On 3/3/21 8:58 PM, Quanah Gibson-Mount wrote:
--On Wednesday, March 3, 2021 6:24 PM +0100 Emmanuel Seyman emmanuel@seyman.fr wrote:
The problem is that I don't see any messages in the log that stand out as being errors (granted, I'm not sure what I'm looking for). In fact, the alert flaps every once in a while as the two nodes come back in sync and drift away from each other again.
I find these values surprising considering I've never seen a syncrepl error in the 2 years before the upgrade. Is there a known issue with replication in 2.4.57 that would explain these sync differences?
The replication code in 2.4.44 was completely unreliable and could report being in sync regardless of whether or not that was true. It's also unknown to me if the nagios plugin is accurate for the current codebase.
Generally what you want to look at are the contextCSN values in the root of the DIT of each server to see if they match.
My slapdcheck package [1] also implements exactly this check and sometimes it shows a difference although the changes have been corrected replicated (normal syncrepl).
You can look at the code to verify what it's doing:
https://gitlab.com/ae-dir/slapdcheck/-/blob/master/slapdcheck/__init__.py#L1...
(It reads the actual syncrepl providers from cn=config before comparing the contextCSN values for each serverID.)
I discussed this several times with Howard and Ondrej but no idea came up why that happens.
Ciao, Michael.