On Wed, Mar 03, 2021 at 09:52:26PM +0100, Michael Ströder wrote:
--On Wednesday, March 3, 2021 6:24 PM +0100 Emmanuel Seyman emmanuel@seyman.fr wrote:
The problem is that I don't see any messages in the log that stand out as being errors (granted, I'm not sure what I'm looking for). In fact, the alert flaps every once in a while as the two nodes come back in sync and drift away from each other again.
I find these values surprising considering I've never seen a syncrepl error in the 2 years before the upgrade. Is there a known issue with replication in 2.4.57 that would explain these sync differences?
My slapdcheck package [1] also implements exactly this check and sometimes it shows a difference although the changes have been corrected replicated (normal syncrepl).
You can look at the code to verify what it's doing:
https://gitlab.com/ae-dir/slapdcheck/-/blob/master/slapdcheck/__init__.py#L1...
(It reads the actual syncrepl providers from cn=config before comparing the contextCSN values for each serverID.)
I discussed this several times with Howard and Ondrej but no idea came up why that happens.
I don't remember the discussion anymore but there's a corner case people writing syncrepl checking scripts often forget to address:
If it takes 1 second to replicate a change and a previous change happened x seconds before this one there's going to be a window of 1 second where you see an x second CSN difference between the provider and consumer. In no way does it mean the consumer is x seconds behind.
If there's an acceptable delay of n seconds, you better wait for that amount of time before raising an alarm, either on the script level or monitoring infrastructure level. See the logic in syncmonitor[0] for an example of a live monitoring tool that should implement this, wrapping the code in a nagios check compatible tool is pending, patches welcome.
[0]. https://git.openldap.org/openldap/syncmonitor/-/blob/master/syncmonitor/envi...