On Wed, Mar 03, 2021 at 09:52:26PM +0100, Michael Ströder wrote:
> --On Wednesday, March 3, 2021 6:24 PM +0100 Emmanuel Seyman
> <emmanuel(a)seyman.fr> wrote:
>> The problem is that I don't see any messages in the log that stand
>> out as being errors (granted, I'm not sure what I'm looking for).
>> In fact, the alert flaps every once in a while as the two nodes
>> come back in sync and drift away from each other again.
>> I find these values surprising considering I've never seen a syncrepl
>> error in the 2 years before the upgrade. Is there a known issue with
>> replication in 2.4.57 that would explain these sync differences?
My slapdcheck package  also implements exactly this check and
sometimes it shows a difference although the changes have been corrected
replicated (normal syncrepl).
You can look at the code to verify what it's doing:
(It reads the actual syncrepl providers from cn=config before comparing
the contextCSN values for each serverID.)
I discussed this several times with Howard and Ondrej but no idea came
up why that happens.
I don't remember the discussion anymore but there's a corner case people
writing syncrepl checking scripts often forget to address:
If it takes 1 second to replicate a change and a previous change
happened x seconds before this one there's going to be a window of 1
second where you see an x second CSN difference between the provider and
consumer. In no way does it mean the consumer is x seconds behind.
If there's an acceptable delay of n seconds, you better wait for that
amount of time before raising an alarm, either on the script level or
monitoring infrastructure level. See the logic in syncmonitor for an
example of a live monitoring tool that should implement this, wrapping
the code in a nagios check compatible tool is pending, patches welcome.
Senior Software Engineer
Symas Corporation http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP