On 3/3/21 8:58 PM, Quanah Gibson-Mount wrote:
--On Wednesday, March 3, 2021 6:24 PM +0100 Emmanuel Seyman
<emmanuel(a)seyman.fr> wrote:
> The problem is that I don't see any messages in the log that stand
> out as being errors (granted, I'm not sure what I'm looking for).
> In fact, the alert flaps every once in a while as the two nodes
> come back in sync and drift away from each other again.
>
> I find these values surprising considering I've never seen a syncrepl
> error in the 2 years before the upgrade. Is there a known issue with
> replication in 2.4.57 that would explain these sync differences?
The replication code in 2.4.44 was completely unreliable and could
report being in sync regardless of whether or not that was true. It's
also unknown to me if the nagios plugin is accurate for the current
codebase.
Generally what you want to look at are the contextCSN values in the root
of the DIT of each server to see if they match.
My slapdcheck package [1] also implements exactly this check and
sometimes it shows a difference although the changes have been corrected
replicated (normal syncrepl).
You can look at the code to verify what it's doing:
https://gitlab.com/ae-dir/slapdcheck/-/blob/master/slapdcheck/__init__.py...
(It reads the actual syncrepl providers from cn=config before comparing
the contextCSN values for each serverID.)
I discussed this several times with Howard and Ondrej but no idea came
up why that happens.
Ciao, Michael.
[1]
https://www.stroeder.com/slapdcheck.html