On 3/4/21 12:20 PM, Ondřej Kuzník wrote:
On Wed, Mar 03, 2021 at 09:52:26PM +0100, Michael Ströder wrote:
> My slapdcheck package [1] also implements exactly this check and
> sometimes it shows a difference although the changes have been corrected
> replicated (normal syncrepl).
>
> You can look at the code to verify what it's doing:
>
>
https://gitlab.com/ae-dir/slapdcheck/-/blob/master/slapdcheck/__init__.py...
>
> (It reads the actual syncrepl providers from cn=config before comparing
> the contextCSN values for each serverID.)
>
> I discussed this several times with Howard and Ondrej but no idea came
> up why that happens.
I don't remember the discussion anymore but there's a corner case people
writing syncrepl checking scripts often forget to address:
If it takes 1 second to replicate a change and a previous change
happened x seconds before this one there's going to be a window of 1
second where you see an x second CSN difference between the provider and
consumer. In no way does it mean the consumer is x seconds behind.
I'm talking about the contextCSN difference being visible for several
*hours* while the changes have been already successfully replicated.
Replication delay is very short, syncrepl type is refreshAndPersist.
If there's an acceptable delay of n seconds, you better wait for
that
amount of time before raising an alarm,
And what's an appropriate value for n? 86400? ;-]
See the logic in syncmonitor[0]
Ideally I'd like to query cn=monitor whether slapd thinks replication is
in a healthy state.
Ciao, Michael.