Hello, all.
I have two OpenLDAP servers in production that were running 2.4.44 . Both of these servers host 5 directories, each with their own backends. This is a bog-standard 2-node RHEL7 multi-master configuration that has worked for years. The binaries are the LTB-Project rpms.
I updated my OpenLDAP servers to 2.4.57 last week. A few minutes after the upgrade, I started to see syncrepl alerts in Nagios and I'm having a hard time convincing myself that the upgrade isn't responsible for the errors.
The problem is that I don't see any messages in the log that stand out as being errors (granted, I'm not sure what I'm looking for). In fact, the alert flaps every once in a while as the two nodes come back in sync and drift away from each other again.
Here's what the output of the syncrepl nagios plugin looks like:
2021-03-03 09:54:01: CRITICAL - directories are not in sync - 199 seconds late (W:10 - C:5) 2021-03-03 09:56:01: CRITICAL - directories are not in sync - 197 seconds late (W:10 - C:5) 2021-03-03 09:58:01: CRITICAL - directories are not in sync - 42 seconds late (W:10 - C:5) 2021-03-03 10:00:01: CRITICAL - directories are not in sync - 4200 seconds late (W:10 - C:5) 2021-03-03 10:02:01: CRITICAL - directories are not in sync - 81 seconds late (W:10 - C:5) 2021-03-03 10:04:01: CRITICAL - directories are not in sync - 201 seconds late (W:10 - C:5) 2021-03-03 10:06:01: CRITICAL - directories are not in sync - 196 seconds late (W:10 - C:5) 2021-03-03 10:08:01: CRITICAL - directories are not in sync - 42 seconds late (W:10 - C:5) 2021-03-03 10:10:01: CRITICAL - directories are not in sync - 200 seconds late (W:10 - C:5) 2021-03-03 10:12:01: CRITICAL - directories are not in sync - 81 seconds late (W:10 - C:5)
I find these values surprising considering I've never seen a syncrepl error in the 2 years before the upgrade. Is there a known issue with replication in 2.4.57 that would explain these sync differences?
Regards, Emmanuel
--On Wednesday, March 3, 2021 6:24 PM +0100 Emmanuel Seyman emmanuel@seyman.fr wrote:
The problem is that I don't see any messages in the log that stand out as being errors (granted, I'm not sure what I'm looking for). In fact, the alert flaps every once in a while as the two nodes come back in sync and drift away from each other again.
I find these values surprising considering I've never seen a syncrepl error in the 2 years before the upgrade. Is there a known issue with replication in 2.4.57 that would explain these sync differences?
The replication code in 2.4.44 was completely unreliable and could report being in sync regardless of whether or not that was true. It's also unknown to me if the nagios plugin is accurate for the current codebase.
Generally what you want to look at are the contextCSN values in the root of the DIT of each server to see if they match. If you're using delta-syncrepl, you need to also look at the contextCSN values in the root of the accesslog DIT as well.
Regards, Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
On 3/3/21 8:58 PM, Quanah Gibson-Mount wrote:
--On Wednesday, March 3, 2021 6:24 PM +0100 Emmanuel Seyman emmanuel@seyman.fr wrote:
The problem is that I don't see any messages in the log that stand out as being errors (granted, I'm not sure what I'm looking for). In fact, the alert flaps every once in a while as the two nodes come back in sync and drift away from each other again.
I find these values surprising considering I've never seen a syncrepl error in the 2 years before the upgrade. Is there a known issue with replication in 2.4.57 that would explain these sync differences?
The replication code in 2.4.44 was completely unreliable and could report being in sync regardless of whether or not that was true. It's also unknown to me if the nagios plugin is accurate for the current codebase.
Generally what you want to look at are the contextCSN values in the root of the DIT of each server to see if they match.
My slapdcheck package [1] also implements exactly this check and sometimes it shows a difference although the changes have been corrected replicated (normal syncrepl).
You can look at the code to verify what it's doing:
https://gitlab.com/ae-dir/slapdcheck/-/blob/master/slapdcheck/__init__.py#L1...
(It reads the actual syncrepl providers from cn=config before comparing the contextCSN values for each serverID.)
I discussed this several times with Howard and Ondrej but no idea came up why that happens.
Ciao, Michael.
On Wed, Mar 03, 2021 at 09:52:26PM +0100, Michael Ströder wrote:
--On Wednesday, March 3, 2021 6:24 PM +0100 Emmanuel Seyman emmanuel@seyman.fr wrote:
The problem is that I don't see any messages in the log that stand out as being errors (granted, I'm not sure what I'm looking for). In fact, the alert flaps every once in a while as the two nodes come back in sync and drift away from each other again.
I find these values surprising considering I've never seen a syncrepl error in the 2 years before the upgrade. Is there a known issue with replication in 2.4.57 that would explain these sync differences?
My slapdcheck package [1] also implements exactly this check and sometimes it shows a difference although the changes have been corrected replicated (normal syncrepl).
You can look at the code to verify what it's doing:
https://gitlab.com/ae-dir/slapdcheck/-/blob/master/slapdcheck/__init__.py#L1...
(It reads the actual syncrepl providers from cn=config before comparing the contextCSN values for each serverID.)
I discussed this several times with Howard and Ondrej but no idea came up why that happens.
I don't remember the discussion anymore but there's a corner case people writing syncrepl checking scripts often forget to address:
If it takes 1 second to replicate a change and a previous change happened x seconds before this one there's going to be a window of 1 second where you see an x second CSN difference between the provider and consumer. In no way does it mean the consumer is x seconds behind.
If there's an acceptable delay of n seconds, you better wait for that amount of time before raising an alarm, either on the script level or monitoring infrastructure level. See the logic in syncmonitor[0] for an example of a live monitoring tool that should implement this, wrapping the code in a nagios check compatible tool is pending, patches welcome.
[0]. https://git.openldap.org/openldap/syncmonitor/-/blob/master/syncmonitor/envi...
On 3/4/21 12:20 PM, Ondřej Kuzník wrote:
On Wed, Mar 03, 2021 at 09:52:26PM +0100, Michael Ströder wrote:
My slapdcheck package [1] also implements exactly this check and sometimes it shows a difference although the changes have been corrected replicated (normal syncrepl).
You can look at the code to verify what it's doing:
https://gitlab.com/ae-dir/slapdcheck/-/blob/master/slapdcheck/__init__.py#L1...
(It reads the actual syncrepl providers from cn=config before comparing the contextCSN values for each serverID.)
I discussed this several times with Howard and Ondrej but no idea came up why that happens.
I don't remember the discussion anymore but there's a corner case people writing syncrepl checking scripts often forget to address:
If it takes 1 second to replicate a change and a previous change happened x seconds before this one there's going to be a window of 1 second where you see an x second CSN difference between the provider and consumer. In no way does it mean the consumer is x seconds behind.
I'm talking about the contextCSN difference being visible for several *hours* while the changes have been already successfully replicated. Replication delay is very short, syncrepl type is refreshAndPersist.
If there's an acceptable delay of n seconds, you better wait for that amount of time before raising an alarm,
And what's an appropriate value for n? 86400? ;-]
See the logic in syncmonitor[0]
Ideally I'd like to query cn=monitor whether slapd thinks replication is in a healthy state.
Ciao, Michael.
On Thu, Mar 04, 2021 at 01:09:55PM +0100, Michael Ströder wrote:
On 3/4/21 12:20 PM, Ondřej Kuzník wrote:
If it takes 1 second to replicate a change and a previous change happened x seconds before this one there's going to be a window of 1 second where you see an x second CSN difference between the provider and consumer. In no way does it mean the consumer is x seconds behind.
I'm talking about the contextCSN difference being visible for several *hours* while the changes have been already successfully replicated. Replication delay is very short, syncrepl type is refreshAndPersist.
Don't think I've ever seen this outside slapcat (only checkpoints affect the on-disk version). Please submit a bug if you can replicate this.
If there's an acceptable delay of n seconds, you better wait for that amount of time before raising an alarm,
And what's an appropriate value for n? 86400? ;-]
Depends where in the galaxy you place your replicas :)
See the logic in syncmonitor[0]
Ideally I'd like to query cn=monitor whether slapd thinks replication is in a healthy state.
Consumer will never think its replication is slow/broken (unless it gets an actual error and you can already see that in cn=monitor). Provider might want to expose some information but that's not implemented yet and will not be able to spot many issues if other providers exist.
openldap-technical@openldap.org