Hello, all.
I have two OpenLDAP servers in production that were running 2.4.44 . Both of these servers host 5 directories, each with their own backends. This is a bog-standard 2-node RHEL7 multi-master configuration that has worked for years. The binaries are the LTB-Project rpms.
I updated my OpenLDAP servers to 2.4.57 last week. A few minutes after the upgrade, I started to see syncrepl alerts in Nagios and I'm having a hard time convincing myself that the upgrade isn't responsible for the errors.
The problem is that I don't see any messages in the log that stand out as being errors (granted, I'm not sure what I'm looking for). In fact, the alert flaps every once in a while as the two nodes come back in sync and drift away from each other again.
Here's what the output of the syncrepl nagios plugin looks like:
2021-03-03 09:54:01: CRITICAL - directories are not in sync - 199 seconds late (W:10 - C:5) 2021-03-03 09:56:01: CRITICAL - directories are not in sync - 197 seconds late (W:10 - C:5) 2021-03-03 09:58:01: CRITICAL - directories are not in sync - 42 seconds late (W:10 - C:5) 2021-03-03 10:00:01: CRITICAL - directories are not in sync - 4200 seconds late (W:10 - C:5) 2021-03-03 10:02:01: CRITICAL - directories are not in sync - 81 seconds late (W:10 - C:5) 2021-03-03 10:04:01: CRITICAL - directories are not in sync - 201 seconds late (W:10 - C:5) 2021-03-03 10:06:01: CRITICAL - directories are not in sync - 196 seconds late (W:10 - C:5) 2021-03-03 10:08:01: CRITICAL - directories are not in sync - 42 seconds late (W:10 - C:5) 2021-03-03 10:10:01: CRITICAL - directories are not in sync - 200 seconds late (W:10 - C:5) 2021-03-03 10:12:01: CRITICAL - directories are not in sync - 81 seconds late (W:10 - C:5)
I find these values surprising considering I've never seen a syncrepl error in the 2 years before the upgrade. Is there a known issue with replication in 2.4.57 that would explain these sync differences?
Regards, Emmanuel