Hi,
since some weeks we operate an openldap deployment in N-Way Multi-Provider Delta-syncrepl Replication. There are around 130000 entries in the main database, and it is around 635 MB in size. Currently the replication contains two VMs each with 4Gb of RAM and 2 CPUs. The VMs are running slapd 2.6.10. I posted the configuration on [1]. I only removed the credentials and some user specific ACLs.
This setup worked flawlessly for some time until both servers rebooted during a scheduled patch circle. Since then, we see drastically increased response times and CPU utilization on both VMs. On one of the servers (ldap08) I see the following Log entry every few seconds: do_syncrep2: rid=910 (4096) Content Sync Refresh Required
When I try to compare the contextCSN of both servers they differ a little but only less then 5 seconds at max. The however change constantly because these servers are used for login and store the last login and Intruder detection information which must be replicated. Most of the other data is static but there are some changes (changed passwords, Name changes) every few minutes or so. For Example a few minutes ago we had the following values:
20250717103207.135309Z#000000#000#000000 20250717115153.689217Z#000000#06a#000000 20250814105455.935611Z#000000#06b#000000 20250814105522.282937Z#000000#06c#000000
I understand that the 000 entry is from before enabling replication and the 06a value is from an old server no longer belonging to this replication (ldap05).
When I last had a situation like this the servers where not in production and I shut both down, copied the database over and started them up again. This is not an option now as they are in production and needed for login. Also preventing updates for longer than a few minutes is not an option an even this has to be announced ahead of time. When I last tried adding a brand new Server cluster configured in a similar way in testing it took multiple hours to get the new server up to speed. I fear that removing one of the two servers for multiple hours would overwhelm the remaining server with requests. In the future the plan is to have at least 3 Servers in this replication but currently we only have two. It is however an option to prepare new server(s) and add them to the replication if that might help somehow.
One other information is that currently the accesslog database is around 2 GB of size.
What would be the best approach to remediate this situation?
[1] https://next.hessenbox.de/index.php/s/jFX9gAEWXoqoxNS
Mit freundlichen Grüßen Clemens (Bergmann)
Hello Clemens,
We ran into the same kind of issues last year. What happened is that last login information was refreshed each minute. Requesting the accesslog DB showed that 95% of the logs were lastlogon updates. We increased the refresh delay to 30m.
What we checked was disk usage and we saw high iowrites when the accesslog cleanup occured (each 4h). We reduced the cleanup frequency to 2h (so there would be half the number of records to remove) and also reduced the log history a little bit.
I'm not on my laptop so I can't provide the exact settings but you can check this.
Regards Jerome ________________________________ De : Bergmann, Clemens clemens.bergmann@tu-darmstadt.de Envoyé : jeudi 14 août 2025 13:32 À : openldap-technical@openldap.org openldap-technical@openldap.org Objet : Reduced performance after restart because of replication inconsistency
ATTENTION : Cet e-mail provient de l'extérieur de l'organisation. Ne cliquez pas sur les liens et n'ouvrez pas les pièces jointes à moins que vous ne reconnaissiez l'expéditeur et que vous sachiez que le contenu est sûr.
Hi!
I'm definitely not an expert (see also https://serverfault.com/q/1177576/407952), but the symptom seems to indicate to me that either the servers were not shut down cleanly (I guess one was shutdown, patched, and then started, and then the other one), or there were local changes to each server before replication was working. Now it seems you need a manual content sync. Your replication log isn't full, BTW? Also what is the cumulative size of your MDB databases compared to the RAM you have?
I think (see disclaimer at start) shutting down one server would be enough for manual content sync: Shutdown the server, delete the database and changelog database. Import (slapadd) a rather current export from the other server (which options to use, BTW?). This will grow the changlog tremendously. I'm unsure whether you should delete the changelog database again before restarting the node or not; maybe experts can tell. Then the newly started node should pull the outstanding changes from the other node.
Kind regards, Ulrich Windl
-----Original Message----- From: Bergmann, Clemens clemens.bergmann@tu-darmstadt.de Sent: Thursday, August 14, 2025 1:33 PM To: openldap-technical@openldap.org Subject: [EXT] Reduced performance after restart because of replication inconsistency
Hi,
since some weeks we operate an openldap deployment in N-Way Multi- Provider Delta-syncrepl Replication. There are around 130000 entries in the main database, and it is around 635 MB in size. Currently the replication contains two VMs each with 4Gb of RAM and 2 CPUs. The VMs are running slapd 2.6.10. I posted the configuration on [1]. I only removed the credentials and some user specific ACLs.
This setup worked flawlessly for some time until both servers rebooted during a scheduled patch circle. Since then, we see drastically increased response times and CPU utilization on both VMs. On one of the servers (ldap08) I see the following Log entry every few seconds: do_syncrep2: rid=910 (4096) Content Sync Refresh Required
When I try to compare the contextCSN of both servers they differ a little but only less then 5 seconds at max. The however change constantly because these servers are used for login and store the last login and Intruder detection information which must be replicated. Most of the other data is static but there are some changes (changed passwords, Name changes) every few minutes or so. For Example a few minutes ago we had the following values:
20250717103207.135309Z#000000#000#000000 20250717115153.689217Z#000000#06a#000000 20250814105455.935611Z#000000#06b#000000 20250814105522.282937Z#000000#06c#000000
I understand that the 000 entry is from before enabling replication and the 06a value is from an old server no longer belonging to this replication (ldap05).
When I last had a situation like this the servers where not in production and I shut both down, copied the database over and started them up again. This is not an option now as they are in production and needed for login. Also preventing updates for longer than a few minutes is not an option an even this has to be announced ahead of time. When I last tried adding a brand new Server cluster configured in a similar way in testing it took multiple hours to get the new server up to speed. I fear that removing one of the two servers for multiple hours would overwhelm the remaining server with requests. In the future the plan is to have at least 3 Servers in this replication but currently we only have two. It is however an option to prepare new server(s) and add them to the replication if that might help somehow.
One other information is that currently the accesslog database is around 2 GB of size.
What would be the best approach to remediate this situation?
[1] https://next.hessenbox.de/index.php/s/jFX9gAEWXoqoxNS
Mit freundlichen Grüßen Clemens (Bergmann)
-- Clemens Bergmann [er/ihm; he/him] Gruppe Nutzermanagement und Entwicklung Technische Universität Darmstadt Hochschulrechenzentrum, Alexanderstraße 2, 64283 Darmstadt Tel. +49 6151 16 71184 http://www.hrz.tu-darmstadt.de/
On Thu, Aug 14, 2025 at 11:32:45AM +0000, Bergmann, Clemens wrote:
Hi,
since some weeks we operate an openldap deployment in N-Way Multi-Provider Delta-syncrepl Replication. There are around 130000 entries in the main database, and it is around 635 MB in size. Currently the replication contains two VMs each with 4Gb of RAM and 2 CPUs. The VMs are running slapd 2.6.10. I posted the configuration on [1]. I only removed the credentials and some user specific ACLs.
This setup worked flawlessly for some time until both servers rebooted during a scheduled patch circle. Since then, we see drastically increased response times and CPU utilization on both VMs. On one of the servers (ldap08) I see the following Log entry every few seconds: do_syncrep2: rid=910 (4096) Content Sync Refresh Required
Hi Clemens, check your contextCSNs and make sure they are all in sync (the DBs *and* their corresponding accesslog as well). If your servers (through misconfiguration or otherwise) missed a contextCSN update to accesslog and that CSN never got updated afterwards, deltasync will have to keep falling back to plain syncrepl (client has a cookie indicating "future" data).
When I try to compare the contextCSN of both servers they differ a little but only less then 5 seconds at max. The however change constantly because these servers are used for login and store the last login and Intruder detection information which must be replicated. Most of the other data is static but there are some changes (changed passwords, Name changes) every few minutes or so. For Example a few minutes ago we had the following values:
First I would examine I/O status of the server: is there congestion? Is that on the read/write side? In general, read side congestion suggests low RAM, write side congestion suggests configuration issues or platform I/O limitations. Also if you're limited on write I/O, your cluster will keep struggling however many servers you spin up.
If it's write I/O, have a go at reading your accesslog and looking whether there is a major source of writes. Things like pwdLastSuccess can have their granularity limited which often greatly reduces write traffic, etc.
When I last had a situation like this the servers where not in production and I shut both down, copied the database over and started them up again. This is not an option now as they are in production and needed for login. Also preventing updates for longer than a few minutes is not an option an even this has to be announced ahead of time. When I last tried adding a brand new Server cluster configured in a similar way in testing it took multiple hours to get the new server up to speed. I fear that removing one of the two servers for multiple hours would overwhelm the remaining server with requests. In the future the plan is to have at least 3 Servers in this replication but currently we only have two. It is however an option to prepare new server(s) and add them to the replication if that might help somehow.
One other information is that currently the accesslog database is around 2 GB of size.
Good, just like Ulrich mentioned, set up monitoring to make sure it *never* becomes full. Missed accesslog writes cause hard to recover synchronisation errors.
Regards,
Hi Ondřej,
thanks fort he tips. Just to make sure I am not misunderstanding something fundamental: The accesslog Database has the syncprov overlay configured for access from the other servers but no olcSyncrepl attribute. It is referenced as 'logbase' in the olcSyncrepl attribute of the main database. I understood that the accesslog database is a local database to help with synchronization and therefore not itself synchronized. Is this correct?
You can see my "full" (minus credentials) config under [1] if I missed providing some relevant information.
[1] https://next.hessenbox.de/index.php/s/jFX9gAEWXoqoxNS
Mit freundlichen Grüßen Clemens (Bergmann)
On Mon, Aug 18, 2025 at 01:50:44PM +0000, Bergmann, Clemens wrote:
Hi Ondřej,
thanks fort he tips. Just to make sure I am not misunderstanding something fundamental: The accesslog Database has the syncprov overlay configured for access from the other servers but no olcSyncrepl attribute. It is referenced as 'logbase' in the olcSyncrepl attribute of the main database. I understood that the accesslog database is a local database to help with synchronization and therefore not itself synchronized. Is this correct?
Yes, the consumer reads the accesslog and processes the entries therein as changes they represent (this is the delta part of delta-syncrepl).
Its own accesslog overlay will then create new entries based on what has been done (so they will be different from the server it was replicating from). This is how the DB and its accesslog are tied together and what makes it suitable as a deltasync source for someone trying replicate it.
Regards,
openldap-technical@openldap.org