https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #6 from falgon.comp@gmail.com --- Hello, Thank you very much for your interesting feedback and the time spent.
First things first, can you use the latest: 2.5.18 at the very least, 2.6.8 or even OPENLDAP_REL_ENG_2_6 (the upcoming 2.6.9) would be better. And once you've upgraded, can you wipe the accesslog DBs all around (or check there is no duplicate sid in its minCSN), there's a known issue where minCSNs can get wrongly populated pre-2.5.18/2.6.8 (ITS#10173).
We were able to test version 2.5.18 (all DB cleans) but unfortunately we reproduced the problem. For version 2.6+ we are currently unable to test this version. We may be able to do it half of 2025.
If the issue persists, the gaps tend to be around slap_queue_csn/ slap_graduate_commit_csn. Are you able to run the test with a patched slapd, logging how many slap_csn_entry items slap_graduate_commit_csn had to crawl before it found the right one? This list shouldn't keep growing so that's one thing I'd like to rule out.
We are working on this to be able to have a precise view on it and provide you other clean and interesting logs.
Also I suspect your configuration is slightly different (since the configs do not acknowgledge there are 4 servers replicating with each other) but hopefully there isn't much else that's different.
We have globaly the same configuration as the one initially provided (We were able to reproduce the problem with this one).
I can't see evidence of I/O issues but you can always monitor iowait I guess.
We have a visual on this part as well, we have a lot of different metrics we can provide you if you want/need.
Monitoring-wise, you also want to keep an eye on the contextCSNs of the cluster if you're not doing that yet (a DB is linked to its accesslog, divergence between their contextCSNs is also notable BTW).
Yes, we also have scripts at our disposal that check the contextCSN of each server to check for a replication delay and alert us in the event of a problem.
Yes, I can't see an actual replication conflict in the logs (I suspect the "delta-sync lost sync" messages are down to ITS#10173), so that's fine, the servers just get slower over time which shouldn't be happening and I don't remember seeing anything like it before either.
If you need new logs with the 2.5.18 version we can provide them to you.
Unfortunately the problem persists, we are still doing tests on it to try to find the origin of the problem. We had already done checks on the Virtualization + network + storage side, but we have reopened discussions with them to try to go further on this issue and find/tracks on the problem. The detail that is problematic for us is that we had managed to reproduce it on a local VM without network transactions.
I don't know if you were able to try to reproduce the problem ? We are interested in all the ideas you may have.
Once again thank you for your time and involvement.