https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #5 from Ondřej Kuzník ondra@mistotebe.net --- On Tue, Nov 05, 2024 at 12:46:36PM +0000, openldap-its@openldap.org wrote:
I would question the design of your experiment: it appears you have 4 write accepting nodes and attempt to saturate each with as much write load as they will handle (global write traffic being at least 4 times of what the slowest node can handle).
This is not really the case, currently one of our instances can manage to perform 300MOD/s up to 450 MOD/s with latencies. See Test01 folder
Hi, thanks for the logs, they definitely help drill down further. Based on that I can see that you're aiming to keep it below the 450 mods/s globally so things should be able to keep up. Instead, some servers just observe these pauses in the logs that are getting increasingly longer.
First things first, can you use the latest: 2.5.18 at the very least, 2.6.8 or even OPENLDAP_REL_ENG_2_6 (the upcoming 2.6.9) would be better. And once you've upgraded, can you wipe the accesslog DBs all around (or check there is no duplicate sid in its minCSN), there's a known issue where minCSNs can get wrongly populated pre-2.5.18/2.6.8 (ITS#10173).
If the issue persists, the gaps tend to be around slap_queue_csn/ slap_graduate_commit_csn. Are you able to run the test with a patched slapd, logging how many slap_csn_entry items slap_graduate_commit_csn had to crawl before it found the right one? This list shouldn't keep growing so that's one thing I'd like to rule out.
Also I suspect your configuration is slightly different (since the configs do not acknowgledge there are 4 servers replicating with each other) but hopefully there isn't much else that's different.
We have done many performance tests with single instance, multiple instances, mirror mode and MMR mode, our environment is supposed to hold the load without problem. We probably even oversized it We have never exceeded our capacities whether it is RAM / CPU / system load / network ... The only thing we can have doubts about is our storage array for which other teams are also investigating this case.
I can't see evidence of I/O issues but you can always monitor iowait I guess.
Monitoring-wise, you also want to keep an eye on the contextCSNs of the cluster if you're not doing that yet (a DB is linked to its accesslog, divergence between their contextCSNs is also notable BTW).
For this case we never exceed the 50% of allocated capacities, we have already followed all the OpenLDAP recommendations. We have a lot of statistics and supervision for our directories. As said previously we never reach system limits when the problem occurs. our only current track is the replication conflict that would cause this problem.
Yes, I can't see an actual replication conflict in the logs (I suspect the "delta-sync lost sync" messages are down to ITS#10173), so that's fine, the servers just get slower over time which shouldn't be happening and I don't remember seeing anything like it before either.
Again, thanks for the logs and let us know how the interventions above pan out.
Thanks,