[Issue 10274] Replication issue on MMR configuration

8 Nov 2024


      https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #5 from Ondřej Kuzník ondra@mistotebe.net ---
On Tue, Nov 05, 2024 at 12:46:36PM +0000, openldap-its@openldap.org wrote:
...
...
I would question the design of your experiment: it appears you have 4 
write accepting nodes and attempt to saturate each with as much write
load as they will handle (global write traffic being at least 4 times 
of what the slowest node can handle).
This is not really the case, currently one of our instances can manage to
perform 300MOD/s up to 450 MOD/s with latencies. 
See Test01 folder
Hi,
thanks for the logs, they definitely help drill down further.
Based on that I can see that you're aiming to keep it below the 450
mods/s globally so things should be able to keep up. Instead, some
servers just observe these pauses in the logs that are getting
increasingly longer.
First things first, can you use the latest: 2.5.18 at the very least,
2.6.8 or even OPENLDAP_REL_ENG_2_6 (the upcoming 2.6.9) would be better.
And once you've upgraded, can you wipe the accesslog DBs all around (or
check there is no duplicate sid in its minCSN), there's a known issue
where minCSNs can get wrongly populated pre-2.5.18/2.6.8 (ITS#10173).
If the issue persists, the gaps tend to be around slap_queue_csn/
slap_graduate_commit_csn. Are you able to run the test with a patched
slapd, logging how many slap_csn_entry items slap_graduate_commit_csn
had to crawl before it found the right one? This list shouldn't keep
growing so that's one thing I'd like to rule out.
Also I suspect your configuration is slightly different (since the
configs do not acknowgledge there are 4 servers replicating with each
other) but hopefully there isn't much else that's different.
...
We have done many performance tests with single instance, multiple instances,
mirror mode and MMR mode, our environment is supposed to hold the load without
problem. We probably even oversized it
We have never exceeded our capacities whether it is RAM / CPU / system load /
network ... The only thing we can have doubts about is our storage array for
which other teams are also investigating this case.
I can't see evidence of I/O issues but you can always monitor iowait I
guess.
Monitoring-wise, you also want to keep an eye on the contextCSNs of the
cluster if you're not doing that yet (a DB is linked to its accesslog,
divergence between their contextCSNs is also notable BTW).
...
For this case we never exceed the 50% of allocated capacities, we have already
followed all the OpenLDAP recommendations.
We have a lot of statistics and supervision for our directories. As said
previously we never reach system limits when the problem occurs.
our only current track is the replication conflict that would cause this
problem.
Yes, I can't see an actual replication conflict in the logs (I suspect
the "delta-sync lost sync" messages are down to ITS#10173), so that's
fine, the servers just get slower over time which shouldn't be happening
and I don't remember seeing anything like it before either.
Again, thanks for the logs and let us know how the interventions above
pan out.
Thanks,
-- 
You are receiving this mail because:
You are on the CC list for the issue.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Issue 10274] Replication issue on MMR configuration