[Issue 10274] Replication issue on MMR configuration

20 Nov 2024


      https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #6 from falgon.comp@gmail.com ---
Hello,
Thank you very much for your interesting feedback and the time spent.
...
First things first, can you use the latest: 2.5.18 at the very least,
2.6.8 or even OPENLDAP_REL_ENG_2_6 (the upcoming 2.6.9) would be better.
And once you've upgraded, can you wipe the accesslog DBs all around (or
check there is no duplicate sid in its minCSN), there's a known issue
where minCSNs can get wrongly populated pre-2.5.18/2.6.8 (ITS#10173).
We were able to test version 2.5.18 (all DB cleans) but unfortunately we
reproduced the problem.
For version 2.6+ we are currently unable to test this version. We may be able
to do it half of 2025.
...
If the issue persists, the gaps tend to be around slap_queue_csn/
slap_graduate_commit_csn. Are you able to run the test with a patched
slapd, logging how many slap_csn_entry items slap_graduate_commit_csn
had to crawl before it found the right one? This list shouldn't keep
growing so that's one thing I'd like to rule out.
We are working on this to be able to have a precise view on it and provide you
other clean and interesting logs.
...
Also I suspect your configuration is slightly different (since the
configs do not acknowgledge there are 4 servers replicating with each
other) but hopefully there isn't much else that's different.
We have globaly the same configuration as the one initially provided (We were
able to reproduce the problem with this one).
...
I can't see evidence of I/O issues but you can always monitor iowait I
guess.
We have a visual on this part as well, we have a lot of different metrics we
can provide you if you want/need.
...
Monitoring-wise, you also want to keep an eye on the contextCSNs of the
cluster if you're not doing that yet (a DB is linked to its accesslog,
divergence between their contextCSNs is also notable BTW).
Yes, we also have scripts at our disposal that check the contextCSN of each
server to check for a replication delay and alert us in the event of a problem.
...
Yes, I can't see an actual replication conflict in the logs (I suspect
the "delta-sync lost sync" messages are down to ITS#10173), so that's
fine, the servers just get slower over time which shouldn't be happening
and I don't remember seeing anything like it before either.
If you need new logs with the 2.5.18 version we can provide them to you.
Unfortunately the problem persists, we are still doing tests on it to try to
find the origin of the problem. 
We had already done checks on the Virtualization + network + storage side, but
we have reopened discussions with them to try to go further on this issue and
find/tracks on the problem. The detail that is problematic for us is that we
had managed to reproduce it on a local VM without network transactions.
I don't know if you were able to try to reproduce the problem ?
We are interested in all the ideas you may have.
Once again thank you for your time and involvement.
-- 
You are receiving this mail because:
You are on the CC list for the issue.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Issue 10274] Replication issue on MMR configuration