[Issue 10274] Replication issue on MMR configuration

5 Nov 2024


      https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #4 from falgon.comp@gmail.com ---
Hello,
Here are some details and clarifications on the subject.
This bug ticket follows the previous ticket opened in the technical section.
As indicated in the last messages, we have the same problem for our new service
which will be based much more on MOD operations than our previous instances.
The lastbind overlay is therefore no longer as impactful.
Our problem is the same whether it is BIND with writing of the last
authentication or a standard MOD. For the next examples of graphs and logs, it
will therefore mainly be tests with MODs.
We have set up a gdrive in order to easily share the logs and screenshots
referenced below :
https://drive.google.com/drive/folders/1N4PWu9Eq4pUbORKVquEFTlZBRVjLK9en?usp...
...
I would question the design of your experiment: it appears you have 4 
write accepting nodes and attempt to saturate each with as much write
load as they will handle (global write traffic being at least 4 times 
of what the slowest node can handle).
This is not really the case, currently one of our instances can manage to
perform 300MOD/s up to 450 MOD/s with latencies. 
See Test01 folder
...
When you stop the test after time t, you expect the environment to 
replicate (considerably) earlier than start+4*t. This is not the 
expected outcome, we expressly document that MPR is not a way to 
achieve write scaling.
No, we expect to have some replication delay (or logs indicating a problem) but
it should catch up on its own when we stop our tests without needing to restart
the instances.
Example of a test with replication delay on an instance that continues to write
and catches up on its own
See Test02 folder
...
First, consider the what I said before and size your environment
accordingly. In the meantime:
We have done many performance tests with single instance, multiple instances,
mirror mode and MMR mode, our environment is supposed to hold the load without
problem. We probably even oversized it
We have never exceeded our capacities whether it is RAM / CPU / system load /
network ... The only thing we can have doubts about is our storage array for
which other teams are also investigating this case.
...
I don't think you've shown a problem. You want to collect sync logs and
see if there is one, I'd expect that instead you'll see a lot of
replication traffic.
On our side we have activated the stats logs + sync and we have analyzed all
the types of messages returned by the directories but we do not see any error
message speaking when the problem occurs.
We have deposited the logs of the 4 servers during one of our tests that we
have anonymized. 
We would like you to take a look at it and tell us if you find something
interesting that we have missed.
See Logs_repl_issue + Test03 folders
...
If your accesslog runs out of space at any point, you have
silenced one source of replication (and in a way you won't be able to
recover from).
For this case we never exceed the 50% of allocated capacities, we have already
followed all the OpenLDAP recommendations.
We have a lot of statistics and supervision for our directories. As said
previously we never reach system limits when the problem occurs.
our only current track is the replication conflict that would cause this
problem.
...
If, once you've investigated further, you find an indication of an
OpenLDAP bug, please provide relevant information (anonymized logs, ...)
so that we have something to go by.
This is the problem, we are going back to this potential problem which is
proven on the behavior side of our directory we have an unexpected behavior. 
However we have not found any logs revealing a concrete problem.
If you find something interesting on your side that can explain exactly what is
happening in this case we are interested in the explanation and if there is a
solution.
You can find graphs of this test on Test03 folder
If you need any further information, do not hesitate to ask.
Thank you in advance for your analysis.
-- 
You are receiving this mail because:
You are on the CC list for the issue.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Issue 10274] Replication issue on MMR configuration