https://bugs.openldap.org/show_bug.cgi?id=10274
Issue ID: 10274 Summary: Replication issue on MMR configuration Product: OpenLDAP Version: 2.5.14 Hardware: All OS: Linux Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: slapd Assignee: bugs@openldap.org Reporter: falgon.comp@gmail.com Target Milestone: ---
Created attachment 1036 --> https://bugs.openldap.org/attachment.cgi?id=1036&action=edit In this attachment you will find 2 openldap configurations for 2 instances + slamd conf exemple + 5 screenshots to show the issue and one text file to explain what you see
Hello we are openning this issue further to the initial post in technical : https://lists.openldap.org/hyperkitty/list/openldap-technical@openldap.org/t...
Issue : We are working on a project and we've come across an issue with the replication after performance testing :
*Configuration :*
RHEL 8.6 OpenLDAP 2.5.14 *MMR-delta *configuration on multiple servers attached 300,000 users configured and used for tests *olcLastBind: TRUE*
Use of SLAMD (performance shooting)
*Problem description:*
We are currently running performance and resilience tests on our infrastructure using the SLAMD tool configured to perform BINDs and MODs on a defined range of accounts.
We use a load balancer (VIP) to poll all of our servers equally. (but it is possible to do performance tests directly on each of the directories)
With our current infrastructure we're able to perform approximately 300 MOD/BIND/s. Beyond that, we start to generate delays and can randomly come across one issue.
However, when we run performance tests that exceed our write capacity, our replication between servers can randomly create an incident with directories being unable to catch up with their replication delay.
The directories update their contextCSNs, but extremely slowly (like freezing). From then on, it's impossible for the directories to catch again. (even with no incoming traffic)
A restart of the instance is required to perform a full refresh and solve the incident.
We have enabled synchronization logs and have no error or refresh logs to indicate a problem ( we can provide you with logs if necessary).
We suspect a write collision or a replication conflict but this is never write in our sync logs.
We've run a lot of tests.
For example, when we run a performance test on a single live server, we don't reproduce the problem.
Anothers examples: if we define different accounts ranges for each server with SALMD, we don't reproduce the problem either.
If we use only one account for the range, we don't reproduce the problem either.
______________________________________________________________________ I have add some screenshots on attachement to show you the issue and all the explanations. ______________________________________________________________________
*Symptoms :*
One or more directories can no longer be replicated normally after performance testing ends. No apparent error logs. Need a restart of instances to solve the problem.
*How to reproduce the problem:*
Have at least two servers in MMR mode Set LastBind to TRUE Perform a SLAMD shot from a LoadBalancer in bandwidth mode OR start multiple SLAMD test on same time for each server with the same account range. Exceed the maximum write capacity of the servers.
*SLAMD configuration :*
authrate.sh --hostname ${HOSTNAME} --port ${PORTSSL} \ --useSSL --trustStorePath ${CACERTJKS} \ --trustStorePassword ${CACERTJKSPW} --bindDN "${BINDDN}" \ --bindPassword ${BINDPW} --baseDN "${BASEDN}" \ --filter "(uid=[${RANGE}])" --credentials ${USERPW} \ --warmUpIntervals ${WARMUP} \ --numThreads ${NTHREADS} ${ARGS}
https://bugs.openldap.org/show_bug.cgi?id=10274
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Assignee|bugs@openldap.org |ondra@mistotebe.net Keywords|needs_review |
https://bugs.openldap.org/show_bug.cgi?id=10274
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |2.5.19
https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #1 from Ondřej Kuzník ondra@mistotebe.net --- I would question the design of your experiment: it appears you have 4 write-accepting nodes and attempt to saturate each with as much write load as they will handle (global write traffic being at least 4 times of what the slowest node can handle).
When you stop the test after time t, you expect the environment to replicate (considerably) earlier than start+4*t. This is not the expected outcome, we expressly document that MPR is not a way to achieve write scaling.
https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #2 from falgon.comp@gmail.com --- We implemented this architecture according to our needs and performance optimization. We tested the mirror mode first but it was not suitable for our case. For the issue we are reporting, we would like to know why we have no log announcing a problem and why our instances freeze at the replication level. Our observation is not a behavior expected by OpenLDAP, how can we explain this case?
https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #3 from Ondřej Kuzník ondra@mistotebe.net --- On Tue, Oct 29, 2024 at 08:26:50AM +0000, openldap-its@openldap.org wrote:
We implemented this architecture according to our needs and performance optimization. We tested the mirror mode first but it was not suitable for our case.
First, consider the what I said before and size your environment accordingly. In the meantime:
For the issue we are reporting, we would like to know why we have no log announcing a problem and why our instances freeze at the replication level. Our observation is not a behavior expected by OpenLDAP, how can we explain this case?
I don't think you've shown a problem. You want to collect sync logs and see if there is one, I'd expect that instead you'll see a lot of replication traffic.
Also don't forget to monitor your environment and check that your databases are not running out of space (e.g. mdb maxsize is not reached). If your accesslog runs out of space at any point, you have silenced one source of replication (and in a way you won't be able to recover from).
If, once you've investigated further, you find an indication of an OpenLDAP bug, please provide relevant information (anonymized logs, ...) so that we have something to go by.
Regards,
https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #4 from falgon.comp@gmail.com --- Hello,
Here are some details and clarifications on the subject. This bug ticket follows the previous ticket opened in the technical section. As indicated in the last messages, we have the same problem for our new service which will be based much more on MOD operations than our previous instances. The lastbind overlay is therefore no longer as impactful. Our problem is the same whether it is BIND with writing of the last authentication or a standard MOD. For the next examples of graphs and logs, it will therefore mainly be tests with MODs. We have set up a gdrive in order to easily share the logs and screenshots referenced below : https://drive.google.com/drive/folders/1N4PWu9Eq4pUbORKVquEFTlZBRVjLK9en?usp...
I would question the design of your experiment: it appears you have 4 write accepting nodes and attempt to saturate each with as much write load as they will handle (global write traffic being at least 4 times of what the slowest node can handle).
This is not really the case, currently one of our instances can manage to perform 300MOD/s up to 450 MOD/s with latencies. See Test01 folder
When you stop the test after time t, you expect the environment to replicate (considerably) earlier than start+4*t. This is not the expected outcome, we expressly document that MPR is not a way to achieve write scaling.
No, we expect to have some replication delay (or logs indicating a problem) but it should catch up on its own when we stop our tests without needing to restart the instances. Example of a test with replication delay on an instance that continues to write and catches up on its own See Test02 folder
First, consider the what I said before and size your environment accordingly. In the meantime:
We have done many performance tests with single instance, multiple instances, mirror mode and MMR mode, our environment is supposed to hold the load without problem. We probably even oversized it We have never exceeded our capacities whether it is RAM / CPU / system load / network ... The only thing we can have doubts about is our storage array for which other teams are also investigating this case.
I don't think you've shown a problem. You want to collect sync logs and see if there is one, I'd expect that instead you'll see a lot of replication traffic.
On our side we have activated the stats logs + sync and we have analyzed all the types of messages returned by the directories but we do not see any error message speaking when the problem occurs. We have deposited the logs of the 4 servers during one of our tests that we have anonymized. We would like you to take a look at it and tell us if you find something interesting that we have missed. See Logs_repl_issue + Test03 folders
If your accesslog runs out of space at any point, you have silenced one source of replication (and in a way you won't be able to recover from).
For this case we never exceed the 50% of allocated capacities, we have already followed all the OpenLDAP recommendations. We have a lot of statistics and supervision for our directories. As said previously we never reach system limits when the problem occurs. our only current track is the replication conflict that would cause this problem.
If, once you've investigated further, you find an indication of an OpenLDAP bug, please provide relevant information (anonymized logs, ...) so that we have something to go by.
This is the problem, we are going back to this potential problem which is proven on the behavior side of our directory we have an unexpected behavior. However we have not found any logs revealing a concrete problem. If you find something interesting on your side that can explain exactly what is happening in this case we are interested in the explanation and if there is a solution. You can find graphs of this test on Test03 folder
If you need any further information, do not hesitate to ask. Thank you in advance for your analysis.
https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #5 from Ondřej Kuzník ondra@mistotebe.net --- On Tue, Nov 05, 2024 at 12:46:36PM +0000, openldap-its@openldap.org wrote:
I would question the design of your experiment: it appears you have 4 write accepting nodes and attempt to saturate each with as much write load as they will handle (global write traffic being at least 4 times of what the slowest node can handle).
This is not really the case, currently one of our instances can manage to perform 300MOD/s up to 450 MOD/s with latencies. See Test01 folder
Hi, thanks for the logs, they definitely help drill down further. Based on that I can see that you're aiming to keep it below the 450 mods/s globally so things should be able to keep up. Instead, some servers just observe these pauses in the logs that are getting increasingly longer.
First things first, can you use the latest: 2.5.18 at the very least, 2.6.8 or even OPENLDAP_REL_ENG_2_6 (the upcoming 2.6.9) would be better. And once you've upgraded, can you wipe the accesslog DBs all around (or check there is no duplicate sid in its minCSN), there's a known issue where minCSNs can get wrongly populated pre-2.5.18/2.6.8 (ITS#10173).
If the issue persists, the gaps tend to be around slap_queue_csn/ slap_graduate_commit_csn. Are you able to run the test with a patched slapd, logging how many slap_csn_entry items slap_graduate_commit_csn had to crawl before it found the right one? This list shouldn't keep growing so that's one thing I'd like to rule out.
Also I suspect your configuration is slightly different (since the configs do not acknowgledge there are 4 servers replicating with each other) but hopefully there isn't much else that's different.
We have done many performance tests with single instance, multiple instances, mirror mode and MMR mode, our environment is supposed to hold the load without problem. We probably even oversized it We have never exceeded our capacities whether it is RAM / CPU / system load / network ... The only thing we can have doubts about is our storage array for which other teams are also investigating this case.
I can't see evidence of I/O issues but you can always monitor iowait I guess.
Monitoring-wise, you also want to keep an eye on the contextCSNs of the cluster if you're not doing that yet (a DB is linked to its accesslog, divergence between their contextCSNs is also notable BTW).
For this case we never exceed the 50% of allocated capacities, we have already followed all the OpenLDAP recommendations. We have a lot of statistics and supervision for our directories. As said previously we never reach system limits when the problem occurs. our only current track is the replication conflict that would cause this problem.
Yes, I can't see an actual replication conflict in the logs (I suspect the "delta-sync lost sync" messages are down to ITS#10173), so that's fine, the servers just get slower over time which shouldn't be happening and I don't remember seeing anything like it before either.
Again, thanks for the logs and let us know how the interventions above pan out.
Thanks,
https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #6 from falgon.comp@gmail.com --- Hello, Thank you very much for your interesting feedback and the time spent.
First things first, can you use the latest: 2.5.18 at the very least, 2.6.8 or even OPENLDAP_REL_ENG_2_6 (the upcoming 2.6.9) would be better. And once you've upgraded, can you wipe the accesslog DBs all around (or check there is no duplicate sid in its minCSN), there's a known issue where minCSNs can get wrongly populated pre-2.5.18/2.6.8 (ITS#10173).
We were able to test version 2.5.18 (all DB cleans) but unfortunately we reproduced the problem. For version 2.6+ we are currently unable to test this version. We may be able to do it half of 2025.
If the issue persists, the gaps tend to be around slap_queue_csn/ slap_graduate_commit_csn. Are you able to run the test with a patched slapd, logging how many slap_csn_entry items slap_graduate_commit_csn had to crawl before it found the right one? This list shouldn't keep growing so that's one thing I'd like to rule out.
We are working on this to be able to have a precise view on it and provide you other clean and interesting logs.
Also I suspect your configuration is slightly different (since the configs do not acknowgledge there are 4 servers replicating with each other) but hopefully there isn't much else that's different.
We have globaly the same configuration as the one initially provided (We were able to reproduce the problem with this one).
I can't see evidence of I/O issues but you can always monitor iowait I guess.
We have a visual on this part as well, we have a lot of different metrics we can provide you if you want/need.
Monitoring-wise, you also want to keep an eye on the contextCSNs of the cluster if you're not doing that yet (a DB is linked to its accesslog, divergence between their contextCSNs is also notable BTW).
Yes, we also have scripts at our disposal that check the contextCSN of each server to check for a replication delay and alert us in the event of a problem.
Yes, I can't see an actual replication conflict in the logs (I suspect the "delta-sync lost sync" messages are down to ITS#10173), so that's fine, the servers just get slower over time which shouldn't be happening and I don't remember seeing anything like it before either.
If you need new logs with the 2.5.18 version we can provide them to you.
Unfortunately the problem persists, we are still doing tests on it to try to find the origin of the problem. We had already done checks on the Virtualization + network + storage side, but we have reopened discussions with them to try to go further on this issue and find/tracks on the problem. The detail that is problematic for us is that we had managed to reproduce it on a local VM without network transactions.
I don't know if you were able to try to reproduce the problem ? We are interested in all the ideas you may have.
Once again thank you for your time and involvement.