https://bugs.openldap.org/show_bug.cgi?id=10274
--- Comment #4 from falgon.comp@gmail.com --- Hello,
Here are some details and clarifications on the subject. This bug ticket follows the previous ticket opened in the technical section. As indicated in the last messages, we have the same problem for our new service which will be based much more on MOD operations than our previous instances. The lastbind overlay is therefore no longer as impactful. Our problem is the same whether it is BIND with writing of the last authentication or a standard MOD. For the next examples of graphs and logs, it will therefore mainly be tests with MODs. We have set up a gdrive in order to easily share the logs and screenshots referenced below : https://drive.google.com/drive/folders/1N4PWu9Eq4pUbORKVquEFTlZBRVjLK9en?usp...
I would question the design of your experiment: it appears you have 4 write accepting nodes and attempt to saturate each with as much write load as they will handle (global write traffic being at least 4 times of what the slowest node can handle).
This is not really the case, currently one of our instances can manage to perform 300MOD/s up to 450 MOD/s with latencies. See Test01 folder
When you stop the test after time t, you expect the environment to replicate (considerably) earlier than start+4*t. This is not the expected outcome, we expressly document that MPR is not a way to achieve write scaling.
No, we expect to have some replication delay (or logs indicating a problem) but it should catch up on its own when we stop our tests without needing to restart the instances. Example of a test with replication delay on an instance that continues to write and catches up on its own See Test02 folder
First, consider the what I said before and size your environment accordingly. In the meantime:
We have done many performance tests with single instance, multiple instances, mirror mode and MMR mode, our environment is supposed to hold the load without problem. We probably even oversized it We have never exceeded our capacities whether it is RAM / CPU / system load / network ... The only thing we can have doubts about is our storage array for which other teams are also investigating this case.
I don't think you've shown a problem. You want to collect sync logs and see if there is one, I'd expect that instead you'll see a lot of replication traffic.
On our side we have activated the stats logs + sync and we have analyzed all the types of messages returned by the directories but we do not see any error message speaking when the problem occurs. We have deposited the logs of the 4 servers during one of our tests that we have anonymized. We would like you to take a look at it and tell us if you find something interesting that we have missed. See Logs_repl_issue + Test03 folders
If your accesslog runs out of space at any point, you have silenced one source of replication (and in a way you won't be able to recover from).
For this case we never exceed the 50% of allocated capacities, we have already followed all the OpenLDAP recommendations. We have a lot of statistics and supervision for our directories. As said previously we never reach system limits when the problem occurs. our only current track is the replication conflict that would cause this problem.
If, once you've investigated further, you find an indication of an OpenLDAP bug, please provide relevant information (anonymized logs, ...) so that we have something to go by.
This is the problem, we are going back to this potential problem which is proven on the behavior side of our directory we have an unexpected behavior. However we have not found any logs revealing a concrete problem. If you find something interesting on your side that can explain exactly what is happening in this case we are interested in the explanation and if there is a solution. You can find graphs of this test on Test03 folder
If you need any further information, do not hesitate to ask. Thank you in advance for your analysis.