[Issue 10274] New: Replication issue on MMR configuration

List overview All Threads
Download

newer

older

[Issue 10342] New: Potential...

[Issue 10308] New: Implement...

openldap-its＠openldap.org

22 Oct 2024 22 Oct '24

6:30 a.m.

https://bugs.openldap.org/show_bug.cgi?id=10274

Issue ID: 10274 Summary: Replication issue on MMR configuration Product: OpenLDAP Version: 2.5.14 Hardware: All OS: Linux Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: slapd Assignee: bugs@openldap.org Reporter: falgon.comp@gmail.com Target Milestone: ---

Created attachment 1036 --> https://bugs.openldap.org/attachment.cgi?id=1036&action=edit In this attachment you will find 2 openldap configurations for 2 instances + slamd conf exemple + 5 screenshots to show the issue and one text file to explain what you see

Hello we are openning this issue further to the initial post in technical : https://lists.openldap.org/hyperkitty/list/openldap-technical@openldap.org/t...

Issue : We are working on a project and we've come across an issue with the replication after performance testing :

*Configuration :*

RHEL 8.6 OpenLDAP 2.5.14 *MMR-delta *configuration on multiple servers attached 300,000 users configured and used for tests *olcLastBind: TRUE*

Use of SLAMD (performance shooting)

*Problem description:*

We are currently running performance and resilience tests on our infrastructure using the SLAMD tool configured to perform BINDs and MODs on a defined range of accounts.

We use a load balancer (VIP) to poll all of our servers equally. (but it is possible to do performance tests directly on each of the directories)

With our current infrastructure we're able to perform approximately 300 MOD/BIND/s. Beyond that, we start to generate delays and can randomly come across one issue.

However, when we run performance tests that exceed our write capacity, our replication between servers can randomly create an incident with directories being unable to catch up with their replication delay.

The directories update their contextCSNs, but extremely slowly (like freezing). From then on, it's impossible for the directories to catch again. (even with no incoming traffic)

A restart of the instance is required to perform a full refresh and solve the incident.

We have enabled synchronization logs and have no error or refresh logs to indicate a problem ( we can provide you with logs if necessary).

We suspect a write collision or a replication conflict but this is never write in our sync logs.

We've run a lot of tests.

For example, when we run a performance test on a single live server, we don't reproduce the problem.

Anothers examples: if we define different accounts ranges for each server with SALMD, we don't reproduce the problem either.

If we use only one account for the range, we don't reproduce the problem either.

______________________________________________________________________ I have add some screenshots on attachement to show you the issue and all the explanations. ______________________________________________________________________

*Symptoms :*

One or more directories can no longer be replicated normally after performance testing ends. No apparent error logs. Need a restart of instances to solve the problem.

*How to reproduce the problem:*

Have at least two servers in MMR mode Set LastBind to TRUE Perform a SLAMD shot from a LoadBalancer in bandwidth mode OR start multiple SLAMD test on same time for each server with the same account range. Exceed the maximum write capacity of the servers.

*SLAMD configuration :*

authrate.sh --hostname ${HOSTNAME} --port ${PORTSSL} \ --useSSL --trustStorePath ${CACERTJKS} \ --trustStorePassword ${CACERTJKSPW} --bindDN "${BINDDN}" \ --bindPassword ${BINDPW} --baseDN "${BASEDN}" \ --filter "(uid=[${RANGE}])" --credentials ${USERPW} \ --warmUpIntervals ${WARMUP} \ --numThreads ${NTHREADS} ${ARGS}

-- You are receiving this mail because: You are on the CC list for the issue.

Show replies by date

openldap-its＠openldap.org

22 Oct 22 Oct

8:36 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

Quanah Gibson-Mount quanah@openldap.org changed:

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

8:36 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

Quanah Gibson-Mount quanah@openldap.org changed:

What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |2.5.19

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

23 Oct 23 Oct

2:56 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

--- Comment #1 from Ondřej Kuzník ondra@mistotebe.net --- I would question the design of your experiment: it appears you have 4 write-accepting nodes and attempt to saturate each with as much write load as they will handle (global write traffic being at least 4 times of what the slowest node can handle).

When you stop the test after time t, you expect the environment to replicate (considerably) earlier than start+4*t. This is not the expected outcome, we expressly document that MPR is not a way to achieve write scaling.

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

29 Oct 29 Oct

1:26 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

--- Comment #2 from falgon.comp@gmail.com --- We implemented this architecture according to our needs and performance optimization. We tested the mirror mode first but it was not suitable for our case. For the issue we are reporting, we would like to know why we have no log announcing a problem and why our instances freeze at the replication level. Our observation is not a behavior expected by OpenLDAP, how can we explain this case?

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

2:19 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

--- Comment #3 from Ondřej Kuzník ondra@mistotebe.net --- On Tue, Oct 29, 2024 at 08:26:50AM +0000, openldap-its@openldap.org wrote:

...

We implemented this architecture according to our needs and performance optimization. We tested the mirror mode first but it was not suitable for our case.

First, consider the what I said before and size your environment accordingly. In the meantime:

...

For the issue we are reporting, we would like to know why we have no log announcing a problem and why our instances freeze at the replication level. Our observation is not a behavior expected by OpenLDAP, how can we explain this case?

I don't think you've shown a problem. You want to collect sync logs and see if there is one, I'd expect that instead you'll see a lot of replication traffic.

Also don't forget to monitor your environment and check that your databases are not running out of space (e.g. mdb maxsize is not reached). If your accesslog runs out of space at any point, you have silenced one source of replication (and in a way you won't be able to recover from).

If, once you've investigated further, you find an indication of an OpenLDAP bug, please provide relevant information (anonymized logs, ...) so that we have something to go by.

Regards,

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

5 Nov 5 Nov

4:46 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

--- Comment #4 from falgon.comp@gmail.com --- Hello,

Here are some details and clarifications on the subject. This bug ticket follows the previous ticket opened in the technical section. As indicated in the last messages, we have the same problem for our new service which will be based much more on MOD operations than our previous instances. The lastbind overlay is therefore no longer as impactful. Our problem is the same whether it is BIND with writing of the last authentication or a standard MOD. For the next examples of graphs and logs, it will therefore mainly be tests with MODs. We have set up a gdrive in order to easily share the logs and screenshots referenced below : https://drive.google.com/drive/folders/1N4PWu9Eq4pUbORKVquEFTlZBRVjLK9en?usp...

...

I would question the design of your experiment: it appears you have 4 write accepting nodes and attempt to saturate each with as much write load as they will handle (global write traffic being at least 4 times of what the slowest node can handle).

This is not really the case, currently one of our instances can manage to perform 300MOD/s up to 450 MOD/s with latencies. See Test01 folder

...

When you stop the test after time t, you expect the environment to replicate (considerably) earlier than start+4*t. This is not the expected outcome, we expressly document that MPR is not a way to achieve write scaling.

No, we expect to have some replication delay (or logs indicating a problem) but it should catch up on its own when we stop our tests without needing to restart the instances. Example of a test with replication delay on an instance that continues to write and catches up on its own See Test02 folder

...

First, consider the what I said before and size your environment accordingly. In the meantime:

We have done many performance tests with single instance, multiple instances, mirror mode and MMR mode, our environment is supposed to hold the load without problem. We probably even oversized it We have never exceeded our capacities whether it is RAM / CPU / system load / network ... The only thing we can have doubts about is our storage array for which other teams are also investigating this case.

...

I don't think you've shown a problem. You want to collect sync logs and see if there is one, I'd expect that instead you'll see a lot of replication traffic.

On our side we have activated the stats logs + sync and we have analyzed all the types of messages returned by the directories but we do not see any error message speaking when the problem occurs. We have deposited the logs of the 4 servers during one of our tests that we have anonymized. We would like you to take a look at it and tell us if you find something interesting that we have missed. See Logs_repl_issue + Test03 folders

...

If your accesslog runs out of space at any point, you have silenced one source of replication (and in a way you won't be able to recover from).

For this case we never exceed the 50% of allocated capacities, we have already followed all the OpenLDAP recommendations. We have a lot of statistics and supervision for our directories. As said previously we never reach system limits when the problem occurs. our only current track is the replication conflict that would cause this problem.

...

If, once you've investigated further, you find an indication of an OpenLDAP bug, please provide relevant information (anonymized logs, ...) so that we have something to go by.

This is the problem, we are going back to this potential problem which is proven on the behavior side of our directory we have an unexpected behavior. However we have not found any logs revealing a concrete problem. If you find something interesting on your side that can explain exactly what is happening in this case we are interested in the explanation and if there is a solution. You can find graphs of this test on Test03 folder

If you need any further information, do not hesitate to ask. Thank you in advance for your analysis.

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

8 Nov 8 Nov

2:48 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

--- Comment #5 from Ondřej Kuzník ondra@mistotebe.net --- On Tue, Nov 05, 2024 at 12:46:36PM +0000, openldap-its@openldap.org wrote:

...

...
I would question the design of your experiment: it appears you have 4 write accepting nodes and attempt to saturate each with as much write load as they will handle (global write traffic being at least 4 times of what the slowest node can handle).

This is not really the case, currently one of our instances can manage to perform 300MOD/s up to 450 MOD/s with latencies. See Test01 folder

Hi, thanks for the logs, they definitely help drill down further. Based on that I can see that you're aiming to keep it below the 450 mods/s globally so things should be able to keep up. Instead, some servers just observe these pauses in the logs that are getting increasingly longer.

First things first, can you use the latest: 2.5.18 at the very least, 2.6.8 or even OPENLDAP_REL_ENG_2_6 (the upcoming 2.6.9) would be better. And once you've upgraded, can you wipe the accesslog DBs all around (or check there is no duplicate sid in its minCSN), there's a known issue where minCSNs can get wrongly populated pre-2.5.18/2.6.8 (ITS#10173).

If the issue persists, the gaps tend to be around slap_queue_csn/ slap_graduate_commit_csn. Are you able to run the test with a patched slapd, logging how many slap_csn_entry items slap_graduate_commit_csn had to crawl before it found the right one? This list shouldn't keep growing so that's one thing I'd like to rule out.

Also I suspect your configuration is slightly different (since the configs do not acknowgledge there are 4 servers replicating with each other) but hopefully there isn't much else that's different.

...

We have done many performance tests with single instance, multiple instances, mirror mode and MMR mode, our environment is supposed to hold the load without problem. We probably even oversized it We have never exceeded our capacities whether it is RAM / CPU / system load / network ... The only thing we can have doubts about is our storage array for which other teams are also investigating this case.

I can't see evidence of I/O issues but you can always monitor iowait I guess.

Monitoring-wise, you also want to keep an eye on the contextCSNs of the cluster if you're not doing that yet (a DB is linked to its accesslog, divergence between their contextCSNs is also notable BTW).

...

For this case we never exceed the 50% of allocated capacities, we have already followed all the OpenLDAP recommendations. We have a lot of statistics and supervision for our directories. As said previously we never reach system limits when the problem occurs. our only current track is the replication conflict that would cause this problem.

Yes, I can't see an actual replication conflict in the logs (I suspect the "delta-sync lost sync" messages are down to ITS#10173), so that's fine, the servers just get slower over time which shouldn't be happening and I don't remember seeing anything like it before either.

Again, thanks for the logs and let us know how the interventions above pan out.

Thanks,

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

20 Nov 20 Nov

7:35 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

--- Comment #6 from falgon.comp@gmail.com --- Hello, Thank you very much for your interesting feedback and the time spent.

...

First things first, can you use the latest: 2.5.18 at the very least, 2.6.8 or even OPENLDAP_REL_ENG_2_6 (the upcoming 2.6.9) would be better. And once you've upgraded, can you wipe the accesslog DBs all around (or check there is no duplicate sid in its minCSN), there's a known issue where minCSNs can get wrongly populated pre-2.5.18/2.6.8 (ITS#10173).

We were able to test version 2.5.18 (all DB cleans) but unfortunately we reproduced the problem. For version 2.6+ we are currently unable to test this version. We may be able to do it half of 2025.

...

If the issue persists, the gaps tend to be around slap_queue_csn/ slap_graduate_commit_csn. Are you able to run the test with a patched slapd, logging how many slap_csn_entry items slap_graduate_commit_csn had to crawl before it found the right one? This list shouldn't keep growing so that's one thing I'd like to rule out.

We are working on this to be able to have a precise view on it and provide you other clean and interesting logs.

...

Also I suspect your configuration is slightly different (since the configs do not acknowgledge there are 4 servers replicating with each other) but hopefully there isn't much else that's different.

We have globaly the same configuration as the one initially provided (We were able to reproduce the problem with this one).

...

I can't see evidence of I/O issues but you can always monitor iowait I guess.

We have a visual on this part as well, we have a lot of different metrics we can provide you if you want/need.

...

Monitoring-wise, you also want to keep an eye on the contextCSNs of the cluster if you're not doing that yet (a DB is linked to its accesslog, divergence between their contextCSNs is also notable BTW).

Yes, we also have scripts at our disposal that check the contextCSN of each server to check for a replication delay and alert us in the event of a problem.

...

Yes, I can't see an actual replication conflict in the logs (I suspect the "delta-sync lost sync" messages are down to ITS#10173), so that's fine, the servers just get slower over time which shouldn't be happening and I don't remember seeing anything like it before either.

If you need new logs with the 2.5.18 version we can provide them to you.

Unfortunately the problem persists, we are still doing tests on it to try to find the origin of the problem. We had already done checks on the Virtualization + network + storage side, but we have reopened discussions with them to try to go further on this issue and find/tracks on the problem. The detail that is problematic for us is that we had managed to reproduce it on a local VM without network transactions.

I don't know if you were able to try to reproduce the problem ? We are interested in all the ideas you may have.

Once again thank you for your time and involvement.

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

26 Nov 26 Nov

8:45 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

Quanah Gibson-Mount quanah@openldap.org changed:

What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|2.5.19 |2.5.20

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

9 Dec 9 Dec

5:14 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

--- Comment #7 from falgon.comp@gmail.com --- Hello,

In addition to my previous comment, we are wondering about the "run the test with a patched slapd" that you mentioned.

...

If the issue persists, the gaps tend to be around slap_queue_csn/ slap_graduate_commit_csn. Are you able to run the test with a patched slapd, logging how many slap_csn_entry items slap_graduate_commit_csn had to crawl before it found the right one? This list shouldn't keep growing so that's one thing I'd like to rule out.

Do you already have a patched version to achieve this? If not, what are you referring to exactly ? Customize the code to perform the requested count? Thank's

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

9 Jan 9 Jan

4:19 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

--- Comment #8 from Ondřej Kuzník ondra@mistotebe.net --- On Mon, Dec 09, 2024 at 01:14:24PM +0000, openldap-its@openldap.org wrote:

...

Hello,

In addition to my previous comment, we are wondering about the "run the test with a patched slapd" that you mentioned.

...
If the issue persists, the gaps tend to be around slap_queue_csn/ slap_graduate_commit_csn. Are you able to run the test with a patched slapd, logging how many slap_csn_entry items slap_graduate_commit_csn had to crawl before it found the right one? This list shouldn't keep growing so that's one thing I'd like to rule out.

Do you already have a patched version to achieve this? If not, what are you referring to exactly ? Customize the code to perform the requested count? Thank's

Hi, sorry for not coming back you sooner.

Yes, count how many items slap_graduate_commit_csn[0] had to walk before it found what it wanted and maybe log it alongside the rest on line #122. I don't have a specific patch but if you are struggling to prepare one I can help.

If logs indicate this count indeed starts to grow out of hand, there might be enough info in there as to when/why that started. But walking the first few entries in gdb and looking at the CSNs in there might also be needed to cross-reference.

[0]. https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/ctxcs...

Thanks,

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

11 Feb 11 Feb

1:06 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

--- Comment #9 from falgon.comp@gmail.com --- Hello, No problem same for us, thank's for your reply,

For your recommendation, we haven't implemented custom code to count the number of iterations of slap_graduate_commit_csn. Instead, we've made a script that analyzes all replication logs and the verification of each CSN after our test. We have find interesting things, Here are a few observations for the test we have share before :

If we Count the occurence of CSN for all servers we have : 176363 20241104132232.937968Z#000000#002#000000 143023 20241104132232.374890Z#000000#001#000000 69302 20241104130859.571000Z#000000#003#000000 69294 20241104130849.482824Z#000000#004#000000 23087 20240919072405.196064Z#000000#001#000000 11409 20241104121608.755579Z#000000#001#000000 7960 20241104121927.229914Z#000000#002#000000 921 20241104133528.749142Z#000000#001#000000 483 20241104133656.538323Z#000000#001#000000 ...

If we look up to 20241104130859.571000Z#000000#003#000000

Number of iterations by server server1.log:678 server2.log:46176 server3.log:11291 server4.log:11157

And here are the first CSN logs for each server

server1 : 2024-11-04T14:24:10.085660+01:00 server1 slapd-2.5-user[2432501]: conn=1064 op=1 syncprov_op_search: got a persistent search with a cookie=rid=001,sid=003,csn=20241104132232.374890Z#000000#001#000000;20241104132232.937968Z#000000#002#000000;20241104130859.571000Z#000000#003#000000;20241104130849.482824Z#000000#004#000000 2024-11-04T14:24:10.159051+01:00 server1 slapd-2.5-user[2432501]: conn=1066 op=1 syncprov_op_search: got a persistent search with a cookie=rid=001,sid=004,csn=20241104132232.374890Z#000000#001#000000;20241104132232.937968Z#000000#002#000000;20241104130859.571000Z#000000#003#000000;20241104130849.482824Z#000000#004#000000 2024-11-04T14:24:49.792914+01:00 server1 slapd-2.5-user[2432501]: conn=-1 op=0 syncprov_add_slog: adding csn=20241104130859.571000Z#000000#003#000000 to sessionlog, uuid=e800557e-2a1b-4f5c-8ff5-273fef541eb0 2024-11-04T14:32:14.833713+01:00 server1 slapd-2.5-user[2432501]: do_syncrep2: rid=004 cookie=rid=004,sid=004,csn=20241104132232.374890Z#000000#001#000000;20241104132232.937968Z#000000#002#000000;20241104130859.571000Z#000000#003#000000;20241104130849.482824Z#000000#004#000000 2024-11-04T14:34:15.116600+01:00 server1 slapd-2.5-user[2432501]: do_syncrep1: rid=002 starting refresh (sending cookie=rid=002,sid=001,csn=20241104132232.374890Z#000000#001#000000;20241104132232.937968Z#000000#002#000000;20241104130859.571000Z#000000#003#000000;20241104130849.482824Z#000000#004#000000) 2024-11-04T14:34:15.180789+01:00 server1 slapd-2.5-user[2432501]: do_syncrep1: rid=002 starting refresh (sending cookie=rid=002,sid=001,csn=20241104132232.374890Z#000000#001#000000;20241104132232.937968Z#000000#002#000000;20241104130859.571000Z#000000#003#000000;20241104130849.482824Z#000000#004#000000)

server2 : 2024-11-04T14:25:30.018539+01:00 server2 slapd-2.5-user[1777246]: conn=1046 op=1 syncprov_op_search: got a persistent search with a cookie=rid=002,sid=003,csn=20241104132232.374890Z#000000#001#000000;20241104132232.937968Z#000000#002#000000;20241104130859.571000Z#000000#003#000000;20241104130849.482824Z#000000#004#000000 2024-11-04T14:25:30.026018+01:00 server2 slapd-2.5-user[1777246]: conn=1047 op=1 syncprov_op_search: got a persistent search with a cookie=rid=002,sid=004,csn=20241104132232.374890Z#000000#001#000000;20241104132232.937968Z#000000#002#000000;20241104130859.571000Z#000000#003#000000;20241104130849.482824Z#000000#004#000000 2024-11-04T14:25:30.037401+01:00 server2 slapd-2.5-user[1777246]: conn=1048 op=1 syncprov_op_search: got a search with a cookie=rid=002,sid=003,csn=20241104132232.374890Z#000000#001#000000;20241104132232.937968Z#000000#002#000000;20241104130859.571000Z#000000#003#000000;20241104130849.482824Z#000000#004#000000 2024-11-04T14:25:30.041279+01:00 server2 slapd-2.5-user[1777246]: conn=1048 op=2 syncprov_op_search: got a persistent search with a cookie=rid=002,sid=003,csn=20241104132232.374890Z#000000#001#000000;20241104132232.937968Z#000000#002#000000;20241104130859.571000Z#000000#003#000000;20241104130849.482824Z#000000#004#000000

server3 : 2024-11-04T14:08:59.571006+01:00 server3 slapd-2.5-user[3405557]: slap_get_csn: conn=8684 op=0 generated new csn=20241104130859.571000Z#000000#003#000000 manage=1 2024-11-04T14:08:59.571012+01:00 server3 slapd-2.5-user[3405557]: slap_queue_csn: queueing 0x56290f862840 20241104130859.571000Z#000000#003#000000 2024-11-04T14:08:59.571930+01:00 server3 slapd-2.5-user[3405557]: conn=8684 op=0 syncprov_add_slog: adding csn=20241104130859.571000Z#000000#003#000000 to sessionlog, uuid=e800557e-2a1b-4f5c-8ff5-273fef541eb0 2024-11-04T14:08:59.571989+01:00 server3 slapd-2.5-user[3405557]: slap_queue_csn: queueing 0x5629040c2540 20241104130859.571000Z#000000#003#000000 2024-11-04T14:08:59.664457+01:00 server3 slapd-2.5-user[3405557]: conn=1073 op=1 syncprov_qresp: set up a new syncres mode=1 csn=20241104130859.571000Z#000000#003#000000 2024-11-04T14:08:59.664465+01:00 server3 slapd-2.5-user[3405557]: conn=1066 op=1 syncprov_qresp: set up a new syncres mode=1 csn=20241104130859.571000Z#000000#003#000000

server4 : 2024-11-04T14:08:59.665159+01:00 server4 slapd-2.5-user[3412537]: do_syncrep2: rid=003 cookie=rid=003,sid=003,csn=20241104130859.571000Z#000000#003#000000 2024-11-04T14:08:59.665242+01:00 server4 slapd-2.5-user[3412537]: slap_queue_csn: queueing 0x55e30b476000 20241104130859.571000Z#000000#003#000000 2024-11-04T14:08:59.666221+01:00 server4 slapd-2.5-user[3412537]: conn=-1 op=0 syncprov_add_slog: adding csn=20241104130859.571000Z#000000#003#000000 to sessionlog, uuid=e800557e-2a1b-4f5c-8ff5-273fef541eb0 2024-11-04T14:08:59.666272+01:00 server4 slapd-2.5-user[3412537]: slap_queue_csn: queueing 0x55e30af67f80 20241104130859.571000Z#000000#003#000000 2024-11-04T14:08:59.670261+01:00 server4 slapd-2.5-user[3412537]: conn=1044 op=1 syncprov_qresp: set up a new syncres mode=1 csn=20241104130859.571000Z#000000#003#000000

We can see for the server1 and server2 the firsts occurence of the CSN are after the restart of instances, when for server3 and server4 start at 14:08. Can this kind of behavior be used to determine whether there is a replication problem ? If you need more details, you just need to ask, if you want the full details I can add them in the google drive.

PS : We will also be able to start our tests with version 2.6 soon, we will keep you informed about the improvements.

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

7 Mar 7 Mar

12:37 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

--- Comment #10 from falgon.comp@gmail.com --- Hello, I'm here to give you an update on our recent tests, as we were able to install and test version 2.6.9. We were able to reproduce the synchronization problem with MMR-Delta mode from 2.6, so this problem is still present on the latest version. However, we tested the simple MMR mode (without accesslog base) with all the improvements/optimizations, and it does not have the replication problem We were able to run performance tests at over 1200 MOD/s without succeeding in recreating the problem. We're now migrating to the simple MMR mode to overcome this problem.

We'll keep our MMR-Delta instances in our development environments in case you need more information, or if you need us to carry out further tests. Thank's,

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

25 Mar 25 Mar

8:51 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

Ondřej Kuzník ondra@mistotebe.net changed:

What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|2.5.20 |2.6.11

-- You are receiving this mail because: You are on the CC list for the issue.

openldap-its＠openldap.org

5 Sep 5 Sep

8:44 a.m.

New subject: [Issue 10274] Replication issue on MMR configuration

https://bugs.openldap.org/show_bug.cgi?id=10274

--- Comment #11 from Ondřej Kuzník ondra@mistotebe.net --- Retrying with the provided configs and a 4-way delta-MPR with absolute precision lastbind (same as MODs, it generates a lot of mutually conflicting traffic). I picked extremely underpowered machines so as to keep the saturating throughputs reasonably low and not have to worry about I/O being the limiting factor.

During the test, replication slows down to ~10-20 % of regular throughput (e.g. ~7ms to ~50-80ms per message processed in my tests). Investigating further, 40-70 % of replication wall-clock time (~25ms here) is spent on syncrepl_op_modify (deltaMMR mod resolution), specifically running the (entryCSN>=receivedCSN) part of the filter. It seems this effectively imposes a quadratic complexity on the delta-resolution traffic.

A representative traceback from master (`perf top` shows that mdb_idl_next/mdb_idl_union together take up about half of the CPU time):

``` (gdb) bt #0 0x00007fa2c37fea6a in mdb_idl_next (ids=0x7fa195476018, cursor=0x7fa1ab7ebcf0) at idl.c:947 #1 0x00007fa2c37fe80e in mdb_idl_union (a=0x7fa195476018, b=0x7fa195376018) at idl.c:831 #2 0x00007fa2c37f6823 in inequality_candidates (op=Search request rid=003 = {...}, rtxn=0x7fa1a445b1e0, ava="entryCSN"="20250904095808.059675Z#000000#003#000000", ids=0x7fa195476018, tmp=0x7fa195376018, gtorlt=165) at filterindex.c:1157 #3 0x00007fa2c37f2e35 in mdb_filter_candidates (op=Search request rid=003 = {...}, rtxn=0x7fa1a445b1e0, f=LDAP_FILTER_GE = {...}, ids=0x7fa195476018, tmp=0x7fa195376018, stack=0x7fa195576018) at filterindex.c:172 #4 0x00007fa2c37f43f8 in list_candidates (op=Search request rid=003 = {...}, rtxn=0x7fa1a445b1e0, flist=LDAP_FILTER_GE = {...}, ftype=160, ids=0x7fa1951f6018, tmp=0x7fa195376018, save=0x7fa195476018) at filterindex.c:582 #5 0x00007fa2c37f307d in mdb_filter_candidates (op=Search request rid=003 = {...}, rtxn=0x7fa1a445b1e0, f=LDAP_FILTER_AND = {...}, ids=0x7fa1951f6018, tmp=0x7fa195376018, stack=0x7fa195476018) at filterindex.c:195 #6 0x00007fa2c37ee81d in search_candidates (op=Search request rid=003 = {...}, rs=0x7fa1ab7fc6e0, e="cn=accesslog", isc=0x7fa1ab7ec1c0, mci=0x7fa1b01212e0, ids=0x7fa1951f6018, stack=0x7fa195376018) at search.c:1419 #7 0x00007fa2c37ec1aa in mdb_search (op=Search request rid=003 = {...}, rs=0x7fa1ab7fc6e0) at search.c:678 #8 0x000055b51e99feef in overlay_op_walk (op=Search request rid=003 = {...}, rs=0x7fa1ab7fc6e0, which=op_search, oi=0x55b54e3b7280, on=0x0) at backover.c:706 #9 0x000055b51e9a0178 in over_op_func (op=Search request rid=003 = {...}, rs=0x7fa1ab7fc6e0, which=op_search) at backover.c:766 #10 0x000055b51e9a02b1 in over_op_search (op=Search request rid=003 = {...}, rs=0x7fa1ab7fc6e0) at backover.c:796 #11 0x000055b51e9879b7 in syncrepl_op_modify (op=Modify request rid=003 = {...}, rs=0x7fa1ab7fced0) at syncrepl.c:3122 #12 0x000055b51e99fe49 in overlay_op_walk (op=Modify request rid=003 = {...}, rs=0x7fa1ab7fced0, which=op_modify, oi=0x55b54e367380, on=0x55b54e3f2c60) at backover.c:691 #13 0x000055b51e9a0178 in over_op_func (op=Modify request rid=003 = {...}, rs=0x7fa1ab7fced0, which=op_modify) at backover.c:766 #14 0x000055b51e9a0305 in over_op_modify (op=Modify request rid=003 = {...}, rs=0x7fa1ab7fced0) at backover.c:808 #15 0x000055b51e98918f in syncrepl_message_to_op (si=0x55b54e3f4fc0, op=Modify request rid=003 = {...}, msg=0x7fa1a422c190, do_lock=0) at syncrepl.c:3411 #16 0x000055b51e9822c0 in do_syncrep2 (op=Modify request rid=003 = {...}, si=0x55b54e3f4fc0) at syncrepl.c:1575 #17 0x000055b51e98549d in do_syncrepl (ctx=0x7fa1ab7fdb70, arg=0x55b54e3f3950, _rid=3) at syncrepl.c:2334 #18 0x000055b51e8faa6d in connection_read_thread (ctx=0x7fa1ab7fdb70, argv=0x13) at connection.c:1280 #20 0x00007fa2c41321f5 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #21 0x00007fa2c41b289c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) frame 4 #4 0x00007fa2c37f43f8 in list_candidates (op=Search request rid=003 = {...}, rtxn=0x7fa1a445b1e0, flist=LDAP_FILTER_GE = {...}, ftype=160, ids=0x7fa1951f6018, tmp=0x7fa195376018, save=0x7fa195476018) at filterindex.c:582 582 rc = mdb_filter_candidates( op, rtxn, f, save, tmp, (gdb) p flist $1 = LDAP_FILTER_GE = {f_next = LDAP_FILTER_EQUALITY, f_un.f_un_ava = "entryCSN"="20250904095808.059675Z#000000#003#000000"} ```

-- You are receiving this mail because: You are on the CC list for the issue.

Age (days ago)

344

Last active (days ago)

openldap-bugs@openldap.org

15 comments

1 participants

tags (0)

participants (1)

openldap-its＠openldap.org