--On Wednesday, March 27, 2019 9:24 AM +0100 "Ángel L. Mateo" amateo@um.es wrote:
Although I'm not using any intermediate device between ldap servers (they all are in the same network) I have added the keepalive option to syncrepl, without any result.
Yeah, it's more of a safeguard to ensure one avoids those types of issues.
As for your current situation, I did have a customer the other day who was reporting multi-hour replication delays. But when I looked into it, the issue was not unexpected. I don't know if this has any relevance to your situation, but here is what they were doing.
They had a 4-way MMR setup. They were doing 50k ADDs to each of the masters. This is one of the worst things one could do, becuase it will massively slow down replication and cause all the servers to get behind due to the contention caused by the massive number of ADD ops occurring on each server while also trying to replicate the ADDs from the other servers. I.e., the best route would be to simply ADD all 150k entries to a single member node and let them replicate out from there.
You've noted before wanting some information on what replication lines mean. It's hard to answer that exactly, as what's important to examine and what isn't important can be difficult to determine without looking at the actual logs and having some knowledge of what's going on, but this may be of some help. Here are some different sets of logs and scenarios. All logs, etc, are based on delta-syncrepl being in use.
Scenario #1:
Skipping an change that's too far in the past. This should have been processed already from the original change on the original master (SID 066). This change request came from SID 0c9
Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=201 cookie=rid=201,sid=0c9,csn=20190325191922.072386Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=201 CSN too old, ignoring 20190325191922.072386Z#000000#066#000000 (reqStart=20190325191922.000185Z,cn=deltalog)
Scenario #2:
Applying a change that originated on another server:
Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=102 cookie=rid=102,sid=066,csn=20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing 0x7f9ab88ce0f0 20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: syncrepl_message_to_op: rid=102 tid bcffc700 Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing 0x7f9ab842eda0 20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: syncprov_sendresp: to=12d, cookie=rid=101,sid=065,csn=20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: syncprov_matchops: skipping original sid 066 Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing 0x7f9ab842eda0 20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing 0x7f9ab88ce0f0 20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: syncprov_sendresp: to=0c9, cookie=rid=101,sid=065,csn=20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: syncrepl_message_to_op: rid=102 be_add cn=QA_MACHINE_9F06E7769C9B49B49D737414662C9089,ou=machines,dc=example,dc=com (0) Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing 0x7f9ab8a3b930 20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing 0x7f9ab8a3b930 20190325191923.177761Z#000000#066#000000
So in the above, this master (SID065) got a change from another master (SID066). It was not a change it had received yet from 066, so it applies it. It skips sending this change back to the originating server (Thus the "skipping original sid 066" line). It forwards the change to its other listeners (0c9, 12d) in case they do not have a direct link to 066, and commits the change locally.
Scenario #3:
Originating a change:
Mar 25 03:58:24 master1 slapd[29939]: conn=5788689 op=75 ADD dn="cn=ATOM Virtual:843a4b737992,ou=identities,dc=example,dc=com" Mar 25 03:58:24 master1 slapd[29939]: slap_queue_csn: queueing 0x7f9ab053ed40 20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: slap_queue_csn: queueing 0x7f9ab106c0f0 20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=12d, cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=0c9, cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=066, cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: slap_graduate_commit_csn: removing 0x7f9ab106c0f0 20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: conn=5788689 op=75 RESULT tag=105 err=0 text= Mar 25 03:58:24 master1 slapd[29939]: slap_graduate_commit_csn: removing 0x7f9ab053ed40 20190325105824.517526Z#000000#065#000000
So in the above, the master originating this change (065) creates the relevant CSN and forwards this change to its listeners (SIDs 12d, 0c9, 066), commits, and cleans up.
To read a CSN value:
20190325105824.517526Z#000000#065#000000
The first part is the time (Zulu) that the change occured. So in above case, the change occurred at:
2019/03/25 @ 10:58:24.517526 seconds Zulu time (Server time as we see above was 03:58:24), so this server is 7 hours behind Zulu time.
The server where the modification was made was the one with a SID value of 065.
If you're looking at the contextCSN values on a server, it essentially tells you the last change it processed from each master that it talks to (and its own current change value). If the values aren't matching between servers, then you need to see if those servers are still processing changes from the other masters.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com