Re: Antw: Re: Replication delay

29 Mar 2019


      --On Wednesday, March 27, 2019 9:24 AM +0100 "Ángel L. Mateo" 
amateo@um.es wrote:
...
Although I'm not using any intermediate device between ldap servers
(they all are in the same network) I have added the keepalive option to
syncrepl, without any result.
Yeah, it's more of a safeguard to ensure one avoids those types of issues.
As for your current situation, I did have a customer the other day who was 
reporting multi-hour replication delays.  But when I looked into it, the 
issue was not unexpected.  I don't know if this has any relevance to your 
situation, but here is what they were doing.
They had a 4-way MMR setup.  They were doing 50k ADDs to each of the 
masters.  This is one of the worst things one could do, becuase it will 
massively slow down replication and cause all the servers to get behind due 
to the contention caused by the massive number of ADD ops occurring on each 
server while also trying to replicate the ADDs from the other servers. 
I.e., the best route would be to simply ADD all 150k entries to a single 
member node and let them replicate out from there.
You've noted before wanting some information on what replication lines 
mean.  It's hard to answer that exactly, as what's important to examine and 
what isn't important can be difficult to determine without looking at the 
actual logs and having some knowledge of what's going on, but this may be 
of some help. Here are some different sets of logs and scenarios.  All 
logs, etc, are based on delta-syncrepl being in use.
Scenario #1:
Skipping an change that's too far in the past.  This should have been 
processed already from the original change on the original master (SID 
066).  This change request came from SID 0c9
Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=201 
cookie=rid=201,sid=0c9,csn=20190325191922.072386Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=201 CSN too old, 
ignoring 20190325191922.072386Z#000000#066#000000 
(reqStart=20190325191922.000185Z,cn=deltalog)
Scenario #2:
Applying a change that originated on another server:
Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=102 
cookie=rid=102,sid=066,csn=20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing 
0x7f9ab88ce0f0 20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: syncrepl_message_to_op: rid=102 tid 
bcffc700
Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing 
0x7f9ab842eda0 20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: syncprov_sendresp: to=12d, 
cookie=rid=101,sid=065,csn=20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: syncprov_matchops: skipping original 
sid 066
Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing 
0x7f9ab842eda0 20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing 
0x7f9ab88ce0f0 20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: syncprov_sendresp: to=0c9, 
cookie=rid=101,sid=065,csn=20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: syncrepl_message_to_op: rid=102 
be_add 
cn=QA_MACHINE_9F06E7769C9B49B49D737414662C9089,ou=machines,dc=example,dc=com 
(0)
Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing 
0x7f9ab8a3b930 20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing 
0x7f9ab8a3b930 20190325191923.177761Z#000000#066#000000
So in the above, this master (SID065) got a change from another master 
(SID066). It was not a change it had received yet from 066, so it applies 
it.  It skips sending this change back to the originating server (Thus the 
"skipping original sid 066" line).  It forwards the change to its other 
listeners (0c9, 12d) in case they do not have a direct link to 066, and 
commits the change locally.
Scenario #3:
Originating a change:
Mar 25 03:58:24 master1 slapd[29939]: conn=5788689 op=75 ADD dn="cn=ATOM 
Virtual:843a4b737992,ou=identities,dc=example,dc=com"
Mar 25 03:58:24 master1 slapd[29939]: slap_queue_csn: queueing 
0x7f9ab053ed40 20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: slap_queue_csn: queueing 
0x7f9ab106c0f0 20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=12d, 
cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=0c9, 
cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=066, 
cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: slap_graduate_commit_csn: removing 
0x7f9ab106c0f0 20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: conn=5788689 op=75 RESULT tag=105 
err=0 text=
Mar 25 03:58:24 master1 slapd[29939]: slap_graduate_commit_csn: removing 
0x7f9ab053ed40 20190325105824.517526Z#000000#065#000000
So in the above, the master originating this change (065) creates the 
relevant CSN and forwards this change to its listeners (SIDs 12d, 0c9, 
066), commits, and cleans up.
To read a CSN value:
20190325105824.517526Z#000000#065#000000
The first part is the time (Zulu) that the change occured.  So in above 
case, the change occurred at:
2019/03/25 @ 10:58:24.517526 seconds Zulu time (Server time as we see above 
was 03:58:24), so this server is 7 hours behind Zulu time.
The server where the modification was made was the one with a SID value of 
065.
If you're looking at the contextCSN values on a server, it essentially 
tells you the last change it processed from each master that it talks to 
(and its own current change value).  If the values aren't matching between 
servers, then you need to see if those servers are still processing changes 
from the other masters.
--Quanah
--
Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
http://www.symas.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: Antw: Re: Replication delay