--On Wednesday, March 27, 2019 9:24 AM +0100 "Ángel L. Mateo"
<amateo(a)um.es> wrote:
Although I'm not using any intermediate device between ldap
servers
(they all are in the same network) I have added the keepalive option to
syncrepl, without any result.
Yeah, it's more of a safeguard to ensure one avoids those types of issues.
As for your current situation, I did have a customer the other day who was
reporting multi-hour replication delays. But when I looked into it, the
issue was not unexpected. I don't know if this has any relevance to your
situation, but here is what they were doing.
They had a 4-way MMR setup. They were doing 50k ADDs to each of the
masters. This is one of the worst things one could do, becuase it will
massively slow down replication and cause all the servers to get behind due
to the contention caused by the massive number of ADD ops occurring on each
server while also trying to replicate the ADDs from the other servers.
I.e., the best route would be to simply ADD all 150k entries to a single
member node and let them replicate out from there.
You've noted before wanting some information on what replication lines
mean. It's hard to answer that exactly, as what's important to examine and
what isn't important can be difficult to determine without looking at the
actual logs and having some knowledge of what's going on, but this may be
of some help. Here are some different sets of logs and scenarios. All
logs, etc, are based on delta-syncrepl being in use.
Scenario #1:
Skipping an change that's too far in the past. This should have been
processed already from the original change on the original master (SID
066). This change request came from SID 0c9
Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=201
cookie=rid=201,sid=0c9,csn=20190325191922.072386Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=201 CSN too old,
ignoring 20190325191922.072386Z#000000#066#000000
(reqStart=20190325191922.000185Z,cn=deltalog)
Scenario #2:
Applying a change that originated on another server:
Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=102
cookie=rid=102,sid=066,csn=20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing
0x7f9ab88ce0f0 20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: syncrepl_message_to_op: rid=102 tid
bcffc700
Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing
0x7f9ab842eda0 20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: syncprov_sendresp: to=12d,
cookie=rid=101,sid=065,csn=20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: syncprov_matchops: skipping original
sid 066
Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing
0x7f9ab842eda0 20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing
0x7f9ab88ce0f0 20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: syncprov_sendresp: to=0c9,
cookie=rid=101,sid=065,csn=20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: syncrepl_message_to_op: rid=102
be_add
cn=QA_MACHINE_9F06E7769C9B49B49D737414662C9089,ou=machines,dc=example,dc=com
(0)
Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing
0x7f9ab8a3b930 20190325191923.177761Z#000000#066#000000
Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing
0x7f9ab8a3b930 20190325191923.177761Z#000000#066#000000
So in the above, this master (SID065) got a change from another master
(SID066). It was not a change it had received yet from 066, so it applies
it. It skips sending this change back to the originating server (Thus the
"skipping original sid 066" line). It forwards the change to its other
listeners (0c9, 12d) in case they do not have a direct link to 066, and
commits the change locally.
Scenario #3:
Originating a change:
Mar 25 03:58:24 master1 slapd[29939]: conn=5788689 op=75 ADD dn="cn=ATOM
Virtual:843a4b737992,ou=identities,dc=example,dc=com"
Mar 25 03:58:24 master1 slapd[29939]: slap_queue_csn: queueing
0x7f9ab053ed40 20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: slap_queue_csn: queueing
0x7f9ab106c0f0 20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=12d,
cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=0c9,
cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=066,
cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: slap_graduate_commit_csn: removing
0x7f9ab106c0f0 20190325105824.517526Z#000000#065#000000
Mar 25 03:58:24 master1 slapd[29939]: conn=5788689 op=75 RESULT tag=105
err=0 text=
Mar 25 03:58:24 master1 slapd[29939]: slap_graduate_commit_csn: removing
0x7f9ab053ed40 20190325105824.517526Z#000000#065#000000
So in the above, the master originating this change (065) creates the
relevant CSN and forwards this change to its listeners (SIDs 12d, 0c9,
066), commits, and cleans up.
To read a CSN value:
20190325105824.517526Z#000000#065#000000
The first part is the time (Zulu) that the change occured. So in above
case, the change occurred at:
2019/03/25 @ 10:58:24.517526 seconds Zulu time (Server time as we see above
was 03:58:24), so this server is 7 hours behind Zulu time.
The server where the modification was made was the one with a SID value of
065.
If you're looking at the contextCSN values on a server, it essentially
tells you the last change it processed from each master that it talks to
(and its own current change value). If the values aren't matching between
servers, then you need to see if those servers are still processing changes
from the other masters.
--Quanah
--
Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<
http://www.symas.com>