Odd MMR behaviour with delta-syncrepl and refreshAndPersist

3 Jun 2016

      Hi,
I'd discussed this in the IRC channel a while back and we were all a bit
stumped so I thought it would be worth sending it to the mailing list.
If I use delta-syncrepl in an MMR cluster with >2 nodes I get the
following behaviour when adding and deleting objects:
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2:
rid=033 cookie=rid=033,sid=006,csn=20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn:
queueing 0x4659f80 20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0x4659bc0
20160603133551.231411Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn:
queueing 0x4659e80 20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
syncprov_matchops: skipping original sid 006
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0x4659e80
20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0x4659f80
20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
syncrepl_message_to_op: rid=033 be_add
cn=marksgroup2,ou=ug,ou=iti,ou=is,dc=authorise,dc=ed,dc=ac,dc=uk (0)
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn:
queueing 0x4659fc0 20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
syncprov_sendresp: to=004,
cookie=rid=032,sid=005,csn=20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2:
rid=031 cookie=rid=031,sid=004
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn:
queueing 0x46599c0 20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn:
queueing 0xb58a140 20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0x46599c0
20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0x4659fc0
20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
slap_graduate_commit_csn: removing 0xb58a140
20160603133734.289843Z#000000#006#000000
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]:
syncrepl_message_to_op: rid=031 be_add
cn=marksgroup2,ou=ug,ou=iti,ou=is,dc=authorise,dc=ed,dc=ac,dc=uk (68)
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2:
rid=031 delta-sync lost sync on
(reqStart=20160603133734.000001Z,cn=accesslog), switching to REFRESH
Jun  3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2:
rid=031 LDAP_RES_INTERMEDIATE - SYNC_ID_SET
It looks very much like a time sync issue but I've confirmed that my
servers are all synced against a single common NTP server and are as
closely synced time-wise as could be reasonably expected:
[root@alder ~]# ntpstat
synchronised to NTP server (129.215.205.191) at stratum 3
   time correct to within 39 ms
   polling server every 64 s
[root@beech ~]# ntpstat
synchronised to NTP server (129.215.205.191) at stratum 3
   time correct to within 38 ms
   polling server every 64 s
[root@oak ~]# ntpstat
synchronised to NTP server (129.215.205.191) at stratum 3
   time correct to within 38 ms
   polling server every 64 s
[root@rowan ~]# ntpstat
synchronised to NTP server (129.215.205.191) at stratum 3
   time correct to within 39 ms
   polling server every 64 s
In terms of server config the olcSyncrepl directives are of the form:
olcSyncrepl: {0}rid=31 provider=ldaps://beech.authorise.is.ed.ac.uk/
realm=EASE.ED.AC.UK bindmethod=sasl saslmech=gssapi
authcid="replicator.authorise.is.ed.ac.uk@EASE.ED.AC.UK"
searchbase="dc=authorise,dc=ed,dc=ac,dc=uk" logbase="cn=accesslog"
logfilter="(&(objectClass=auditWriteObject)(reqResult=0))"
schemachecking=on type=refreshAndPersist retry="60 +" tls_reqcert=never
timeout=1 keepalive=300:12:5 syncdata=accesslog
with a line for each of the 4 servers.
and my syncprov config on the database being replicated is:
dn: olcOverlay={2}syncprov,olcDatabase={1}mdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcConfig
objectClass: top
objectClass: olcSyncProvConfig
olcOverlay: {2}syncprov
structuralObjectClass: olcSyncProvConfig
If I have 2 servers in the cluster it works fine, if I use pull-based
replication (refreshOnly) it also works fine and if I use traditional
syncrepl with refreshAndPersist it works fine.
I'm using OpenLDAP 2.4.44 on Centos 6 and 7 x86_64 with the same behaviour.
Any ideas? At the moment I've proceeded to go-live with pull-based
replication but it would be nice to get push-based replication working.
Also while I have the "old" servers sitting there it makes sense to use
these for testing/debugging before retiring them.
-- 
/****************************

Mark Cairney
ITI Enterprise Services
Information Services
University of Edinburgh

Tel: 0131 650 6565
Email: Mark.Cairney@ed.ac.uk
PGP: 0x435A9621

*******************************/

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Odd MMR behaviour with delta-syncrepl and refreshAndPersist