Hi,
I'd discussed this in the IRC channel a while back and we were all a bit stumped so I thought it would be worth sending it to the mailing list. If I use delta-syncrepl in an MMR cluster with >2 nodes I get the following behaviour when adding and deleting objects:
Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2: rid=033 cookie=rid=033,sid=006,csn=20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn: queueing 0x4659f80 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0x4659bc0 20160603133551.231411Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn: queueing 0x4659e80 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: syncprov_matchops: skipping original sid 006 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0x4659e80 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0x4659f80 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: syncrepl_message_to_op: rid=033 be_add cn=marksgroup2,ou=ug,ou=iti,ou=is,dc=authorise,dc=ed,dc=ac,dc=uk (0) Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn: queueing 0x4659fc0 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: syncprov_sendresp: to=004, cookie=rid=032,sid=005,csn=20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2: rid=031 cookie=rid=031,sid=004 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn: queueing 0x46599c0 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn: queueing 0xb58a140 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0x46599c0 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0x4659fc0 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0xb58a140 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: syncrepl_message_to_op: rid=031 be_add cn=marksgroup2,ou=ug,ou=iti,ou=is,dc=authorise,dc=ed,dc=ac,dc=uk (68) Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2: rid=031 delta-sync lost sync on (reqStart=20160603133734.000001Z,cn=accesslog), switching to REFRESH Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2: rid=031 LDAP_RES_INTERMEDIATE - SYNC_ID_SET
It looks very much like a time sync issue but I've confirmed that my servers are all synced against a single common NTP server and are as closely synced time-wise as could be reasonably expected:
[root@alder ~]# ntpstat synchronised to NTP server (129.215.205.191) at stratum 3 time correct to within 39 ms polling server every 64 s [root@beech ~]# ntpstat synchronised to NTP server (129.215.205.191) at stratum 3 time correct to within 38 ms polling server every 64 s [root@oak ~]# ntpstat synchronised to NTP server (129.215.205.191) at stratum 3 time correct to within 38 ms polling server every 64 s [root@rowan ~]# ntpstat synchronised to NTP server (129.215.205.191) at stratum 3 time correct to within 39 ms polling server every 64 s
In terms of server config the olcSyncrepl directives are of the form:
olcSyncrepl: {0}rid=31 provider=ldaps://beech.authorise.is.ed.ac.uk/ realm=EASE.ED.AC.UK bindmethod=sasl saslmech=gssapi authcid="replicator.authorise.is.ed.ac.uk@EASE.ED.AC.UK" searchbase="dc=authorise,dc=ed,dc=ac,dc=uk" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))" schemachecking=on type=refreshAndPersist retry="60 +" tls_reqcert=never timeout=1 keepalive=300:12:5 syncdata=accesslog
with a line for each of the 4 servers.
and my syncprov config on the database being replicated is: dn: olcOverlay={2}syncprov,olcDatabase={1}mdb,cn=config objectClass: olcOverlayConfig objectClass: olcConfig objectClass: top objectClass: olcSyncProvConfig olcOverlay: {2}syncprov structuralObjectClass: olcSyncProvConfig
If I have 2 servers in the cluster it works fine, if I use pull-based replication (refreshOnly) it also works fine and if I use traditional syncrepl with refreshAndPersist it works fine.
I'm using OpenLDAP 2.4.44 on Centos 6 and 7 x86_64 with the same behaviour.
Any ideas? At the moment I've proceeded to go-live with pull-based replication but it would be nice to get push-based replication working. Also while I have the "old" servers sitting there it makes sense to use these for testing/debugging before retiring them.