--On Thursday, April 07, 2016 11:57 PM +0000 quanah@zimbra.com wrote:
Full summary:
the syncprov checkpoint operation causes the CSN to be lost for the first write operation to occur after the checkpoint. It is important to note that no data is lost, all changes replicate as they should.
However, the replica CSN is not updated in this scenario, making it appear that the replica is out of sync with the master. Adding the syncprov overlay to a replica database works around this issue by forcing the replica to track its internal CSNs, rather than relying on broadcasts from the master.
It is trivial to reproduce this issue by setting a short checkpoint interval with the syncprov-checkpoint parameter.
Example of the problem:
We have a script modifying the userPassword attribute of an entry every 45 seconds. We have a syncprov-checkpoint set to happen every 5 minutes.
From the log we can see:
Apr 7 18:00:38 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:05:53 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:11:09 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:16:25 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:17:55 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:21:41 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:26:57 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100 Apr 7 18:32:13 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Stopping the script after the 18:32:13 operation, and examining the CSN values on each server, we see the following.
master: [zimbra@zre-ldap003 scripts]$ ldapsearch -x -LLL -H ldap://zre-ldap002.eng.zimbra.com -s base -b "dc=uvm,dc=edu" contextCSN dn: dc=uvm,dc=edu contextCSN: 20160407233212.979013Z#000000#000#000000
replica: [zimbra@zre-ldap003 scripts]$ ldapsearch -x -LLL -H ldapi:// -s base -b "dc=uvm,dc=edu" contextCSN dn: dc=uvm,dc=edu contextCSN: 20160407233127.886702Z#000000#000#000000
Note that the CSNs are 45 seconds apart -- The interval of how often our writes are occurring. So the write op /prior/ to the checkpoint is the CSN value that is left on the replica in this case, as it ignores the empty CSN syncprov send response (thus not updating its CSN).
While it is of course best practice to run the syncprov overlay on the replica to enforce internal CSN cohesion, it still should not be required, and this is clearly a bug that can cause admins to incorrectly believe that their servers are having replication issues.
--Quanah
--
Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc