quanah@zimbra.com wrote:
--On Wednesday, May 16, 2012 10:27 PM +0000 quanah@OpenLDAP.org wrote:
Full_Name: Quanah Gibson-Mount Version: 2.4.31 OS: Linux 2.6 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (75.108.184.39)
We can see that the script turning it into a master ran here:
Thu May 17 16:05:46 2012 *** Running as zimbra user: /opt/zimbra/libexec/zmldapenable-mmr -s 2 -m ldap://zre-ldap002.eng.vmware.com:389/
so 16:05:46
In the accesslog, we see:
dn: cn=accesslog objectClass: auditContainer cn: accesslog structuralObjectClass: auditContainer contextCSN: 20120517225152.913667Z#000000#000#000000 contextCSN: 20120517230823.615364Z#000000#001#000000 contextCSN: 20120517230546.409118Z#000000#002#000000
dn: reqStart=20120517230546.000019Z,cn=accesslog objectClass: auditAdd structuralObjectClass: auditAdd reqStart: 20120517230546.000019Z reqEnd: 20120517230546.000020Z reqType: add reqSession: 100 reqAuthzID: cn=config reqDN: cn=zimbra reqResult: 0 reqMod: objectClass:+ organizationalRole reqMod: description:+ Zimbra Systems Application Data reqMod: cn:+ zimbra reqMod: structuralObjectClass:+ organizationalRole reqMod: entryUUID:+ 40f78bea-34be-1031-8a5d-e1466f667e19 reqMod: creatorsName:+ cn=config reqMod: createTimestamp:+ 20120517224907Z reqMod: entryCSN:+ 20120517224907.221672Z#000000#000#000000 reqMod: modifiersName:+ cn=config reqMod: modifyTimestamp:+ 20120517224907Z reqEntryUUID: 40f78bea-34be-1031-8a5d-e1466f667e19 entryUUID: 948929e2-34c0-1031-9a14-c93bd10ff0f2 creatorsName: cn=config createTimestamp: 20120517224907Z entryCSN: 20120517224907.221672Z#000000#000#000000 modifiersName: cn=config modifyTimestamp: 20120517224907Z
so it is tracking "000" as a third master? This seems to be why the original server (which was 000 before being promoted to 001) replicates these entries back to itself.
The loop is caused by the patch to ITS#6872, which considers a consumer out of date whenever the number of CSNs in its sync request doesn't match the number known to the provider.
The data here is basically invalid: server1 has entries generated using SID=0 but it has no contextCSN value with SID=0. It only sent SID=1 and SID=2 in its sync request. Server2, which just updated from server1, has a contextCSN for SID=0 in addition to 1 and 2 (and that's all correct).
Server1 should have always had a contextCSN value for SID=0 but doesn't. This problem would not occur if server1 was converted first from standalone into a single-master. I.e., load syncprov on it, let it scan the DB and generate the first sid=0 contextCSN, before turning it intu a MMR node.