Failing syncrepl connections because of an older contextCSN on the master - openldap-technical

10 Sep 2012


      OpenLDAP version(s): 2.4.32 and recent git master (54a111907b48900e1ce5de12a9d87818a334423d)
OS: CentOS 5, 32-bit
In a master/slave delta-syncrepl configuration, syncrepl connections from the fully
synchronized and restarted slave to the freshly loaded master will fail because the master
uses an earlier contextCSN. This state persists until the first modification on the master,
when the contextCSN will become newer than the consumer's, whose next syncrepl retry will
then succeed.
Details (master and slave configurations are attached to this message):
The master is freshly loaded with the following LDIF:
dn: dc=example,dc=com
objectClass: locality
objectClass: dcObject
dc: example
dn: ou=People,dc=example,dc=com
objectClass: organizationalUnit
ou: People
When the master is started, there is only one entry in cn=accesslog:
dn: cn=accesslog
objectClass: auditContainer
cn: accesslog
structuralObjectClass: auditContainer
entryDN: cn=accesslog
subschemaSubentry: cn=Subschema
contextCSN: 20120910091205.088888Z#000000#000#000000
hasSubordinates: FALSE
The root entry of the main database looks like this:
dn: dc=example,dc=com
objectClass: locality
objectClass: dcObject
dc: example
structuralObjectClass: locality
entryUUID: 921f3734-24fc-4efa-ac52-10aa4d293332
creatorsName: cn=Manager,dc=example,dc=com
createTimestamp: 20120910090757Z
entryCSN: 20120910090757.483217Z#000000#000#000000
modifiersName: cn=Manager,dc=example,dc=com
modifyTimestamp: 20120910090757Z
entryDN: dc=example,dc=com
subschemaSubentry: cn=Subschema
auditContext: cn=accesslog
contextCSN: 20120910091205.089003Z#000000#000#000000
hasSubordinates: TRUE
When the slave server (with an empty database) is started, it connects to the master and
synchronizes. Their synchronization dialog, captured with -d sync, looks like this:
(master)
504db377 syncprov_search_response: cookie=rid=000,csn=20120910091205.089003Z#000000#000#000000
504db377 syncprov_search_response: cookie=rid=000,csn=20120910091205.088888Z#000000#000#000000
(slave)
504db370 syncrepl_message_to_entry: rid=000 DN: dc=example,dc=com, UUID: 921f3734-24fc-4efa-ac52-10aa4d293332
504db370 syncrepl_entry: rid=000 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
504db370 syncrepl_entry: rid=000 inserted UUID 921f3734-24fc-4efa-ac52-10aa4d293332
504db370 syncrepl_entry: rid=000 be_search (32)
504db370 syncrepl_entry: rid=000 dc=example,dc=com
504db370 syncrepl_entry: rid=000 be_add dc=example,dc=com (0)
504db370 syncrepl_message_to_entry: rid=000 DN: ou=People,dc=example,dc=com, UUID: 08fe53fb-724a-4eca-a57e-3c271d99dd7e
504db370 syncrepl_entry: rid=000 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD)
504db370 syncrepl_entry: rid=000 inserted UUID 08fe53fb-724a-4eca-a57e-3c271d99dd7e
504db370 syncrepl_entry: rid=000 be_search (0)
504db370 syncrepl_entry: rid=000 ou=People,dc=example,dc=com
504db370 syncrepl_entry: rid=000 be_add ou=People,dc=example,dc=com (0)
504db370 do_syncrep2: rid=000 LDAP_RES_SEARCH_RESULT
504db370 do_syncrep2: rid=000 cookie=rid=000,csn=20120910091205.089003Z#000000#000#000000
504db370 slap_queue_csn: queing 0x836dcf8 20120910091205.089003Z#000000#000#000000
504db370 slap_graduate_commit_csn: removing 0x8371328 20120910091205.089003Z#000000#000#000000
504db370 do_syncrep2: rid=000 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
504db370 do_syncrep2: rid=000 cookie=rid=000,csn=20120910091205.088888Z#000000#000#000000
The master sends the cookie with the contextCSN belonging to cn=accesslog. If the slave is
now restarted, what happens is this:
504db38d do_syncrep2: rid=000 LDAP_RES_SEARCH_RESULT
504db38d do_syncrep2: rid=000 LDAP_RES_SEARCH_RESULT (53) Server is unwilling to perform
504db38d do_syncrep2: rid=000 (53) Server is unwilling to perform
504db38d do_syncrepl: rid=000 rc -2 retrying
The slave sends csn=20120910091205.089003Z#000000#000#000000, but the master uses
csn=20120910091205.088888Z#000000#000#000000. When the master is subsequently modified,
it will use a newer contextCSN, and the slave's next retry will succeed.
Is this worthy of an ITS, since the situation is unlikely to occur in practice, and will
correct itself, only leaving the slave out of sync for the duration of the retry interval?
i.
include ../servers/slapd/schema/core.schema
argsfile master/run/slapd.args
pidfile master/run/slapd.pid
database mdb
suffix cn=accesslog
directory master/accesslog
index entryCSN,objectClass,reqEnd,reqResult,reqStart eq
overlay syncprov
syncprov-nopresent TRUE
syncprov-reloadhint TRUE
database mdb
suffix "dc=example,dc=com"
rootdn "cn=Manager,dc=example,dc=com"
rootpw secret
maxsize 256000000
directory master/main
index objectClass eq
index entryCSN,entryUUID eq
overlay syncprov
syncprov-checkpoint 1000 60
overlay accesslog
logdb cn=accesslog
logops writes
logsuccess TRUE
logpurge 07+00:00 01+00:00
include ../servers/slapd/schema/core.schema
argsfile replica/run/slapd.args
pidfile replica/run/slapd.pid
database mdb
suffix "dc=example,dc=com"
rootdn "cn=Manager,dc=example,dc=com"
rootpw secret
maxsize 256000000
directory replica/main
index objectClass eq
index entryUUID eq
syncrepl rid=0
    provider=ldap://localhost:6389/
    bindmethod=simple
    binddn="cn=Manager,dc=example,dc=com"
    credentials=secret
    searchbase="dc=example,dc=com"
    logbase="cn=accesslog"
    logfilter="(&(objectClass=auditWriteObject)(reqResult=0))"
    schemachecking=on
    type=refreshAndPersist
    retry="60 +"
    syncdata=accesslog
updateref "ldap://localhost:6389"