Jonathan Clarke wrote:
Hi Jonathan,
This looks very much like ITS#6059.
Interesting reading. So it seems that there is a known race condition with ABANDON. I did manage to find out that the person who ran the original delete thought it had hung, but I'm not sure whether that resulted in an automatic (timeout) or manual ABANDON.
(lots of questions cut)
- Would a simple restart of slapd on the slave cause it to
resynchronise correctly with the master?
No, assuming my assumptions above are correct: if the syncrepl consumer didn't lose the connection to the master, other updates will have been propagated, and the consumer's cookie will have been updated, missing out the delete operation.
To resync, start your consumer(s) once with the -c rid=nnn command line option (with nnn being the rid of the syncrepl statement in config).
Alas it doesn't seem to help :( While I can see the log file on the slave filling up with each entry DN as it checks it against the master, a quick grep has showed that the slave didn't even check the master for the existence of the bad DN "cn=001901717" to determine whether it should be kept - and consequently it wasn't removed from the slave when the synchronisation finished.
I'm wondering if this could be an openldap bug in that when invoking with the "-c rid=<foo>" syntax, the slave downloads each DN from the master and verifies it against its local DN - and hence won't remove extra entries still remaining on the slave that aren't present on the master?
Also another interesting thing I've noticed is that ou=salvage (and its immediate child DN ou=people) both have totally the wrong objectClass. If I point an LDAP browser (JXplorer) directly at ou=salvage on the slave then I see the following:
objectClass: top objectClass: glue
Although interestingly I see the correct objectClasses for ou=salvage if I do a simple ldapsearch on localhost on the slave. Weird.
ATB,
Mark.