Recently, a fluke in our configuration distribution system caused one of our consumers (running 2.3.41) to have stale schema information. slapd at debuglevel 16384 emitted:
syncrepl_message_to_op: rid 001 mods check (objectClass: value #0 invalid per syntax)
We had *not* specified schemachecking in this syncrepl stanza, and slapd.conf(5) says:
The schema checking can be enforced at the LDAP Sync consumer site by turning on the schemachecking parameter. The default is off.
Should this error have been raised in this case? I tried explicitly disabling schemachecking ("schemachecking=off" in the syncrepl stanza), but this error was still raised.
Once the schema was updated appropriately, I started slapd (again at debuglevel 16384) and saw syncrepl operations being successfully executed:
syncrepl_message_to_op: rid 001 be_modify uid=example,[...],o=org (0)
Thinking all was well, I ^C'd slapd, and slapd shut itself down successfully. I restarted slapd using an init script, but the backend's contextCSN didn't start incrementing. Once again at debuglevel 16384:
null_callback: error code 0x10 syncrepl_message_to_op: rid 001 be_modify uid=bad-objectclass,[...],o=org (16) do_syncrepl: rid 001 retrying (29 retries left)
uid=bad-objectclass is the same entry that triggered the schemachecking error in the first place. Error 0x10 is LDAP_NO_SUCH_ATTRIBUTE, and this seems a lot like the symptoms described in this thread:
http://www.openldap.org/lists/openldap-software/200801/msg00126.html
To make a long story short, it seems that syncrepl doesn't update the backend's contextCSN until it's processed its backlog? To check, I stopped another consumer and let a backlog build, then started it at debuglevel 16384 and watched the backend's contextCSN with ldapsearch(1). contextCSN didn't increment until the backlog was completely processed, even though I could see the changes it was processing with ldapsearch(1) as soon as they were processed.
If a consumer processes replication without updating the backend's contextCSN, it will try to re-process the same replication entries when it starts up again, which will generally fail. This seems to leave one in a bind, either having to manually determine the correct value for contextCSN and update it manually, or remove the backend's data files and let syncrepl rebuild them from scratch. If this assessment is correct, this behavior doesn't seem desirable.
john