The bdb/hdb and ldif backends assigns CSNs to delete operations that
lacks it, which causes problems in forwarding replication
configurations. During the refresh phase there may be legitimate delete
operations that should not have any CSN. When the forwarder adds its
CSN it might leave the forwarded and its consumers with a CSN set that
includes a SID not present on the provider, and they will never be able
to resync.
syncrepl_del_nonpresent() queues the minimum CSN received from the
provider, which partly obscures this problem but in return introduce
other :-( The CSN set received may include updates to more than one
CSN, and only one if these can be added on the queue. Much worse, the
first delete will commit the queued CSN. If there are more than one
entry that should be deleted this leaves an open window where the
forwarder (and its consumers) have an apparently up-to-date CSN set
without actually being in sync with the provider. Running the new
test061 with sync debugging shows traces of these problem in the logs.
In back-bdb/delete.c, the CSN of the delete operation appear to be added
as a value in the entryCSN index, which really puzzles me. If that
index is to be modified I would expect that it should delete the
entryCSN value of the entry being deleted, not to add anything. Why
this is only done in non-shadowed databases I cannot tell either.
I would fix these problems by assigning the CSN of delete operations in
the frontend, i.e on the server where ordinary delete operations where
done. syncrepl_del_nonpresent() should not queue the CSN, updating it
should be left to the syncrepl_updateCookie() call which takes place
when the refresh phase completes. But what to do about the index
manipulation I cannot tell. Anyone?
Rein