The bdb/hdb and ldif backends assigns CSNs to delete operations that lacks it, which causes problems in forwarding replication configurations. During the refresh phase there may be legitimate delete operations that should not have any CSN. When the forwarder adds its CSN it might leave the forwarded and its consumers with a CSN set that includes a SID not present on the provider, and they will never be able to resync.
syncrepl_del_nonpresent() queues the minimum CSN received from the provider, which partly obscures this problem but in return introduce other :-( The CSN set received may include updates to more than one CSN, and only one if these can be added on the queue. Much worse, the first delete will commit the queued CSN. If there are more than one entry that should be deleted this leaves an open window where the forwarder (and its consumers) have an apparently up-to-date CSN set without actually being in sync with the provider. Running the new test061 with sync debugging shows traces of these problem in the logs.
In back-bdb/delete.c, the CSN of the delete operation appear to be added as a value in the entryCSN index, which really puzzles me. If that index is to be modified I would expect that it should delete the entryCSN value of the entry being deleted, not to add anything. Why this is only done in non-shadowed databases I cannot tell either.
I would fix these problems by assigning the CSN of delete operations in the frontend, i.e on the server where ordinary delete operations where done. syncrepl_del_nonpresent() should not queue the CSN, updating it should be left to the syncrepl_updateCookie() call which takes place when the refresh phase completes. But what to do about the index manipulation I cannot tell. Anyone?
Rein
Rein Tollevik wrote:
The bdb/hdb and ldif backends assigns CSNs to delete operations that lacks it, which causes problems in forwarding replication configurations. During the refresh phase there may be legitimate delete operations that should not have any CSN. When the forwarder adds its CSN it might leave the forwarded and its consumers with a CSN set that includes a SID not present on the provider, and they will never be able to resync.
OK, sounds like that should be fixed.
syncrepl_del_nonpresent() queues the minimum CSN received from the provider, which partly obscures this problem but in return introduce other :-( The CSN set received may include updates to more than one CSN, and only one if these can be added on the queue. Much worse, the first delete will commit the queued CSN. If there are more than one entry that should be deleted this leaves an open window where the forwarder (and its consumers) have an apparently up-to-date CSN set without actually being in sync with the provider. Running the new test061 with sync debugging shows traces of these problem in the logs.
In back-bdb/delete.c, the CSN of the delete operation appear to be added as a value in the entryCSN index, which really puzzles me. If that index is to be modified I would expect that it should delete the entryCSN value of the entry being deleted, not to add anything. Why this is only done in non-shadowed databases I cannot tell either.
bdb_index_entry_del is already invoked to remove all appropriate index values. IIRC, this particular patch was done to accomodate the entryCSN>=foo search that syncprov performs. Probably this only matters if a Delete was the last operation on a DB just before shutdown. Hard to say if it's still relevant.
I would fix these problems by assigning the CSN of delete operations in the frontend, i.e on the server where ordinary delete operations where done. syncrepl_del_nonpresent() should not queue the CSN, updating it should be left to the syncrepl_updateCookie() call which takes place when the refresh phase completes. But what to do about the index manipulation I cannot tell. Anyone?