I have created a test script that shows the contextCSN propagation problems we have, as well as a patch that fixes most of them. But before I commit anything or files more ITSs I'd like to receive comments. The test script and patch can be found at:
ftp://ftp.openldap.org/incoming/test054-syncrepl-asymmetric ftp://ftp.openldap.org/incoming/test054-syncrepl-asymmetric.patch
The comments in the scripts should hopefully explain the configuration. The patch adds support for the newCookie sync info protocol messages, and sends them when the contextCSN is being updated without any changes being sent from the syncprov provider (i.e when the entries that changed didn't match the syncrepl filter and/or was forbidden by ACL rules). The queuing of the CSNs done in syncrepl_updateCookie should not be necessary with this patch, it should be removed after the patch has been committed. The bug reported in ITS#5710 is also worked around by this patch, but defining SLAP_SYNC_UPDATE_MSGID with an unique value would be better.
This test requires the proposed patch to fix ITS#5572 to be applied. Dynamically creating the configuration doesn't work as expected without it, and the number of errors reported by the script increases from 10 to 25. Mostly due to the replication starting in an unexpected state, and several restarts and writes are needed before an expected state is reached. This patch can be found at:
ftp://ftp.openldap.org/incoming/ITS5572.patch
The test script finds 10 errors (assuming the race condition discussed below is actually hit) after the ITS#5572 patch has been applied With both patches I'm left with 3 errors, all related to the race condition.
Howard, you'll probably object to the rootdn usage in this test. But as of now there is no way I know of that allows me to create the layout I need without it. Extending syncrepl with the ability to exclude a list of dn subtrees from its control could be the solution. But in that case I assume it would be better to have a list of URIs (evaluated locally) that defines the entries to exclude. Which again opens a can of new questions as to when the URIs should be evaluated. For add and delete it should be pretty obvious, but what about modify? Test against the entry as it is before the change, would be after or both?
Hmm, the ability to exclude a list of URIs from syncrepl control could be (ab)used to merge some type of entries from one server, other types from other servers, if anyone should wish to do such weired things..
The errors remaining after these two patches have been applied are due to the race condition that arise if more then one subordinate database replicates from the same provider. I have a clear idea as to how this race can be eliminated, it must be addressed in both syncrepl and syncprov:
In syncrepl it should be fairly easy, as storing the contextCSN values in the suffix of the database where syncrepl is configured rather than the suffix of the glue entry should be sufficient.
Syncprov needs a bit more work. Whenever it detects that syncrepl updates the contextCSN value it must use the *lowest* contextCSN value for any given sid stored in all the databases within its context. And no updates must be accepted from syncrepl until a contextCSN value has been stored in the suffix dn of all databases (within its context) where syncrepl has been enabled. Assuming syncrepl cannot be used on both a superior and a subordinate database (which could no be true if the ability to exclude something from its control is introduced) it should not be required to make these extra tests when syncrepl updates the contextCSN in the suffix of the glue database itself (i.e when syncrepl and syncprov are configured in the same database). The sending of newCookie messages should cause the contextCSN values to be updated to the newest values fairly quickly.
Actually, there could be another problem buried here as well, as the consumers really need a full resynchronization after new subordinate syncrepl consumers are added on their provider (at least if the new consumer replicates from the same server as an existing). The easiest fix would be to require the clients to do a full resync whenever the config of the provider is changed, which I find a bit drastic. Noting it in the doc. should hopefully be sufficient.
Rein Tollevik Basefarm AS