https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #7 from Howard Chu hyc@openldap.org --- (In reply to Ondřej Kuzník from comment #6)
Thanks for the reproducer script.
This is due to https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/ syncrepl.c#L1638 causing A to skip the present cull.
Based on the git history, this was introduced to deal with ITS#5470 but that seems wrong, if the number of SIDs in the cookie differs from what we requested then either:
- a SID disappeared from the set we received, which sounds like what
ITS#5470 is about? But slapd doesn't really allow this at the moment as it will say consumer is newer than provider) so that shouldn't happen
A SID can't disappear. They tend to stay in the contextCSN forever. (This is actually another problem, nodes that are converted from single-provider to multi-provider generally still have a SID 0 CSN, which is always ancient relative to the active SIDs. Routines that check for oldest CSN to still exist in the DB lead to wasteful checks because of that. Right now all you can do is use mage privs and delete the obsolete CSN.)
- a SID is added to the set by the provider, like here. This could be due to
a delete (like here) and that delete has to be replicated - that is the point of running syncrepl_del_nonpresent
Yes, the problem that was being addressed is that if the local node knows about more SIDs than the remote node, then the incoming present list from the remote node can't be trusted. Doing a del_nonpresent could delete a lot of entries that the remote node doesn't know about, but exist legitimately on the local node.
I think a proper fix would require a change in the syncrepl protocol sequencing. E.g., two nodes should refresh from each other with all of their new Adds/Modifies first, and once those changes have been settled, then they can perform a present cross-check. This would also require saving some intermediate cookie state in case the the full sequence gets interrupted.
Or, put in another way, there needs to be a separately tracked contextDeleteCSN.
The above would not explain why server B then receives the deleted entry back rather than this being a silent desync. It turns out that check_syncprov() doesn't add SID 2 into the cookie[0] so it forgets its own modification when establishing the syncrepl session to A.
Howard, can you review if any of the claims above seem wrong?
[0]. https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/ syncrepl.c#L1638 - the loops should probably be inverted, with the outer loop operating on si_cookieState instead?