https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #8 from Ondřej Kuzník ondra@mistotebe.net --- On Thu, Jul 02, 2020 at 01:19:40PM +0000, openldap-its@openldap.org wrote:
--- Comment #7 from Howard Chu hyc@openldap.org --- (In reply to Ondřej Kuzník from comment #6)
Thanks for the reproducer script.
This is due to https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/ syncrepl.c#L1638 causing A to skip the present cull.
Based on the git history, this was introduced to deal with ITS#5470 but that seems wrong, if the number of SIDs in the cookie differs from what we requested then either:
- a SID disappeared from the set we received, which sounds like what
ITS#5470 is about? But slapd doesn't really allow this at the moment as it will say consumer is newer than provider) so that shouldn't happen
A SID can't disappear. They tend to stay in the contextCSN forever. (This is actually another problem, nodes that are converted from single-provider to multi-provider generally still have a SID 0 CSN, which is always ancient relative to the active SIDs. Routines that check for oldest CSN to still exist in the DB lead to wasteful checks because of that. Right now all you can do is use mage privs and delete the obsolete CSN.)
Yeah, and it would not be so wasteful if we could query the database for the oldest/newest entry with a given SID in entryCSN. Removing a SID from the set is always going to be a manual operation unless we can coordinate with all provider and consumer nodes somehow.
- a SID is added to the set by the provider, like here. This could be due to
a delete (like here) and that delete has to be replicated - that is the point of running syncrepl_del_nonpresent
Yes, the problem that was being addressed is that if the local node knows about more SIDs than the remote node, then the incoming present list from the remote node can't be trusted. Doing a del_nonpresent could delete a lot of entries that the remote node doesn't know about, but exist legitimately on the local node.
The scenario I describe here is if we start a search with a cookie containing only SIDs {1, 2} but finish present phase by receiving a cookie with SIDs {1, 2, 3}. Accepting that cookie implies we have to process the (implied) deletes too or we have desynced.
If, in the meantime, we added entries with a SID of 4, those are not part of the original cookie and should not be deleted, that's for sure. I think we do the right thing already or are close to doing so.
I think a proper fix would require a change in the syncrepl protocol sequencing. E.g., two nodes should refresh from each other with all of their new Adds/Modifies first, and once those changes have been settled, then they can perform a present cross-check. This would also require saving some intermediate cookie state in case the the full sequence gets interrupted.
Or, put in another way, there needs to be a separately tracked contextDeleteCSN.
That's ITS#8125 work, I should get back to that eventually.