https://bugs.openldap.org/show_bug.cgi?id=9282
Issue ID: 9282 Summary: Syncrepl re-creates deleted entry Product: OpenLDAP Version: 2.4.50 Hardware: All OS: All Status: UNCONFIRMED Severity: normal Priority: --- Component: slapd Assignee: bugs@openldap.org Reporter: quanah@openldap.org Target Milestone: ---
Scenario:
2 node Multi-provider replication
Add database to provider A
ensure database replicates to provider B
Stop provider A
delete entry on provider B
Start provider A
Wait for provider B to reconnect to provider A
Deleted entry re-appears
https://bugs.openldap.org/show_bug.cgi?id=9282
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |2.4.51
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #1 from Quanah Gibson-Mount quanah@openldap.org --- Created attachment 741 --> https://bugs.openldap.org/attachment.cgi?id=741&action=edit config for node A
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #2 from Quanah Gibson-Mount quanah@openldap.org --- Created attachment 742 --> https://bugs.openldap.org/attachment.cgi?id=742&action=edit config for node B
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #3 from Quanah Gibson-Mount quanah@openldap.org --- Created attachment 743 --> https://bugs.openldap.org/attachment.cgi?id=743&action=edit Database used
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #4 from Quanah Gibson-Mount quanah@openldap.org --- General steps to reproduce:
slapadd server1 config on node A slapadd server2 config on node B
start server A, start server B
ldapadd example database on node A
confirm DB has replicated to both nodes, and that entry to be deleted exists:
ldapsearch -H ldap://server1.example.com -x -D cn=admin,dc=example,dc=com -w secret -b dc=example,dc=com "(objectClass=*)" 1.1 | grep ^dn: | wc -l
Should be: 1011
ldapsearch -H ldap://server2.example.com -x -D cn=admin,dc=example,dc=com -w secret -b dc=example,dc=com "(objectClass=*)" 1.1 | grep ^dn: | wc -l
Should be: 1011
ldapsearch -x -H ldap://server1.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Daemon Leeson" 1.1
Should return entry DN: dn: cn=Damon Leeson,ou=Product Development,dc=example,dc=com
ldapsearch -x -H ldap://server2.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Daemon Leeson" 1.1 dn: cn=Damon Leeson,ou=Product Development,dc=example,dc=com
Stop node A
Delete entry from node B:
ldapdelete -x -H ldap://server2.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Damon Leeson, ou=Product Development, dc=example,dc=com"
Confirm entry is gone from node B: ldapsearch -x -H ldap://server2.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Daemon Leeson" 1.1
<nothing>
Start node A
Wait for replication to re-initiate from node B to node A
Search for entry in both nodes.
ldapsearch -x -H ldap://server1.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Daemon Leeson" 1.1
ldapsearch -x -H ldap://server2.example.com/ -D cn=admin,dc=example,dc=com -w secret "cn=Daemon Leeson" 1.1
Expected: Entry is not in either node
Actual: Entry is in both nodes
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #5 from Quanah Gibson-Mount quanah@openldap.org --- Confirmed same issue exists in master after writing regression test
https://bugs.openldap.org/show_bug.cgi?id=9282
Ondřej Kuzník ondra@mistotebe.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |CONFIRMED Ever confirmed|0 |1
--- Comment #6 from Ondřej Kuzník ondra@mistotebe.net --- Thanks for the reproducer script.
This is due to https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/syncr... causing A to skip the present cull.
Based on the git history, this was introduced to deal with ITS#5470 but that seems wrong, if the number of SIDs in the cookie differs from what we requested then either: - a SID disappeared from the set we received, which sounds like what ITS#5470 is about? But slapd doesn't really allow this at the moment as it will say consumer is newer than provider) so that shouldn't happen - a SID is added to the set by the provider, like here. This could be due to a delete (like here) and that delete has to be replicated - that is the point of running syncrepl_del_nonpresent
The above would not explain why server B then receives the deleted entry back rather than this being a silent desync. It turns out that check_syncprov() doesn't add SID 2 into the cookie[0] so it forgets its own modification when establishing the syncrepl session to A.
Howard, can you review if any of the claims above seem wrong?
[0]. https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/syncr... - the loops should probably be inverted, with the outer loop operating on si_cookieState instead?
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #7 from Howard Chu hyc@openldap.org --- (In reply to Ondřej Kuzník from comment #6)
Thanks for the reproducer script.
This is due to https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/ syncrepl.c#L1638 causing A to skip the present cull.
Based on the git history, this was introduced to deal with ITS#5470 but that seems wrong, if the number of SIDs in the cookie differs from what we requested then either:
- a SID disappeared from the set we received, which sounds like what
ITS#5470 is about? But slapd doesn't really allow this at the moment as it will say consumer is newer than provider) so that shouldn't happen
A SID can't disappear. They tend to stay in the contextCSN forever. (This is actually another problem, nodes that are converted from single-provider to multi-provider generally still have a SID 0 CSN, which is always ancient relative to the active SIDs. Routines that check for oldest CSN to still exist in the DB lead to wasteful checks because of that. Right now all you can do is use mage privs and delete the obsolete CSN.)
- a SID is added to the set by the provider, like here. This could be due to
a delete (like here) and that delete has to be replicated - that is the point of running syncrepl_del_nonpresent
Yes, the problem that was being addressed is that if the local node knows about more SIDs than the remote node, then the incoming present list from the remote node can't be trusted. Doing a del_nonpresent could delete a lot of entries that the remote node doesn't know about, but exist legitimately on the local node.
I think a proper fix would require a change in the syncrepl protocol sequencing. E.g., two nodes should refresh from each other with all of their new Adds/Modifies first, and once those changes have been settled, then they can perform a present cross-check. This would also require saving some intermediate cookie state in case the the full sequence gets interrupted.
Or, put in another way, there needs to be a separately tracked contextDeleteCSN.
The above would not explain why server B then receives the deleted entry back rather than this being a silent desync. It turns out that check_syncprov() doesn't add SID 2 into the cookie[0] so it forgets its own modification when establishing the syncrepl session to A.
Howard, can you review if any of the claims above seem wrong?
[0]. https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/ syncrepl.c#L1638 - the loops should probably be inverted, with the outer loop operating on si_cookieState instead?
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #8 from Ondřej Kuzník ondra@mistotebe.net --- On Thu, Jul 02, 2020 at 01:19:40PM +0000, openldap-its@openldap.org wrote:
--- Comment #7 from Howard Chu hyc@openldap.org --- (In reply to Ondřej Kuzník from comment #6)
Thanks for the reproducer script.
This is due to https://git.openldap.org/openldap/openldap/-/blob/master/servers/slapd/ syncrepl.c#L1638 causing A to skip the present cull.
Based on the git history, this was introduced to deal with ITS#5470 but that seems wrong, if the number of SIDs in the cookie differs from what we requested then either:
- a SID disappeared from the set we received, which sounds like what
ITS#5470 is about? But slapd doesn't really allow this at the moment as it will say consumer is newer than provider) so that shouldn't happen
A SID can't disappear. They tend to stay in the contextCSN forever. (This is actually another problem, nodes that are converted from single-provider to multi-provider generally still have a SID 0 CSN, which is always ancient relative to the active SIDs. Routines that check for oldest CSN to still exist in the DB lead to wasteful checks because of that. Right now all you can do is use mage privs and delete the obsolete CSN.)
Yeah, and it would not be so wasteful if we could query the database for the oldest/newest entry with a given SID in entryCSN. Removing a SID from the set is always going to be a manual operation unless we can coordinate with all provider and consumer nodes somehow.
- a SID is added to the set by the provider, like here. This could be due to
a delete (like here) and that delete has to be replicated - that is the point of running syncrepl_del_nonpresent
Yes, the problem that was being addressed is that if the local node knows about more SIDs than the remote node, then the incoming present list from the remote node can't be trusted. Doing a del_nonpresent could delete a lot of entries that the remote node doesn't know about, but exist legitimately on the local node.
The scenario I describe here is if we start a search with a cookie containing only SIDs {1, 2} but finish present phase by receiving a cookie with SIDs {1, 2, 3}. Accepting that cookie implies we have to process the (implied) deletes too or we have desynced.
If, in the meantime, we added entries with a SID of 4, those are not part of the original cookie and should not be deleted, that's for sure. I think we do the right thing already or are close to doing so.
I think a proper fix would require a change in the syncrepl protocol sequencing. E.g., two nodes should refresh from each other with all of their new Adds/Modifies first, and once those changes have been settled, then they can perform a present cross-check. This would also require saving some intermediate cookie state in case the the full sequence gets interrupted.
Or, put in another way, there needs to be a separately tracked contextDeleteCSN.
That's ITS#8125 work, I should get back to that eventually.
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #9 from Howard Chu hyc@openldap.org --- (In reply to Howard Chu from comment #7)
(In reply to Ondřej Kuzník from comment #6)
- a SID is added to the set by the provider, like here. This could be due to
a delete (like here) and that delete has to be replicated - that is the point of running syncrepl_del_nonpresent
Yes, the problem that was being addressed is that if the local node knows about more SIDs than the remote node, then the incoming present list from the remote node can't be trusted. Doing a del_nonpresent could delete a lot of entries that the remote node doesn't know about, but exist legitimately on the local node.
More on this, from https://bugs.openldap.org/show_bug.cgi?id=5470#c15
We could try changing del_nonpresent to ignore entries with entryCSN newer than the current remote cookie, instead of ignoring the entire presentlist.
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #10 from Ondřej Kuzník ondra@mistotebe.net --- On Thu, Jul 02, 2020 at 03:31:14PM +0000, openldap-its@openldap.org wrote:
--- Comment #9 from Howard Chu hyc@openldap.org --- (In reply to Howard Chu from comment #7)
Yes, the problem that was being addressed is that if the local node knows about more SIDs than the remote node, then the incoming present list from the remote node can't be trusted. Doing a del_nonpresent could delete a lot of entries that the remote node doesn't know about, but exist legitimately on the local node.
More on this, from https://bugs.openldap.org/show_bug.cgi?id=5470#c15
We could try changing del_nonpresent to ignore entries with entryCSN newer than the current remote cookie, instead of ignoring the entire presentlist.
And that's what you've touched on in ITS#5470 (4673c99e96fdc70ef96b84a9aa4de6141f26e6df) already but had to partially revert in ac037d3a13ce56f72ef24c8e17fc944c39c71e72 since there is still no way to LE compare only within the same SID, but I repeat myself.
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #11 from Ondřej Kuzník ondra@mistotebe.net --- (In reply to Howard Chu from comment #9)
(In reply to Howard Chu from comment #7)
Yes, the problem that was being addressed is that if the local node knows about more SIDs than the remote node, then the incoming present list from the remote node can't be trusted. Doing a del_nonpresent could delete a lot of entries that the remote node doesn't know about, but exist legitimately on the local node.
More on this, from https://bugs.openldap.org/show_bug.cgi?id=5470#c15
We could try changing del_nonpresent to ignore entries with entryCSN newer than the current remote cookie, instead of ignoring the entire presentlist.
The top commit in https://git.openldap.org/openldap/openldap/-/merge_requests/91 attempts to address this very issue by checking whether its entryCSN is covered by the cookie we received.
https://bugs.openldap.org/show_bug.cgi?id=9282
Ondřej Kuzník ondra@mistotebe.net changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugs.openldap.org/s | |how_bug.cgi?id=7748
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #12 from Quanah Gibson-Mount quanah@openldap.org --- Commits: • 9e736dbf by Quanah Gibson-Mount at 2020-07-22T18:24:51+00:00 ITS#9282 regression test
• 5bbcf38c by Ondřej Kuzník at 2020-07-22T18:24:51+00:00 ITS#9282 Build a complete cookie for the search
• 521b8bbe by Ondřej Kuzník at 2020-07-22T18:24:52+00:00 ITS#9282 Check entries are covered by new contextCSN before deletion
https://bugs.openldap.org/show_bug.cgi?id=9282
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|CONFIRMED |RESOLVED Resolution|--- |FIXED
--- Comment #13 from Quanah Gibson-Mount quanah@openldap.org --- RE24:
Commits: • 7fb8912b by Quanah Gibson-Mount at 2020-07-23T15:57:22+00:00 ITS#9282 regression test
• 939404ac by Ondřej Kuzník at 2020-07-23T16:53:46+00:00 ITS#9282 Build a complete cookie for the search
• a34ecd86 by Ondřej Kuzník at 2020-07-23T17:02:29+00:00 ITS#9282 Check entries are covered by new contextCSN before deletion
https://bugs.openldap.org/show_bug.cgi?id=9282
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED
https://bugs.openldap.org/show_bug.cgi?id=9282
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|2.4.51 |2.4.53
--- Comment #14 from Quanah Gibson-Mount quanah@openldap.org --- Commits: • 8699e5f3 by Howard Chu at 2020-08-31T19:36:10+01:00 ITS#9282 fix crash in nonpresent_callback
In a standard Refresh present phase, the provider sends no cookie since it is only listing the entries that existed as of the time in the cookie the consumer sent. In this case the consumer only needs to check entryCSNs against its last sent cookie.
• 41396248 by Howard Chu at 2020-08-31T20:09:52+01:00 ITS#9282 more for merge_state
Don't assume si_cookieState is always newer
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #15 from Quanah Gibson-Mount quanah@openldap.org --- Commits: • ee564399 by Ondřej Kuzník at 2021-02-18T17:31:32+00:00 ITS#9282 Check all csns
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #16 from Quanah Gibson-Mount quanah@openldap.org --- RE24:
Commits: • f2a6425c by Ondřej Kuzník at 2021-02-18T18:43:44+00:00 ITS#9282 Check all csns
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #17 from Quanah Gibson-Mount quanah@openldap.org --- head:
Commits: • da610c05 by Ondřej Kuzník at 2021-12-08T17:15:57+00:00 ITS#9282 Short-circuit cookie comparison in non-present check
• 8d428f31 by Ondřej Kuzník at 2021-12-08T17:15:57+00:00 ITS#9282 Do not resuscitate entries we already deleted
• a73ddda5 by Ondřej Kuzník at 2021-12-08T17:15:57+00:00 ITS#9282 Skip old accesslog entries even in delta-refresh
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #18 from Quanah Gibson-Mount quanah@openldap.org --- RE26:
• 8507e711 by Ondřej Kuzník at 2021-12-13T16:38:23+00:00 ITS#9282 Short-circuit cookie comparison in non-present check
• 68981147 by Ondřej Kuzník at 2021-12-13T16:38:28+00:00 ITS#9282 Do not resuscitate entries we already deleted
• 6f7dccd5 by Ondřej Kuzník at 2021-12-13T16:38:33+00:00 ITS#9282 Skip old accesslog entries even in delta-refresh
https://bugs.openldap.org/show_bug.cgi?id=9282
--- Comment #19 from Quanah Gibson-Mount quanah@openldap.org --- RE25:
• 11a86733 by Ondřej Kuzník at 2021-12-13T16:40:47+00:00 ITS#9282 Short-circuit cookie comparison in non-present check
• 9bd3fa40 by Ondřej Kuzník at 2021-12-13T16:40:50+00:00 ITS#9282 Do not resuscitate entries we already deleted
• f1ce0d20 by Ondřej Kuzník at 2021-12-13T16:40:52+00:00 ITS#9282 Skip old accesslog entries even in delta-refresh