https://bugs.openldap.org/show_bug.cgi?id=10358
Issue ID: 10358 Summary: syncrepl can revert an entry's CSN Product: OpenLDAP Version: unspecified Hardware: All OS: All Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: slapd Assignee: bugs@openldap.org Reporter: ondra@mistotebe.net Target Milestone: ---
Created attachment 1080 --> https://bugs.openldap.org/attachment.cgi?id=1080&action=edit Debug log of an instance of this happening
There is a sequence of operations which can force a MPR node to apply changes out of order (essentially reverting an operation). Currently investigating which part of the code that should have prevented this has let it slip.
A sample log showing how this happened is attached.
https://bugs.openldap.org/show_bug.cgi?id=10358
--- Comment #1 from Ondřej Kuzník ondra@mistotebe.net --- On Tue, Jun 17, 2025 at 10:31:00AM +0000, openldap-its@openldap.org wrote:
There is a sequence of operations which can force a MPR node to apply changes out of order (essentially reverting an operation). Currently investigating which part of the code that should have prevented this has let it slip.
A sample log showing how this happened is attached.
What we need for plain MPR is not to apply a modification twice (we do this by tracking the latest applied/pending CSN per serverid) and apply only the latest version of an entry when we learn of it - we use the entryCSN to enforce serialisation across the cluster (for each entry we keep the latest version).
It is the latter that fails in this case, the syncrepl task picks up first, comes all the way down to preparing the modification based on the version of the entry it sees, a Modify request comes in, gets applied, then syncrepl gets scheduled again and gets applied, reverting the entryCSN. There is no synchronisation between the two, nor is there an opportunity to make sure the syncrepl thread is not interrupted after diffing the entry before it gets applied to the DB.
We could go all in and introduce another way to serialise or we could just tag the diff with a precondition that the entry has not been changed in the meantime and detect it. Either by making the entryCSN mod a delete+add or use assertion control. With assertion control, we can ignore LDAP_ASSERTION_FAILED, with plain mods we would have to start ignoring LDAP_NO_SUCH_ATTRIBUTE - excluding bugs, the only way this could happen is getting preempted before the mod is applied. All this only for messages handled by syncrepl_diff_entry (not delta modifications which have syncrepl_op_modify() to deal with MPR concerns).
I will see if the delete+add approach is practicable and post a MR to that effect if it is.
https://bugs.openldap.org/show_bug.cgi?id=10358
Ondřej Kuzník ondra@mistotebe.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Status|UNCONFIRMED |IN_PROGRESS Assignee|bugs@openldap.org |ondra@mistotebe.net
--- Comment #2 from Ondřej Kuzník ondra@mistotebe.net --- https://git.openldap.org/openldap/openldap/-/merge_requests/781
https://bugs.openldap.org/show_bug.cgi?id=10358
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords|needs_review | Target Milestone|--- |2.6.11
https://bugs.openldap.org/show_bug.cgi?id=10358
--- Comment #3 from 3049720393@qq.com --- Hi All, Don't know whether this problem is related to this bug or not, but want to share it here as we still do not know the reason ;)
We're using version 2.6.9 now and started up by docker container. It works very well when installing and starting it on a server instance.
But recently, we try to create it on Azure Container Apps service (which is a service supporting to start docker containers), but found out that when two openldap containers(openldap processes) run at the same time with same data volume mounted, well, one is normally started, and another one is looping to try to start but always failed. The failed one shows errors:
``` mdb_db_open: database "dc=xxx,dc=exmple,dc=com" cannot be opened: Resource temporarily unavailable (11). Restore from backup! backend_startup_one (type=mdb, suffix="dc=xxx,dc=exmple,dc=com"): bi_db_open failed! (11) ```
I think this is OK, because the started one is occupying the resources. And the started up one is working well for some time, BUT suddenly at some point, it went down to exit with the following error messages:
``` slapd: id2entry.c:828: mdb_opinfo_get: Assertion `!rc' failed. /usr/local/bin/entrypoint.sh: line 160: 50 Aborted (core dumped) slapd -d 32768 -u openldap -g openldap -F /etc/ldap/slapd.d ```
And as the previous normal one is exit, the other one can start up normally. So we bash into it and ran `ldapsearch` command and found out that some of the people entries were gone!!! But until now we still do not know the reason...And then came across this thread...
To see whether it's the same race condition problem or not as this, and try to get some help ;) thanks.