https://bugs.openldap.org/show_bug.cgi?id=9888
Issue ID: 9888 Summary: When using cn=config replication, schema updates can corrupt the index database(s) Product: OpenLDAP Version: unspecified Hardware: All OS: All Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: slapd Assignee: bugs@openldap.org Reporter: quanah@openldap.org Target Milestone: ---
Today I pushed a schema update out to the config node that holds schema that is replicated to the providers and consumers. Post schema update, 2/11 servers crashed in the mdb online indexing function. I fixed this by slapcat the db and slapadd the db. This is important because it was later revealed that on the 9/11 servers that did not crash or have their database reloaded, ldapsearch would return the wrong attribute names for some attribute:value pairs in the database, which caused mayhem in downstream systems and caused replication issues between the nodes. The 2 nodes that were reloaded immediately after the schema change had the only "good" copies of the database left.
To give an example, say an entry was something like:
dn: uid=joe,ou=people,dc=example,dc=com uid: joe sn: smith cn: joe smith givenName: joe
After the change, the broken servers could return something like:
dn: uid=joe,ou=people,dc=example,dc=com uid: joe posixGroup: smith cn: joe smith givenName joe
It's not clear how deeply this bug ran in the database. It for sure affected 2 attributes used by the person objectClass. Both of the "replacement" attributes were not valid attributes for the person objectClasses in use.
Maybe related to the changes in ITS#9858?
https://bugs.openldap.org/show_bug.cgi?id=9888
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Version|unspecified |2.6.3
https://bugs.openldap.org/show_bug.cgi?id=9888
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Priority|--- |Highest Severity|normal |critical
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #1 from Quanah Gibson-Mount quanah@openldap.org --- Note that one change that was made during the schema update was that some of the attribute OIDs were renumbered. Speculating that this was the root cause of the issue in slapd
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #2 from Michael Ströder michael@stroeder.com --- I guess OID renumbering is something you must not do because otherwise you're running into this kind of race conditions.
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #3 from Ondřej Kuzník ondra@mistotebe.net --- On Mon, Jul 25, 2022 at 11:19:00PM +0000, openldap-its@openldap.org wrote:
Note that one change that was made during the schema update was that some of the attribute OIDs were renumbered. Speculating that this was the root cause of the issue in slapd
Just checked the code and it really seems like any AttributeDescription pointer caching is tied to OIDs, so if you mess with those and there is data in mdb, those pointers will no longer match up and you get undefined behaviour. As usual, schema changes removing or modifying attributes/objectclasses should include a textual (LDIF) reload.
I guess not a bug? At least unless you can show a change not touching OIDs gives you the same?
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #4 from Howard Chu hyc@openldap.org --- Also note that the OID is part of the index key, so reusing OIDs for different attributes will absolutely corrupt indices.
This is a long standing rule - OIDs of existing attributes must never be changed or reused for different attributes. If you're doing schema changes of that sort, you must always assign entirely new OIDs to ever changed attribute.
https://bugs.openldap.org/show_bug.cgi?id=9888
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |WONTFIX Status|UNCONFIRMED |RESOLVED
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #5 from Quanah Gibson-Mount quanah@openldap.org --- (In reply to Howard Chu from comment #4)
Also note that the OID is part of the index key, so reusing OIDs for different attributes will absolutely corrupt indices.
This is a long standing rule - OIDs of existing attributes must never be changed or reused for different attributes. If you're doing schema changes of that sort, you must always assign entirely new OIDs to ever changed attribute.
Hm, that's odd. One of my first conversations with you back at Stanford was when we hit an issue with AD that it stores knowledge of the attribute OIDs and OpenLDAP does not (back with OpenLDAP 2.1). We'd run into an issue where AD had problems with a schema change when an OID was re-used.
Past that point, this has not been a problem for OpenLDAP 2.1 through OpenLDAP 2.4 where this was made extensive use of at both Stanford and at Zimbra (I just went back and checked the commit history of the Zimbra schema to confirm).
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #6 from Michael Ströder michael@stroeder.com --- (In reply to Comment #5 from Quanah Gibson-Mount quanah@openldap.org)
(In reply to Howard Chu from comment #4) Also note that the OID is part of the index key, so reusing OIDs for different attributes will absolutely corrupt indices.
Hm, that's odd. One of my first conversations with you back at Stanford was when we hit an issue with AD that it stores knowledge of the attribute OIDs and OpenLDAP does not (back with OpenLDAP 2.1).
AFAICT this changed with back-mdb.
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #7 from Quanah Gibson-Mount quanah@openldap.org --- (In reply to Michael Ströder from comment #6)
(In reply to Comment #5 from Quanah Gibson-Mount quanah@openldap.org)
(In reply to Howard Chu from comment #4) Also note that the OID is part of the index key, so reusing OIDs for different attributes will absolutely corrupt indices.
Hm, that's odd. One of my first conversations with you back at Stanford was when we hit an issue with AD that it stores knowledge of the attribute OIDs and OpenLDAP does not (back with OpenLDAP 2.1).
AFAICT this changed with back-mdb.
Zimbra was one of the earliest adopters of back-mdb and never hit this issue in OpenLDAP 2.4
https://bugs.openldap.org/show_bug.cgi?id=9888
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords|needs_review | Status|RESOLVED |VERIFIED
https://bugs.openldap.org/show_bug.cgi?id=9888
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|WONTFIX |--- Status|VERIFIED |UNCONFIRMED
--- Comment #8 from Quanah Gibson-Mount quanah@openldap.org --- Discussed further with Howard:
There is no dependency on the OIDs within the actual database. So with slapd.conf, you could (and can still) renumber OIDs on a schema file no issue. You can also do this with cn=config based systems if slapd is offline while the schema update is made (which is what was done with Zimbra).
In essence, this means that ldapmodify operations on schema which touch OIDs can be destructive (2 slapds segfaulted out of 12 servers) and also cause slapd to serve out invalid information (at least until slapd is restarted).
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #9 from Quanah Gibson-Mount quanah@openldap.org --- At this point, I would say doing replication of the cn=schema tree is generally catastrophic and should be avoided. Ran into another issue today with this, with all new schema elements and it still caused slapd to break and return invalid data until slapd was restarted.
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #10 from Howard Chu hyc@openldap.org --- I'll note that the "notify" field of the ConfigTable existed for the purpose of allowing subsystems to register handlers to be notified when a config item they cared about was changed. Such a handler could have been used by back-mdb to refresh itself whenever an attribute definition was updated.
But so far nothing ever implemented these notify handlers, and that field was deleted from the config table in https://git.openldap.org/openldap/openldap/-/commit/0c3b8a35249942dc58f0746d...
We could resurrect it and define how it should be used, now that we have an actual use case for it.
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #11 from Quanah Gibson-Mount quanah@openldap.org --- Hit an issue with this again today, where no actual schema changes were made. In this case the change involved:
a) Adding an additional olcAuthzRegexp configuration b) Adding an ACL
It is useful to note that the process that triggers cn=config updates regenerates the contextCSN of all the entries in the config db, so it causes a 'force sync' of all schema, even if they've had no changes.
After the change was replicated to the downstream consumers, the slapd process lost all knowledge of the schema it uses, leading to filters showing missing schema:
(&(?objectClass=person))
being one example. Although an odd practice this seems indicative of some serious issues internal to slapd. I think that we should go back to marking cn=config replication experimental and not advised until this can be fixed.
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #12 from Quanah Gibson-Mount quanah@openldap.org --- Digging deeper, this is related to the use of back-relay/slapo-rwm in my environment. The environment continues to work correctly UNTIL a bind and search is done against the back-relay database in my configuration.
https://bugs.openldap.org/show_bug.cgi?id=9888
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|When using cn=config |back-relay/slapo-rwm can |replication, schema updates |corrupt slapd internal |can corrupt the index |pointers after cn=config |database(s) |refresh
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #13 from Vins Vilaplana vins@disce.org --- We've been able to hit this issue in our staging platform. We currently have an isolated server with this bug, so we're happy to run on it all kind of tests you need. Restarting slapd seems to fix the issue.
Additionally, we can also consider the option of starting a slapd with all debugs on (-d -1), take it out of service and play around modifying cn=config trying to trigger the bug. However, I'm not sure a -d -1 will include any debug regarding memory structures.
https://bugs.openldap.org/show_bug.cgi?id=9888
--- Comment #14 from Ondřej Kuzník ondra@mistotebe.net --- On Wed, Oct 18, 2023 at 05:54:08PM +0000, openldap-its@openldap.org wrote:
We've been able to hit this issue in our staging platform. We currently have an isolated server with this bug, so we're happy to run on it all kind of tests you need. Restarting slapd seems to fix the issue.
Additionally, we can also consider the option of starting a slapd with all debugs on (-d -1), take it out of service and play around modifying cn=config trying to trigger the bug. However, I'm not sure a -d -1 will include any debug regarding memory structures.
Hi Vins, I appreciate you might not be able to but it would help most if you found a way to trigger this condition in a repeatable way, doesn't have to hit it 100 % of the time.
Other than that, sharing -d -1 logs leading up to it certainly won't hurt. Still, unless we can reproduce it, we can't be sure any changes we make would actually help get rid of the issue.
Thanks,