Lesley Walker wrote:
I have spent today dissecting the logs from two incidents this week in which entries were erroneously deleted. Although the circumstances of the two incidents are quite different, from examining the logs I believe it is the same thing happening in each case.
I'm still unable to pinpoint the trigger condition, but I have a better idea of what happens. I believe it *may* be covered by ITS#4626 and ITS#4813, so I have built 2.3.35 to run on a test server.
On starting this new version for the first time and letting it build the database by replication from its provider, I get these messages in the log:
is_entry_objectclass("", "2.5.17.0") no objectClass attribute is_entry_objectclass("", "2.5.6.1") no objectClass attribute is_entry_objectclass("", "2.16.840.1.113730.3.2.6") no objectClass attribute
I freely admit that I am not clued-up on schema design, but I have tried grepping for those numbers in the schema files and in an ldif of the database and I don't find them.
I note that these same messages were reported in ITS#4626, and wonder whether there's a connection, or is it a mere coincidence?
I also note that these exact same messages were discussed in December: http://www.openldap.org/lists/openldap-software/200612/msg00046.html but this discussion went over my head, so I would welcome any words-of-one-syllable explanations.
The main problem I'm trying to troubleshoot is this:
In every case, there's a log entry: do_syncrep2: rid 123 LDAP_RES_INTERMEDIATE - SYNC_ID_SET
followed by some number of these: syncrepl_entry: rid 123 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) ("some number" is MUCH less than the number of records)
then: do_syncrep2: rid 123 LDAP_RES_INTERMEDIATE - REFRESH_PRESENT
followed by some (other) number of these: syncrepl_del_nonpresent: rid 123 be_delete uid=whatever,ou=Accounts,dc=example,dc=co,dc=nz (0)
*INCLUDING* be_deletes for nearly ALL the top-level entries: be_delete cn=root,dc=example,dc=co,dc=nz (0) be_delete ou=Accounts,dc=example,dc=co,dc=nz (66) be_delete ou=Mailbox,dc=example,dc=co,dc=nz (66) be_delete ou=Services,dc=example,dc=co,dc=nz (66) be_delete ou=Offices,dc=example,dc=co,dc=nz (66) be_delete ou=Networks,dc=example,dc=co,dc=nz (66) be_delete ou=Rooms,dc=example,dc=co,dc=nz (66) be_delete ou=Group,dc=example,dc=co,dc=nz (66) be_delete ou=EmailLists,dc=example,dc=co,dc=nz (66) be_delete ou=People,dc=example,dc=co,dc=nz (66) be_delete ou=Computers,dc=example,dc=co,dc=nz (66)
This would seem to leave the database completely empty, and in a state where nothing and nobody can authenticate to it. No amount of stopping/restarting has any effect (because it thinks it is in sync) until we repair it by starting with the empty sync cookie.
There have been at least 10 instances of this fault on different servers in the last 1-2 weeks.
Because I can't reproduce the problem on demand, I won't know for sure whether or not the new version fixes it, but I have built the new version and am now running it on a test server.
Here's the environment: OpenLDAP 2.3.32 running on Debian 3.1 (Sarge) compiled with sync logging patch discussed about 4 months ago loglevel config sync on all servers BDB 4.2 backend Syncrepl replication all round A "master" server (com) - holds the master copy of the database A number of servers that replicate directly from com An "intermediate" server (wwsv04) that - is on the same LAN and subnet as com - replicates from com - acts as provider for all other servers 88 servers/replicas in total Approx 9000 records All replicas are supposed to be complete copies Nothing particularly fancy or clever going on
Lesley Walker wrote:
I also note that these exact same messages were discussed in December: http://www.openldap.org/lists/openldap-software/200612/msg00046.html but this discussion went over my head, so I would welcome any words-of-one-syllable explanations.
Ignore those messages. Sorry, more than one syllable required...
The main problem I'm trying to troubleshoot is this:
This is most likely ITS#4813, fixed in 2.3.34. As noted in that ITS, it's a bit tricky to manually reproduce the problem since it's quite timing dependent.
In every case, there's a log entry: do_syncrep2: rid 123 LDAP_RES_INTERMEDIATE - SYNC_ID_SET
followed by some number of these: syncrepl_entry: rid 123 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) ("some number" is MUCH less than the number of records)
then: do_syncrep2: rid 123 LDAP_RES_INTERMEDIATE - REFRESH_PRESENT
followed by some (other) number of these: syncrepl_del_nonpresent: rid 123 be_delete uid=whatever,ou=Accounts,dc=example,dc=co,dc=nz (0)
*INCLUDING* be_deletes for nearly ALL the top-level entries: be_delete cn=root,dc=example,dc=co,dc=nz (0) be_delete ou=Accounts,dc=example,dc=co,dc=nz (66) be_delete ou=Mailbox,dc=example,dc=co,dc=nz (66) be_delete ou=Services,dc=example,dc=co,dc=nz (66) be_delete ou=Offices,dc=example,dc=co,dc=nz (66) be_delete ou=Networks,dc=example,dc=co,dc=nz (66) be_delete ou=Rooms,dc=example,dc=co,dc=nz (66) be_delete ou=Group,dc=example,dc=co,dc=nz (66) be_delete ou=EmailLists,dc=example,dc=co,dc=nz (66) be_delete ou=People,dc=example,dc=co,dc=nz (66) be_delete ou=Computers,dc=example,dc=co,dc=nz (66)
This would seem to leave the database completely empty, and in a state where nothing and nobody can authenticate to it. No amount of stopping/restarting has any effect (because it thinks it is in sync) until we repair it by starting with the empty sync cookie.
There have been at least 10 instances of this fault on different servers in the last 1-2 weeks.
Because I can't reproduce the problem on demand, I won't know for sure whether or not the new version fixes it, but I have built the new version and am now running it on a test server.
Howard Chu wrote:
Ignore those messages. Sorry, more than one syllable required...
I can live with that, if they really are harmless. :-)
This is most likely ITS#4813, fixed in 2.3.34. As noted in that ITS, it's a bit tricky to manually reproduce the problem since it's quite timing dependent.
Many thanks for the confirmation. Is the fix in the provider or the consumer? Or both?
Lesley Walker wrote:
Howard Chu wrote:
This is most likely ITS#4813, fixed in 2.3.34. As noted in that ITS, it's a bit tricky to manually reproduce the problem since it's quite timing dependent.
Many thanks for the confirmation. Is the fix in the provider or the consumer? Or both?
The fix is in the provider.
Howard Chu wrote:
The fix is in the provider.
Thanks.
I now have 2.3.35 running on a pair of test servers and will post again when/if I have new information to report.
openldap-software@openldap.org