I've been running OpenLDAP 2.3.35 for almost a year with no problems (Gentoo Linux, w/DB 4.5). I upgraded to 2.3.40 last week, and had a meltdown this morning during an account purge.
There had been some updates to the directory after upgrading, but no deletes. During my delete run, there were some index errors:
----- [successful deletes] Feb 3 03:50:36 derp idmgmt[3722]: error deleting user cjlindsay: DN index delete failed (LDAP) [successful deletes] Feb 3 03:56:55 derp idmgmt[3722]: error deleting user ddshah: DN index delete failed (LDAP) [successful deletes] Feb 3 04:00:10 derp idmgmt[3722]: error binding to directory: internal error (LDAP) -----
After the index errors, LDAP stopped working. The logs on the master showed:
----- Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): file dn2id.bdb has LSN 8388745/33900185 , past end of log at 137/33937441 Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): Commonly caused by moving a database fr om one transactional database Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): environment to another without clearing the database LSNs, or removing Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): all of the log files from a database en vironment Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): DB_ENV->log_flush: LSN of 8388745/33900 185 past current end-of-log of 137/37666668 Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): Database environment corrupt; the wrong log files may have been removed or incompatible database files imported from another environment Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): PANIC: DB_RUNRECOVERY: Fatal error, run database recovery Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): dn2id.bdb: unable to flush page: 3118 Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): txn_checkpoint: failed to flush the buf fer cache: DB_RUNRECOVERY: Fatal error, run database recovery Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): PANIC: fatal region error detected; run recovery -----
I restarted slapd, resulting in:
----- Feb 3 06:55:53 fosse slapd[5129]: bdb_db_close: txn_checkpoint failed: DB_RUNRECOVERY: Fatal error, run database recovery (-30975) Feb 3 06:55:53 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): PANIC: fatal region error detected; run recovery Feb 3 06:55:53 fosse slapd[5129]: bdb_db_close: close failed: DB_RUNRECOVERY: Fatal error, run database recovery (-30975) Feb 3 06:55:53 fosse slapd[5129]: slapd stopped. Feb 3 06:56:02 fosse slapd[11055]: @(#) $OpenLDAP: slapd 2.3.40 (Jan 31 2008 21:55:32) $ portage@fosse:/var/tmp/portage/net-nds/openldap-2.3.40-r1/work/openldap-2.3.40/servers/slapd - Server is unavailable Feb 3 06:56:18 fosse slapd[11056]: hdb_db_open: unclean shutdown detected; attempting recovery. Feb 3 06:56:20 fosse slapd[11056]: slapd starting -----
I have two syncrepl slaves, both had failed at the exact same time with the exact same logs:
----- Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): DB_ENV->log_flush: LSN of 8388762/370 13381 past current end-of-log of 154/40862130 Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): Database environment corrupt; the wro ng log files may have been removed or incompatible database files imported from another environment Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): PANIC: DB_RUNRECOVERY: Fatal error, r un database recovery Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): dn2id.bdb: unable to flush page: 3118 Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): txn_checkpoint: failed to flush the b uffer cache: DB_RUNRECOVERY: Fatal error, run database recovery Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): PANIC: fatal region error detected; r un recovery Feb 3 04:00:11 filmore slapd[5127]: null_callback: error code 0x50
Feb 3 06:57:41 filmore slapd[5127]: bdb_db_close: txn_checkpoint failed: DB_RUNRECOVERY: Fatal error, run database recovery (-30975) Feb 3 06:57:41 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): PANIC: fatal region error detected; run recovery Feb 3 06:57:41 filmore slapd[5127]: bdb_db_close: close failed: DB_RUNRECOVERY: Fatal error, run database recovery (-30975) Feb 3 06:57:41 filmore slapd[5127]: slapd stopped. Feb 3 06:57:49 filmore slapd[20722]: @(#) $OpenLDAP: slapd 2.3.40 (Jan 28 2008 18:22:34) $ portage@filmore:/var/tmp/portage/net-nds/openldap-2.3.40-r1/work/openldap-2.3.40/servers/slapd Feb 3 06:57:49 filmore slapd[20723]: hdb_db_open: unclean shutdown detected; attempting recovery. Feb 3 06:57:51 filmore slapd[20723]: slapd starting -----
After a restart, things seemed to be working ok. I deleted one object by hand with no problem. But then on deleting the second, I received the same index failure.
Finally, I shut everything down, did a slapcat to ldif, deleted the entire DB, and then a slapadd to regenerate from scratch.
So far, so good after that. I restarted my batch delete, it's been running for 20 minutes with no errors (fingers crossed).
What happened? It can't be local corruption, all three servers failed exactly the same way. I didn't upgrade bdb, it's the exact same version that's been running. I didn't think I needed to rebuild the db on an upgrade from 2.3.35 to 2.3.40, was I mistaken? Should I have done a dump/reload? Was this failure possibly a bug in 2.3.40? A bug in 2.3.35 exposed by the upgrade?
Any thoughts much appreciated...
Thanks much...