Upgrade to 2.3.40 -> failed index - openldap-software

3 Feb 2008


      I've been running OpenLDAP 2.3.35 for almost a year with no problems
(Gentoo Linux, w/DB 4.5). I upgraded to 2.3.40 last week, and had a
meltdown this morning during an account purge.
There had been some updates to the directory after upgrading, but no
deletes. During my delete run, there were some index errors:
-----
[successful deletes]
Feb  3 03:50:36 derp idmgmt[3722]: error deleting user cjlindsay: DN index
delete failed (LDAP)
[successful deletes]
Feb  3 03:56:55 derp idmgmt[3722]: error deleting user ddshah: DN index
delete failed (LDAP)
[successful deletes]
Feb  3 04:00:10 derp idmgmt[3722]: error binding to directory: internal
error (LDAP)
-----
After the index errors, LDAP stopped working. The logs on the master
showed:
-----
Feb  3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): file dn2id.bdb
has LSN 8388745/33900185
, past end of log at 137/33937441
Feb  3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): Commonly
caused by moving a database fr
om one transactional database
Feb  3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): environment to
another without clearing
 the database LSNs, or removing
Feb  3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): all of the log
files from a database en
vironment
Feb  3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu):
DB_ENV->log_flush: LSN of 8388745/33900
185 past current end-of-log of 137/37666668
Feb  3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): Database
environment corrupt; the wrong
 log files may have been removed or incompatible database files imported
from another environment
Feb  3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): PANIC:
DB_RUNRECOVERY: Fatal error, run
 database recovery
Feb  3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): dn2id.bdb:
unable to flush page: 3118
Feb  3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu):
txn_checkpoint: failed to flush the buf
fer cache: DB_RUNRECOVERY: Fatal error, run database recovery
Feb  3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): PANIC: fatal
region error detected; run
 recovery
-----
I restarted slapd, resulting in:
-----
Feb  3 06:55:53 fosse slapd[5129]: bdb_db_close: txn_checkpoint failed:
DB_RUNRECOVERY: Fatal error, run database recovery (-30975)
Feb  3 06:55:53 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): PANIC: fatal
region error detected; run recovery
Feb  3 06:55:53 fosse slapd[5129]: bdb_db_close: close failed:
DB_RUNRECOVERY: Fatal error, run database recovery (-30975)
Feb  3 06:55:53 fosse slapd[5129]: slapd stopped.
Feb  3 06:56:02 fosse slapd[11055]: @(#) $OpenLDAP: slapd 2.3.40 (Jan 31
2008 21:55:32) $
portage@fosse:/var/tmp/portage/net-nds/openldap-2.3.40-r1/work/openldap-2.3.40/servers/slapd
- Server is unavailable
Feb  3 06:56:18 fosse slapd[11056]: hdb_db_open: unclean shutdown detected;
attempting recovery.
Feb  3 06:56:20 fosse slapd[11056]: slapd starting
-----
I have two syncrepl slaves, both had failed at the exact same time with the
exact same logs:
-----
Feb  3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu):
DB_ENV->log_flush: LSN of 8388762/370
13381 past current end-of-log of 154/40862130
Feb  3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): Database
environment corrupt; the wro
ng log files may have been removed or incompatible database files imported
from another environment
Feb  3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): PANIC:
DB_RUNRECOVERY: Fatal error, r
un database recovery
Feb  3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): dn2id.bdb:
unable to flush page: 3118
Feb  3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu):
txn_checkpoint: failed to flush the b
uffer cache: DB_RUNRECOVERY: Fatal error, run database recovery
Feb  3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): PANIC: fatal
region error detected; r
un recovery
Feb  3 04:00:11 filmore slapd[5127]: null_callback: error code 0x50
Feb  3 06:57:41 filmore slapd[5127]: bdb_db_close: txn_checkpoint failed:
DB_RUNRECOVERY: Fatal error, run database recovery (-30975)
Feb  3 06:57:41 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): PANIC: fatal
region error detected; run recovery
Feb  3 06:57:41 filmore slapd[5127]: bdb_db_close: close failed:
DB_RUNRECOVERY: Fatal error, run database recovery (-30975)
Feb  3 06:57:41 filmore slapd[5127]: slapd stopped.
Feb  3 06:57:49 filmore slapd[20722]: @(#) $OpenLDAP: slapd 2.3.40 (Jan 28
2008 18:22:34) $
portage@filmore:/var/tmp/portage/net-nds/openldap-2.3.40-r1/work/openldap-2.3.40/servers/slapd
Feb  3 06:57:49 filmore slapd[20723]: hdb_db_open: unclean shutdown
detected; attempting recovery.
Feb  3 06:57:51 filmore slapd[20723]: slapd starting
-----
After a restart, things seemed to be working ok. I deleted one object by
hand with no problem. But then on deleting the second, I received the same
index failure.
Finally, I shut everything down, did a slapcat to ldif, deleted the entire
DB, and then a slapadd to regenerate from scratch.
So far, so good after that. I restarted my batch delete, it's been running
for 20 minutes with no errors (fingers crossed).
What happened? It can't be local corruption, all three servers failed
exactly the same way. I didn't upgrade bdb, it's the exact same version
that's been running. I didn't think I needed to rebuild the db on an
upgrade from 2.3.35 to 2.3.40, was I mistaken? Should I have done a
dump/reload? Was this failure possibly a bug in 2.3.40? A bug in 2.3.35
exposed by the upgrade?
Any thoughts much appreciated...
Thanks much...
-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  henson@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768