A while ago I posted that we were having what we thought were random bdb backend crashes with the following in our log:
bdb(o=example.com): PANIC: fatal region error detected; run recovery.
This was on a on our RH5 openldap servers (2.3.43) that we were rebuilding:
It appears that the crashes were caused by a vulnerability scanner that was hitting the server (still testing), even though it was suppose to be safe. We'll have to investigate what is causing it, maybe we will need an acl to stop whatever the scanner is doing. Once we stopped the automated scan, the servers seem to be running as expected.
But, this brought up another issue. When the bdb backend failed, the slapd process continued run and listen on the ldap ports and clients still tried to connect to the failed server for authentication. The server accepted and established the connection with the client. Of course the client could not authenticate since the backend db was down. The client will not fail over to the other server that is listed in it's ldap.conf file since it thinks it has a valid connection. If the slap process is not running then the fail over works fine since no ports are there for the client to connect to.
I'm thinking that bdb failures will be rare once we solve the scanner issue, but on a network that relies heavily on ldap, a failed bdb backend with a running slapd would cause significant issues.
Just trying to restart the slapd service doesn't fix the issue, a manual recovery is required (slapd_db_recover). I was curious if anyone has put something in place to deal with this potential issue? Maybe run slapd_db_status via cron and if it errors due a bdb corruption, just stop slapd and let the admin know. At least the clients would be able to failover to the other ldap servers. I guess an automated recovery is possible via a script, but I'm not sure if that's a good idea. Maybe dealing with this type of failure is not really required, I was hoping that some of you that have been do this for a while would have some insight.