Dealing with BDB Crash

List overview All Threads
Download

newer

older

Re: Antwort: Re: RHEL 6 OpenLDAP...

issue with slapadd

ldap＠mm.st

31 Mar 2011 31 Mar '11

12:10 a.m.

A while ago I posted that we were having what we thought were random bdb backend crashes with the following in our log:

bdb(o=example.com): PANIC: fatal region error detected; run recovery.

This was on a on our RH5 openldap servers (2.3.43) that we were rebuilding:

It appears that the crashes were caused by a vulnerability scanner that was hitting the server (still testing), even though it was suppose to be safe. We'll have to investigate what is causing it, maybe we will need an acl to stop whatever the scanner is doing. Once we stopped the automated scan, the servers seem to be running as expected.

But, this brought up another issue. When the bdb backend failed, the slapd process continued run and listen on the ldap ports and clients still tried to connect to the failed server for authentication. The server accepted and established the connection with the client. Of course the client could not authenticate since the backend db was down. The client will not fail over to the other server that is listed in it's ldap.conf file since it thinks it has a valid connection. If the slap process is not running then the fail over works fine since no ports are there for the client to connect to.

I'm thinking that bdb failures will be rare once we solve the scanner issue, but on a network that relies heavily on ldap, a failed bdb backend with a running slapd would cause significant issues.

Just trying to restart the slapd service doesn't fix the issue, a manual recovery is required (slapd_db_recover). I was curious if anyone has put something in place to deal with this potential issue? Maybe run slapd_db_status via cron and if it errors due a bdb corruption, just stop slapd and let the admin know. At least the clients would be able to failover to the other ldap servers. I guess an automated recovery is possible via a script, but I'm not sure if that's a good idea. Maybe dealing with this type of failure is not really required, I was hoping that some of you that have been do this for a while would have some insight.

Show replies by date

Aaron Richton

31 Mar 31 Mar

12:43 a.m.

So, the best defense is a good offense in this case, and if you were running 2.4.25 with the appropriate BerkeleyDB library you'd likely not see an issue of this manner.

With that said, there was a time (with earlier releases of OpenLDAP) when we had this issue (one bdb go down, with the service apparently working via an overly simple smoke test). Not being fans of being bitten by the same failure mode twice, we wrote up a Nagios check that searches a known-present-on-disk entry that is in each of our databases. (You can either create one, or (ab)use "ou=People" if you're RFC2307 or use "cn=Manager" or what have you...) If any database doesn't return a hit, time for us to get a call.

As an aside, I find this thoroughly fascinating timing. Not that it'll make you feel any better in the present case, but I was just considering writing something up for the next LDAPCon on how we do monitoring (there are ~10 angles we check from, many of them due to real life situations similar to yours). They're all relatively simple ideas like the above, but I suppose cleaning up our code to the point where it's world-safe and getting something written up on it may be useful. They've proven occasionally useful for slapd(8) code issues and also, more frequently, useful in the face of human factors.

On Wed, 30 Mar 2011, ldap@mm.st wrote:

...

A while ago I posted that we were having what we thought were random bdb backend crashes with the following in our log:

bdb(o=example.com): PANIC: fatal region error detected; run recovery.

This was on a on our RH5 openldap servers (2.3.43) that we were rebuilding:

It appears that the crashes were caused by a vulnerability scanner that was hitting the server (still testing), even though it was suppose to be safe. We'll have to investigate what is causing it, maybe we will need an acl to stop whatever the scanner is doing. Once we stopped the automated scan, the servers seem to be running as expected.

But, this brought up another issue. When the bdb backend failed, the slapd process continued run and listen on the ldap ports and clients still tried to connect to the failed server for authentication. The server accepted and established the connection with the client. Of course the client could not authenticate since the backend db was down. The client will not fail over to the other server that is listed in it's ldap.conf file since it thinks it has a valid connection. If the slap process is not running then the fail over works fine since no ports are there for the client to connect to.

I'm thinking that bdb failures will be rare once we solve the scanner issue, but on a network that relies heavily on ldap, a failed bdb backend with a running slapd would cause significant issues.

Just trying to restart the slapd service doesn't fix the issue, a manual recovery is required (slapd_db_recover). I was curious if anyone has put something in place to deal with this potential issue? Maybe run slapd_db_status via cron and if it errors due a bdb corruption, just stop slapd and let the admin know. At least the clients would be able to failover to the other ldap servers. I guess an automated recovery is possible via a script, but I'm not sure if that's a good idea. Maybe dealing with this type of failure is not really required, I was hoping that some of you that have been do this for a while would have some insight.

Mark

1:25 a.m.

What is 'the appropriate BerkeleyDB library'? Are you referring to a specific *version*? I have a new OpenLDAP installation I've built with OpenLDAP 2.4.25 & BDB 4.8.30 and have been testing in a 4-way multi-master setup. Should I take the plunge to BDB 5.1.25?

On Wed, Mar 30, 2011 at 5:43 PM, Aaron Richton richton@nbcs.rutgers.eduwrote:

...

So, the best defense is a good offense in this case, and if you were running 2.4.25 with the appropriate BerkeleyDB library you'd likely not see an issue of this manner.

With that said, there was a time (with earlier releases of OpenLDAP) when we had this issue (one bdb go down, with the service apparently working via an overly simple smoke test). Not being fans of being bitten by the same failure mode twice, we wrote up a Nagios check that searches a known-present-on-disk entry that is in each of our databases. (You can either create one, or (ab)use "ou=People" if you're RFC2307 or use "cn=Manager" or what have you...) If any database doesn't return a hit, time for us to get a call.

As an aside, I find this thoroughly fascinating timing. Not that it'll make you feel any better in the present case, but I was just considering writing something up for the next LDAPCon on how we do monitoring (there are ~10 angles we check from, many of them due to real life situations similar to yours). They're all relatively simple ideas like the above, but I suppose cleaning up our code to the point where it's world-safe and getting something written up on it may be useful. They've proven occasionally useful for slapd(8) code issues and also, more frequently, useful in the face of human factors.

On Wed, 30 Mar 2011, ldap@mm.st wrote:

A while ago I posted that we were having what we thought were random bdb

...
backend crashes with the following in our log:

bdb(o=example.com): PANIC: fatal region error detected; run recovery.

This was on a on our RH5 openldap servers (2.3.43) that we were rebuilding:

It appears that the crashes were caused by a vulnerability scanner that was hitting the server (still testing), even though it was suppose to be safe. We'll have to investigate what is causing it, maybe we will need an acl to stop whatever the scanner is doing. Once we stopped the automated scan, the servers seem to be running as expected.

But, this brought up another issue. When the bdb backend failed, the slapd process continued run and listen on the ldap ports and clients still tried to connect to the failed server for authentication. The server accepted and established the connection with the client. Of course the client could not authenticate since the backend db was down. The client will not fail over to the other server that is listed in it's ldap.conf file since it thinks it has a valid connection. If the slap process is not running then the fail over works fine since no ports are there for the client to connect to.

I'm thinking that bdb failures will be rare once we solve the scanner issue, but on a network that relies heavily on ldap, a failed bdb backend with a running slapd would cause significant issues.

Just trying to restart the slapd service doesn't fix the issue, a manual recovery is required (slapd_db_recover). I was curious if anyone has put something in place to deal with this potential issue? Maybe run slapd_db_status via cron and if it errors due a bdb corruption, just stop slapd and let the admin know. At least the clients would be able to failover to the other ldap servers. I guess an automated recovery is possible via a script, but I'm not sure if that's a good idea. Maybe dealing with this type of failure is not really required, I was hoping that some of you that have been do this for a while would have some insight.

Aaron Richton

2:54 p.m.

On Wed, 30 Mar 2011, Mark wrote:

...

What is 'the appropriate BerkeleyDB library'? Are you referring to a specific version? I have a new OpenLDAP installation I've built with OpenLDAP 2.4.25 & BDB 4.8.30 and have been testing in a 4-way multi-master setup. Should I take the plunge to BDB 5.1.25?

This changes a bit with time; the experience of the community is occasionally shared on openldap-technical and/or openldap-devel. Keep in mind that not all BerkeleyDB<>OpenLDAP combinations are compatible (I'm noting an unfortunate recent spate of 2.3.43 users recently, for example) so that obviously must color the decision as well.

As some rules of thumb based on experience:

* if Oracle releases an official patch, that patch is almost certainly required --- especially if that's the version you've chosen already

* any version that is pulled from the official download site is to be avoided/uninstalled as soon as practical

* as with any software, allowing some lab and early adopter/new installation soak time is important

To take a practical/today's example, 5.1.25 was announced on February 3. If I was on 5.1.19 I'd have moved to it already. However, not being on 5.1, I'd question whether two months of soak is enough versus the generally positive reputation of 4.8.30. Of course, there's no one answer; different organizations have different risk tolerance/patch policy/etc. It's also worth considering your resources/deployment model. Decisions in a syncrepl multimaster environment, where a server should be able to be shot with no user-visible issue, may not be the same as in a classic provider-consumer model.

Dan Pritts

1 Apr 1 Apr

5:54 p.m.

When I recently built openldap 2.4.23, I first tried using whatever 5.x version was the latest on oracle's site.

I do not remember the specifics, but i had problems trying to build openldap with that version of bdb.

I then noticed this in the openldap README file:

SLAPD: BDB and HDB backends require Oracle Berkeley DB 4.4, 4.5, 4.6, 4.7, or 4.8. It is highly recommended to apply the patches from Oracle for a given release.

I assumed that meant 5.x wasn't supported, built 4.8.latest, and went on with life.

Anyway, if 5.x IS supported it would be great to update that README file for the next openldap release.

On Mar 31, 2011, at 8:54 AM, Aaron Richton wrote:

...

On Wed, 30 Mar 2011, Mark wrote:

...
What is 'the appropriate BerkeleyDB library'? Are you referring to a specific version? I have a new OpenLDAP installation I've built with OpenLDAP 2.4.25 & BDB 4.8.30 and have been testing in a 4-way multi-master setup. Should I take the plunge to BDB 5.1.25?

This changes a bit with time; the experience of the community is occasionally shared on openldap-technical and/or openldap-devel. Keep in mind that not all BerkeleyDB<>OpenLDAP combinations are compatible (I'm noting an unfortunate recent spate of 2.3.43 users recently, for example) so that obviously must color the decision as well.

As some rules of thumb based on experience:

if Oracle releases an official patch, that patch is almost certainly required --- especially if that's the version you've chosen already

any version that is pulled from the official download site is to be avoided/uninstalled as soon as practical

as with any software, allowing some lab and early adopter/new installation soak time is important

To take a practical/today's example, 5.1.25 was announced on February 3. If I was on 5.1.19 I'd have moved to it already. However, not being on 5.1, I'd question whether two months of soak is enough versus the generally positive reputation of 4.8.30. Of course, there's no one answer; different organizations have different risk tolerance/patch policy/etc. It's also worth considering your resources/deployment model. Decisions in a syncrepl multimaster environment, where a server should be able to be shot with no user-visible issue, may not be the same as in a classic provider-consumer model.

danno -- Dan Pritts, Sr. Systems Engineer Internet2 office: +1-734-352-4953 | mobile: +1-734-834-7224

Quanah Gibson-Mount

6:06 p.m.

--On Friday, April 01, 2011 11:54 AM -0400 Dan Pritts danno@internet2.edu wrote:

...

When I recently built openldap 2.4.23, I first tried using whatever 5.x version was the latest on oracle's site.

I do not remember the specifics, but i had problems trying to build openldap with that version of bdb.

I then noticed this in the openldap README file:
SLAPD:
    BDB and HDB backends require Oracle Berkeley DB 4.4, 4.5,
    4.6, 4.7, or 4.8.  It is highly recommended to apply the
    patches from Oracle for a given release.
I assumed that meant 5.x wasn't supported, built 4.8.latest, and went on with life.

Anyway, if 5.x IS supported it would be great to update that README file for the next openldap release.

It was, when 5.x configure support was added in 2.4.24+. 2.4.23 works fine with BDB 5.x, but you have to patch configure to detect it properly.

--Quanah

Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration

ldap＠mm.st

31 Mar 31 Mar

1:25 a.m.

Interesting and good information. We happen to use Big Brother/Xymon for monitoring and have multiple scripts to check things like cache, locks etc. We will get notified when these sense a problem, but at 1AM on a Saturday getting notified and fixing the issue before all those services get impacted is a little scary. That's why we were contemplating that maybe it would be wise to "hit it with a hammer" until we are able to intervene and repair.

On Wed, 30 Mar 2011 18:43 -0400, "Aaron Richton" richton@nbcs.rutgers.edu wrote:

...

So, the best defense is a good offense in this case, and if you were running 2.4.25 with the appropriate BerkeleyDB library you'd likely not see an issue of this manner.

With that said, there was a time (with earlier releases of OpenLDAP) when we had this issue (one bdb go down, with the service apparently working via an overly simple smoke test). Not being fans of being bitten by the same failure mode twice, we wrote up a Nagios check that searches a known-present-on-disk entry that is in each of our databases. (You can either create one, or (ab)use "ou=People" if you're RFC2307 or use "cn=Manager" or what have you...) If any database doesn't return a hit, time for us to get a call.

As an aside, I find this thoroughly fascinating timing. Not that it'll make you feel any better in the present case, but I was just considering writing something up for the next LDAPCon on how we do monitoring (there are ~10 angles we check from, many of them due to real life situations similar to yours). They're all relatively simple ideas like the above, but I suppose cleaning up our code to the point where it's world-safe and getting something written up on it may be useful. They've proven occasionally useful for slapd(8) code issues and also, more frequently, useful in the face of human factors.

On Wed, 30 Mar 2011, ldap@mm">ldap@mm.st wrote:

...
A while ago I posted that we were having what we thought were random bdb backend crashes with the following in our log:

bdb(o=example.com): PANIC: fatal region error detected; run recovery.

This was on a on our RH5 openldap servers (2.3.43) that we were rebuilding:

It appears that the crashes were caused by a vulnerability scanner that was hitting the server (still testing), even though it was suppose to be safe. We'll have to investigate what is causing it, maybe we will need an acl to stop whatever the scanner is doing. Once we stopped the automated scan, the servers seem to be running as expected.

But, this brought up another issue. When the bdb backend failed, the slapd process continued run and listen on the ldap ports and clients still tried to connect to the failed server for authentication. The server accepted and established the connection with the client. Of course the client could not authenticate since the backend db was down. The client will not fail over to the other server that is listed in it's ldap.conf file since it thinks it has a valid connection. If the slap process is not running then the fail over works fine since no ports are there for the client to connect to.

I'm thinking that bdb failures will be rare once we solve the scanner issue, but on a network that relies heavily on ldap, a failed bdb backend with a running slapd would cause significant issues.

Just trying to restart the slapd service doesn't fix the issue, a manual recovery is required (slapd_db_recover). I was curious if anyone has put something in place to deal with this potential issue? Maybe run slapd_db_status via cron and if it errors due a bdb corruption, just stop slapd and let the admin know. At least the clients would be able to failover to the other ldap servers. I guess an automated recovery is possible via a script, but I'm not sure if that's a good idea. Maybe dealing with this type of failure is not really required, I was hoping that some of you that have been do this for a while would have some insight.

Bill MacAllister

8:50 p.m.

New subject: LDAPCon?

--On Wednesday, March 30, 2011 06:43:48 PM -0400 Aaron Richton richton@nbcs.rutgers.edu wrote:

...

Not that it'll make you feel any better in the present case, but I was just considering writing something up for the next LDAPCon on how we do monitoring (there are ~10 angles we check from, many of them due to real life situations similar to yours).

It there another LDAPCon scheduled? I certainly would be willing to help make these happen a bit more regularly.

Bill

-- Bill MacAllister Infrastructure Delivery Group, Stanford University

Michael Ströder

1 Apr 1 Apr

8:35 a.m.

New subject: LDAPCon?

Bill MacAllister wrote:

...

--On Wednesday, March 30, 2011 06:43:48 PM -0400 Aaron Richton richton@nbcs.rutgers.edu wrote:

...
Not that it'll make you feel any better in the present case, but I was just considering writing something up for the next LDAPCon on how we do monitoring (there are ~10 angles we check from, many of them due to real life situations similar to yours).

It there another LDAPCon scheduled? I certainly would be willing to help make these happen a bit more regularly.

Yes, 3rd LDAPcon 2011 is organized by DAASI, October 10 – 11 in Heidelberg, Germany.

see http://www.ldapcon.org

Ciao, Michael.