PANIC: bdb fatal region - openldap-technical

16 Mar 2011


      I am rebuilding our aging pre 2.2 openldap servers that ran  ldbm
backend and slurpd.  We ran this setup without any issues for many
years.
The new setup is:
RH5
openldap 2.3.43 (Stock RH)
bdb backend 4.4.20 (Stock RH)
Entries in db- about 1820
LDIF file is about 1.2M
Memory- Master 4GB Slave 2GB (will add two more slaves)
Database section of slapd.conf:
database        bdb
suffix          "o=example.com"
rootdn          "cn=root">cn=root">cn=root">cn=root,o=example.com"
rootpw {SSHA} .....
cachesize 1900
checkpoint 512 30
directory       /var/lib/ldap
index   objectClass,uid,uidNumber,gidNumber,memberUid,uniqueMember     
eq
index   cn,mail,surname,givenname                                      
eq,subinitial
DB_CONFIG:
set_cachesize 0 4153344 1
set_lk_max_objects 1500
set_lk_max_locks 1500
set_lk_max_lockers 1500
set_lg_regionmax 1048576
set_lg_bsize 32768
set_lg_max 131072
set_lg_dir /var/lib/ldap
set_flags DB_LOG_AUTOREMOVE
This new setup appeared to work great for the last 10 days or so.  I was
able to authenticate clients, add records etc.  Running slapd_db_stat -m
and slapd_db_stat -c seem to indicate everything was ok.  Before I put
this setup into production, I got slurpd to function. Then decided to
disable slurpd to use syncrepl in refreshonly mode. This also seemed to
work fine. I'm not sure if the replication started this or not, but
wanted to include all the events that let up to this.  I have started to
get:
bdb(o=example.com): PANIC: fatal region error detected; run recovery
On both servers at different times.  During this time slapd continues to
run which seems to confuse clients that try to use it and they will not
try the other server that is listed in ldap.conf.  To recover I did:
service ldap stop, slapd_db_recover -h /var/lib/ldap, service ldap
start.
I then commented all the replication stuff out in the slapd.conf and
restarted ldap.  It will run for a while (varies 5 minutes - ?)  then I
get the same errors and clients are unable to authenticate.  On one of
the servers I deleted all the files (except DB_CONFIG)  and did a
slapadd of a ldif file that I generated every night (without stopping
slapd).  Same results once I started slapd again. I have enabled debug
for slapd and have not seen anything different, I attached gdb to the
running slapd and no errors are noted.   I even copied a backup copy of
slapd.conf prior to the replication settings (even though they are
commented out) thinking that maybe something in there was causing it.. 
Then after several recoveries as described above the systems seem to be
working again.  One has not generated the error for for over 5.5 hours
the other has not had any problems for 2 hours.  For some reason after
that period when the errors showed up for a while, things seem to be
working again, at least for now.
I'm nervous about putting this into production until I can get this to
function properly without these issues.  During the 10 day period with
everything working good,  the slave would occasional (rarely) get the
error and I would do a recovery, but we thought this was due to possible
hardware problems.  Now I'm not so sure.
I have a monitor script that runs slapd_db_stat -m and -c every 5
minutes and nothing seems wrong there, I far as I can tell.  I'm hoping
someone can help me determine possible causes or things to look at.