I am rebuilding our aging pre 2.2 openldap servers that ran ldbm backend and slurpd. We ran this setup without any issues for many years.
The new setup is: RH5 openldap 2.3.43 (Stock RH) bdb backend 4.4.20 (Stock RH) Entries in db- about 1820 LDIF file is about 1.2M Memory- Master 4GB Slave 2GB (will add two more slaves)
Database section of slapd.conf: database bdb suffix "o=example.com" rootdn "cn=root">cn=root">cn=root">cn=root,o=example.com" rootpw {SSHA} ..... cachesize 1900 checkpoint 512 30 directory /var/lib/ldap index objectClass,uid,uidNumber,gidNumber,memberUid,uniqueMember eq index cn,mail,surname,givenname eq,subinitial
DB_CONFIG: set_cachesize 0 4153344 1 set_lk_max_objects 1500 set_lk_max_locks 1500 set_lk_max_lockers 1500 set_lg_regionmax 1048576 set_lg_bsize 32768 set_lg_max 131072 set_lg_dir /var/lib/ldap set_flags DB_LOG_AUTOREMOVE
This new setup appeared to work great for the last 10 days or so. I was able to authenticate clients, add records etc. Running slapd_db_stat -m and slapd_db_stat -c seem to indicate everything was ok. Before I put this setup into production, I got slurpd to function. Then decided to disable slurpd to use syncrepl in refreshonly mode. This also seemed to work fine. I'm not sure if the replication started this or not, but wanted to include all the events that let up to this. I have started to get: bdb(o=example.com): PANIC: fatal region error detected; run recovery On both servers at different times. During this time slapd continues to run which seems to confuse clients that try to use it and they will not try the other server that is listed in ldap.conf. To recover I did: service ldap stop, slapd_db_recover -h /var/lib/ldap, service ldap start.
I then commented all the replication stuff out in the slapd.conf and restarted ldap. It will run for a while (varies 5 minutes - ?) then I get the same errors and clients are unable to authenticate. On one of the servers I deleted all the files (except DB_CONFIG) and did a slapadd of a ldif file that I generated every night (without stopping slapd). Same results once I started slapd again. I have enabled debug for slapd and have not seen anything different, I attached gdb to the running slapd and no errors are noted. I even copied a backup copy of slapd.conf prior to the replication settings (even though they are commented out) thinking that maybe something in there was causing it.. Then after several recoveries as described above the systems seem to be working again. One has not generated the error for for over 5.5 hours the other has not had any problems for 2 hours. For some reason after that period when the errors showed up for a while, things seem to be working again, at least for now.
I'm nervous about putting this into production until I can get this to function properly without these issues. During the 10 day period with everything working good, the slave would occasional (rarely) get the error and I would do a recovery, but we thought this was due to possible hardware problems. Now I'm not so sure.
I have a monitor script that runs slapd_db_stat -m and -c every 5 minutes and nothing seems wrong there, I far as I can tell. I'm hoping someone can help me determine possible causes or things to look at.