I am rebuilding our aging pre 2.2 openldap servers that ran ldbm backend and slurpd. We ran this setup without any issues for many years.
The new setup is: RH5 openldap 2.3.43 (Stock RH) bdb backend 4.4.20 (Stock RH) Entries in db- about 1820 LDIF file is about 1.2M Memory- Master 4GB Slave 2GB (will add two more slaves)
Database section of slapd.conf: database bdb suffix "o=example.com" rootdn "cn=root">cn=root">cn=root">cn=root,o=example.com" rootpw {SSHA} ..... cachesize 1900 checkpoint 512 30 directory /var/lib/ldap index objectClass,uid,uidNumber,gidNumber,memberUid,uniqueMember eq index cn,mail,surname,givenname eq,subinitial
DB_CONFIG: set_cachesize 0 4153344 1 set_lk_max_objects 1500 set_lk_max_locks 1500 set_lk_max_lockers 1500 set_lg_regionmax 1048576 set_lg_bsize 32768 set_lg_max 131072 set_lg_dir /var/lib/ldap set_flags DB_LOG_AUTOREMOVE
This new setup appeared to work great for the last 10 days or so. I was able to authenticate clients, add records etc. Running slapd_db_stat -m and slapd_db_stat -c seem to indicate everything was ok. Before I put this setup into production, I got slurpd to function. Then decided to disable slurpd to use syncrepl in refreshonly mode. This also seemed to work fine. I'm not sure if the replication started this or not, but wanted to include all the events that let up to this. I have started to get: bdb(o=example.com): PANIC: fatal region error detected; run recovery On both servers at different times. During this time slapd continues to run which seems to confuse clients that try to use it and they will not try the other server that is listed in ldap.conf. To recover I did: service ldap stop, slapd_db_recover -h /var/lib/ldap, service ldap start.
I then commented all the replication stuff out in the slapd.conf and restarted ldap. It will run for a while (varies 5 minutes - ?) then I get the same errors and clients are unable to authenticate. On one of the servers I deleted all the files (except DB_CONFIG) and did a slapadd of a ldif file that I generated every night (without stopping slapd). Same results once I started slapd again. I have enabled debug for slapd and have not seen anything different, I attached gdb to the running slapd and no errors are noted. I even copied a backup copy of slapd.conf prior to the replication settings (even though they are commented out) thinking that maybe something in there was causing it.. Then after several recoveries as described above the systems seem to be working again. One has not generated the error for for over 5.5 hours the other has not had any problems for 2 hours. For some reason after that period when the errors showed up for a while, things seem to be working again, at least for now.
I'm nervous about putting this into production until I can get this to function properly without these issues. During the 10 day period with everything working good, the slave would occasional (rarely) get the error and I would do a recovery, but we thought this was due to possible hardware problems. Now I'm not so sure.
I have a monitor script that runs slapd_db_stat -m and -c every 5 minutes and nothing seems wrong there, I far as I can tell. I'm hoping someone can help me determine possible causes or things to look at.
--On Wednesday, March 16, 2011 3:12 PM -0600 ldap@mm.st wrote:
I am rebuilding our aging pre 2.2 openldap servers that ran ldbm backend and slurpd. We ran this setup without any issues for many years.
The new setup is: RH5 openldap 2.3.43 (Stock RH) bdb backend 4.4.20 (Stock RH)
I highly advise you to build your own OpenLDAP and BDB packages. 2.3 is no longer a supported release series. The current stable release is 2.4.23.
Otherwise, you should be directing these issues to the company who provided you these packages, namely RedHat.
Personally, I used OpenLDAP 2.3.43 with BDB 4.2.52+patches for years without issue (using delta-sync for replication).
It is entirely possible the BDB build provided by RedHat is missing critical patches.
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
----- ldap@mm.st wrote:
I am rebuilding our aging pre 2.2 openldap servers that ran ldbm backend and slurpd. We ran this setup without any issues for many years.
The new setup is: RH5 openldap 2.3.43 (Stock RH) bdb backend 4.4.20 (Stock RH) Entries in db- about 1820 LDIF file is about 1.2M Memory- Master 4GB Slave 2GB (will add two more slaves)
Database section of slapd.conf: database bdb suffix "o=example.com" rootdn "cn=root">cn=root">cn=root">cn=root,o=example.com" rootpw {SSHA} ..... cachesize 1900 checkpoint 512 30 directory /var/lib/ldap index objectClass,uid,uidNumber,gidNumber,memberUid,uniqueMember
eq index cn,mail,surname,givenname
eq,subinitial
DB_CONFIG: set_cachesize 0 4153344 1 set_lk_max_objects 1500 set_lk_max_locks 1500 set_lk_max_lockers 1500 set_lg_regionmax 1048576 set_lg_bsize 32768 set_lg_max 131072 set_lg_dir /var/lib/ldap set_flags DB_LOG_AUTOREMOVE
This new setup appeared to work great for the last 10 days or so. I was able to authenticate clients, add records etc. Running slapd_db_stat -m and slapd_db_stat -c seem to indicate everything was ok. Before I put this setup into production, I got slurpd to function. Then decided to disable slurpd to use syncrepl in refreshonly mode. This also seemed to work fine. I'm not sure if the replication started this or not, but wanted to include all the events that let up to this.
Replication should not be related at all.
I have started to get: bdb(o=example.com): PANIC: fatal region error detected; run recovery On both servers at different times. During this time slapd continues to run which seems to confuse clients that try to use it and they will not try the other server that is listed in ldap.conf. To recover I did: service ldap stop, slapd_db_recover -h /var/lib/ldap, service ldap start.
I then commented all the replication stuff out in the slapd.conf and restarted ldap. It will run for a while (varies 5 minutes - ?) then I get the same errors and clients are unable to authenticate. On one of the servers I deleted all the files (except DB_CONFIG) and did a slapadd of a ldif file that I generated every night (without stopping slapd).
You imported while slapd was running? This is a recipe for failure. You can import to a different directory, stop slapd, and switch directories, and then start, but importing in the directory while slapd is running is a bad idea.
Same results once I started slapd again. I have enabled debug for slapd and have not seen anything different, I attached gdb to the running slapd and no errors are noted. I even copied a backup copy of slapd.conf prior to the replication settings (even though they are commented out) thinking that maybe something in there was causing it..
Then after several recoveries as described above the systems seem to be working again. One has not generated the error for for over 5.5 hours the other has not had any problems for 2 hours. For some reason after that period when the errors showed up for a while, things seem to be working again, at least for now.
I'm nervous about putting this into production until I can get this to function properly without these issues. During the 10 day period with everything working good, the slave would occasional (rarely) get the error and I would do a recovery, but we thought this was due to possible hardware problems. Now I'm not so sure.
I have a monitor script that runs slapd_db_stat -m and -c every 5 minutes and nothing seems wrong there, I far as I can tell. I'm hoping someone can help me determine possible causes or things to look at.
I would recommend that any server that hasn't had a clean import while slapd is *NOT* running, get one. Run it for a few days, and see if you see any problems.
I have been running 2.3.43 for years (my own packages on RHEL4, then my own packages on RHEL5, now some boxes run the RHEL packages), and never seen *this* issue, with a *much* larger directory (with about 8 replicas, though not all replicas have all databases).
Usually database corruption is due to hardware failure, unclean shutdown, or finger trouble.
Regards, Buchan
openldap-technical@openldap.org