----- ldap(a)mm.st wrote:
I am rebuilding our aging pre 2.2 openldap servers that ran ldbm
backend and slurpd. We ran this setup without any issues for many
years.
The new setup is:
RH5
openldap 2.3.43 (Stock RH)
bdb backend 4.4.20 (Stock RH)
Entries in db- about 1820
LDIF file is about 1.2M
Memory- Master 4GB Slave 2GB (will add two more slaves)
Database section of slapd.conf:
database bdb
suffix "o=example.com"
rootdn
"cn=root">cn=root">cn=root">cn=root,o=example.com"
rootpw {SSHA} .....
cachesize 1900
checkpoint 512 30
directory /var/lib/ldap
index objectClass,uid,uidNumber,gidNumber,memberUid,uniqueMember
eq
index cn,mail,surname,givenname
eq,subinitial
DB_CONFIG:
set_cachesize 0 4153344 1
set_lk_max_objects 1500
set_lk_max_locks 1500
set_lk_max_lockers 1500
set_lg_regionmax 1048576
set_lg_bsize 32768
set_lg_max 131072
set_lg_dir /var/lib/ldap
set_flags DB_LOG_AUTOREMOVE
This new setup appeared to work great for the last 10 days or so. I
was
able to authenticate clients, add records etc. Running slapd_db_stat
-m
and slapd_db_stat -c seem to indicate everything was ok. Before I
put
this setup into production, I got slurpd to function. Then decided to
disable slurpd to use syncrepl in refreshonly mode. This also seemed
to
work fine. I'm not sure if the replication started this or not, but
wanted to include all the events that let up to this.
Replication should not be related at all.
I have started
to
get:
bdb(o=example.com): PANIC: fatal region error detected; run recovery
On both servers at different times. During this time slapd continues
to
run which seems to confuse clients that try to use it and they will
not
try the other server that is listed in ldap.conf. To recover I did:
service ldap stop, slapd_db_recover -h /var/lib/ldap, service ldap
start.
I then commented all the replication stuff out in the slapd.conf and
restarted ldap. It will run for a while (varies 5 minutes - ?) then
I
get the same errors and clients are unable to authenticate. On one
of
the servers I deleted all the files (except DB_CONFIG) and did a
slapadd of a ldif file that I generated every night (without stopping
slapd).
You imported while slapd was running? This is a recipe for failure. You can import to a
different directory, stop slapd, and switch directories, and then start, but importing in
the directory while slapd is running is a bad idea.
Same results once I started slapd again. I have enabled
debug
for slapd and have not seen anything different, I attached gdb to the
running slapd and no errors are noted. I even copied a backup copy
of
slapd.conf prior to the replication settings (even though they are
commented out) thinking that maybe something in there was causing it..
Then after several recoveries as described above the systems seem to
be
working again. One has not generated the error for for over 5.5
hours
the other has not had any problems for 2 hours. For some reason
after
that period when the errors showed up for a while, things seem to be
working again, at least for now.
I'm nervous about putting this into production until I can get this
to
function properly without these issues. During the 10 day period
with
everything working good, the slave would occasional (rarely) get the
error and I would do a recovery, but we thought this was due to
possible
hardware problems. Now I'm not so sure.
I have a monitor script that runs slapd_db_stat -m and -c every 5
minutes and nothing seems wrong there, I far as I can tell. I'm
hoping
someone can help me determine possible causes or things to look at.
I would recommend that any server that hasn't had a clean import while slapd is *NOT*
running, get one. Run it for a few days, and see if you see any problems.
I have been running 2.3.43 for years (my own packages on RHEL4, then my own packages on
RHEL5, now some boxes run the RHEL packages), and never seen *this* issue, with a *much*
larger directory (with about 8 replicas, though not all replicas have all databases).
Usually database corruption is due to hardware failure, unclean shutdown, or finger
trouble.
Regards,
Buchan