Server Hangs

12 Aug 2011


      I upgraded and updated four of our OpenLDAP servers that we have back in 
May to run the latest stable version of OpenLDAP (2.4.23) along with BDB 
(4.8.30).  Everything was running with no issues until a little over a 
month later one of the servers slapd processes hung, the only way I 
could restart the process was to use kill -9, all other kill options 
failed.  Over the next month and a half the issue reoccurred on the same 
server and occurred on two of the other servers.  There was nothing in 
the logs to indicate an issue with running out of file descriptors, dead 
locks or anything else.  I set out to see if I could recreate the issue 
and I found if I had around 20000 entries, which our database is roughly 
around 21000, and ran a script to randomly query, one a time, the 
entries in the database and then run another script that added 1000 
entries, one at a time, then deleted them in reverse order, one at a 
time, and will continue to do so infinitely.  When I ran the two scripts 
simultaneously they would hang after 3 to 16 deletes were completed.  I 
attempted to use the latest version of OpenLDAP (2.4.26) to see if any 
of the bug fixes in it would help and I still get the same results, I 
even tried to run it with all of the supported versions of BDB, 4.4, 
4.5, 4.6, 4.7, 5.0 and 5.1 with the same results.  I ran it with full 
logging on and I was not able to find any thing that pointed to the problem.
We have been running OpenLDAP 2.2 and 2.3 for years (many servers 
without any restarting of slapd for over a year) without any lockups, so 
I decided to test with OpenLDAP 2.3.43 with BDB 4.2.52 (with patches) 
and loaded the same exact database and the same exact tests and it runs 
literally for hours with no issues.  I attempted to upgrade the version 
of BDB to 4.4 and I started to experience the hanging again, so it 
appears to be a BDB issue.  I searched for related issues with no 
success and considering that others are running 2.4 with newer versions 
of BDB for a couple of years now I find it odd that I am running into 
this issue on my first use of 2.4.
I tested all of this on CentOS 5.4, 5.6 and Fedora 17 with the same 
results.  Does anyone have any ideas or suggestions on what I can try to 
do to fix this issue?
Below are some of the configs I am using on my last attempts to resolve 
the issue:
DB_CONFIG:
set_cachesize 0 536870912 1
set_lg_regionmax 10485760
set_lg_max 104857600
set_lg_bsize 2097152
set_lg_dir /var/log/bdb
set_tmp_dir /var/log/bdb
# This one I added recently to see if it might help.
set_lk_detect DB_LOCK_DEFAULT
slapd.conf:
include         /usr/local/etc/openldap/schema/cosine.schema
include         /usr/local/etc/openldap/schema/nis.schema
include         /usr/local/etc/openldap/schema/misc.schema
include         /usr/local/etc/openldap/schema/inetorgperson.schema
pidfile         /usr/local/var/run/slapd.pid
argsfile        /usr/local/var/run/slapd.args
conn_max_pending 1000
database        bdb
cachesize       20000
suffix          "dc=example,dc=net"
checkpoint      5120 30
rootdn          "cn=Manager,dc=example,dc=net"
rootpw         secrect
directory       /usr/local/var/openldap-data
# Indices to maintain
index default pres,eq
index cn,uid
#index WhidNetCustID,CustID,ID
index sn pres,eq,sub
index   objectClass     eq
index   uidNumber eq
index   gidNumber eq
index   memberUid eq
# database access control definitions
access to attrs=userPassword
         by self write
         by anonymous auth
         by dn="cn=Admin,dc=example,dc=net" write
         by * none
access to *
         by self write
         by dn="cn=Admin,dc=example,dc=net" write
         by * read
I can send out the LDIF I am using and the perl scripts that I run to 
break it for anyone who is interested.
Thank you,
-- 
David
Whidbey Telecom Internet and Broadband
Software Engineer

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Server Hangs