On 8/17/11 2:31 PM, Howard Chu wrote:
David Engeset wrote:
> I upgraded and updated four of our OpenLDAP servers that we have back in
> May to run the latest stable version of OpenLDAP (2.4.23) along with BDB
> (4.8.30). Everything was running with no issues until a little over a
> month later one of the servers slapd processes hung, the only way I
> could restart the process was to use kill -9, all other kill options
> failed. Over the next month and a half the issue reoccurred on the same
> server and occurred on two of the other servers. There was nothing in
> the logs to indicate an issue with running out of file descriptors, dead
> locks or anything else. I set out to see if I could recreate the issue
> and I found if I had around 20000 entries, which our database is roughly
> around 21000, and ran a script to randomly query, one a time, the
> entries in the database and then run another script that added 1000
> entries, one at a time, then deleted them in reverse order, one at a
> time, and will continue to do so infinitely. When I ran the two scripts
> simultaneously they would hang after 3 to 16 deletes were completed. I
> attempted to use the latest version of OpenLDAP (2.4.26) to see if any
> of the bug fixes in it would help and I still get the same results, I
> even tried to run it with all of the supported versions of BDB, 4.4,
> 4.5, 4.6, 4.7, 5.0 and 5.1 with the same results. I ran it with full
> logging on and I was not able to find any thing that pointed to the
> problem.
Perhaps you should also post your test script so we can reproduce the
problem ourselves. Looking at your db_stat output I see a read-lock is
dangling out there without an active thread that owns it. Need to be
able to inspect each of the server threads under gdb to find which
thread the lock belonged to.
> We have been running OpenLDAP 2.2 and 2.3 for years (many servers
> without any restarting of slapd for over a year) without any lockups, so
> I decided to test with OpenLDAP 2.3.43 with BDB 4.2.52 (with patches)
> and loaded the same exact database and the same exact tests and it runs
> literally for hours with no issues. I attempted to upgrade the version
> of BDB to 4.4 and I started to experience the hanging again, so it
> appears to be a BDB issue. I searched for related issues with no
> success and considering that others are running 2.4 with newer versions
> of BDB for a couple of years now I find it odd that I am running into
> this issue on my first use of 2.4.
>
> I tested all of this on CentOS 5.4, 5.6 and Fedora 17 with the same
> results. Does anyone have any ideas or suggestions on what I can try to
> do to fix this issue?
>
> Below are some of the configs I am using on my last attempts to resolve
> the issue:
>
> DB_CONFIG:
> set_cachesize 0 536870912 1
> set_lg_regionmax 10485760
> set_lg_max 104857600
> set_lg_bsize 2097152
> set_lg_dir /var/log/bdb
> set_tmp_dir /var/log/bdb
> # This one I added recently to see if it might help.
> set_lk_detect DB_LOCK_DEFAULT
>
> slapd.conf:
> include /usr/local/etc/openldap/schema/cosine.schema
> include /usr/local/etc/openldap/schema/nis.schema
> include /usr/local/etc/openldap/schema/misc.schema
> include /usr/local/etc/openldap/schema/inetorgperson.schema
>
> pidfile /usr/local/var/run/slapd.pid
> argsfile /usr/local/var/run/slapd.args
> conn_max_pending 1000
>
> database bdb
> cachesize 20000
> suffix "dc=example,dc=net"
> checkpoint 5120 30
> rootdn "cn=Manager,dc=example,dc=net"
> rootpw secrect
> directory /usr/local/var/openldap-data
>
> # Indices to maintain
> index default pres,eq
> index cn,uid
> #index WhidNetCustID,CustID,ID
> index sn pres,eq,sub
> index objectClass eq
> index uidNumber eq
> index gidNumber eq
> index memberUid eq
>
> # database access control definitions
> access to attrs=userPassword
> by self write
> by anonymous auth
> by dn="cn=Admin,dc=example,dc=net" write
> by * none
> access to *
> by self write
> by dn="cn=Admin,dc=example,dc=net" write
> by * read
>
> I can send out the LDIF I am using and the perl scripts that I run to
> break it for anyone who is interested.
> Thank you,
>
Howard,
Attached is the configs I used as well as the two perl scripts.
The first perl script searchldap.pl just calls the ldapsearch command to
lookup a random selected user in the database, and does so infinitely.
The second perl script adddelldap.pl uses the Net::LDAP module and
connects to the database and adds 1000 records and then removes them,
each complete cycle I call a round and echo that out to the display and
it will continue to do so for 1000 rounds. I had the script set to
close and open new connections after X number of random actions have
been completed, to simulate a typical set of changes in a single
connection, it is limited to a maximum of 10 operations for a single
connection.
I am unable to attach the LDIF I used due to size, but it had
roughly 20000 records under the ou=people branch.
Thank you,
--
David
Whidbey Telecom Internet and Broadband
Software Engineer