Re: (ITS#5508) slapd process consumes all of CPU - openldap-bugs

13 May 2008


      --On Tuesday, May 13, 2008 02:45:06 AM -0700 Howard Chu hyc@symas.com 
wrote:
...
Bill MacAllister wrote:
...
Attached is the output of db4.2_stat -CA of the database.
Thanks for looking at this.
So far it just looks like a very busy server. Can you turn off the
network access to it and see if it settles down when the query traffic
stops?
Last night the server tried to do a log rotation.  When I look at the log 
now it is zero length and nothing is getting written to it.  An ldapsearch 
on the server just hangs.
I logged into the console, shutdown the network interface down and the CPU 
is still pinned.
...
It's a bit odd that a single transaction has so many pages of the
suPrivilegeGroup index locked.
The backtrace is somewhat suspicious, there are several <value optimized
out> items in the trace. In thread 8, frames 5 and 6 the locker value is
odd; usually in BDB the locker ID associated with a transaction has bit
31 set, yielding a very large 32 bit number. Also there is no locker with
that ID in the db_stat output you provided.
It looks like you'll have to try this again with a non-optimized binary
to get a reliable backtrace.
Yes, we were afraid of that.  I will build a debug version of bdb.  The 
real rub is that we don't seem to be able to make this happen on demand.  I 
tried taking the log from the pinned server, turned the log into a shell 
script of ldapsearch commands, and pointed it at another server.  I could 
not make the second server go CPU bound.  So, we will just have to deploy 
the debug bdb support on our test servers and wait.
Bill
...
...
Bill
--On Tuesday, May 13, 2008 01:20:49 AM -0700 Howard Chuhyc@symas.com
wrote:
...
whm@stanford.edu wrote:
...
Full_Name: Bill MacAllister
Version: 2.3.41-1su2
OS: debian etch kernel 2.6.18-4-amd64
URL: http://www.stanford.edu/~whm/ldap-test1-bt.txt
Submission from: (NULL) (171.64.19.165)
The slapd process will sometimes consume all of available CPU.  We
observed this when we upgraded our production servers from 2.3.35-2su2
to 2.3.41-1su2.  The problem was bad enough that we downgraded the
production servers to 2.3.35-2su2. We have been trying to provoke the
   problem in our test environment and have not been successful in
   making it happen on demand.  Today, we noticed that one of our test
servers went completely CPU bound.  I took a backtrace.  It is
available at the URL below.  The interesting thing about the problem
is that although top shows a pinned CPU and a high load the server is
still responsive and continues to answer LDAP searches.  The test
server that exhibits the problem is still CPU bound and has been for
2-3 hours now.  We will leave this server in this state in case there
is other information that we should harvest in resolving the problem.
Please also provide the output from db_stat -CA on the database in
question, thanks.
--
Bill MacAllister whm@stanford.edu
Systems Programmer, ITS Unix Systems, Stanford University