--On Tuesday, May 13, 2008 02:45:06 AM -0700 Howard Chu <hyc(a)symas.com>
Bill MacAllister wrote:
> Attached is the output of db4.2_stat -CA of the database.
> Thanks for looking at this.
So far it just looks like a very busy server. Can you turn off the
network access to it and see if it settles down when the query traffic
Last night the server tried to do a log rotation. When I look at the log
now it is zero length and nothing is getting written to it. An ldapsearch
on the server just hangs.
I logged into the console, shutdown the network interface down and the CPU
is still pinned.
It's a bit odd that a single transaction has so many pages of
suPrivilegeGroup index locked.
The backtrace is somewhat suspicious, there are several <value optimized
out> items in the trace. In thread 8, frames 5 and 6 the locker value is
odd; usually in BDB the locker ID associated with a transaction has bit
31 set, yielding a very large 32 bit number. Also there is no locker with
that ID in the db_stat output you provided.
It looks like you'll have to try this again with a non-optimized binary
to get a reliable backtrace.
Yes, we were afraid of that. I will build a debug version of bdb. The
real rub is that we don't seem to be able to make this happen on demand. I
tried taking the log from the pinned server, turned the log into a shell
script of ldapsearch commands, and pointed it at another server. I could
not make the second server go CPU bound. So, we will just have to deploy
the debug bdb support on our test servers and wait.
> --On Tuesday, May 13, 2008 01:20:49 AM -0700 Howard Chu<hyc(a)symas.com>
>> whm(a)stanford.edu wrote:
>>> Full_Name: Bill MacAllister
>>> Version: 2.3.41-1su2
>>> OS: debian etch kernel 2.6.18-4-amd64
>>> URL: http://www.stanford.edu/~whm/ldap-test1-bt.txt
>>> Submission from: (NULL) (220.127.116.11)
>>> The slapd process will sometimes consume all of available CPU. We
>>> observed this when we upgraded our production servers from 2.3.35-2su2
>>> to 2.3.41-1su2. The problem was bad enough that we downgraded the
>>> production servers to 2.3.35-2su2. We have been trying to provoke the
>>> problem in our test environment and have not been successful in
>>> making it happen on demand. Today, we noticed that one of our test
>>> servers went completely CPU bound. I took a backtrace. It is
>>> available at the URL below. The interesting thing about the problem
>>> is that although top shows a pinned CPU and a high load the server is
>>> still responsive and continues to answer LDAP searches. The test
>>> server that exhibits the problem is still CPU bound and has been for
>>> 2-3 hours now. We will leave this server in this state in case there
>>> is other information that we should harvest in resolving the problem.
>> Please also provide the output from db_stat -CA on the database in
>> question, thanks.
Bill MacAllister <whm(a)stanford.edu>
Systems Programmer, ITS Unix Systems, Stanford University