nikolai(a)net24.co.nz wrote:
Full_Name: Nikolai Schupbach
Version: 2.4.31
OS: FreeBSD
URL:
ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (202.78.158.60)
We are experiencing frequent hangs in slapd. Once hung we can continue to
connect, but all searches will just hang indefinitely until we kill -9 the slapd
process and restart it. The directory is used for mail routing and we have been
migrating to it from an existing directory server over the last 3 weeks - we
have noted the busier the directory becomes the more often it hangs (now once
every 2 days).
We have one master and 10 syncrepl read only replicas - the master is used
mainly for writes and has not hung yet, but most of the replicas have hung at
least once. The replicas receive anywhere between 50 to 300 searches/sec, while
the master would only get 1/sec. There are 45k entries in the directory.
We are running:
FreeBSD 8.3/9.0 x64
OpenLDAP 2.4.31
Berkeley DB 4.6.21
The old directory we are migrating from has the same load and is also running
OpenLDAP, but has been rock solid for 5 years. It is running Berkeley DB 4.3.29
and OpenLDAP 2.3.27.
We have managed to collect db_stat lock information, which indicates the same
issue each time - a write lock on dn2id.bdb.
It's more than that. Your db_stat shows that a single thread has 3 active
transactions. This should never happen:
8000a85e dd= 0 locks held 2 write locks 0 pid/thread 88000/34386526336
8000a85e READ 1 HELD 0xb19a8 len: 9 data: 40xa800000000000000
8000a85e READ 1 HELD 0xb26c8 len: 9 data: 60xa800000000000000
8000a85f dd= 0 locks held 8 write locks 4 pid/thread 88000/34386526336
8000a85f READ 1 WAIT dn2id.bdb page 559
8000a85f READ 1 HELD dn2id.bdb page 768
8000a85f WRITE 2 HELD dn2id.bdb page 1362
8000a85f READ 2 HELD dn2id.bdb page 1362
8000a85f WRITE 2 HELD dn2id.bdb page 1353
8000a85f READ 2 HELD dn2id.bdb page 1353
8000a85f WRITE 2 HELD dn2id.bdb page 933
8000a85f READ 1 HELD dn2id.bdb page 933
8000a85f WRITE 4 HELD dn2id.bdb page 219
80001047 dd=28 locks held 1 write locks 1 pid/thread 88000/34386526336
80001047 WRITE 1 HELD dn2id.bdb page 559
I would first recommend changing from BDB 4.6.21 to some other version. There
are no code paths in back-bdb where we would ever return without either
committing or aborting the current transactions, so this appears to be a BDB
bug, not an OpenLDAP bug.
We have also collected the backtrace for all the threads which I have
uploaded
to:
ftp://ftp.openldap.org/incoming/nikolai-gdb-120902.txt
The full db_stat output is located at:
ftp://ftp.openldap.org/incoming/nikolai-dbstat-120902.txt
--
-- Howard Chu
CTO, Symas Corp.
http://www.symas.com
Director, Highland Sun
http://highlandsun.com/hyc/
Chief Architect, OpenLDAP
http://www.openldap.org/project/