nikolai@net24.co.nz wrote:
Full_Name: Nikolai Schupbach Version: 2.4.31 OS: FreeBSD URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (202.78.158.60)
We are experiencing frequent hangs in slapd. Once hung we can continue to connect, but all searches will just hang indefinitely until we kill -9 the slapd process and restart it. The directory is used for mail routing and we have been migrating to it from an existing directory server over the last 3 weeks - we have noted the busier the directory becomes the more often it hangs (now once every 2 days).
We have one master and 10 syncrepl read only replicas - the master is used mainly for writes and has not hung yet, but most of the replicas have hung at least once. The replicas receive anywhere between 50 to 300 searches/sec, while the master would only get 1/sec. There are 45k entries in the directory.
We are running:
FreeBSD 8.3/9.0 x64 OpenLDAP 2.4.31 Berkeley DB 4.6.21
The old directory we are migrating from has the same load and is also running OpenLDAP, but has been rock solid for 5 years. It is running Berkeley DB 4.3.29 and OpenLDAP 2.3.27.
We have managed to collect db_stat lock information, which indicates the same issue each time - a write lock on dn2id.bdb.
It's more than that. Your db_stat shows that a single thread has 3 active transactions. This should never happen:
8000a85e dd= 0 locks held 2 write locks 0 pid/thread 88000/34386526336 8000a85e READ 1 HELD 0xb19a8 len: 9 data: 40xa800000000000000 8000a85e READ 1 HELD 0xb26c8 len: 9 data: 60xa800000000000000 8000a85f dd= 0 locks held 8 write locks 4 pid/thread 88000/34386526336 8000a85f READ 1 WAIT dn2id.bdb page 559 8000a85f READ 1 HELD dn2id.bdb page 768 8000a85f WRITE 2 HELD dn2id.bdb page 1362 8000a85f READ 2 HELD dn2id.bdb page 1362 8000a85f WRITE 2 HELD dn2id.bdb page 1353 8000a85f READ 2 HELD dn2id.bdb page 1353 8000a85f WRITE 2 HELD dn2id.bdb page 933 8000a85f READ 1 HELD dn2id.bdb page 933 8000a85f WRITE 4 HELD dn2id.bdb page 219 80001047 dd=28 locks held 1 write locks 1 pid/thread 88000/34386526336 80001047 WRITE 1 HELD dn2id.bdb page 559
I would first recommend changing from BDB 4.6.21 to some other version. There are no code paths in back-bdb where we would ever return without either committing or aborting the current transactions, so this appears to be a BDB bug, not an OpenLDAP bug.
We have also collected the backtrace for all the threads which I have uploaded to:
ftp://ftp.openldap.org/incoming/nikolai-gdb-120902.txt
The full db_stat output is located at:
ftp://ftp.openldap.org/incoming/nikolai-dbstat-120902.txt