My environment consists of 1 provider and 6 consumer machines using delta-syncrepl for replication. For some reason, one of the consumer machines stopped receiving changes for several days and did not resume syncing until a restart (SIGKILL).
All machines are running 2.4.30 with BerkeleyDB 4.7.25 (plus patches), the hdb backend, and tcmalloc.
I notice that 2.4.31 has several syncrepl fixes, and I intend to upgrade in the near future, but my issue seems different than the ITS issues listed in the change log. I'm posting here to determine if this has been fixed or if I need to create a new ITS issue.
When I noticed one of the machines was not syncing properly I first ensured that I could still query it. When I verified that, I decided to change the LogLevel to 'any' to make sure I wasn't missing any logs that would help me figure out what the issue was. Unfortunately, this LDAP modify caused OpenLDAP to hang. The operation never finished, and new queries using ldapsearch would simply hang as well.
At this point I got a gdb backtrace and db_stat from BDB.
db_stat -c:
===== Default locking region information: 640 Last allocated locker ID 0x7fffffff Current maximum unused locker ID 9 Number of lock modes 3000 Maximum number of locks possible 1500 Maximum number of lockers possible 1500 Maximum number of lock objects possible 80 Number of lock object partitions 47 Number of current locks 1262 Maximum number of locks at any one time 19 Maximum number of locks in any one bucket 0 Maximum number of locks stolen by for an empty partition 0 Maximum number of locks stolen for any one partition 340 Number of current lockers 341 Maximum number of lockers at any one time 44 Number of current lock objects 683 Maximum number of lock objects at any one time 6 Maximum number of lock objects in any one bucket 0 Maximum number of objects stolen by for an empty partition 0 Maximum number of objects stolen for any one partition 3799M Total number of locks requested (3799770221) 3799M Total number of locks released (3799714045) 0 Total number of locks upgraded 73 Total number of locks downgraded 73M Lock requests not available due to conflicts, for which we waited (73026840) 5165 Lock requests not available due to conflicts, for which we did not wait 50963 Number of deadlocks 0 Lock timeout value 0 Number of locks that have timed out 0 Transaction timeout value 0 Number of transactions that have timed out 1MB 400KB The size of the lock region 70M The number of partition locks that required waiting (1%) 15M The maximum number of times any partition lock was waited for (0%) 15018 The number of object queue operations that required waiting (0%) 54M The number of locker allocations that required waiting (2%) 0 The number of region locks that required waiting (0%) 6 Maximum hash bucket length =====
The number of deadlocks was pretty shocking; I've never seen that value non-zero.
I also have db_stat -C A output if that helps.
gdb (syncrepl thread):
===== Thread 12 (Thread 0x444a9950 (LWP 16336)): #0 0x00007f5e087f2b99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x00007f5e08a274fb in __db_pthread_mutex_lock () from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so #2 0x00007f5e08aa5dec in __lock_get_internal () from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so #3 0x00007f5e08aa6d6a in __lock_vec () from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so #4 0x00007f5e08aa72db in __lock_vec_pp () from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so #5 0x000000000054da72 in hdb_cache_entry_db_relock (bdb=0xaec000, txn=0x1031c20, ei=0x3d5e240, rw=1, tryOnly=0, lock=0x444a7b30) at cache.c:198 #6 0x000000000054fe9e in hdb_cache_modify (bdb=0xaec000, e=0x7f5ccb945f58, newAttrs=0x7f5ccf4c1ec0, txn=0x1031c20, lock=0x444a7b30) at cache.c:1231 #7 0x00000000004fc2bd in hdb_modify (op=0x444a86d0, rs=0x444a8040) at modify.c:711 #8 0x00000000004db66f in overlay_op_walk (op=0x444a86d0, rs=0x444a8040, which=op_modify, oi=0xace000, on=0x0) at backover.c:671 #9 0x00000000004db899 in over_op_func (op=0x444a86d0, rs=0x444a8040, which=op_modify) at backover.c:723 #10 0x00000000004db9de in over_op_modify (op=0x444a86d0, rs=0x444a8040) at backover.c:762 #11 0x00000000004c8c59 in syncrepl_message_to_op (si=0xb22000, op=0x444a86d0, msg=0x7f5cf7df5f80) at syncrepl.c:2316 #12 0x00000000004c40dd in do_syncrep2 (op=0x444a86d0, si=0xb22000) at syncrepl.c:986 #13 0x00000000004c61a6 in do_syncrepl (ctx=0x444a8df0, arg=0xb768c0) at syncrepl.c:1522 #14 0x0000000000449003 in connection_read_thread (ctx=0x444a8df0, argv=0x31) at connection.c:1288 #15 0x0000000000591269 in ldap_int_thread_pool_wrapper (xpool=0xa741c0) at tpool.c:688 #16 0x00007f5e087ee3f7 in start_thread () from /lib/libpthread.so.0 #17 0x00007f5e0794dbbd in clone () from /lib/libc.so.6 #18 0x0000000000000000 in ?? () =====
gdb (thread modifying cn=config):
===== Thread 3 (Thread 0x48cb2950 (LWP 29597)): #0 0x00007f5e087f2b99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x0000000000592806 in ldap_pvt_thread_cond_wait (cond=0xa74220, mutex=0xa741c8) at thr_posix.c:277 #2 0x000000000059162f in handle_pause (tpool=0x8b5420, pause_type=96) at tpool.c:788 #3 0x00000000005916fa in ldap_pvt_thread_pool_pause (tpool=0x8b5420) at tpool.c:831 #4 0x0000000000433ac4 in config_back_modify (op=0x7f5cdc507800, rs=0x48cb1c90) at bconfig.c:5837 #5 0x0000000000468e52 in fe_op_modify (op=0x7f5cdc507800, rs=0x48cb1c90) at modify.c:303 #6 0x0000000000468761 in do_modify (op=0x7f5cdc507800, rs=0x48cb1c90) at modify.c:177 #7 0x0000000000448a65 in connection_operation (ctx=0x48cb1df0, arg_v=0x7f5cdc507800) at connection.c:1150 #8 0x0000000000448fe7 in connection_read_thread (ctx=0x48cb1df0, argv=0x15) at connection.c:1286 #9 0x0000000000591269 in ldap_int_thread_pool_wrapper (xpool=0xa741c0) at tpool.c:688 #10 0x00007f5e087ee3f7 in start_thread () from /lib/libpthread.so.0 #11 0x00007f5e0794dbbd in clone () from /lib/libc.so.6 #12 0x0000000000000000 in ?? () =====
I also have bt full output if needed.
Since restarting I have seen no issues with any of the instances and the failed instance synced without issue.
Let me know if I should create an ITS.
Thanks,
Dave
I also have bt full output if needed.
Since restarting I have seen no issues with any of the instances and the failed instance synced without issue.
Let me know if I should create an ITS.
Thanks,
Dave
Hi Dave,
Have you been able to reproduce it since?
Thanks.
On 2012-07-18 17:15, Gavin Henry wrote:
I also have bt full output if needed.
Since restarting I have seen no issues with any of the instances and the failed instance synced without issue.
Let me know if I should create an ITS.
Thanks,
Dave
Hi Dave,
Have you been able to reproduce it since?
Thanks.
So far I've only had the one failure and I haven't been able to reproduce it since.
Hi Dave,
Have you been able to reproduce it since?
Thanks.
So far I've only had the one failure and I haven't been able to reproduce it since.
That's tricky then. Did you file an ITS? Will check...
On 2012-07-18 18:01, Gavin Henry wrote:
Hi Dave,
Have you been able to reproduce it since?
Thanks.
So far I've only had the one failure and I haven't been able to reproduce it since.
That's tricky then. Did you file an ITS? Will check...
I haven't filed an ITS yet, but another consumer locked up yesterday.
This time I didn't try to modify anything when syncrepl was locked.
The backtrace is the same for the hdb_modify, but there are also threads that are locked on binds and searches. It turns out that these searches and binds are for the user that is being modified. The consumer shut down, came up normally, and replication continued without incident.
Here's a bind that is locked:
===== Thread 5 (Thread 0x474cb950 (LWP 27575)): #0 0x00007fad5de7eb99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x00007fad5e0b34fb in __db_pthread_mutex_lock () from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so #2 0x00007fad5e131dec in __lock_get_internal () from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so #3 0x00007fad5e132391 in __lock_get_pp () from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so #4 0x000000000054dbfa in bdb_cache_entry_db_lock (bdb=0xaec000, txn=0x32ee9560, ei=0xea8d680, rw=0, tryOnly=0, lock=0x474ca8b0) at cache.c:234 #5 0x000000000054f4d5 in hdb_cache_find_id (op=0x7fac2a014c00, tid=0x32ee9560, id=1455322, eip=0x474ca870, flag=0, lock=0x474ca8b0) at cache.c:988 #6 0x00000000005564f5 in hdb_dn2entry (op=0x7fac2a014c00, tid=0x32ee9560, dn=0x7fac2a014c38, e=0x474ca8e0, matched=1, lock=0x474ca8b0) at dn2entry.c:67 #7 0x000000000054d212 in hdb_bind (op=0x7fac2a014c00, rs=0x474cac90) at bind.c:70 ... =====
Clearly I need to upgrade and see if this still continues to happen.
David Hawes wrote:
On 2012-07-18 18:01, Gavin Henry wrote:
Hi Dave,
Have you been able to reproduce it since?
Thanks.
So far I've only had the one failure and I haven't been able to reproduce it since.
That's tricky then. Did you file an ITS? Will check...
I haven't filed an ITS yet, but another consumer locked up yesterday.
This time I didn't try to modify anything when syncrepl was locked.
The backtrace is the same for the hdb_modify, but there are also threads that are locked on binds and searches. It turns out that these searches and binds are for the user that is being modified. The consumer shut down, came up normally, and replication continued without incident.
Here's a bind that is locked:
Clearly I need to upgrade and see if this still continues to happen.
Give back-mdb in 2.4.32 a try. It's completely deadlock free.
Clearly I need to upgrade and see if this still continues to happen.
Give back-mdb in 2.4.32 a try. It's completely deadlock free.
So usual rules apply right?
1. slapcat, 2. stop slapd 3. change backend definition 4. empty db dir or use a new one 5. slapadd 6. start slapd
or not stop/start if using cn=config and make sure mdb module is loaded if slapd built that way
Thanks.
I think that even in the cn=config case, a stop/start is necessary, because database can't be deleted, and I doubt a backend change can be performed dynamically. Maybe even some manual LDIF editing is necessary, something evil ;)
2012/8/15 Gavin Henry ghenry@suretec.co.uk:
Clearly I need to upgrade and see if this still continues to happen.
Give back-mdb in 2.4.32 a try. It's completely deadlock free.
So usual rules apply right?
- slapcat,
- stop slapd
- change backend definition
- empty db dir or use a new one
- slapadd
- start slapd
or not stop/start if using cn=config and make sure mdb module is loaded if slapd built that way
Thanks.
Kind Regards,
Gavin Henry.
Erwann Abalea wrote:
I think that even in the cn=config case, a stop/start is necessary, because database can't be deleted, and I doubt a backend change can be performed dynamically. Maybe even some manual LDIF editing is necessary, something evil ;)
You can get the new database running, dynamically. Since the new backend is a different structural objectclass, no, you can't just modify the existing olcDatabase entry.
I would create a new slapd.d for slapadd'ing the new database. Then when it's complete, ldapmodify the running slapd to add the MDB database and disable the existing database. Something like:
### dn: olcDatabase=mdb,cn=config changetype: add ... olcSuffix: o=temporary ...
dn: olcDatabase={1}hdb,cn=config changetype: modify replace: olcDbDirectory olcDbDirectory: /tmp/dummy - replace: olcSuffix olcSuffix: o=unused -
dn: olcDatabase=mdb,cn=config changetype: modify replace: olcSuffix olcSuffix: dc=the,dc=suffix -
###
Add the MDB database with a fake suffix first, then turn off the BDB/HDB database and set it to a dummy suffix, then put the real suffix onto the MDB database.
Pointing the BDB/HDB backend at an empty directory will close the current DB, freeing up all of its resources. There is a window of maybe a few microseconds between switching the suffix on the old database and switching it on the new database where clients might get No Such Object errors on their queries. If we ever get around to implementing LDAP Transaction support for cn=config, then the whole thing can be done atomically/transparently.
The slapd.d that was used for the MDB slapadd can be discarded afterward.
openldap-technical@openldap.org