My environment consists of 1 provider and 6 consumer machines using delta-syncrepl for replication. For some reason, one of the consumer machines stopped receiving changes for several days and did not resume syncing until a restart (SIGKILL).
All machines are running 2.4.30 with BerkeleyDB 4.7.25 (plus patches), the hdb backend, and tcmalloc.
I notice that 2.4.31 has several syncrepl fixes, and I intend to upgrade in the near future, but my issue seems different than the ITS issues listed in the change log. I'm posting here to determine if this has been fixed or if I need to create a new ITS issue.
When I noticed one of the machines was not syncing properly I first ensured that I could still query it. When I verified that, I decided to change the LogLevel to 'any' to make sure I wasn't missing any logs that would help me figure out what the issue was. Unfortunately, this LDAP modify caused OpenLDAP to hang. The operation never finished, and new queries using ldapsearch would simply hang as well.
At this point I got a gdb backtrace and db_stat from BDB.
db_stat -c:
===== Default locking region information: 640 Last allocated locker ID 0x7fffffff Current maximum unused locker ID 9 Number of lock modes 3000 Maximum number of locks possible 1500 Maximum number of lockers possible 1500 Maximum number of lock objects possible 80 Number of lock object partitions 47 Number of current locks 1262 Maximum number of locks at any one time 19 Maximum number of locks in any one bucket 0 Maximum number of locks stolen by for an empty partition 0 Maximum number of locks stolen for any one partition 340 Number of current lockers 341 Maximum number of lockers at any one time 44 Number of current lock objects 683 Maximum number of lock objects at any one time 6 Maximum number of lock objects in any one bucket 0 Maximum number of objects stolen by for an empty partition 0 Maximum number of objects stolen for any one partition 3799M Total number of locks requested (3799770221) 3799M Total number of locks released (3799714045) 0 Total number of locks upgraded 73 Total number of locks downgraded 73M Lock requests not available due to conflicts, for which we waited (73026840) 5165 Lock requests not available due to conflicts, for which we did not wait 50963 Number of deadlocks 0 Lock timeout value 0 Number of locks that have timed out 0 Transaction timeout value 0 Number of transactions that have timed out 1MB 400KB The size of the lock region 70M The number of partition locks that required waiting (1%) 15M The maximum number of times any partition lock was waited for (0%) 15018 The number of object queue operations that required waiting (0%) 54M The number of locker allocations that required waiting (2%) 0 The number of region locks that required waiting (0%) 6 Maximum hash bucket length =====
The number of deadlocks was pretty shocking; I've never seen that value non-zero.
I also have db_stat -C A output if that helps.
gdb (syncrepl thread):
===== Thread 12 (Thread 0x444a9950 (LWP 16336)): #0 0x00007f5e087f2b99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x00007f5e08a274fb in __db_pthread_mutex_lock () from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so #2 0x00007f5e08aa5dec in __lock_get_internal () from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so #3 0x00007f5e08aa6d6a in __lock_vec () from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so #4 0x00007f5e08aa72db in __lock_vec_pp () from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so #5 0x000000000054da72 in hdb_cache_entry_db_relock (bdb=0xaec000, txn=0x1031c20, ei=0x3d5e240, rw=1, tryOnly=0, lock=0x444a7b30) at cache.c:198 #6 0x000000000054fe9e in hdb_cache_modify (bdb=0xaec000, e=0x7f5ccb945f58, newAttrs=0x7f5ccf4c1ec0, txn=0x1031c20, lock=0x444a7b30) at cache.c:1231 #7 0x00000000004fc2bd in hdb_modify (op=0x444a86d0, rs=0x444a8040) at modify.c:711 #8 0x00000000004db66f in overlay_op_walk (op=0x444a86d0, rs=0x444a8040, which=op_modify, oi=0xace000, on=0x0) at backover.c:671 #9 0x00000000004db899 in over_op_func (op=0x444a86d0, rs=0x444a8040, which=op_modify) at backover.c:723 #10 0x00000000004db9de in over_op_modify (op=0x444a86d0, rs=0x444a8040) at backover.c:762 #11 0x00000000004c8c59 in syncrepl_message_to_op (si=0xb22000, op=0x444a86d0, msg=0x7f5cf7df5f80) at syncrepl.c:2316 #12 0x00000000004c40dd in do_syncrep2 (op=0x444a86d0, si=0xb22000) at syncrepl.c:986 #13 0x00000000004c61a6 in do_syncrepl (ctx=0x444a8df0, arg=0xb768c0) at syncrepl.c:1522 #14 0x0000000000449003 in connection_read_thread (ctx=0x444a8df0, argv=0x31) at connection.c:1288 #15 0x0000000000591269 in ldap_int_thread_pool_wrapper (xpool=0xa741c0) at tpool.c:688 #16 0x00007f5e087ee3f7 in start_thread () from /lib/libpthread.so.0 #17 0x00007f5e0794dbbd in clone () from /lib/libc.so.6 #18 0x0000000000000000 in ?? () =====
gdb (thread modifying cn=config):
===== Thread 3 (Thread 0x48cb2950 (LWP 29597)): #0 0x00007f5e087f2b99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #1 0x0000000000592806 in ldap_pvt_thread_cond_wait (cond=0xa74220, mutex=0xa741c8) at thr_posix.c:277 #2 0x000000000059162f in handle_pause (tpool=0x8b5420, pause_type=96) at tpool.c:788 #3 0x00000000005916fa in ldap_pvt_thread_pool_pause (tpool=0x8b5420) at tpool.c:831 #4 0x0000000000433ac4 in config_back_modify (op=0x7f5cdc507800, rs=0x48cb1c90) at bconfig.c:5837 #5 0x0000000000468e52 in fe_op_modify (op=0x7f5cdc507800, rs=0x48cb1c90) at modify.c:303 #6 0x0000000000468761 in do_modify (op=0x7f5cdc507800, rs=0x48cb1c90) at modify.c:177 #7 0x0000000000448a65 in connection_operation (ctx=0x48cb1df0, arg_v=0x7f5cdc507800) at connection.c:1150 #8 0x0000000000448fe7 in connection_read_thread (ctx=0x48cb1df0, argv=0x15) at connection.c:1286 #9 0x0000000000591269 in ldap_int_thread_pool_wrapper (xpool=0xa741c0) at tpool.c:688 #10 0x00007f5e087ee3f7 in start_thread () from /lib/libpthread.so.0 #11 0x00007f5e0794dbbd in clone () from /lib/libc.so.6 #12 0x0000000000000000 in ?? () =====
I also have bt full output if needed.
Since restarting I have seen no issues with any of the instances and the failed instance synced without issue.
Let me know if I should create an ITS.
Thanks,
Dave