delta-syncrepl stopped receiving changes

11 Jul 2012


      My environment consists of 1 provider and 6 consumer machines using
delta-syncrepl for replication. For some reason, one of the consumer
machines stopped receiving changes for several days and did not resume
syncing until a restart (SIGKILL).
All machines are running 2.4.30 with BerkeleyDB 4.7.25 (plus patches),
the hdb backend, and tcmalloc.
I notice that 2.4.31 has several syncrepl fixes, and I intend to upgrade
in the near future, but my issue seems different than the ITS issues
listed in the change log. I'm posting here to determine if this has been
fixed or if I need to create a new ITS issue.
When I noticed one of the machines was not syncing properly I first
ensured that I could still query it. When I verified that, I decided to
change the LogLevel to 'any' to make sure I wasn't missing any logs that
would help me figure out what the issue was. Unfortunately, this LDAP
modify caused OpenLDAP to hang. The operation never finished, and new
queries using ldapsearch would simply hang as well.
At this point I got a gdb backtrace and db_stat from BDB.
db_stat -c:
=====
Default locking region information:
640 Last allocated locker ID
0x7fffffff  Current maximum unused locker ID
9   Number of lock modes
3000    Maximum number of locks possible
1500    Maximum number of lockers possible
1500    Maximum number of lock objects possible
80  Number of lock object partitions
47  Number of current locks
1262    Maximum number of locks at any one time
19  Maximum number of locks in any one bucket
0   Maximum number of locks stolen by for an empty partition
0   Maximum number of locks stolen for any one partition
340 Number of current lockers
341 Maximum number of lockers at any one time
44  Number of current lock objects
683 Maximum number of lock objects at any one time
6   Maximum number of lock objects in any one bucket
0   Maximum number of objects stolen by for an empty partition
0   Maximum number of objects stolen for any one partition
3799M   Total number of locks requested (3799770221)
3799M   Total number of locks released (3799714045)
0   Total number of locks upgraded
73  Total number of locks downgraded
73M Lock requests not available due to conflicts, for which we waited
(73026840)
5165    Lock requests not available due to conflicts, for which we did
not wait
50963   Number of deadlocks
0   Lock timeout value
0   Number of locks that have timed out
0   Transaction timeout value
0   Number of transactions that have timed out
1MB 400KB   The size of the lock region
70M The number of partition locks that required waiting (1%)
15M The maximum number of times any partition lock was waited for (0%)
15018   The number of object queue operations that required waiting (0%)
54M The number of locker allocations that required waiting (2%)
0   The number of region locks that required waiting (0%)
6   Maximum hash bucket length
=====
The number of deadlocks was pretty shocking; I've never seen that value
non-zero.
I also have db_stat -C A output if that helps.
gdb (syncrepl thread):
=====
Thread 12 (Thread 0x444a9950 (LWP 16336)):
#0  0x00007f5e087f2b99 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1  0x00007f5e08a274fb in __db_pthread_mutex_lock ()
   from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so
#2  0x00007f5e08aa5dec in __lock_get_internal ()
   from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so
#3  0x00007f5e08aa6d6a in __lock_vec ()
   from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so
#4  0x00007f5e08aa72db in __lock_vec_pp ()
   from /apps/local/depend/BerkeleyDB-4.7.25p4/lib/libdb-4.7.so
#5  0x000000000054da72 in hdb_cache_entry_db_relock (bdb=0xaec000,
    txn=0x1031c20, ei=0x3d5e240, rw=1, tryOnly=0, lock=0x444a7b30)
    at cache.c:198
#6  0x000000000054fe9e in hdb_cache_modify (bdb=0xaec000, e=0x7f5ccb945f58,
    newAttrs=0x7f5ccf4c1ec0, txn=0x1031c20, lock=0x444a7b30) at cache.c:1231
#7  0x00000000004fc2bd in hdb_modify (op=0x444a86d0, rs=0x444a8040)
    at modify.c:711
#8  0x00000000004db66f in overlay_op_walk (op=0x444a86d0, rs=0x444a8040,
    which=op_modify, oi=0xace000, on=0x0) at backover.c:671
#9  0x00000000004db899 in over_op_func (op=0x444a86d0, rs=0x444a8040,
    which=op_modify) at backover.c:723
#10 0x00000000004db9de in over_op_modify (op=0x444a86d0, rs=0x444a8040)
    at backover.c:762
#11 0x00000000004c8c59 in syncrepl_message_to_op (si=0xb22000,
op=0x444a86d0,
    msg=0x7f5cf7df5f80) at syncrepl.c:2316
#12 0x00000000004c40dd in do_syncrep2 (op=0x444a86d0, si=0xb22000)
    at syncrepl.c:986
#13 0x00000000004c61a6 in do_syncrepl (ctx=0x444a8df0, arg=0xb768c0)
    at syncrepl.c:1522
#14 0x0000000000449003 in connection_read_thread (ctx=0x444a8df0, argv=0x31)
    at connection.c:1288
#15 0x0000000000591269 in ldap_int_thread_pool_wrapper (xpool=0xa741c0)
    at tpool.c:688
#16 0x00007f5e087ee3f7 in start_thread () from /lib/libpthread.so.0
#17 0x00007f5e0794dbbd in clone () from /lib/libc.so.6
#18 0x0000000000000000 in ?? ()
=====
gdb (thread modifying cn=config):
=====
Thread 3 (Thread 0x48cb2950 (LWP 29597)):
#0  0x00007f5e087f2b99 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1  0x0000000000592806 in ldap_pvt_thread_cond_wait (cond=0xa74220,
    mutex=0xa741c8) at thr_posix.c:277
#2  0x000000000059162f in handle_pause (tpool=0x8b5420, pause_type=96)
    at tpool.c:788
#3  0x00000000005916fa in ldap_pvt_thread_pool_pause (tpool=0x8b5420)
    at tpool.c:831
#4  0x0000000000433ac4 in config_back_modify (op=0x7f5cdc507800,
rs=0x48cb1c90)
    at bconfig.c:5837
#5  0x0000000000468e52 in fe_op_modify (op=0x7f5cdc507800, rs=0x48cb1c90)
    at modify.c:303
#6  0x0000000000468761 in do_modify (op=0x7f5cdc507800, rs=0x48cb1c90)
    at modify.c:177
#7  0x0000000000448a65 in connection_operation (ctx=0x48cb1df0,
    arg_v=0x7f5cdc507800) at connection.c:1150
#8  0x0000000000448fe7 in connection_read_thread (ctx=0x48cb1df0, argv=0x15)
    at connection.c:1286
#9  0x0000000000591269 in ldap_int_thread_pool_wrapper (xpool=0xa741c0)
    at tpool.c:688
#10 0x00007f5e087ee3f7 in start_thread () from /lib/libpthread.so.0
#11 0x00007f5e0794dbbd in clone () from /lib/libc.so.6
#12 0x0000000000000000 in ?? ()
=====
I also have bt full output if needed.
Since restarting I have seen no issues with any of the instances and the
failed instance synced without issue.
Let me know if I should create an ITS.
Thanks,
Dave

Thanks.