hyc@symas.com wrote:
I was unable to reproduce the problem on my multi-core machines, but I do see it on a single-core machine. I've sent a backtrace and other debug info to the Oracle folks, will see what they have to say.
I see the problem; it's a bug in BDB's multi-partition lock manager. When using multiple lock table partitions, it obtains a lock on the system-wide lock mutex and a lock on the per-region mutex. On a single core system it defaults to a single lock table. In this case, the macro that obtains the system-wide lock behaves identically to the per-region lock. I.e., both attempt to acquire the exact same mutex. Since it's already held, the process deadlocks.
(gdb) bt #0 0xb7f37424 in __kernel_vsyscall () #1 0xb7b36c4e in __lll_mutex_lock_wait () from /lib/libpthread.so.0 #2 0xb7b32a3c in _L_mutex_lock_88 () from /lib/libpthread.so.0 #3 0xb7b3242d in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0xb7d00819 in __db_pthread_mutex_lock (env=0x8a84550, mutex=104) at ../dist/../mutex/mut_pthread.c:207 #5 0xb7daad19 in __lock_getobj (lt=0x8a84848, obj=0xbfd492ec, ndx=492, create=1, retp=0xbfd491e4) at ../dist/../lock/lock.c:1470 #6 0xb7da7f53 in __lock_get_internal (lt=0x8a84848, sh_locker=0xb776d508, flags=1, obj=0xbfd492ec, lock_mode=DB_LOCK_READ, timeout=0, lock=0xbfd493cc) at ../dist/../lock/lock.c:588 #7 0xb7da77d6 in __lock_get_api (env=0x8a84550, locker=2147483659, flags=1, obj=0xbfd492ec, lock_mode=DB_LOCK_READ, lock=0xbfd493cc) at ../dist/../lock/lock.c:423 #8 0xb7da765b in __lock_get_pp (dbenv=0x8a841c0, locker=2147483659, flags=1, obj=0xbfd492ec, lock_mode=DB_LOCK_READ, lock=0xbfd493cc) at ../dist/../lock/lock.c:395 #9 0x08124fb8 in bdb_dn2id_lock (bdb=0x8a68620, dn=0xbfd493f0, rw=0, txn=0x8a890b8, lock=0xbfd493cc) at ../../../../head/servers/slapd/back-bdb/dn2id.c:47 #10 0x08125d7d in bdb_dn2id (op=0xbfd49640, dn=0xbfd493f0, ei=0xbfd493e0, txn=0x8a890b8, lock=0xbfd493cc) at ../../../../head/servers/slapd/back-bdb/dn2id.c:307 ---Type <return> to continue, or q <return> to quit---q Quit (gdb) frame 4 #4 0xb7d00819 in __db_pthread_mutex_lock (env=0x8a84550, mutex=104) at ../dist/../mutex/mut_pthread.c:207 207 RET_SET((pthread_mutex_lock(&mutexp->mutex)), ret); (gdb) p *mutexp $1 = {mutex = {__data = {__lock = 2, __count = 0, __owner = 29470, __kind = 0, __nusers = 1, {__spins = 0, __list = {__next = 0x0}}}, __size = "\002\000\000\000\000\000\000\000\036s\000\000\000\000\000\000\001\000\000\000\000\000\000", __align = 2}, cond = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\0' <repeats 47 times>, __align = 0}, pid = 29470, tid = 3080046272, mutex_next_link = 0, alloc_id = 6, mutex_set_wait = 1, mutex_set_nowait = 129, flags = 3} (gdb)
The mutex being acquired in frame 4 is the same one that was already acquired in frame 7, __lock_get_api line 418.