> Howard Chu wrote:
Your lwp #2 is in the thread pool, probably waiting for lwp #1 to finish. Why lwp #1 is still waiting, dunno.
Yes that was all I could conclude as well. As to why lwp #1 is still waiting, well it's in bdb_tool_entry_close(), but without the debug symbols it's not easy to work out whether it's waiting on the trickle condition or the index condition. What did seem odd is why it was blocked on a condition at all when there was no other active thread capable of unblocking it. A race condition in the code somewhere, I guess.
Call me impatient, but I've dived into disassembly now rather than waiting on Oracle who have the debug symbols.
::walk thread | ::findstack -v
stack pointer for thread 1: f5a68538 [ f5a68538 libc_hwcap1.so.1`__lwp_park+0x19() ] f5a68568 libc_hwcap1.so.1`cond_wait_queue+0x63(82ed048, 82ed020, 0, 7ff16b0) f5a685a8 libc_hwcap1.so.1`__cond_wait+0x89(82ed048, 82ed020, f5a685c8, 7ff1771) f5a685c8 libc_hwcap1.so.1`cond_wait+0x27(82ed048, 82ed020, f5a685f8, 7ff17b7) f5a685e8 libc_hwcap1.so.1`pthread_cond_wait+0x24(82ed048, 82ed020, f5a68618, 7e56137) f5a68608 libldap_r-2.4.so.2.8.3`ldap_pvt_thread_cond_wait+0x24(82ed048, 82ed020, f5a68648, 81b7e29) f5a68648 bdb_tool_entry_close+0x62(8e9d398, ffffffff, f5a68698, 82ecc38) f5a68ae8 slapadd+0xcb8(4, f5a68bbc, f5a68bbc, 80dccb1) f5a68b88 main+0xac(4, f5a68bbc, f5a68bd0, ef60f968) f5a68bb0 _start+0x7d(4, f5a68c8e, f5a68ca0, f5a68ca3, f5a68ca6, 0)
The offending ldap_pvt_thread_cond_wait(), which will never be unblocked because no other thread will do it, is in bdb_tool_entry_close+0x62. Checking the disassembly:
bdb_tool_entry_close::dis
bdb_tool_entry_close: pushl %ebp bdb_tool_entry_close+1: movl %esp,%ebp bdb_tool_entry_close+3: pushl %ebx bdb_tool_entry_close+4: pushl %esi bdb_tool_entry_close+5: pushl %edi bdb_tool_entry_close+6: subl $0x1c,%esp bdb_tool_entry_close+9: andl $0xfffffff0,%esp bdb_tool_entry_close+0xc: call +0x0 <bdb_tool_entry_close+0x11> bdb_tool_entry_close+0x11: popl %ebx bdb_tool_entry_close+0x12: addl $0x1171d7,%ebx bdb_tool_entry_close+0x18: cmpl $0x0,0x1dfa4(%ebx) bdb_tool_entry_close+0x1f: je +0x17e <bdb_tool_entry_close+0x1a3>
... we were at 0x62, so we didn't jump to 0x17e, checking source code:
int bdb_tool_entry_close( BackendDB *be ) { if ( bdb_tool_info ) {
... so bdb_tool_info was not a NULL pointer.
bdb_tool_entry_close+0x25: movl 0x2dc(%ebx),%eax bdb_tool_entry_close+0x2b: movl $0x1,(%eax) bdb_tool_entry_close+0x31: subl $0xc,%esp bdb_tool_entry_close+0x34: leal 0x1e020(%ebx),%eax bdb_tool_entry_close+0x3a: pushl %eax bdb_tool_entry_close+0x3b: call -0xdcac0 <PLT=libldap_r-2.4.so.2.8.3`ldap_pvt_thread_mutex_lock> bdb_tool_entry_close+0x40: addl $0x10,%esp bdb_tool_entry_close+0x43: cmpl $0x0,0x1dfb8(%ebx) bdb_tool_entry_close+0x4a: jne +0x22 <bdb_tool_entry_close+0x6e> bdb_tool_entry_close+0x4c: leal 0x1e048(%ebx),%esi bdb_tool_entry_close+0x52: leal 0x1e020(%ebx),%edi bdb_tool_entry_close+0x58: subl $0x8,%esp bdb_tool_entry_close+0x5b: pushl %edi bdb_tool_entry_close+0x5c: pushl %esi bdb_tool_entry_close+0x5d: call -0xdc882 <PLT=libldap_r-2.4.so.2.8.3`ldap_pvt_thread_cond_wait> bdb_tool_entry_close+0x62: addl $0x10,%esp
This section of disassembly seems to match up quite nicely with the next section of code:
slapd_shutdown = 1; #ifdef USE_TRICKLE ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex );
/* trickle thread may not have started yet */ while ( !bdb_tool_trickle_active ) ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond_end, &bdb_tool_trickle_mutex );
So bdb_tool_trickle_active is 0, and we are blocking on the condition variable bdb_tool_trickle_cond_end. The corresponding call to ldap_pvt_thread_cond_signal occurs in bdb_tool_trickle_task(). The trickle task function looks like this:
ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex ); bdb_tool_trickle_active = 1; ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond_end ); while ( 1 ) { ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond, &bdb_tool_trickle_mutex ); if ( slapd_shutdown ) break; env->memp_trickle( env, 30, &wrote ); } bdb_tool_trickle_active = 0; ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond_end ); ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex );
The cond_signal calls are all contained within one big mutex_lock. So could our thread have reached its mutex_lock *after* the very last invocation of the trickle task function, then blocked on its mutex_lock until the trickle task function returned, allowing our thread to reach a cond_wait which will never get a signal? There seems to be a race condition here.
The trickle task is a thread in the pool, submitted by bdb_tool_entry_open():
ldap_pvt_thread_pool_submit( &connection_pool, bdb_tool_trickle_task, bdb->bi_dbenv );
... except that the thread wasn't in our core dump. So slapd_shutdown will have been non-zero.
Not being that familiar with the code, I'll stop here for now to see if anyone who is more familiar with it might now have an explanation as to how slapadd could have got itself into this state.
Thanks, Mark.
--------------------------------------------------------------------------------
NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies; do not disclose, use or act upon the information; and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.