Full_Name: Mark Bannister Version: 2.4.30 OS: Solaris 11.3 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (205.228.82.172)
Summary of this thread: http://www.openldap.org/lists/openldap-technical/201605/msg00079.html
slapadd process on Solaris 11 was hanging for many days (slapd 2.4.30) with the following stack trace:
# pstack /var/tmp/cores/core.slapd.0.16189.1463492837 core '/var/tmp/cores/core.slapd.0.16189.1463492837' of 16189: /usr/sbin/slapadd -q -f /etc/openldap/slapd.conf ------------ lwp# 1 / thread# 1 --------------- 07ff7af9 lwp_park (0, 0, 0) 07ff1193 cond_wait_queue (82ed048, 82ed020, 0, 7ff16b0) + 63 07ff1728 __cond_wait (82ed048, 82ed020, f5a685c8, 7ff1771) + 89 07ff177f cond_wait (82ed048, 82ed020, f5a685f8, 7ff17b7) + 27 07ff17cc pthread_cond_wait (82ed048, 82ed020, f5a68618, 7e56137) + 24 07e5614c ldap_pvt_thread_cond_wait (82ed048, 82ed020, f5a68648, 81b7e29) + 24 081b7e7a bdb_tool_entry_close (8e9d398, ffffffff, f5a68698, 82ecc38) + 62 081a38ec slapadd (4, f5a68bbc, f5a68bbc, 80dccb1) + cb8 080dcd2c main (4, f5a68bbc, f5a68bd0, ef60f968) + ac 080dc86d _start (4, f5a68c8e, f5a68ca0, f5a68ca3, f5a68ca6, 0) + 7d ------------ lwp# 2 / thread# 2 --------------- 07ff7af9 lwp_park (0, 0, 0) 07ff1193 cond_wait_queue (8e6b0e4, 8e6b0cc, 0, 7ff16b0) + 63 07ff1728 __cond_wait (8e6b0e4, 8e6b0cc, d7dffdb8, 7ff1771) + 89 07ff177f cond_wait (8e6b0e4, 8e6b0cc, 8e6b0c8, 7ff17b7) + 27 07ff17cc pthread_cond_wait (8e6b0e4, 8e6b0cc, 0, 7e56137) + 24 07e5614c ldap_pvt_thread_cond_wait (8e6b0e4, 8e6b0cc, d7dfffc8, 7e54bfc) + 24 07e54d66 ldap_int_thread_pool_wrapper (8e6b0c8, 803a000, d7dfffe8, 7ff7769) + 242 07ff77bc _thrp_setup (ee1e0240) + 9d 07ff7aa0 _lwp_start (ee1e0240, 0, 0, 0, 0, 0)
Had to examine disassembly output as Oracle stripped the debug symbols:
::walk thread | ::findstack -v
stack pointer for thread 1: f5a68538 [ f5a68538 libc_hwcap1.so.1`__lwp_park+0x19() ] f5a68568 libc_hwcap1.so.1`cond_wait_queue+0x63(82ed048, 82ed020, 0, 7ff16b0) f5a685a8 libc_hwcap1.so.1`__cond_wait+0x89(82ed048, 82ed020, f5a685c8, 7ff1771) f5a685c8 libc_hwcap1.so.1`cond_wait+0x27(82ed048, 82ed020, f5a685f8, 7ff17b7) f5a685e8 libc_hwcap1.so.1`pthread_cond_wait+0x24(82ed048, 82ed020, f5a68618, 7e56137) f5a68608 libldap_r-2.4.so.2.8.3`ldap_pvt_thread_cond_wait+0x24(82ed048, 82ed020, f5a68648, 81b7e29) f5a686 b bdb_tool_entry_close+0x62(8e9d398, ffffffff, f5a68698, 82ecc38) f5a68ae8 slapadd+0xcb8(4, f5a68bbc, f5a68bbc, 80dccb1) f5a68b88 main+0xac(4, f5a68bbc, f5a68bd0, ef60f968) f5a68bb0 _start+0x7d(4, f5a68c8e, f5a68ca0, f5a68ca3, f5a68ca6, 0)
The offending ldap_pvt_thread_cond_wait(), which will never be unblocked because no other thread will do it, is in bdb_tool_entry_close+0x62. Checking the disassembly:
bdb_tool_entry_close::dis
bdb_tool_entry_close: pushl %ebp bdb_tool_entry_cle%e+1: movl %esp,%ebp bdb_tool_entry_close+3: pushl %ebx bdb_tool_entry_close+4: pushl %esi bdb_tool_entry_close+5: pushl %edi bdb_tool_entry_close+6: subl $0x1c,%p%p bdb_tool_entry_close+9: andl $0xfffffff0,%esp bdb_tool_entry_close+0xc: call +0x0 <bdb_tool_entry_close+0x11> bdb_tool_entry_close+0x11: popl %ebx bdb_tool_entry_close+0x12: addl $0x1171d7,%ebx bdb_tool_entry_close+0x18: cmpl $0x0,0x1dfa4(%ebx) bdb_tool_entry_close+0x1f: je +0x17e <bdb_tool_entry_close+0x1a3>
... we were at 0x62, so we didn't jump to 0x17e, checking source code:
int bdb_tool_entry_close( BackendDB *be ) { if ( bdb_tool_info ) {
... so bdb_tool_info was not a NULL pointer.
bdb_tool_entry_close+0x25: movl 0x2dc(%ebx),%eax bdb_tool_entry_close+0x2b: movl $0x1,(%eax) bdb_tool_entry_close+0x31: subl $0xc,%esp bdb_tool_entry_close+0x34: leal 0x1e020(%ebx),%ea0D0D bdb_tool_entry_close+0x3a: pushl %eax bdb_tool_entry_close+0x3b: call -0xdcac0 <PLT=libldap_r-2.4.so.2.8.3`ldap_pvt_thread_mutex_lock> bdb_tool_entry_close+0x40: addl $0x10,%esp bdb_tool_entry_close+0x43: cmpl $0x0,0x1dfb8(%ebx) bdb_tool_entry_close+0x4a: jne +0x22 <bdb_tool_entry_close+0x6e> bdb_tool_entry_close+0x4c: leal 0x1e048(%ebx),%esi bdb_tool_entry_close+0x52: leal 0x1e020(%ebx),%edi bdb_tool_entry_close+0x58: subl $0x8,%esp bdb_tool_entry_close+0x5b: pushl %edi bdb_tool_entry_close+0x5c: pushl %esi bdb_tool_entry_close+0x5d: call -0xdc882 <PLT=libldap_r-2.4.so.2.8.3`ldap_pvt_thread_cond_wait> bdb_tool_entry_close+0x62: addl $0x10,%esp
This section of disassembly seems to match up quite nicely with the next section of code:
slapd_shutdown = 1; #ifdef USE_TRICKLE ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex );
/* trickle thread may not have started yet */ while ( !bdb_tool_trickle_active ) ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond_end, &bdb_tool_trickle_mutex );
So bdb_tool_trickle_actives 0%0, and we are blocking on the condition variable bdb_tool_trickle_cond_end. The corresponding call to ldap_pvt_thread_cond_signal occurs in bdb_tool_trickle_task(). The trickle task function looks like this:
ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex ); bdb_tool_trickle_active = 1; ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond_end ); while ( 1 ) { ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond, &bdb_tool_trickle_mutex ); if ( slapd_shutdown ) break; env->memp_trickle( env, 30, &wrote ); } bdb_tool_trickle_active = 0; ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond_end %%3; ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex );
The cond_signal calls are all contained within one big mutex_lock. So could our thread have reached its mutex_lock *after* the very last invocation of the trickle task function, then blocked on its mutex_lock until the trickle task function returned, allowing our thread to reach a cond_wait which will never get a signal? There seems to be a race condition here.
The trickle task is a thread in the pool, submitted by bdb_tool_entry_open():