Hello,
I've seen a couple of instances where slapd becomes unresponsive,
apparently because the threads are waiting on a backend meta DB. We're
running slapd 2.4.23 on Solaris 10 (update 11/06). We have 128 threads
configured and when I attach with truss, I see 130 allocated, most of which
look like this:
/53: lwp_park(0x00000000, 0) (sleeping...)
When I run pstack against the PID, the stack for almost all of the threads
looks like one of the two threads below, both of which are somewhere within
meta_back_search:
----------------- lwp# 20 / thread# 20 --------------------
fee40408 lwp_park (0, 0, 0)
00140e64 ldap_build_search_req (110f0c0, 599eb90, 2, 79db010, 79db0a0, 0)
+ 2c
001412bc ldap_pvt_search (110f0c0, 599eb90, 2, 79db010, 79db0a0, 0) + d4
000cd494 ???????? (3f45d70, f4fffd58, f4fff478, f4fff294, 0, 2b822c0)
000cd7fc meta_back_search (3f45d70, f4fffd58, 2, 2, 24d400, 0) + 1e0
000a0928 ???????? (3f45d70, f4fffd58, f4fff8e0, 28ff20, 2ac1b8, 14)
000a12dc ???????? (f4fff8e0, f4fffd58, 2, 2, 24d400, 28ff20)
000a3950 overlay_op_walk (8000, f4fffd58, 8000, 28fe18, 28ff20, 818) + 4c
000a3ae4 ???????? (3f45d70, f4fffd58, 2, 1e6000, a3ba0, 28fe18)
00041e08 fe_op_search (3f45d70, f4fffd58, 3f45e70, f4fffad8, 1ee5f8,
1ee6f0) + 3f8
00041528 do_search (3f45d70, f4fffd58, fee6cbc0, 1e6000, 16ec00, f4fffad8)
+ 590
0003f8e4 ???????? (f4fffe08, 3f45d70, fee6cbc0, fe2d4400, 2683c8, 0)
0013ca30 ???????? (2683b8, f5000000, 0, 0, 13c8d4, 1)
fee40368 _lwp_start (0, 0, 0, 0, 0, 0)
----------------- lwp# 21 / thread# 21 --------------------
fee40408 lwp_park (0, 0, 0)
0013e5c8 ldap_result (10ea3b0, 43e, 2, f47ff488, f47ff290, 0) + 3c
000cddf0 meta_back_search (53fe5b8, f47ffd58, 2, 2, 0, 1) + 7d4
000a0928 ???????? (53fe5b8, f47ffd58, f47ff8e0, 28ff20, 2ac1b8, 14)
000a12dc ???????? (f47ff8e0, f47ffd58, 2, 2, 24d400, 28ff20)
000a3950 overlay_op_walk (8000, f47ffd58, 8000, 28fe18, 28ff20, 818) + 4c
000a3ae4 ???????? (53fe5b8, f47ffd58, 2, 1e6000, a3ba0, 28fe18)
00041e08 fe_op_search (53fe5b8, f47ffd58, 53fe6b8, f47ffad8, 1ee5f8,
1ee6f0) + 3f8
00041528 do_search (53fe5b8, f47ffd58, fee6cbc0, 1e6000, 16ec00, f47ffad8)
+ 590
0003f8e4 ???????? (f47ffe08, 53fe5b8, fee6cbc0, fe2d4800, 2683c8, 0)
0013ca30 ???????? (2683b8, f4800000, 0, 0, 13c8d4, 1)
fee40368 _lwp_start (0, 0, 0, 0, 0, 0)
I took a core dump while the process was running but I'm not really sure
how to proceed from here. Is there any way to get more information on what
was happening with these threads at the time? In either case, is this a
situation that should be handled with a general timeout directive? We
currently only have a network-timeout and a bind timeout specified:
network-timeout 3
timeout bind=3
Thanks,
Lincoln