Hi
Apologies - I was incorrect in item 2. We have not yet upgraded to 2.4.16 - that changecontrol is still pending. So the patch I referred to applies to 2.4.15.
Thanks, and good luck.
Oskar
On 28 Apr 2009, at 15:16, Oskar Pearson wrote:
Hi David
On 28 Apr 2009, at 12:20, David Angerer wrote:
Hi there,
Already posted this a while ago to openldap-bugs, but havn't got a response yet. As the problem still persist in my environment, I'll give it another try here.
We upgraded to openldap 2.4.16 a while ago. Since then we are experiencing relatively frequent core dumps (average uptime is around 2-3 days). Investigating the root cause we came across ITS#6034 which reminded us much of problem we are facing, but should be fixed in 2.4.16.
We've not had a crash related to this (or its #5836) since doing the following four things:
- Fixing path-mtu issues on the network with the VPN, so that when
large transactions occurred, the network wouldn't go into infinite wait, chewing up threads. Links flapping up and down, or path-mtu- discovery icmp filtering can have the same effect and be out of your control.
Upgrading to 2.4.16
setting this in slapd.conf (after applying the patch in the next
step) writetimeout 30
- Applying the patch here.
The patch is a hacked up version of the patch relating to #5836 pulled out of CVS, which removes some of the threading aspects, and "clarifies" some of the code in the process.
If I were you, I'd try to pull the #5836 patch out of CVS and see if that helps you. If not, or you need a very urgent fix to see if this helps, then consider using this patch.
I have to make numerous disclaimers about this patch... 4a) some of it may be a less-optimal way of performing the same changes that are mentioned in #5836 in trunk. I originally started hacking that code as I encountered some "wierdness" in it - where straces didn't do what I expected. In retrospect, I think some of that may have been optimisations that I didn't understand at the time. I tested my hacked-up patch, and it worked, so I've stuck to it since. I intend to re-pull the original patch out of the cvs source sometime and re-test it, but I've not had time. 4b) It works for us, but I make you absolutely no promises - I am releasing it under the same licence as openldap (no warantee, etc etc etc). 4c) it probably only reduces the problem because of some corner-case that I am completely unaware of. EG: it may mean the assert-ing code is never reached because the connections never reach the expected thread state. 4d) I've not done any C code for years, nor do I have very much experience with the openldap source. 4e) I fully expect at least three flames from Howard Chu including great detail about why this patch is horrible. :) But at least it has some useful comments. :)
<oskar-connection-patch-v1>
Hope that helps.
Oskar
Setup: Solaris 10 Openldap 2.4.16 (64bit) Berkeley DB 4.7
1 Master 5 Replicas with deltasyncrepl
Cores only happened to the master so far, replicas are running stable so far.
I couldn't find a reference to this ITS in the 2.4.16 changelog (probably listed as other ITS or fixed earlier). Also if the fix made it's way into 2.4.16, I am wondering why we still observe this cores.
If you need further information or have any suggestions please let me know. Please find some debugging information below.
Kind regards, David
(dbx) threads t@1 a l@1 ?() LWP suspended in __lwp_wait() t@2 a l@2 slapd_daemon_task() LWP suspended in __pollsys() t@3 a l@3 ldap_int_thread_pool_wrapper() sleep on 0x1004dd428 in __lwp_park() t@4 a l@4 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@5 a l@5 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@6 a l@6 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@7 a l@7 ldap_int_thread_pool_wrapper() sleep on 0x1004dd670 in __lwp_park() t@8 a l@8 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@9 a l@9 ldap_int_thread_pool_wrapper() sleep on 0x1004a9908 in __lwp_park() o t@10 a l@10 ldap_int_thread_pool_wrapper() signal SIGABRT in __lwp_kill() t@11 a l@11 ldap_int_thread_pool_wrapper() sleep on 0x1004a9908 in __lwp_park() t@12 a l@12 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait()
t@13 a l@13 ldap_int_thread_pool_wrapper() LWP suspended in
slap_access_allowed() t@14 a l@14 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@15 a l@15 ldap_int_thread_pool_wrapper() sleep on 0x1004dd1e0 in __lwp_park() t@16 a l@16 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@17 a l@17 ldap_int_thread_pool_wrapper() sleep on 0x1004a9908 in __lwp_park() t@18 a l@18 ldap_int_thread_pool_wrapper() LWP suspended in __lock_getlocker_int() (dbx) lwp l@10 Current function is connection_state_closing 686 assert( c->c_struct_state == SLAP_C_USED ); t@10 (l@10) stopped in __lwp_kill at 0xffffffff7decea00 0xffffffff7decea00: __lwp_kill+0x0008: bcc,a,pt %icc,__lwp_kill +0x18 ! 0xffffffff7decea10 (dbx) where [1] __lwp_kill(0x0, 0x6, 0x10004b128, 0x19d480, 0x0, 0x0), at 0xffffffff7decea00 [2] raise(0x6, 0x0, 0x100445d58, 0xffffffffffffffff, 0xffffffff7dfe6000, 0x0), at 0xffffffff7de6b4f4 [3] abort(0x1, 0x1b8, 0x10004b128, 0x19d480, 0x0, 0x0), at 0xffffffff7de48c8c [4] __assert(0x100271f48, 0x100271f70, 0x2ae, 0x0, 0x19d138, 0x100069f70), at 0xffffffff7de48f30 =>[5] connection_state_closing(c = 0x1004dce30), line 686 in "connection.c" [6] send_ldap_ber(op = 0xffffffff733ff768, ber = 0xffffffff733ff028), line 148 in "result.c" [7] slap_send_search_entry(op = 0xffffffff733ff768, rs = 0xffffffff733ff2d0), line 1247 in "result.c" [8] syncprov_sendresp(op = 0xffffffff733ff768, opc = 0xffffffff733ff4a0, so = 0x113c0a930, e = 0xffffffff733ff500, mode = 3), line 848 in "syncprov.c" [9] syncprov_qplay(op = 0xffffffff733ff768, so = 0x113c0a930), line 913 in "syncprov.c" [10] syncprov_qtask(ctx = 0xffffffff733ffc20, arg = 0x113c0a930), line 971 in "syncprov.c" [11] ldap_int_thread_pool_wrapper(xpool = 0x10046a250), line 663 in "tpool.c" (dbx) list 680,690 680 int connection_state_closing( Connection *c ) 681 { 682 /* c_mutex must be locked by caller */ 683 684 int state; 685 assert( c != NULL ); 686 assert( c->c_struct_state == SLAP_C_USED ); 687 688 state = c->c_conn_state; 689 690 assert( state != SLAP_C_INVALID ); (dbx) print c c = 0x1004dce30 (dbx) print *c *c = { c_struct_state = 1 c_conn_state = 0 c_conn_idx = 26 c_sd = -1 c_close_reason = 0x100271eb0 "?" c_mutex = { __pthread_mutex_flags = { __pthread_mutex_flag1 = 4U __pthread_mutex_flag2 = '\0' __pthread_mutex_ceiling = '\0' __pthread_mutex_type = 0 __pthread_mutex_magic = 19800U } __pthread_mutex_lock = { __pthread_mutex_lock64 = { __pthread_mutex_pad = "" } __pthread_mutex_lock32 = { __pthread_ownerpid = 0 __pthread_lockword = 0 } __pthread_mutex_owner64 = 0 } __pthread_mutex_data = 0 } c_sb = 0x118772d90 c_starttime = 0 c_activitytime = 0 c_connid = 18446744073709551615U c_peer_domain = { bv_len = 0 bv_val = (nil) } c_peer_name = { bv_len = 0 bv_val = (nil) } c_listener = (nil) c_sasl_bind_mech = { bv_len = 0 bv_val = (nil) } c_sasl_dn = { bv_len = 0 bv_val = (nil) } c_sasl_authz_dn = { bv_len = 0 bv_val = (nil) } c_authz_backend = (nil) c_authz_cookie = (nil) c_authz = { sai_method = 128U sai_mech = { bv_len = 0 bv_val = (nil) } sai_dn = { bv_len = 0 bv_val = (nil) } sai_ndn = { bv_len = 0 bv_val = (nil) } sai_ssf = 0 sai_transport_ssf = 0 sai_tls_ssf = 0 sai_sasl_ssf = 0 } c_protocol = 0 c_ops = { stqh_first = (nil) stqh_last = 0x1004dcf38 } c_pending_ops = { stqh_first = (nil) stqh_last = 0x1004dcf48 } c_write1_mutex = { __pthread_mutex_flags = { __pthread_mutex_flag1 = 4U __pthread_mutex_flag2 = '\0' __pthread_mutex_ceiling = '\0' __pthread_mutex_type = 0 __pthread_mutex_magic = 19800U } __pthread_mutex_lock = { __pthread_mutex_lock64 = { __pthread_mutex_pad = "" } __pthread_mutex_lock32 = { __pthread_ownerpid = 0 __pthread_lockword = 4278190080U } __pthread_mutex_owner64 = 4278190080U } __pthread_mutex_data = 18446744071451976704U } c_write1_cv = { __pthread_cond_flags = { __pthread_cond_flag = "" __pthread_cond_type = 0 __pthread_cond_magic = 17238U } __pthread_cond_data = 0 } c_write2_mutex = { __pthread_mutex_flags = { __pthread_mutex_flag1 = 4U __pthread_mutex_flag2 = '\0' __pthread_mutex_ceiling = '\0' __pthread_mutex_type = 0 __pthread_mutex_magic = 19800U } __pthread_mutex_lock = { __pthread_mutex_lock64 = { __pthread_mutex_pad = "" } __pthread_mutex_lock32 = { __pthread_ownerpid = 0 __pthread_lockword = 0 } __pthread_mutex_owner64 = 0 } __pthread_mutex_data = 0 } c_write2_cv = { __pthread_cond_flags = { __pthread_cond_flag = "" __pthread_cond_type = 0 __pthread_cond_magic = 17238U } __pthread_cond_data = 0 } c_currentber = (nil) c_writers = 0 c_sasl_bind_in_progress = '\0' c_writewaiter = '\0' c_sasl_layers = '\0' c_sasl_done = '\0' c_sasl_authctx = (nil) c_sasl_sockctx = (nil) c_sasl_extra = (nil) c_sasl_bindop = (nil) c_pagedresults_state = { ps_be = (nil) ps_size = 0 ps_count = 0 ps_cookie = 0 ps_cookieval = { bv_len = 0 bv_val = (nil) } } c_n_ops_received = 3 c_n_ops_executing = 0 c_n_ops_pending = 0 c_n_ops_completed = 3 c_n_get = 4 c_n_read = 3 c_n_write = 1 c_extensions = (nil) c_clientfunc = (nil) c_clientarg = (nil) c_send_ldap_result = 0x10006b900 = &slap_send_ldap_result(Operation *op, SlapReply *rs) c_send_search_entry = 0x10006cda0 = &slap_send_search_entry(Operation *op, SlapReply *rs) c_send_search_reference = 0x10006fa80 = &slap_send_search_reference(Operation *op, SlapReply *rs) c_send_ldap_extended = 0x10006c5f8 = &slap_send_ldap_extended(Operation *op, SlapReply *rs) c_send_ldap_intermediate = 0x10006ca28 = &slap_send_ldap_intermediate(Operation *op, SlapReply *rs) }