--Apple-Mail-3-627969786 Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit
Hi David
On 28 Apr 2009, at 12:20, David Angerer wrote:
Hi there,
Already posted this a while ago to openldap-bugs, but havn't got a response yet. As the problem still persist in my environment, I'll give it another try here.
We upgraded to openldap 2.4.16 a while ago. Since then we are experiencing relatively frequent core dumps (average uptime is around 2-3 days). Investigating the root cause we came across ITS#6034 which reminded us much of problem we are facing, but should be fixed in 2.4.16.
We've not had a crash related to this (or its #5836) since doing the following four things:
1) Fixing path-mtu issues on the network with the VPN, so that when large transactions occurred, the network wouldn't go into infinite wait, chewing up threads. Links flapping up and down, or path-mtu- discovery icmp filtering can have the same effect and be out of your control.
2) Upgrading to 2.4.16
3) setting this in slapd.conf (after applying the patch in the next step) writetimeout 30
4) Applying the patch here. The patch is a hacked up version of the patch relating to #5836 pulled out of CVS, which removes some of the threading aspects, and "clarifies" some of the code in the process.
If I were you, I'd try to pull the #5836 patch out of CVS and see if that helps you. If not, or you need a very urgent fix to see if this helps, then consider using this patch.
I have to make numerous disclaimers about this patch... 4a) some of it may be a less-optimal way of performing the same changes that are mentioned in #5836 in trunk. I originally started hacking that code as I encountered some "wierdness" in it - where straces didn't do what I expected. In retrospect, I think some of that may have been optimisations that I didn't understand at the time. I tested my hacked-up patch, and it worked, so I've stuck to it since. I intend to re-pull the original patch out of the cvs source sometime and re-test it, but I've not had time. 4b) It works for us, but I make you absolutely no promises - I am releasing it under the same licence as openldap (no warantee, etc etc etc). 4c) it probably only reduces the problem because of some corner-case that I am completely unaware of. EG: it may mean the assert-ing code is never reached because the connections never reach the expected thread state. 4d) I've not done any C code for years, nor do I have very much experience with the openldap source. 4e) I fully expect at least three flames from Howard Chu including great detail about why this patch is horrible. :) But at least it has some useful comments. :)
--Apple-Mail-3-627969786 Content-Disposition: attachment; filename=oskar-connection-patch-v1 Content-Type: application/octet-stream; x-unix-mode=0644; name="oskar-connection-patch-v1" Content-Transfer-Encoding: 7bit
diff -urN openldap-2.4.15-plain-source/doc/man/man5/slapd-config.5 openldap-2.4.15-oskar-connection-patch/doc/man/man5/slapd-config.5 --- openldap-2.4.15-plain-source/doc/man/man5/slapd-config.5 2009-01-22 00:00:50.000000000 +0000 +++ openldap-2.4.15-oskar-connection-patch/doc/man/man5/slapd-config.5 2009-03-11 14:15:59.000000000 +0000 @@ -451,7 +451,9 @@ .B olcIdleTimeout: <integer> Specify the number of seconds to wait before forcibly closing an idle client connection. A setting of 0 disables this -feature. The default is 0. +feature. The default is 0. You may also want to set the +.B olcWriteTimeout +option. .TP .B olcIndexIntLen: <integer> Specify the key length for ordered integer indices. The most significant @@ -750,11 +752,12 @@ Specify the maximum number of threads to use in tool mode. This should not be greater than the number of CPUs in the system. The default is 1. -."ucdata-path is obsolete / ignored... -.".TP -.".B ucdata-path <path> -."Specify the path to the directory containing the Unicode character -."tables. The default path is DATADIR/ucdata. +.TP +.B olcWriteTimeout: <integer> +Specify the number of seconds to wait before forcibly closing +a connection with an outstanding write. This allows recovery from +various network hang conditions. A setting of 0 disables this +feature. The default is 0. .SH TLS OPTIONS If .B slapd diff -urN openldap-2.4.15-plain-source/servers/slapd/bconfig.c openldap-2.4.15-oskar-connection-patch/servers/slapd/bconfig.c --- openldap-2.4.15-plain-source/servers/slapd/bconfig.c 2009-02-13 03:16:59.000000000 +0000 +++ openldap-2.4.15-oskar-connection-patch/servers/slapd/bconfig.c 2009-03-11 14:16:14.000000000 +0000 @@ -706,6 +706,9 @@ &config_updateref, "( OLcfgDbAt:0.13 NAME 'olcUpdateRef' " "EQUALITY caseIgnoreMatch " "SUP labeledURI )", NULL, NULL }, + { "writetimeout", "timeout", 2, 2, 0, ARG_INT, + &global_writetimeout, "( OLcfgGlAt:88 NAME 'olcWriteTimeout' " + "SYNTAX OMsInteger SINGLE-VALUE )", NULL, NULL }, { NULL, NULL, 0, 0, 0, ARG_IGNORED, NULL, NULL, NULL, NULL } }; @@ -764,7 +767,7 @@ "olcTLSCACertificatePath $ olcTLSCertificateFile $ " "olcTLSCertificateKeyFile $ olcTLSCipherSuite $ olcTLSCRLCheck $ " "olcTLSRandFile $ olcTLSVerifyClient $ olcTLSDHParamFile $ " - "olcTLSCRLFile $ olcToolThreads $ " + "olcTLSCRLFile $ olcToolThreads $ olcWriteTimeout $ " "olcObjectIdentifier $ olcAttributeTypes $ olcObjectClasses $ " "olcDitContentRules $ olcLdapSyntaxes ) )", Cft_Global }, { "( OLcfgGlOc:2 " diff -urN openldap-2.4.15-plain-source/servers/slapd/config.c openldap-2.4.15-oskar-connection-patch/servers/slapd/config.c --- openldap-2.4.15-plain-source/servers/slapd/config.c 2009-02-13 03:16:59.000000000 +0000 +++ openldap-2.4.15-oskar-connection-patch/servers/slapd/config.c 2009-03-11 14:16:54.000000000 +0000 @@ -62,6 +62,7 @@ slap_mask_t global_disallows = 0; int global_gentlehup = 0; int global_idletimeout = 0; +int global_writetimeout = 0; char *global_host = NULL; char *global_realm = NULL; char *sasl_host = NULL; diff -urN openldap-2.4.15-plain-source/servers/slapd/connection.c openldap-2.4.15-oskar-connection-patch/servers/slapd/connection.c --- openldap-2.4.15-plain-source/servers/slapd/connection.c 2009-01-30 18:51:16.000000000 +0000 +++ openldap-2.4.15-oskar-connection-patch/servers/slapd/connection.c 2009-03-11 21:44:57.000000000 +0000 @@ -244,6 +244,42 @@ return i; }
+/* + * Timeout connections that are having trouble writing + */ +int connections_timeout_write(time_t now) +{ + int i = 0; + int connindex; + Connection* c; + + for( c = connection_first( &connindex ); + c != NULL; + c = connection_next( c, &connindex ) ) + { + /* Don't timeout an outbound connection */ + if( c->c_conn_state == SLAP_C_CLIENT ) { + continue; + } + + if (! c->c_writewaiter ) { + continue; + } + + if( difftime( c->c_activitytime+global_writetimeout, now) < 0 ) { + /* close it */ + Debug( LDAP_DEBUG_CONNS, + "connection_closing: due to write timeout", NULL, 0, 0 ); + connection_closing( c, "writetimeout" ); + connection_close( c ); + i++; + } + } + connection_done( c ); + + return i; +} + static Connection* connection_get( ber_socket_t s ) { Connection *c; diff -urN openldap-2.4.15-plain-source/servers/slapd/daemon.c openldap-2.4.15-oskar-connection-patch/servers/slapd/daemon.c --- openldap-2.4.15-plain-source/servers/slapd/daemon.c 2009-02-13 02:35:39.000000000 +0000 +++ openldap-2.4.15-oskar-connection-patch/servers/slapd/daemon.c 2009-03-11 23:11:00.000000000 +0000 @@ -2033,25 +2033,23 @@ { int l; time_t last_idle_check = 0; - struct timeval idle; + time_t last_write_timeout_check = 0; int ebadf = 0;
#define SLAPD_IDLE_CHECK_LIMIT 4 +#define SLAPD_WRITE_CHECK_LIMIT 8
- if ( global_idletimeout > 0 ) { - last_idle_check = slap_get_time(); - /* Set the select timeout. - * Don't just truncate, preserve the fractions of - * seconds to prevent sleeping for zero time. - */ - idle.tv_sec = global_idletimeout / SLAPD_IDLE_CHECK_LIMIT; - idle.tv_usec = global_idletimeout - \ - ( idle.tv_sec * SLAPD_IDLE_CHECK_LIMIT ); - idle.tv_usec *= 1000000 / SLAPD_IDLE_CHECK_LIMIT; - } else { - idle.tv_sec = 0; - idle.tv_usec = 0; - } + /* This is how long we are going to sleep for, before we next need to check for idle or + * timed-out connections */ + struct timeval next_sleep; + + /* + * If the write timeout or idle timeout are both unset, there is no reason for the + * select/epoll loop to timeout. We thus default the sleep time to zero, + * which never times out + */ + next_sleep.tv_sec = 0; + next_sleep.tv_usec = 0;
slapd_add( wake_sds[0], 0, NULL );
@@ -2143,6 +2141,8 @@ #define SLAPD_EBADF_LIMIT 16
time_t now; + time_t tv_idlenextcheck = 0; + time_t tv_writenextcheck = 0;
SLAP_EVENT_DECL;
@@ -2155,14 +2155,89 @@
now = slap_get_time();
- if ( ( global_idletimeout > 0 ) && - difftime( last_idle_check + - global_idletimeout/SLAPD_IDLE_CHECK_LIMIT, now ) < 0 ) + /* + * Decide whether it's time to call the idle timeout or write timeout code, + * based on the last time we did such a check, plus whether there are any + * filedescriptors which have pending writes + */ + if ( ( global_idletimeout > 0 ) && + difftime( (last_idle_check + + global_idletimeout/SLAPD_IDLE_CHECK_LIMIT), now ) < 0 ) + { + connections_timeout_idle( now ); + last_idle_check = now; + } + + if ( (global_writetimeout > 0) && + difftime( (last_write_timeout_check + + global_writetimeout/SLAPD_WRITE_CHECK_LIMIT), now) < 0 ) { - connections_timeout_idle( now ); - last_idle_check = now; + connections_timeout_write( now ); + last_write_timeout_check = now; + } + + /* + * Now that we've timed out any pending sessions, we need to ascertain when the + * next time we return to this code is. + * + * We calculate this value based on the last run that we did of each type. We then + * calculate the difference in seconds between now and the next scheduled time, and then + * sleep for that amount of time the sleep value into the SLAP_EVENT_WAIT sleep value + * + * We must return based on the smaller value of either the idle-timeout handler's value, or the + * write-timeout value, excluding zero values. + * + */ + if ( global_idletimeout > 0 ) + { + tv_idlenextcheck = last_idle_check + + (global_idletimeout / SLAPD_WRITE_CHECK_LIMIT); + + /* If we missed our last deadline for the idle check, re-schedule it to + * some time in the future + */ + if (tv_idlenextcheck < now) + tv_idlenextcheck = now + + (global_idletimeout / SLAPD_WRITE_CHECK_LIMIT); + + assert (tv_idlenextcheck > 0); + assert (tv_idlenextcheck >= now); + } + if ( global_writetimeout > 0 ) + { + tv_writenextcheck = last_write_timeout_check + + (global_writetimeout / SLAPD_WRITE_CHECK_LIMIT); + + /* If we missed our last deadline for the write check, re-schedule it to + * some time in the future + */ + if (tv_writenextcheck < now) + tv_writenextcheck = now + + (global_writetimeout / SLAPD_WRITE_CHECK_LIMIT); + assert (tv_writenextcheck > 0); + assert (tv_writenextcheck >= now); + } + + if (tv_idlenextcheck && tv_writenextcheck) { + /* Figure out the smaller time, and sleep for that long */ + if (tv_idlenextcheck < tv_writenextcheck) { + tv.tv_sec = (tv_idlenextcheck - now) + 1; + tv.tv_usec = 0; + } else { + tv.tv_sec = (tv_writenextcheck - now) + 1; + tv.tv_usec = 0; + } + } else if (tv_idlenextcheck) { + tv.tv_sec = (tv_idlenextcheck - now) + 1; + tv.tv_usec = 0; + } else if (tv_writenextcheck) { + tv.tv_sec = (tv_writenextcheck - now) + 1; + tv.tv_usec = 0; + } else { + /* Sleep forever */ + tv.tv_sec = 0; + tv.tv_usec = 0; } - tv = idle;
#ifdef SIGHUP if ( slapd_gentle_shutdown ) { @@ -2211,7 +2286,7 @@
nfds = SLAP_EVENT_MAX;
- if ( global_idletimeout && slap_daemon.sd_nactives ) at = 1; + if ( (global_idletimeout || global_writetimeout) && slap_daemon.sd_nactives ) at = 1;
ldap_pvt_thread_mutex_unlock( &slap_daemon.sd_mutex );
diff -urN openldap-2.4.15-plain-source/servers/slapd/proto-slap.h openldap-2.4.15-oskar-connection-patch/servers/slapd/proto-slap.h --- openldap-2.4.15-plain-source/servers/slapd/proto-slap.h 2009-02-17 19:14:41.000000000 +0000 +++ openldap-2.4.15-oskar-connection-patch/servers/slapd/proto-slap.h 2009-03-11 23:11:38.000000000 +0000 @@ -728,6 +728,7 @@ LDAP_SLAPD_F (int) connections_shutdown LDAP_P((void)); LDAP_SLAPD_F (int) connections_destroy LDAP_P((void)); LDAP_SLAPD_F (int) connections_timeout_idle LDAP_P((time_t)); +LDAP_SLAPD_F (int) connections_timeout_write LDAP_P((time_t));
LDAP_SLAPD_F (Connection *) connection_client_setup LDAP_P(( ber_socket_t s, @@ -1930,6 +1931,7 @@
LDAP_SLAPD_V (int) global_gentlehup; LDAP_SLAPD_V (int) global_idletimeout; +LDAP_SLAPD_V (int) global_writetimeout; LDAP_SLAPD_V (char *) global_host; LDAP_SLAPD_V (char *) global_realm; LDAP_SLAPD_V (char *) sasl_host;
--Apple-Mail-3-627969786 Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit
Hope that helps.
Oskar
Setup: Solaris 10 Openldap 2.4.16 (64bit) Berkeley DB 4.7
1 Master 5 Replicas with deltasyncrepl
Cores only happened to the master so far, replicas are running stable so far.
I couldn't find a reference to this ITS in the 2.4.16 changelog (probably listed as other ITS or fixed earlier). Also if the fix made it's way into 2.4.16, I am wondering why we still observe this cores.
If you need further information or have any suggestions please let me know. Please find some debugging information below.
Kind regards, David
(dbx) threads t@1 a l@1 ?() LWP suspended in __lwp_wait() t@2 a l@2 slapd_daemon_task() LWP suspended in __pollsys() t@3 a l@3 ldap_int_thread_pool_wrapper() sleep on 0x1004dd428 in __lwp_park() t@4 a l@4 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@5 a l@5 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@6 a l@6 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@7 a l@7 ldap_int_thread_pool_wrapper() sleep on 0x1004dd670 in __lwp_park() t@8 a l@8 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@9 a l@9 ldap_int_thread_pool_wrapper() sleep on 0x1004a9908 in __lwp_park() o t@10 a l@10 ldap_int_thread_pool_wrapper() signal SIGABRT in __lwp_kill() t@11 a l@11 ldap_int_thread_pool_wrapper() sleep on 0x1004a9908 in __lwp_park() t@12 a l@12 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait()
t@13 a l@13 ldap_int_thread_pool_wrapper() LWP suspended in
slap_access_allowed() t@14 a l@14 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@15 a l@15 ldap_int_thread_pool_wrapper() sleep on 0x1004dd1e0 in __lwp_park() t@16 a l@16 ldap_int_thread_pool_wrapper() LWP suspended in ___lwp_cond_wait() t@17 a l@17 ldap_int_thread_pool_wrapper() sleep on 0x1004a9908 in __lwp_park() t@18 a l@18 ldap_int_thread_pool_wrapper() LWP suspended in __lock_getlocker_int() (dbx) lwp l@10 Current function is connection_state_closing 686 assert( c->c_struct_state == SLAP_C_USED ); t@10 (l@10) stopped in __lwp_kill at 0xffffffff7decea00 0xffffffff7decea00: __lwp_kill+0x0008: bcc,a,pt %icc,__lwp_kill+0x18 ! 0xffffffff7decea10 (dbx) where [1] __lwp_kill(0x0, 0x6, 0x10004b128, 0x19d480, 0x0, 0x0), at 0xffffffff7decea00 [2] raise(0x6, 0x0, 0x100445d58, 0xffffffffffffffff, 0xffffffff7dfe6000, 0x0), at 0xffffffff7de6b4f4 [3] abort(0x1, 0x1b8, 0x10004b128, 0x19d480, 0x0, 0x0), at 0xffffffff7de48c8c [4] __assert(0x100271f48, 0x100271f70, 0x2ae, 0x0, 0x19d138, 0x100069f70), at 0xffffffff7de48f30 =>[5] connection_state_closing(c = 0x1004dce30), line 686 in "connection.c" [6] send_ldap_ber(op = 0xffffffff733ff768, ber = 0xffffffff733ff028), line 148 in "result.c" [7] slap_send_search_entry(op = 0xffffffff733ff768, rs = 0xffffffff733ff2d0), line 1247 in "result.c" [8] syncprov_sendresp(op = 0xffffffff733ff768, opc = 0xffffffff733ff4a0, so = 0x113c0a930, e = 0xffffffff733ff500, mode = 3), line 848 in "syncprov.c" [9] syncprov_qplay(op = 0xffffffff733ff768, so = 0x113c0a930), line 913 in "syncprov.c" [10] syncprov_qtask(ctx = 0xffffffff733ffc20, arg = 0x113c0a930), line 971 in "syncprov.c" [11] ldap_int_thread_pool_wrapper(xpool = 0x10046a250), line 663 in "tpool.c" (dbx) list 680,690 680 int connection_state_closing( Connection *c ) 681 { 682 /* c_mutex must be locked by caller */ 683 684 int state; 685 assert( c != NULL ); 686 assert( c->c_struct_state == SLAP_C_USED ); 687 688 state = c->c_conn_state; 689 690 assert( state != SLAP_C_INVALID ); (dbx) print c c = 0x1004dce30 (dbx) print *c *c = { c_struct_state = 1 c_conn_state = 0 c_conn_idx = 26 c_sd = -1 c_close_reason = 0x100271eb0 "?" c_mutex = { __pthread_mutex_flags = { __pthread_mutex_flag1 = 4U __pthread_mutex_flag2 = '\0' __pthread_mutex_ceiling = '\0' __pthread_mutex_type = 0 __pthread_mutex_magic = 19800U } __pthread_mutex_lock = { __pthread_mutex_lock64 = { __pthread_mutex_pad = "" } __pthread_mutex_lock32 = { __pthread_ownerpid = 0 __pthread_lockword = 0 } __pthread_mutex_owner64 = 0 } __pthread_mutex_data = 0 } c_sb = 0x118772d90 c_starttime = 0 c_activitytime = 0 c_connid = 18446744073709551615U c_peer_domain = { bv_len = 0 bv_val = (nil) } c_peer_name = { bv_len = 0 bv_val = (nil) } c_listener = (nil) c_sasl_bind_mech = { bv_len = 0 bv_val = (nil) } c_sasl_dn = { bv_len = 0 bv_val = (nil) } c_sasl_authz_dn = { bv_len = 0 bv_val = (nil) } c_authz_backend = (nil) c_authz_cookie = (nil) c_authz = { sai_method = 128U sai_mech = { bv_len = 0 bv_val = (nil) } sai_dn = { bv_len = 0 bv_val = (nil) } sai_ndn = { bv_len = 0 bv_val = (nil) } sai_ssf = 0 sai_transport_ssf = 0 sai_tls_ssf = 0 sai_sasl_ssf = 0 } c_protocol = 0 c_ops = { stqh_first = (nil) stqh_last = 0x1004dcf38 } c_pending_ops = { stqh_first = (nil) stqh_last = 0x1004dcf48 } c_write1_mutex = { __pthread_mutex_flags = { __pthread_mutex_flag1 = 4U __pthread_mutex_flag2 = '\0' __pthread_mutex_ceiling = '\0' __pthread_mutex_type = 0 __pthread_mutex_magic = 19800U } __pthread_mutex_lock = { __pthread_mutex_lock64 = { __pthread_mutex_pad = "" } __pthread_mutex_lock32 = { __pthread_ownerpid = 0 __pthread_lockword = 4278190080U } __pthread_mutex_owner64 = 4278190080U } __pthread_mutex_data = 18446744071451976704U } c_write1_cv = { __pthread_cond_flags = { __pthread_cond_flag = "" __pthread_cond_type = 0 __pthread_cond_magic = 17238U } __pthread_cond_data = 0 } c_write2_mutex = { __pthread_mutex_flags = { __pthread_mutex_flag1 = 4U __pthread_mutex_flag2 = '\0' __pthread_mutex_ceiling = '\0' __pthread_mutex_type = 0 __pthread_mutex_magic = 19800U } __pthread_mutex_lock = { __pthread_mutex_lock64 = { __pthread_mutex_pad = "" } __pthread_mutex_lock32 = { __pthread_ownerpid = 0 __pthread_lockword = 0 } __pthread_mutex_owner64 = 0 } __pthread_mutex_data = 0 } c_write2_cv = { __pthread_cond_flags = { __pthread_cond_flag = "" __pthread_cond_type = 0 __pthread_cond_magic = 17238U } __pthread_cond_data = 0 } c_currentber = (nil) c_writers = 0 c_sasl_bind_in_progress = '\0' c_writewaiter = '\0' c_sasl_layers = '\0' c_sasl_done = '\0' c_sasl_authctx = (nil) c_sasl_sockctx = (nil) c_sasl_extra = (nil) c_sasl_bindop = (nil) c_pagedresults_state = { ps_be = (nil) ps_size = 0 ps_count = 0 ps_cookie = 0 ps_cookieval = { bv_len = 0 bv_val = (nil) } } c_n_ops_received = 3 c_n_ops_executing = 0 c_n_ops_pending = 0 c_n_ops_completed = 3 c_n_get = 4 c_n_read = 3 c_n_write = 1 c_extensions = (nil) c_clientfunc = (nil) c_clientarg = (nil) c_send_ldap_result = 0x10006b900 = &slap_send_ldap_result(Operation *op, SlapReply *rs) c_send_search_entry = 0x10006cda0 = &slap_send_search_entry(Operation *op, SlapReply *rs) c_send_search_reference = 0x10006fa80 = &slap_send_search_reference(Operation *op, SlapReply *rs) c_send_ldap_extended = 0x10006c5f8 = &slap_send_ldap_extended(Operation *op, SlapReply *rs) c_send_ldap_intermediate = 0x10006ca28 = &slap_send_ldap_intermediate(Operation *op, SlapReply *rs) }
--Apple-Mail-3-627969786--