I've investigated this issue a little bit more since my initial bug
report.
With loglevel conns sync in slapd.conf, the following are the last
messages written to the log before slapd crashes:
Jan 12 09:33:19 shelley slapd[2482]:
Jan 12 09:33:19 shelley slapd[2482]: daemon: write active on 15
Jan 12 09:33:19 shelley slapd[2482]: daemon: epoll: listen=9 active_threads=0 tvp=zero
Jan 12 09:33:19 shelley slapd[2482]: daemon: epoll: listen=10 active_threads=0 tvp=zero
Jan 12 09:33:19 shelley slapd[2482]: daemon: activity on 1 descriptor
Jan 12 09:33:19 shelley slapd[2482]: daemon: activity on:
Jan 12 09:33:19 shelley slapd[2482]: 15rw
Jan 12 09:33:19 shelley slapd[2482]: ber_flush2 failed errno=32 reason="Broken pipe"
Jan 12 09:33:19 shelley slapd[2482]: connection_closing: readying conn=1377 sd=15 for close
Jan 12 09:33:19 shelley slapd[2482]: send_search_entry: conn 1377 ber write failed.
Jan 12 09:33:19 shelley slapd[2482]: connection_resched: attempting closing conn=1377 sd=15
Jan 12 09:33:19 shelley slapd[2482]: daemon: removing 15
Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection!
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error
Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection!
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error
Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection!
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error
Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection!
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error
Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection!
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error
Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection!
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error
Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection!
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error
Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection!
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error
Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection!
Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error
Jan 12 09:33:19 shelley slapd[2482]:
Jan 12 09:33:19 shelley slapd[2482]: daemon: write active on 15
"daemon: write active on 15" comes from line 2660 or so of daemon.c.
Here's that with some context:
for ( i = 0; nwfds > 0; i++ ) {
ber_socket_t wd;
if ( ! SLAP_EVENT_IS_WRITE( i ) ) continue;
wd = i;
SLAP_EVENT_CLR_WRITE( wd );
nwfds--;
Debug(LDAP_DEBUG_CONNS,
"daemon: write active on %d\n",
wd, 0, 0);
/*
* NOTE: it is possible that the connection was closed and that
* the stream is now inactive. connection_write() must validate
* the stream is still active.
*
* ITS#4338: if the stream is invalid, there is no need to close it
* here. It has already been closed in connection.c.
*/
if ( connection_write( wd ) < 0 ) {
if ( SLAP_EVENT_IS_READ( wd)) {
SLAP_EVENT_CLR_READ( (unsigned) wd);
nrfds--;
}
}
}
We do some housekeeping on the descriptor, then call connection_write to
take care of the write. Here's connection_write from connection.c:
int connection_write(ber_socket_t s)
{
Connection *c;
Operation *op;
assert( connections != NULL );
slapd_clr_write( s, 0 );
c = connection_get( s );
if( c == NULL ) {
Debug( LDAP_DEBUG_ANY,
"connection_write(%ld): no connection!\n",
(long)s, 0, 0 );
return -1;
}
[...]
connection_get seems to perform the validation that is mentioned by the
comment in daemon.c, but connection_get is called after slapd_clr_write,
which is the source of the assertion implicated by the backtrace in the
initial bug report:
void
slapd_clr_write( ber_socket_t s, int wake )
{
int id = DAEMON_ID(s);
ldap_pvt_thread_mutex_lock( &slap_daemon[id].sd_mutex );
if ( SLAP_SOCK_IS_WRITE( id, s )) {
assert( SLAP_SOCK_IS_ACTIVE( id, s ));
SLAP_SOCK_CLR_WRITE( id, s );
slap_daemon[id].sd_nwriters--;
}
ldap_pvt_thread_mutex_unlock( &slap_daemon[id].sd_mutex );
WAKE_LISTENER(id,wake);
}
I'm not sure if connection_write is supposed to validate that a stream
is active before or after calling slapd_clr_write, but it seems like the
assertion wouldn't be an issue if that validation were performed before
calling slapd_clr_write. To test this thought, I rebuilt openldap 2.4.28
with the following patch:
--- openldap-2.4.28/servers/slapd/connection.c 2011-11-25 10:52:29.000000000 -0800
+++ openldap-2.4.28-new/servers/slapd/connection.c 2012-01-12 13:35:45.000000000 -0800
@@ -1893,8 +1893,6 @@
assert( connections != NULL );
- slapd_clr_write( s, 0 );
-
c = connection_get( s );
if( c == NULL ) {
Debug( LDAP_DEBUG_ANY,
@@ -1903,6 +1901,8 @@
return -1;
}
+ slapd_clr_write( s, 0 );
+
#ifdef HAVE_TLS
if ( c->c_is_tls && c->c_needs_tls_accept ) {
connection_return( c );
and tried to reproduce the problem under the same circumstances as
reported in my initial bug report. The master slapd tolerated the
misconfigured replicas for 5 days without crashing; before, it would
crash reliably within a half hour or so. I didn't notice any regressions
due to the patch, though the master slapd wasn't exposed to a typical
workload during the experiment.
Any thoughts on this patch?
--
Kevan Carstensen <kacarstensen(a)csupomona.edu>
Operating Systems Analyst, I&IT Systems, Cal Poly Pomona