I've investigated this issue a little bit more since my initial bug report.
With loglevel conns sync in slapd.conf, the following are the last messages written to the log before slapd crashes:
Jan 12 09:33:19 shelley slapd[2482]: Jan 12 09:33:19 shelley slapd[2482]: daemon: write active on 15 Jan 12 09:33:19 shelley slapd[2482]: daemon: epoll: listen=9 active_threads=0 tvp=zero Jan 12 09:33:19 shelley slapd[2482]: daemon: epoll: listen=10 active_threads=0 tvp=zero Jan 12 09:33:19 shelley slapd[2482]: daemon: activity on 1 descriptor Jan 12 09:33:19 shelley slapd[2482]: daemon: activity on: Jan 12 09:33:19 shelley slapd[2482]: 15rw
Jan 12 09:33:19 shelley slapd[2482]: ber_flush2 failed errno=32 reason="Broken pipe" Jan 12 09:33:19 shelley slapd[2482]: connection_closing: readying conn=1377 sd=15 for close Jan 12 09:33:19 shelley slapd[2482]: send_search_entry: conn 1377 ber write failed. Jan 12 09:33:19 shelley slapd[2482]: connection_resched: attempting closing conn=1377 sd=15 Jan 12 09:33:19 shelley slapd[2482]: daemon: removing 15 Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: connection_get(15): connection not used Jan 12 09:33:19 shelley slapd[2482]: connection_read(15): no connection! Jan 12 09:33:19 shelley slapd[2482]: connection_read(15) error Jan 12 09:33:19 shelley slapd[2482]: Jan 12 09:33:19 shelley slapd[2482]: daemon: write active on 15
"daemon: write active on 15" comes from line 2660 or so of daemon.c. Here's that with some context:
for ( i = 0; nwfds > 0; i++ ) { ber_socket_t wd; if ( ! SLAP_EVENT_IS_WRITE( i ) ) continue; wd = i;
SLAP_EVENT_CLR_WRITE( wd ); nwfds--;
Debug(LDAP_DEBUG_CONNS, "daemon: write active on %d\n", wd, 0, 0);
/* * NOTE: it is possible that the connection was closed and that * the stream is now inactive. connection_write() must validate * the stream is still active. * * ITS#4338: if the stream is invalid, there is no need to close it * here. It has already been closed in connection.c. */ if ( connection_write( wd ) < 0 ) { if ( SLAP_EVENT_IS_READ( wd)) { SLAP_EVENT_CLR_READ( (unsigned) wd); nrfds--; } } }
We do some housekeeping on the descriptor, then call connection_write to take care of the write. Here's connection_write from connection.c:
int connection_write(ber_socket_t s) { Connection *c; Operation *op;
assert( connections != NULL );
slapd_clr_write( s, 0 );
c = connection_get( s ); if( c == NULL ) { Debug( LDAP_DEBUG_ANY, "connection_write(%ld): no connection!\n", (long)s, 0, 0 ); return -1; } [...]
connection_get seems to perform the validation that is mentioned by the comment in daemon.c, but connection_get is called after slapd_clr_write, which is the source of the assertion implicated by the backtrace in the initial bug report:
void slapd_clr_write( ber_socket_t s, int wake ) { int id = DAEMON_ID(s); ldap_pvt_thread_mutex_lock( &slap_daemon[id].sd_mutex );
if ( SLAP_SOCK_IS_WRITE( id, s )) { assert( SLAP_SOCK_IS_ACTIVE( id, s ));
SLAP_SOCK_CLR_WRITE( id, s ); slap_daemon[id].sd_nwriters--; }
ldap_pvt_thread_mutex_unlock( &slap_daemon[id].sd_mutex ); WAKE_LISTENER(id,wake); }
I'm not sure if connection_write is supposed to validate that a stream is active before or after calling slapd_clr_write, but it seems like the assertion wouldn't be an issue if that validation were performed before calling slapd_clr_write. To test this thought, I rebuilt openldap 2.4.28 with the following patch:
--- openldap-2.4.28/servers/slapd/connection.c 2011-11-25 10:52:29.000000000 -0800 +++ openldap-2.4.28-new/servers/slapd/connection.c 2012-01-12 13:35:45.000000000 -0800 @@ -1893,8 +1893,6 @@
assert( connections != NULL );
- slapd_clr_write( s, 0 ); - c = connection_get( s ); if( c == NULL ) { Debug( LDAP_DEBUG_ANY, @@ -1903,6 +1901,8 @@ return -1; }
+ slapd_clr_write( s, 0 ); + #ifdef HAVE_TLS if ( c->c_is_tls && c->c_needs_tls_accept ) { connection_return( c );
and tried to reproduce the problem under the same circumstances as reported in my initial bug report. The master slapd tolerated the misconfigured replicas for 5 days without crashing; before, it would crash reliably within a half hour or so. I didn't notice any regressions due to the patch, though the master slapd wasn't exposed to a typical workload during the experiment.
Any thoughts on this patch?