New subject: syncrepl loosing connection

2 Oct 2009


      Hi,
This issue has been bugging me for a while, but I can't find anything 
about it when googling.
I have a slapd 2.3.x server which has been taking longer and longer to 
start. Lately it has ben 45 min. for a restart.
With strace you can just see it waiting for a futex:
Process 22740 attached - interrupt to quit
futex(0x56274bd8, FUTEX_WAIT, 22742, NULL
... then suddenly it starts to listen and answer queries.
Now, I hoped my planed upgrade to 2.4.x would solve that, but alas!
I now have a running setup with 2.4.17 compiled with OpenSSL in 
mirrormode, mirroring cn=config and the primary database with TLS (with 
client certs and SASL EXTERNAL), running on Linux 2.6.18 on a "vserver".
And it still hangs on startup.
I would suspect that it has something to do with the vserver. One 
explanation would be if slapd tried to connect to it self via TCP, since 
the kernel just DROP packets to 127.0.0.1.
Another explantion would be that It can't gather enough entropy, but my 
2.3..x setup didn't use TLS and I have checked the /dev/random 
/dev/urandom are world readable.
Looking at debug output at server-1 I see things like:
ber_flush2 failed errno=11 reason="Resource temporarily unavailable"
and:
connection_write(17): waking output for id=2
connection_get(17): got connid=2
connection_write(17): waking output for id=2
connection_get(17): got connid=2
connection_write(17): waking output for id=2
connection_get(17): got connid=2
connection_write(17): waking output for id=2
ber_flush2: 933 bytes to sd 17
send_search_entry: conn 2  ber write failed.
connection_close: conn=2 sd=17
connection_read(17): no connection!
connection_read(17): no connection!
connection_read(17): no connection!
connection_read(17): no connection!
Where sd 17 seems to be the last of 4 syncrepl connections.
lsof tells me:
slapd     25704   root   14u     IPv4             852048 
    TCP s01:40400->s02:ldaps (ESTABLISHED)
slapd     25704   root   15u     IPv4             852057 
    TCP s01:40401->s02:ldaps (ESTABLISHED)
slapd     25704   root   16u     IPv4             852169 
    TCP s01:ldaps->s02:48705 (ESTABLISHED)
But the last (sd 17) seems to be closed again.
It seems syncrepl tried to get startet, (the other server is empty, 
since the database has just been loaded with slapdadd -w on server-1)
But it only mamanges to syncrepl the first 5 entries or so.
slapd -d 16384 output is:
slapd starting
do_syncrep2: rid=004 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
send_search_entry: conn 2  ber write failed.
connection_read(17): no connection!
connection_read(17): no connection!
connection_read(17): no connection!
send_search_entry: conn 3  ber write failed.
connection_write(17): no connection!
send_search_entry: conn 4  ber write failed.
connection_write(17): no connection!
send_search_entry: conn 5  ber write failed.
connection_write(17): no connection!
....
What can slapd be waiting for?
/Peter

slapd hangs on startup