Hi,
This issue has been bugging me for a while, but I can't find anything about it when googling.
I have a slapd 2.3.x server which has been taking longer and longer to start. Lately it has ben 45 min. for a restart. With strace you can just see it waiting for a futex:
Process 22740 attached - interrupt to quit futex(0x56274bd8, FUTEX_WAIT, 22742, NULL
... then suddenly it starts to listen and answer queries.
Now, I hoped my planed upgrade to 2.4.x would solve that, but alas!
I now have a running setup with 2.4.17 compiled with OpenSSL in mirrormode, mirroring cn=config and the primary database with TLS (with client certs and SASL EXTERNAL), running on Linux 2.6.18 on a "vserver".
And it still hangs on startup.
I would suspect that it has something to do with the vserver. One explanation would be if slapd tried to connect to it self via TCP, since the kernel just DROP packets to 127.0.0.1.
Another explantion would be that It can't gather enough entropy, but my 2.3..x setup didn't use TLS and I have checked the /dev/random /dev/urandom are world readable.
Looking at debug output at server-1 I see things like:
ber_flush2 failed errno=11 reason="Resource temporarily unavailable"
and:
connection_write(17): waking output for id=2 connection_get(17): got connid=2 connection_write(17): waking output for id=2 connection_get(17): got connid=2 connection_write(17): waking output for id=2 connection_get(17): got connid=2 connection_write(17): waking output for id=2 ber_flush2: 933 bytes to sd 17 send_search_entry: conn 2 ber write failed. connection_close: conn=2 sd=17 connection_read(17): no connection! connection_read(17): no connection! connection_read(17): no connection! connection_read(17): no connection!
Where sd 17 seems to be the last of 4 syncrepl connections.
lsof tells me: slapd 25704 root 14u IPv4 852048 TCP s01:40400->s02:ldaps (ESTABLISHED) slapd 25704 root 15u IPv4 852057 TCP s01:40401->s02:ldaps (ESTABLISHED) slapd 25704 root 16u IPv4 852169 TCP s01:ldaps->s02:48705 (ESTABLISHED)
But the last (sd 17) seems to be closed again.
It seems syncrepl tried to get startet, (the other server is empty, since the database has just been loaded with slapdadd -w on server-1) But it only mamanges to syncrepl the first 5 entries or so.
slapd -d 16384 output is: slapd starting do_syncrep2: rid=004 LDAP_RES_INTERMEDIATE - REFRESH_DELETE do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE send_search_entry: conn 2 ber write failed. connection_read(17): no connection! connection_read(17): no connection! connection_read(17): no connection! send_search_entry: conn 3 ber write failed. connection_write(17): no connection! send_search_entry: conn 4 ber write failed. connection_write(17): no connection! send_search_entry: conn 5 ber write failed. connection_write(17): no connection! ....
What can slapd be waiting for?
/Peter
Peter Mogensen wrote:
Hi,
This issue has been bugging me for a while, but I can't find anything about it when googling.
I have a slapd 2.3.x server which has been taking longer and longer to start. Lately it has ben 45 min. for a restart. With strace you can just see it waiting for a futex:
Process 22740 attached - interrupt to quit futex(0x56274bd8, FUTEX_WAIT, 22742, NULL
... then suddenly it starts to listen and answer queries.
What can slapd be waiting for?
Use gdb and find out.
http://www.openldap.org/faq/data/cache/59.html
Howard Chu wrote:
Peter Mogensen wrote:
Hi,
This issue has been bugging me for a while, but I can't find anything about it when googling.
I have a slapd 2.3.x server which has been taking longer and longer to start. Lately it has ben 45 min. for a restart. With strace you can just see it waiting for a futex:
Process 22740 attached - interrupt to quit futex(0x56274bd8, FUTEX_WAIT, 22742, NULL
... then suddenly it starts to listen and answer queries.
What can slapd be waiting for?
Use gdb and find out.
Unfortunately that's a bit hard for me until I get new hardware, since I'm running on a 32 bit system on a 64 bit kernel and it seems that combination have a bug with gdb debuggen multithreaded applicattions for 2.6.18).
Anyway... strace on one of the threads which seems to be doing syncrepl reveals:
write(19, "\27\3\1\3\2710\202\3\241\2\1\2d\202\3c\4\uuid=ef556050-8"..., 958) = 708 write(19, "09Z0i\4\7entryDN1^\4\uuid=ef556050-8"..., 250) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(7, EPOLL_CTL_MOD, 19, {EPOLLIN|EPOLLOUT, {u32=1437773908, u64=1437773908}}) = 0 write(6, "0"..., 1) = 1 futex(0x55cd2178, FUTEX_WAIT, 51, NULL) = 0 futex(0x55cd215c, FUTEX_WAKE, 1) = 0 write(19, "09Z0i\4\7entryDN1^\4\uuid=ef556050-8"..., 250) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- epoll_ctl(7, EPOLL_CTL_MOD, 19, {0, {u32=1437773908, u64=1437773908}}) = 0 futex(0x8165960, FUTEX_WAKE, 1) = 0
So somehow the syncrepl connection from server 2 to server 1 is closed (again and again) for the main database. But the connection for the cn=config database seems to persist.
/Peter
Hi,
I found the reason that slapd was hanging at startup. It turned out to be a schema, which hadn't been properly replicated after being dynamicly added. So not replication is actually moving entries. However... it seems to constantly loose connection (which may be why the schema sometimes fails to replicate on load).
The setup is 2 mirrormode servers (slapd 2.4.17). Server 1 has the database and is trying to replicate it to Server 2 which was empty from start.
I have syncrepl for both cn=config and for the actual database. Which means that I should see 4 connections (2 each way) between server 1 and 2. But the last connection (server2->server1) seems to open and close constantly.
On server 2 I see repeated:
Oct 7 09:47:14 s02 slapd[26723]: do_syncrepl: rid=001 rc -1 retrying Oct 7 09:47:28 s02 slapd[26723]: do_syncrep2: rid=003 (-1) Can't contact LDAP server Oct 7 09:47:28 s02 slapd[26723]: do_syncrepl: rid=003 rc -1 retrying Oct 7 09:48:49 s02 slapd[26723]: do_syncrep2: rid=003 (-1) Can't contact LDAP server
When Adding olcLogLevel: conns sync trace none I se the logmessages I would expect mixed with a lot of these:
Oct 7 10:41:52 s02 slapd[26723]: slap_sl_malloc of 48 bytes failed, using ch_malloc Oct 7 10:41:52 s02 slapd[26723]: slap_sl_malloc of 40 bytes failed, using ch_malloc
... coming in burts with varying number of bytes. However, the machine doesn't look like it's running out of mem.
/Peter
Peter Mogensen wrote:
The setup is 2 mirrormode servers (slapd 2.4.17). Server 1 has the database and is trying to replicate it to Server 2 which was empty from start.
Ah... My problems seems to go away if I remove the "idletimeout" directive. (which I had to 120)
I can find a few mentions of this problem with older versions of slapd, but nothing saying that I shouldn't use idletimeout with syncrepl on 2.4.17.
Have I missed something?
/Peter
--On October 7, 2009 3:06:43 PM +0200 Peter Mogensen apm@mutex.dk wrote:
Peter Mogensen wrote:
The setup is 2 mirrormode servers (slapd 2.4.17). Server 1 has the database and is trying to replicate it to Server 2 which was empty from start.
Ah... My problems seems to go away if I remove the "idletimeout" directive. (which I had to 120)
I can find a few mentions of this problem with older versions of slapd, but nothing saying that I shouldn't use idletimeout with syncrepl on 2.4.17.
Have I missed something?
From 2.4.18:
Fixed slapd incorrectly applying writetimeout when not set (ITS#6220)
most likely.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
I can find a few mentions of this problem with older versions of slapd, but nothing saying that I shouldn't use idletimeout with syncrepl on 2.4.17.
Have I missed something?
From 2.4.18:
Fixed slapd incorrectly applying writetimeout when not set (ITS#6220)
most likely.
Hmm... I actually did look in the changelog for 2.4.1[89] and did read that, but didn't associate it to idletimeout vs. syncrepl.
/Peter
--On October 7, 2009 10:53:27 AM +0200 Peter Mogensen apm@mutex.dk wrote:
Hi,
I found the reason that slapd was hanging at startup. It turned out to be a schema, which hadn't been properly replicated after being dynamicly added. So not replication is actually moving entries. However... it seems to constantly loose connection (which may be why the schema sometimes fails to replicate on load).
The setup is 2 mirrormode servers (slapd 2.4.17). Server 1 has the database and is trying to replicate it to Server 2 which was empty from start.
I have syncrepl for both cn=config and for the actual database. Which means that I should see 4 connections (2 each way) between server 1 and 2. But the last connection (server2->server1) seems to open and close constantly.
On server 2 I see repeated:
Oct 7 09:47:14 s02 slapd[26723]: do_syncrepl: rid=001 rc -1 retrying Oct 7 09:47:28 s02 slapd[26723]: do_syncrep2: rid=003 (-1) Can't contact LDAP server Oct 7 09:47:28 s02 slapd[26723]: do_syncrepl: rid=003 rc -1 retrying Oct 7 09:48:49 s02 slapd[26723]: do_syncrep2: rid=003 (-1) Can't contact LDAP server
When Adding olcLogLevel: conns sync trace none I se the logmessages I would expect mixed with a lot of these:
Oct 7 10:41:52 s02 slapd[26723]: slap_sl_malloc of 48 bytes failed, using ch_malloc Oct 7 10:41:52 s02 slapd[26723]: slap_sl_malloc of 40 bytes failed, using ch_malloc
... coming in burts with varying number of bytes. However, the machine doesn't look like it's running out of mem.
Unable to malloc means your system is running out of memory. That's bad.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
--On October 7, 2009 1:48:52 PM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
... coming in burts with varying number of bytes. However, the machine doesn't look like it's running out of mem.
Unable to malloc means your system is running out of memory. That's bad.
Sorry, wrong malloc... you can ignore those messages.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
openldap-software@openldap.org