Richard Silverman wrote:
I’ve done this, and it may well be epoll-specific; the test has now run over twice as long as the longest it has ever required to produce the deadlock. With this sort of bug no amount of waiting would make me sure, but it seems likely. I’ll leave it running.
epoll(7) specifically mentions the possibility of epoll_wait hanging even though there is outstanding unread data on a socket, when using edge-triggered operation, and I notice in daemon.c that you switch to edge-triggered mode in the event that the client closes the connection (at least that’s what the comment suggests):
/* Don't keep reporting the hangup */ if ( SLAP_SOCK_IS_ACTIVE( tid, fd )) { SLAP_EPOLL_SOCK_SET( tid, fd, EPOLLET ); }
Perhaps related?
Indeed. Seems like ITS#5886 has resurfaced. You can get some insight into this looking at about commit 96192064f3a3daea994eb8293f0413def5379958 forward. I don't have time to dig further into it at the moment, Christmas dinner(s) calling...
Also, what kernel version(s) are you testing on?
2.6.32 (Red Hat Enterprise Linux)
I see a lot of Linux-kernel email traffic about epoll bugs as well, but not sure which are relevant to this version. Just something to stay aware of.
On Sun, 23 Dec 2012, Howard Chu wrote:
Richard Silverman wrote:
I’ve done this, and it may well be epoll-specific; the test has now run over twice as long as the longest it has ever required to produce the deadlock. With this sort of bug no amount of waiting would make me sure, but it seems likely. I’ll leave it running.
epoll(7) specifically mentions the possibility of epoll_wait hanging even though there is outstanding unread data on a socket, when using edge-triggered operation, and I notice in daemon.c that you switch to edge-triggered mode in the event that the client closes the connection (at least that’s what the comment suggests):
/* Don't keep reporting the hangup */ if ( SLAP_SOCK_IS_ACTIVE( tid, fd )) { SLAP_EPOLL_SOCK_SET( tid, fd, EPOLLET ); }
Perhaps related?
Indeed. Seems like ITS#5886 has resurfaced. You can get some insight into this looking at about commit 96192064f3a3daea994eb8293f0413def5379958 forward. I don't have time to dig further into it at the moment, Christmas dinner(s) calling...
I hope you're enjoying your holidays. I've also been on vacation, but I will follow up on this when I can, probably next week. In the meantime, I thought I'd report that I let my test with select() instead of epoll() run for several more hours with no deadlock, for whatever that's worth.
Also, what kernel version(s) are you testing on?
2.6.32 (Red Hat Enterprise Linux)
I see a lot of Linux-kernel email traffic about epoll bugs as well, but not sure which are relevant to this version. Just something to stay aware of.
Good to know, although I think the bug is independent of this. We actually ran into this exact issue on Solaris several years ago and reported it then:
http://www.openldap.org/its/index.cgi/Incoming?id=6920
... but the problem was misidentified and we didn't follow up further. At the time the bug manifested very infrequently for us and we just put monitoring in place which restarted slapd whenever it happened. By the time we migrated to Linux we had forgotten all about this, so we didn't migrate the watchdog, and it turned out that with the changed environment the bug was much more severe this time around, which prompted us to investigate more fully.
Richard Silverman wrote:
On Sun, 23 Dec 2012, Howard Chu wrote:
Also, what kernel version(s) are you testing on?
2.6.32 (Red Hat Enterprise Linux)
I see a lot of Linux-kernel email traffic about epoll bugs as well, but not sure which are relevant to this version. Just something to stay aware of.
Good to know, although I think the bug is independent of this. We actually ran into this exact issue on Solaris several years ago and reported it then:
That bug is against Solaris, which has always only used select(), since epoll() is Linux-only. I don't see how the two can be related, since your current issue is epoll-specific.
On Wed, 2 Jan 2013, Howard Chu wrote:
Richard Silverman wrote:
On Sun, 23 Dec 2012, Howard Chu wrote:
Also, what kernel version(s) are you testing on?
2.6.32 (Red Hat Enterprise Linux)
I see a lot of Linux-kernel email traffic about epoll bugs as well, but not sure which are relevant to this version. Just something to stay aware of.
Good to know, although I think the bug is independent of this. We actually ran into this exact issue on Solaris several years ago and reported it then:
That bug is against Solaris, which has always only used select(), since epoll() is Linux-only. I don't see how the two can be related, since your current issue is epoll-specific.
It’s not clear to me. All we know at the moment is that if I rebuild with select(), I can no longer reproduce the problem. With a timing-dependent concurrency bug we do not yet fully understand, I would not call that conclusive, merely suggestive; it could just as well be that the different code alters the timing such that my test no longer tickles the bug. In fact, as I said it happened much less frequently for us on Solaris, on the order of a few times a month. I have not yet read your earlier bug reference though, so perhaps there’s information there that makes this more clear.
As the historical stuff below shows, once again most threads are blocked on condition waits while one is polling I/O; certainly, the symptoms were the same (slapd became unresponsive and had to be SIGKILL’ed). On the other hand, the last bit (gdb) shows that the cond_wait is for database access, not network as with our current scenario. Could be the same bug manifesting slightly differently.
Anyway, this is of course not directly relevant to solving the current problem; I just thought I’d point it out. At the moment, it seems far too similar to be coincidence to me.