Richard Silverman wrote:
OK, so thread 13 is waiting on conn 1370, sd 39. There are actually no error messages associated with this connection or socket anywhere in your logs.
Likewise for thread 16: conn 1351, sd 20. No errors anywhere.
This looks more like a problem with epoll() than anything else. Are the connections associated with socket 20 and 39 actually dead?
The process which initiated those connections is gone; the test harness runs
slamd (a load generator) repeatedly and then kills it while it is running, to simulate the connections going down in a variety of states. Interestingly, though, here’s the lsof output for the slapd file descriptors:
Most of the TCP connections are in CLOSE_WAIT, which is what you’d expect if
the client sent a FIN but the server is frozen and cannot close the socket. However, note the odd state of the two sockets you asked about, 20 and 39: “can’t identify protocol.” There are 25 IPv4 connections in CLOSE_WAIT shown above; by comparison, netstat shows all those, plus one more on port 43362 not shown above.
A little googling produced this and similar comments:
https://idea.popcount.org/2012-12-09-lsof-cant-identify-protocol/
I'm not sure this applies here; perhaps there's more than one way for the socket to end up in this state.
... which I verified using the sample program provided in that blog post; apparently, after a TCP connection is fully closed, Linux just drops the information about it and it shows up like that thereafter. This is the network trace of a connection which ends up in this state:
03:11:01.875853 IP client.52176 > server.9918: Flags [S], seq 4090195389, win 14600, options [mss 1460,sackOK,TS val 3524517096 ecr 0,nop,wscale 9], length 0 03:11:01.875870 IP server.9918 > client.52176: Flags [S.], seq 2724560604, ack 4090195390, win 14480, options [mss 1460,sackOK,TS val 1101728398 ecr 3524517096,nop,wscale 9], length 0 03:11:01.876773 IP client.52176 > server.9918: Flags [.], ack 1, win 29, options [nop,nop,TS val 3524517096 ecr 1101728398], length 0 03:11:01.876852 IP server.9918 > client.52176: Flags [F.], seq 1, ack 1, win 29, options [nop,nop,TS val 1101728399 ecr 3524517096], length 0 03:11:01.877894 IP client.52176 > server.9918: Flags [.], ack 2, win 29, options [nop,nop,TS val 3524517098 ecr 1101728399], length 0 03:11:01.878978 IP client.52176 > server.9918: Flags [F.], seq 1, ack 2, win 29, options [nop,nop,TS val 3524517099 ecr 1101728399], length 0 03:11:01.878993 IP server.9918 > client.52176: Flags [.], ack 2, win 29, options [nop,nop,TS val 1101728401 ecr 3524517099], length 0
... which is just a normal server-initiated close. I wonder if this odd socket state is not producing the expected/necessary error indication somewhere (e.g. ber_flush2) and thus mucking things up.
slapd definitely is not initiating the close though.
Something else to try would be to disable the use of epoll() and go back to select(), to see if this problem is epoll-specific. In slapd/daemon.c, after the #include "portable.h" add a #undef HAVE_EPOLL and then recompile and test.
Also, what kernel version(s) are you testing on?