Eric Déchaux wrote:
Le lundi 07 juillet 2008 à 02:32 -0700, Howard Chu a écrit :
I have no idea what Debian or any other distro packages. You should quote specific version numbers for all relevant pieces of software.
Sorry about that. Version is 2.3.30. I also forgot to mention I am running the whole thing inside a VMware ESX 301 virtual machine. I don't know if this can have impact.
Always possible. Time has a nasty tendency to drift inside a virtual machine, most often due to other real machine activity that isn't visible to the VM. Plus, it's just very hard to keep accurate time without expensive overhead in the VM environment. The VMware documentation describes a number of these issues.
Output 2 [ some uninteresting ldap stuff ]
futex(0x2b0db3b35dc8, FUTEX_WAKE, 1) = 1 select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) write(5, "0", 1) = 1 shutdown(12, 2 /* send and receive */) = 0 close(12) = 0
Here we have 6 select system calls for a real idletimeout of 90 seconds which is enough for the session to expire on the load balancer.
This is rather surprising.
If it is the case, shouldn't the difftime call be tested<= 0 to help idle sessions to be cleaned sonner ?
I don't think it makes much difference in the long run. Whenever you choose an idletimeout that is not evenly divisible by 4 (IDLE_CHECK_LIMIT) it's going to have extra slop anyway. And none of this explains how your 60 second idletimeout allowed an idle connection to continue for 90 seconds. Frankly I have no idea why that would be.
I believe it is possible when the main event loop takes less than 1 second, not counting the select timeout, when an idle check was done on the previous loop. If this condition happens, difftime(last_idle_check+global_idletimeout/SLAPD_IDLE_CHECK_LIMIT, now) will return 0 and no connection aging will be checked.
Then we should see an I/O event in the log, but there's no such event in the strace log you provided. And skipping one check would only extend the delay by 15 seconds; it would still close at 75. Still seems a bit too mysterious.
You can of course try changing the "< 0" to "<= 0" in both daemon.c and connection.c to see if that helps your situation.