Dear openldap gurus,
I am hitting some strange behavior with the idle sessions timeout feature. In my configuration this timeout is set to 60 seconds on 4 slaves that are behind a load balancer. This load balancer times-out idle sessions after 90 seconds, which should be fine. Openldap version is the stable one from Debian Etch r3.
I however encounter random connection issues that have been traced to the load balancer timeouting and idle session *before* the ldap slave.
I have straced the slapd process and I found out the applyed idletimeout was way above the configured one, please check the two following strace output :
Output 1
futex(0x603428, FUTEX_WAKE, 1) = 1 select(16, [4 6 7], NULL, NULL, {300, 0}) = 1 (in [6], left {238, 132000}) accept(6, {sa_family=AF_INET, sin_port=htons(34103), sin_addr=inet_addr("192.168.1.1")}, [4647729933731233808]) = 12 setsockopt(12, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 setsockopt(12, SOL_TCP, TCP_NODELAY, [1], 4) = 0 open("/etc/hosts.allow", O_RDONLY) = 15 fstat(15, {st_mode=S_IFREG|0644, st_size=677, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaccb3000 read(15, "# /etc/hosts.allow: list of host"..., 4096) = 677 read(15, "", 4096) = 0 close(15) = 0 munmap(0x2aaaaccb3000, 4096) = 0 open("/etc/hosts.deny", O_RDONLY) = 15 fstat(15, {st_mode=S_IFREG|0644, st_size=901, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaccb3000 read(15, "# /etc/hosts.deny: list of hosts"..., 4096) = 901 read(15, "", 4096) = 0 close(15) = 0 munmap(0x2aaaaccb3000, 4096) = 0 fcntl(12, F_GETFL) = 0x2 (flags O_RDWR) fcntl(12, F_SETFL, O_RDWR|O_NONBLOCK) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2945, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2945, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2945, ...}) = 0 sendto(3, "<167>Jul 1 09:14:14 slapd[2765]"..., 102, MSG_NOSIGNAL, NULL, 0) = 102 select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 1 (in [12], left {15, 0}) read(12, "0b\2\1\1`]\2", 8) = 8
[ some uninteresting ldap stuff ]
futex(0x603428, FUTEX_WAKE, 1) = 1 read(12, 0x6f30ff, 8) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2b0db3b35dc8, FUTEX_WAKE, 1) = 1 select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) write(5, "0", 1) = 1 shutdown(12, 2 /* send and receive */) = 0 close(12) = 0
Here, we can see 5 select system calls for a real idletimeout is 75 seconds instead of 60.
Output 2
futex(0x603428, FUTEX_WAKE, 1) = 1 select(16, [4 6 7], NULL, NULL, {300, 0}) = 1 (in [6], left {230, 828000}) accept(6, {sa_family=AF_INET, sin_port=htons(51692), sin_addr=inet_addr("192.168.1.1")}, [4647729933731233808]) = 12 setsockopt(12, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 setsockopt(12, SOL_TCP, TCP_NODELAY, [1], 4) = 0 open("/etc/hosts.allow", O_RDONLY) = 15 fstat(15, {st_mode=S_IFREG|0644, st_size=677, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaccb3000 read(15, "# /etc/hosts.allow: list of host"..., 4096) = 677 read(15, "", 4096) = 0 close(15) = 0 munmap(0x2aaaaccb3000, 4096) = 0 open("/etc/hosts.deny", O_RDONLY) = 15 fstat(15, {st_mode=S_IFREG|0644, st_size=901, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaccb3000 read(15, "# /etc/hosts.deny: list of hosts"..., 4096) = 901 read(15, "", 4096) = 0 close(15) = 0 munmap(0x2aaaaccb3000, 4096) = 0 fcntl(12, F_GETFL) = 0x2 (flags O_RDWR) fcntl(12, F_SETFL, O_RDWR|O_NONBLOCK) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2945, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2945, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2945, ...}) = 0 sendto(3, "<167>Jul 1 09:19:21 slapd[2765]"..., 102, MSG_NOSIGNAL, NULL, 0) = 102 select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 1 (in [12], left {15, 0}) read(12, "0b\2\1\1`]\2", 8) = 8
[ some uninteresting ldap stuff ]
futex(0x2b0db3b35dc8, FUTEX_WAKE, 1) = 1 select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) write(5, "0", 1) = 1 shutdown(12, 2 /* send and receive */) = 0 close(12) = 0
Here, we can see 5 select system call for a real idletimeout is 75 Here we have 6 select system calls for a real idletimeout of 90 seconds which is enough for the session to expire on the load balancer.
I have checked the source code and the logic that choose either to idletimeout the session or go into a "SLAP_EVENT_WAIT" (select) call is the following :
from server/slap/daemon.c
now = slap_get_time();
if ( ( global_idletimeout > 0 ) && difftime( last_idle_check + global_idletimeout/SLAPD_IDLE_CHECK_LIMIT, now ) < 0 ) { connections_timeout_idle( now ); last_idle_check = now; }
As I understand this, no connection should be tested against the idletimeout before any "event wait loop" takes more time than the idletimeout parameter / 4.
In my case, I need the "event wait loop" to last more than 15 seconds for connections to be checked against aging.
If I am not mistaken, as the difftime call compares seconds, I need the loop to last a least for 16 seconds for the connections_timeout_idle procedure to be called.
Am I understanding everything the right way ? If it is the case, shouldn't the difftime call be tested <= 0 to help idle sessions to be cleaned sonner ?
Many thanks.
Eric Déchaux wrote:
Dear openldap gurus,
I am hitting some strange behavior with the idle sessions timeout feature. In my configuration this timeout is set to 60 seconds on 4 slaves that are behind a load balancer. This load balancer times-out idle sessions after 90 seconds, which should be fine. Openldap version is the stable one from Debian Etch r3.
I have no idea what Debian or any other distro packages. You should quote specific version numbers for all relevant pieces of software.
I however encounter random connection issues that have been traced to the load balancer timeouting and idle session *before* the ldap slave.
I have straced the slapd process and I found out the applyed idletimeout was way above the configured one, please check the two following strace output :
Output 1
[ some uninteresting ldap stuff ]
futex(0x603428, FUTEX_WAKE, 1) = 1 read(12, 0x6f30ff, 8) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2b0db3b35dc8, FUTEX_WAKE, 1) = 1 select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) write(5, "0", 1) = 1 shutdown(12, 2 /* send and receive */) = 0 close(12) = 0
Here, we can see 5 select system calls for a real idletimeout is 75 seconds instead of 60.
This doesn't really surprise me.
Output 2
[ some uninteresting ldap stuff ]
futex(0x2b0db3b35dc8, FUTEX_WAKE, 1) = 1 select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) write(5, "0", 1) = 1 shutdown(12, 2 /* send and receive */) = 0 close(12) = 0
Here we have 6 select system calls for a real idletimeout of 90 seconds which is enough for the session to expire on the load balancer.
This is rather surprising.
I have checked the source code and the logic that choose either to idletimeout the session or go into a "SLAP_EVENT_WAIT" (select) call is the following :
from server/slap/daemon.c
now = slap_get_time(); if ( ( global_idletimeout> 0 )&& difftime( last_idle_check + global_idletimeout/SLAPD_IDLE_CHECK_LIMIT, now )< 0 ) { connections_timeout_idle( now ); last_idle_check = now; }
As I understand this, no connection should be tested against the idletimeout before any "event wait loop" takes more time than the idletimeout parameter / 4.
Right, on an otherwise idle server, we don't want to wake up too frequently to check for idle connections. It's OK to check a little late, but we don't want to wake up much too late, which would often occur if the IDLE_CHECK_LIMIT was smaller.
In my case, I need the "event wait loop" to last more than 15 seconds for connections to be checked against aging.
Basically, yes.
If I am not mistaken, as the difftime call compares seconds, I need the loop to last a least for 16 seconds for the connections_timeout_idle procedure to be called.
Am I understanding everything the right way ?
Sounds like it.
If it is the case, shouldn't the difftime call be tested<= 0 to help idle sessions to be cleaned sonner ?
I don't think it makes much difference in the long run. Whenever you choose an idletimeout that is not evenly divisible by 4 (IDLE_CHECK_LIMIT) it's going to have extra slop anyway. And none of this explains how your 60 second idletimeout allowed an idle connection to continue for 90 seconds. Frankly I have no idea why that would be.
In the meantime, on an idle server, I don't see any urgency in closing idle connections, because in this case there's no danger of resource starvation. On the other hand, for an active server, the event loop is going to be waking up more frequently anyway due to real activity, in which case the idle checks will happen more frequently. So as the server gets busier, the actual idletimeouts will get much closer to the configured value.
Le lundi 07 juillet 2008 à 02:32 -0700, Howard Chu a écrit :
Eric Déchaux wrote:
Dear openldap gurus,
I am hitting some strange behavior with the idle sessions timeout feature. In my configuration this timeout is set to 60 seconds on 4 slaves that are behind a load balancer. This load balancer times-out idle sessions after 90 seconds, which should be fine. Openldap version is the stable one from Debian Etch r3.
I have no idea what Debian or any other distro packages. You should quote specific version numbers for all relevant pieces of software.
Sorry about that. Version is 2.3.30. I also forgot to mention I am running the whole thing inside a VMware ESX 301 virtual machine. I don't know if this can have impact.
I however encounter random connection issues that have been traced to the load balancer timeouting and idle session *before* the ldap slave.
I have straced the slapd process and I found out the applyed idletimeout was way above the configured one, please check the two following strace output :
Output 1
[ some uninteresting ldap stuff ]
futex(0x603428, FUTEX_WAKE, 1) = 1 read(12, 0x6f30ff, 8) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2b0db3b35dc8, FUTEX_WAKE, 1) = 1 select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) write(5, "0", 1) = 1 shutdown(12, 2 /* send and receive */) = 0 close(12) = 0
Here, we can see 5 select system calls for a real idletimeout is 75 seconds instead of 60.
This doesn't really surprise me.
Me neither.
Output 2
[ some uninteresting ldap stuff ]
futex(0x2b0db3b35dc8, FUTEX_WAKE, 1) = 1 select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) write(5, "0", 1) = 1 shutdown(12, 2 /* send and receive */) = 0 close(12) = 0
Here we have 6 select system calls for a real idletimeout of 90 seconds which is enough for the session to expire on the load balancer.
This is rather surprising.
I have checked the source code and the logic that choose either to idletimeout the session or go into a "SLAP_EVENT_WAIT" (select) call is the following :
from server/slap/daemon.c
now = slap_get_time(); if ( ( global_idletimeout> 0 )&& difftime( last_idle_check + global_idletimeout/SLAPD_IDLE_CHECK_LIMIT, now )< 0 ) { connections_timeout_idle( now ); last_idle_check = now; }
As I understand this, no connection should be tested against the idletimeout before any "event wait loop" takes more time than the idletimeout parameter / 4.
Right, on an otherwise idle server, we don't want to wake up too frequently to check for idle connections. It's OK to check a little late, but we don't want to wake up much too late, which would often occur if the IDLE_CHECK_LIMIT was smaller.
In my case, I need the "event wait loop" to last more than 15 seconds for connections to be checked against aging.
Basically, yes.
If I am not mistaken, as the difftime call compares seconds, I need the loop to last a least for 16 seconds for the connections_timeout_idle procedure to be called.
Am I understanding everything the right way ?
Sounds like it.
If it is the case, shouldn't the difftime call be tested<= 0 to help idle sessions to be cleaned sonner ?
I don't think it makes much difference in the long run. Whenever you choose an idletimeout that is not evenly divisible by 4 (IDLE_CHECK_LIMIT) it's going to have extra slop anyway. And none of this explains how your 60 second idletimeout allowed an idle connection to continue for 90 seconds. Frankly I have no idea why that would be.
I believe it is possible when the main event loop takes less than 1 second, not counting the select timeout, when an idle check was done on the previous loop. If this condition happens, difftime(last_idle_check+global_idletimeout/SLAPD_IDLE_CHECK_LIMIT, now) will return 0 and no connection aging will be checked.
In the meantime, on an idle server, I don't see any urgency in closing idle connections, because in this case there's no danger of resource starvation. On the other hand, for an active server, the event loop is going to be waking up more frequently anyway due to real activity, in which case the idle checks will happen more frequently. So as the server gets busier, the actual idletimeouts will get much closer to the configured value.
Got it. It seems there is no simple workaround on ldap side for my issue. I will search for other options.
Many thanks for your help.
Eric Déchaux wrote:
Le lundi 07 juillet 2008 à 02:32 -0700, Howard Chu a écrit :
I have no idea what Debian or any other distro packages. You should quote specific version numbers for all relevant pieces of software.
Sorry about that. Version is 2.3.30. I also forgot to mention I am running the whole thing inside a VMware ESX 301 virtual machine. I don't know if this can have impact.
Always possible. Time has a nasty tendency to drift inside a virtual machine, most often due to other real machine activity that isn't visible to the VM. Plus, it's just very hard to keep accurate time without expensive overhead in the VM environment. The VMware documentation describes a number of these issues.
Output 2 [ some uninteresting ldap stuff ]
futex(0x2b0db3b35dc8, FUTEX_WAKE, 1) = 1 select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) select(16, [4 6 7 12], NULL, NULL, {15, 0}) = 0 (Timeout) write(5, "0", 1) = 1 shutdown(12, 2 /* send and receive */) = 0 close(12) = 0
Here we have 6 select system calls for a real idletimeout of 90 seconds which is enough for the session to expire on the load balancer.
This is rather surprising.
If it is the case, shouldn't the difftime call be tested<= 0 to help idle sessions to be cleaned sonner ?
I don't think it makes much difference in the long run. Whenever you choose an idletimeout that is not evenly divisible by 4 (IDLE_CHECK_LIMIT) it's going to have extra slop anyway. And none of this explains how your 60 second idletimeout allowed an idle connection to continue for 90 seconds. Frankly I have no idea why that would be.
I believe it is possible when the main event loop takes less than 1 second, not counting the select timeout, when an idle check was done on the previous loop. If this condition happens, difftime(last_idle_check+global_idletimeout/SLAPD_IDLE_CHECK_LIMIT, now) will return 0 and no connection aging will be checked.
Then we should see an I/O event in the log, but there's no such event in the strace log you provided. And skipping one check would only extend the delay by 15 seconds; it would still close at 75. Still seems a bit too mysterious.
You can of course try changing the "< 0" to "<= 0" in both daemon.c and connection.c to see if that helps your situation.
openldap-technical@openldap.org