We've noticed hard failures on both our Linux and Mac workstations when an LDAP server fails in a way which causes it to stop responding but leave a connection open (e.g. lock contention, disk failure). This usually ends up requiring the system to be rebooted because a key system process will probably have made a call which is waiting on a read() which might take days to fail.
I've created a patch simply calls setsockopt() to set SO_SNDTIMEO| SO_RCVTIMEO when LDAP_OPT_NETWORK_TIMEOUT has been set. This appears to produce the desired result on Linux (both with pam_ldap and the ldap utilities) and OS X (within the DirectoryService plugin).
Is there a drawback to this approach which I've missed? It appears that the issue has come up in the past but there's no solution that I can see (certainly nothing else uses socket-level timeouts). I'd like to find a solution for this as it's by far the biggest source of Linux downtime in our environment.
Thanks, Chris
I think you might be confusing LDAP_OPT_NETWORK_TIMEOUT and LDAP_OPT_TIMEOUT. (Or maybe I am...) But as I recall, NETWORK_TIMEOUT is for initial connect(), and you're referring to ongoing conversations.
For that matter, I'm having a hard time envisioning the situation you describe playing out. Let's say your server dies hard and you reboot it. Then your client, blissfully unaware of this, sends some packets over its open connection. The rebooted server sees the packets, but doesn't have a matching TCP flow, so it's going to tell you to bug off -- I'd expect a "typical OS" to send a TCP reset in response to this. And at that point, libldap should produce LDAP_SERVER_DOWN or something along that flavor, and the client will of course have no bugs and handle this with perfect grace.
Finally, libldap does use TCP keepalive nowadays. In the event of intermediate network path dying hard (which can't be relied upon to nicely produce TCP resets), the underlying keepalive mechanism should pick that up.
On Tue, 8 Apr 2008, Chris Adams wrote:
We've noticed hard failures on both our Linux and Mac workstations when an LDAP server fails in a way which causes it to stop responding but leave a connection open (e.g. lock contention, disk failure). This usually ends up requiring the system to be rebooted because a key system process will probably have made a call which is waiting on a read() which might take days to fail.
I've created a patch simply calls setsockopt() to set SO_SNDTIMEO|SO_RCVTIMEO when LDAP_OPT_NETWORK_TIMEOUT has been set. This appears to produce the desired result on Linux (both with pam_ldap and the ldap utilities) and OS X (within the DirectoryService plugin).
Is there a drawback to this approach which I've missed? It appears that the issue has come up in the past but there's no solution that I can see (certainly nothing else uses socket-level timeouts). I'd like to find a solution for this as it's by far the biggest source of Linux downtime in our environment.
Thanks, Chris
On Apr 8, 2008, at 10:54 AM, Aaron Richton wrote:
I think you might be confusing LDAP_OPT_NETWORK_TIMEOUT and LDAP_OPT_TIMEOUT. (Or maybe I am...) But as I recall, NETWORK_TIMEOUT is for initial connect(), and you're referring to ongoing conversations.
This is correct - I'm proposing extending that to include a timeout for all network communication. In some cases the APIs have a timeout but many do not and this seems cleaner than requiring the client to pass a timeout for every call which could conceivably perform network operations.
For that matter, I'm having a hard time envisioning the situation you describe playing out. Let's say your server dies hard and you reboot it.
This is the only situation which works well currently. The only three failures we've had with slapd, however, have been situations where the server failed by simply becoming unresponsive and anything which touched PAM/NSS hung waiting for read() to return. We've also seen similar problems with mobile and multi-homed systems where an connection was attempted before the defined LDAP server was reachable.
Finally, libldap does use TCP keepalive nowadays. In the event of intermediate network path dying hard (which can't be relied upon to nicely produce TCP resets), the underlying keepalive mechanism should pick that up.
This is an improvement but it wouldn't help with the slapd failures we've observed because the server's TCP stack can respond to keepalives even when the service is unresponsive. It would definitely help recover when the server is rebooted but it uses the system-wide keepalive settings and the values appropriate for a local LDAP server would be far too aggressive for internet connections.
I understand the current situation but as a user it would feel more correct for LDAP_OPT_NETWORK_TIMEOUT to mean "try the next server if a response is not obtained within this time", covering the additional class of failures where an LDAP server is partially up as we cannot guarantee minute-level admin response times to restart a failing server.
Chris
On Tue, 8 Apr 2008, Chris Adams wrote:
On Apr 8, 2008, at 10:54 AM, Aaron Richton wrote:
I think you might be confusing LDAP_OPT_NETWORK_TIMEOUT and LDAP_OPT_TIMEOUT. (Or maybe I am...) But as I recall, NETWORK_TIMEOUT is for initial connect(), and you're referring to ongoing conversations.
Aaron is correct that LDAP_OPT_NETWORK_TIMEOUT only affects connect() in the released versions.
In versions before 2.4.4, LDAP_OPT_TIMEOUT had no effect. Starting in version 2.4.4 it sets the default timeout for reading the requested result in ldap_result().
This is correct - I'm proposing extending that to include a timeout for all network communication. In some cases the APIs have a timeout but many do not and this seems cleaner than requiring the client to pass a timeout for every call which could conceivably perform network operations.
Since all the synchronous calls (ldap_*_s) are built on top of ldap_result(), if you set LDAP_OPT_TIMEOUT then those calls will automatically use that value. That's the behavior you're looking for, right?
I understand the current situation but as a user it would feel more correct for LDAP_OPT_NETWORK_TIMEOUT to mean "try the next server if a response is not obtained within this time", covering the additional class of failures where an LDAP server is partially up as we cannot guarantee minute-level admin response times to restart a failing server.
Hmm, what do you think the distinction between LDAP_OPT_NETWORK_TIMEOUT and LDAP_OPT_TIMEOUT should be? (Neither of which should be confused with LDAP_OPT_TIMELIMIT, of course.)
Philip Guenther
On Apr 9, 2008, at 4:00 AM, Philip Guenther wrote:
In versions before 2.4.4, LDAP_OPT_TIMEOUT had no effect. Starting in version 2.4.4 it sets the default timeout for reading the requested result in ldap_result().
This sounds like exactly what we've needed it - that ends up in poll()/ select(), both of which should handle our malevolent server scenario.
I understand the current situation but as a user it would feel more correct for LDAP_OPT_NETWORK_TIMEOUT to mean "try the next server if a response is not obtained within this time", covering the additional class of failures where an LDAP server is partially up as we cannot guarantee minute-level admin response times to restart a failing server.
Hmm, what do you think the distinction between LDAP_OPT_NETWORK_TIMEOUT and LDAP_OPT_TIMEOUT should be? (Neither of which should be confused with LDAP_OPT_TIMELIMIT, of course.)
Perhaps the difference between how long it will wait for a given server to response (LDAP_OPT_NETWORK_TIMEOUT) and how long it will spend before giving up on the call (LDAP_OPT_TIMEOUT) so it will eventually time out if it can't contact any of the servers? The latter case can be useful in odd networking environments where connectivity creatively broken (e.g. a "smart" gateway which attempts to spoof any IP it thinks your laptop is using as a gateway) - while it would be if hotels stopped using that kind of stuff, laptops need to recover.
Chris
On Tue, 8 Apr 2008, Aaron Richton wrote:
For that matter, I'm having a hard time envisioning the situation you describe playing out. Let's say your server dies hard and you reboot it.
Our situation was a frozen disk controller on the server; it happily accepted connections, which never timed out...
On Apr 8, 2008, at 7:23 PM, Dave Horsfall wrote:
On Tue, 8 Apr 2008, Aaron Richton wrote:
For that matter, I'm having a hard time envisioning the situation you describe playing out. Let's say your server dies hard and you reboot it.
Our situation was a frozen disk controller on the server; it happily accepted connections, which never timed out...
We've had similar storage failures, and two slapd bugs: hitting the maximum file descriptor limit (new connections were accepted but not serviced - (http://bugs.debian.org/378261) and the bdb backend deadlocking until we learned to set the number of locks to a value much higher than was suggested when I first set the server up (http://bugs.debian.org/303057 ).
This why I'm pretty big on making the client not trust the server to operate perfectly since there are many things which can go wrong, most of which can be addressed by a single client-side fix instead much difficult server-side hardening.
Chris
On Tue, 8 Apr 2008, Chris Adams wrote:
We've noticed hard failures on both our Linux and Mac workstations when an LDAP server fails in a way which causes it to stop responding but leave a connection open (e.g. lock contention, disk failure). This usually ends up requiring the system to be rebooted because a key system process will probably have made a call which is waiting on a read() which might take days to fail.
If this is the one I'm thinking of, there is a new client timeout implemented in 2.4.8.
-- Dave