Hi,
since quite some time libldap enables tcp-keepalive, e.g. to detected dangling syncrepl connections. However the default timeout of two hours that most systems are using might be a bit too long for some applications (e.g. I had a problem lately were nscd didn't answer queries anymore because nss_ldap was blocking in SSL_read() while the underlying connection has been cut off). On the other hand messing with the system wide settings might no be a good idea either. On Linux it is possible to configure the keepalive settings on a per socket basis through the TCP_KEEP* socket options.
Would it be worth adding ldap_set_option() support for those, even if they are not really portable?
Hi,
since quite some time libldap enables tcp-keepalive, e.g. to detected dangling syncrepl connections. However the default timeout of two hours that most systems are using might be a bit too long for some applications (e.g. I had a problem lately were nscd didn't answer queries anymore because nss_ldap was blocking in SSL_read() while the underlying connection has been cut off). On the other hand messing with the system wide settings might no be a good idea either. On Linux it is possible to configure the keepalive settings on a per socket basis through the TCP_KEEP* socket options.
Would it be worth adding ldap_set_option() support for those, even if they are not really portable?
I think it would; for archs that do not support it, it could do nothing (and log accordingly, just in case).
p.
Am Freitag 01 Mai 2009 11:50:15 schrieb masarati@aero.polimi.it:
Hi,
since quite some time libldap enables tcp-keepalive, e.g. to detected dangling syncrepl connections. However the default timeout of two hours that most systems are using might be a bit too long for some applications (e.g. I had a problem lately were nscd didn't answer queries anymore because nss_ldap was blocking in SSL_read() while the underlying connection has been cut off). On the other hand messing with the system wide settings might no be a good idea either. On Linux it is possible to configure the keepalive settings on a per socket basis through the TCP_KEEP* socket options.
Would it be worth adding ldap_set_option() support for those, even if they are not really portable?
I think it would; for archs that do not support it, it could do nothing (and log accordingly, just in case).
Ok, I'll introduce the following new options for keepalive support then: LDAP_OPT_X_KEEPALIVE_IDLE 0x6300 LDAP_OPT_X_KEEPALIVE_PROBES 0x6301 LDAP_OPT_X_KEEPALIVE_INTERVAL 0x6302
We might also think about adding support to set those values for syncrepl and back-ldap/back-meta.
Ralf Haferkamp wrote:
Am Freitag 01 Mai 2009 11:50:15 schrieb masarati@aero.polimi.it:
Hi,
since quite some time libldap enables tcp-keepalive, e.g. to detected dangling syncrepl connections. However the default timeout of two hours that most systems are using might be a bit too long for some applications (e.g. I had a problem lately were nscd didn't answer queries anymore because nss_ldap was blocking in SSL_read() while the underlying connection has been cut off). On the other hand messing with the system wide settings might no be a good idea either. On Linux it is possible to configure the keepalive settings on a per socket basis through the TCP_KEEP* socket options.
Would it be worth adding ldap_set_option() support for those, even if they are not really portable?
I think it would; for archs that do not support it, it could do nothing (and log accordingly, just in case).
Ok, I'll introduce the following new options for keepalive support then: LDAP_OPT_X_KEEPALIVE_IDLE 0x6300 LDAP_OPT_X_KEEPALIVE_PROBES 0x6301 LDAP_OPT_X_KEEPALIVE_INTERVAL 0x6302
We might also think about adding support to set those values for syncrepl and back-ldap/back-meta.
I'd prefer a portable solution vs something so extremely platform-dependent. As already discussed many times before, we just need a client to send a periodic LDAP no-op message to get the same effect. (Abandon 0 will work fine.) While it's not as general purpose as setting a keepalive in the socket layer, I think we only need to worry about the syncrepl client. back-ldap/meta already have their own retry mechanisms, they can take care of themselves.
So - I'd rather see an option for a periodic LDAP ping added to the syncrepl client - that will work uniformly across all platforms.
And in general - I am opposed to any code that causes our feature set / behavior to differ from platform to platform.
Am Dienstag 05 Mai 2009 22:48:10 schrieb Howard Chu:
Ralf Haferkamp wrote:
Am Freitag 01 Mai 2009 11:50:15 schrieb masarati@aero.polimi.it:
Hi,
since quite some time libldap enables tcp-keepalive, e.g. to detected dangling syncrepl connections. However the default timeout of two hours that most systems are using might be a bit too long for some applications (e.g. I had a problem lately were nscd didn't answer queries anymore because nss_ldap was blocking in SSL_read() while the underlying connection has been cut off). On the other hand messing with the system wide settings might no be a good idea either. On Linux it is possible to configure the keepalive settings on a per socket basis through the TCP_KEEP* socket options.
Would it be worth adding ldap_set_option() support for those, even if they are not really portable?
I think it would; for archs that do not support it, it could do nothing (and log accordingly, just in case).
Ok, I'll introduce the following new options for keepalive support then: LDAP_OPT_X_KEEPALIVE_IDLE 0x6300 LDAP_OPT_X_KEEPALIVE_PROBES 0x6301 LDAP_OPT_X_KEEPALIVE_INTERVAL 0x6302
We might also think about adding support to set those values for syncrepl and back-ldap/back-meta.
I'd prefer a portable solution vs something so extremely platform-dependent. As already discussed many times before, we just need a client to send a periodic LDAP no-op message to get the same effect. (Abandon 0 will work fine.)
Something like proposed in ITS#5133? It seems that it was rejected with a reference to the enablement of SO_KEEPALIVE, though. Should we revisit that?
My problem was not so much with syncrepl though, I had nss_ldap making me trouble.
While it's not as general purpose as setting a keepalive in the socket layer, I think we only need to worry about the syncrepl client. back-ldap/meta already have their own retry mechanisms, they can take care of themselves.
There seems to be a problem with many retry mechanisms when it comes to the scenario I described in my orignial post. On a TLS protected connection SSL_read (called from ldap_result) might trigger multiple read() calls. As there are no select/poll calls inbetween them, one of those read()s might block forever (until TCP keepalive kicks in) in case the server is not answering anymore and didn't close the connection correctly (power failure, ...) I havn't had a good idea yet how to easily fix this case, apart from leveraging TCP keepalives.
(According to the docs, SSL_read() would return SSL_ERROR_WANT_READ when the underlying BIO is non-blocking. But we're using blocking IO. I am unsure how much effort it would be to port that to non-blocking. I'd think it's a non- trivial task ;)).
So - I'd rather see an option for a periodic LDAP ping added to the syncrepl client - that will work uniformly across all platforms.
And in general - I am opposed to any code that causes our feature set / behavior to differ from platform to platform.
Understandable, that's why I was asking before commiting anything. But AFAIK we have plattform specific issues in other places as well. (Or think about the various different LDAP_OPT_X_TLS-settings depending on which underlying SSL implementation is used.)
Ralf Haferkamp wrote:
Am Dienstag 05 Mai 2009 22:48:10 schrieb Howard Chu: Something like proposed in ITS#5133? It seems that it was rejected with a reference to the enablement of SO_KEEPALIVE, though. Should we revisit that?
Seems like it, yes.
My problem was not so much with syncrepl though, I had nss_ldap making me trouble.
While it's not as general purpose as setting a keepalive in the socket layer, I think we only need to worry about the syncrepl client. back-ldap/meta already have their own retry mechanisms, they can take care of themselves.
There seems to be a problem with many retry mechanisms when it comes to the scenario I described in my orignial post. On a TLS protected connection SSL_read (called from ldap_result) might trigger multiple read() calls. As there are no select/poll calls inbetween them, one of those read()s might block forever (until TCP keepalive kicks in) in case the server is not answering anymore and didn't close the connection correctly (power failure, ...) I havn't had a good idea yet how to easily fix this case, apart from leveraging TCP keepalives.
(According to the docs, SSL_read() would return SSL_ERROR_WANT_READ when the underlying BIO is non-blocking. But we're using blocking IO. I am unsure how much effort it would be to port that to non-blocking. I'd think it's a non- trivial task ;)).
I don't think there's any particular dependencies left in our code in this regard; ber_get_next() can be called as many times as necessary to retrieve a complete message. All of our input is triggered by select/poll/etc. What's less clear is how well OpenSSL actually behaves with non-blocking sockets; there are a lot of bug reports on that as I recall. You interested in testing that?
I guess, in the absence of a better solution, go ahead with what you've already worked up. We'll just have to document somewhere (Admin Guide I suppose) that a system's TCP keepalive setting may need to be adjusted if not on Linux...
Am Mittwoch 06 Mai 2009 11:27:29 schrieb Howard Chu:
Ralf Haferkamp wrote:
Am Dienstag 05 Mai 2009 22:48:10 schrieb Howard Chu: Something like proposed in ITS#5133? It seems that it was rejected with a reference to the enablement of SO_KEEPALIVE, though. Should we revisit that?
Seems like it, yes.
Btw, you mentioned that sending Abandon 0 will be sufficient as a no-op. How's that going to work?
[..]
I havn't had a good idea yet how to easily fix this case, apart from leveraging TCP keepalives.
(According to the docs, SSL_read() would return SSL_ERROR_WANT_READ when the underlying BIO is non-blocking. But we're using blocking IO. I am unsure how much effort it would be to port that to non-blocking. I'd think it's a non- trivial task ;)).
I don't think there's any particular dependencies left in our code in this regard; ber_get_next() can be called as many times as necessary to retrieve a complete message. All of our input is triggered by select/poll/etc. What's less clear is how well OpenSSL actually behaves with non-blocking sockets; there are a lot of bug reports on that as I recall. You interested in testing that?
Apart from the usual time-constraints, I am not too keen on that. ;)
I guess, in the absence of a better solution, go ahead with what you've already worked up. We'll just have to document somewhere (Admin Guide I suppose) that a system's TCP keepalive setting may need to be adjusted if not on Linux...
I just submitted the libldap part, will see how/if I can work out the syncrepl part later. I need to finish some other stuff first.
Ralf Haferkamp writes:
Btw, you mentioned that sending Abandon 0 will be sufficient as a no-op. How's that going to work?
It's a no-op, thus it can be sent when you just want to send some message:
* The Abandon request has no reponse. * rfc4511 §4.11: "Servers MUST discard Abandon requests for messageIDs they do not recognize, for operations that cannot be abandoned, (...) * No request may have Message ID 0 (§4.1.1.1); 0 is reserved for Unsolicited Notifications. Yet Message IDs are just defined as 0..2^^31-1, so abandon(0) is not a protocolError. Thus abandon(0) is a no-op.
I can imagine some implementation treating it as protocolError anyway though. It's not as if everyone agrees what the letter of the standard means, and follow it to the letter.
i always thought that having a separate, specific NOOP operation would be helpful in such cases [not to be mixed up with NOOP control] because similar kinda problems tend to surface in various scenarios and while there are workarounds, they are well, workarounds. some deployments out there might want to have such operation disabled since it can be abused by clients, in which case it can have a response defined and servers can send unwilling to perform or something similar if that is the case.
Hallvard B Furuseth wrote:
Ralf Haferkamp writes:
Btw, you mentioned that sending Abandon 0 will be sufficient as a no-op. How's that going to work?
It's a no-op, thus it can be sent when you just want to send some message:
- The Abandon request has no reponse.
- rfc4511 §4.11: "Servers MUST discard Abandon requests for messageIDs they do not recognize, for operations that cannot be abandoned, (...)
- No request may have Message ID 0 (§4.1.1.1); 0 is reserved for Unsolicited Notifications. Yet Message IDs are just defined as 0..2^^31-1, so abandon(0) is not a protocolError.
Thus abandon(0) is a no-op.
I can imagine some implementation treating it as protocolError anyway though. It's not as if everyone agrees what the letter of the standard means, and follow it to the letter.
Anton Bobrov wrote:
i always thought that having a separate, specific NOOP operation would be helpful in such cases [not to be mixed up with NOOP control] because similar kinda problems tend to surface in various scenarios and while there are workarounds, they are well, workarounds. some deployments out there might want to have such operation disabled since it can be abused by clients, in which case it can have a response defined and servers can send unwilling to perform or something similar if that is the case.
Yes, given the spotty nature of network connectivity it probably should have been part of the original spec. (Then again, prople probably weren't thinking about long-lived sessions or persistent search back then...) But at this point it's too late; adding a new NoOp request probably isn't going to get sufficient adoption/deployment to actually be useful to any clients.
For the purpose of detecting a failed TCP connection, Abandon ought to be sufficient. No LDAP-level reply is needed since TCP will ACK the message. As an alternative you could send a Search request (as noted in ITS#5133) against the rootDSE, if you wanted to also measure the latency. The advantage of using Abandon is that it could easily be hidden from any applications by doing it invisibly in the library. For any other request with a reply, you'd also have to intercept / discard the result. (Though you could always send the request immediately followed by its own Abandon, I suppose.)
Hallvard B Furuseth wrote:
Ralf Haferkamp writes:
Btw, you mentioned that sending Abandon 0 will be sufficient as a no-op. How's that going to work?
It's a no-op, thus it can be sent when you just want to send some message:
- The Abandon request has no reponse.
- rfc4511 §4.11: "Servers MUST discard Abandon requests for messageIDs they do not recognize, for operations that cannot be abandoned, (...)
- No request may have Message ID 0 (§4.1.1.1); 0 is reserved for Unsolicited Notifications. Yet Message IDs are just defined as 0..2^^31-1, so abandon(0) is not a protocolError.
Thus abandon(0) is a no-op.
I can imagine some implementation treating it as protocolError anyway though. It's not as if everyone agrees what the letter of the standard means, and follow it to the letter.