Am Dienstag 05 Mai 2009 22:48:10 schrieb Howard Chu:
Ralf Haferkamp wrote:
Am Freitag 01 Mai 2009 11:50:15 schrieb masarati@aero.polimi.it:
Hi,
since quite some time libldap enables tcp-keepalive, e.g. to detected dangling syncrepl connections. However the default timeout of two hours that most systems are using might be a bit too long for some applications (e.g. I had a problem lately were nscd didn't answer queries anymore because nss_ldap was blocking in SSL_read() while the underlying connection has been cut off). On the other hand messing with the system wide settings might no be a good idea either. On Linux it is possible to configure the keepalive settings on a per socket basis through the TCP_KEEP* socket options.
Would it be worth adding ldap_set_option() support for those, even if they are not really portable?
I think it would; for archs that do not support it, it could do nothing (and log accordingly, just in case).
Ok, I'll introduce the following new options for keepalive support then: LDAP_OPT_X_KEEPALIVE_IDLE 0x6300 LDAP_OPT_X_KEEPALIVE_PROBES 0x6301 LDAP_OPT_X_KEEPALIVE_INTERVAL 0x6302
We might also think about adding support to set those values for syncrepl and back-ldap/back-meta.
I'd prefer a portable solution vs something so extremely platform-dependent. As already discussed many times before, we just need a client to send a periodic LDAP no-op message to get the same effect. (Abandon 0 will work fine.)
Something like proposed in ITS#5133? It seems that it was rejected with a reference to the enablement of SO_KEEPALIVE, though. Should we revisit that?
My problem was not so much with syncrepl though, I had nss_ldap making me trouble.
While it's not as general purpose as setting a keepalive in the socket layer, I think we only need to worry about the syncrepl client. back-ldap/meta already have their own retry mechanisms, they can take care of themselves.
There seems to be a problem with many retry mechanisms when it comes to the scenario I described in my orignial post. On a TLS protected connection SSL_read (called from ldap_result) might trigger multiple read() calls. As there are no select/poll calls inbetween them, one of those read()s might block forever (until TCP keepalive kicks in) in case the server is not answering anymore and didn't close the connection correctly (power failure, ...) I havn't had a good idea yet how to easily fix this case, apart from leveraging TCP keepalives.
(According to the docs, SSL_read() would return SSL_ERROR_WANT_READ when the underlying BIO is non-blocking. But we're using blocking IO. I am unsure how much effort it would be to port that to non-blocking. I'd think it's a non- trivial task ;)).
So - I'd rather see an option for a periodic LDAP ping added to the syncrepl client - that will work uniformly across all platforms.
And in general - I am opposed to any code that causes our feature set / behavior to differ from platform to platform.
Understandable, that's why I was asking before commiting anything. But AFAIK we have plattform specific issues in other places as well. (Or think about the various different LDAP_OPT_X_TLS-settings depending on which underlying SSL implementation is used.)