Ralf Haferkamp wrote:
Am Dienstag 05 Mai 2009 22:48:10 schrieb Howard Chu: Something like proposed in ITS#5133? It seems that it was rejected with a reference to the enablement of SO_KEEPALIVE, though. Should we revisit that?
Seems like it, yes.
My problem was not so much with syncrepl though, I had nss_ldap making me trouble.
While it's not as general purpose as setting a keepalive in the socket layer, I think we only need to worry about the syncrepl client. back-ldap/meta already have their own retry mechanisms, they can take care of themselves.
There seems to be a problem with many retry mechanisms when it comes to the scenario I described in my orignial post. On a TLS protected connection SSL_read (called from ldap_result) might trigger multiple read() calls. As there are no select/poll calls inbetween them, one of those read()s might block forever (until TCP keepalive kicks in) in case the server is not answering anymore and didn't close the connection correctly (power failure, ...) I havn't had a good idea yet how to easily fix this case, apart from leveraging TCP keepalives.
(According to the docs, SSL_read() would return SSL_ERROR_WANT_READ when the underlying BIO is non-blocking. But we're using blocking IO. I am unsure how much effort it would be to port that to non-blocking. I'd think it's a non- trivial task ;)).
I don't think there's any particular dependencies left in our code in this regard; ber_get_next() can be called as many times as necessary to retrieve a complete message. All of our input is triggered by select/poll/etc. What's less clear is how well OpenSSL actually behaves with non-blocking sockets; there are a lot of bug reports on that as I recall. You interested in testing that?
I guess, in the absence of a better solution, go ahead with what you've already worked up. We'll just have to document somewhere (Admin Guide I suppose) that a system's TCP keepalive setting may need to be adjusted if not on Linux...