Made some good progress on this one this evening.
The original issue this ITS is about is that gnutls_handshake() can, in
some versions of GnuTLS, return GNUTLS_E_AGAIN even when the socket is
blocking. Specifically, this happens in the case I described with a
large CA list sent by the server.
For slapd, the patch I committed is unfortunately completely wrong. It
has been using non-blocking sockets forever, EAGAIN is expected and
handled robustly -- or it was, until I introduced the busy-loop.
For clients I'm still working on figuring out the right path forward.
There is some EAGAIN handling conditional on LDAP_USE_NON_BLOCKING_TLS
which itself is behind LDAP_DEVEL. However this code is meant for
non-blocking sockets, and in my case it ends up stuck in poll() waiting
for a notification that never arrives. In 2.4, ret == 1 simply falls
into the success case and proceeds to send data without completing the
handshake first.
It's possible that what I actually want here is a (ret > 0) case in
ldap_int_tls_start for when LDAP_USE_NON_BLOCKING_TLS is absent and
ldap_int_tls_connect returns 1. (I'd also need to adapt the non-blocking
path to be able to handle a blocking socket as well.)
But it's also possible that gnutls_handshake() returning GNUTLS_E_AGAIN
with a blocking socket is simply a GnuTLS bug that was introduced at
some point. I still need to determine exactly when and why its behaviour
changed. (It is still happening with 3.5.19.)
In any case, my patch has to be reverted, as its impact (making slapd
busy-loop) is obviously worse than the status quo (misbehaving clients
in a specific case). I have pushed that revert now and will continue
digging as time permits.