Hi Ryan,
I'm running into a problem with slapd 2.4.46 hanging on Ubuntu 18.04, which seems to be a side effect of the ITS#8650 patch:
https://github.com/openldap/openldap/commit/7b5181da8cdd47a13041f9ee36fa9590...
slapd will run fine for a while, but during some periods of high-traffic, it'll hang. It'll peg the CPU at 100% and won't respond to any new LDAP connections. After some time, it'll resume working again, but overall it's fairly unreliable.
strace on slapd during the hang shows that it's constantly making read() calls that return EAGAIN. After doing a gdb stack trace on slapd, I realized that these read() calls are happening as part of the busywait for loop in tlsg_session_accept() that repeatedly calls gnutls_handshake() when it gets EAGAIN. When slapd recovers from this hang state, the first message it prints is a TLS negotiation failure error on the culprit file descriptor.
If I back out the ITS#8650 patch, the problem goes away. If I insert sleep(1) in the for loop, slapd no longer pegs the CPU at 100%, but it still becomes unresponsive during these high-traffic periods.
I don't know what's happening during these high-traffic periods that causes the TLS negotiation to go astray. Unfortunately it's not easy to reproduce this problem outside of this production environment, given the diversity of clients running different OSes with various versions of SSL libraries.
I'm wondering if there is a better way to handle EAGAIN returned from gnutls_handshake(), instead of doing a busywait as in ITS#8650, or my simplistic attempt at inserting a sleep() call which doesn't really seem to help. I'm wondering how the GnuTLS developers intend for people to use gnutls_handshake() properly, so as to gracefully handle sessions that involve long packets on the one hand, without opening up a vulnerability to chew up lots of system resources on the other hand.
Regards,
-Kartik