Janne Peltonen wrote:
Hi!
We've got a proxy OpenLDAP setup using the "ldap" proxy database (the newer "meta" backend had some issues, so I couldn't use that one). Most of the time, the proxy bind goes OK, but sometimes, the connection goes like this (proxy log, using ldaps):
--clip-- Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 ACCEPT from IP=128.214.54.152:36460 (IP=0.0.0.0:636) Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 TLS established tls_ssf=256 ssf=256 Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 op=0 BIND dn="<removed>" method=128 Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 op=0 RESULT tag=97 err=52 text=Start TLS failed Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 closed (connection lost) --clip--
On the backend, this appears to be the case that a new connection kind of overruns the old one; see what happens with connection 6983:
Looks odd indeed, but doesn't seem related to the other errors. Would need much finer resolution timestamps to correlate what's going on, unless you know there are no other active connections on the proxy when this occurred on the backend.
--clip-- Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 fd=23 ACCEPT from IP=128.214.222.155:46144 (IP=0.0.0.0:389) Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 ACCEPT from IP=128.214.222.155:46146 (IP=0.0.0.0:389) Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=0 STARTTLS Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=1 UNBIND Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 fd=23 closed Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 STARTTLS Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 RESULT oid= err=0 text= Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 TLS established tls_ssf=256 ssf=256 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 BIND dn="<removed>" method=128 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 BIND dn="<removed>" mech=SIMPLE ssf=0 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 RESULT tag=97 err=0 text= [SRCH and result lines] Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=3 UNBIND Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 closed --clip--
This doesn't seem to have any correlation to the amount of binds of server load, at least that I'm aware - and anyway, the servers are nowhere near at capacity - we only get ~2 binds per second.
Normal connection would look like this:
--clip-- Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 fd=159 ACCEPT from IP=128.214.54.152:36362 (IP=0.0.0.0:636) Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 fd=159 TLS established tls_ssf=256 ssf=256 Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 BIND dn="<removed>" method=128 Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 BIND dn="<removed>" mech=SIMPLE ssf=0 Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 RESULT tag=97 err=0 text= [SRCH and SEARCH RESULT lines] Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7650 op=219 UNBIND Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7650 fd=159 closed --clip--
So this far, it looks as if the problem was that if we accept a second connection for the same client before the proxy has had time to do a bind (or indeed, a STARTTLS for the bind) on the backend for the second connection.
But we seem to be getting spurious Start TLS failed messages also without any competing connections. Here's one using ldap+STARTTLS but no other ACCEPTs anywhere near:
These aren't spurious - your TLS library has genuinely failed to start a session. Which TLS library are you using? What OS are you running on? The most common cause for periodic failures is running out of entropy for the PRNG.
--clip-- Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 fd=62 ACCEPT from IP=128.214.173.135:43148 (IP=0.0.0.0:389) Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 STARTTLS Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 RESULT oid= err=0 text= Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 fd=62 TLS established tls_ssf=256 ssf=256 Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=1 BIND dn="ou=shibboleth,ou=login,dc=helsinki,dc=fi" method=128 Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=1 RESULT tag=97 err=52 text=Start TLS failed Jan 17 17:46:22 ldap-r-1 slapd[28738]: conn=529684 fd=62 closed (connection lost) --clip--
on the proxy, and backend:
--clip-- Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 fd=92 ACCEPT from IP=128.214.222.155:48996 (IP=0.0.0.0:389) Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 STARTTLS Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 RESULT oid= err=0 text= Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 fd=92 closed (TLS negotiation failure) --clip--
Any ideas what could be causing these? It's a bit hard to pinpoint the problem, since it happens so rarely - we seem to get a couple dozen of these every hour to around 16 000 succesful binds, not really depending on system load. This is RHEL 7 and OpenLDAP 2.4.40.
It may be the case that the same DN binding very often increases the probability of a Start TLS failed error, which I guess would be the first case (a new ACCEPT + BIND for a DN too soon after the previous one). But as I said, these seem to also sometimes appear even without any other connections being opened at the same time. (Although there might be some old connections open for the same DN.