New subject: Spurious Start TLS failed errors on proxyed bind OpenLDAP 2.4.40 [possible cause found]

17 Jan 2019


      Hi!
We've got a proxy OpenLDAP setup using the "ldap" proxy database (the newer
"meta" backend had some issues, so I couldn't use that one). Most of the time,
the proxy bind goes OK, but sometimes, the connection goes like this (proxy
log, using ldaps):
--clip--
Jan  7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 ACCEPT from IP=128.214.54.152:36460 (IP=0.0.0.0:636)
Jan  7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 TLS established tls_ssf=256 ssf=256
Jan  7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 op=0 BIND dn="<removed>" method=128
Jan  7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 op=0 RESULT tag=97 err=52 text=Start TLS failed
Jan  7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 closed (connection lost)
--clip--
On the backend, this appears to be the case that a new connection kind of
overruns the old one; see what happens with connection 6983:
--clip--
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 fd=23 ACCEPT from IP=128.214.222.155:46144 (IP=0.0.0.0:389)
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 ACCEPT from IP=128.214.222.155:46146 (IP=0.0.0.0:389)
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=0 EXT oid=1.3.6.1.4.1.1466.20037
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=0 STARTTLS
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=1 UNBIND
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 fd=23 closed
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 EXT oid=1.3.6.1.4.1.1466.20037
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 STARTTLS
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 RESULT oid= err=0 text=
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 TLS established tls_ssf=256 ssf=256
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 BIND dn="<removed>" method=128
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 BIND dn="<removed>" mech=SIMPLE ssf=0
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 RESULT tag=97 err=0 text=
[SRCH and result lines]
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=3 UNBIND
Jan  7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 closed
--clip--
This doesn't seem to have any correlation to the amount of binds of server
load, at least that I'm aware - and anyway, the servers are nowhere near at
capacity - we only get ~2 binds per second.
Normal connection would look like this:
--clip--
Jan  7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 fd=159 ACCEPT from IP=128.214.54.152:36362 (IP=0.0.0.0:636)
Jan  7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 fd=159 TLS established tls_ssf=256 ssf=256
Jan  7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 BIND dn="<removed>" method=128
Jan  7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 BIND dn="<removed>" mech=SIMPLE ssf=0
Jan  7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 RESULT tag=97 err=0 text=
[SRCH and SEARCH RESULT lines]
Jan  7 16:01:12 ldap-r-1 slapd[16851]: conn=7650 op=219 UNBIND
Jan  7 16:01:12 ldap-r-1 slapd[16851]: conn=7650 fd=159 closed
--clip--
So this far, it looks as if the problem was that if we accept a second
connection for the same client before the proxy has had time to do a bind (or
indeed, a STARTTLS for the bind) on the backend for the second connection.
But we seem to be getting spurious Start TLS failed messages also without any
competing connections. Here's one using ldap+STARTTLS but no other ACCEPTs
anywhere near:
--clip--
Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 fd=62 ACCEPT from IP=128.214.173.135:43148 (IP=0.0.0.0:389)
Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 EXT oid=1.3.6.1.4.1.1466.20037
Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 STARTTLS
Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 RESULT oid= err=0 text=
Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 fd=62 TLS established tls_ssf=256 ssf=256
Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=1 BIND dn="ou=shibboleth,ou=login,dc=helsinki,dc=fi" method=128
Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=1 RESULT tag=97 err=52 text=Start TLS failed
Jan 17 17:46:22 ldap-r-1 slapd[28738]: conn=529684 fd=62 closed (connection lost)
--clip--
on the proxy, and backend:
--clip--
Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 fd=92 ACCEPT from IP=128.214.222.155:48996 (IP=0.0.0.0:389)
Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 EXT oid=1.3.6.1.4.1.1466.20037
Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 STARTTLS
Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 RESULT oid= err=0 text=
Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 fd=92 closed (TLS negotiation failure)
--clip--
Succesful ldap+starttls connection on proxy:
--clip--
Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 fd=54 ACCEPT from IP=128.214.173.137:48700 (IP=0.0.0.0:389)
Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=0 EXT oid=1.3.6.1.4.1.1466.20037
Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=0 STARTTLS
Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=0 RESULT oid= err=0 text=
Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 fd=54 TLS established tls_ssf=256 ssf=256
Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=1 BIND dn="<removed>" method=128
Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=1 BIND dn="<removed>" mech=SIMPLE ssf=0
Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=1 RESULT tag=97 err=0 text=
[SRCH and SEARCH RESULT lines]
Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=3 UNBIND
Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 fd=54 closed
--clip--
Any ideas what could be causing these? It's a bit hard to pinpoint the problem,
since it happens so rarely - we seem to get a couple dozen of these every hour
to around 16 000 succesful binds, not really depending on system load. This is
RHEL 7 and OpenLDAP 2.4.40.
It may be the case that the same DN binding very often increases the
probability of a Start TLS failed error, which I guess would be the first case
(a new ACCEPT + BIND for a DN too soon after the previous one). But as I said,
these seem to also sometimes appear even without any other connections being
opened at the same time. (Although there might be some old connections open for
the same DN.)
--Janne Peltonen
University of Helsinki