Hi!
We've got a proxy OpenLDAP setup using the "ldap" proxy database (the newer "meta" backend had some issues, so I couldn't use that one). Most of the time, the proxy bind goes OK, but sometimes, the connection goes like this (proxy log, using ldaps):
--clip-- Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 ACCEPT from IP=128.214.54.152:36460 (IP=0.0.0.0:636) Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 TLS established tls_ssf=256 ssf=256 Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 op=0 BIND dn="<removed>" method=128 Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 op=0 RESULT tag=97 err=52 text=Start TLS failed Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 closed (connection lost) --clip--
On the backend, this appears to be the case that a new connection kind of overruns the old one; see what happens with connection 6983:
--clip-- Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 fd=23 ACCEPT from IP=128.214.222.155:46144 (IP=0.0.0.0:389) Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 ACCEPT from IP=128.214.222.155:46146 (IP=0.0.0.0:389) Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=0 STARTTLS Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=1 UNBIND Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 fd=23 closed Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 STARTTLS Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 RESULT oid= err=0 text= Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 TLS established tls_ssf=256 ssf=256 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 BIND dn="<removed>" method=128 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 BIND dn="<removed>" mech=SIMPLE ssf=0 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 RESULT tag=97 err=0 text= [SRCH and result lines] Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=3 UNBIND Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 closed --clip--
This doesn't seem to have any correlation to the amount of binds of server load, at least that I'm aware - and anyway, the servers are nowhere near at capacity - we only get ~2 binds per second.
Normal connection would look like this:
--clip-- Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 fd=159 ACCEPT from IP=128.214.54.152:36362 (IP=0.0.0.0:636) Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 fd=159 TLS established tls_ssf=256 ssf=256 Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 BIND dn="<removed>" method=128 Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 BIND dn="<removed>" mech=SIMPLE ssf=0 Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 RESULT tag=97 err=0 text= [SRCH and SEARCH RESULT lines] Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7650 op=219 UNBIND Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7650 fd=159 closed --clip--
So this far, it looks as if the problem was that if we accept a second connection for the same client before the proxy has had time to do a bind (or indeed, a STARTTLS for the bind) on the backend for the second connection.
But we seem to be getting spurious Start TLS failed messages also without any competing connections. Here's one using ldap+STARTTLS but no other ACCEPTs anywhere near:
--clip-- Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 fd=62 ACCEPT from IP=128.214.173.135:43148 (IP=0.0.0.0:389) Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 STARTTLS Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 RESULT oid= err=0 text= Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 fd=62 TLS established tls_ssf=256 ssf=256 Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=1 BIND dn="ou=shibboleth,ou=login,dc=helsinki,dc=fi" method=128 Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=1 RESULT tag=97 err=52 text=Start TLS failed Jan 17 17:46:22 ldap-r-1 slapd[28738]: conn=529684 fd=62 closed (connection lost) --clip--
on the proxy, and backend:
--clip-- Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 fd=92 ACCEPT from IP=128.214.222.155:48996 (IP=0.0.0.0:389) Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 STARTTLS Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 RESULT oid= err=0 text= Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 fd=92 closed (TLS negotiation failure) --clip--
Succesful ldap+starttls connection on proxy:
--clip-- Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 fd=54 ACCEPT from IP=128.214.173.137:48700 (IP=0.0.0.0:389) Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=0 STARTTLS Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=0 RESULT oid= err=0 text= Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 fd=54 TLS established tls_ssf=256 ssf=256 Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=1 BIND dn="<removed>" method=128 Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=1 BIND dn="<removed>" mech=SIMPLE ssf=0 Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=1 RESULT tag=97 err=0 text= [SRCH and SEARCH RESULT lines] Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 op=3 UNBIND Jan 17 18:11:01 ldap-r-1 slapd[28738]: conn=531181 fd=54 closed --clip--
Any ideas what could be causing these? It's a bit hard to pinpoint the problem, since it happens so rarely - we seem to get a couple dozen of these every hour to around 16 000 succesful binds, not really depending on system load. This is RHEL 7 and OpenLDAP 2.4.40.
It may be the case that the same DN binding very often increases the probability of a Start TLS failed error, which I guess would be the first case (a new ACCEPT + BIND for a DN too soon after the previous one). But as I said, these seem to also sometimes appear even without any other connections being opened at the same time. (Although there might be some old connections open for the same DN.)
--Janne Peltonen University of Helsinki
Janne Peltonen wrote:
Hi!
We've got a proxy OpenLDAP setup using the "ldap" proxy database (the newer "meta" backend had some issues, so I couldn't use that one). Most of the time, the proxy bind goes OK, but sometimes, the connection goes like this (proxy log, using ldaps):
--clip-- Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 ACCEPT from IP=128.214.54.152:36460 (IP=0.0.0.0:636) Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 TLS established tls_ssf=256 ssf=256 Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 op=0 BIND dn="<removed>" method=128 Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 op=0 RESULT tag=97 err=52 text=Start TLS failed Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7716 fd=96 closed (connection lost) --clip--
On the backend, this appears to be the case that a new connection kind of overruns the old one; see what happens with connection 6983:
Looks odd indeed, but doesn't seem related to the other errors. Would need much finer resolution timestamps to correlate what's going on, unless you know there are no other active connections on the proxy when this occurred on the backend.
--clip-- Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 fd=23 ACCEPT from IP=128.214.222.155:46144 (IP=0.0.0.0:389) Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 ACCEPT from IP=128.214.222.155:46146 (IP=0.0.0.0:389) Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=0 STARTTLS Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 op=1 UNBIND Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6983 fd=23 closed Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 STARTTLS Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=0 RESULT oid= err=0 text= Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 TLS established tls_ssf=256 ssf=256 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 BIND dn="<removed>" method=128 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 BIND dn="<removed>" mech=SIMPLE ssf=0 Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=1 RESULT tag=97 err=0 text= [SRCH and result lines] Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 op=3 UNBIND Jan 7 16:01:12 ldap-v-2 slapd[61739]: conn=6984 fd=41 closed --clip--
This doesn't seem to have any correlation to the amount of binds of server load, at least that I'm aware - and anyway, the servers are nowhere near at capacity - we only get ~2 binds per second.
Normal connection would look like this:
--clip-- Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 fd=159 ACCEPT from IP=128.214.54.152:36362 (IP=0.0.0.0:636) Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 fd=159 TLS established tls_ssf=256 ssf=256 Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 BIND dn="<removed>" method=128 Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 BIND dn="<removed>" mech=SIMPLE ssf=0 Jan 7 16:00:55 ldap-r-1 slapd[16851]: conn=7650 op=0 RESULT tag=97 err=0 text= [SRCH and SEARCH RESULT lines] Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7650 op=219 UNBIND Jan 7 16:01:12 ldap-r-1 slapd[16851]: conn=7650 fd=159 closed --clip--
So this far, it looks as if the problem was that if we accept a second connection for the same client before the proxy has had time to do a bind (or indeed, a STARTTLS for the bind) on the backend for the second connection.
But we seem to be getting spurious Start TLS failed messages also without any competing connections. Here's one using ldap+STARTTLS but no other ACCEPTs anywhere near:
These aren't spurious - your TLS library has genuinely failed to start a session. Which TLS library are you using? What OS are you running on? The most common cause for periodic failures is running out of entropy for the PRNG.
--clip-- Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 fd=62 ACCEPT from IP=128.214.173.135:43148 (IP=0.0.0.0:389) Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 STARTTLS Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=0 RESULT oid= err=0 text= Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 fd=62 TLS established tls_ssf=256 ssf=256 Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=1 BIND dn="ou=shibboleth,ou=login,dc=helsinki,dc=fi" method=128 Jan 17 17:46:20 ldap-r-1 slapd[28738]: conn=529684 op=1 RESULT tag=97 err=52 text=Start TLS failed Jan 17 17:46:22 ldap-r-1 slapd[28738]: conn=529684 fd=62 closed (connection lost) --clip--
on the proxy, and backend:
--clip-- Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 fd=92 ACCEPT from IP=128.214.222.155:48996 (IP=0.0.0.0:389) Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 STARTTLS Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 op=0 RESULT oid= err=0 text= Jan 17 17:46:20 ldap-v-2 slapd[40089]: conn=406149 fd=92 closed (TLS negotiation failure) --clip--
Any ideas what could be causing these? It's a bit hard to pinpoint the problem, since it happens so rarely - we seem to get a couple dozen of these every hour to around 16 000 succesful binds, not really depending on system load. This is RHEL 7 and OpenLDAP 2.4.40.
It may be the case that the same DN binding very often increases the probability of a Start TLS failed error, which I guess would be the first case (a new ACCEPT + BIND for a DN too soon after the previous one). But as I said, these seem to also sometimes appear even without any other connections being opened at the same time. (Although there might be some old connections open for the same DN.
--On Thursday, January 17, 2019 4:52 PM +0000 Howard Chu hyc@symas.com wrote:
But we seem to be getting spurious Start TLS failed messages also without any competing connections. Here's one using ldap+STARTTLS but no other ACCEPTs anywhere near:
These aren't spurious - your TLS library has genuinely failed to start a session. Which TLS library are you using? What OS are you running on? The most common cause for periodic failures is running out of entropy for the PRNG.
They noted RHEL7 and 2.4.40, which would mean MozNSS, as the most recent RHEL7 build of 2.4.44 switched back to OpenSSL. I would just add this to the many reasons not to use RHEL for OpenLDAP.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
Hi!
Thanks for the replies!
Howard Chu wrote:
On the backend, this appears to be the case that a new connection kind of overruns the old one; see what happens with connection 6983:
Looks odd indeed, but doesn't seem related to the other errors. Would need much finer resolution timestamps to correlate what's going on, unless you know there are no other active connections on the proxy when this occurred on the backend.
Yeah. I don't know for certain, but those are the only connections that would match the dates and the results.
These aren't spurious - your TLS library has genuinely failed to start a session. Which TLS library are you using? What OS are you running on? The most common cause for periodic failures is running out of entropy for the PRNG.
RHEL 7, and slapd seems to be linked to the Mozilla nss libraries.
I called them "spurious" because there doesn't seem to be any correlation between the errors and any external events. But I have to admit I do not understand what kind of activity might cause entropy to be low; I somehow thought it would be a simple case of "more entropy used than the pool has" and then it would be simply caused by too much activity. But these errors seem to come sometimes when the server is more loaded and sometimes when it is less loaded and I haven't found any way to make them more probable. Or less.
Anything I can do about this? I mean, if the connection from the proxy to the backend fails because of Start TLS issues, couldn't the proxy just wait a while and try again once some entropy becomes available? Currently, the problems gets propagated back to the client of the OpenLDAP service, which then has to retry a failed connection.
Quanah Gibson-Mount wrote:
But we seem to be getting spurious Start TLS failed messages also without any competing connections. Here's one using ldap+STARTTLS but no other ACCEPTs anywhere near:
These aren't spurious - your TLS library has genuinely failed to start a session. Which TLS library are you using? What OS are you running on? The most common cause for periodic failures is running out of entropy for the PRNG.
They noted RHEL7 and 2.4.40, which would mean MozNSS, as the most recent RHEL7 build of 2.4.44 switched back to OpenSSL. I would just add this to the many reasons not to use RHEL for OpenLDAP.
The fact that they keep switching the TLS libraries they're linking to? I can roll out my own RPMs and keep them linked to the very same library all the time, but do you think linking to OpenSSL could help resolve my issue? Running out of entropy with only a few starttls calls per second, or only a few ldaps connections per second, seems to be a bit weird to me.
Anyway, thanks again.
--Janne / Helsinki Uni
On Tue, Jan 22, 2019 at 03:10:04PM +0200, Janne Peltonen wrote:
RHEL 7, and slapd seems to be linked to the Mozilla nss libraries.
I called them "spurious" because there doesn't seem to be any correlation between the errors and any external events. But I have to admit I do not understand what kind of activity might cause entropy to be low; I somehow thought it would be a simple case of "more entropy used than the pool has" and then it would be simply caused by too much activity. But these errors seem to come sometimes when the server is more loaded and sometimes when it is less loaded and I haven't found any way to make them more probable. Or less.
Perhaps unrelated, but at $JOB we had to explore using /dev/urandom vs /dev/random on out CentOS 6 servers:
https://unix.stackexchange.com/a/324210
--Janne / Helsinki Uni
--On Tuesday, January 22, 2019 3:10 PM +0200 Janne Peltonen janne.peltonen@helsinki.fi wrote:
Quanah Gibson-Mount wrote:
But we seem to be getting spurious Start TLS failed messages also without any competing connections. Here's one using ldap+STARTTLS but no other ACCEPTs anywhere near:
These aren't spurious - your TLS library has genuinely failed to start a session. Which TLS library are you using? What OS are you running on? The most common cause for periodic failures is running out of entropy for the PRNG.
They noted RHEL7 and 2.4.40, which would mean MozNSS, as the most recent RHEL7 build of 2.4.44 switched back to OpenSSL. I would just add this to the many reasons not to use RHEL for OpenLDAP.
The fact that they keep switching the TLS libraries they're linking to? I can roll out my own RPMs and keep them linked to the very same library all the time, but do you think linking to OpenSSL could help resolve my issue? Running out of entropy with only a few starttls calls per second, or only a few ldaps connections per second, seems to be a bit weird to me.
Hi Janne,
RedHat pursued linking OpenLDAP against MozNSS against the advice of the OpenLDAP foundation. We reluctantly included those patches in the 2.4 series, but they were a constant source of problems. RedHat never disclosed to the OpenLDAP project why exactly they abandoned MozNSS and switched back to OpenSSL.
Regardless, you should at the least update to the latest RHEL7 version from RH to see if it offers any relief from the issue you are encountering. There are also alternatives to the RH build that you can use on RH, such as:
a) The Symas OpenLDAP for Linux packages (currently at 2.4.47). See https://symas.com/linux-openldap-support-symas-corporation/, https://symas.com/linuxopenldap/. These packages are provided for free, with the option of having paid support.
b) The LTB project: https://ltb-project.org/documentation/openldap-rpm#yum_repository
c) The Symas commercial version of OpenLDAP, which requires a support contract and has additional features: https://symas.com/symasopenldap/
Regards, Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
Hi!
On Tue, Jan 22, 2019 at 06:42:52AM -0800, Quanah Gibson-Mount wrote:
RedHat pursued linking OpenLDAP against MozNSS against the advice of the OpenLDAP foundation. We reluctantly included those patches in the 2.4 series, but they were a constant source of problems. RedHat never disclosed to the OpenLDAP project why exactly they abandoned MozNSS and switched back to OpenSSL.
OK, thanks for the clarification.
Regardless, you should at the least update to the latest RHEL7 version from RH to see if it offers any relief from the issue you are encountering. There are also alternatives to the RH build that you can use on RH, such as:
Yeah, we're looking to that already. Should've probably done that the first thing, before asking on the list. Oh well.
a) The Symas OpenLDAP for Linux packages (currently at 2.4.47). See https://symas.com/linux-openldap-support-symas-corporation/, https://symas.com/linuxopenldap/. These packages are provided for free, with the option of having paid support. b) The LTB project: https://ltb-project.org/documentation/openldap-rpm#yum_repository c) The Symas commercial version of OpenLDAP, which requires a support contract and has additional features: https://symas.com/symasopenldap/
Thanks for these suggestions as well.
Best,
Janne / Helsinki Uni
Hi!
On Tue, Jan 22, 2019 at 04:46:14PM +0200, Janne Peltonen wrote:
Regardless, you should at the least update to the latest RHEL7 version from RH to see if it offers any relief from the issue you are encountering. There are also alternatives to the RH build that you can use on RH, such as:
Yeah, we're looking to that already. Should've probably done that the first thing, before asking on the list. Oh well.
We had a look at this, but still ended up having some Start TLS failed errors on the proxy. We were able to create them on a server with no other load, if we hit 200 simultaneous ldaps connections binding as the same user.
Next, we tried Unto Sten's suggestion: we confirmed that the "timeout" variable is zero, so we go into the "else" branch he mentioned; then instead of calling the macro in the else branch, we just directly set tv.sec = 3 and tv.usec = 0 (a quick and dirty hack, I know). After that, we were no longer able to get any Start TLS failed errors on the proxy, and all proxy binds were completed succesfully. To make sure, we downgraded the proxy again, and sure enough, the Start TLS failed errors reappeared, or rather, we began to have some of them again. Upgraded again, and no errors at all.
To us, this really seems as if the root of the problem were that the starttls timeout ends up being 0.1 seconds, which is too short if there're any latencies in the network. What would be the correct place to fix this? It appears to me that you should be able to say "timeout extended=5" or something similar in a config file, but in back-ldap/config.c the "extended" timeout option is commented out as unimplemented. So, what would be required to implement it?
Relevant files:
back-ldap/bind.c (ldap_back_start_tls function, setting of tv using LDAP_BACK_TV_SET macro) back-ldap/back-ldap.h (defining the LDAP_BACK_TV_SET to basically set the timeout to 0.1 s) back-ldap/config.f (definition of timeout_table)
Best,
Janne / Helsinki Uni
--On Monday, January 28, 2019 6:47 PM +0200 Janne Peltonen janne.peltonen@helsinki.fi wrote:
Next, we tried Unto Sten's suggestion: we confirmed that the "timeout" variable is zero, so we go into the "else" branch he mentioned; then instead of calling the macro in the else branch, we just directly set tv.sec = 3 and tv.usec = 0 (a quick and dirty hack, I know). After that, we were no longer able to get any Start TLS failed errors on the proxy, and all proxy binds were completed succesfully. To make sure, we downgraded the proxy again, and sure enough, the Start TLS failed errors reappeared, or rather, we began to have some of them again. Upgraded again, and no errors at all.
To us, this really seems as if the root of the problem were that the starttls timeout ends up being 0.1 seconds, which is too short if there're any latencies in the network. What would be the correct place to fix this? It appears to me that you should be able to say "timeout extended=5" or something similar in a config file, but in back-ldap/config.c the "extended" timeout option is commented out as unimplemented. So, what would be required to implement it?
Relevant files:
back-ldap/bind.c (ldap_back_start_tls function, setting of tv using LDAP_BACK_TV_SET macro) back-ldap/back-ldap.h (defining the LDAP_BACK_TV_SET to basically set the timeout to 0.1 s) back-ldap/config.f (definition of timeout_table)
Please file an issue report at http://www.openldap.org/its/ so this can be tracked and resolved.
Thanks!
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
On Mon, Jan 28, 2019 at 09:02:12AM -0800, Quanah Gibson-Mount wrote:
Please file an issue report at http://www.openldap.org/its/ so this can be tracked and resolved.
Done, #8963. Thanks!
Best,
Janne / Helsinki Uni
Hello,
On Thu, 17 Jan 2019, Howard Chu wrote:
These aren't spurious - your TLS library has genuinely failed to start a session. Which TLS library are you using? What OS are you running on? The most common cause for periodic failures is running out of entropy for the PRNG.
Could it be because of too short timeout period? Please see the following code analysis.
OpenLDAP 2.4.46 on Fedora 29 back-ldap/bind.c has:
rs->sr_err = ldap_back_start_tls( ld, op->o_protocol, &is_tls, li->li_uri, flags, li->li_timeout[ SLAP_OP_EXTENDED ], &rs->sr_text ); li->li_uri_mutex_do_not_lock = 0; ldap_pvt_thread_mutex_unlock( &li->li_uri_mutex ); if ( rs->sr_err != LDAP_SUCCESS ) { ldap_unbind_ext( ld, NULL, NULL ); rs->sr_text = "Start TLS failed"; goto error_return;
So the call to:
ldap_back_start_tls( ld, op->o_protocol, &is_tls, li->li_uri, flags, li->li_timeout[ SLAP_OP_EXTENDED ], &rs->sr_text )
is the one that fails.
By the way, I am wondering here about the correctness of parameter
li->li_timeout[ SLAP_OP_EXTENDED ]
slap.h has:
/* * Operation indices */ typedef enum { SLAP_OP_BIND = 0, SLAP_OP_UNBIND, SLAP_OP_SEARCH, SLAP_OP_COMPARE, SLAP_OP_MODIFY, SLAP_OP_MODRDN, SLAP_OP_ADD, SLAP_OP_DELETE, SLAP_OP_ABANDON, SLAP_OP_EXTENDED, SLAP_OP_LAST } slap_op_t;
So SLAP_OP_EXTENDED has value 9. But back-ldap/config.c has:
/* see enum in slap.h */ static slap_cf_aux_table timeout_table[] = { { BER_BVC("bind="), SLAP_OP_BIND * sizeof( time_t ), 'u', 0, NULL }, /* unbind makes no sense */ { BER_BVC("add="), SLAP_OP_ADD * sizeof( time_t ), 'u', 0, NULL }, { BER_BVC("delete="), SLAP_OP_DELETE * sizeof( time_t ), 'u', 0, NULL }, { BER_BVC("modrdn="), SLAP_OP_MODRDN * sizeof( time_t ), 'u', 0, NULL }, { BER_BVC("modify="), SLAP_OP_MODIFY * sizeof( time_t ), 'u', 0, NULL }, { BER_BVC("compare="), SLAP_OP_COMPARE * sizeof( time_t ), 'u', 0, NULL }, { BER_BVC("search="), SLAP_OP_SEARCH * sizeof( time_t ), 'u', 0, NULL }, /* abandon makes little sense */ #if 0 /* not implemented yet */ { BER_BVC("extended="), SLAP_OP_EXTENDED * sizeof( time_t ), 'u', 0, NULL }, #endif { BER_BVNULL, 0, 0, 0, NULL } };
Hmmm, SLAP_OP_EXTENDED is not defined, so does li->li_timeout[ SLAP_OP_EXTENDED ] end up pointing outside of this timeout_table[]? I am not sure.
In any case, back-ldap/bind.c has:
static int ldap_back_start_tls( LDAP *ld, int protocol, int *is_tls, const char *url, unsigned flags, int timeout, const char **text ) {
...
The async code is compiled in because of a #define in back-ldap.h:
#ifdef SLAP_STARTTLS_ASYNCHRONOUS
...
if ( timeout ) { tv.tv_sec = timeout; tv.tv_usec = 0; } else { LDAP_BACK_TV_SET( &tv ); }
Now I assume (timeout == 0) and we go to the else-branch.
LDAP_BACK_TV_SET macro is:
#define LDAP_BACK_RESULT_TIMEOUT (0) #define LDAP_BACK_RESULT_UTIMEOUT (100000) #define LDAP_BACK_TV_SET(tv) \ do { \ (tv)->tv_sec = LDAP_BACK_RESULT_TIMEOUT; \ (tv)->tv_usec = LDAP_BACK_RESULT_UTIMEOUT; \ } while ( 0 )
My current guess is that the 0.1 second default timeout is set, and for some reason, the slapd backend TLS initialization sometimes takes longer than that, causing the TLS failure.
Do you think this is a possible scenario?
If so, as a temporary fix, would it be safe to hardcode, say 5 seconds for the timeout just for the TLS timeout else-branch?
Best regards, Unto Sten
Hello again,
On Tue, 22 Jan 2019, openldap-technical@kolttonen.fi wrote:
By the way, I am wondering here about the correctness of parameter
li->li_timeout[ SLAP_OP_EXTENDED ]
slap.h has:
/*
- Operation indices
*/ typedef enum { SLAP_OP_BIND = 0, SLAP_OP_UNBIND, SLAP_OP_SEARCH, SLAP_OP_COMPARE, SLAP_OP_MODIFY, SLAP_OP_MODRDN, SLAP_OP_ADD, SLAP_OP_DELETE, SLAP_OP_ABANDON, SLAP_OP_EXTENDED, SLAP_OP_LAST } slap_op_t;
So SLAP_OP_EXTENDED has value 9. But back-ldap/config.c has:
/* see enum in slap.h */ static slap_cf_aux_table timeout_table[] = { { BER_BVC("bind="), SLAP_OP_BIND * sizeof( time_t ), 'u', 0, NULL }, /* unbind makes no sense */ { BER_BVC("add="), SLAP_OP_ADD * sizeof( time_t ), 'u', 0, NULL }, { BER_BVC("delete="), SLAP_OP_DELETE * sizeof( time_t ), 'u', 0, NULL }, { BER_BVC("modrdn="), SLAP_OP_MODRDN * sizeof( time_t ), 'u', 0, NULL }, { BER_BVC("modify="), SLAP_OP_MODIFY * sizeof( time_t ), 'u', 0, NULL }, { BER_BVC("compare="), SLAP_OP_COMPARE * sizeof( time_t ), 'u', 0, NULL }, { BER_BVC("search="), SLAP_OP_SEARCH * sizeof( time_t ), 'u', 0, NULL }, /* abandon makes little sense */ #if 0 /* not implemented yet */ { BER_BVC("extended="), SLAP_OP_EXTENDED * sizeof( time_t ), 'u', 0, NULL }, #endif { BER_BVNULL, 0, 0, 0, NULL } };
Hmmm, SLAP_OP_EXTENDED is not defined, so does li->li_timeout[ SLAP_OP_EXTENDED ] end up pointing outside of this timeout_table[]? I am not sure.
No, it does not point there. I read some more code and "li" is of type struct ldapinfo_t that has member:
time_t li_timeout[ SLAP_OP_LAST ];
And hence we have an array:
li_timeout[ 10 ]
so pointing to li_timeout[ SLAP_OP_EXTENDED ] points to li_timeout[ 9 ] and there is nothing wrong with that.
I do not know if I understand the code here, but I suppose this li_timeout[ SLAP_OP_EXTENDED ] has value 0. That is why the else-branch is taken, giving the default 0.1 second timeout for TLS initialization. Is this correct, and perhaps safe to increase the value here?
Best regards, Unto Sten
openldap-technical@openldap.org