Full_Name: Maciej Puzio Version: 2.4.44 and current sources from git repo OS: Ubuntu 15.10 amd64 URL: Submission from: (NULL) (129.112.109.41)
While configuring two servers in a dual-master setup I encountered an issue causing replication failures, when replication is done on a TLS connection. In this particular setup the issue manifested itself by replication retries failing, even when the cause of the original failure no longer existed. For example, let's consider that dual-master setup consists of servers A and B. If server A was started before server B, its on-startup replication would fail (as expected, since B is down). Server B would then be started and its replication attempt would succeed (since A is already running). However A's subsequent retries would fail, even though B was now accessible. If I then restarted server A, its replication will succeed, but B will now lose connection to A and all B's subsequent retries would fail. This issue has ~50% reproducibility, but if it occurs once, then it would continue failing. On occasion both servers would replicate fine, but obviously the whole situation was unacceptable on servers being prepared for production use.
I tested this issue thoroughly eliminating various hypotheses, and ended up debugging OpenLDAP code and finding a bug causing the above behavior. In short, on failure slap_client_connect deallocates LDAP object ld and objects pointed by it, including ldap_common which contains ld_options. Pointer to ld_options is held in sb->sb_tls_ctx and reused in subsequent slap_client_connect calls, at this point referencing freed memory.
This issue occurs in OpenLDAP newest stable version 2.4.44, as well as in current source code from git repo (as of March 11, 2016), obtained from git://git.openldap.org/openldap.git
In order to demonstrate that the issue is indeed use-after-free, I made the following debug modifications to the code. (The diffs are against the source code from git repo as of 2016/03/11)
--- servers/slapd/config.c.orig 2016-03-11 14:52:14.000000000 -0600 +++ servers/slapd/config.c 2016-03-11 17:20:55.216170535 -0600 @@ -60,6 +60,26 @@
#define ARGS_STEP 512
+ +/* DEBUG */ +#include <syslog.h> +#include <gnutls/gnutls.h> +#include "../../libraries/libldap/ldap-int.h" + +typedef struct tlsg_ctx { + struct ldapoptions *lo; + gnutls_certificate_credentials_t cred; + gnutls_dh_params_t dh_params; + unsigned long verify_depth; + int refcount; + gnutls_priority_t prios; +#ifdef LDAP_R_COMPILE + ldap_pvt_thread_mutex_t ref_mutex; +#endif +} tlsg_ctx; +/* END DEBUG */ + + /* * defaults for various global variables */ @@ -2001,9 +2021,11 @@
#ifdef HAVE_TLS if ( sb->sb_tls_do_init ) { + syslog(LOG_DEBUG, "DEBUG: slap_client_connect sb->sb_tls_do_init branch\n"); rc = bindconf_tls_set( sb, ld );
} else if ( sb->sb_tls_ctx ) { + syslog(LOG_DEBUG, "DEBUG: slap_client_connect sb->sb_tls_ctx branch\n"); rc = ldap_set_option( ld, LDAP_OPT_X_TLS_CTX, sb->sb_tls_ctx ); } @@ -2015,6 +2037,7 @@ sb->sb_uri.bv_val, rc, 0 ); goto done; } + syslog(LOG_DEBUG, "DEBUG: slap_client_connect ldo_tls_require_cert=%X\n", ((tlsg_x*x*)ld->ld_options.ldo_tls_ctx)->lo->ldo_tls_require_cert); #endif
--- libraries/libldap/unbind.c.orig 2016-03-11 14:52:14.000000000 -0600 +++ libraries/libldap/unbind.c 2016-03-11 15:49:18.603447766 -0600 @@ -27,6 +27,8 @@
#include "ldap-int.h"
+#include <syslog.h> /* DEBUG */ + /* An Unbind Request looks like this: * * UnbindRequest ::= [APPLICATION 2] NULL @@ -208,6 +210,8 @@
#ifdef HAVE_TLS ldap_int_tls_destroy( &ld->ld_options ); + syslog(LOG_DEBUG, "DEBUG: ldap_ld_destroy\n"); + ld->ld_options.ldo_tls_require_cert = 0xdeadbeef; /* DEBUG */ #endif
if ( ld->ld_options.ldo_sctrls != NULL ) { @@ -233,7 +237,7 @@ #ifndef NDEBUG LDAP_TRASH(ld); #endif - LDAP_FREE( (char *) ld->ldc ); + //LDAP_FREE( (char *) ld->ldc ); /* DEBUG */ LDAP_FREE( (char *) ld );
return( err );
Here, instead of freeing ld->ldc (struct ldap_common), I overwrite its contents with a magic debug value 0xdeadbeef. I overwrite ldo_tls_require_cert, as an incorrect value of this particular variable turned out to be the direct cause of replication failures mentioned at the beginning (more about it later). However, other fields of ldap_common could be used as well.
File ldap-int.h is included to be able to examine contents of *ld, while the declaration of struct tlsg_ctx has been copied verbatim from libraries/libldap/tls_g.c, in order to examine the contents of ld->ld_options.ldo_tls_ctx. I use GnuTLS library following Ubuntu build options. But the issue in question is not GnuTLS-related, though it may cause different symptoms with OpenSSL or Mozilla NSS (I did not test these options).
One would expect that the only effect of these modifications would be a mere memory leak, but in addition to that we see that the magic value gets logged. As you can see, the second replication attempts uses memory that, if executing non-debug code, would be already freed:
The results are as follows (only one server started): Mar 11 17:42:21 srvA slapd[14360]: slapd starting Mar 11 17:42:21 srvA slapd[14360]: =>do_syncrepl rid=001 Mar 11 17:42:21 srvA slapd[14360]: DEBUG: slap_client_connect sb->sb_tls_do_init branch Mar 11 17:42:21 srvA slapd[14355]: ...done. Mar 11 17:42:21 srvA slapd[14360]: DEBUG: slap_client_connect ldo_tls_require_cert=0 Mar 11 17:42:21 srvA slapd[14360]: DEBUG: ldap_ld_destroy Mar 11 17:42:21 srvA slapd[14360]: do_syncrepl: rid=001 rc -1 retrying (9 retries left) Mar 11 17:42:51 srvA slapd[14360]: =>do_syncrepl rid=001 Mar 11 17:42:51 srvA slapd[14360]: DEBUG: slap_client_connect sb->sb_tls_ctx branch Mar 11 17:42:51 srvA slapd[14360]: DEBUG: slap_client_connect ldo_tls_require_cert=DEADBEEF Mar 11 17:42:51 srvA slapd[14360]: DEBUG: ldap_ld_destroy Mar 11 17:42:51 srvA slapd[14360]: do_syncrepl: rid=001 rc -1 retrying (8 retries left)
When running unmodified non-debug code, on replication retry ld_options contains either (A) previous values, (B) zeros or (C) random values. Occasionally a segfault occurs (although I observed them only during debugging). Replication fails much more readily if, in addition to the issue discussed here, there is a TLS certificate problem. In this case connection tends to fail in tlsg_cert_verify, even if tls_reqcert is set to 'allow' or 'never'. This is because random value of ldo_tls_require_cert will cause the certificate to be checked regardless of tls_reqcert setting. If certificates are correct, the issue may well remain hidden, as TLS connection will be established regardless of whether A, B or C occurs. In this case an occasional segfault may be the only symptom. Of course, if by chance B occurs, certificate verification will be silently skipped and this may lead to a security problem.
A quick and dirty fix is to comment out few lines in servers/slapd/config.c so that "rc = bindconf_tls_set( sb, ld )" is always executed, which avoids the reuse of sb->sb_tls_ctx. However, I think a better fix would involve changing the handling of ld->ldoptions and related objects. Given the complexity of interactions between various objects involved, and my limited familiarity with OpenLDAP design, I feel that I am not in position to propose a patch.
For reference, OpenLDAP was build with this configuration: ./configure --prefix=/opt/openldap --enable-debug --enable-dynamic --enable-syslog --enable-local --enable-slapd --enable-crypt --enable-modules --enable-backends --enable-ndb=no --enable-shell=no --enable-perl=no --enable-sql=no --enable-wt=no --enable-overlays --with-tls=gnutls CPPFLAGS="-Wno-format-extra-args"
The relevant parts of slapd.conf are [redacted to anonymize]:
loglevel 1 serverID 001 moduleload syncprov TLSCipherSuite SECURE256:-VERS-SSL3.0 TLSCACertificateFile /etc/ldap/ssl/ca.pem TLSCertificateFile /etc/ldap/ssl/srvA.pem TLSCertificateKeyFile /etc/ldap/ssl/srvA.key syncrepl rid=001 provider=ldaps://10.0.0.2 type=refreshAndPersist retry="30 10 300 +" searchbase="dc=test" attrs="*,+" bindmethod=simple binddn="cn=root,dc=test" credentials="plaintext-password" tls_reqcert=never keepalive="240:5:10" mirrormode TRUE overlay syncprov syncprov-checkpoint 10 1440
Test machine hardware: CPU: Intel Atom C2558 @ 2.40GHz Motherboard: Supermicro A1SRM-2558F NIC: Intel Corporation Ethernet Connection I354 (rev 03)
Software used: OS: Ubuntu 15.10 Kernel: 4.2.0-30-generic #36-Ubuntu SMP x86_64 OpenLDAP: slapd 2.4.44 and slapd 2.X (from git repo) GnuTLS: 3.3.15-5ubuntu2
Please let me know if you need further information. Thank you.
Maciej Puzio