Hi,
we are currently chasing a strange issue at a customers site where the ldap slaves become unresponsive when network connectivity to master ldaps and dns servers is lost.
They have a setup of two masters and two slaves at separate sites. There is a load balancer sitting in front of the slaves that performs regular health checks consisting of binds followed by a search of their binddn.
During regular operations the load balancers health checks look as follows [1]
Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 fd=36 ACCEPT from IP=192.0.2.189:33852 (IP=192.0.2.129:389) Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 BIND dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" method=128 Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 BIND dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" mech=SIMPLE ssf=0 Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 RESULT tag=97 err=0 text= Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 SRCH base="ou=system,dc=example,dc=org" scope=1 deref=0 filter="(cn=keepalive-check-lb)" Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 ENTRY dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 SEARCH RESULT tag=101 err=0 nentries=1 text= Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=2 UNBIND Dec 2 14:38:05 ldap slapd[57585]: connection_closing: readying conn=3924716 sd=36 for close Dec 2 14:38:05 ldap slapd[57585]: connection_resched: attempting closing conn=3924716 sd=36 Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 fd=36 closed
When they experience a network outage separating the slaves from the masters and the dns servers the load balancers are not able to bind the slaves:
Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 fd=44 ACCEPT from IP=192.0.2.188:35761 (IP=192.0.2.129:389) Dec 2 14:38:50 ldap slapd[57585]: connection_closing: readying conn=3924725 sd=44 for close Dec 2 14:38:50 ldap slapd[57585]: connection_close: deferring conn=3924725 sd=44 Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 op=0 BIND dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" method=128 Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 op=0 BIND dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" mech=SIMPLE ssf=0 Dec 2 14:38:50 ldap slapd[57585]: connection_resched: attempting closing conn=3924725 sd=44 Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 fd=44 closed (connection lost)
We have not been able to reproduce this problem in a lab setup which is supposed to be identical to the production setup. It does not seem to be related to the servers not being able to perform reverse mapping on the client ips. We run a mixture of 2.4.35 and 2.4.38 on CentOS 6.4. In the lab the slaves are able to perform queries just fine without connectivity to the masters or to their dns servers.
The servers are currently running with following loglevel:
dn: cn=config olcLogLevel: Conns olcLogLevel: Stats olcLogLevel: Stats2 olcLogLevel: Sync
It seems we only get to the point where the bind credentials are parsed after which the connection is closed.
This could of course be a problem with the load balancer prematurely closing the connection.
I am trying to eliminate any causes in the ldap servers.
Any ideas on how to debug this or where we could look.
Greetings Christian
[1] dns and ips obfuscated to protect the customer
Hi,
On Tue, 3 Dec 2013, Christian Kratzer wrote:
Hi,
we are currently chasing a strange issue at a customers site where the ldap slaves become unresponsive when network connectivity to master ldaps and dns servers is lost.
They have a setup of two masters and two slaves at separate sites. There is a load balancer sitting in front of the slaves that performs regular health checks consisting of binds followed by a search of their binddn.
It seems that this is due to ldap chaining from slave to master running without a timeout and eventually blocking all of slapd.
We use referrals and chaining for slapo-ppolicy and slapo-lastbind (with replication patch from ITS#7721).
I tried to resolve this using olcDbKeepalive and olcDbKeepalive but have not been sucessfull yet.
This is how the chaining backend is configured on the slaves in our lab:
olcDatabase={1}ldap,olcOverlay={0}chain,olcDatabase={-1}frontend,cn=config objectClass: olcLDAPConfig objectClass: olcChainDatabase olcDatabase: {1}ldap olcDbURI: "ldap://ldaptest0.example.org" olcDbStartTLS: start starttls=no tls_cert="/usr/local/etc/openldap/certs/server.cert" tls_key="/usr/local/etc/openldap/certs/server.key" tls_cacert="/usr/local/etc/openldap/certs/ca.cert" tls_reqcert=demand tls_crlcheck=none olcDbIDAssertBind: mode=self flags=prescriptive,proxy-authz-non-critical bindmethod=simple binddn="cn=chain,ou=system,dc=de,dc=telefonica,dc=com" credentials="XXXXXXXXXXX" keepalive=60:6:10 tls_cert="/usr/local/etc/openldap/certs/server.cert" tls_key="/usr/local/etc/openldap/certs/server.key" tls_cacert="/usr/local/etc/openldap/certs/ca.cert" tls_reqcert=demand tls_crlcheck=none olcDbRebindAsUser: FALSE olcDbChaseReferrals: TRUE olcDbTFSupport: no olcDbProxyWhoAmI: FALSE olcDbProtocolVersion: 3 olcDbSingleConn: FALSE olcDbCancel: abandon olcDbUseTemporaryConn: FALSE olcDbConnectionPoolMax: 16 olcDbSessionTrackingRequest: FALSE olcDbNoRefs: FALSE olcDbNoUndefFilter: FALSE olcDbOnErr: continue olcDbKeepalive: 60:6:10 olcDbNetworkTimeout: 3
Any pointers on what we should change to allow quick detection of unreachable olcDbURI ?
Greetings Christian
During regular operations the load balancers health checks look as follows [1]
Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 fd=36 ACCEPT from IP=192.0.2.189:33852 (IP=192.0.2.129:389) Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 BIND dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" method=128 Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 BIND dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" mech=SIMPLE ssf=0 Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 RESULT tag=97 err=0 text= Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 SRCH base="ou=system,dc=example,dc=org" scope=1 deref=0 filter="(cn=keepalive-check-lb)" Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 ENTRY dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 SEARCH RESULT tag=101 err=0 nentries=1 text= Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 op=2 UNBIND Dec 2 14:38:05 ldap slapd[57585]: connection_closing: readying conn=3924716 sd=36 for close Dec 2 14:38:05 ldap slapd[57585]: connection_resched: attempting closing conn=3924716 sd=36 Dec 2 14:38:05 ldap slapd[57585]: conn=3924716 fd=36 closed
When they experience a network outage separating the slaves from the masters and the dns servers the load balancers are not able to bind the slaves:
Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 fd=44 ACCEPT from IP=192.0.2.188:35761 (IP=192.0.2.129:389) Dec 2 14:38:50 ldap slapd[57585]: connection_closing: readying conn=3924725 sd=44 for close Dec 2 14:38:50 ldap slapd[57585]: connection_close: deferring conn=3924725 sd=44 Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 op=0 BIND dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" method=128 Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 op=0 BIND dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" mech=SIMPLE ssf=0 Dec 2 14:38:50 ldap slapd[57585]: connection_resched: attempting closing conn=3924725 sd=44 Dec 2 14:38:50 ldap slapd[57585]: conn=3924725 fd=44 closed (connection lost)
We have not been able to reproduce this problem in a lab setup which is supposed to be identical to the production setup. It does not seem to be related to the servers not being able to perform reverse mapping on the client ips. We run a mixture of 2.4.35 and 2.4.38 on CentOS 6.4. In the lab the slaves are able to perform queries just fine without connectivity to the masters or to their dns servers.
The servers are currently running with following loglevel:
dn: cn=config olcLogLevel: Conns olcLogLevel: Stats olcLogLevel: Stats2 olcLogLevel: Sync
It seems we only get to the point where the bind credentials are parsed after which the connection is closed.
This could of course be a problem with the load balancer prematurely closing the connection.
I am trying to eliminate any causes in the ldap servers.
Any ideas on how to debug this or where we could look.
Greetings Christian
[1] dns and ips obfuscated to protect the customer
Christian Kratzer wrote:
On Tue, 3 Dec 2013, Christian Kratzer wrote:
we are currently chasing a strange issue at a customers site where the ldap slaves become unresponsive when network connectivity to master ldaps and dns servers is lost.
They have a setup of two masters and two slaves at separate sites. There is a load balancer sitting in front of the slaves that performs regular health checks consisting of binds followed by a search of their binddn.
It seems that this is due to ldap chaining from slave to master running without a timeout and eventually blocking all of slapd.
That was my first idea remembering your former info about your setup.
We use referrals and chaining for slapo-ppolicy and slapo-lastbind (with replication patch from ITS#7721).
You have been warned. ;-)
No, I don't have a good suggestion other than to avoid chaining write operations by slapo-ppolicy and slapo-lastbind.
Ciao, Michael.
openldap-technical@openldap.org