https://bugs.openldap.org/show_bug.cgi?id=9262
Bug ID: 9262 Summary: Segmentation Fault on ldap_chain_op during Network Issues Product: OpenLDAP Version: 2.4.49 Hardware: x86_64 OS: Linux Status: UNCONFIRMED Severity: normal Priority: --- Component: overlays Assignee: bugs@openldap.org Reporter: jeremy.diaz@rexconsulting.net Target Milestone: ---
Created attachment 733 --> https://bugs.openldap.org/attachment.cgi?id=733&action=edit Full backtrace
This is a follow up to the following OpenLDAP mailing list post: https://www.openldap.org/lists/openldap-technical/202001/msg00045.html
The issue is as follows:
Slapd crashes with a segmentation fault periodically and seems to occur most often during periods of high load. The issue seems to be related to the chain overlay.
Some background info:
This environment was originally made up of 3 producers and 6 consumers all on OpenLdap 2.4.48.
Initially the issue did not appear to be caused directly by high load. We noticed that there was an uptick in saslauthd errors around the time when a crash occurred:
Dec 11 07:23:20 {REDACTED} saslauthd[2720]: user ldap_search_st() failed: Timed out Dec 11 07:23:30 {REDACTED} saslauthd[2718]: user ldap_search_st() failed: Timed out Dec 11 07:23:40 {REDACTED} saslauthd[2718]: user ldap_search_st() failed: Timed out Dec 11 07:23:56 {REDACTED} saslauthd[2717]: user ldap_search_st() failed: Timed out Dec 11 07:23:57 {REDACTED} saslauthd[2720]: user ldap_search_st() failed: Timed out Dec 11 07:23:57 {REDACTED} saslauthd[2719]: user ldap_search_st() failed: Timed out Dec 11 07:24:05 {REDACTED} saslauthd[2718]: user ldap_search_st() failed: Timed out Dec 11 07:24:06 {REDACTED} saslauthd[2717]: user ldap_search_st() failed: Timed out Dec 11 07:24:54 {REDACTED} saslauthd[2719]: : write failure Dec 11 07:24:54 {REDACTED} saslauthd[2717]: : write failure Dec 11 07:24:54 {REDACTED} saslauthd[2719]: : write: Broken pipe Dec 11 07:24:54 {REDACTED} saslauthd[2717]: : write: Broken pipe Dec 11 07:24:54 {REDACTED} saslauthd[2718]: : write failure Dec 11 07:24:54 {REDACTED} saslauthd[2720]: : write failure Dec 11 07:24:54 {REDACTED} saslauthd[2718]: : write: Broken pipe Dec 11 07:24:54 {REDACTED} saslauthd[2720]: : write: Broken pipe Dec 11 07:25:04 {REDACTED} saslauthd[2721]: : write failure Dec 11 07:25:04 {REDACTED} saslauthd[2721]: : write: Broken pipe
This seemed to indicate an issue with either AD, saslauthd or the ldap meta instance. Slapd with log level -1 did not return any obvious error messages.
One consumer server was particularly problematic since it was in a data center that had a slower network connection. This server was also dealing with a subset of traffic corresponding to another one of our applications so overall there was more traffic going into this server than some the other servers.
After some network improvements the segfault occurences dropped to about 1 every couple of days mostly confined to the problematic server.
The environment was eventually updated to 2.4.49 but there was no obvious reduction in the number of segfaults.
A while after an incident occurred that caused an increase in the latency of the network connections causing most of the slapd instances to crash constantly every few minutes. The crashing stopped only after the network latency was reduced.
Ultimately what seemed to significantly reduce the number of segfaults was adding another 3 consumers to further reduce the load on each server.
The following backtrace we captured seems to indicate that there is an issue with the chain overlay:
#0 ldap_chain_op (op=op@entry=0x7ffbe41f0870, rs=rs@entry=0x7ffbf09bbb20, op_f=0x4a845e <ldap_back_search>, ref=ref@entry=0x0, depth=depth@entry=0) at chain.c:422 #1 0x00000000004e61da in ldap_chain_response (op=0x7ffbe41f0870, rs=0x7ffbf09bbb20) at chain.c:1090 #2 0x000000000049d323 in over_back_response (op=0x7ffbe41f0870, rs=0x7ffbf09bbb20) at backover.c:237 #3 0x0000000000447c1a in slap_response_play (op=op@entry=0x7ffbe41f0870, rs=rs@entry=0x7ffbf09bbb20) at result.c:508 #4 0x00000000004480f5 in send_ldap_response (op=op@entry=0x7ffbe41f0870, rs=rs@entry=0x7ffbf09bbb20) at result.c:583
The issue ultimately seems to stem from slapd being really sensitive to network latency. The chain overlay appears to be attempting to dereference a value that never gets set due to a timeout mechanism being exceeded because of high load and/or high network latency as chain.c:422 seems to suggest:
for ( ; !BER_BVISNULL( ref ); ref++ ) {
Attached is the updated full backtrace of all the threads and below is the relevant portions of our olc configuration regarding the chain overlay:
dn: olcOverlay={1}chain,olcDatabase={-1}frontend,cn=config objectClass: olcOverlayConfig objectClass: olcChainConfig olcOverlay: {1}chain olcChainCacheURI: FALSE olcChainMaxReferralDepth: 1 olcChainReturnError: TRUE
dn: olcDatabase={0}ldap,olcOverlay={1}chain,olcDatabase={-1}frontend,cn=config objectClass: olcLDAPConfig objectClass: olcChainDatabase olcDatabase: {0}ldap olcDbStartTLS: ldaps starttls=no olcDbRebindAsUser: FALSE olcDbChaseReferrals: TRUE olcDbTFSupport: no olcDbProxyWhoAmI: TRUE olcDbProtocolVersion: 3 olcDbSingleConn: FALSE olcDbCancel: abandon olcDbUseTemporaryConn: FALSE olcDbConnectionPoolMax: 16 olcDbSessionTrackingRequest: FALSE olcDbNoRefs: FALSE olcDbNoUndefFilter: FALSE olcDbOnErr: continue olcDbKeepalive: 0:0:0
dn: olcDatabase={1}ldap,olcOverlay={1}chain,olcDatabase={-1}frontend,cn=config objectClass: olcLDAPConfig objectClass: olcChainDatabase olcDatabase: {1}ldap olcDbURI: “{REDACTED URI 1}“ olcDbStartTLS: ldaps starttls=no tls_cacert=“/{REDACTED}/openldap/tls/ cacerts.cer" tls_reqcert=demand tls_crlcheck=none olcDbIDAssertBind: mode=self flags=prescriptive,proxy-authz-non-critical bindm ethod=simple timeout=0 network-timeout=0 binddn={REDACTED} credentials={REDACTED} keepalive=0:0:0 olcDbRebindAsUser: FALSE olcDbChaseReferrals: TRUE olcDbTFSupport: no olcDbProxyWhoAmI: TRUE olcDbProtocolVersion: 3 olcDbSingleConn: FALSE olcDbCancel: abandon olcDbUseTemporaryConn: FALSE olcDbConnectionPoolMax: 16 olcDbSessionTrackingRequest: FALSE olcDbNoRefs: FALSE olcDbNoUndefFilter: FALSE olcDbOnErr: continue olcDbKeepalive: 0:0:0
dn: olcDatabase={2}ldap,olcOverlay={1}chain,olcDatabase={-1}frontend,cn=config objectClass: olcLDAPConfig objectClass: olcChainDatabase olcDatabase: {2}ldap olcDbURI: “{REDACTED URI 2}“ olcDbStartTLS: ldaps starttls=no tls_cacert=“/{REDACTED}/openldap/tls/ cacerts.cer" tls_reqcert=demand tls_crlcheck=none olcDbIDAssertBind: mode=self flags=prescriptive,proxy-authz-non-critical bindm ethod=simple timeout=0 network-timeout=0 binddn={REDACTED} credentials={REDACTED} keepalive=0:0:0 olcDbRebindAsUser: FALSE olcDbChaseReferrals: TRUE olcDbTFSupport: no olcDbProxyWhoAmI: TRUE olcDbProtocolVersion: 3 olcDbSingleConn: FALSE olcDbCancel: abandon olcDbUseTemporaryConn: FALSE olcDbConnectionPoolMax: 16 olcDbSessionTrackingRequest: FALSE olcDbNoRefs: FALSE olcDbNoUndefFilter: FALSE olcDbOnErr: continue olcDbKeepalive: 0:0:0
dn: olcDatabase={3}ldap,olcOverlay={1}chain,olcDatabase={-1}frontend,cn=config objectClass: olcLDAPConfig objectClass: olcChainDatabase olcDatabase: {3}ldap olcDbURI: “{REDACTED URI 3}“ olcDbStartTLS: ldaps starttls=no tls_cacert=“/{REDACTED}/openldap/tls/ cacerts.cer" tls_reqcert=demand tls_crlcheck=none olcDbIDAssertBind: mode=self flags=prescriptive,proxy-authz-non-critical bindm ethod=simple timeout=0 network-timeout=0 binddn={REDACTED} credentials={REDACTED} keepalive=0:0:0 olcDbRebindAsUser: FALSE olcDbChaseReferrals: TRUE olcDbTFSupport: no olcDbProxyWhoAmI: TRUE olcDbProtocolVersion: 3 olcDbSingleConn: FALSE olcDbCancel: abandon olcDbUseTemporaryConn: FALSE olcDbConnectionPoolMax: 16 olcDbSessionTrackingRequest: FALSE olcDbNoRefs: FALSE olcDbNoUndefFilter: FALSE olcDbOnErr: continue olcDbKeepalive: 0:0:0