https://bugs.openldap.org/show_bug.cgi?id=9262
Bug ID: 9262 Summary: Segmentation Fault on ldap_chain_op during Network Issues Product: OpenLDAP Version: 2.4.49 Hardware: x86_64 OS: Linux Status: UNCONFIRMED Severity: normal Priority: --- Component: overlays Assignee: bugs@openldap.org Reporter: jeremy.diaz@rexconsulting.net Target Milestone: ---
Created attachment 733 --> https://bugs.openldap.org/attachment.cgi?id=733&action=edit Full backtrace
This is a follow up to the following OpenLDAP mailing list post: https://www.openldap.org/lists/openldap-technical/202001/msg00045.html
The issue is as follows:
Slapd crashes with a segmentation fault periodically and seems to occur most often during periods of high load. The issue seems to be related to the chain overlay.
Some background info:
This environment was originally made up of 3 producers and 6 consumers all on OpenLdap 2.4.48.
Initially the issue did not appear to be caused directly by high load. We noticed that there was an uptick in saslauthd errors around the time when a crash occurred:
Dec 11 07:23:20 {REDACTED} saslauthd[2720]: user ldap_search_st() failed: Timed out Dec 11 07:23:30 {REDACTED} saslauthd[2718]: user ldap_search_st() failed: Timed out Dec 11 07:23:40 {REDACTED} saslauthd[2718]: user ldap_search_st() failed: Timed out Dec 11 07:23:56 {REDACTED} saslauthd[2717]: user ldap_search_st() failed: Timed out Dec 11 07:23:57 {REDACTED} saslauthd[2720]: user ldap_search_st() failed: Timed out Dec 11 07:23:57 {REDACTED} saslauthd[2719]: user ldap_search_st() failed: Timed out Dec 11 07:24:05 {REDACTED} saslauthd[2718]: user ldap_search_st() failed: Timed out Dec 11 07:24:06 {REDACTED} saslauthd[2717]: user ldap_search_st() failed: Timed out Dec 11 07:24:54 {REDACTED} saslauthd[2719]: : write failure Dec 11 07:24:54 {REDACTED} saslauthd[2717]: : write failure Dec 11 07:24:54 {REDACTED} saslauthd[2719]: : write: Broken pipe Dec 11 07:24:54 {REDACTED} saslauthd[2717]: : write: Broken pipe Dec 11 07:24:54 {REDACTED} saslauthd[2718]: : write failure Dec 11 07:24:54 {REDACTED} saslauthd[2720]: : write failure Dec 11 07:24:54 {REDACTED} saslauthd[2718]: : write: Broken pipe Dec 11 07:24:54 {REDACTED} saslauthd[2720]: : write: Broken pipe Dec 11 07:25:04 {REDACTED} saslauthd[2721]: : write failure Dec 11 07:25:04 {REDACTED} saslauthd[2721]: : write: Broken pipe
This seemed to indicate an issue with either AD, saslauthd or the ldap meta instance. Slapd with log level -1 did not return any obvious error messages.
One consumer server was particularly problematic since it was in a data center that had a slower network connection. This server was also dealing with a subset of traffic corresponding to another one of our applications so overall there was more traffic going into this server than some the other servers.
After some network improvements the segfault occurences dropped to about 1 every couple of days mostly confined to the problematic server.
The environment was eventually updated to 2.4.49 but there was no obvious reduction in the number of segfaults.
A while after an incident occurred that caused an increase in the latency of the network connections causing most of the slapd instances to crash constantly every few minutes. The crashing stopped only after the network latency was reduced.
Ultimately what seemed to significantly reduce the number of segfaults was adding another 3 consumers to further reduce the load on each server.
The following backtrace we captured seems to indicate that there is an issue with the chain overlay:
#0 ldap_chain_op (op=op@entry=0x7ffbe41f0870, rs=rs@entry=0x7ffbf09bbb20, op_f=0x4a845e <ldap_back_search>, ref=ref@entry=0x0, depth=depth@entry=0) at chain.c:422 #1 0x00000000004e61da in ldap_chain_response (op=0x7ffbe41f0870, rs=0x7ffbf09bbb20) at chain.c:1090 #2 0x000000000049d323 in over_back_response (op=0x7ffbe41f0870, rs=0x7ffbf09bbb20) at backover.c:237 #3 0x0000000000447c1a in slap_response_play (op=op@entry=0x7ffbe41f0870, rs=rs@entry=0x7ffbf09bbb20) at result.c:508 #4 0x00000000004480f5 in send_ldap_response (op=op@entry=0x7ffbe41f0870, rs=rs@entry=0x7ffbf09bbb20) at result.c:583
The issue ultimately seems to stem from slapd being really sensitive to network latency. The chain overlay appears to be attempting to dereference a value that never gets set due to a timeout mechanism being exceeded because of high load and/or high network latency as chain.c:422 seems to suggest:
for ( ; !BER_BVISNULL( ref ); ref++ ) {
Attached is the updated full backtrace of all the threads and below is the relevant portions of our olc configuration regarding the chain overlay:
dn: olcOverlay={1}chain,olcDatabase={-1}frontend,cn=config objectClass: olcOverlayConfig objectClass: olcChainConfig olcOverlay: {1}chain olcChainCacheURI: FALSE olcChainMaxReferralDepth: 1 olcChainReturnError: TRUE
dn: olcDatabase={0}ldap,olcOverlay={1}chain,olcDatabase={-1}frontend,cn=config objectClass: olcLDAPConfig objectClass: olcChainDatabase olcDatabase: {0}ldap olcDbStartTLS: ldaps starttls=no olcDbRebindAsUser: FALSE olcDbChaseReferrals: TRUE olcDbTFSupport: no olcDbProxyWhoAmI: TRUE olcDbProtocolVersion: 3 olcDbSingleConn: FALSE olcDbCancel: abandon olcDbUseTemporaryConn: FALSE olcDbConnectionPoolMax: 16 olcDbSessionTrackingRequest: FALSE olcDbNoRefs: FALSE olcDbNoUndefFilter: FALSE olcDbOnErr: continue olcDbKeepalive: 0:0:0
dn: olcDatabase={1}ldap,olcOverlay={1}chain,olcDatabase={-1}frontend,cn=config objectClass: olcLDAPConfig objectClass: olcChainDatabase olcDatabase: {1}ldap olcDbURI: “{REDACTED URI 1}“ olcDbStartTLS: ldaps starttls=no tls_cacert=“/{REDACTED}/openldap/tls/ cacerts.cer" tls_reqcert=demand tls_crlcheck=none olcDbIDAssertBind: mode=self flags=prescriptive,proxy-authz-non-critical bindm ethod=simple timeout=0 network-timeout=0 binddn={REDACTED} credentials={REDACTED} keepalive=0:0:0 olcDbRebindAsUser: FALSE olcDbChaseReferrals: TRUE olcDbTFSupport: no olcDbProxyWhoAmI: TRUE olcDbProtocolVersion: 3 olcDbSingleConn: FALSE olcDbCancel: abandon olcDbUseTemporaryConn: FALSE olcDbConnectionPoolMax: 16 olcDbSessionTrackingRequest: FALSE olcDbNoRefs: FALSE olcDbNoUndefFilter: FALSE olcDbOnErr: continue olcDbKeepalive: 0:0:0
dn: olcDatabase={2}ldap,olcOverlay={1}chain,olcDatabase={-1}frontend,cn=config objectClass: olcLDAPConfig objectClass: olcChainDatabase olcDatabase: {2}ldap olcDbURI: “{REDACTED URI 2}“ olcDbStartTLS: ldaps starttls=no tls_cacert=“/{REDACTED}/openldap/tls/ cacerts.cer" tls_reqcert=demand tls_crlcheck=none olcDbIDAssertBind: mode=self flags=prescriptive,proxy-authz-non-critical bindm ethod=simple timeout=0 network-timeout=0 binddn={REDACTED} credentials={REDACTED} keepalive=0:0:0 olcDbRebindAsUser: FALSE olcDbChaseReferrals: TRUE olcDbTFSupport: no olcDbProxyWhoAmI: TRUE olcDbProtocolVersion: 3 olcDbSingleConn: FALSE olcDbCancel: abandon olcDbUseTemporaryConn: FALSE olcDbConnectionPoolMax: 16 olcDbSessionTrackingRequest: FALSE olcDbNoRefs: FALSE olcDbNoUndefFilter: FALSE olcDbOnErr: continue olcDbKeepalive: 0:0:0
dn: olcDatabase={3}ldap,olcOverlay={1}chain,olcDatabase={-1}frontend,cn=config objectClass: olcLDAPConfig objectClass: olcChainDatabase olcDatabase: {3}ldap olcDbURI: “{REDACTED URI 3}“ olcDbStartTLS: ldaps starttls=no tls_cacert=“/{REDACTED}/openldap/tls/ cacerts.cer" tls_reqcert=demand tls_crlcheck=none olcDbIDAssertBind: mode=self flags=prescriptive,proxy-authz-non-critical bindm ethod=simple timeout=0 network-timeout=0 binddn={REDACTED} credentials={REDACTED} keepalive=0:0:0 olcDbRebindAsUser: FALSE olcDbChaseReferrals: TRUE olcDbTFSupport: no olcDbProxyWhoAmI: TRUE olcDbProtocolVersion: 3 olcDbSingleConn: FALSE olcDbCancel: abandon olcDbUseTemporaryConn: FALSE olcDbConnectionPoolMax: 16 olcDbSessionTrackingRequest: FALSE olcDbNoRefs: FALSE olcDbNoUndefFilter: FALSE olcDbOnErr: continue olcDbKeepalive: 0:0:0
https://bugs.openldap.org/show_bug.cgi?id=9262
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |2.4.51
https://bugs.openldap.org/show_bug.cgi?id=9262
--- Comment #1 from edeity@gmail.com --- We've encountered a similar segfault.
Program terminated with signal SIGSEGV, Segmentation fault. #0 ldap_chain_op (op=op@entry=0x7f32a812df70, rs=rs@entry=0x7f32bd14b9a0, op_f=0x56207ba8aa50 <ldap_back_search>, ref=ref@entry=0x0, depth=depth@entry=0) at chain.c:422 422 for ( ; !BER_BVISNULL( ref ); ref++ ) {
The segfault occurred because ref itself is NULL as you can see, and BER_BVISNULL checks for ref->by_val:
# include/lber_pvt.h 217 #define BER_BVISNULL(bv) ((bv)->bv_val == NULL)
Our backtrace contains:
#bt #0 ldap_chain_op (op=op@entry=0x7fb1f018d820, rs=rs@entry=0x7fb21bffe9a0, op_f=0x55a4b9a9ca50 <ldap_back_search>, ref=ref@entry=0x0, depth=depth@entry=0) at chain.c:422 #1 0x000055a4b9aec883 in ldap_chain_response (op=0x7fb1f018d820, rs=0x7fb21bffe9a0) at chain.c:1061 #2 0x000055a4b9a711e8 in over_back_response (op=0x7fb1f018d820, rs=0x7fb21bffe9a0) at backover.c:237 #3 0x000055a4b9a147f6 in slap_response_play (op=op@entry=0x7fb1f018d820, rs=rs@entry=0x7fb21bffe9a0) at result.c:508 #4 0x000055a4b9a14d50 in send_ldap_response (op=op@entry=0x7fb1f018d820, rs=rs@entry=0x7fb21bffe9a0) at result.c:583 #5 0x000055a4b9a158f2 in slap_send_ldap_result (op=0x7fb1f018d820, rs=0x7fb21bffe9a0) at result.c:861 #6 0x000055a4b9ab3750 in mdb_search (op=0x7fb1f018d820, rs=0x7fb21bffe9a0) at search.c:552 ...
In our case, we got here from ldap_chain_response, which defines ref as:
# servers/slapd/back-ldap/chain.c 955 static int 956 ldap_chain_response( Operation *op, SlapReply *rs ) 957 { .. 967 BerVarray ref;
However, in this case, it's value comes from rs->sr_ref, which does get set to NULL in various locations in the code in some situations and therefore appears to be a valid value.
1025 ref = rs->sr_ref; 1026 rs->sr_ref = NULL; ... 1062 case LDAP_REQ_DELETE: 1063 rc = ldap_chain_op( op, rs, op_delete, ref, 0 );
The assumption of the check at chain.c:422 seems to be that ref itself won't be NULL. We're working to identify the circumstances that get us to this.
https://bugs.openldap.org/show_bug.cgi?id=9262
--- Comment #2 from Howard Chu hyc@openldap.org --- (In reply to edeity from comment #1)
We've encountered a similar segfault.
Program terminated with signal SIGSEGV, Segmentation fault. #0 ldap_chain_op (op=op@entry=0x7f32a812df70, rs=rs@entry=0x7f32bd14b9a0, op_f=0x56207ba8aa50 <ldap_back_search>, ref=ref@entry=0x0, depth=depth@entry=0) at chain.c:422 422 for ( ; !BER_BVISNULL( ref ); ref++ ) {
The segfault occurred because ref itself is NULL as you can see, and BER_BVISNULL checks for ref->by_val:
Should be fixed by 2346dfd2a04cbc84c69b76249a41af070c2991ad, please test.
From what I can see, this can only occur if you have an entry of objectclass referral, but it has no ref attribute. Since the ref attribute is required by the referral OC schema, this should never happen.
https://bugs.openldap.org/show_bug.cgi?id=9262
--- Comment #3 from Quanah Gibson-Mount quanah@openldap.org --- main:
• 2346dfd2 by Howard Chu at 2020-06-21T00:45:45+01:00 ITS#9262 check referral
https://bugs.openldap.org/show_bug.cgi?id=9262
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |RESOLVED Resolution|--- |TEST
https://bugs.openldap.org/show_bug.cgi?id=9262
--- Comment #4 from jeremy.diaz@rexconsulting.net --- None of the entries in our database make use of the ref attribute. The only referrals being used are those set through the chain overlay for the olcDbURI attribute. One thing I noticed is that when a config conversion is done through slaptest, a 0th chain overlay ldap backend entry is added with no dbURI attribute. I do not see anywhere else that could be missing a ref.
https://bugs.openldap.org/show_bug.cgi?id=9262
--- Comment #5 from Quanah Gibson-Mount quanah@openldap.org --- Hi Jeremy,
Have you been able to test with the suggested fix?
https://bugs.openldap.org/show_bug.cgi?id=9262
--- Comment #6 from jeremy.diaz@rexconsulting.net --- Hi Quannah, I am still in the process of testing and will post an update once that is done.
https://bugs.openldap.org/show_bug.cgi?id=9262
--- Comment #7 from jeremy.diaz@rexconsulting.net --- (In reply to Quanah Gibson-Mount from comment #5)
Hi Jeremy,
Have you been able to test with the suggested fix?
Hi Quanah i'm wondering what is the recommended way to apply the patch to the stable 2.4.50 release?
https://bugs.openldap.org/show_bug.cgi?id=9262
--- Comment #8 from Quanah Gibson-Mount quanah@openldap.org --- (In reply to jeremy.diaz from comment #7)
(In reply to Quanah Gibson-Mount from comment #5)
Hi Jeremy,
Have you been able to test with the suggested fix?
Hi Quanah i'm wondering what is the recommended way to apply the patch to the stable 2.4.50 release?
It's a 3 line change in a section of code that's identical between 2.4 and master. It should be trivial to apply via the patch utility, or just cut and paste, or anything else that works for you. ;)
https://bugs.openldap.org/show_bug.cgi?id=9262
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|TEST |FIXED
--- Comment #9 from Quanah Gibson-Mount quanah@openldap.org --- RE24: Commits: • 6fade7d0 by Howard Chu at 2020-07-23T18:21:20+00:00 ITS#9262 check referral
https://bugs.openldap.org/show_bug.cgi?id=9262
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED