https://bugs.openldap.org/show_bug.cgi?id=9210
Bug ID: 9210 Summary: [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued Product: OpenLDAP Version: 2.4.47 Hardware: All OS: All Status: UNCONFIRMED Severity: normal Priority: --- Component: libraries Assignee: bugs@openldap.org Reporter: lukas.juhrich@agdsn.de Target Milestone: ---
Created attachment 706 --> https://bugs.openldap.org/attachment.cgi?id=706&action=edit Patch adding errno resets
*tl;dr* single-stepping revealed a missing `errno` reset in `ber_int_sb_write`s retry loop.
An sssd-setup of ours, which we use for basic-auth on one of our services, issues ldap calls. When under load, i.e. when many `ldap_search_ext` calls had to be issued due to many requests, we observed that the corresponding process/thread went up to 100% CPU usage and stayed there.
- This was the [flamegraph](https://helios.wh2.tu-dresden.de/~shreyder/sssd_be%20--domain%20dom-http-wik...), where you can see that it was stuck below `ber_int_sb_write`. - Single-Stepping with GDB revealed that we are stuck in the `for(;;)`-Retry-loop. Indeed, we could observe that the `sbi_write` was successful, but the `errno` continued to be `EINTR` every time I hit that breakpoint. - Patching `sockbuf.c` as attached and rebuilding resolved the issue.
I also noticed similar sections with such a loop in `sockbuf.c` and added `errno = 0;` at the beginning of each iteration. In principle, they should suffer from the same problem.
The reasoning for why this happened under load is that with many requests being issued, the probability that the write happens when the process gets an _actual_ interrupt is much higher, and once that happens, we're stuck in the infinite loop.
https://bugs.openldap.org/show_bug.cgi?id=9210
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |2.4.50
https://bugs.openldap.org/show_bug.cgi?id=9210
--- Comment #1 from Howard Chu hyc@openldap.org --- Sorry, your explanation for this patch makes no sense. If the sbi_write() actually succeeds, then ret will be >= 0. And the loop condition is if ((ret<0) && (errno==EINTR)) continue;
So if the write succeeds, errno is irrelevant. If the write actually fails, errno will be set to an actual value. As such, a forced reset of errno is not relevant.
https://bugs.openldap.org/show_bug.cgi?id=9210
--- Comment #2 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de --- If you wish, I can try to reproduce this again. Unfortunately, this has been two months ago and I've only come around to the bugreporting now, so I will need to downgrade and set some things up again.
The basic observation still stands though, adding the errno fixes the symptom.
https://bugs.openldap.org/show_bug.cgi?id=9210
--- Comment #3 from Howard Chu hyc@openldap.org --- (In reply to lukas.juhrich@agdsn.de from comment #2)
If you wish, I can try to reproduce this again. Unfortunately, this has been two months ago and I've only come around to the bugreporting now, so I will need to downgrade and set some things up again.
The basic observation still stands though, adding the errno fixes the symptom.
Yes, please try to reproduce it.
But you understand, the behavior you've reported is not how errno is documented to work. Your flamegraph shows that gnutls is underneath, perhaps it is mis-handling its reporting of success/failure. In any case, when an underlying call succeeds, its return value must be >= 0, and the value of errno is ignored. The value of errno is only significant if the return value is < 0.
https://bugs.openldap.org/show_bug.cgi?id=9210
--- Comment #4 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de ---
Yes, please try to reproduce it.
I'm on it, after downgrading I can observe the behavior again, setting gdb up.
Thanks for your quick response so far and sorry for my sloppy analysis.
https://bugs.openldap.org/show_bug.cgi?id=9210
--- Comment #5 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de --- Quick update, I found a document from myself from the last debugging.
The return value in the loop is always `-408`, which is `GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE`.
https://bugs.openldap.org/show_bug.cgi?id=9210
Ryan Tandy ryan@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugs.openldap.org/s | |how_bug.cgi?id=8650
https://bugs.openldap.org/show_bug.cgi?id=9210
--- Comment #6 from Howard Chu hyc@openldap.org --- (In reply to lukas.juhrich@agdsn.de from comment #5)
Quick update, I found a document from myself from the last debugging.
The return value in the loop is always `-408`, which is `GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE`.
Which is quite different from your original report saying there was no error.
This still sounds like a GnuTLS bug. What version of that are you using? Can you reproduce this workload using OpenSSL instead?
https://bugs.openldap.org/show_bug.cgi?id=9210
--- Comment #7 from Ryan Tandy ryan@openldap.org --- Hi Lukas, can you check whether gnutls_handshake() returns GNUTLS_E_AGAIN in your case? If so then this is simply #8650 again. Also, what version of GnuTLS are you on, please?
https://bugs.openldap.org/show_bug.cgi?id=9210
--- Comment #8 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de --- Gnutls version (debian packages): libgnutls28-dev=3.6.7-4+deb10u3, libgnutls30=3.6.7-4+deb10u2
Unfortunately, my reproduction stopped working, so I cannot give any more details right now. I'm trying to find an alternative.
https://bugs.openldap.org/show_bug.cgi?id=9210
--- Comment #9 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de --- Clarification: By „My reproduction stopped working“, I mean
1. It worked a few minutes ago, and I had a gdb session with precisely the same situation (errno == EINTR, ret == GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE). No, it was not EAGAIN. Yes, it was the same stacktrace as in the flamegraph. 2. Then, it stopped working, and the CPU didn't go to 100% anymore.
https://bugs.openldap.org/show_bug.cgi?id=9210
Ryan Tandy ryan@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |DUPLICATE Status|UNCONFIRMED |RESOLVED
--- Comment #10 from Ryan Tandy ryan@openldap.org --- I just noticed your original report did mention sssd. So yes, this is exactly bug 8650. This case was previously documented as https://bugs.debian.org/849756 and https://github.com/SSSD/sssd/pull/67. The "fix" for that issue was to change the socket to blocking which is more like a workaround for the libldap bug. It's the same bug and happens equally if libldap sets the socket non-blocking itself (for example when doing an async connect). The GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE error code was introduced in GnuTLS 3.6.0 https://gitlab.com/gnutls/gnutls/issues/158 exactly to highlight users like libldap doing the wrong thing with gnutls_handshake.
*** This bug has been marked as a duplicate of bug 8650 ***
https://bugs.openldap.org/show_bug.cgi?id=9210
lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|DUPLICATE |FIXED
--- Comment #11 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de --- Thanks Ryan, your explanation makes sense (although I can't verify this is what actually happens due to lack of reproduction).
However, that sounds to me like a separate phenomenon from the missing `errno` reset: Shouldn't the retry condition be `( ret < 0 ) || ( errno == EINTR )`, so that in any of both scenarios (stuck in handshake or similar vs. interrupt during request) the retry is triggered?
Because if I would change the condition that way, my patch would instead _not_ fix my symptoms, which would be more consistent with your explanation (given that #8650 is still open).
Sorry if I misunderstand something, I'm not a C programmer and this whole library-ecosystem is quite foreign to me.
https://bugs.openldap.org/show_bug.cgi?id=9210
--- Comment #12 from Howard Chu hyc@openldap.org --- (In reply to lukas.juhrich@agdsn.de from comment #11)
However, that sounds to me like a separate phenomenon from the missing `errno` reset: Shouldn't the retry condition be `( ret < 0 ) || ( errno == EINTR )`, so that in any of both scenarios (stuck in handshake or similar vs. interrupt during request) the retry is triggered?
No.
https://bugs.openldap.org/show_bug.cgi?id=9210
lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|[with patch] Infinite |Infinite retry-loop (and |retry-loop (and thus 100% |thus 100% CPU-Usage) when |CPU-Usage) when lots of |lots of requests are issued |requests are issued |
https://bugs.openldap.org/show_bug.cgi?id=9210
Ryan Tandy ryan@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|FIXED |DUPLICATE
--- Comment #13 from Ryan Tandy ryan@openldap.org ---
*** This bug has been marked as a duplicate of bug 8650 ***
https://bugs.openldap.org/show_bug.cgi?id=9210
--- Comment #14 from Ryan Tandy ryan@openldap.org --- Created attachment 707 --> https://bugs.openldap.org/attachment.cgi?id=707&action=edit test program
Here's a test program that exercises connecting with a non-blocking socket, like sssd does. Currently it fails on 2.4 with LDAP_SERVER_DOWN and on 2.5 with LDAP_TIMEOUT, but succeeds if you comment out the fcntl().
This only tests the non-blocking scenario. Any patch should also be tested with a blocking socket per the scenario from bug 8650.
https://bugs.openldap.org/show_bug.cgi?id=9210
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED