[Bug 9210] New: [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

List overview All Threads
Download

newer

older

[Bug 8650] EAGAIN from...

openldap-its＠openldap.org

12 Apr 2020 12 Apr '20

8:29 a.m.

https://bugs.openldap.org/show_bug.cgi?id=9210

Bug ID: 9210 Summary: [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued Product: OpenLDAP Version: 2.4.47 Hardware: All OS: All Status: UNCONFIRMED Severity: normal Priority: --- Component: libraries Assignee: bugs@openldap.org Reporter: lukas.juhrich@agdsn.de Target Milestone: ---

Created attachment 706 --> https://bugs.openldap.org/attachment.cgi?id=706&action=edit Patch adding errno resets

*tl;dr* single-stepping revealed a missing `errno` reset in `ber_int_sb_write`s retry loop.

An sssd-setup of ours, which we use for basic-auth on one of our services, issues ldap calls. When under load, i.e. when many `ldap_search_ext` calls had to be issued due to many requests, we observed that the corresponding process/thread went up to 100% CPU usage and stayed there.

- This was the [flamegraph](https://helios.wh2.tu-dresden.de/~shreyder/sssd_be%20--domain%20dom-http-wik...), where you can see that it was stuck below `ber_int_sb_write`. - Single-Stepping with GDB revealed that we are stuck in the `for(;;)`-Retry-loop. Indeed, we could observe that the `sbi_write` was successful, but the `errno` continued to be `EINTR` every time I hit that breakpoint. - Patching `sockbuf.c` as attached and rebuilding resolved the issue.

I also noticed similar sections with such a loop in `sockbuf.c` and added `errno = 0;` at the beginning of each iteration. In principle, they should suffer from the same problem.

The reasoning for why this happened under load is that with many requests being issued, the probability that the write happens when the process gets an _actual_ interrupt is much higher, and once that happens, we're stuck in the infinite loop.

-- You are receiving this mail because: You are on the CC list for the bug.

Show replies by date

openldap-its＠openldap.org

12 Apr 12 Apr

8:40 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

Quanah Gibson-Mount quanah@openldap.org changed:

What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |2.4.50

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

9:14 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

--- Comment #1 from Howard Chu hyc@openldap.org --- Sorry, your explanation for this patch makes no sense. If the sbi_write() actually succeeds, then ret will be >= 0. And the loop condition is if ((ret<0) && (errno==EINTR)) continue;

So if the write succeeds, errno is irrelevant. If the write actually fails, errno will be set to an actual value. As such, a forced reset of errno is not relevant.

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

9:28 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

--- Comment #2 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de --- If you wish, I can try to reproduce this again. Unfortunately, this has been two months ago and I've only come around to the bugreporting now, so I will need to downgrade and set some things up again.

The basic observation still stands though, adding the errno fixes the symptom.

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

9:56 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

--- Comment #3 from Howard Chu hyc@openldap.org --- (In reply to lukas.juhrich@agdsn.de from comment #2)

...

If you wish, I can try to reproduce this again. Unfortunately, this has been two months ago and I've only come around to the bugreporting now, so I will need to downgrade and set some things up again.

The basic observation still stands though, adding the errno fixes the symptom.

Yes, please try to reproduce it.

But you understand, the behavior you've reported is not how errno is documented to work. Your flamegraph shows that gnutls is underneath, perhaps it is mis-handling its reporting of success/failure. In any case, when an underlying call succeeds, its return value must be >= 0, and the value of errno is ignored. The value of errno is only significant if the return value is < 0.

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

10:13 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

--- Comment #4 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de ---

...

Yes, please try to reproduce it.

I'm on it, after downgrading I can observe the behavior again, setting gdb up.

Thanks for your quick response so far and sorry for my sloppy analysis.

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

10:31 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

--- Comment #5 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de --- Quick update, I found a document from myself from the last debugging.

The return value in the loop is always `-408`, which is `GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE`.

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

10:35 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

Ryan Tandy ryan@openldap.org changed:

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

10:52 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

--- Comment #6 from Howard Chu hyc@openldap.org --- (In reply to lukas.juhrich@agdsn.de from comment #5)

...

Quick update, I found a document from myself from the last debugging.

The return value in the loop is always `-408`, which is `GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE`.

Which is quite different from your original report saying there was no error.

This still sounds like a GnuTLS bug. What version of that are you using? Can you reproduce this workload using OpenSSL instead?

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

10:58 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

--- Comment #7 from Ryan Tandy ryan@openldap.org --- Hi Lukas, can you check whether gnutls_handshake() returns GNUTLS_E_AGAIN in your case? If so then this is simply #8650 again. Also, what version of GnuTLS are you on, please?

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

11:30 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

--- Comment #8 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de --- Gnutls version (debian packages): libgnutls28-dev=3.6.7-4+deb10u3, libgnutls30=3.6.7-4+deb10u2

Unfortunately, my reproduction stopped working, so I cannot give any more details right now. I'm trying to find an alternative.

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

11:33 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

--- Comment #9 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de --- Clarification: By „My reproduction stopped working“, I mean

1. It worked a few minutes ago, and I had a gdb session with precisely the same situation (errno == EINTR, ret == GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE). No, it was not EAGAIN. Yes, it was the same stacktrace as in the flamegraph. 2. Then, it stopped working, and the CPU didn't go to 100% anymore.

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

11:58 a.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

Ryan Tandy ryan@openldap.org changed:

--- Comment #10 from Ryan Tandy ryan@openldap.org --- I just noticed your original report did mention sssd. So yes, this is exactly bug 8650. This case was previously documented as https://bugs.debian.org/849756 and https://github.com/SSSD/sssd/pull/67. The "fix" for that issue was to change the socket to blocking which is more like a workaround for the libldap bug. It's the same bug and happens equally if libldap sets the socket non-blocking itself (for example when doing an async connect). The GNUTLS_E_UNAVAILABLE_DURING_HANDSHAKE error code was introduced in GnuTLS 3.6.0 https://gitlab.com/gnutls/gnutls/issues/158 exactly to highlight users like libldap doing the wrong thing with gnutls_handshake.

*** This bug has been marked as a duplicate of bug 8650 ***

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

12:28 p.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de changed:

What |Removed |Added ---------------------------------------------------------------------------- Resolution|DUPLICATE |FIXED

--- Comment #11 from lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de --- Thanks Ryan, your explanation makes sense (although I can't verify this is what actually happens due to lack of reproduction).

However, that sounds to me like a separate phenomenon from the missing `errno` reset: Shouldn't the retry condition be `( ret < 0 ) || ( errno == EINTR )`, so that in any of both scenarios (stuck in handshake or similar vs. interrupt during request) the retry is triggered?

Because if I would change the condition that way, my patch would instead _not_ fix my symptoms, which would be more consistent with your explanation (given that #8650 is still open).

Sorry if I misunderstand something, I'm not a C programmer and this whole library-ecosystem is quite foreign to me.

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

12:30 p.m.

New subject: [Bug 9210] [with patch] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

--- Comment #12 from Howard Chu hyc@openldap.org --- (In reply to lukas.juhrich@agdsn.de from comment #11)

...

However, that sounds to me like a separate phenomenon from the missing `errno` reset: Shouldn't the retry condition be `( ret < 0 ) || ( errno == EINTR )`, so that in any of both scenarios (stuck in handshake or similar vs. interrupt during request) the retry is triggered?

No.

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

12:35 p.m.

New subject: [Bug 9210] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

lukas.juhrich@agdsn.de lukas.juhrich@agdsn.de changed:

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

1:46 p.m.

New subject: [Bug 9210] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

Ryan Tandy ryan@openldap.org changed:

What |Removed |Added ---------------------------------------------------------------------------- Resolution|FIXED |DUPLICATE

--- Comment #13 from Ryan Tandy ryan@openldap.org ---

*** This bug has been marked as a duplicate of bug 8650 ***

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

1:55 p.m.

New subject: [Bug 9210] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

--- Comment #14 from Ryan Tandy ryan@openldap.org --- Created attachment 707 --> https://bugs.openldap.org/attachment.cgi?id=707&action=edit test program

Here's a test program that exercises connecting with a non-blocking socket, like sssd does. Currently it fails on 2.4 with LDAP_SERVER_DOWN and on 2.5 with LDAP_TIMEOUT, but succeeds if you comment out the fcntl().

This only tests the non-blocking scenario. Any patch should also be tested with a blocking socket per the scenario from bug 8650.

-- You are receiving this mail because: You are on the CC list for the bug.

openldap-its＠openldap.org

13 Apr 13 Apr

7:45 a.m.

New subject: [Bug 9210] Infinite retry-loop (and thus 100% CPU-Usage) when lots of requests are issued

https://bugs.openldap.org/show_bug.cgi?id=9210

Quanah Gibson-Mount quanah@openldap.org changed:

What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED

-- You are receiving this mail because: You are on the CC list for the bug.

2016

Age (days ago)

2017

Last active (days ago)

openldap-bugs@openldap.org

18 comments

1 participants

tags (0)

participants (1)

openldap-its＠openldap.org