https://bugs.openldap.org/show_bug.cgi?id=10141
Issue ID: 10141 Summary: 100% CPU consumption with ldap_int_tls_connect Product: OpenLDAP Version: 2.6.3 Hardware: Other OS: Linux Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: libraries Assignee: bugs@openldap.org Reporter: vivekanand754@gmail.com Target Milestone: ---
While doing secure ldap connection, i'm seeing that connection is getting stuck in read block in case it is unable to connect active directory sometime: ~ # strace -p 15049 strace: Process 15049 attached read(3, 0x55ef720bda53, 5) = -1 EAGAIN (Resource temporarily unavailable) read(3, 0x55ef720bda53, 5) = -1 EAGAIN (Resource temporarily unavailable) .. .. .. ..
After putting some logs, I can see that "ldap_int_tls_start" function of "openldap-2.6.3/libraries/libldap/tls2.c" calls "ldap_int_tls_connect" in while loop. It seems to be blocking call, as it try to connect continuously until it get connected(ti_session_connect returns 0) and thus consumes 100% CPU during that time.
Is there any known issue ?
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #1 from Quanah Gibson-Mount quanah@openldap.org --- There is not enough detail here to be actionable. Please provided further details. Is this sync or async? What TLS library is in use? etc
https://bugs.openldap.org/show_bug.cgi?id=10141
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords|needs_review |
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #2 from Vivek Anand vivekanand754@gmail.com --- This issue is coming in sync. And using "openssl(--with-tls=openssl)" library.
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #3 from Vivek Anand vivekanand754@gmail.com --- Created attachment 999 --> https://bugs.openldap.org/attachment.cgi?id=999&action=edit wait for 5ms and then retry after ldap_int_tls_connect failure
If ldap_int_tls_connect fails for sync flow, added a sleep of 5ms before retrying again. This is resolving the issue and now I'm seeing high CPU consumption as the loop frequency got reduced. Attached patch for the same (tls-reconnect-delay.patch)
Let me know if I can go with this minor change. Hope this change will not impact anything else. Let me know if this change is not recommended and if there is any other way to resolve this issue.
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #4 from Quanah Gibson-Mount quanah@openldap.org --- (In reply to Vivek Anand from comment #3)
Created attachment 999 [details] wait for 5ms and then retry after ldap_int_tls_connect failure
You need to provide concrete details and code examples of what you are doing. This patch cannot be accepted since it is just a workaround, not a solution to an actual problem.
--- Comment #5 from Quanah Gibson-Mount quanah@openldap.org --- (In reply to Quanah Gibson-Mount from comment #4)
(In reply to Vivek Anand from comment #3)
Created attachment 999 [details] wait for 5ms and then retry after ldap_int_tls_connect failure
You need to provide concrete details and code examples of what you are doing. This patch cannot be accepted since it is just a workaround, not a solution to an actual problem.
This didn't got out via email due to a bug in Bugzilla exposed by today's maintenance. Please note the above.
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #6 from Vivek Anand vivekanand754@gmail.com --- Created attachment 1002 --> https://bugs.openldap.org/attachment.cgi?id=1002&action=edit set async mode
Basically I'm running a script which is hitting GET api continuously which is authenticating via ldap. The flow is like (MyApp --> Linux-PAM --> nss_ldap --> openldap)
Sometimes if ldap server is not reachable, the script hangs as it get stuck in while loop which is continuously hitting "ldap_int_tls_connect" ("ldap_int_tls_start" function of "openldap-2.6.3/libraries/libldap/tls2.c"). During this time MyApp consumes 100% CPU and remains there about 16~17min. After that connection gets terminated and CPU comes back to normal.
In order to fix this problem, I tried below 2 approaches: 1) introduced a sleep of 50ms to reduce while loop frequency (sync mode of operation): this reduced CPU consumption but process remain stuck for 16~17 min and got released after that 2) set async mode of operation (using LDAP_BOOL_SET in openldap-2.6.3/libraries/libldap/init.c) : Got similiar result as approach 1
Both of above 2 approach reduced CPU consumption of MyApp from ~25%(ideal scenario) to ~1.3% and with that I'm not able to hit 100% cpu with api load.
I have query as below: 1) How to cater this issue for sync mode of operation. Is there any timeout parameter which we can configure if it's unable to connect ldap server, then it should come out of while loop after configured timeout ? 2) Is there any way to set async mode via any configuration?
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #7 from Vivek Anand vivekanand754@gmail.com --- Hi Team,
Is there any update on this? Do let me know if there is any other faster channel for communication.
-Thanks
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #8 from Quanah Gibson-Mount quanah@openldap.org --- (In reply to Vivek Anand from comment #7)
Hi Team,
Is there any update on this? Do let me know if there is any other faster channel for communication.
Hello,
Based on your description, you're opening a massive number of connections which is causing OpenSSL to run out of entropy, and why adding the sleep works for you. This appears to be an abuse of the library. If the former is correct, then the question is why are you opening a massive number of connections instead of simply using a small number of connections to do multiple queries.
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #9 from Vivek Anand vivekanand754@gmail.com --- Just to clarify, I'm not opening massive number of connections. The API load is sequential like below: ``` while [ 1 ] do curl -v -k -u user:"password" --request GET "https://x.x.x.x/api/xyz" sleep 1 done ```
So, only one application thread will be spwaned at a time. It will be then processed and then released. At certain point, if ldap server is not reachable, the current application thread available at that moment gets stuck and comsumes 100%
Please do let me know your recommendation regarding this issue.
Also, It would be great if you can help with previous query also "Is there any way to set async mode via any configuration?"
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #10 from Vivek Anand vivekanand754@gmail.com --- Hi Team,
Any help with above query?
-Thanks
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #11 from maxime.besson@worteks.com maxime.besson@worteks.com --- Hi Quanah, I was able to reproduce this issue (or a very similar one) easily while investigating a production outage
* Start slapd somewhere (any version should work) * pkill -STOP slapd (will freeze the slapd service, simulating an unresponsive LDAP server, but still allowing TCP connections to succeed) * strace ldapsearch -H ldaps://slapd -d 1
(reports a single read() syscall hanging because the server is stuck)
* strace ldapsearch -H ldaps://slapd -d 1 -o network_timeout=5
(reports rapid-fire read() syscalls on the nonblocking socket)
I was able to reproduce this on the 2.6 branch with OpenSSL as well as a recent Debian system (2.5 branch + GnuTLS)
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #12 from maxime.besson@worteks.com maxime.besson@worteks.com --- Created attachment 1035 --> https://bugs.openldap.org/attachment.cgi?id=1035&action=edit prevent busyloop when socket was set to nonblocking by network_timeout option
I believe this patch might fix the issue by polling the socket before attempting a read. This is previously only performed in async mode. But setting a timeout also causes the socket to be nonblocking.
I renamed the async variable for clarity
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #13 from Ondřej Kuzník ondra@mistotebe.net --- On Tue, Oct 22, 2024 at 09:20:22AM +0000, openldap-its@openldap.org wrote:
I believe this patch might fix the issue by polling the socket before attempting a read. This is previously only performed in async mode. But setting a timeout also causes the socket to be nonblocking.
I renamed the async variable for clarity
Hi Maxime, thanks for the extra information and a proposed patch. Can you test the patch in MR!727 I created yesterday? It overlaps what you've just submitted and I believe is a more correct approach.
Thanks,
https://bugs.openldap.org/show_bug.cgi?id=10141
--- Comment #14 from maxime.besson@worteks.com maxime.besson@worteks.com --- Hi Ondřej, MR!727 correctly polls the socket instead of looping on read() syscalls when server is unresponsive during TLS handshake, thanks again!
https://bugs.openldap.org/show_bug.cgi?id=10141
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Assignee|bugs@openldap.org |ondra@mistotebe.net
https://bugs.openldap.org/show_bug.cgi?id=10141
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |DUPLICATE Status|UNCONFIRMED |RESOLVED
--- Comment #15 from Quanah Gibson-Mount quanah@openldap.org ---
*** This issue has been marked as a duplicate of issue 8047 ***
https://bugs.openldap.org/show_bug.cgi?id=10141
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED