https://bugs.openldap.org/show_bug.cgi?id=9930
Issue ID: 9930 Summary: test050 deadlock on BSD OSes Product: OpenLDAP Version: 2.5.13 Hardware: All OS: All Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: slapd Assignee: bugs@openldap.org Reporter: quanah@openldap.org Target Milestone: ---
Initially reported to me directly as an issue with NetBSD 9.1, I was able to reproduce this with FreeBSD 13.1 as well. Investigation ongoing.
Running test050 in a loop eventually results in a deadlock in one of the 4 slapd provider processes.
https://bugs.openldap.org/show_bug.cgi?id=9930
--- Comment #1 from Quanah Gibson-Mount quanah@openldap.org --- Created attachment 921 --> https://bugs.openldap.org/attachment.cgi?id=921&action=edit gdb output for all threads
https://bugs.openldap.org/show_bug.cgi?id=9930
--- Comment #2 from Ondřej Kuzník ondra@mistotebe.net --- Thread 5 has paused the server and is now waiting to wlock() olcDatabase={1}mdb,cn=config config entry, but thread 6 already has read lock on it (and is currently waiting for the pause to end).
The issue is that workers are not meant to join the pause while holding the rlock but slap_send_search_entry->send_ldap_ber is willing to do just that because we got EAGAIN/EWOULDBLOCK on the socket. I assume this can end up worse than a deadlock if the server actually reconfigured itself while we were waiting for the writer to flush?
https://bugs.openldap.org/show_bug.cgi?id=9930
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Assignee|bugs@openldap.org |hyc@openldap.org Target Milestone|--- |2.5.14 Keywords|needs_review |
https://bugs.openldap.org/show_bug.cgi?id=9930
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |CONFIRMED Ever confirmed|0 |1
--- Comment #3 from Howard Chu hyc@openldap.org --- Yeah that seems to be the case.
I think we can avoid this by checking to see if op->o_bd is cn=config and skipping the pausecheck in that case.
https://bugs.openldap.org/show_bug.cgi?id=9930
--- Comment #4 from Ondřej Kuzník ondra@mistotebe.net --- That only fixes the deadlock, not the potential crashers when the op is against a random DB and its configuration changed/went away?
https://bugs.openldap.org/show_bug.cgi?id=9930
--- Comment #5 from Quanah Gibson-Mount quanah@openldap.org --- Created attachment 924 --> https://bugs.openldap.org/attachment.cgi?id=924&action=edit gdb output from segv with test patch
https://bugs.openldap.org/show_bug.cgi?id=9930
--- Comment #6 from Howard Chu hyc@openldap.org --- Probably need Ondrej to look into that, it seems that syncrepl called runqueue_stoptask for a task that wasn't currently running.
https://bugs.openldap.org/show_bug.cgi?id=9930
--- Comment #7 from Ondřej Kuzník ondra@mistotebe.net --- On Sat, Oct 15, 2022 at 08:17:40PM +0000, openldap-its@openldap.org wrote:
Probably need Ondrej to look into that, it seems that syncrepl called runqueue_stoptask for a task that wasn't currently running.
The task isn't running because do_syncrepl() is being called in response to traffic on the socket but si->si_ld == NULL, so we haven't actually sent any requests yet. What's more, not sure I can see a way to have si->si_conn active without a si->si_ld set up as well? Certainly looks that's not supposed to happen.
Thanks,
https://bugs.openldap.org/show_bug.cgi?id=9930
--- Comment #8 from Quanah Gibson-Mount quanah@openldap.org --- Created attachment 925 --> https://bugs.openldap.org/attachment.cgi?id=925&action=edit Logs of SEGV
https://bugs.openldap.org/show_bug.cgi?id=9930
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|test050 deadlock on BSD |test050 deadlock with small |OSes |send/receive queues
https://bugs.openldap.org/show_bug.cgi?id=9930
--- Comment #9 from Quanah Gibson-Mount quanah@openldap.org --- Deadlock fix:
head: • 04eded74 by Howard Chu at 2022-10-14T15:22:24+01:00 ITS#9930 fix cn=config / write_waiter deadlock
RE26:
• 0761de2d by Howard Chu at 2022-10-25T16:04:50+00:00 ITS#9930 fix cn=config / write_waiter deadlock
RE25:
• bf5607b4 by Howard Chu at 2022-10-25T16:04:12+00:00 ITS#9930 fix cn=config / write_waiter deadlock
https://bugs.openldap.org/show_bug.cgi?id=9930
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|CONFIRMED |IN_PROGRESS
https://bugs.openldap.org/show_bug.cgi?id=9930
Ondřej Kuzník ondra@mistotebe.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Assignee|hyc@openldap.org |ondra@mistotebe.net
https://bugs.openldap.org/show_bug.cgi?id=9930
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|IN_PROGRESS |RESOLVED
--- Comment #10 from Quanah Gibson-Mount quanah@openldap.org --- head:
• fa030ef8 by Ondřej Kuzník at 2023-01-30T10:26:23+00:00 ITS#9930 Do not reschedule consumers that are shutting down
RE26:
• c6a3d2da by Ondřej Kuzník at 2023-01-30T19:00:38+00:00 ITS#9930 Do not reschedule consumers that are shutting down
RE25:
• a8b1e2f2 by Ondřej Kuzník at 2023-01-30T19:02:43+00:00 ITS#9930 Do not reschedule consumers that are shutting down
https://bugs.openldap.org/show_bug.cgi?id=9930
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED