https://bugs.openldap.org/show_bug.cgi?id=9584
Issue ID: 9584 Summary: cn=config replication ops/refresh should pause server Product: OpenLDAP Version: unspecified Hardware: All OS: All Status: UNCONFIRMED Severity: normal Priority: --- Component: slapd Assignee: bugs@openldap.org Reporter: ondra@mistotebe.net Target Milestone: ---
Looking into this crash: https://git.openldap.org/openldap/openldap/-/jobs/7286
The thread in question is running a plain syncrepl refresh while another thread seems to have done the same. This thread fetched the entryUUID attribute of the 'cn=config' entry as 'a' and in the meantime, that entry has been rewritten, with 'a' presumably cleaned up and returned to the pool, so addressing a->a_nvals[0] is a NULL-dereference now.
This might or might not be related to the fix in ITS#8102.
https://bugs.openldap.org/show_bug.cgi?id=9584
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |2.6.0 Assignee|bugs@openldap.org |hyc@openldap.org
https://bugs.openldap.org/show_bug.cgi?id=9584
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords| |replication Priority|--- |High
https://bugs.openldap.org/show_bug.cgi?id=9584
--- Comment #1 from Howard Chu hyc@openldap.org --- (In reply to Ondřej Kuzník from comment #0)
Looking into this crash: https://git.openldap.org/openldap/openldap/-/jobs/7286
The thread in question is running a plain syncrepl refresh while another thread seems to have done the same. This thread fetched the entryUUID attribute of the 'cn=config' entry as 'a' and in the meantime, that entry has been rewritten, with 'a' presumably cleaned up and returned to the pool, so addressing a->a_nvals[0] is a NULL-dereference now.
I don't see how this could happen. In order for any consumer to write to cn=config, all other threads must be paused. There is no point where one consumer will have retrieved any entry from cn=config and then paused itself, giving any window for another consumer to rewrite the entry out from under it.
In particular, the stacktrace for the crashed thread is:
#0 0x0000557332297336 in nonpresent_callback (op=<optimized out>, rs=0x7ff39d52dcb0) at ../../../servers/slapd/syncrepl.c:5681 #1 0x0000557332244ef8 in slap_response_play (op=op@entry=0x7ff39d52e4f0, rs=rs@entry=0x7ff39d52dcb0) at ../../../servers/slapd/result.c:567 #2 0x0000557332246c30 in slap_send_search_entry (op=0x7ff39d52e4f0, rs=0x7ff39d52dcb0) at ../../../servers/slapd/result.c:1072 #3 0x000055733221c154 in config_send (op=op@entry=0x7ff39d52e4f0, rs=rs@entry=0x7ff39d52dcb0, ce=ce@entry=0x55733339fcd0, depth=depth@entry=0) at ../../../servers/slapd/bconfig.c:4719 #4 0x000055733221c790 in config_back_search (op=0x7ff39d52e4f0, rs=0x7ff39d52dcb0) at ../../../servers/slapd/bconfig.c:6794 #5 0x00005573322a57f1 in overlay_op_walk (op=op@entry=0x7ff39d52e4f0, rs=rs@entry=0x7ff39d52dcb0, which=which@entry=op_search, oi=oi@entry=0x7ff3941dcaf0, on=<optimized out>) at ../../../servers/slapd/backover.c:706 #6 0x00005573322a594e in over_op_func (op=0x7ff39d52e4f0, rs=0x7ff39d52dcb0, which=op_search) at ../../../servers/slapd/backover.c:766 #7 0x0000557332291b3d in syncrepl_del_nonpresent (op=op@entry=0x7ff39d52e4f0, si=si@entry=0x7ff388108640, uuids=uuids@entry=0x0, sc=sc@entry=0x7ff39d52dfc0, m=3) at ../../../servers/slapd/syncrepl.c:4625 #8 0x000055733229ea1a in do_syncrep2 (si=0x7ff388108640, op=0x7ff39d52e4f0) at ../../../servers/slapd/syncrepl.c:1840 #9 do_syncrepl (ctx=<optimized out>, arg=0x7ff388109cb0) at ../../../servers/slapd/syncrepl.c:2076 #10 0x00007ff3a0fe047e in ?? () #11 0x0000000000000000 in ?? ()
There is nowhere along the call path from frame #4 thru frame #0 where this thread will do a pausecheck. Which means any other consumer trying to modify the config DB necessarily waits for this call path to complete before it can proceed. Which means no other consumer could have changed any attributes out from under this thread.
This might or might not be related to the fix in ITS#8102.
https://bugs.openldap.org/show_bug.cgi?id=9584
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |RESOLVED Resolution|--- |TEST
--- Comment #2 from Howard Chu hyc@openldap.org --- fixed in master
https://bugs.openldap.org/show_bug.cgi?id=9584
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords|replication |
--- Comment #3 from Quanah Gibson-Mount quanah@openldap.org --- Commits: • e1c90d09 by Howard Chu at 2021-07-27T16:12:14+01:00 ITS#9584 serialize refresh phase
https://bugs.openldap.org/show_bug.cgi?id=9584
--- Comment #4 from Michael Ströder michael@stroeder.com --- (In reply to Quanah Gibson-Mount from comment #3)
Commits: • e1c90d09 by Howard Chu at 2021-07-27T16:12:14+01:00 ITS#9584 serialize refresh phase
Does this affect all backends?
The subject of this ITS implies that an issue for cn=config is solved.
I'm asking because I see some potential operational trouble when automatically configuring and starting multiple consumers at once (with puppet, ansible or similar) and refresh phase takes some time.
https://bugs.openldap.org/show_bug.cgi?id=9584
--- Comment #5 from Howard Chu hyc@openldap.org --- (In reply to Michael Ströder from comment #4)
(In reply to Quanah Gibson-Mount from comment #3)
Commits: • e1c90d09 by Howard Chu at 2021-07-27T16:12:14+01:00 ITS#9584 serialize refresh phase
Does this affect all backends?
Yes.
The subject of this ITS implies that an issue for cn=config is solved.
I'm asking because I see some potential operational trouble when automatically configuring and starting multiple consumers at once (with puppet, ansible or similar) and refresh phase takes some time.
The overall time required will actually be shorter. Previously, without this serialization, all of the consumers would start at the same time and make redundant requests to their respective providers. Starting from the same cookie, they would each obtain the same content from their providers, so the same data is transmitted N times. The consumers go thru the entries one by one and end up ignoring the majority of the data.
By serializing the start, one consumer will pull down the bulk of the data, and then when it finishes, the next consumer will be using the latest cookie when it talks to its provider, so no redundant data will be downloaded.
https://bugs.openldap.org/show_bug.cgi?id=9584
--- Comment #6 from Ondřej Kuzník ondra@mistotebe.net ---
By serializing the start, one consumer will pull down the bulk of the data, and then when it finishes, the next consumer will be using the latest cookie when it talks to its provider, so no redundant data will be downloaded.
In that case, the other threads seem to be busy looping until the lock is available, we should probably make it either wait on a condition or suspend the tasks in a future release.
https://bugs.openldap.org/show_bug.cgi?id=9584
--- Comment #7 from Howard Chu hyc@openldap.org --- (In reply to Ondřej Kuzník from comment #6)
By serializing the start, one consumer will pull down the bulk of the data, and then when it finishes, the next consumer will be using the latest cookie when it talks to its provider, so no redundant data will be downloaded.
In that case, the other threads seem to be busy looping until the lock is available, we should probably make it either wait on a condition or suspend the tasks in a future release.
Agreed. It would have made more sense to have only a single task iterate through all of the configured consumers, at least for this startup refresh.
I guess instead of the busy loop I can just exit the task and explicitly re-enable it later. Will look into it.
https://bugs.openldap.org/show_bug.cgi?id=9584
--- Comment #8 from Michael Ströder michael@stroeder.com --- On 7/29/21 1:33 PM, openldap-its@openldap.org wrote:
By serializing the start, one consumer will pull down the bulk of the data, and then when it finishes, the next consumer will be using the latest cookie when it talks to its provider, so no redundant data will be downloaded.
Does this also apply to the situation where I start multiple empty consumers (no data in DB yet)?
https://bugs.openldap.org/show_bug.cgi?id=9584
--- Comment #9 from Quanah Gibson-Mount quanah@openldap.org --- Commits: • 79d33fe4 by Howard Chu at 2021-07-29T13:28:34+01:00 ITS#9584 avoid busy-loop while refresh is serialized
https://bugs.openldap.org/show_bug.cgi?id=9584
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|TEST |FIXED Status|RESOLVED |VERIFIED
https://bugs.openldap.org/show_bug.cgi?id=9584
--- Comment #10 from Quanah Gibson-Mount quanah@openldap.org --- Commits: • 75636a40 by Ondřej Kuzník at 2021-12-15T01:22:38+00:00 ITS#9584 Track refreshing status explicitly
https://bugs.openldap.org/show_bug.cgi?id=9584
--- Comment #11 from Quanah Gibson-Mount quanah@openldap.org --- RE26:
Commits: • 5be1853a by Ondřej Kuzník at 2022-01-12T21:44:42+00:00 ITS#9584 Track refreshing status explicitly
https://bugs.openldap.org/show_bug.cgi?id=9584
Ondřej Kuzník ondra@mistotebe.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|VERIFIED |CONFIRMED Ever confirmed|0 |1 Resolution|FIXED |---
--- Comment #12 from Ondřej Kuzník ondra@mistotebe.net --- The current fix in 2.6 abuses retry logic to handle SYNC_BUSY and that's going to cause us no end of trouble if we allow this into 2.5. Given it looks like we want to backport, we'll need to make some changes I planned to do eventually.
My thinking is we make each syncinfo track whether it's active/paused, with all of them starting in paused state and we keep the existing cs_refreshing management as is. Also order probably matters but that just got fixed in ITS#9761.
On startup, we kick off a task that picks the first paused syncinfo, marks it active and schedules accordingly.
When a syncinfo gets a SYNC_BUSY - there is another active syncinfo in cs_refreshing - it makes itself paused. When a syncinfo drops itself from cs_refreshing it will pick the first paused syncinfo from the start and activates it as above.
In this, it seems that syncinfos that run their retry counter to the end (going dead) stay marked as "active" and that looks safe, we don't actually want to reschedule them.
Any cn=config prodding has to take the above into account, so it might have to schedule the kick-off task if required and some tracking to allow that will be needed. There is no active link from cn=monitor that can change these states/retries so that's fine. Decision whether or not to schedule the initial task could be based on cs_refreshing == NULL. Even if we spawn duplicate kick-off tasks, one of them will be able to take over cs_refreshing and others will get BUSY or find no more paused syncinfos.
https://bugs.openldap.org/show_bug.cgi?id=9584
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|2.6.0 |2.5.12
https://bugs.openldap.org/show_bug.cgi?id=9584
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|CONFIRMED |RESOLVED
--- Comment #13 from Quanah Gibson-Mount quanah@openldap.org --- head:
commit 87ffc60006298069a5a044b8e63dab27a61d3fdf Author: Ondřej Kuzník ondra@mistotebe.net Date: Wed Jan 26 17:39:41 2022 +0000
ITS#9584 Do not rely on retry=.* to reschedule new syncrepl sessions
commit 62bf31e9664464a1df0e0cf83f8bbb4d047f10c8 Author: Howard Chu hyc@openldap.org Date: Thu Feb 3 00:49:52 2022 +0000
ITS#9584 bconfig: protect cf entries with rwlock
RE26:
commit f2671bc549a69035ffa471bff322577b10e6a3d4 Author: Ondřej Kuzník ondra@mistotebe.net Date: Wed Jan 26 17:39:41 2022 +0000
ITS#9584 Do not rely on retry=.* to reschedule new syncrepl sessions
commit ba11bd6b66077eafc6b75fcf4adf1dc3883efb80 Author: Howard Chu hyc@openldap.org Date: Thu Feb 3 00:49:52 2022 +0000
ITS#9584 bconfig: protect cf entries with rwlock
Since not all config writes pause the server, must prevent searches from seeing intermediate states.
RE25:
commit d9f82e1a40afc0f3dcc0a80090e9cb51f55a847e Author: Howard Chu hyc@openldap.org Date: Tue Jul 27 16:10:29 2021 +0100
ITS#9584 serialize refresh phase
Only allow one consumer at a time to perform a refresh on a database.
Contains: e1c90d0977d389db05803c127d45b39c89a5ac2f 79d33fe40ea41f52a2c1b9e299a6c711f62d0f40 75636a407e38f1502c592566b5bf4c3ebf142a2b 3e3d9d7637e65a40ec0ec9aa9b9bcb051e3a42b5 minus testsuite tweak
commit 30f42b16230d26fab2ccdd9c9f90cd3f60af02de Author: Howard Chu hyc@openldap.org Date: Thu Feb 3 00:49:52 2022 +0000
ITS#9584 bconfig: protect cf entries with rwlock
Since not all config writes pause the server, must prevent searches from seeing intermediate states.
https://bugs.openldap.org/show_bug.cgi?id=9584
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED