Paul B. Henson wrote:
I upgraded one of the nodes of a four node MMR delta syncrepl openldap system today to 2.4.44 + the backported ITS 8432 patch (the other three nodes were still running 2.4.41, which all four had been running for quite some time with no issues) and within a few hours they started blowing up with the infinite replication issue referred to in ITS 8432, all the accesslogs were filled with the same modification repeated over and over. I ended up having to slapcat the db, restore the upgraded node to 2.4.41, and then reload the db on all of them to recover.
From what I understand the bug was supposed to have existed in versions
before 2.4.44, but I had never seen it in 2.4.41, and backing out to that version seems to have restored stability. I remember seeing another ITS regarding an issue even with the 8432 patch applied if you're using the memberOf overlay (although that number escapes me at the moment), but I thought that only applied if you were updating group memberships and the change that blew up my system was a password change.
Is it possible updating one node to the fixed version somehow triggered the bug in one of the other nodes? Do I need to upgrade all the nodes at the same time? Or are there possibly still edge cases where 2.4.44 with the patch are still broken? I did a test run of the upgrade in my dev environment, including the staged rollout with temporarily mixed versions, but it doesn't really have the same load and variety of access patterns that production sees.
The fix for #8432 only prevents the redundant mod from being processed on a particular node. If other nodes are still accepting the redundant op then yes, it will continue to propagate. So yes, you need the patched code on all nodes.
Thanks...
From: Howard Chu Sent: Thursday, July 21, 2016 3:36 AM
The fix for #8432 only prevents the redundant mod from being processed on a particular node. If other nodes are still accepting the redundant op
then yes,
it will continue to propagate. So yes, you need the patched code on all nodes.
Okay, thanks for the clarification. I usually stage updates to avoid a complete outage at any given time. It's interesting though that I had never seen this problem running 2.4.41 until I introduced a 2.4.44 system into the mix, and then it went away once I reverted that system back to 2.4.41. I wonder why that combination caused it to pop up suddenly. I'll have to schedule a downtime window and update them all at once and see what happens.
By any chance have you had the time to look at ITS 8444? The reporter says he sees a similar circular replication issue when the memberOf overlay is enabled, which we also use.
Thanks much.
I would note that I've opened up two new ITSes for problems that only occur on systems that have the ITS8432 fix. Still waiting on fixes for those, which is why I haven't pushed the fix into RE24 yet.
ITSes 8460 and 8462.
--Quanah
--On Thursday, July 21, 2016 2:59 PM -0700 "Paul B. Henson" henson@acm.org wrote:
From: Howard Chu Sent: Thursday, July 21, 2016 3:36 AM
The fix for #8432 only prevents the redundant mod from being processed on a particular node. If other nodes are still accepting the redundant op
then yes,
it will continue to propagate. So yes, you need the patched code on all nodes.
Okay, thanks for the clarification. I usually stage updates to avoid a complete outage at any given time. It's interesting though that I had never seen this problem running 2.4.41 until I introduced a 2.4.44 system into the mix, and then it went away once I reverted that system back to 2.4.41. I wonder why that combination caused it to pop up suddenly. I'll have to schedule a downtime window and update them all at once and see what happens.
By any chance have you had the time to look at ITS 8444? The reporter says he sees a similar circular replication issue when the memberOf overlay is enabled, which we also use.
Thanks much.
--
Quanah Gibson-Mount Platform Architect Manager, Systems Team Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
Also see ITS 8448.
--Quanah
--On Thursday, July 21, 2016 5:47 PM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
I would note that I've opened up two new ITSes for problems that only occur on systems that have the ITS8432 fix. Still waiting on fixes for those, which is why I haven't pushed the fix into RE24 yet.
ITSes 8460 and 8462.
--Quanah
--On Thursday, July 21, 2016 2:59 PM -0700 "Paul B. Henson" henson@acm.org wrote:
From: Howard Chu Sent: Thursday, July 21, 2016 3:36 AM
The fix for #8432 only prevents the redundant mod from being processed on a particular node. If other nodes are still accepting the redundant op
then yes,
it will continue to propagate. So yes, you need the patched code on all nodes.
Okay, thanks for the clarification. I usually stage updates to avoid a complete outage at any given time. It's interesting though that I had never seen this problem running 2.4.41 until I introduced a 2.4.44 system into the mix, and then it went away once I reverted that system back to 2.4.41. I wonder why that combination caused it to pop up suddenly. I'll have to schedule a downtime window and update them all at once and see what happens.
By any chance have you had the time to look at ITS 8444? The reporter says he sees a similar circular replication issue when the memberOf overlay is enabled, which we also use.
Thanks much.
--
Quanah Gibson-Mount Platform Architect Manager, Systems Team Zimbra, Inc.
Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
--
Quanah Gibson-Mount Platform Architect Manager, Systems Team Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
From: Quanah Gibson-Mount Sent: Friday, July 22, 2016 10:40 AM
Also see ITS 8448.
Ah, thanks much for the heads up on these additional issues. Perhaps it would be safest for me to stick with 2.4.41 until a 2.4.45 is officially released with fixes for both the replication loop and possible fallout from its fix :). We've been pretty lucky with 2.4.41, knock on bits, it's been very stable with no issues or problems that I've seen while we've been running it.
--On Thursday, July 21, 2016 5:47 PM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
I would note that I've opened up two new ITSes for problems that only occur on systems that have the ITS8432 fix. Still waiting on fixes for those, which is why I haven't pushed the fix into RE24 yet.
ITSes 8460 and 8462.
--Quanah
--On Thursday, July 21, 2016 2:59 PM -0700 "Paul B. Henson" henson@acm.org wrote:
From: Howard Chu Sent: Thursday, July 21, 2016 3:36 AM
The fix for #8432 only prevents the redundant mod from being
processed
on a particular node. If other nodes are still accepting the redundant op
then yes,
it will continue to propagate. So yes, you need the patched code on
all
nodes.
Okay, thanks for the clarification. I usually stage updates to avoid a complete outage at any given time. It's interesting though that I had never seen this problem running 2.4.41 until I introduced a 2.4.44
system
into the mix, and then it went away once I reverted that system back to 2.4.41. I wonder why that combination caused it to pop up suddenly.
I'll
have to schedule a downtime window and update them all at once and
see
what happens.
By any chance have you had the time to look at ITS 8444? The reporter says he sees a similar circular replication issue when the memberOf overlay is enabled, which we also use.
Thanks much.
--
Quanah Gibson-Mount Platform Architect Manager, Systems Team Zimbra, Inc.
Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
--
Quanah Gibson-Mount Platform Architect Manager, Systems Team Zimbra, Inc.
Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
openldap-technical@openldap.org