Re: ITS#7052, syncrepl, deletes, and MMR

23 Feb 2012


      On 13.02.12 00:47, Howard Chu wrote:
Long time with little OpenLDAP work, but I'm still around ;-)
...
I'm still seeing cases where deleted entries are getting resurrected
when a number of concurrent Add/Delete sequences are occurring, with
multiple MMR servers (4 minimum to show the error).
Just for the record, this is not the problem reported in this ITS, the 
ITS bug is the same as discussed in this thread:
http://www.openldap.org/lists/openldap-devel/201012/msg00018.html
The queuing of an old CSN done as a fix to ITS#7052 may have introduced 
a new race condition, an ITS and fix is coming.
I would prefer a rewrite so that only the frontend assigned CSNs to 
operations though.  The current situation where syncrepl attaches the 
entryCSN or old CSNs to the operation just to prevent the backend from 
generating new CSNs appears to me like curing the symptom rather than 
the sickness..
...
The problem begins because multiple writes are outstanding, and they are
replicated in persist mode without a CSN in their syncrepl cookie. This
is a normal occurrence when the current op does not correspond to the
last committed CSN.
This looks to me as the root of the problem seen here.  Replicating 
without CSN implies replicating possible incomplete state, and when 
there are multiple paths by which these operations can reach a server we 
end up with race conditions.
I'd prefer that all changes replicated in persist mode carried a single 
CSN and were replicated in CSN order (for all CSNs with the same SID 
that is).  It is probably sufficient to enforce this in MMR mode though.
The replicated changes are already being serialized, so serializing them 
in CSN order shouldn't stall things noticeably and would eliminate the 
type of race conditions seen here.  And I guess it's already required 
for delta replication?
The major drawback would be that after a refresh syncprov would have to 
force its consumers to refresh as well.  I.e the first hop in a chain 
would have to complete its refresh before the next hop starts seeing the 
updates.  But database consistency is most important to me, so I would 
have no problem living with that.
...
Because there is no CSN, the consumer doesn't update its cookie state
while performing a particular op.
As a result, if a client does Add/Delete/Add/Delete of the same DN, it's
possible for the Adds to propagate several times (more than the client
actually executed).
Adds and Modifies can usually be rejected if they're too old, because
they carry an entryCSN attribute which can be compared against the
existing entry, even if the consumer cookie state has not been updated.
But Deletes don't carry any attributes, and Deleted entries can't be
checked.
So, given a MMR setup like so:
1 -- 2
|    |
3 -- 4
A sequence of Add/Del/Add/Del performed at server 1 will be replicated
to both 2 and 3 immediately. They will then cascade it to server 4. If
many other writes were occurring at the same time, causing these writes
to be propagated without a cookie CSN, then server 4 will propagate them
back to 3 and 2 respectively, and 3 and 2 will re-add the deleted
entries because they have nothing to check that says the Adds are old.
This cycle only gets broken if server 1 eventually sends an op with
accompanying cookie update, so that all the downstream servers can see
that the ops are old.
There are actually two possible race conditions in this configuration, 
when an add/delete is performed on the same DN:
1) The add is sent without CSN, the delete with.  Assume that the 
add/delete is handled by server 3 before it receives them from server 4. 
  It will then act upon the CSN-less add and discard the delete as 
already being seen, and end up with an entry not present on the origin 
server.
2) Neither the add nor the delete are sent with a CSN.  This can lead to 
the endless add/delete cycle outlined above when there exist loops in 
the MMR topology.  The cycle will only be broken if the same DN is 
re-added with a CSN, updating the CSN by changing other entries is not 
sufficient.  The wild CSN-less add will be stopped when it reaches a 
server with the newly added entry, and hence also the delete.  But which 
servers that will end up acting on the delete is yet another race 
condition :-(
Hm, given that the replication handles add and modify fairly equal, 
could a modify/delete sequence be sufficient to trigger these race 
conditions?
...
OK, upon further digging, this appears to be caused by ITS#6024. rein's
patch prevents the consumer and provider from informing each other of
their SIDs when no CSN is present; this prevents syncprov's propagation
loop detection from working. Sigh. Reverting ITS#6024 patch...
Unfortunately, this will not fix scenario 1 and only scenario 2 when all 
loops includes the server initiating the change.  The rid and sid fields 
of the cookie are not sufficient for loop detection in the general case, 
and as such should only be used for optimization.
A new test script which exercise these race conditions is coming.
Rein

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: ITS#7052, syncrepl, deletes, and MMR