Re: (ITS#5454) syncrepl refreshAndPersist stops receiving

7 Apr 2008


      Howard Chu skrev:
...
rein@basefarm.no wrote:
...
...
Hm, I hope I have found the race condition that causes this :-)  I'm now
running with the patch at the end to see if that solves it, only time
will tell..
And we haven't seen any replication problems since this was implemented 
friday afternoon :-).  But it is still a bit too early to conclude that 
it has been fixed.
...
...
The race is that between the time selecting on the syncrepl socket is
enabled by the call to connection_client_enable() and the release of the
si_mutex a new message may arrive.  If so, the next call to do_syncrepl
may fail in its attempt to trylock the mutex and no-one will re-enable
selecting on it again.  My patch delays enabling of the socket until the
mutex has been released, which looks safe to me.  Or can the access to
si->si_conn without a lock be a problem?
How about just moving the enable to after the runqueue manipulation is
done?
That would still leave a glitch in the window open, unless the initial 
trylock is changed to a regular lock as your checking message suggests.
But, doing that could result in all the threads ending up waiting for 
the lock if do_syncrepl of a refresh only replication is configured to 
run too often.  I assume the trylock is there to avoid this?  What about 
starting out with deferring the next scheduled call, as a scheduled call 
  is not wanted until the first has completed anyhow?  And a rescedule 
is already being done before do_syncrepl finishes.
Deferring should not be necessary if rtask->next_sched.tv_sec is zero, 
i.e it has already been deferred. That would remove these calls for all 
the cases when do_syncrepl is called as a result of a new message on a 
persistent syncrepl connection.
...
Just need to be sure that do_syncrepl() isn't entered again before
si->si_conn gets initialized.
I'm not quite sure what you mean here.  As I see it, the problem is that 
do_syncrepl() must not be called after si->si_conn is enabled and the 
lock is still held.
...
It also occurs to me that we probably don't even need to manipulate the
slapd runqueue in persist mode, when si->si_conn is already set. I.e.,
in that case we can only have gotten here because of a listener event,
and not because of a runqueue schedule.
That is probably true.
Rein

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: (ITS#5454) syncrepl refreshAndPersist stops receiving