Re: No replication after power failure

5 Oct 2007

      Pierangelo Masarati wrote:
...
Stelios Grigoriadis wrote:
...
I am not sure this would be considered a bug, but it is a problem for
us. If the master goes down, the replicas have no way of detecting it.
When the master is going back up again, all replica servers have to be
restarted. Is there a way to avoid this?
Using the KEEPALIVE option (socket or TCP) is not really an option since
the default timeout is 2 hours which is too long.
Another would be to have some kind of timeout in the epoll and check if
the master is responding, but that timeout is used for the runqueue?
Have you come across this? I was surprised to see that no one has had
any issues with it. Am I missing something?
This was recently discussed (ITS#5133), and the only alternative to
SO_KEEPALIVE would be to have some background thread poll the producer
on the syncrepl descriptors on a regular basis performing some no-op
(like searching the rootDSE requesting 1.1).  Aaron Richton noted that
support for SO_KEEPALIVE was added in OpenLDAP 2.3.28.
p.
Ing. Pierangelo Masarati
OpenLDAP Core Team
SysNet s.r.l.
via Dossi, 8 - 27100 Pavia - ITALIA
http://www.sys-net.it

Office:  +39 02 23998309
Mobile:  +39 333 4963172
Email:   pierangelo.masarati@sys-net.it

I have solved the problem by inserting a periodic check (called
do_mastercheck) in the runqueue. The period is determined by the slapd.conf
parameter mastercheckint in the syncrepl section. The period is
specified in minutes and is optional. If it's not specified, it isn't
added. I have
tested it and it seems to work. I'm supplying a patch (only syncrepl.c
is affected) so you can review my solution and hopefully incorporate it in
the code (or better yet, improve and submit it).
/Stelios

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: No replication after power failure