2.3.39 syncrepl lost connection

31 Jan 2008


      I have a strange problem that is causing me to go nuts.
I have five servers all on RHEL v4, all with OpenLDAP 2.3.39 (locally 
built RPM).  The master server is a VMWare guest, one of the replicas is 
a blade, the other three are 2U boxes.
Twice now two of the four replicas have stopped updating at around 
2:45am.  It was not the same two both times (although the blade was one 
of them both times).
All five servers have loglevel set to "stats sync".
There was nothing logged on either end about any network error and my 
networking folks have looked at all the logs for all the ports involved 
and found nothing.  Although, my first thought was something in the 
network because we just moved these to a brand new data center.
The fix both times so far has been to recycle slapd on the two replicas 
and they get caught up in minutes.
The syncrepl config on the replicas is for refreshAndPersist and does a 
retry every 30 seconds -- so, if the replica knew the connection had 
dropped, it should have restarted it.
We run a command via nagios (nrpe) on each replica every five minutes 
that compares the contextcsn of the replica and the master.  I see those 
connections/queries in the logs on the master continuing and nagios 
eventually yells that we're dreadfully behind on the replicas.
Has anyone seen something like this before -- or have a suggestion of a 
method of figuring out why/where the connection is getting broken?
Thanks,
-- 
Frank Swasey                    | http://www.uvm.edu/~fcs
Sr Systems Administrator        | Always remember: You are UNIQUE,
University of Vermont           |    just like everyone else.
   "I am not young enough to know everything." - Oscar Wilde (1854-1900)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2.3.39 syncrepl lost connection