Re: 2.3.39 syncrepl lost connection

1 Feb 2008


      On 1/31/08 4:31 PM, Aaron Richton wrote:
...
A 2.3.39 replica should know the connection was lost, via the underlying 
OS, because it requests SO_KEEPALIVE. I assume these are 
(pseudo-?)dedicated servers, given the size of your OpenLDAP 
installation. As such, you may want to investigate your kernel tunable 
parameters to make keepalives more aggressive.
Yep, I'll bet that's it.  These are dedicated servers and they had the 
RedHat default values for the keepalive settings (which means about 
2hours 15minutes before the replica knows that the connection got axed 
-- and nagios yells really loud after 10 minutes).  We have never waited 
the full 2 hours and 15 minutes to see it recover itself.  I have 
changed the times to be 300/5/5 (instead of 7200/75/9).  So, it should 
figure out it's dead within 5 minutes 30 seconds.
On 1/31/08 4:12 PM, Quanah Gibson-Mount wrote:
...
Was the connection between the replica and master still open?  What bind
mechanism are you using?
I wasn't the one that worked on it.  I was told "the connection still 
thought it was up".  I believe that means that netstat was used on the 
replica and a connection to the master was reported back.  And as we 
didn't wait for keepalive to figure out it had died, it would have 
reported as ESTABLISHED.  I don't know that anyone checked on the 
master, where I'll bet the connection was not still showing up.
The bind mechanism is simple to prevent external services (like 
Kerberos) being required for the LDAP servers to be operational.
-- 
Frank Swasey                    | http://www.uvm.edu/~fcs
Sr Systems Administrator        | Always remember: You are UNIQUE,
University of Vermont           |    just like everyone else.
   "I am not young enough to know everything." - Oscar Wilde (1854-1900)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: 2.3.39 syncrepl lost connection