On 1/31/08 4:31 PM, Aaron Richton wrote:
A 2.3.39 replica should know the connection was lost, via the
OS, because it requests SO_KEEPALIVE. I assume these are
(pseudo-?)dedicated servers, given the size of your OpenLDAP
installation. As such, you may want to investigate your kernel tunable
parameters to make keepalives more aggressive.
Yep, I'll bet that's it. These are dedicated servers and they had the
RedHat default values for the keepalive settings (which means about
2hours 15minutes before the replica knows that the connection got axed
-- and nagios yells really loud after 10 minutes). We have never waited
the full 2 hours and 15 minutes to see it recover itself. I have
changed the times to be 300/5/5 (instead of 7200/75/9). So, it should
figure out it's dead within 5 minutes 30 seconds.
On 1/31/08 4:12 PM, Quanah Gibson-Mount wrote:
Was the connection between the replica and master still open? What
mechanism are you using?
I wasn't the one that worked on it. I was told "the connection still
thought it was up". I believe that means that netstat was used on the
replica and a connection to the master was reported back. And as we
didn't wait for keepalive to figure out it had died, it would have
reported as ESTABLISHED. I don't know that anyone checked on the
master, where I'll bet the connection was not still showing up.
The bind mechanism is simple to prevent external services (like
Kerberos) being required for the LDAP servers to be operational.
Frank Swasey | http://www.uvm.edu/~fcs
Sr Systems Administrator | Always remember: You are UNIQUE,
University of Vermont | just like everyone else.
"I am not young enough to know everything." - Oscar Wilde (1854-1900)