I have a strange problem that is causing me to go nuts.
I have five servers all on RHEL v4, all with OpenLDAP 2.3.39 (locally built RPM). The master server is a VMWare guest, one of the replicas is a blade, the other three are 2U boxes.
Twice now two of the four replicas have stopped updating at around 2:45am. It was not the same two both times (although the blade was one of them both times).
All five servers have loglevel set to "stats sync".
There was nothing logged on either end about any network error and my networking folks have looked at all the logs for all the ports involved and found nothing. Although, my first thought was something in the network because we just moved these to a brand new data center.
The fix both times so far has been to recycle slapd on the two replicas and they get caught up in minutes.
The syncrepl config on the replicas is for refreshAndPersist and does a retry every 30 seconds -- so, if the replica knew the connection had dropped, it should have restarted it.
We run a command via nagios (nrpe) on each replica every five minutes that compares the contextcsn of the replica and the master. I see those connections/queries in the logs on the master continuing and nagios eventually yells that we're dreadfully behind on the replicas.
Has anyone seen something like this before -- or have a suggestion of a method of figuring out why/where the connection is getting broken?
Thanks,
--On Thursday, January 31, 2008 10:54 AM -0500 Francis Swasey Frank.Swasey@uvm.edu wrote:
The fix both times so far has been to recycle slapd on the two replicas and they get caught up in minutes.
Was the connection between the replica and master still open? What bind mechanism are you using? There's been a known issue for a while, for example, that if you use SASL/GSSAPI with MIT Kerberos, and the ticket expires, replication will stop, and you'll have to bounce the server to get it to replicating again.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
The syncrepl config on the replicas is for refreshAndPersist and does a retry every 30 seconds -- so, if the replica knew the connection had dropped, it should have restarted it.
I had this ~2.3.27. It should be OK now; there was a patch to libldap.
A 2.3.39 replica should know the connection was lost, via the underlying OS, because it requests SO_KEEPALIVE. I assume these are (pseudo-?)dedicated servers, given the size of your OpenLDAP installation. As such, you may want to investigate your kernel tunable parameters to make keepalives more aggressive.
You mention RHEL. If you have a test environment, iptables'ing ports 389/636 to /dev/null is a really good way to make sure that this works in a sane fashion. Around here, we don't pay by the byte, so we do:
net.ipv4.tcp_keepalive_time = 10 net.ipv4.tcp_keepalive_intvl = 5 net.ipv4.tcp_keepalive_probes = 5
to keep syncrepl sync'd. With the CentOS defaults, we saw ~2+hrs before the kernel got a clue.
On 1/31/08 4:31 PM, Aaron Richton wrote:
A 2.3.39 replica should know the connection was lost, via the underlying OS, because it requests SO_KEEPALIVE. I assume these are (pseudo-?)dedicated servers, given the size of your OpenLDAP installation. As such, you may want to investigate your kernel tunable parameters to make keepalives more aggressive.
Yep, I'll bet that's it. These are dedicated servers and they had the RedHat default values for the keepalive settings (which means about 2hours 15minutes before the replica knows that the connection got axed -- and nagios yells really loud after 10 minutes). We have never waited the full 2 hours and 15 minutes to see it recover itself. I have changed the times to be 300/5/5 (instead of 7200/75/9). So, it should figure out it's dead within 5 minutes 30 seconds.
On 1/31/08 4:12 PM, Quanah Gibson-Mount wrote:
Was the connection between the replica and master still open? What bind mechanism are you using?
I wasn't the one that worked on it. I was told "the connection still thought it was up". I believe that means that netstat was used on the replica and a connection to the master was reported back. And as we didn't wait for keepalive to figure out it had died, it would have reported as ESTABLISHED. I don't know that anyone checked on the master, where I'll bet the connection was not still showing up.
The bind mechanism is simple to prevent external services (like Kerberos) being required for the LDAP servers to be operational.
openldap-software@openldap.org