On Thu, Apr 23, 2009 at 05:41:09PM -0500, John Kane wrote:
I am having a problem with what appears (to me) to be 'stale' TCP connections for syncrepl between the master and a pair of slaves. After restarting all, I see changes on the master replicated to both slaves. BUT, if I wait about 30 minutes or more, then make a change, the replication fails (most of the time). netstat on the LDAP port show the connections still established, but queued packets at the master server. After about 15 minutes, the master server drops the connection. An overnight tcpdump on the master showed LDAP occasionally sending a keep-alive, with 2hrs between the keep-alive messages (these keep-alives are inconsistent, though, some nights I see none).
Note: The 2 slaves are running on blades in an IBM chassis, and the master is on a 1U Linux server, just 'one-hop' away. Prior to this, when I had a master/slave pair running on the blades, syncRepl was working fine for several months. It was not until I moved the master to the another server did the failures start.
Do you have a firewall or NAT configured on or between any of the boxes? This sort of problem with long-lived connections is often due to state being dropped from IP-level devices.
Andrew