Re: (ITS#5133) Synchronous replication on slave doesn't notice lost network connection

17 Sep 2007


      On Thursday 13 September 2007 23:05:29 ando@sys-net.it wrote:
...
audrius.valunas@teo.lt wrote:
...
There is synchronous replication between mastyer and slave. When network
connectivity problems occur master closes tcp connection but slave
doesn't notice those problems, it still has tcp connection open, but in
real it is not receiving updates any more.
I think that can be solved adding some ack from slave because sending on
such a socket would fail and force slave to retry connection.
Well, this should already be taken into consideration by SO_KEEPALIVE,
which is always set when available on all connections.  I concur that it
usually requires quite a long time before a connection is actually
checked (usually more than 2 hours), so some better policy could be put
in place.
I think I filed a previous ITS on this, but the servers exhibiting this 
behaviour in a remote site were lost (power supplies died) so I couldn't test 
Howard's fix at the time. We have recently installed some QA servers, which 
now also need to traverse a PIX firewall to get to the production master 
(from which they replicate one database), and I have seen the behaviour again 
(they go out of sync on most of the rare changes to this database until I 
restart them or the check kicks in).
I note that a keepalive probably needs to be sent at least once an hour for a 
PIX not to drop the connection. I haven't looked up any relevant RFCs on this 
though ...
I can now test a fix a lot more easily (since I can upgrade one of these 
servers at-will, as opposed to the previous slaves which were in production).
Regards,
Buchan

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: (ITS#5133) Synchronous replication on slave doesn't notice lost network connection