On Thursday 13 September 2007 23:05:29 ando(a)sys-net.it wrote:
> audrius.valunas(a)teo.lt wrote:
> > There is synchronous replication between mastyer and slave. When network
> > connectivity problems occur master closes tcp connection but slave
> > doesn't notice those problems, it still has tcp connection open, but in
> > real it is not receiving updates any more.
> > I think that can be solved adding some ack from slave because sending on
> > such a socket would fail and force slave to retry connection.
>
> Well, this should already be taken into consideration by SO_KEEPALIVE,
> which is always set when available on all connections. I concur that it
> usually requires quite a long time before a connection is actually
> checked (usually more than 2 hours), so some better policy could be put
> in place.
I think I filed a previous ITS on this, but the servers exhibiting this
behaviour in a remote site were lost (power supplies died) so I couldn't test
Howard's fix at the time. We have recently installed some QA servers, which
now also need to traverse a PIX firewall to get to the production master
(from which they replicate one database), and I have seen the behaviour again
(they go out of sync on most of the rare changes to this database until I
restart them or the check kicks in).
I note that a keepalive probably needs to be sent at least once an hour for a
PIX not to drop the connection. I haven't looked up any relevant RFCs on this
though ...
I can now test a fix a lot more easily (since I can upgrade one of these
servers at-will, as opposed to the previous slaves which were in production).
Regards,
Buchan