On Thursday 13 September 2007 23:05:29 ando@sys-net.it wrote:
audrius.valunas@teo.lt wrote:
There is synchronous replication between mastyer and slave. When network connectivity problems occur master closes tcp connection but slave doesn't notice those problems, it still has tcp connection open, but in real it is not receiving updates any more. I think that can be solved adding some ack from slave because sending on such a socket would fail and force slave to retry connection.
Well, this should already be taken into consideration by SO_KEEPALIVE, which is always set when available on all connections. I concur that it usually requires quite a long time before a connection is actually checked (usually more than 2 hours), so some better policy could be put in place.
I think I filed a previous ITS on this, but the servers exhibiting this behaviour in a remote site were lost (power supplies died) so I couldn't test Howard's fix at the time. We have recently installed some QA servers, which now also need to traverse a PIX firewall to get to the production master (from which they replicate one database), and I have seen the behaviour again (they go out of sync on most of the rare changes to this database until I restart them or the check kicks in).
I note that a keepalive probably needs to be sent at least once an hour for a PIX not to drop the connection. I haven't looked up any relevant RFCs on this though ...
I can now test a fix a lot more easily (since I can upgrade one of these servers at-will, as opposed to the previous slaves which were in production).
Regards, Buchan