Thanks for the response Andrew. See comments below
-----Original Message----- From: Andrew Findlay [mailto:andrew.findlay@skills-1st.co.uk] Sent: Friday, April 24, 2009 4:42 AM To: John Kane Cc: openldap-technical@openldap.org Subject: Re: Syncrepl connections failing
On Thu, Apr 23, 2009 at 05:41:09PM -0500, John Kane wrote:
I am having a problem with what appears (to me) to be 'stale' TCP connections for syncrepl between the master and a pair of slaves.
After
restarting all, I see changes on the master replicated to both
slaves.
BUT, if I wait about 30 minutes or more, then make a change, the replication fails (most of the time). netstat on the LDAP port show
the
connections still established, but queued packets at the master
server.
After about 15 minutes, the master server drops the connection. An overnight tcpdump on the master showed LDAP occasionally sending a keep-alive, with 2hrs between the keep-alive messages (these keep-
alives
are inconsistent, though, some nights I see none).
Note: The 2 slaves are running on blades in an IBM chassis, and the master is on a 1U Linux server, just 'one-hop' away. Prior to
this,
when I had a master/slave pair running on the blades, syncRepl was working fine for several months. It was not until I moved the
master
to
the another server did the failures start.
Do you have a firewall or NAT configured on or between any of the boxes? This sort of problem with long-lived connections is often due to state being dropped from IP-level devices.
[JK] Spoke with the network admin, and he guarantees only a single router exists between the master and slaves, and it is wide-open, i.e. it is just doing pure routing, no NAT or firewall that would cause this.
To debug this, I am running a couple of crons on the master server. One does a netstat every 5 minutes, the other does an LDAP password change on a test user every 30 minutes. I restart all and watch the logs. The first password change may (or may not) get replicated to the slaves. Subsequent changes do not. netstat shows the message queued for about 10-15 minutes, then the connection drops. Then I see an LDAP keep-alive every two hours, and this will briefly bring the connections back up, and it goes through the whole cycle again.
Is there a configurable keep alive for syncRepl? I've seen other equipment, load balancers for example, that have TCP timers that will drop an inactive TCP connection. On one in particular, the Array load balancer we were testing, the max of this configurable TCP timer was less than 2 hours.
********* Here's an excerpt from the cron logs *********
Thu Apr 30 04:45:01 UTC 2009 <-------------------------------------------------------- BOTH DOWN Thu Apr 30 04:50:01 UTC 2009 tcp 0 0 172.24.1.191:ldap slave-1.dist-one:44364 ESTABLISHED <-- BOTH RECOVERED BY KEEPALIVE tcp 0 0 172.24.1.191:ldap slave-2.dist-one:38348 ESTABLISHED Thu Apr 30 04:55:01 UTC 2009 tcp 0 0 172.24.1.191:ldap slave-1.dist-one:44364 ESTABLISHED tcp 0 0 172.24.1.191:ldap slave-2.dist-one:38348 ESTABLISHED Thu Apr 30 05:00:01 UTC 2009 tcp 0 0 172.24.1.191:ldap slave-1.dist-one:44364 ESTABLISHED tcp 0 0 172.24.1.191:ldap slave-2.dist-one:38348 ESTABLISHED Thu Apr 30 05:01:01 UTC 2009 New password: sNM/tVks Result: Success (0) modifying entry "uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"
Thu Apr 30 05:05:01 UTC 2009 tcp 0 2402 172.24.1.191:ldap slave-1.dist-one:44364 ESTABLISHED <-- QUEUED ON BOTH tcp 0 2402 172.24.1.191:ldap slave-2.dist-one:38348 ESTABLISHED Thu Apr 30 05:10:01 UTC 2009 tcp 0 2402 172.24.1.191:ldap slave-1.dist-one:44364 ESTABLISHED tcp 0 2402 172.24.1.191:ldap slave-2.dist-one:38348 ESTABLISHED Thu Apr 30 05:15:01 UTC 2009 tcp 0 2402 172.24.1.191:ldap slave-1.dist-one:44364 ESTABLISHED tcp 0 2402 172.24.1.191:ldap slave-2.dist-one:38348 ESTABLISHED Thu Apr 30 05:20:01 UTC 2009 <-------------------------------------------------------- DROPPED BOTH Thu Apr 30 05:25:01 UTC 2009 Thu Apr 30 05:30:01 UTC 2009 Thu Apr 30 05:31:01 UTC 2009 New password: Ajae2z9/ Result: Success (0) modifying entry "uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"
Thu Apr 30 05:35:01 UTC 2009 Thu Apr 30 05:40:01 UTC 2009 Thu Apr 30 05:45:01 UTC 2009 Thu Apr 30 05:50:01 UTC 2009 Thu Apr 30 05:55:01 UTC 2009 Thu Apr 30 06:00:01 UTC 2009 Thu Apr 30 06:01:01 UTC 2009 New password: j.ec4OnZ Result: Success (0) modifying entry "uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"
Thu Apr 30 06:05:01 UTC 2009 Thu Apr 30 06:10:01 UTC 2009 Thu Apr 30 06:15:01 UTC 2009 Thu Apr 30 06:20:01 UTC 2009 Thu Apr 30 06:25:01 UTC 2009 Thu Apr 30 06:30:01 UTC 2009 Thu Apr 30 06:31:01 UTC 2009 New password: NtUirUuh Result: Success (0) modifying entry "uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"
Thu Apr 30 06:35:01 UTC 2009 Thu Apr 30 06:40:01 UTC 2009 Thu Apr 30 06:45:01 UTC 2009 Thu Apr 30 06:50:01 UTC 2009 Thu Apr 30 06:55:01 UTC 2009 tcp 0 0 172.24.1.191:ldap slave-1.dist-one:33271 ESTABLISHED <-- BOTH RECOVERED BY KEEPALIVE tcp 0 0 172.24.1.191:ldap slave-2.dist-one:42721 ESTABLISHED Thu Apr 30 07:00:01 UTC 2009 tcp 0 0 172.24.1.191:ldap slave-1.dist-one:33271 ESTABLISHED tcp 0 0 172.24.1.191:ldap slave-2.dist-one:42721 ESTABLISHED Thu Apr 30 07:01:01 UTC 2009 New password: r/Jn0AYk Result: Success (0) modifying entry "uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net"
Thu Apr 30 07:05:01 UTC 2009 tcp 0 2402 172.24.1.191:ldap slave-1.dist-one:33271 ESTABLISHED <-- QUEUED ON ONE tcp 0 0 172.24.1.191:ldap slave-2.dist-one:42721 ESTABLISHED <-- SUCCESSFUL ON THE OTHER Thu Apr 30 07:10:01 UTC 2009 tcp 0 2402 172.24.1.191:ldap slave-1.dist-one:33271 ESTABLISHED tcp 0 0 172.24.1.191:ldap slave-2.dist-one:42721 ESTABLISHED Thu Apr 30 07:15:01 UTC 2009 tcp 0 2402 172.24.1.191:ldap slave-1.dist-one:33271 ESTABLISHED tcp 0 0 172.24.1.191:ldap slave-2.dist-one:42721 ESTABLISHED Thu Apr 30 07:20:01 UTC 2009 tcp 0 0 172.24.1.191:ldap slave-2.dist-one:42721 ESTABLISHED <-- DROPPED THE QUEUED ONE Thu Apr 30 07:25:01 UTC 2009 tcp 0 0 172.24.1.191:ldap slave-2.dist-one:42721 ESTABLISHED
********* LDAP log on the 1st slave (unsuccessful update until the next 2hr keep alive) *********
Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_search (0) Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004 cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net Apr 30 04:49:53 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_modify (0) Apr 30 04:49:53 slave-1 slapd[27904]: do_syncrep2: rid 004 LDAP_RES_INTERMEDIATE - REFRESH_DELETE Apr 30 06:49:52 slave-1 slapd[27904]: do_syncrep2: rid 004 Can't contact LDAP server Apr 30 06:49:52 slave-1 slapd[27904]: do_syncrepl: rid 004 retrying (9 retries left) Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_search (0) Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004 o=partner_x,dc=dist-one,dc=net <snip> Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_search (0) Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004 cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net Apr 30 06:50:22 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_modify (0) Apr 30 06:50:22 slave-1 slapd[27904]: do_syncrep2: rid 004 LDAP_RES_INTERMEDIATE - REFRESH_DELETE Apr 30 06:53:32 slave-1 slapd[27904]: connection_read(33): no connection! Apr 30 06:57:32 slave-1 slapd[27904]: connection_read(33): no connection! Apr 30 07:45:32 slave-1 slapd[27904]: connection_read(37): no connection! Apr 30 07:55:32 slave-1 slapd[27904]: connection_read(33): no connection! Apr 30 07:56:32 slave-1 slapd[27904]: connection_read(33): no connection! Apr 30 07:59:32 slave-1 slapd[27904]: connection_read(33): no connection! Apr 30 08:17:32 slave-1 slapd[27904]: connection_read(33): no connection! Apr 30 08:25:32 slave-1 slapd[27904]: connection_read(33): no connection! Apr 30 08:31:32 slave-1 slapd[27904]: connection_read(37): no connection! Apr 30 08:46:32 slave-1 slapd[27904]: connection_read(37): no connection! Apr 30 08:50:21 slave-1 slapd[27904]: do_syncrep2: rid 004 Can't contact LDAP server Apr 30 08:50:21 slave-1 slapd[27904]: do_syncrepl: rid 004 retrying (9 retries left) Apr 30 08:50:51 slave-1 slapd[27904]: syncrepl_entry: rid 004 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) Apr 30 08:50:51 slave-1 slapd[27904]: syncrepl_entry: rid 004 be_search (0) Apr 30 08:50:51 slave-1 slapd[27904]: syncrepl_entry: rid 004 o=partner_x,dc=dist-one,dc=net
********* On the 2nd slave (successful update) *********
Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search (0) Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002 cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net Apr 30 04:49:52 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_modify (0) Apr 30 04:49:52 slave-2 slapd[6973]: do_syncrep2: rid 002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE Apr 30 06:49:51 slave-2 slapd[6973]: do_syncrep2: rid 002 Can't contact LDAP server Apr 30 06:49:51 slave-2 slapd[6973]: do_syncrepl: rid 002 retrying (9 retries left) Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search (0) Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 o=partner_x,dc=dist-one,dc=net <snip> Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search (0) Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 cn=lb_check,ou=systemaccts,o=partner_x,dc=dist-one,dc=net Apr 30 06:50:21 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_modify (0) Apr 30 06:50:21 slave-2 slapd[6973]: do_syncrep2: rid 002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE Apr 30 07:01:01 slave-2 slapd[6973]: syncrepl_entry: rid 002 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY) Apr 30 07:01:01 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_search (0) Apr 30 07:01:01 slave-2 slapd[6973]: syncrepl_entry: rid 002 uid=testSyncDrop,ou=people,o=partner_x,dc=dist-one,dc=net Apr 30 07:01:02 slave-2 slapd[6973]: syncrepl_entry: rid 002 be_modify (0)
Thanks again, John
This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.