I have a 2-way multi-master setup on srv1.foo.com (EDT) and srv2.foo.com (PDT).
For about 2hrs this morning srv2 was syslog'ing "Can't contact LDAP server" while it was in a replication conversation with srv1:
Sep 9 05:29:45 srv2 slapd[9413]: do_syncrep2: rid=001 (-1) Can't contact LDAP server Sep 9 05:29:45 srv2 slapd[9413]: do_syncrepl: rid=001 rc -1 retrying Sep 9 05:30:00 srv2 slapd[9413]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - SYNC_ID_SET Sep 9 05:30:01 srv2 last message repeated 34 times Sep 9 05:30:01 srv2 slapd[9413]: syncrepl_entry: rid=001 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) Sep 9 05:30:01 srv2 slapd[9413]: syncrepl_entry: rid=001 be_search (0) Sep 9 05:30:01 srv2 slapd[9413]: syncrepl_entry: rid=001 ... Sep 9 05:30:01 srv2 slapd[9413]: syncprov_matchops: skipping original sid 003 Sep 9 05:30:01 srv2 slapd[9413]: syncprov_matchops: skipping original sid 003 Sep 9 05:30:01 srv2 slapd[9413]: syncrepl_entry: rid=001 be_modify ... (0) Sep 9 05:30:01 srv2 slapd[9413]: syncrepl_entry: rid=001 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) Sep 9 05:30:01 srv2 slapd[9413]: syncrepl_entry: rid=001 be_search (0) Sep 9 05:30:01 srv2 slapd[9413]: syncrepl_entry: rid=001 ... Sep 9 05:30:01 srv2 slapd[9413]: syncprov_matchops: skipping original sid 003 Sep 9 05:30:01 srv2 slapd[9413]: syncprov_matchops: skipping original sid 003 Sep 9 05:30:01 srv2 slapd[9413]: syncrepl_entry: rid=001 be_modify ... (0 ... Sep 9 05:30:29 srv2 slapd[9413]: syncrepl_entry: rid=001 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) Sep 9 05:30:29 srv2 slapd[9413]: syncrepl_entry: rid=001 be_search (0) Sep 9 05:30:29 srv2 slapd[9413]: syncrepl_entry: rid=001 ... Sep 9 05:30:29 srv2 slapd[9413]: syncprov_matchops: skipping original sid 003 Sep 9 05:30:29 srv2 slapd[9413]: syncprov_matchops: skipping original sid 003 Sep 9 05:30:29 srv2 slapd[9413]: syncrepl_entry: rid=001 be_modify ... (0) Sep 9 05:30:29 srv2 slapd[9413]: do_syncrep2: rid=001 (-1) Can't contact LDAP server Sep 9 05:30:29 srv2 slapd[9413]: do_syncrepl: rid=001 rc -1 retrying Sep 9 05:30:45 srv2 slapd[9413]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - SYNC_ID_SET Sep 9 05:30:45 srv2 last message repeated 34 times Sep 9 05:30:45 srv2 slapd[9413]: syncrepl_entry: rid=001 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_ADD) Sep 9 05:30:45 srv2 slapd[9413]: syncrepl_entry: rid=001 be_search (0) Sep 9 05:30:45 srv2 slapd[9413]: syncrepl_entry: rid=001 ... Sep 9 05:30:45 srv2 slapd[9413]: syncprov_matchops: skipping original sid 003 Sep 9 05:30:45 srv2 slapd[9413]: syncprov_matchops: skipping original sid 003 Sep 9 05:30:45 srv2 slapd[9413]: syncrepl_entry: rid=001 be_modify ... (0) ...
During that time no errors or closes were logged in the srv2 syslog. I tried bouncing each slapd, but the issue continued. After about 2hrs it stopped occurring.
My question is: What would cause the "Can't contact LDAP server message"? The srv1 side doesn't log any error or log that the connection was closed. The text "Can't contact" would seem to imply that the error occurred during a connection attempt, but these errors seemed to occur during a active conversation. Does the srv2 side notice an read or write error on the socket and abandon the connection? I've been looking through the code trying to determine what causes that error message. Does it happen on a single read/write error? Does it retry a few times? I hate to just say "Must be a network error" without some due diligence.
My env:
RHEL 5.5 OpenLDAP 2.4.25 BerkeleyDB 4.8.40 OpenSSL 1.0.0d Cyrus SASL 2.1.23 2-way Multi-master
==================== #srv1 slapd.conf -> slapd.d include /appl/openldap/etc/schema/core.schema include /appl/openldap/etc/schema/cosine.schema include /appl/openldap/etc/schema/nis.schema include /appl/openldap/etc/schema/inetorgperson.schema include /appl/openldap/etc/schema/foo.com.schema
argsfile /appl/openldap/var/run/slapd.args pidfile /appl/openldap/var/run/slapd.pid
threads 16 tool-threads 4
idletimeout 300 writetimeout 5
reverse-lookup off
timelimit time.soft=30 time.hard=300 sizelimit size.soft=500 size.hard=1000
password-hash {SSHA}
loglevel stats sync
serverid 1 ldap://srv1.foo.com:10806
modulepath /appl/openldap/libexec moduleload back_monitor.la moduleload back_hdb.la moduleload syncprov.la
database config rootdn cn=manager,dc=foo,dc=com
database monitor rootdn cn=manager,dc=foo,dc=com
database hdb rootdn cn=manager,dc=foo,dc=com suffix dc=foo,dc=com directory /appl/openldap/var/data/dc=foo,dc=com cachesize 1000 idlcachesize 3000 cachefree 5 checkpoint 128 15
index objectClass eq index entryCSN eq index entryUUID eq
syncrepl rid=001 provider=ldap://srv1.foo.com:10806 type=refreshAndPersist retry="15 +" bindmethod=simple binddn="cn=replicator,dc=foo,dc=com" credentials="secret" searchbase="dc=foo,dc=com" starttls=no schemachecking=off
syncrepl rid=002 provider=ldap://srv2.foo.com:10806 type=refreshAndPersist retry="15 +" bindmethod=simple binddn="cn=replicator,dc=foo,dc=com" credentials="secret" searchbase="dc=foo,dc=com" starttls=no schemachecking=off
mirrormode TRUE
overlay syncprov syncprov-checkpoint 50 10 syncprov-sessionlog 100
limits dn.exact="cn=replicator,dc=foo,dc=com" time.soft=unlimited time.hard=unlimited size.soft=unlimited size.hard=unlimited
==================== #srv2 slapd.conf -> slapd.d include /appl/openldap/etc/schema/core.schema include /appl/openldap/etc/schema/cosine.schema include /appl/openldap/etc/schema/nis.schema include /appl/openldap/etc/schema/inetorgperson.schema include /appl/openldap/etc/schema/foo.com.schema
argsfile /appl/openldap/var/run/slapd.args pidfile /appl/openldap/var/run/slapd.pid
threads 16 tool-threads 4
idletimeout 300 writetimeout 5
reverse-lookup off
timelimit time.soft=30 time.hard=300 sizelimit size.soft=500 size.hard=1000
password-hash {SSHA}
loglevel stats sync
serverid 2 ldap://srv2.foo.com:10806
modulepath /appl/openldap/libexec moduleload back_monitor.la moduleload back_hdb.la moduleload syncprov.la
database config rootdn cn=manager,dc=foo,dc=com
database monitor rootdn cn=manager,dc=foo,dc=com
database hdb rootdn cn=manager,dc=foo,dc=com suffix dc=foo,dc=com directory /appl/openldap/var/data/dc=foo,dc=com cachesize 1000 idlcachesize 3000 cachefree 5 checkpoint 128 15
index objectClass eq index entryCSN eq index entryUUID eq
syncrepl rid=001 provider=ldap://srv1.foo.com:10806 type=refreshAndPersist retry="15 +" bindmethod=simple binddn="cn=replicator,dc=foo,dc=com" credentials="secret" searchbase="dc=foo,dc=com" starttls=no schemachecking=off
syncrepl rid=002 provider=ldap://srv2.foo.com:10806 type=refreshAndPersist retry="15 +" bindmethod=simple binddn="cn=replicator,dc=foo,dc=com" credentials="secret" searchbase="dc=foo,dc=com" starttls=no schemachecking=off
mirrormode TRUE
overlay syncprov syncprov-checkpoint 50 10 syncprov-sessionlog 100
limits dn.exact="cn=replicator,dc=foo,dc=com" time.soft=unlimited time.hard=unlimited size.soft=unlimited size.hard=unlimited
openldap-technical@openldap.org