Hi,

Of course I add many more details like detailed configs and logs, just ask.

We have a 4-host MMR setup, that all replicate to the three others, relevant snippets from the config:

* moduleload syncprov.la

* different (and consistent) serverID's configured correctly on all hosts

and then on each host the three other LDAP servers are defined like the below sample from ldapm01.company.com :

syncrepl rid=212 provider=ldaps://ldapm02.company.com:636 bindmethod=simple binddn="cn=ldap_replicate,ou=Directory Access,o=company,c=com" credentials=xyz searchbase="o=company,c=com" filter="(objectClass=*)" scope=sub schemachecking=on type=refreshAndPersist retry="1 2 3 4 5 +" attrs="*,+" tls_reqcert=demand
syncrepl rid=232 provider=ldaps://ldaps01.company.com:636 bindmethod=simple binddn="cn=ldap_replicate,ou=Directory Access,o=company,c=com" credentials=xyz searchbase="o=company,c=com" filter="(objectClass=*)" scope=sub schemachecking=on type=refreshAndPersist retry="1 2 3 4 5 +" attrs="*,+" tls_reqcert=demand
syncrepl rid=242 provider=ldaps://ldaps02.company.com:636 bindmethod=simple binddn="cn=ldap_replicate,ou=Directory Access,o=company,c=com" credentials=xyz searchbase="o=company,c=com" filter="(objectClass=*)" scope=sub schemachecking=on type=refreshAndPersist retry="1 2 3 4 5 +" attrs="*,+" tls_reqcert=demand
MirrorMode True

We observe that usually replication is quick and instant. But "sometimes" (yes, sometimes...) a replication line (rid) can become 'stuck', and suddeny, after an hour or so it syncs up again. Restarting openldap make it sync immediately. We notice that because we compare contextCSN's across our nodes to monitor replication.

There is no significant load, firewall ports are open, but hosts are in different subnets: subnet A: ldapm01 / ldapm02, subnet B: ldaps01 / ldaps02 Of course 389/636 ports are open between the subnets.

We wondered what could cause this behaviour, and started thinking in the direction of long-lived tcp connections that perhaps are used in refreshAndPersist functionality. (much like in IMAP idle)

Is anything special needed to make refreshAndPersistwork reliably through firewalls, and across subnets? Does refreshAndPersistwork use (some kind of) long-lived network connections..? Is there a kind of "keepalive" setting that we could try..?

Input would be appreciated. Also input like: you are looking in the wrong direction, better look into this or that. :-)

Thanks!