Hi,

I have an issue with a consumer replication starting to fail until OpenLDAP is restarted.

My setup consists of a pair of on-prem MirrorMode replicated providers (only one is active at a given time using a virtual IP managed by Keepalived), and one off-site (AWS) consumer. The providers use a dedicated port (LDAPS on 1636) for their own replication, as well as for the consumer to connect to them, so the consumer has access to both servers, regardless of where the providers' virtual IP is residing.

All the connections happen over LDAPS, and the syncrepl configs have the tls_reqcert=allow option.

The providers are always in sync and I'm able to switch make one or the other one the "active" one with ease. The consumer does the initial sync and stays in sync for a while, but I find it often (almost daily) out of sync. I see error messages on both the consumer and provider side:

On the consumer (every minute):
Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider1>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1)
Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1 retrying
Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider2>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1)
Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1 retrying
Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider1>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1)
Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1 retrying
Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider2>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1)
Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1 retrying

On the provider (every minute):
Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 ACCEPT from IP=<consumer IP>:45438 (IP=<provider1 IP>:1636)
Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 TLS established tls_ssf=256 ssf=256
Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 closed (connection lost)
Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 ACCEPT from IP=<consumer IP>:45458 (IP=<provider1 IP>:1636)
Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 TLS established tls_ssf=256 ssf=256
Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 closed (connection lost)

Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 ACCEPT from IP=<consumer IP>:41706 (IP=<provider2 IP>:1636)
Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 TLS established tls_ssf=256 ssf=256
Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 closed (connection lost)
Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 ACCEPT from IP=<consumer IP>:41726 (IP=<provider2 IP>:1636)
Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 TLS established tls_ssf=256 ssf=256
Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 closed (connection lost)

There must be something wrong on the consumer side since when the issue starts, the consumer is not able to connect to either provider.

Once I restart the consumer, it quickly resyncs and works just fine, for a while.

The providers are OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64), running on RHEL 7.
The consumer is OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64), running on CentOS 7.

The consumer syncrepl config is:
olcSyncrepl: {0}rid=001
  provider=ldaps://<provider1>:1636/
  searchbase="dc=example,dc=com"
  type=refreshAndPersist
  retry="60 +"
  timeout=1
  bindmethod=simple
  binddn="uid=replication,ou=SysAccounts,dc=example,dc=com"
  credentials=<credentials>
  tls_reqcert=allow
olcSyncrepl: {1}rid=002
  provider=ldaps://<provider1>:1636/
  searchbase="dc=example,dc=com"
  type=refreshAndPersist
  retry="60 +"
  timeout=1
  bindmethod=simple
  binddn="uid=replication,ou=SysAccounts,dc=example,dc=com"
  credentials=<credentials>
  tls_reqcert=allow

The "uid=replication,ou=SysAccounts,dc=example,dc=com" DN has full read-only permissions for the entire "dc=example,dc=com" tree.

Any idea on what might be my issue here?

Thank you,
Mircea
--
Mircea Baciu | Senior Unix Systems Administrator
Simmons University | 300 The Fenway | Boston, MA 02115 | 617-521-2194