Syncrepl failing after a while - openldap-technical

20 Sep 2021


      Hi,
I have an issue with a consumer replication starting to fail until OpenLDAP
is restarted.
My setup consists of a pair of on-prem MirrorMode replicated providers
(only one is active at a given time using a virtual IP managed by
Keepalived), and one off-site (AWS) consumer. The providers use a dedicated
port (LDAPS on 1636) for their own replication, as well as for the consumer
to connect to them, so the consumer has access to both servers, regardless
of where the providers' virtual IP is residing.
All the connections happen over LDAPS, and the syncrepl configs have the
tls_reqcert=allow option.
The providers are always in sync and I'm able to switch make one or the
other one the "active" one with ease. The consumer does the initial sync
and stays in sync for a while, but I find it often (almost daily) out of
sync. I see error messages on both the consumer and provider side:
On the consumer (every minute):
Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect:
URI=ldaps://<provider1>:1636/
DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s
failed (-1)
Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1 retrying
Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect:
URI=ldaps://<provider2>:1636/
DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s
failed (-1)
Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1 retrying
Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect:
URI=ldaps://<provider1>:1636/
DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s
failed (-1)
Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1 retrying
Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect:
URI=ldaps://<provider2>:1636/
DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s
failed (-1)
Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1 retrying
On the provider (every minute):
Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 ACCEPT from
IP=<consumer IP>:45438 (IP=<provider1 IP>:1636)
Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 TLS established
tls_ssf=256 ssf=256
Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 closed
(connection lost)
Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 ACCEPT from
IP=<consumer IP>:45458 (IP=<provider1 IP>:1636)
Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 TLS established
tls_ssf=256 ssf=256
Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 closed
(connection lost)
Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 ACCEPT from
IP=<consumer IP>:41706 (IP=<provider2 IP>:1636)
Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 TLS established
tls_ssf=256 ssf=256
Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 closed
(connection lost)
Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 ACCEPT from
IP=<consumer IP>:41726 (IP=<provider2 IP>:1636)
Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 TLS established
tls_ssf=256 ssf=256
Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 closed
(connection lost)
There must be something wrong on the consumer side since when the issue
starts, the consumer is not able to connect to either provider.
Once I restart the consumer, it quickly resyncs and works just fine, for a
while.
The providers are OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64),
running on RHEL 7.
The consumer is OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64), running
on CentOS 7.
The consumer syncrepl config is:
olcSyncrepl: {0}rid=001
  provider=ldaps://<provider1>:1636/
  searchbase="dc=example,dc=com"
  type=refreshAndPersist
  retry="60 +"
  timeout=1
  bindmethod=simple
  binddn="uid=replication,ou=SysAccounts,dc=example,dc=com"
  credentials=<credentials>
  tls_reqcert=allow
olcSyncrepl: {1}rid=002
  provider=ldaps://<provider1>:1636/
  searchbase="dc=example,dc=com"
  type=refreshAndPersist
  retry="60 +"
  timeout=1
  bindmethod=simple
  binddn="uid=replication,ou=SysAccounts,dc=example,dc=com"
  credentials=<credentials>
  tls_reqcert=allow
The "uid=replication,ou=SysAccounts,dc=example,dc=com" DN has full
read-only permissions for the entire "dc=example,dc=com" tree.
Any idea on what might be my issue here?
Thank you,
Mircea
--
Mircea Baciu | Senior Unix Systems Administrator
Simmons University | 300 The Fenway | Boston, MA 02115 | 617-521-2194