Thank you, Howard! I'll give that a try in case the keepalive option Quanah mentioned is not fixing the issue.

Mircea
--
Mircea Baciu | Senior Unix Systems Administrator
Simmons University | 300 The Fenway | Boston, MA 02115 | 617-521-2194


On Mon, Sep 20, 2021 at 12:02 PM Howard Chu <hyc@symas.com> wrote:
Mircea Baciu wrote:
> Hi,
>
> I have an issue with a consumer replication starting to fail until OpenLDAP is restarted.
>
> My setup consists of a pair of on-prem MirrorMode replicated providers (only one is active at a given time using a virtual IP managed by Keepalived), and one
> off-site (AWS) consumer. The providers use a dedicated port (LDAPS on 1636) for their own replication, as well as for the consumer to connect to them, so the
> consumer has access to both servers, regardless of where the providers' virtual IP is residing.
>
> All the connections happen over LDAPS, and the syncrepl configs have the tls_reqcert=allow option.
>
> The providers are always in sync and I'm able to switch make one or the other one the "active" one with ease. The consumer does the initial sync and stays in
> sync for a while, but I find it often (almost daily) out of sync. I see error messages on both the consumer and provider side:

Sounds like an issue in the TLS layer. You should increase the debug level on both provider and consumer to see
if there are any TLS-specific error messages being generated. If you have cn=monitor configured you can set the
debuglevel using ldapmodify, so no need to restart the servers for it to take effect. That'll let you see the
problem as it's occurring.
>
> On the consumer (every minute):
> Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider1>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com"
> ldap_sasl_bind_s failed (-1)
> Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1 retrying
> Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider2>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com"
> ldap_sasl_bind_s failed (-1)
> Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1 retrying
> Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider1>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com"
> ldap_sasl_bind_s failed (-1)
> Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1 retrying
> Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider2>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com"
> ldap_sasl_bind_s failed (-1)
> Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1 retrying
>
> On the provider (every minute):
> Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 ACCEPT from IP=<consumer IP>:45438 (IP=<provider1 IP>:1636)
> Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 TLS established tls_ssf=256 ssf=256
> Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 closed (connection lost)
> Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 ACCEPT from IP=<consumer IP>:45458 (IP=<provider1 IP>:1636)
> Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 TLS established tls_ssf=256 ssf=256
> Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 closed (connection lost)
>
> Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 ACCEPT from IP=<consumer IP>:41706 (IP=<provider2 IP>:1636)
> Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 TLS established tls_ssf=256 ssf=256
> Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 closed (connection lost)
> Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 ACCEPT from IP=<consumer IP>:41726 (IP=<provider2 IP>:1636)
> Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 TLS established tls_ssf=256 ssf=256
> Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 closed (connection lost)
>
> There must be something wrong on the consumer side since when the issue starts, the consumer is not able to connect to either provider.
>
> Once I restart the consumer, it quickly resyncs and works just fine, for a while.
>
> The providers are OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64), running on RHEL 7.
> The consumer is OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64), running on CentOS 7.
>
> The consumer syncrepl config is:
> olcSyncrepl: {0}rid=001
>   provider=ldaps://<provider1>:1636/
>   searchbase="dc=example,dc=com"
>   type=refreshAndPersist
>   retry="60 +"
>   timeout=1
>   bindmethod=simple
>   binddn="uid=replication,ou=SysAccounts,dc=example,dc=com"
>   credentials=<credentials>
>   tls_reqcert=allow
> olcSyncrepl: {1}rid=002
>   provider=ldaps://<provider1>:1636/
>   searchbase="dc=example,dc=com"
>   type=refreshAndPersist
>   retry="60 +"
>   timeout=1
>   bindmethod=simple
>   binddn="uid=replication,ou=SysAccounts,dc=example,dc=com"
>   credentials=<credentials>
>   tls_reqcert=allow
>
> The "uid=replication,ou=SysAccounts,dc=example,dc=com" DN has full read-only permissions for the entire "dc=example,dc=com" tree.
>
> Any idea on what might be my issue here?
>
> Thank you,
> Mircea
> --
> Mircea Baciu | Senior Unix Systems Administrator
> Simmons University | 300 The Fenway | Boston, MA 02115 | 617-521-2194


--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/