Hi,
I have an issue with a consumer replication starting to fail until OpenLDAP is restarted.
My setup consists of a pair of on-prem MirrorMode replicated providers (only one is active at a given time using a virtual IP managed by Keepalived), and one off-site (AWS) consumer. The providers use a dedicated port (LDAPS on 1636) for their own replication, as well as for the consumer to connect to them, so the consumer has access to both servers, regardless of where the providers' virtual IP is residing.
All the connections happen over LDAPS, and the syncrepl configs have the tls_reqcert=allow option.
The providers are always in sync and I'm able to switch make one or the other one the "active" one with ease. The consumer does the initial sync and stays in sync for a while, but I find it often (almost daily) out of sync. I see error messages on both the consumer and provider side:
On the consumer (every minute): Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider1>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1) Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1 retrying Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider2>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1) Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1 retrying Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider1>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1) Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1 retrying Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider2>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1) Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1 retrying
On the provider (every minute): Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 ACCEPT from IP=<consumer IP>:45438 (IP=<provider1 IP>:1636) Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 TLS established tls_ssf=256 ssf=256 Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 closed (connection lost) Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 ACCEPT from IP=<consumer IP>:45458 (IP=<provider1 IP>:1636) Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 TLS established tls_ssf=256 ssf=256 Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 closed (connection lost)
Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 ACCEPT from IP=<consumer IP>:41706 (IP=<provider2 IP>:1636) Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 TLS established tls_ssf=256 ssf=256 Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 closed (connection lost) Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 ACCEPT from IP=<consumer IP>:41726 (IP=<provider2 IP>:1636) Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 TLS established tls_ssf=256 ssf=256 Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 closed (connection lost)
There must be something wrong on the consumer side since when the issue starts, the consumer is not able to connect to either provider.
Once I restart the consumer, it quickly resyncs and works just fine, for a while.
The providers are OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64), running on RHEL 7. The consumer is OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64), running on CentOS 7.
The consumer syncrepl config is: olcSyncrepl: {0}rid=001 provider=ldaps://<provider1>:1636/ searchbase="dc=example,dc=com" type=refreshAndPersist retry="60 +" timeout=1 bindmethod=simple binddn="uid=replication,ou=SysAccounts,dc=example,dc=com" credentials=<credentials> tls_reqcert=allow olcSyncrepl: {1}rid=002 provider=ldaps://<provider1>:1636/ searchbase="dc=example,dc=com" type=refreshAndPersist retry="60 +" timeout=1 bindmethod=simple binddn="uid=replication,ou=SysAccounts,dc=example,dc=com" credentials=<credentials> tls_reqcert=allow
The "uid=replication,ou=SysAccounts,dc=example,dc=com" DN has full read-only permissions for the entire "dc=example,dc=com" tree.
Any idea on what might be my issue here?
Thank you, Mircea -- Mircea Baciu | Senior Unix Systems Administrator Simmons University | 300 The Fenway | Boston, MA 02115 | 617-521-2194
--On Monday, September 20, 2021 11:38 AM -0400 Mircea Baciu mircea.baciu@simmons.edu wrote:
The providers are OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64), running on RHEL 7. The consumer is OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64), running on CentOS 7.
Hello,
The OpenLDAP 2.4.44 release is over 5 years old and numerous replication related issues have been fixed since that time. Additionally, RedHat is known to have made questionable modifications to libldap, particularly around the TLS layer in RHEL7.
I'd strongly advise you to upgrade to a current release of OpenLDAP. I would note that Symas provides free drop-in replacement builds of OpenLDAP for RHEL7 with optional support available (https://repo.symas.com/sofl/rhel7/).
Symas also provides free builds of the current OpenLDAP release series (2.5) with optional support available (https://repo.symas.com/soldap/rhel7/).
I'd also note that your syncrepl stanza is missing the "keepalive" option, which is usually essential when dealing with traffic through load balancers.
Regards, Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
Mircea Baciu wrote:
Hi,
I have an issue with a consumer replication starting to fail until OpenLDAP is restarted.
My setup consists of a pair of on-prem MirrorMode replicated providers (only one is active at a given time using a virtual IP managed by Keepalived), and one off-site (AWS) consumer. The providers use a dedicated port (LDAPS on 1636) for their own replication, as well as for the consumer to connect to them, so the consumer has access to both servers, regardless of where the providers' virtual IP is residing.
All the connections happen over LDAPS, and the syncrepl configs have the tls_reqcert=allow option.
The providers are always in sync and I'm able to switch make one or the other one the "active" one with ease. The consumer does the initial sync and stays in sync for a while, but I find it often (almost daily) out of sync. I see error messages on both the consumer and provider side:
Sounds like an issue in the TLS layer. You should increase the debug level on both provider and consumer to see if there are any TLS-specific error messages being generated. If you have cn=monitor configured you can set the debuglevel using ldapmodify, so no need to restart the servers for it to take effect. That'll let you see the problem as it's occurring.
On the consumer (every minute): Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider1>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1) Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1 retrying Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider2>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1) Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1 retrying Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider1>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1) Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1 retrying Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect: URI=ldaps://<provider2>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com" ldap_sasl_bind_s failed (-1) Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1 retrying
On the provider (every minute): Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 ACCEPT from IP=<consumer IP>:45438 (IP=<provider1 IP>:1636) Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 TLS established tls_ssf=256 ssf=256 Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 closed (connection lost) Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 ACCEPT from IP=<consumer IP>:45458 (IP=<provider1 IP>:1636) Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 TLS established tls_ssf=256 ssf=256 Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 closed (connection lost)
Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 ACCEPT from IP=<consumer IP>:41706 (IP=<provider2 IP>:1636) Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 TLS established tls_ssf=256 ssf=256 Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 closed (connection lost) Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 ACCEPT from IP=<consumer IP>:41726 (IP=<provider2 IP>:1636) Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 TLS established tls_ssf=256 ssf=256 Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 closed (connection lost)
There must be something wrong on the consumer side since when the issue starts, the consumer is not able to connect to either provider.
Once I restart the consumer, it quickly resyncs and works just fine, for a while.
The providers are OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64), running on RHEL 7. The consumer is OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64), running on CentOS 7.
The consumer syncrepl config is: olcSyncrepl: {0}rid=001 provider=ldaps://<provider1>:1636/ searchbase="dc=example,dc=com" type=refreshAndPersist retry="60 +" timeout=1 bindmethod=simple binddn="uid=replication,ou=SysAccounts,dc=example,dc=com" credentials=<credentials> tls_reqcert=allow olcSyncrepl: {1}rid=002 provider=ldaps://<provider1>:1636/ searchbase="dc=example,dc=com" type=refreshAndPersist retry="60 +" timeout=1 bindmethod=simple binddn="uid=replication,ou=SysAccounts,dc=example,dc=com" credentials=<credentials> tls_reqcert=allow
The "uid=replication,ou=SysAccounts,dc=example,dc=com" DN has full read-only permissions for the entire "dc=example,dc=com" tree.
Any idea on what might be my issue here?
Thank you, Mircea -- Mircea Baciu | Senior Unix Systems Administrator Simmons University | 300 The Fenway | Boston, MA 02115 | 617-521-2194
Thank you, Howard! I'll give that a try in case the keepalive option Quanah mentioned is not fixing the issue.
Mircea -- Mircea Baciu | Senior Unix Systems Administrator Simmons University | 300 The Fenway | Boston, MA 02115 | 617-521-2194
On Mon, Sep 20, 2021 at 12:02 PM Howard Chu hyc@symas.com wrote:
Mircea Baciu wrote:
Hi,
I have an issue with a consumer replication starting to fail until
OpenLDAP is restarted.
My setup consists of a pair of on-prem MirrorMode replicated providers
(only one is active at a given time using a virtual IP managed by Keepalived), and one
off-site (AWS) consumer. The providers use a dedicated port (LDAPS on
- for their own replication, as well as for the consumer to connect to
them, so the
consumer has access to both servers, regardless of where the providers'
virtual IP is residing.
All the connections happen over LDAPS, and the syncrepl configs have the
tls_reqcert=allow option.
The providers are always in sync and I'm able to switch make one or the
other one the "active" one with ease. The consumer does the initial sync and stays in
sync for a while, but I find it often (almost daily) out of sync. I see
error messages on both the consumer and provider side:
Sounds like an issue in the TLS layer. You should increase the debug level on both provider and consumer to see if there are any TLS-specific error messages being generated. If you have cn=monitor configured you can set the debuglevel using ldapmodify, so no need to restart the servers for it to take effect. That'll let you see the problem as it's occurring.
On the consumer (every minute): Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect:
URI=ldaps://<provider1>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com"
ldap_sasl_bind_s failed (-1) Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1
retrying
Sep 20 08:19:31 <consumer> slapd[1440]: slap_client_connect:
URI=ldaps://<provider2>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com"
ldap_sasl_bind_s failed (-1) Sep 20 08:19:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1
retrying
Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect:
URI=ldaps://<provider1>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com"
ldap_sasl_bind_s failed (-1) Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=001 rc -1
retrying
Sep 20 08:20:31 <consumer> slapd[1440]: slap_client_connect:
URI=ldaps://<provider2>:1636/ DN="uid=replication,ou=sysaccounts,dc=example,dc=com"
ldap_sasl_bind_s failed (-1) Sep 20 08:20:31 <consumer> slapd[1440]: do_syncrepl: rid=002 rc -1
retrying
On the provider (every minute): Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 ACCEPT from
IP=<consumer IP>:45438 (IP=<provider1 IP>:1636)
Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 TLS
established tls_ssf=256 ssf=256
Sep 20 08:19:31 <provider1> slapd[1057]: conn=11242 fd=14 closed
(connection lost)
Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 ACCEPT from
IP=<consumer IP>:45458 (IP=<provider1 IP>:1636)
Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 TLS
established tls_ssf=256 ssf=256
Sep 20 08:20:31 <provider1> slapd[1057]: conn=11243 fd=14 closed
(connection lost)
Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 ACCEPT from
IP=<consumer IP>:41706 (IP=<provider2 IP>:1636)
Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 TLS
established tls_ssf=256 ssf=256
Sep 20 08:19:31 <provider2> slapd[1051]: conn=215893 fd=18 closed
(connection lost)
Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 ACCEPT from
IP=<consumer IP>:41726 (IP=<provider2 IP>:1636)
Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 TLS
established tls_ssf=256 ssf=256
Sep 20 08:20:31 <provider2> slapd[1051]: conn=215898 fd=18 closed
(connection lost)
There must be something wrong on the consumer side since when the issue
starts, the consumer is not able to connect to either provider.
Once I restart the consumer, it quickly resyncs and works just fine, for
a while.
The providers are OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64),
running on RHEL 7.
The consumer is OpenLDAP 2.4.44 (openldap-2.4.44-24.el7_9.x86_64),
running on CentOS 7.
The consumer syncrepl config is: olcSyncrepl: {0}rid=001 provider=ldaps://<provider1>:1636/ searchbase="dc=example,dc=com" type=refreshAndPersist retry="60 +" timeout=1 bindmethod=simple binddn="uid=replication,ou=SysAccounts,dc=example,dc=com" credentials=<credentials> tls_reqcert=allow olcSyncrepl: {1}rid=002 provider=ldaps://<provider1>:1636/ searchbase="dc=example,dc=com" type=refreshAndPersist retry="60 +" timeout=1 bindmethod=simple binddn="uid=replication,ou=SysAccounts,dc=example,dc=com" credentials=<credentials> tls_reqcert=allow
The "uid=replication,ou=SysAccounts,dc=example,dc=com" DN has full
read-only permissions for the entire "dc=example,dc=com" tree.
Any idea on what might be my issue here?
Thank you, Mircea -- Mircea Baciu | Senior Unix Systems Administrator Simmons University | 300 The Fenway | Boston, MA 02115 | 617-521-2194
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
openldap-technical@openldap.org