Hi ,

I am trying to configure and use the openldap 2-WAY Multimaster Replication setup for  high availability(HA), but the  HA solution used to freeze very often. Using ansible tool I have automated the installation and configuration steps. I tried with manual steps too before using the ansible for the actual deployment

Following are the env. details used for installation and configuration

Two nodes ldap-test1 and ldap-test2  are running with base OS RHEL7

LDAP rpms installed:

openldap-clients-2.4.39-6.el7.x86_64
openldap-servers-2.4.39-6.el7.x86_64
openldap-2.4.39-6.el7.x86_64

Configuration steps:

1. Added the nis, cosine schemas
ldapadd -Y EXTERNAL -H ldapi:/// -D "cn=config" -f cosine.ldif
ldapadd -Y EXTERNAL -H ldapi:/// -D "cn=config" -f nis.ldif

2. Send the new global configuration settings to slapd
 'ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/global_config.ldif

# cat global_config.ldif

dn: cn=module{0},cn=config
objectClass: olcModuleList
cn: module{0}
olcModuleLoad: syncprov

dn: olcOverlay=syncprov,olcDatabase={0}config,cn=config
changetype: add
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: syncprov

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcSuffix
olcSuffix: dc=example,dc=com

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootDN
olcRootDN: cn=Manager,dc=example,dc=com

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootPW
olcRootPW: {SSHA}vAp3OToPGMYnWEkh+76RJEVyfCIdnsDg

dn: cn=config
changetype: modify
replace: olcTLSCACertificateFile
olcTLSCACertificateFile: /etc/openldap/certs/cacert.pem

dn: cn=config
changetype: modify
replace: olcTLSCertificateFile
olcTLSCertificateFile: /etc/openldap/certs/slapdcert.pem

dn: cn=config
changetype: modify
replace: olcTLSCertificateKeyFile
olcTLSCertificateKeyFile: /etc/openldap/certs/slapdkey.pem

dn: cn=config
changetype: modify
replace: olcLogLevel
olcLogLevel: -1

dn: olcDatabase={1}monitor,cn=config
changetype: modify
replace: olcAccess
olcAccess: {0}to * by dn.base="gidNumber=0+uidNumber=0,cn=peercred,cn=external,cn=auth" read by dn.base="cn=Manager,dc=example,dc=com" read by * none

dn: cn=config
changeType: modify
add: olcServerID
olcServerID: 1

dn: olcDatabase={0}config,cn=config
changetype: modify
add: olcRootDN
olcRootDN: cn=admin,cn=config

dn: cn=config
changetype: modify

dn: olcDatabase={0}config,cn=config
changetype: modify
replace: olcRootPW
olcRootPW: {SSHA}vAp3OToPGMYnWEkh+76RJEVyfCIdnsDg

dn: cn=config
changetype: modify
replace: olcServerID
olcServerID: 1 ldaps://ldap-test1
olcServerID: 2 ldaps://ldap-test2

3. Load base.ldif
ldapadd -x -w redhat7 -D cn=Manager,dc=example,dc=com -f /etc/openldap/base.ldif

# cat base.ldif
dn: dc=example,dc=com
dc: example
objectClass: top
objectClass: domain

dn: ou=People,dc=example,dc=com
ou: People
objectClass: top
objectClass: organizationalUnit

dn: ou=Group,dc=example,dc=com
ou: Group
objectClass: top
objectClass: organizationalUnit


4. Load hdb_config.ldif
ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/hdb_config.ldif

#cat hdb_config.ldif

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcSuffix
olcSuffix: dc=example,dc=com

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootDN
olcRootDN: cn=Manager,dc=example,dc=com

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootPW
olcRootPW: {{ ldap_root_password.stdout }}

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcDbIndex
olcDbIndex: entryCSN eq
olcDbIndex: entryUUID eq

5. Load replication.ldif
ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/replication.ldif

cat  replication.ldif

dn: olcOverlay=syncprov,olcDatabase={2}hdb,cn=config
changetype: add
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: syncprov

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcSyncRepl
olcSyncRepl: rid=101 provider=ldaps://ldap-test1 binddn="cn=Manager,dc=example,dc=com" bindmethod=simple credentials=redhat7 searchbase="dc=example,dc=com" type=refreshAndPersist interval=00:00:00:10 retry="5 5 300 5" timeout=1
olcSyncRepl: rid=102 provider=ldaps://ldap-test2 binddn="cn=Manager,dc=example,dc=com" bindmethod=simple credentials=redhat7 searchbase="dc=example,dc=com" type=refreshAndPersist interval=00:00:00:10 retry="5 5 300 5" timeout=1
-
replace: olcMirrorMode
olcMirrorMode: TRUE

dn: olcDatabase={0}config,cn=config
changetype: modify
replace: olcSyncRepl
olcSyncRepl: rid=101 provider=ldaps://ldap-test1 binddn="cn=admin,cn=config" bindmethod=simple credentials=redhat7 searchbase="cn=config" type=refreshAndPersist retry="5 5 300 5" timeout=1
olcSyncRepl: rid=102 provider=ldaps://ldap-test2 binddn="cn=admin,cn=config" bindmethod=simple credentials=redhat7 searchbase="cn=config" type=refreshAndPersist retry="5 5 300 5" timeout=1
-
replace: olcMirrorMode
olcMirrorMode: TRUE


Above configuration steps 1 to 4 are executed parallely on both nodes , only step 5 ie. replication.ldif was executed serially one node after the other
because parallel execution of step 5 causing the solution to freeze

Parallel execution on both nodes :
1. Added the nis, cosine schemas
2. Send the new global configuration settings to slapd
3. Load base.ldif

4. Load hdb_config.ldif  ,  executed on any one of two nodes assuming that content will replicated automatically on other node once the servers are replicated

Serial execution :  Executed on both nodes one after the other
5. Load replication.ldif

Sometimes LDAP replication is causing the solution to freeze  and it may hung during the deployment , after deployment  or  while executing some basic ldap operations like ldapadd/modify/delete etc

1. First thing I would like to know Is there any specific order we need to follow to avoid solution freeze or  ldapadd command hung with two nodes configuring parallely ?

2. Anything wrong with the configuration attributes used ? If so what attributes I need to add/update to avoid the command/service hung during configuration or after the deployment ?

3. To verify the high availability I used to stop ldap service any one of the two nodes and send the ldap requests to other node, but sometimes the restart of the service not bringing back the two nodes in Sync. And for replication I am verifying based on number of connections established between two servers ( min. 4 ,  2 for config replication and 2 for  db replication  )
   And unittest to verify db replication by  creating an ldap user on one node and search &  delete operations on the other node.

4. Whenever the basic ldap commands add/modify etc hung on any one node , I used to restart the slapd service on the corresponding node, Is it the right way ?
HA solution is working fine by configuring above steps in the given order, ldap services are restarted and in Sync, but sometimes the solution used to freeze when executing basic ldapadd commond on any one node
 
There are no specific error messages in the logs to understand the reason for hung.
May 15 22:20:17 ldap-test2 slapd[8538]: daemon: epoll: listen=7 active_threads=0 tvp=zero
May 15 22:20:17 ldap-test2 slapd[8538]: daemon: epoll: listen=8 active_threads=0 tvp=zero

 
Always a single restart is also not helping to bring the servers back to sync and available,
sometimes the number of tcp connections are 3 instead of required 4 connections
 
[root@ldap-test1 openldap]# netstat -a | grep ldaps
tcp        0      0 ldap-test1:ldaps 0.0.0.0:*               LISTEN
tcp        0      0 ldap-test1:34854 ldap-test2:ldaps ESTABLISHED
tcp        0      0 ldap-test1:ldaps ldap-test2:48493 ESTABLISHED
tcp        0      0 ldap-test1:34856 ldap-test2:ldaps ESTABLISHED

Used to restart couple of times until 4 tcp connections are established and servers are replicated properly

5. Can someone please help me to get rid of this situation ?
I didn't find any clue in the archived messages etc for similar kind of problems.

Thanks & Regards,
Shashi