Full_Name: Bill MacAllister Version: 2.4.46 OS: Ubuntu 16.04 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (50.247.112.108)
We have been seeing replication stall on many of our replicas. Once replication stalls is never recovers on its own and the slapd process on the replica needs to be restarted. As soon as slapd is restarted on a replica it catches up to the master. By examining the ContextCSNs it appears that all stalls at exactly the same point.
Our configuration is a single master with 20 replicas. Most replicas are deployed in pairs at remote site. Generally replicas are deployed to address latency issues at remote sites. The LDAP infrastructure provides DHCP and authorization services. The replicas and the master are all running Ubuntu 16.04 with a custom built slapd using 2.4.46 source. Our build of slapd starts with Ryan Tandys ppa source and has the following changes.
- Build with OpenSSL - ITS patches 8054, 8752, and 8727
We have been seeing this problem intermittently for months now, and the problem just recently gotten worse, going from once every other month to once or twice a week.
One anomaly that we have seen with the last two stalls is that only 18 of the 20 replicas stalled. The replicas that did not stall are in the same network as the master. I think that is a red herring since the stalling hosts are widely dispersed and have a variety of network paths.
We use GSSAPI authentication and the directory holds no passwords. We have logs and thread dumps of the most recent stall. If you would like to see them let me know where to send them.
Bill