(ITS#8951) Replication stalls

2 Jan 2019


      Full_Name: Bill MacAllister
Version: 2.4.46
OS: Ubuntu 16.04
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (50.247.112.108)
We have been seeing replication stall on many of our replicas.  Once replication
stalls is never recovers on its own and the slapd process on the replica needs
to be restarted. As soon as slapd is restarted on a replica it catches up to the
master. By examining the ContextCSNs it appears that all stalls at exactly the
same point.
Our configuration is a single master with 20 replicas. Most replicas are
deployed in pairs at remote site. Generally replicas are deployed to address
latency issues at remote sites.  The LDAP infrastructure provides DHCP and
authorization services. The replicas and the master are all running Ubuntu 16.04
with a custom built slapd using 2.4.46 source. Our build of slapd starts with
Ryan Tandys ppa source and has the following changes.
- Build with OpenSSL
    - ITS patches 8054, 8752, and 8727
We have been seeing this problem intermittently for months now, and the problem
just recently gotten worse, going from once every other month to once or twice a
week.
One anomaly that we have seen with the last two stalls is that only 18 of the 20
replicas stalled. The replicas that did not stall are in the same network as the
master.  I think that is a red herring since the stalling hosts are widely
dispersed and have a variety of network paths.
We use GSSAPI authentication and the directory holds no passwords.  We have logs
and thread dumps of the most recent stall. If you would like to see them let me
know where to send them.
Bill

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

(ITS#8951) Replication stalls