(ITS#6320) Replication Issues - openldap-bugs

30 Sep 2009


      Full_Name: Jeff Doyle
Version: 2.4.17
OS: Debian/Lenny
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (68.15.14.98)
Greetings,
I am faced with a persistent replication issue for which I am seeking guidance.
First, my infrastructure:
My organization is running Debian/Lenny .... HOWEVER we have backported a future
release of Debian's ''slapd'' release to be more in-line with OpenLDAP current
releases (as the actual release of ''slapd'' in Lenny is 2.4.11 is too
far-behind, given some of the posts I've read on your site).  We are running
2.4.17.
I am absolutely stumped with random replication inconsistencies that have me
running in circles.  We're having random issues involving records added to a
provider server not being replicated to all consumers.  This occurs on a very
random basis. It can occur with a single record, or many.   Because of the
severity of this issue, I have written a pretty gnarly script to check the
content of each server and when differences are found, send out an alert.  This
script is run via cron.
I simply cannot determine why our replication fails when it does.
Here is a SyncRepl stanza from a "consumer" LDAP server in question:
syncrepl rid=001
  provider=ldaps://10.9.8.3:636/
  bindmethod=simple
  binddn="uid=syncrepl,dc=example,dc=com"
  credentials=password
  scope=sub
  filter="(objectClass=*)"
  schemachecking=off
  searchbase="dc=example,dc=com"
  retry="120 +"
  sizelimit=unlimited
  timeout=1
  type=refreshOnly
  interval=00:00:07:00
BE ADVISED, I have tried variations of this stanza involving:
* Changing from refreshOnly to refreshAndPersist (and subsequently removing
the 'interval' parameter, as it is not needed with refreshAndPersist)
  * Trying longer and shorter intervals, when still using refreshOnly
  * Trying repeated "retry" integers both with and without the "forever (+)"
parameter
  * Using NO encryption, and also using StartTLS over 389 as opposed to
LDAPS/636
  * With and without the "filter" parameter
  * With and without the schemachecking parameter
  * With and without the timeout parameter
  * Tried the 'network-timeout' parameter with a variety of integers (after
reading the man page)
  * Using SASL/Kerberos for the bind method instead of simple binds
  * Using the ROOTDN for the user instead of our dedicated SyncRepl user (who
has Confirmed 100% read-access to the DIT and ALL of its OCs and Attributes)
Assuming we just built this slave (with the stanza above) in our production
environment, starting slapd with any mix of debugging modes show normal
syncrepl, ber, acl-processing and other activities in progress.  Everything
seems fine.  Our entire DIT is replicated in full, with no consistencies.
Network conditions between all hosts is relatively calm.  No policy issues or
routing issues exist. LDAP search operations from every possible angle are
tested. The network is internally wide-open to itself.
Everything looks fine on both the provider and consumer.  We can add records to
a Provider server, and the records (or changes to existing records, such as
deletions) are replicated just fine to the consumer.  We can try this for any
length of time without errors.  Using a variety of 'delays' between operations,
bind-users, etc.  All conceivable permutations end in success.
OK, now ... wait any specified amount of time.  A day or two, maybe a week
(during which all of the servers are doing mostly READ operations). I receive an
alert via email:  Records have stopped replicating to one or more hosts.  We
don't understand why.
The prescribed fix for it its to delete the slave's database and let it rebuild.
 But calling that solution a "Hack" doesn't quite tell you how i Really feel
about it.
Help.  Please.  This is not production-grade behavior, I think you would agree.
If you want more of my slapd.conf or other details, Just ask, and I will
promptly supply them.
Thanks
Jeff