Full_Name: Jeff Doyle Version: 2.4.17 OS: Debian/Lenny URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (68.15.14.98)
Greetings,
I am faced with a persistent replication issue for which I am seeking guidance.
First, my infrastructure:
My organization is running Debian/Lenny .... HOWEVER we have backported a future release of Debian's ''slapd'' release to be more in-line with OpenLDAP current releases (as the actual release of ''slapd'' in Lenny is 2.4.11 is too far-behind, given some of the posts I've read on your site). We are running 2.4.17.
I am absolutely stumped with random replication inconsistencies that have me running in circles. We're having random issues involving records added to a provider server not being replicated to all consumers. This occurs on a very random basis. It can occur with a single record, or many. Because of the severity of this issue, I have written a pretty gnarly script to check the content of each server and when differences are found, send out an alert. This script is run via cron.
I simply cannot determine why our replication fails when it does.
Here is a SyncRepl stanza from a "consumer" LDAP server in question:
syncrepl rid=001 provider=ldaps://10.9.8.3:636/ bindmethod=simple binddn="uid=syncrepl,dc=example,dc=com" credentials=password scope=sub filter="(objectClass=*)" schemachecking=off searchbase="dc=example,dc=com" retry="120 +" sizelimit=unlimited timeout=1 type=refreshOnly interval=00:00:07:00
BE ADVISED, I have tried variations of this stanza involving:
* Changing from refreshOnly to refreshAndPersist (and subsequently removing the 'interval' parameter, as it is not needed with refreshAndPersist) * Trying longer and shorter intervals, when still using refreshOnly * Trying repeated "retry" integers both with and without the "forever (+)" parameter * Using NO encryption, and also using StartTLS over 389 as opposed to LDAPS/636 * With and without the "filter" parameter * With and without the schemachecking parameter * With and without the timeout parameter * Tried the 'network-timeout' parameter with a variety of integers (after reading the man page) * Using SASL/Kerberos for the bind method instead of simple binds * Using the ROOTDN for the user instead of our dedicated SyncRepl user (who has Confirmed 100% read-access to the DIT and ALL of its OCs and Attributes)
Assuming we just built this slave (with the stanza above) in our production environment, starting slapd with any mix of debugging modes show normal syncrepl, ber, acl-processing and other activities in progress. Everything seems fine. Our entire DIT is replicated in full, with no consistencies.
Network conditions between all hosts is relatively calm. No policy issues or routing issues exist. LDAP search operations from every possible angle are tested. The network is internally wide-open to itself.
Everything looks fine on both the provider and consumer. We can add records to a Provider server, and the records (or changes to existing records, such as deletions) are replicated just fine to the consumer. We can try this for any length of time without errors. Using a variety of 'delays' between operations, bind-users, etc. All conceivable permutations end in success.
OK, now ... wait any specified amount of time. A day or two, maybe a week (during which all of the servers are doing mostly READ operations). I receive an alert via email: Records have stopped replicating to one or more hosts. We don't understand why.
The prescribed fix for it its to delete the slave's database and let it rebuild. But calling that solution a "Hack" doesn't quite tell you how i Really feel about it.
Help. Please. This is not production-grade behavior, I think you would agree.
If you want more of my slapd.conf or other details, Just ask, and I will promptly supply them.
Thanks
Jeff