(ITS#5463) syncrepl stops or pauses - openldap-bugs

11 Apr 2008


      Full_Name: 
Version: 2.4.8
OS: Linux 2.6.23.13
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (194.97.7.65)
Were still migrating von slurpd to syncrepl. We have a 2.3-slurpd-fed provider
under OpenLDAP 2.4.8 used by currently two OpenLDAP 2.4.8 consumers with
"refreshAndPersist" and a retry-interval of "180 +" (see below)
The database is fairly large (8 Mio objects, 15 GB BDB files) and the consumers
have a syncrepled database (pulled once, then cloned). There are ongoing updates
"each second".
We see two issues:
When a consumer (re-) connects, the replication for the running consumer pauses
until it continues on both nodes. This could become a problem, where a single
consumer can be a SPOF for the whole cluster.
The replication may suddenly stop, and we see no reason for that. Sometimes it
recovers  by itself, sometimes it recovers only when the consumer slapd is
stopped and restarted.
Is there a "golden path" to debug this issue?
For the logs: We have "sync" as log level.
Consumer 1 (rid 000)
[...]
Apr 10 16:15:18 0 slapd[27548]: connection_input: conn=192 deferring operation:
binding
[...]
Apr 11 08:45:21 0 slapd[27548]: connection_input: conn=5182 deferring operation:
binding
Apr 11 08:50:11 0 slapd[27548]: connection_input: conn=5207 deferring operation:
binding
Apr 11 08:51:09 0 slapd[27548]: connection_input: conn=5212 deferring operation:
binding
Apr 11 08:53:05 0 slapd[27548]: connection_input: conn=5222 deferring operation:
binding
Apr 11 08:54:03 0 slapd[27548]: connection_input: conn=5228 deferring operation:
binding
Consumer 2 (rid 002):
[...]
Apr 11 02:00:16 2 slapd[22272]: connection_input: conn=3151 deferring operation:
binding
[...]
Apr 11 08:52:09 2 slapd[22272]: connection_input: conn=5224 deferring operation:
binding
Apr 11 08:53:07 2 slapd[22272]: connection_input: conn=5229 deferring operation:
binding
Provider (filtered for last syncprov_sendresp entries per rid)
Apr 10 16:06:43 1 slapd[27917]: syncprov_sendresp: cookie=rid=000,csn=2008041014
0643.069039Z#000000#000#000000
Apr 11 01:57:58 1 slapd[27917]: syncprov_sendresp: cookie=rid=002,csn=2008041023
5758.696324Z#000000#000#000000
I see no indicators for failing operations, missing resources or any other
trouble, incept for the "connection_input" entries.
Any idea how to get sync replication running smoothly and reliably? Thank You
very much in advance, of course!
OK, here are configuraiton snippets for provider and consumer, if it helps. To
be honest: I am not quite shure about a reasonable size for syncprov-sessionlog,
even after reading the slapo-syncprov. The actual number is meant as "big, but
finite".
Consumer:
---------
syncrepl rid=2
   provider="ldap://provider:389"
   type=refreshAndPersist
   bindmethod=simple
   binddn=...
   credentials=...
   searchbase=...
   retry="180 +"
provider:
---------
database        bdb
dbnosync
cachesize       200000
suffix          ...
rootdn          ...
updatedn        ...
rootpw          ...
directory       /var/ldap/db
lastmod         on
index           objectClass eq
...
index           entryCSN,entryUUID eq
checkpoint      512 15
overlay syncprov
syncprov-sessionlog 300000
database        monitor