On Tue, Nov 11, 2008 at 02:18:10PM -0800, Quanah Gibson-Mount wrote:
--On Tuesday, November 11, 2008 4:35 PM -0500 John Morrissey
> Instead of slapcat(8)/slapadd(8)ding the old databases, we're removing
> the existing databases and allowing slapd(8) to delta-syncrepl a copy
> from scratch. Ironing out this use case is especially important for us
> since we expect to be adding a number of consumers in the coming months
> and would obviously prefer to bring them online without having to shut
> down any other slapd instances for slapcat(8)ting.
Why would you have to shut down a server to slapcat it? Hot slapcatting
has been supported for a long time.
Right, slapcat's man page indicates it's always safe to run against the bdb
backend, but I suspect that's referring more to read concurrency and not
necessarily the generation of a consistent, point-in-time snapshot of the
Empirically, slapcat output does not have a consistent view of the database
while dumping it. Specifically, when slapcat is running and one changes an
entry that hasn't been dumped yet, that change will appear in slapcat's
output. Using slapadd's -w option would definitely be unsafe in this
Without the -w option seems safe at first glance since the suffix entry's
contextCSN will be older than any CSN in the generated LDIF. It seems that
any syncrepl updates that have already been "applied" by virtue of the
aforementioned slapcat behavior will simply be skipped since there will be
no changes to the entry? Still, I couldn't find anything in the
Administrator's Guide about this, and it feels like there's some concurrency
case I'm not considering here, so I'd definitely appreciate hearing any
thoughts you have on this.
> At this point, slurpd seems to start processing the accesslog.
There is no slurpd in OpenLDAP 2.4. I think you mean syncrepl?
Yes, that was a typo.
> It's interesting that two consumers have successfully
> complete databases from scratch without experiencing this problem. At
> least four other consumer machines fail in this manner. There seems to
> be no rhyme or reason as to which machines succeed or fail; they're all
> running the same binaries, same OS release and patches, some are even on
> the same Ethernet segment as the provider. The provider slapd has been
> up consistently (without crash nor restart) during at least two
I was referring to operating system patches; we're using the Debian
packaging of 2.4.11 (from the upcoming lenny release) that we've updated
locally for 2.4.12. Debian patches OpenLDAP fairly lightly, and none of
their patches seem to get anywhere near syncrepl or other hard slapd
Newer patches for 2.4.13 of interest may be:
Fixed slapd syncrepl event loss (ITS#5710)
We don't use the ppolicy overlay.
Fixed slapd syncrepl MOD of attrs with no EQ rule (ITS#5781)
AFAICT all of the attributes we're modifying have equality matching rules.
Fixed slapd syncrepl schema checking (ITS#5798)
We aren't using multimaster replication, and don't care about schemachecking
on our consumers.
On the provider side:
Fixed slapo-syncprov runqueue removal (ITS#5776)
Looks like this patch addresses a case where syncprov stops sending queued
responses to persistent searches, which I don't think is applicable to this
Fixed slapo-syncprov unreplicatable ops (ITS#5709)
This might bite us in future so I'll be sure to include it locally.
However, this doesn't seem like the source of the current problem since the
symptoms are that entries that should have been pulled during the initial
refresh phase are not present for later syncrepl activity.
We haven't upgraded our provider to 2.4 yet, since we wanted to get some
consumers upgraded first. Would running a 2.4 provider with 2.3 consumers be
OK? The contextCSN format changed to add fractional seconds; will that have
any adverse impact on existing 2.3 consumers that try to continue
John Morrissey _o /\ ---- __o
jwm(a)horde.net _-< \_ / \ ---- < \,
__(_)/_(_)________/ \_______(_) /_(_)__