On Tue, Nov 11, 2008 at 02:18:10PM -0800, Quanah Gibson-Mount wrote:
--On Tuesday, November 11, 2008 4:35 PM -0500 John Morrissey jwm@horde.net wrote:
Instead of slapcat(8)/slapadd(8)ding the old databases, we're removing the existing databases and allowing slapd(8) to delta-syncrepl a copy from scratch. Ironing out this use case is especially important for us since we expect to be adding a number of consumers in the coming months and would obviously prefer to bring them online without having to shut down any other slapd instances for slapcat(8)ting.
Why would you have to shut down a server to slapcat it? Hot slapcatting has been supported for a long time.
Right, slapcat's man page indicates it's always safe to run against the bdb backend, but I suspect that's referring more to read concurrency and not necessarily the generation of a consistent, point-in-time snapshot of the database.
Empirically, slapcat output does not have a consistent view of the database while dumping it. Specifically, when slapcat is running and one changes an entry that hasn't been dumped yet, that change will appear in slapcat's output. Using slapadd's -w option would definitely be unsafe in this situation.
Without the -w option seems safe at first glance since the suffix entry's contextCSN will be older than any CSN in the generated LDIF. It seems that any syncrepl updates that have already been "applied" by virtue of the aforementioned slapcat behavior will simply be skipped since there will be no changes to the entry? Still, I couldn't find anything in the Administrator's Guide about this, and it feels like there's some concurrency case I'm not considering here, so I'd definitely appreciate hearing any thoughts you have on this.
At this point, slurpd seems to start processing the accesslog.
There is no slurpd in OpenLDAP 2.4. I think you mean syncrepl?
Yes, that was a typo.
It's interesting that two consumers have successfully delta-syncrepl'd complete databases from scratch without experiencing this problem. At least four other consumer machines fail in this manner. There seems to be no rhyme or reason as to which machines succeed or fail; they're all running the same binaries, same OS release and patches, some are even on the same Ethernet segment as the provider. The provider slapd has been up consistently (without crash nor restart) during at least two attempts.
What patches?
I was referring to operating system patches; we're using the Debian packaging of 2.4.11 (from the upcoming lenny release) that we've updated locally for 2.4.12. Debian patches OpenLDAP fairly lightly, and none of their patches seem to get anywhere near syncrepl or other hard slapd internals.
Newer patches for 2.4.13 of interest may be:
Fixed slapd syncrepl event loss (ITS#5710)
We don't use the ppolicy overlay.
Fixed slapd syncrepl MOD of attrs with no EQ rule (ITS#5781)
AFAICT all of the attributes we're modifying have equality matching rules.
Fixed slapd syncrepl schema checking (ITS#5798)
We aren't using multimaster replication, and don't care about schemachecking on our consumers.
On the provider side:
Fixed slapo-syncprov runqueue removal (ITS#5776)
Looks like this patch addresses a case where syncprov stops sending queued responses to persistent searches, which I don't think is applicable to this particular problem?
Fixed slapo-syncprov unreplicatable ops (ITS#5709)
This might bite us in future so I'll be sure to include it locally.
However, this doesn't seem like the source of the current problem since the symptoms are that entries that should have been pulled during the initial refresh phase are not present for later syncrepl activity.
We haven't upgraded our provider to 2.4 yet, since we wanted to get some consumers upgraded first. Would running a 2.4 provider with 2.3 consumers be OK? The contextCSN format changed to add fractional seconds; will that have any adverse impact on existing 2.3 consumers that try to continue syncrepling?
john