I'm upgrading a site from OpenLDAP 2.3.42 to 2.4.12 in an attempt to alleviate http://www.openldap.org/its/index.cgi/Incoming?id=5631 (slapd crashes due to assertion failure).
For us, doing so requires dumping and recreating two back-bdb databases since the OpenLDAP 2.4.x Debian packaging is linked against a newer version of BDB. The larger database contains about a million entries.
Instead of slapcat(8)/slapadd(8)ding the old databases, we're removing the existing databases and allowing slapd(8) to delta-syncrepl a copy from scratch. Ironing out this use case is especially important for us since we expect to be adding a number of consumers in the coming months and would obviously prefer to bring them online without having to shut down any other slapd instances for slapcat(8)ting. The Administrator's Guide seems to indicate this is an accepted use case, since its guide to bringing up a new consumer involves simply configuring the consumer and starting slapd.
When the consumer slapd comes up, it enters the refresh(?) phase and begins adding entries to the fresh, empty bdb database. When it finishes, contextCSN on the suffix entry is set to 20081111135024Z#000000#00#000000 (roughly when slapd was started) and this change is visible with ldapsearch(1).
At this point, slurpd seems to start processing the accesslog. The first entry references a nonexistent DN (uid=nava209,...) and the backend operation returns LDAP_NO_SUCH_OBJECT. This is interesting, since this entry was created months ago should have been found during the refresh phase and created. ldapsearch(1)ing against the provider with the same filter used by the consumer syncrepl ('(objectclass=*)') yields this entry, so it doesn't appear to be index corruption on the provider.
At this point, several hundred subsequent search entries are discarded; possibly due to the be_modify operation failing?
After some time, slapd continues processing entries and does so successfully until it encounters another error (a modrdn that returns LDAP_ALREADY_EXISTS since the accesslog entry that modrdn'd the existing object out of the way was ignored by the consumer). After a while, slapd starts processing the same batch of modifications again, and repeats until the retry counter is exhausted. contextCSN on the suffix entry is never updated during this process, based on debugging output and ldapsearch(1).
It's interesting that two consumers have successfully delta-syncrepl'd complete databases from scratch without experiencing this problem. At least four other consumer machines fail in this manner. There seems to be no rhyme or reason as to which machines succeed or fail; they're all running the same binaries, same OS release and patches, some are even on the same Ethernet segment as the provider. The provider slapd has been up consistently (without crash nor restart) during at least two attempts.
Syncrepl (level 16384) debug output, sans ~400Mbytes of entry processing during the refresh phase, is at:
http://horde.net/~jwm/slapd-syncrepl-debug
john
--On Tuesday, November 11, 2008 4:35 PM -0500 John Morrissey jwm@horde.net wrote:
Instead of slapcat(8)/slapadd(8)ding the old databases, we're removing the existing databases and allowing slapd(8) to delta-syncrepl a copy from scratch. Ironing out this use case is especially important for us since we expect to be adding a number of consumers in the coming months and would obviously prefer to bring them online without having to shut down any other slapd instances for slapcat(8)ting.
Why would you have to shut down a server to slapcat it? Hot slapcatting has been supported for a long time.
At this point, slurpd seems to start processing the accesslog.
There is no slurpd in OpenLDAP 2.4. I think you mean syncrepl?
It's interesting that two consumers have successfully delta-syncrepl'd complete databases from scratch without experiencing this problem. At least four other consumer machines fail in this manner. There seems to be no rhyme or reason as to which machines succeed or fail; they're all running the same binaries, same OS release and patches, some are even on the same Ethernet segment as the provider. The provider slapd has been up consistently (without crash nor restart) during at least two attempts.
What patches?
Newer patches for 2.4.13 of interest may be:
Fixed slapd syncrepl event loss (ITS#5710) Fixed slapd syncrepl MOD of attrs with no EQ rule (ITS#5781) Fixed slapd syncrepl schema checking (ITS#5798)
On the provider side:
Fixed slapo-syncprov runqueue removal (ITS#5776) Fixed slapo-syncprov unreplicatable ops (ITS#5709)
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On Tue, Nov 11, 2008 at 02:18:10PM -0800, Quanah Gibson-Mount wrote:
--On Tuesday, November 11, 2008 4:35 PM -0500 John Morrissey jwm@horde.net wrote:
Instead of slapcat(8)/slapadd(8)ding the old databases, we're removing the existing databases and allowing slapd(8) to delta-syncrepl a copy from scratch. Ironing out this use case is especially important for us since we expect to be adding a number of consumers in the coming months and would obviously prefer to bring them online without having to shut down any other slapd instances for slapcat(8)ting.
Why would you have to shut down a server to slapcat it? Hot slapcatting has been supported for a long time.
Right, slapcat's man page indicates it's always safe to run against the bdb backend, but I suspect that's referring more to read concurrency and not necessarily the generation of a consistent, point-in-time snapshot of the database.
Empirically, slapcat output does not have a consistent view of the database while dumping it. Specifically, when slapcat is running and one changes an entry that hasn't been dumped yet, that change will appear in slapcat's output. Using slapadd's -w option would definitely be unsafe in this situation.
Without the -w option seems safe at first glance since the suffix entry's contextCSN will be older than any CSN in the generated LDIF. It seems that any syncrepl updates that have already been "applied" by virtue of the aforementioned slapcat behavior will simply be skipped since there will be no changes to the entry? Still, I couldn't find anything in the Administrator's Guide about this, and it feels like there's some concurrency case I'm not considering here, so I'd definitely appreciate hearing any thoughts you have on this.
At this point, slurpd seems to start processing the accesslog.
There is no slurpd in OpenLDAP 2.4. I think you mean syncrepl?
Yes, that was a typo.
It's interesting that two consumers have successfully delta-syncrepl'd complete databases from scratch without experiencing this problem. At least four other consumer machines fail in this manner. There seems to be no rhyme or reason as to which machines succeed or fail; they're all running the same binaries, same OS release and patches, some are even on the same Ethernet segment as the provider. The provider slapd has been up consistently (without crash nor restart) during at least two attempts.
What patches?
I was referring to operating system patches; we're using the Debian packaging of 2.4.11 (from the upcoming lenny release) that we've updated locally for 2.4.12. Debian patches OpenLDAP fairly lightly, and none of their patches seem to get anywhere near syncrepl or other hard slapd internals.
Newer patches for 2.4.13 of interest may be:
Fixed slapd syncrepl event loss (ITS#5710)
We don't use the ppolicy overlay.
Fixed slapd syncrepl MOD of attrs with no EQ rule (ITS#5781)
AFAICT all of the attributes we're modifying have equality matching rules.
Fixed slapd syncrepl schema checking (ITS#5798)
We aren't using multimaster replication, and don't care about schemachecking on our consumers.
On the provider side:
Fixed slapo-syncprov runqueue removal (ITS#5776)
Looks like this patch addresses a case where syncprov stops sending queued responses to persistent searches, which I don't think is applicable to this particular problem?
Fixed slapo-syncprov unreplicatable ops (ITS#5709)
This might bite us in future so I'll be sure to include it locally.
However, this doesn't seem like the source of the current problem since the symptoms are that entries that should have been pulled during the initial refresh phase are not present for later syncrepl activity.
We haven't upgraded our provider to 2.4 yet, since we wanted to get some consumers upgraded first. Would running a 2.4 provider with 2.3 consumers be OK? The contextCSN format changed to add fractional seconds; will that have any adverse impact on existing 2.3 consumers that try to continue syncrepling?
john
John Morrissey wrote:
On Tue, Nov 11, 2008 at 02:18:10PM -0800, Quanah Gibson-Mount wrote:
--On Tuesday, November 11, 2008 4:35 PM -0500 John Morrissey jwm@horde.net wrote:
Instead of slapcat(8)/slapadd(8)ding the old databases, we're removing the existing databases and allowing slapd(8) to delta-syncrepl a copy from scratch. Ironing out this use case is especially important for us since we expect to be adding a number of consumers in the coming months and would obviously prefer to bring them online without having to shut down any other slapd instances for slapcat(8)ting.
Why would you have to shut down a server to slapcat it? Hot slapcatting has been supported for a long time.
Right, slapcat's man page indicates it's always safe to run against the bdb backend, but I suspect that's referring more to read concurrency and not necessarily the generation of a consistent, point-in-time snapshot of the database.
Consistency is preserved at entry level. If there is consistency needed across several related entry this cannot be guaranteed with pure LDAPv3.
Ciao, Michael.
Michael Ströder wrote:
John Morrissey wrote:
On Tue, Nov 11, 2008 at 02:18:10PM -0800, Quanah Gibson-Mount wrote:
--On Tuesday, November 11, 2008 4:35 PM -0500 John Morrissey jwm@horde.net wrote:
Instead of slapcat(8)/slapadd(8)ding the old databases, we're removing the existing databases and allowing slapd(8) to delta-syncrepl a copy from scratch. Ironing out this use case is especially important for us since we expect to be adding a number of consumers in the coming months and would obviously prefer to bring them online without having to shut down any other slapd instances for slapcat(8)ting.
Why would you have to shut down a server to slapcat it? Hot slapcatting has been supported for a long time.
Right, slapcat's man page indicates it's always safe to run against the bdb backend, but I suspect that's referring more to read concurrency and not necessarily the generation of a consistent, point-in-time snapshot of the database.
Consistency is preserved at entry level. If there is consistency needed across several related entry this cannot be guaranteed with pure LDAPv3.
That's not relevant here. Syncrepl guarantees eventual convergence.
John Morrissey wrote:
On Tue, Nov 11, 2008 at 02:18:10PM -0800, Quanah Gibson-Mount wrote:
--On Tuesday, November 11, 2008 4:35 PM -0500 John Morrissey jwm@horde.net wrote:
Instead of slapcat(8)/slapadd(8)ding the old databases, we're removing the existing databases and allowing slapd(8) to delta-syncrepl a copy from scratch. Ironing out this use case is especially important for us since we expect to be adding a number of consumers in the coming months and would obviously prefer to bring them online without having to shut down any other slapd instances for slapcat(8)ting.
Why would you have to shut down a server to slapcat it? Hot slapcatting has been supported for a long time.
Right, slapcat's man page indicates it's always safe to run against the bdb backend, but I suspect that's referring more to read concurrency and not necessarily the generation of a consistent, point-in-time snapshot of the database.
Empirically, slapcat output does not have a consistent view of the database while dumping it. Specifically, when slapcat is running and one changes an entry that hasn't been dumped yet, that change will appear in slapcat's output. Using slapadd's -w option would definitely be unsafe in this situation.
The -w option is only meant to be used when your LDIF does not already contain contextCSN values. There's no need to use it otherwise, and as you note, it would be detrimental in this case.
Without the -w option seems safe at first glance since the suffix entry's contextCSN will be older than any CSN in the generated LDIF. It seems that any syncrepl updates that have already been "applied" by virtue of the aforementioned slapcat behavior will simply be skipped since there will be no changes to the entry? Still, I couldn't find anything in the Administrator's Guide about this, and it feels like there's some concurrency case I'm not considering here, so I'd definitely appreciate hearing any thoughts you have on this.
slapcat/slapadd (no -w), let syncrepl figure out the rest.
We haven't upgraded our provider to 2.4 yet, since we wanted to get some consumers upgraded first. Would running a 2.4 provider with 2.3 consumers be OK? The contextCSN format changed to add fractional seconds; will that have any adverse impact on existing 2.3 consumers that try to continue syncrepling?
A recent enough 2.3 will accept 2.4-format CSNs.
--On November 12, 2008 1:38:48 PM -0800 Howard Chu hyc@symas.com wrote:
Empirically, slapcat output does not have a consistent view of the database while dumping it. Specifically, when slapcat is running and one changes an entry that hasn't been dumped yet, that change will appear in slapcat's output. Using slapadd's -w option would definitely be unsafe in this situation.
The -w option is only meant to be used when your LDIF does not already contain contextCSN values. There's no need to use it otherwise, and as you note, it would be detrimental in this case.
As a side note to your question about slapcat & it's state, I'll note I filed a bug about this specifically for delta-syncrepl, which was addressed in the 2.3 series (2.3.42). The point is that delta-syncrepl has been adjusted to accommodate the fact that the slapcat from a hot server may not be entirely consistent between start and finish.
http://www.openldap.org/its/index.cgi/?findid=5378
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On Wed, Nov 12, 2008 at 01:38:48PM -0800, Howard Chu wrote:
John Morrissey wrote:
We haven't upgraded our provider to 2.4 yet, since we wanted to get some consumers upgraded first. Would running a 2.4 provider with 2.3 consumers be OK? The contextCSN format changed to add fractional seconds; will that have any adverse impact on existing 2.3 consumers that try to continue syncrepling?
A recent enough 2.3 will accept 2.4-format CSNs.
2.3.42 should be recent enough, correct?
john
On Tue, Nov 11, 2008 at 02:18:10PM -0800, Quanah Gibson-Mount wrote:
Newer patches for 2.4.13 of interest may be:
Fixed slapd syncrepl event loss (ITS#5710) Fixed slapd syncrepl MOD of attrs with no EQ rule (ITS#5781) Fixed slapd syncrepl schema checking (ITS#5798)
Something in post-2.4.12 RE24 has fixed this delta-syncrepl problem, since I've initialized five consumers from scratch and all have been consistent when finished.
I'm now looking at upgrading our provider. There shouldn't be any problem with the CSN format change, since the new CSN format compares as greater than the old format and therefore existing consumers should continue as normal?
john
openldap-software@openldap.org