I've been running OpenLDAP 2.3.35 for almost a year with no problems (Gentoo Linux, w/DB 4.5). I upgraded to 2.3.40 last week, and had a meltdown this morning during an account purge.
There had been some updates to the directory after upgrading, but no deletes. During my delete run, there were some index errors:
----- [successful deletes] Feb 3 03:50:36 derp idmgmt[3722]: error deleting user cjlindsay: DN index delete failed (LDAP) [successful deletes] Feb 3 03:56:55 derp idmgmt[3722]: error deleting user ddshah: DN index delete failed (LDAP) [successful deletes] Feb 3 04:00:10 derp idmgmt[3722]: error binding to directory: internal error (LDAP) -----
After the index errors, LDAP stopped working. The logs on the master showed:
----- Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): file dn2id.bdb has LSN 8388745/33900185 , past end of log at 137/33937441 Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): Commonly caused by moving a database fr om one transactional database Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): environment to another without clearing the database LSNs, or removing Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): all of the log files from a database en vironment Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): DB_ENV->log_flush: LSN of 8388745/33900 185 past current end-of-log of 137/37666668 Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): Database environment corrupt; the wrong log files may have been removed or incompatible database files imported from another environment Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): PANIC: DB_RUNRECOVERY: Fatal error, run database recovery Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): dn2id.bdb: unable to flush page: 3118 Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): txn_checkpoint: failed to flush the buf fer cache: DB_RUNRECOVERY: Fatal error, run database recovery Feb 3 04:00:10 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): PANIC: fatal region error detected; run recovery -----
I restarted slapd, resulting in:
----- Feb 3 06:55:53 fosse slapd[5129]: bdb_db_close: txn_checkpoint failed: DB_RUNRECOVERY: Fatal error, run database recovery (-30975) Feb 3 06:55:53 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): PANIC: fatal region error detected; run recovery Feb 3 06:55:53 fosse slapd[5129]: bdb_db_close: close failed: DB_RUNRECOVERY: Fatal error, run database recovery (-30975) Feb 3 06:55:53 fosse slapd[5129]: slapd stopped. Feb 3 06:56:02 fosse slapd[11055]: @(#) $OpenLDAP: slapd 2.3.40 (Jan 31 2008 21:55:32) $ portage@fosse:/var/tmp/portage/net-nds/openldap-2.3.40-r1/work/openldap-2.3.40/servers/slapd - Server is unavailable Feb 3 06:56:18 fosse slapd[11056]: hdb_db_open: unclean shutdown detected; attempting recovery. Feb 3 06:56:20 fosse slapd[11056]: slapd starting -----
I have two syncrepl slaves, both had failed at the exact same time with the exact same logs:
----- Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): DB_ENV->log_flush: LSN of 8388762/370 13381 past current end-of-log of 154/40862130 Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): Database environment corrupt; the wro ng log files may have been removed or incompatible database files imported from another environment Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): PANIC: DB_RUNRECOVERY: Fatal error, r un database recovery Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): dn2id.bdb: unable to flush page: 3118 Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): txn_checkpoint: failed to flush the b uffer cache: DB_RUNRECOVERY: Fatal error, run database recovery Feb 3 04:00:11 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): PANIC: fatal region error detected; r un recovery Feb 3 04:00:11 filmore slapd[5127]: null_callback: error code 0x50
Feb 3 06:57:41 filmore slapd[5127]: bdb_db_close: txn_checkpoint failed: DB_RUNRECOVERY: Fatal error, run database recovery (-30975) Feb 3 06:57:41 filmore slapd[5127]: bdb(dc=csupomona,dc=edu): PANIC: fatal region error detected; run recovery Feb 3 06:57:41 filmore slapd[5127]: bdb_db_close: close failed: DB_RUNRECOVERY: Fatal error, run database recovery (-30975) Feb 3 06:57:41 filmore slapd[5127]: slapd stopped. Feb 3 06:57:49 filmore slapd[20722]: @(#) $OpenLDAP: slapd 2.3.40 (Jan 28 2008 18:22:34) $ portage@filmore:/var/tmp/portage/net-nds/openldap-2.3.40-r1/work/openldap-2.3.40/servers/slapd Feb 3 06:57:49 filmore slapd[20723]: hdb_db_open: unclean shutdown detected; attempting recovery. Feb 3 06:57:51 filmore slapd[20723]: slapd starting -----
After a restart, things seemed to be working ok. I deleted one object by hand with no problem. But then on deleting the second, I received the same index failure.
Finally, I shut everything down, did a slapcat to ldif, deleted the entire DB, and then a slapadd to regenerate from scratch.
So far, so good after that. I restarted my batch delete, it's been running for 20 minutes with no errors (fingers crossed).
What happened? It can't be local corruption, all three servers failed exactly the same way. I didn't upgrade bdb, it's the exact same version that's been running. I didn't think I needed to rebuild the db on an upgrade from 2.3.35 to 2.3.40, was I mistaken? Should I have done a dump/reload? Was this failure possibly a bug in 2.3.40? A bug in 2.3.35 exposed by the upgrade?
Any thoughts much appreciated...
Thanks much...
On Mon, 4 Feb 2008, [iso-8859-1] Michael Ströder wrote:
Paul B. Henson wrote:
Feb 3 03:50:36 derp idmgmt[3722]: error deleting user cjlindsay: DN index delete failed (LDAP)
Everything right with ownership/permissions on the data files?
Yes. Nothing had changed on the server since the upgrade. There were successful deletes before this one and some successful ones after.
On Seg, 2008-02-04 at 12:07 -0800, Paul B. Henson wrote:
On Mon, 4 Feb 2008, [iso-8859-1] Michael Ströder wrote:
Paul B. Henson wrote:
Feb 3 03:50:36 derp idmgmt[3722]: error deleting user cjlindsay: DN index delete failed (LDAP)
Everything right with ownership/permissions on the data files?
Yes. Nothing had changed on the server since the upgrade. There were successful deletes before this one and some successful ones after.
Was your OL 2.3.40 built with the same version of berkeley DB as the previous 2.3.35 one? (Sorry if you already answered that)
On Tue, 5 Feb 2008, Andreas Hasenack wrote:
Yes. Nothing had changed on the server since the upgrade. There were successful deletes before this one and some successful ones after.
Was your OL 2.3.40 built with the same version of berkeley DB as the previous 2.3.35 one? (Sorry if you already answered that)
Yes.
--On Sunday, February 03, 2008 8:14 AM -0800 "Paul B. Henson" henson@acm.org wrote:
What happened? It can't be local corruption, all three servers failed exactly the same way. I didn't upgrade bdb, it's the exact same version that's been running. I didn't think I needed to rebuild the db on an upgrade from 2.3.35 to 2.3.40, was I mistaken? Should I have done a dump/reload? Was this failure possibly a bug in 2.3.40? A bug in 2.3.35 exposed by the upgrade?
Did you stop slapd and run db_recover before upgrading to 2.3.40?
Was 2.3.35 using back-bdb, and you changed to back-hdb with 2.3.40?
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
--On Sunday, February 03, 2008 8:14 AM -0800 "Paul B. Henson" henson@acm.org wrote:
What happened? It can't be local corruption, all three servers failed exactly the same way. I didn't upgrade bdb, it's the exact same version that's been running. I didn't think I needed to rebuild the db on an upgrade from 2.3.35 to 2.3.40, was I mistaken? Should I have done a dump/reload? Was this failure possibly a bug in 2.3.40? A bug in 2.3.35 exposed by the upgrade?
Did you stop slapd and run db_recover before upgrading to 2.3.40?
Was 2.3.35 using back-bdb, and you changed to back-hdb with 2.3.40?
No change of that sort would cause these error messages.
Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): file dn2id.bdb has LSN 8388745/33900185 , past end of log at 137/33937441 Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): Commonly caused by moving a database fr om one transactional database Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): environment to another without clearing the database LSNs, or removing Feb 3 03:50:36 fosse slapd[5129]: bdb(dc=csupomona,dc=edu): all of the log files from a database en vironment
This basically says that somebody zeroed out/removed the transaction log files.
On Mon, 4 Feb 2008, Howard Chu wrote:
This basically says that somebody zeroed out/removed the transaction log files.
Nothing ever touches the transaction log files in our configuration. I have "set_flags DB_LOG_AUTOREMOVE" in the DB_CONFIG file. Transaction logs are automatically pruned as necessary by bdb itself. We do backups via slapcat.
Possibly a bug in bdb? It's strange that it happened simultaneously on three different servers. However, the actual database files were created on one server and then copied to the others; and presumably they all executed the same changes afterwards. It does seem oddly coincidental that after almost a year of successful operation the first mass deletion after the upgrade caused a problem.
Thanks...
Paul B. Henson wrote:
On Mon, 4 Feb 2008, Howard Chu wrote:
This basically says that somebody zeroed out/removed the transaction log files.
Nothing ever touches the transaction log files in our configuration. I have "set_flags DB_LOG_AUTOREMOVE" in the DB_CONFIG file. Transaction logs are automatically pruned as necessary by bdb itself. We do backups via slapcat.
Possibly a bug in bdb? It's strange that it happened simultaneously on three different servers. However, the actual database files were created on one server and then copied to the others; and presumably they all executed the same changes afterwards. It does seem oddly coincidental that after almost a year of successful operation the first mass deletion after the upgrade caused a problem.
You cannot copy the database files from one machine to another unless you're extremely careful and follow the procedures outlined in the BerkeleyDB documentation. It sounds like you didn't follow those procedures. Your problem isn't coincidental, it's inevitable when you don't RTFM.
On Mon, 4 Feb 2008, Howard Chu wrote:
You cannot copy the database files from one machine to another unless you're extremely careful and follow the procedures outlined in the BerkeleyDB documentation. It sounds like you didn't follow those procedures. Your problem isn't coincidental, it's inevitable when you don't RTFM.
Huh?
I can't seem to find it right now, but I distinctly recall reading documentation that indicated you could either start a syncrepl slave with no database, or to jumpstart the process and make it quicker you could copy the database from the master on to it before starting.
While no longer in the 2.4 documentation, the documentation from 2.3:
http://www.openldap.org/doc/admin23/replication.html#Configuring%20slurpd%20...
Discusses copying the database from the master to the slave in a slurpd context. Other than indicating to be sure that both systems are homogenous (same hardware, same OS, same versions; which in my case was completely true), there are no dire warnings or pointers to BerkeleyDB documentation procedures.
I took another quick look at the BerkeleyDB documentation on the Oracle site and did not see anything that seemed relevant to copying databases between machines. Could I trouble you for a URL to see whether there is anything in those procedures that might have been violated?
Also, even if for some reason the copies on the two slaves were invalid, that would not explain why the master failed. The database on the master was the original database built by slapadd when the server was first put into commission. How could making a copy of it have caused it to fail itself? Additionally taking into consideration that all three worked fine for almost a year under heavy load, it just doesn't seem likely that the failure of both the original master and both slaves was caused by an improper database copy.
Paul B. Henson wrote:
On Mon, 4 Feb 2008, Howard Chu wrote:
You cannot copy the database files from one machine to another unless you're extremely careful and follow the procedures outlined in the BerkeleyDB documentation. It sounds like you didn't follow those procedures. Your problem isn't coincidental, it's inevitable when you don't RTFM.
Huh?
I can't seem to find it right now, but I distinctly recall reading documentation that indicated you could either start a syncrepl slave with no database, or to jumpstart the process and make it quicker you could copy the database from the master on to it before starting.
While no longer in the 2.4 documentation, the documentation from 2.3:
http://www.openldap.org/doc/admin23/replication.html#Configuring%20slurpd%20...
Discusses copying the database from the master to the slave in a slurpd context. Other than indicating to be sure that both systems are homogenous (same hardware, same OS, same versions; which in my case was completely true), there are no dire warnings or pointers to BerkeleyDB documentation procedures.
That documentation is clearly obsolete, which is why it was removed.
I took another quick look at the BerkeleyDB documentation on the Oracle site and did not see anything that seemed relevant to copying databases between machines. Could I trouble you for a URL to see whether there is anything in those procedures that might have been violated?
http://www.oracle.com/technology/documentation/berkeley-db/db/ref/transapp/a...
At a guess, you failed to copy the transaction log files to the slaves.
Also, even if for some reason the copies on the two slaves were invalid, that would not explain why the master failed. The database on the master was the original database built by slapadd when the server was first put into commission. How could making a copy of it have caused it to fail itself?
Too difficult to guess, given the lack of information. We have only your assurance that nothing was done incorrectly, but the facts indicate that at least one step was done incorrectly.
On Mon, 4 Feb 2008, Howard Chu wrote:
Paul B. Henson wrote:
...
I took another quick look at the BerkeleyDB documentation on the Oracle site and did not see anything that seemed relevant to copying databases between machines. Could I trouble you for a URL to see whether there is anything in those procedures that might have been violated?
http://www.oracle.com/technology/documentation/berkeley-db/db/ref/transapp/a...
At a guess, you failed to copy the transaction log files to the slaves.
Or failed to perform catastrophic recovery instead of normal recovery. (I.e, db_recover must be invoked with the -c option.)
There's another catch: if the DB_LOG_AUTOREMOVE flag is in effect, then taking a hot backup may result in a broken backup if any transaction logs were removed during the backup. If you don't know about this and check for it, you won't even know that the backup is bogus.
So, as a practical matter, if you're using DB_LOG_AUTOREMOVE, then you need to shut down before trying to back up the database at the DB level.
(To quote the DBENV->set_flags() manual: DB_LOG_AUTOREMOVE If set, Berkeley DB will automatically remove log files that are no longer needed. Automatic log file removal is likely to make catastrophic recovery impossible.
...and catastrophic recovery is necessary for hot backups.)
Philip Guenther
On Mon, 4 Feb 2008, Philip Guenther wrote:
Or failed to perform catastrophic recovery instead of normal recovery. (I.e, db_recover must be invoked with the -c option.)
Well, while I won't discount the possibility I screwed up the copy, I still think it's unlikely and also that theory doesn't explain why the master (with the original uncopied database) died as well. Also, nothing was running during the copy, so it wouldn't have been considered a hot backup? And while still possible, it seems unlikely that this problem would not manifest for 10 months, and then just after an upgrade...
So, as a practical matter, if you're using DB_LOG_AUTOREMOVE, then you need to shut down before trying to back up the database at the DB level.
We actually rely on slapcat for backup. The database files haven't been touched (other than by slapd) or copied since the initial slapadd.
On Mon, 4 Feb 2008, Howard Chu wrote:
That documentation is clearly obsolete, which is why it was removed.
slurpd is obsolete, which is why the section on slurpd was removed from the 2.4 manual. Considering OpenLDAP-2.3.39 is still marked as the stable release, I can't really see that the 2.3 documentation in its entirety is obsolete.
http://www.oracle.com/technology/documentation/berkeley-db/db/ref/transapp/a...
Ah, that is the section on backing up/restoring a database, which I suppose could also be considered the same procedure to be used for copying a database from one system to another. Given your original wording, I was looking for something more specifically geared towards copying.
At a guess, you failed to copy the transaction log files to the slaves.
If I had failed to copy the transaction log files, I don't really see that it would have worked at all; let alone for almost a year.
Reviewing the backup/restore procedure, I don't really see anything I might have missed. slapd was not running during the copy, so clearly any updates were suspended. In fact, slapd had never been run -- the copy was made immediately after the initial slapadd. There were actually no log files present. As I mentioned, I have bdb configured to automatically remove them. Presumably slapadd explicitly/implicitly check pointed upon completion and they were removed. Even if there was a log file that I didn't see, the log files were stored in the same directory as the database files, and I copied the entire directory.
Also, even if for some reason the copies on the two slaves were invalid, that would not explain why the master failed. The database on the master was the original database built by slapadd when the server was first put into commission. How could making a copy of it have caused it to fail itself?
Too difficult to guess, given the lack of information. We have only your assurance that nothing was done incorrectly, but the facts indicate that at least one step was done incorrectly.
The facts only indicate that I had a catastrophic failure. That the failure was caused by incompetence is only a hypothesis.
I do greatly appreciate your response and willingness to help; I apologize if I'm getting a bit defensive.
You do have only my assurance that I didn't screw something up. However, assuming I'm not lying, the facts are:
* openldap 2.3.35 was initially installed on three servers * on the master server, slapadd was run to load in an existing database in ldif format * the resultant bdb database was then copied to both slaves * all three were put into production March 2007 and ran perfectly under a reasonably heavy load * a week or so ago I upgraded them to 2.3.40 (stop old server, install new server, start new server -- never touching bdb or the existing database files) * they ran fine for at least 3-4 days * this weekend, they died horribly
Given these facts, if something was done incorrectly, it does not seem likely that it was failure to copy a transaction log file in March 2007. If the failure was my own doing, it seems more likely a byproduct of the upgrade, although I can't think of anything that I could have done wrong during that process.
At this point, I guess I'll just write it off and hope it doesn't happen again.
--On February 4, 2008 5:30:29 PM -0800 "Paul B. Henson" henson@acm.org wrote:
Reviewing the backup/restore procedure, I don't really see anything I might have missed. slapd was not running during the copy, so clearly any updates were suspended. In fact, slapd had never been run -- the copy was made immediately after the initial slapadd. There were actually no log files present. As I mentioned, I have bdb configured to automatically remove them. Presumably slapadd explicitly/implicitly check pointed upon completion and they were removed. Even if there was a log file that I didn't see, the log files were stored in the same directory as the database files, and I copied the entire directory.
slapadd always creates at least one log file that would not be removed by automatic removal. If you had no log files when you were done, then something was done wrong.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On Mon, 4 Feb 2008, Quanah Gibson-Mount wrote:
slapadd always creates at least one log file that would not be removed by automatic removal. If you had no log files when you were done, then something was done wrong.
There's not much to slapadd, I'm not sure what could have been done wrong... I did use the -q option (otherwise it takes untractably long), but there were no errors or interruptions and the database created worked fine for 10 months or so.
I distinctly recall there were no log files last year, and I repeated the same procedure Sunday again with no log files at the end of the slapadd. The documentation says DB_LOG_AUTOREMOVE will "automatically remove log files that are no longer needed"; if the db is checkpointed at close wouldn't that make the log file unneeded?
I ran 'strace slapadd -q < /tmp/test.ldif > /tmp/slapadd.out 2>&1'
There was no log file, and 'grep 'log.' /tmp/slapadd.out' returned nothing.
Running without the -q, the same grep shows
open("/tmp/test/log.0000000001", O_RDWR|O_CREAT, 0600) = 4
Evidentally the -q option to slapadd bypasses logging?
--On February 4, 2008 7:22:09 PM -0800 "Paul B. Henson" henson@acm.org wrote:
On Mon, 4 Feb 2008, Quanah Gibson-Mount wrote:
slapadd always creates at least one log file that would not be removed by automatic removal. If you had no log files when you were done, then something was done wrong.
There's not much to slapadd, I'm not sure what could have been done wrong... I did use the -q option (otherwise it takes untractably long), but there were no errors or interruptions and the database created worked fine for 10 months or so.
Let me expand slightly. If you *correctly* clean up the environment after slapadd finishes, you will have none non-archivable log file:
Here's the add: [zimbra@freelancer openldap-data]$ /opt/zimbra/openldap/sbin/slapadd -b '' -q -f /opt/zimbra/conf/slapd.conf -l /tmp/output.ldif [zimbra@freelancer openldap-data]$ ls DB_CONFIG __db.002 alock db entryCSN.bdb id2entry.bdb logs objectClass.bdb uid.bdb zimbraId.bdb __db.001 accesslog cn.bdb dn2id.bdb entryUUID.bdb ldap.bak mail.bdb sn.bdb zimbraDomainName.bdb [zimbra@freelancer openldap-data]$ ls logs/ [zimbra@freelancer openldap-data]$ /opt/zimbra/sleepycat/bin/db_recover [zimbra@freelancer openldap-data]$ ls DB_CONFIG accesslog alock cn.bdb db dn2id.bdb entryCSN.bdb entryUUID.bdb id2entry.bdb ldap.bak logs mail.bdb objectClass.bdb sn.bdb uid.bdb zimbraDomainName.bdb zimbraId.bdb [zimbra@freelancer openldap-data]$ ls logs/ log.0000000001 [zimbra@freelancer openldap-data]$ /opt/zimbra/sleepycat/bin/db_archive [zimbra@freelancer openldap-data]$
So you can see here -- Slapadd completes. Now, it only creates a partial BDB environment (That's why there are two __db.* files). After it completes, before you can copy it anywhere, you need to run db_recover to clean that environment out, which I noted quite a long time ago.
You can see that once that is done, the log file is generated, and that it is contains data necessary for recovery (which is why db_archive returns nothing). Way back at the beginning of this thread, I asked if you properly ran db_recover to clean up the environment first, the answer is obviously no.
As -q notes:
-q enable quick (fewer integrity checks) mode. Does fewer consis- tency checks on the input data, and no consistency checks when writing the database. Improves the load time but if any errors or interruptions occur the resulting database will be unusable.
See the bit about no consistency checks when writing the DB. You need to run the db_recover after the add to finish up.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
--On February 4, 2008 7:22:09 PM -0800 "Paul B. Henson"henson@acm.org wrote:
On Mon, 4 Feb 2008, Quanah Gibson-Mount wrote:
slapadd always creates at least one log file that would not be removed by automatic removal. If you had no log files when you were done, then something was done wrong.
There's not much to slapadd, I'm not sure what could have been done wrong... I did use the -q option (otherwise it takes untractably long), but there were no errors or interruptions and the database created worked fine for 10 months or so.
So you can see here -- Slapadd completes. Now, it only creates a partial BDB environment (That's why there are two __db.* files). After it completes, before you can copy it anywhere, you need to run db_recover to clean that environment out, which I noted quite a long time ago.
No. slapd will do the necessary recovery automatically in this case.
Paul's right - assuming the slapadd went well and nothing else was done, then a binary copy of the DB directory should have worked fine on another machine.
Still the fact remains that simply upgrading the slapd version from 2.3.35 to 2.3.40 wouldn't have done anything to the transaction log files, and the only reason that BDB would complain about those missing Log Sequence Numbers is because the log files no longer matched the database files.
--On Monday, February 04, 2008 8:24 PM -0800 Howard Chu hyc@symas.com wrote:
Quanah Gibson-Mount wrote:
--On February 4, 2008 7:22:09 PM -0800 "Paul B. Henson"henson@acm.org wrote:
On Mon, 4 Feb 2008, Quanah Gibson-Mount wrote:
slapadd always creates at least one log file that would not be removed by automatic removal. If you had no log files when you were done, then something was done wrong.
There's not much to slapadd, I'm not sure what could have been done wrong... I did use the -q option (otherwise it takes untractably long), but there were no errors or interruptions and the database created worked fine for 10 months or so.
So you can see here -- Slapadd completes. Now, it only creates a partial BDB environment (That's why there are two __db.* files). After it completes, before you can copy it anywhere, you need to run db_recover to clean that environment out, which I noted quite a long time ago.
No. slapd will do the necessary recovery automatically in this case.
Paul's right - assuming the slapadd went well and nothing else was done, then a binary copy of the DB directory should have worked fine on another machine.
Hm, I dunno, I've had enough issues with slapadd -q and not doing a db_recover that I'm rather paritcular about using it. ;)
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On Mon, 4 Feb 2008, Howard Chu wrote:
Paul's right - assuming the slapadd went well and nothing else was done, then a binary copy of the DB directory should have worked fine on another machine.
And it did, for about 10 months :). I can't imagine a problem with the initial databases wouldn't have surfaced during that interval.
Still the fact remains that simply upgrading the slapd version from 2.3.35 to 2.3.40 wouldn't have done anything to the transaction log files, and the only reason that BDB would complain about those missing Log Sequence Numbers is because the log files no longer matched the database files.
I guess something else happened during the upgrade process that corrupted the environment on all three systems. I can't think of what that might be; but the problem happening right after the upgrade seems too coincidental. I did not upgrade bdb, but I did upgrade a handful of other packages and the kernel. I'm not sure how those would have caused a problem, just one of those mysteries of life I hope won't happen again 8-/.
Thanks again for the feedback...
On Tuesday 05 February 2008 05:22:09 Paul B. Henson wrote:
There's not much to slapadd, I'm not sure what could have been done wrong... I did use the -q option (otherwise it takes untractably long), but there were no errors or interruptions and the database created worked fine for 10 months or so.
I distinctly recall there were no log files last year, and I repeated the same procedure Sunday again with no log files at the end of the slapadd. The documentation says DB_LOG_AUTOREMOVE will "automatically remove log files that are no longer needed"; if the db is checkpointed at close wouldn't that make the log file unneeded?
No, you would still have one log file at most times, except that ...
I ran 'strace slapadd -q < /tmp/test.ldif > /tmp/slapadd.out 2>&1'
There was no log file, and 'grep 'log.' /tmp/slapadd.out' returned nothing.
Running without the -q, the same grep shows
open("/tmp/test/log.0000000001", O_RDWR|O_CREAT, 0600) = 4
Evidentally the -q option to slapadd bypasses logging?
-q disables transactions, so there are no transaction logs after a slapadd -q.
Regards, Buchan
On Mon, 4 Feb 2008, Quanah Gibson-Mount wrote:
Did you stop slapd and run db_recover before upgrading to 2.3.40?
I did not run db_recover. I stopped the existing version, install the new version, and then simply started it. I thought recovery was automatically performed by slapd now as needed?
Was 2.3.35 using back-bdb, and you changed to back-hdb with 2.3.40?
No, the database has always been back-hdb.
openldap-software@openldap.org