On Mon, 4 Feb 2008, Howard Chu wrote:
That documentation is clearly obsolete, which is why it was removed.
slurpd is obsolete, which is why the section on slurpd was removed from the 2.4 manual. Considering OpenLDAP-2.3.39 is still marked as the stable release, I can't really see that the 2.3 documentation in its entirety is obsolete.
http://www.oracle.com/technology/documentation/berkeley-db/db/ref/transapp/a...
Ah, that is the section on backing up/restoring a database, which I suppose could also be considered the same procedure to be used for copying a database from one system to another. Given your original wording, I was looking for something more specifically geared towards copying.
At a guess, you failed to copy the transaction log files to the slaves.
If I had failed to copy the transaction log files, I don't really see that it would have worked at all; let alone for almost a year.
Reviewing the backup/restore procedure, I don't really see anything I might have missed. slapd was not running during the copy, so clearly any updates were suspended. In fact, slapd had never been run -- the copy was made immediately after the initial slapadd. There were actually no log files present. As I mentioned, I have bdb configured to automatically remove them. Presumably slapadd explicitly/implicitly check pointed upon completion and they were removed. Even if there was a log file that I didn't see, the log files were stored in the same directory as the database files, and I copied the entire directory.
Also, even if for some reason the copies on the two slaves were invalid, that would not explain why the master failed. The database on the master was the original database built by slapadd when the server was first put into commission. How could making a copy of it have caused it to fail itself?
Too difficult to guess, given the lack of information. We have only your assurance that nothing was done incorrectly, but the facts indicate that at least one step was done incorrectly.
The facts only indicate that I had a catastrophic failure. That the failure was caused by incompetence is only a hypothesis.
I do greatly appreciate your response and willingness to help; I apologize if I'm getting a bit defensive.
You do have only my assurance that I didn't screw something up. However, assuming I'm not lying, the facts are:
* openldap 2.3.35 was initially installed on three servers * on the master server, slapadd was run to load in an existing database in ldif format * the resultant bdb database was then copied to both slaves * all three were put into production March 2007 and ran perfectly under a reasonably heavy load * a week or so ago I upgraded them to 2.3.40 (stop old server, install new server, start new server -- never touching bdb or the existing database files) * they ran fine for at least 3-4 days * this weekend, they died horribly
Given these facts, if something was done incorrectly, it does not seem likely that it was failure to copy a transaction log file in March 2007. If the failure was my own doing, it seems more likely a byproduct of the upgrade, although I can't think of anything that I could have done wrong during that process.
At this point, I guess I'll just write it off and hope it doesn't happen again.