On Fri, 10 Feb 2012, Buchan Milne wrote:
On Friday, 10 February 2012 01:48:45 Quanah Gibson-Mount wrote:
...
I thought I was very clear on that in my last email. It is not sufficient. You need to stop slapd and run *db_recover*, which is more exhaustive than db_checkpoint, if you want to go the route of backing up the BDB db.
If you checkpoint, and you backup all the database files (including transaction log files) in the correct order, you should not need to db_recover
If that's all that you require at backup time, then in order to guarantee correctness *at restore time* you have to perform "catastrophic" recovery (ala db_recover -c) on the restored database before trying to use it. That's necessary if a checkpoint occurs between when you start copying .db files and when you copy the last transaction log file.
The optimized procedure that I worked out with Sleepycat's help (for a completely different program, but using the "transaction data store") was this:
** Backing up the database environment is done with the following ** steps: ** 0) all txn log files except the current one are copied to ** the backup ** 1) a checkpoint is taken ** 2) the list of txn log files that are no longer needed for ** recovery or txn_abort is obtained ** 3) the LSN of the most recent checkpoint is noted ** 4) all the database table files, including queue extents, ** are copied to the backup ** 5) all the txn log files that were not copied in step (0) ** are copied to the backup ** 6) if a checkpoint has *not* taken place since step (3), ** then the database is marked as not needing catastrophic ** recovery when restored ** 7) if the list from step (2) is not empty, then those txn ** log files are removed from the active database environment ** and are marked in the backup as unnecessary for normal ** restoration ** ** Note that the ordering of this is almost completely inflexible. ** In particular: ** (0) must preceed (5) ** (1) must preceed (2) and (3) ** (2) and (3) must preceed (4) ** (4) must preceed (5) ** (5) must preceed (6) and (7) ** ** Minimizing the time between (3) and (6) is highly desirable, ** as that minimizes the window in which a checkpoint could ** occur that would result in a backup that would require ** catastrophic recovery when restored. Restoring such a ** backup is *much* slower than restoring one that only requires ** normal recovery. That's why (0) and (7) are pushed forward ** and backward to where they are.
For those trying to script this, you can get the LSN of the most recent checkpoint with db_stat -t | awk '$2 ~ /^File/offset/{print $1; exit}'
Philip Guenther