(ITS#7668) LMDB enhancement, corruption detection - openldap-bugs

20 Aug 2013


      Full_Name: Howard Chu
Version: LMDB 0.9.7
OS: 
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (78.155.233.73)
Submitted by: hyc
Currently LMDB always stores two snapshots of the environment, one in each of
the two meta pages. For ongoing operation it simply uses the one with the higher
transaction number; the older one is ignored.
If a power failure occurs at the instant that a transaction commit is occurring,
it is possible for the meta page write to be corrupted in the storage device.
Storage devices use ECC to detect (and sometimes recover from) these problems;
if the error cannot be corrected then an attempt to read this sector will fail
and the OS may tell the application that the read failed with an I/O error.
In mdb_env_read_header() we currently attempt to read both header pages but
completely fail the mdb_env_open() if an error occurs. Instead, we should always
attempt to read both pages (assuming the file was not truncated, and both pages
really were written before). If both pages exist but we only successfully read
one, we should allow the env_open() to proceed in Read-Only mode, and return an
error code to the caller indicating this situation. E.g., MDB_NEEDS_BACKUP. The
point being, the user should use mdb_copy(1) to make a backup of the environment
ASAP and should not be able to do anything else to the environment in the
meantime.
It's been suggested that LMDB should also use its own CRC on the meta page, to
detect more subtle corruptions. We may consider adding this as well, but
obviously this would be an on-disk format change. My personal view is this is
what ECC DRAM is for. The primary argument for it is the possibility (again,
during a power faiure) for the contents of the storage device's write buffer to
be corrupted while writing to the meta page. I.e., ECC may have protected the
data all the way from the host to the storage device, but if the device was in
the middle of writing the sector when the power failure occurred, the write may
have completed, but the buffer DRAM may have been losing charge while the write
was happening. Again, I'm skeptical that corruptions of this sort would not be
detected by the device's own ECC checks.
While we're at it, it may be useful to include an env_open() flag explicitly
requesting use of the older meta page. This flag would only be valid in
conjunction with MDB_RDONLY. It would allow using mdb_copy to make a backup of
the previous environment snapshot.