Full_Name: Howard Chu Version: LMDB 0.9.7 OS: URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (78.155.233.73) Submitted by: hyc
Currently LMDB always stores two snapshots of the environment, one in each of the two meta pages. For ongoing operation it simply uses the one with the higher transaction number; the older one is ignored.
If a power failure occurs at the instant that a transaction commit is occurring, it is possible for the meta page write to be corrupted in the storage device. Storage devices use ECC to detect (and sometimes recover from) these problems; if the error cannot be corrected then an attempt to read this sector will fail and the OS may tell the application that the read failed with an I/O error.
In mdb_env_read_header() we currently attempt to read both header pages but completely fail the mdb_env_open() if an error occurs. Instead, we should always attempt to read both pages (assuming the file was not truncated, and both pages really were written before). If both pages exist but we only successfully read one, we should allow the env_open() to proceed in Read-Only mode, and return an error code to the caller indicating this situation. E.g., MDB_NEEDS_BACKUP. The point being, the user should use mdb_copy(1) to make a backup of the environment ASAP and should not be able to do anything else to the environment in the meantime.
It's been suggested that LMDB should also use its own CRC on the meta page, to detect more subtle corruptions. We may consider adding this as well, but obviously this would be an on-disk format change. My personal view is this is what ECC DRAM is for. The primary argument for it is the possibility (again, during a power faiure) for the contents of the storage device's write buffer to be corrupted while writing to the meta page. I.e., ECC may have protected the data all the way from the host to the storage device, but if the device was in the middle of writing the sector when the power failure occurred, the write may have completed, but the buffer DRAM may have been losing charge while the write was happening. Again, I'm skeptical that corruptions of this sort would not be detected by the device's own ECC checks.
While we're at it, it may be useful to include an env_open() flag explicitly requesting use of the older meta page. This flag would only be valid in conjunction with MDB_RDONLY. It would allow using mdb_copy to make a backup of the previous environment snapshot.