LMDB crash consistency, again - openldap-devel

20 Oct 2014


      This paper 
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zhe... 
describes a potential crash vulnerability in LMDB due to its use of fdatasync 
instead of fsync when syncing writes to the data file. The vulnerability 
exists because fdatasync omits syncs of the file metadata; if the data file 
needed to grow as a result of any writes then this requires a metadata update.
This is a well-understood issue in LMDB; we briefly touched on it in this 
earlier email thread 
http://www.openldap.org/lists/openldap-technical/201402/msg00111.html and it's 
been a topic of discussion on IRC ever since the first multi-FS 
microbenchmarks we conducted back in 2012. http://symas.com/mdb/microbench/july/
It's worth noting that this vulnerability doesn't exist on Windows, MacOSX, 
Android, or *BSD, because none of these OSs have a function equivalent to 
fdatasync in the first place - they always use fsync (or the Windows 
equivalent). (Android is an oddball; the underlying Linux kernel of course 
supports fdatasync, but the C library, bionic, does not.)
We have a couple approaches for Linux:
   1) provide an option to preallocate the file, using fallocate(). 
Unfortunately this doesn't completely eliminate metadata updates - filesystem 
drivers tend to try to be "smart" and make fallocate cheap; they allocate the 
space in the FS metadata but they also mark it as "unseen." The first time a 
process accesses an unseen page, it gets zeroed out. Up until that point, 
whatever old contents of the disk page are still present. The act of marking a 
page from "unseen" to "seen" requires a metadata update of its own.
We had a discussion of this FS mis-feature a while ago, but it was fruitless.
https://lkml.org/lkml/2012/12/7/396
2) preallocate the file by explicitly writing zeros to it. This has a 
couple other disadvantages:
     a) on SSDs, doing such a write needlessly contributes to wearout of the 
flash.
     b) Windows detects all-zero writes and compresses them out, creating a 
sparse file, thus defeating the attempt at preallocation.
3) track the allocated size of the file, and toggle between fsync and 
fdatasync depending on whether the allocated size actually grows or not. This 
is the approach I'm currently taking in a development branch. Whether we add 
this to a new 0.9.x release, or just in 1.0, I haven't yet decided.
As another footnote, I plan to add support for LMDB on a raw partition in 1.x. 
Naturally, fsync vs fdatasync will be irrelevant in that case.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/