This paper https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zhe... describes a potential crash vulnerability in LMDB due to its use of fdatasync instead of fsync when syncing writes to the data file. The vulnerability exists because fdatasync omits syncs of the file metadata; if the data file needed to grow as a result of any writes then this requires a metadata update.
This is a well-understood issue in LMDB; we briefly touched on it in this earlier email thread http://www.openldap.org/lists/openldap-technical/201402/msg00111.html and it's been a topic of discussion on IRC ever since the first multi-FS microbenchmarks we conducted back in 2012. http://symas.com/mdb/microbench/july/
It's worth noting that this vulnerability doesn't exist on Windows, MacOSX, Android, or *BSD, because none of these OSs have a function equivalent to fdatasync in the first place - they always use fsync (or the Windows equivalent). (Android is an oddball; the underlying Linux kernel of course supports fdatasync, but the C library, bionic, does not.)
We have a couple approaches for Linux: 1) provide an option to preallocate the file, using fallocate(). Unfortunately this doesn't completely eliminate metadata updates - filesystem drivers tend to try to be "smart" and make fallocate cheap; they allocate the space in the FS metadata but they also mark it as "unseen." The first time a process accesses an unseen page, it gets zeroed out. Up until that point, whatever old contents of the disk page are still present. The act of marking a page from "unseen" to "seen" requires a metadata update of its own.
We had a discussion of this FS mis-feature a while ago, but it was fruitless. https://lkml.org/lkml/2012/12/7/396
2) preallocate the file by explicitly writing zeros to it. This has a couple other disadvantages: a) on SSDs, doing such a write needlessly contributes to wearout of the flash. b) Windows detects all-zero writes and compresses them out, creating a sparse file, thus defeating the attempt at preallocation.
3) track the allocated size of the file, and toggle between fsync and fdatasync depending on whether the allocated size actually grows or not. This is the approach I'm currently taking in a development branch. Whether we add this to a new 0.9.x release, or just in 1.0, I haven't yet decided.
As another footnote, I plan to add support for LMDB on a raw partition in 1.x. Naturally, fsync vs fdatasync will be irrelevant in that case.
Catching up with old mail...
On 20/10/14 12:44, Howard Chu wrote:
This paper https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zhe... describes a potential crash vulnerability in LMDB due to its use of fdatasync instead of fsync when syncing writes to the data file. The vulnerability exists because fdatasync omits syncs of the file metadata; if the data file needed to grow as a result of any writes then this requires a metadata update.
Looks like an OS bug. fdatasync() should not break data integrity, it may only skip metadata which are unneeded for retrieving the data. So size changes are synced. So say the Posix spec and the Linux manpage.
Hallvard Breien Furuseth wrote:
Catching up with old mail...
On 20/10/14 12:44, Howard Chu wrote:
This paper https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zhe...
describes a potential crash vulnerability in LMDB due to its use of fdatasync instead of fsync when syncing writes to the data file. The vulnerability exists because fdatasync omits syncs of the file metadata; if the data file needed to grow as a result of any writes then this requires a metadata update.
Looks like an OS bug. fdatasync() should not break data integrity, it may only skip metadata which are unneeded for retrieving the data. So size changes are synced. So say the Posix spec and the Linux manpage.
Ah good point. If you check out their slides, #103 of 106 asks the question; the only failure they found in LMDB occurred on ext3 (and not on XFS) so we may just chalk this up to a flaw in ext3 instead.
Given that ext3 has already been superseded by ext4, this result of theirs may not be all that useful in the real world. We already have disrecommended ext3 for performance reasons, perhaps we should just note this and move on.
On 15/11/14 02:57, Howard Chu wrote:
Ah good point. If you check out their slides, #103 of 106 asks the question; the only failure they found in LMDB occurred on ext3 (and not on XFS) so we may just chalk this up to a flaw in ext3 instead.
Given that ext3 has already been superseded by ext4, this result of theirs may not be all that useful in the real world. We already have disrecommended ext3 for performance reasons, perhaps we should just note this and move on.
No, ext4 breaks too. Their paper's page 459, 5.1.2 LightningDB:
The fact that the journal commit block (op#402) is flushed with the next pwrite64 in the same thread means fdatasync on ext3 does not wait for the comple- tion of journaling (similar behavior has been observed on ext4).
I guess O_DSYNC and fdatasync() should not be the lmdb defaults yet, at least not on Linux:-(
We need a bug report to Linux or to the distro they were using, noting that the power faults are simulated. And is this O_DSYNC, fdatasync or both? The problem might be with only one of them. I'm not reading the paper in detail yet.
Hallvard Breien Furuseth wrote:
On 15/11/14 02:57, Howard Chu wrote:
Ah good point. If you check out their slides, #103 of 106 asks the question; the only failure they found in LMDB occurred on ext3 (and not on XFS) so we may just chalk this up to a flaw in ext3 instead.
Given that ext3 has already been superseded by ext4, this result of theirs may not be all that useful in the real world. We already have disrecommended ext3 for performance reasons, perhaps we should just note this and move on.
No, ext4 breaks too. Their paper's page 459, 5.1.2 LightningDB:
The fact that the journal commit block (op#402) is flushed with the next pwrite64 in the same thread means fdatasync on ext3 does not wait for the comple- tion of journaling (similar behavior has been observed on ext4).
I guess O_DSYNC and fdatasync() should not be the lmdb defaults yet, at least not on Linux:-(
We need a bug report to Linux or to the distro they were using, noting that the power faults are simulated. And is this O_DSYNC, fdatasync or both? The problem might be with only one of them. I'm not reading the paper in detail yet.
This appears to be quite old news.
https://lkml.org/lkml/2012/9/3/83
Hallvard Breien Furuseth wrote:
On 15/11/14 02:57, Howard Chu wrote:
Ah good point. If you check out their slides, #103 of 106 asks the question; the only failure they found in LMDB occurred on ext3 (and not on XFS) so we may just chalk this up to a flaw in ext3 instead.
Given that ext3 has already been superseded by ext4, this result of theirs may not be all that useful in the real world. We already have disrecommended ext3 for performance reasons, perhaps we should just note this and move on.
No, ext4 breaks too. Their paper's page 459, 5.1.2 LightningDB:
The fact that the journal commit block (op#402) is flushed with the next pwrite64 in the same thread means fdatasync on ext3 does not wait for the comple- tion of journaling (similar behavior has been observed on ext4).
I guess O_DSYNC and fdatasync() should not be the lmdb defaults yet, at least not on Linux:-(
We need a bug report to Linux or to the distro they were using, noting that the power faults are simulated. And is this O_DSYNC, fdatasync or both? The problem might be with only one of them. I'm not reading the paper in detail yet.
This appears to be quite old news.
https://lkml.org/lkml/2012/9/3/83
It has references going back to at least 2008.
Howard Chu wrote:
Hallvard Breien Furuseth wrote:
On 15/11/14 02:57, Howard Chu wrote:
Ah good point. If you check out their slides, #103 of 106 asks the question; the only failure they found in LMDB occurred on ext3 (and not on XFS) so we may just chalk this up to a flaw in ext3 instead.
Given that ext3 has already been superseded by ext4, this result of theirs may not be all that useful in the real world. We already have disrecommended ext3 for performance reasons, perhaps we should just note this and move on.
No, ext4 breaks too. Their paper's page 459, 5.1.2 LightningDB:
The fact that the journal commit block (op#402) is flushed with the next pwrite64 in the same thread means fdatasync on ext3 does not wait for the comple- tion of journaling (similar behavior has been observed on ext4).
I guess O_DSYNC and fdatasync() should not be the lmdb defaults yet, at least not on Linux:-(
We need a bug report to Linux or to the distro they were using, noting that the power faults are simulated. And is this O_DSYNC, fdatasync or both? The problem might be with only one of them. I'm not reading the paper in detail yet.
This appears to be quite old news.
https://lkml.org/lkml/2012/9/3/83
It has references going back to at least 2008.
The LKML thread indicates that this bug was already fixed. The zheng mai paper says they used RHEL6, which shipped with kernel 2.6.32 so it apparently was too old to have the fix.
All in all a bunch of bogus reporting; claiming that all DBs are broken when in fact LMDB is perfectly correct.
On 01/05/2015 12:58 PM, Howard Chu wrote:
The LKML thread indicates that this bug was already fixed. The zheng mai paper says they used RHEL6, which shipped with kernel 2.6.32 so it apparently was too old to have the fix.
All in all a bunch of bogus reporting; claiming that all DBs are broken when in fact LMDB is perfectly correct.
True - but often uninteresting from the user's perspective. So I do think Linux should default to fsync for some years - at least when the file may have grown. Makefile can explain the problem and provide a variable to always use fdatasync, if the admin knows the kernel is OK.
As for how to know the synced size, if you want to do more than always use fsync on an OS where fdatasync is unreliable:
I drafted some code to get around it, but it got messy. If we use more code for this than just '#define MDB_FDATASYNC fsync', I suggest to handle it all in mdb_env_sync() which can fstat():
struct MDB_env: off_t me_size; /**< file size known to be synced, or 0 */
mdb_env_sync() { ...; #if MDB_BUGGY_FDATASYNC size_t sz = 0; if (mdb_fsize(env->me_fd, &sz) != MDB_SUCCESS || sz != env->me_size) { if (fsync(env->me_fd)) rc = ErrCode(); else if (sz) env->me_size = sz; } else #endif ...normal sync...; }
mdb_env_open() does not know if the current filesize has been synced, so drop setting me_size there.
Another sync issue:
mdb_env_sync() syncs the wrong way (msync vs fdatasync) if it runs in a process with a different MDB_WRITEMAP setting than one which committed with MDB_NOSYNC or MDB_NOMETASYNC.
I.e. this statement in lmdb.h is too weak:
* Processes with and without MDB_WRITEMAP on the same environment do * not cooperate well.
I think I added it after an IRC chat. But it should either say that it can break ACID, or env_sync() called explicitly should sync more aggressively - at least if the MDB_env did not commit all transactions since last known sync.
On 01/05/2015 11:25 PM, Hallvard Breien Furuseth wrote:
I think I added it after an IRC chat. But it should either say that it can break ACID, or env_sync() called explicitly should sync more aggressively - at least if the MDB_env did not commit all transactions since last known sync.
Never mind the "called explicitly". Same thing when the sync in txn_commit if last txn committed with a different WRITEMAP setting.