For posterity, this is the conversation we had with the University of Wisconsin researchers behind this Usenix OSDI paper: https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pil...
On Thu, Jul 24, 2014 at 8:10 AM, Howard Chu <hyc@symas.com mailto:hyc@symas.com> wrote:
Hello Thanu, I just saw your slides from http://wisdom.cs.wisc.edu/__index.htm#news-2 <http://wisdom.cs.wisc.edu/index.htm#news-2> and was wondering if you could give more details on the crash vulnerabilities you found in LMDB. Thanks.
25/07/14 09:42 Thanumalayan Sankaranarayana Pillai wrote:
Hi Howard,
Thanks for your interest in the slides. We appreciate you caring about the vulnerability.
The single crash vulnerability that we discovered in LMDB concerns the 106-byte update to the header while inserting a key-value pair (within the mdb_txn_commit function). Specifically, if this overwrite is not done atomically (i.e., if only a part of the 106 bytes goes to disk, and then a crash happens), then LMDB reports an error if we try to further modify the database after reboot. The error reported is "mdb_page_touch: Assertion `mp->mp_p.p_pgno != pgno' failed".
We also did some manual analysis by making the loss in atomicity correspond to field boundaries within the information contained in those 106-bytes (such as free list data and transaction ID), and found that LMDB loses the previous transaction sometimes. For example, if the database contains Txn-1, Txn-2, and Txn-3, and a crash happens while inserting Txn-4, the rebooted-database might lose Txn-3. I have not included any such manual analysis results in the slides.
I have attached a text file with a more detailed description of the vulnerability. If you have any questions on the details, I have CC'ed Ram, who actually tested LMDB for the vulnerabilities and might be able to help you better than me.
Thus, the vulnerability in LMDB relates to the atomicity of a file overwrite that is within a 512-byte sector. With current (ext3/ext4/btrfs/xfs/zfs) file systems running atop hard drives and SSDs, this is probably not a problem. (The only place we have heard about that indicates atomicity might not be maintained here is [1]).
However, byte-accessible NVRAMs in the future will probably expose this vulnerability. We are also not sure what happens with the multi-level storage stacks many people use these days (Virtual machines atop Fusion-IO-card connected via networks to a NAS?). In our understanding, even for these situations, LMDB can solve this vulnerability easily, since it has two root pointers.
All said, since we are not particularly experts at using LMDB, there is a slight chance we made a mistake in how we used LMDB while testing it. Please let us know if you suspect this to be the case. Hope this helps!
[1] http://lkml.iu.edu//hypermail/linux/kernel/0908.3/00155.html
-- Thanks, Thanu
25/07/14 11:38 Howard:
Thanks much for your reply. The results you describe are not too surprising; we assume atomic sector writes. This is an area we've spent quite a lot of time researching ourselves. I summarized our findings here last year http://www.openldap.org/its/index.cgi/Incoming?id=7668
Currently all hard drive/SSD manufacturers guarantee that 512 byte sector writes are atomic. As such, failure to write the 106 byte header is not something we account for in current LMDB releases. Also, failures of this type should result in ECC errors in the disk sector - it should be impossible to successfully read a sector that was written incorrectly in the ways you describe.
Even in extreme cases, the probability of failure to write the leading 128 out of 512 bytes of a sector is nearly nil - even on very old hard drives, before 512-byte sector write guarantees. We would have to go back nearly 30 years to find such a device, e.g.
http://bitsavers.informatik.uni-stuttgart.de/pdf/quantum/Quantum_ProDrive_40...
Page 23, Section 2.1 "No damage or loss of data will occur if power is applied or removed during drive operation, except that data may be lost in the sector being written at the time of power loss."
From the specs on page 15, the data transfer rate to/from the platters is 1.25MB/sec, so the time to write one full sector is 0.4096ms; the time to write the leading 128 bytes of the sector is thus 1/4 of that: 0.10ms. You would have to be very very unlucky to have a power failure hit the drive within this .1ms window of time. Fast-forward to present day and it's simply not an issue.
But as you say, byte-addressable storage is coming soon. The LMDB 1.x versions will add a checksum to the two headers, and will fallback to using the previous header if a checksum error is found.
Thanks again for your work, looking forward to reading your full paper in the future.
25/07/14 13:14 Thanu:
Hi Howard,
Great to know that so much thought was already put into this issue. (We
usually see people assume things about the storage stack that are simply false, and without any notion of when they will become false if they aren't already.) I am waiting for all the cool new things that will be on LMDB 1.0.
Thanks, Thanu
25/07/14 13:15 Howard:
Hi again, looking over these vulnerabilities, they all seem unrealistic even given the existence of byte-addressable storage devices. In particular, even if you assume writes may be non-atomic, why do you assume that writes will be non-sequential within a sector?
For example, if we were talking about an HDD that did not guarantee atomic sector writes, we still know that bytes are written sequentially from beginning of sector to end of sector; this is one reason why the LMDB meta page is laid out in the way that it is - the transaction ID is last because if a partial write occurs and the txn ID doesn't make it to disk, this meta page will be ignored and the previous one will be used automatically.
In the tests you describe here, you have broken this sequential assumption, but there are no storage devices in existence that can fail in the manners you describe. And even in byte-addressable memories, DMA or memcpy routines will always start at a low-numbered address and work sequentially higher. Do you really believe it is important for software to protect against this type of failure scenario?
Found using automatic analysis
Write (106 bytes) at lib/mdb.c:3227[mdb_txn_commit]. Split into three, first part (first 36 bytes) alone does not go to disk.
Error: lib/mdb.c:1921: mdb page touch: Assertion ‘mp->mp p.p pgno != pgno’ failed.
Exact situations where the vulnerabilities arise, found using manual analysis
Only txn id gets into metapage and no other data:
Previous transaction is lost.
Free list data, txn id and last page goes in but not main db details:
Previous transaction is lost. Error: lib/mdb.c:1921: mdb page touch: Assertion ‘mp->mp p.p pgno != pgno’ failed.
only main db - 2 fields no free pointer:
Error: lib/mdb.c:1921: mdb page touch: Assertion ‘mp->mp p.p pgno != pgno’ failed.
Everything gets in except database pointer - it is 48 bytes - but a crash happens such that second 24 bytes (which contains the root) is missed:
Previous transaction is lost. Error: lib/mdb.c:1921: mdb page touch: Assertion ‘mp->mp p.p pgno != pgno’ failed.
25/07/14 14:03 Thanu:
Hi Howard,
To be honest, we are not yet sure whether it is important for software to
protect sequential-assumption-breaking under practical circumstances. While byte-addressable devices might only persist things sequentially, a byte-addressable file system itself might re-order bytes (what if the memcpy doesn't start from the lower address?); at this point however, I am probably being cynical and hypothesizing too much about the future.
However there is currently no standard/guarantee that (file-system-level)
writes should be sequential or atomic as far as we know. One of our goals in doing this project is to show the entire community how non-standardized the interaction between the application and the file system is. It is fully possible that, by the time NVMs become main-stream, standards arise and sector-size non-sequential (or even non-atomic) writes are disallowed. (It might also be possible that someone invents an NVM with high parallel bandwidth, and memcpy() explicitly parallelizes updates to contiguous bytes for higher throughput.)
Personally, with software such as LMDB, I feel that we should simply add a
CRC, just in case future storage stacks behave in "reasonably weird" ways. I do understand that it might be better to not do this for either readability or performance and simply keep an eye out for any new storage stacks.
If it is not too much trouble, can you point me to any discussions about
byte-addressable storage being sequential? I have heard this mentioned once before, but I cannot find any solid references to it. Any references to sequentiality within HDDs will be useful as well.
Thanks, Thanu
On Fri, Jul 25, 2014 at 8:50 AM, Howard Chu <hyc@symas.com wrote:
Regarding sequentiality within HDDs, that's a natural consequence of the physical construction/operation of HDDs. The platters rotate under the R/W head, the R/W head spits out a stream of bits in sequential order. The platter always rotates in one direction, at essentially constant speed, so the R/W head cannot skip back and forth within 1 sector and write out of order. It's physically impossible for HDDs to write non-sequentially within a single sector. For your reference re: SSDs, here's an old Samsung NAND Flash data sheet. http://pdf1.alldatasheet.com/__datasheet-pdf/view/84340/__SAMSUNG/K9F1G08U0A.html <http://pdf1.alldatasheet.com/datasheet-pdf/view/84340/SAMSUNG/K9F1G08U0A.html> The relevant info is page 31 "Page Program" "The device is programmed basically on a page basis, but it does allow multiple partial page programing of a word or consecutive bytes up to 2112, in a single page program cycle. The number of consecutive partial page programming operation within the same page without an intervening erase operation must not exceed 4 times for main array(1time/512byte) and 4 times for spare array(1time/16byte). The addressing should be done in sequential order in a block." For byte-addressable storage I suppose no hard rules apply. If we're talking about NVDIMMs, they are treated basically the same as DRAM. Still, in Linux today you're most likely to see a block-device style layer imposed above them. DMA controllers tend to be programmed for auto-incrementing addresses. memcpy will tend to do the same, since CPU prefetchers are also designed with at least cache-line sequentiality. I.e., it may not be important to write to the destination in sequential order, but there is a performance benefit to reading from the source in sequential order, and there is no benefit to holding the read and reordering it before pushing to the destination.
25/07/14 15:46 Thanu:
Hi Howard,
Thanks for the explanation and the pointers. They are really helpful.
While you certainly have more working knowledge of byte-addressable storage
than me, I still believe there is an (extremely) small chance of non-sequential updates on Linux. Let us say that the cache block is 64 bytes in size, and the block-device layer modifies the mapped memory addresses corresponding to all user writes. For argument's sake, let us also say that writes are written back to the actual storage if there is a change in the (coherence) ownership of a cache block. If we consider two adjacent 64-byte blocks, and the second block is written twice by different cores, it is possible that the second block reaches the storage first.
Of course, even if such non-sequential updates become possible in the
future, it would require changes to the file-system journaling machinery itself, and is bound to be well-publicized. I am only justifying why we considered non-sequential updates in our study (to measure the impact of such scenarios on applications). We are hoping that if someone in the future (ever) debates about introducing non-atomic updates, they fully understand the application-level consequences by looking at our study.
Thanks, Thanu
25/07/14 16:46 Howard:
Fair enough. Certainly, if the question ever arose, I would be glad we have your study to point to as a reference.