LMDB app-level crash consistency - openldap-devel

18 Oct 2014

      For posterity, this is the conversation we had with the University of 
Wisconsin researchers behind this Usenix OSDI paper: 
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pil...
...
On Thu, Jul 24, 2014 at 8:10 AM, Howard Chu <hyc@symas.com
mailto:hyc@symas.com> wrote:
Hello Thanu,
   I just saw your slides from
http://wisdom.cs.wisc.edu/__index.htm#news-2
<http://wisdom.cs.wisc.edu/index.htm#news-2> and was wondering if you
could give more details on the crash vulnerabilities you found in LMDB.
Thanks.

25/07/14 09:42 Thanumalayan Sankaranarayana Pillai wrote:
...
...
Hi Howard,
Thanks for your interest in the slides. We appreciate you caring about the
vulnerability.
The single crash vulnerability that we discovered in LMDB concerns the
106-byte update to the header while inserting a key-value pair (within the
mdb_txn_commit function). Specifically, if this overwrite is not done
atomically (i.e., if only a part of the 106 bytes goes to disk, and then a
crash happens), then LMDB reports an error if we try to further modify the
database after reboot. The error reported is "mdb_page_touch: Assertion
`mp->mp_p.p_pgno != pgno' failed".
We also did some manual analysis by making the loss in atomicity correspond to
field boundaries within the information contained in those 106-bytes (such as
free list data and transaction ID), and found that LMDB loses the previous
transaction sometimes. For example, if the database contains Txn-1, Txn-2, and
Txn-3, and a crash happens while inserting Txn-4, the rebooted-database might
lose Txn-3. I have not included any such manual analysis results in the slides.
I have attached a text file with a more detailed description of the
vulnerability. If you have any questions on the details, I have CC'ed Ram, who
actually tested LMDB for the vulnerabilities and might be able to help you
better than me.
Thus, the vulnerability in LMDB relates to the atomicity of a file overwrite
that is within a 512-byte sector. With current (ext3/ext4/btrfs/xfs/zfs) file
systems running atop hard drives and SSDs, this is probably not a problem.
(The only place we have heard about that indicates atomicity might not be
maintained here is [1]).
However, byte-accessible NVRAMs in the future will probably expose this
vulnerability. We are also not sure what happens with the multi-level storage
stacks many people use these days (Virtual machines atop Fusion-IO-card
connected via networks to a NAS?). In our understanding, even for these
situations, LMDB can solve this vulnerability easily, since it has two root
pointers.
All said, since we are not particularly experts at using LMDB, there is a
slight chance we made a mistake in how we used LMDB while testing it. Please
let us know if you suspect this to be the case. Hope this helps!
[1] http://lkml.iu.edu//hypermail/linux/kernel/0908.3/00155.html
--
Thanks,
Thanu
25/07/14 11:38 Howard:
...
Thanks much for your reply. The results you describe are not too surprising;
we assume atomic sector writes. This is an area we've spent quite a lot of
time researching ourselves. I summarized our findings here last year
http://www.openldap.org/its/index.cgi/Incoming?id=7668
Currently all hard drive/SSD manufacturers guarantee that 512 byte sector
writes are atomic. As such, failure to write the 106 byte header is not
something we account for in current LMDB releases. Also, failures of this type
should result in ECC errors in the disk sector - it should be impossible to
successfully read a sector that was written incorrectly in the ways you describe.
Even in extreme cases, the probability of failure to write the leading 128 out
of 512 bytes of a sector is nearly nil - even on very old hard drives, before
512-byte sector write guarantees. We would have to go back nearly 30 years to
find such a device, e.g.
http://bitsavers.informatik.uni-stuttgart.de/pdf/quantum/Quantum_ProDrive_40...
Page 23, Section 2.1
"No damage or loss of data will occur if power is applied or removed during
drive operation, except that data may be lost in the sector being written at
the time of power loss."
From the specs on page 15, the data transfer rate to/from the platters is
1.25MB/sec, so the time to write one full sector is 0.4096ms; the time to
write the leading 128 bytes of the sector is thus 1/4 of that: 0.10ms. You
would have to be very very unlucky to have a power failure hit the drive
within this .1ms window of time. Fast-forward to present day and it's simply
not an issue.
But as you say, byte-addressable storage is coming soon. The LMDB 1.x versions
will add a checksum to the two headers, and will fallback to using the
previous header if a checksum error is found.
Thanks again for your work, looking forward to reading your full paper in the
future.
25/07/14 13:14 Thanu:
...
Hi Howard,
Great to know that so much thought was already put into this issue. (We
usually see people assume things about the storage stack that are simply
false, and without any notion of when they will become false if they aren't
already.) I am waiting for all the cool new things that will be on LMDB 1.0.
...
Thanks,
Thanu
25/07/14 13:15 Howard:
...
Hi again,
   looking over these vulnerabilities, they all seem unrealistic even given
the existence of byte-addressable storage devices. In particular, even if you
assume writes may be non-atomic, why do you assume that writes will be
non-sequential within a sector?
For example, if we were talking about an HDD that did not guarantee atomic
sector writes, we still know that bytes are written sequentially from
beginning of sector to end of sector; this is one reason why the LMDB meta
page is laid out in the way that it is - the transaction ID is last because if
a partial write occurs and the txn ID doesn't make it to disk, this meta page
will be ignored and the previous one will be used automatically.
In the tests you describe here, you have broken this sequential assumption,
but there are no storage devices in existence that can fail in the manners you
describe. And even in byte-addressable memories, DMA or memcpy routines will
always start at a low-numbered address and work sequentially higher. Do you
really believe it is important for software to protect against this type of
failure scenario?
...
...
Found using automatic analysis
Write (106 bytes) at lib/mdb.c:3227[mdb_txn_commit]. Split into three, first part (first 36 bytes) alone does not go to disk.
Error: lib/mdb.c:1921: mdb page touch: Assertion ‘mp->mp p.p pgno != pgno’ failed.

Exact situations where the vulnerabilities arise, found using manual analysis

Only txn id gets into metapage and no other data:
Previous transaction is lost.

Free list data, txn id and last page goes in but not main db details:
Previous transaction is lost. Error: lib/mdb.c:1921: mdb page touch: Assertion ‘mp->mp p.p pgno != pgno’ failed.

only main db - 2 fields no free pointer:
Error: lib/mdb.c:1921: mdb page touch: Assertion ‘mp->mp p.p pgno != pgno’ failed.

Everything gets in except database pointer - it is 48 bytes - but a crash happens such that second 24 bytes (which contains the root) is missed:
Previous transaction is lost. Error: lib/mdb.c:1921: mdb page touch: Assertion ‘mp->mp p.p pgno != pgno’ failed.

25/07/14 14:03 Thanu:
...
Hi Howard,
To be honest, we are not yet sure whether it is important for software to
protect sequential-assumption-breaking under practical circumstances. While
byte-addressable devices might only persist things sequentially, a
byte-addressable file system itself might re-order bytes (what if the memcpy
doesn't start from the lower address?); at this point however, I am probably
being cynical and hypothesizing too much about the future.
...
However there is currently no standard/guarantee that (file-system-level)
writes should be sequential or atomic as far as we know. One of our goals in
doing this project is to show the entire community how non-standardized the
interaction between the application and the file system is. It is fully
possible that, by the time NVMs become main-stream, standards arise and
sector-size non-sequential (or even non-atomic) writes are disallowed. (It
might also be possible that someone invents an NVM with high parallel
bandwidth, and memcpy() explicitly parallelizes updates to contiguous bytes
for higher throughput.)
...
Personally, with software such as LMDB, I feel that we should simply add a
CRC, just in case future storage stacks behave in "reasonably weird" ways. I
do understand that it might be better to not do this for either readability or
performance and simply keep an eye out for any new storage stacks.
...
If it is not too much trouble, can you point me to any discussions about
byte-addressable storage being sequential? I have heard this mentioned once
before, but I cannot find any solid references to it. Any references to
sequentiality within HDDs will be useful as well.
...
Thanks,
Thanu
On Fri, Jul 25, 2014 at 8:50 AM, Howard Chu <hyc@symas.com wrote:
...
...
...
 Regarding sequentiality within HDDs, that's a natural consequence of the
 physical construction/operation of HDDs. The platters rotate under the R/W
 head, the R/W head spits out a stream of bits in sequential order. The
 platter always rotates in one direction, at essentially constant speed, so
 the R/W head cannot skip back and forth within 1 sector and write out of
 order. It's physically impossible for HDDs to write non-sequentially
 within a single sector.

 For your reference re: SSDs, here's an old Samsung NAND Flash data sheet.
 http://pdf1.alldatasheet.com/__datasheet-pdf/view/84340/__SAMSUNG/K9F1G08U0A.html
 <http://pdf1.alldatasheet.com/datasheet-pdf/view/84340/SAMSUNG/K9F1G08U0A.html>

 The relevant info is page 31 "Page Program"
    "The device is programmed basically on a page basis, but it does allow
 multiple partial page programing of a word or consecutive
 bytes up to 2112, in a single page program cycle. The number of
 consecutive partial page programming operation within the same
 page without an intervening erase operation must not exceed 4 times for
 main array(1time/512byte) and 4 times for spare
 array(1time/16byte). The addressing should be done in sequential order in
 a block."

 For byte-addressable storage I suppose no hard rules apply. If we're
 talking about NVDIMMs, they are treated basically the same as DRAM. Still,
 in Linux today you're most likely to see a block-device style layer
 imposed above them. DMA controllers tend to be programmed for
 auto-incrementing addresses. memcpy will tend to do the same, since CPU
 prefetchers are also designed with at least cache-line sequentiality.
 I.e., it may not be important to write to the destination in sequential
 order, but there is a performance benefit to reading from the source in
 sequential order, and there is no benefit to holding the read and
 reordering it before pushing to the destination.

25/07/14 15:46 Thanu:
...
Hi Howard,
Thanks for the explanation and the pointers. They are really helpful.
While you certainly have more working knowledge of byte-addressable
storage
than me, I still believe there is an (extremely) small chance of
non-sequential updates on Linux. Let us say that the cache block is 64 bytes
in size, and the block-device layer modifies the mapped memory addresses
corresponding to all user writes. For argument's sake, let us also say that
writes are written back to the actual storage if there is a change in the
(coherence) ownership of a cache block. If we consider two adjacent 64-byte
blocks, and the second block is written twice by different cores, it is
possible that the second block reaches the storage first.
...
Of course, even if such non-sequential updates become possible in the
future, it would require changes to the file-system journaling machinery
itself, and is bound to be well-publicized. I am only justifying why we
considered non-sequential updates in our study (to measure the impact of such
scenarios on applications). We are hoping that if someone in the future (ever)
debates about introducing non-atomic updates, they fully understand the
application-level consequences by looking at our study.
...
Thanks,
Thanu
25/07/14 16:46 Howard:
...
Fair enough. Certainly, if the question ever arose, I would be glad we have
your study to point to as a reference.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/