Hi there,
I have a few questions about MDB, and I have some things I'd like to work on.
In the docs there are a few references that reference binary searching. It's not 100% clear but I assume this is a binary search of the keys in a BTree node, not that MDB is a bst.
How does MDB provide crash resilience on the free pages?
According to man, free() should only be called on memory from malloc but I see that you use free on mmaped pages in mdb_dpage_free. There must be something I'm missing here about this.
Anyway, I have two things I want to work on.
The simple one is when pages are moved from the txn free list to the env free list (I hope that's correct), it would be good to call madvise(MADV_REMOVE) on the data section.
The reason for this is that the madvise call will allow supported filesystems to hole punch the sparse file, allowing space reclamation - without MDB needing to worry about it!
The much more invasive change I want to work on is page checksumming. Basically there are 4 cases I have in mind
* No checksumming (today) * Metadata checksumming only * Metadata and data checksumming
These could be used in these scenarios:
* write checksums but don't verify them at run time * write checksums, and only verify metadata on read (possibly a good default option) * write checksums, and verify metadata and data on read (slowest, but has strong integrity properties for some applications)
And in all cases I want to add an "mdb_verify" command that would assert all of these are also correct offline.
There are a few reasons for this
* Hardware is unreliable. Ram, disk, cables, even cpu cache memory can all exhibt bit flips and other data loss. Changing a bit in a pointer can cause damage to any datastructure, and flows on to crashes or silent corruption * Software is never perfect - checksumming allows detection of over- writes of data from overflow or other mistakes that we as humans all make.
I'd opt to use something fast like crc32c (intel provides hardware to accelerate this with -march=native). The only issue I see is that this would require an ondisk structure change because the current structs don't have space for this -and the csums have to be *first*.
http://www.lmdb.tech/doc/group__internal.html#structMDB__page
The checksum would have to be the first value *or* the last value of the page header, (so that it can be updated without affecting the result of the checksum). The checksum for the data would have to be within the header so that this is asserted as correct.
Is this something I should pursue? Would this require a ondisk format change? Is there something that could be done to avoid this?
Thanks,
William
William Brown wrote:
Hi there,
I have a few questions about MDB, and I have some things I'd like to work on.
The current name is LMDB.
In the docs there are a few references that reference binary searching. It's not 100% clear but I assume this is a binary search of the keys in a BTree node, not that MDB is a bst.
There's no need to assume. https://symas.com/lmdb/technical/#pubs
How does MDB provide crash resilience on the free pages?
According to man, free() should only be called on memory from malloc but I see that you use free on mmaped pages in mdb_dpage_free. There must be something I'm missing here about this.
We obviously do not call free() on mmap'd pages. Mmap'd pages just sit there.
Anyway, I have two things I want to work on.
The simple one is when pages are moved from the txn free list to the env free list (I hope that's correct), it would be good to call madvise(MADV_REMOVE) on the data section.
No, it wouldn't.
The reason for this is that the madvise call will allow supported filesystems to hole punch the sparse file, allowing space reclamation - without MDB needing to worry about it!
Freespace reclamation is just added overhead. The pages will be reused in a future transaction anyway, hole punching would just make the filesystem do more work reassigning them back to the DB later.
The much more invasive change I want to work on is page checksumming.
This already exists in LMDB 1.0, along with page-level encryption.
Basically there are 4 cases I have in mind
- No checksumming (today)
- Metadata checksumming only
- Metadata and data checksumming
These could be used in these scenarios:
- write checksums but don't verify them at run time
- write checksums, and only verify metadata on read (possibly a good
default option)
- write checksums, and verify metadata and data on read (slowest, but
has strong integrity properties for some applications)
And in all cases I want to add an "mdb_verify" command that would assert all of these are also correct offline.
There are a few reasons for this
- Hardware is unreliable. Ram, disk, cables, even cpu cache memory can
all exhibt bit flips and other data loss. Changing a bit in a pointer can cause damage to any datastructure, and flows on to crashes or silent corruption
IMO none of this is relevant. Data centers that require reliability will use ECC and redundant hardware. If you're not using these things, then clearly reliability isn't a high priority for you.
CPU caches are all ECC protected already, as are storage drives. The correct place to check for corruption above the drive is at the filesystem layer.
The main reason we added checksum support is as a side-effect of providing a space for the signature in authenticated encryption.
- Software is never perfect - checksumming allows detection of over-
writes of data from overflow or other mistakes that we as humans all make.
By default, with a read-only memory map, unintended overwrites of data is not possible.
I'd opt to use something fast like crc32c (intel provides hardware to accelerate this with -march=native). The only issue I see is that this would require an ondisk structure change because the current structs don't have space for this -and the csums have to be *first*.
The checksums can live wherever you want to put them. You just skip over a gap where you insert the result later. My inclination is to put them near the page header, but it depends on a few other decisions as well.
http://www.lmdb.tech/doc/group__internal.html#structMDB__page
The checksum would have to be the first value *or* the last value of the page header, (so that it can be updated without affecting the result of the checksum). The checksum for the data would have to be within the header so that this is asserted as correct.
Is this something I should pursue? Would this require a ondisk format change? Is there something that could be done to avoid this?
I see no way to do this without a format change. This is why the feature has waited till LMDB 1.0 for rollout. There are multiple format-changing features in LMDB 1.0 and the on-disk format is still in flux there.
If you want to discuss this further, we should use the openldap-devel mailing list instead.
Thanks,
William
openldap-technical@openldap.org