Hi there,
I have a few questions about MDB, and I have some things I'd like to work on.
In the docs there are a few references that reference binary searching. It's not 100% clear but I assume this is a binary search of the keys in a BTree node, not that MDB is a bst.
How does MDB provide crash resilience on the free pages?
According to man, free() should only be called on memory from malloc but I see that you use free on mmaped pages in mdb_dpage_free. There must be something I'm missing here about this.
Anyway, I have two things I want to work on.
The simple one is when pages are moved from the txn free list to the env free list (I hope that's correct), it would be good to call madvise(MADV_REMOVE) on the data section.
The reason for this is that the madvise call will allow supported filesystems to hole punch the sparse file, allowing space reclamation - without MDB needing to worry about it!
The much more invasive change I want to work on is page checksumming. Basically there are 4 cases I have in mind
* No checksumming (today) * Metadata checksumming only * Metadata and data checksumming
These could be used in these scenarios:
* write checksums but don't verify them at run time * write checksums, and only verify metadata on read (possibly a good default option) * write checksums, and verify metadata and data on read (slowest, but has strong integrity properties for some applications)
And in all cases I want to add an "mdb_verify" command that would assert all of these are also correct offline.
There are a few reasons for this
* Hardware is unreliable. Ram, disk, cables, even cpu cache memory can all exhibt bit flips and other data loss. Changing a bit in a pointer can cause damage to any datastructure, and flows on to crashes or silent corruption * Software is never perfect - checksumming allows detection of over- writes of data from overflow or other mistakes that we as humans all make.
I'd opt to use something fast like crc32c (intel provides hardware to accelerate this with -march=native). The only issue I see is that this would require an ondisk structure change because the current structs don't have space for this -and the csums have to be *first*.
http://www.lmdb.tech/doc/group__internal.html#structMDB__page
The checksum would have to be the first value *or* the last value of the page header, (so that it can be updated without affecting the result of the checksum). The checksum for the data would have to be within the header so that this is asserted as correct.
Is this something I should pursue? Would this require a ondisk format change? Is there something that could be done to avoid this?
Thanks,
William