This post outlines a few changes to LMDB I had to do to make it work in a specific use case. I’d like to see those changes upstream, but I understand that they may be/are not relevant for e.g. OpenLDAP. The use case is multiple databases on disks with long running large write transactions.
1. Option to not use custom memory allocator/page pool
LMDB has a custom malloc() implementation that re-uses pages (me_dpages). I understand that this improves the performance at bit (depending on the malloc implementation). But there should at least be the option to not do that (for many reasons). I would even make not using it the default.
2. Large transactions and spilling
In a large write transaction, it will use a lot of memory per default (512MiB) which won’t get freed when the transaction commits (see 1.). If one has a lot of databases it uses a lot of memory that never gets freed.
Alternatively, one can use MDB_WRITEMAP, but (i) per default Linux isn’t tuned to delay writing pages to disk and (ii) before commit LMDB has to remove a dirty bit, so each page is written twice.
Both problems would be fixed by making when pages get spilled configurable (mt_dirty_room as MDB_IDL_UM_MAX currently) and reducing the default non-spill memory amount for at least the MDB_WRITEMAP case. If this memory amount is low mt_spill_pgs gets sorted often so maybe this needs to be converted to a different data structure (e.g. red-black tree).
3. LMDB causes crashes if database is corrupted
If the database is corrupted it can cause the application to crash. I have fixed those cases when they (randomly) occurred. Properly fixing this would probably be best done with some fuzzing.
4. Allow LMDB to reside on a device
I used dm-cache to improve LMDB read performance. It needed a bit of adjustment to get the correct size of the device via ioctl BLKGETSIZE64.
--
I’ve fixed those issues w.r.t. my application. If there is interest in any of those application specific changes, I’ll clean them up and post them.
martin@urbackup.org wrote:
This post outlines a few changes to LMDB I had to do to make it work in a specific use case. I’d like to see those changes upstream, but I understand that they may be/are not relevant for e.g. OpenLDAP. The use case is multiple databases on disks with long running large write transactions.
- Option to not use custom memory allocator/page pool
LMDB has a custom malloc() implementation that re-uses pages (me_dpages). I understand that this improves the performance at bit (depending on the malloc implementation). But there should at least be the option to not do that (for many reasons). I would even make not using it the default.
Not going to happen. But maybe it would be reasonable to allow configuring a limit on how many pages it keeps hanging around, before actually using libc free() on them.
- Large transactions and spilling
In a large write transaction, it will use a lot of memory per default (512MiB) which won’t get freed when the transaction commits (see 1.). If one has a lot of databases it uses a lot of memory that never gets freed.
Alternatively, one can use MDB_WRITEMAP, but (i) per default Linux isn’t tuned to delay writing pages to disk and (ii) before commit LMDB has to remove a dirty bit, so each page is written twice.
There is no more dirty bit in LMDB 1.0, and this double-write no longer happens.
- LMDB causes crashes if database is corrupted
You can enable per-page checksums in LMDB 1.0, in which case you'll just get an error code if a page is corrupted (and the checksum fails to match). The DB will still be unusable if anything is corrupted.
- Allow LMDB to reside on a device
LMDB 1.0 supports storage on raw devices.
Thanks! Looks like my specific issues are well on the radar.
On 22.10.2020 20:40 Howard Chu wrote:
martin@urbackup.org wrote:
This post outlines a few changes to LMDB I had to do to make it work in a specific use case. I’d like to see those changes upstream, but I understand that they may be/are not relevant for e.g. OpenLDAP. The use case is multiple databases on disks with long running large write transactions.
- Option to not use custom memory allocator/page pool
LMDB has a custom malloc() implementation that re-uses pages (me_dpages). I understand that this improves the performance at bit (depending on the malloc implementation). But there should at least be the option to not do that (for many reasons). I would even make not using it the default.
Not going to happen. But maybe it would be reasonable to allow configuring a limit on how many pages it keeps hanging around, before actually using libc free() on them.
A limit would work.
- Large transactions and spilling
In a large write transaction, it will use a lot of memory per default (512MiB) which won’t get freed when the transaction commits (see 1.). If one has a lot of databases it uses a lot of memory that never gets freed.
Alternatively, one can use MDB_WRITEMAP, but (i) per default Linux isn’t tuned to delay writing pages to disk and (ii) before commit LMDB has to remove a dirty bit, so each page is written twice.
There is no more dirty bit in LMDB 1.0, and this double-write no longer happens.
That'd improve the MDB_WRITEMAP case. The general problem there is that tuning of the write-back behaviour is system wide (e.g. vm.dirty_expire_centisecs on Linux) and mostly not in control of the application, so if one wants to be sure that writeback happens only once sufficient changes have accumulated, one needs to not use MDB_WRITEMAP. In that case it would be great to be able to control the memory usage/write-back behaviour by configuring the amount at which it spills, as well.
- LMDB causes crashes if database is corrupted
You can enable per-page checksums in LMDB 1.0, in which case you'll just get an error code if a page is corrupted (and the checksum fails to match). The DB will still be unusable if anything is corrupted.
That would fix the problem properly. Does it check that it is the correct transaction as well (e.g. by putting a transid into the page like btrfs)? Returning wrong results or MDB_CORRUPTED is something my application can handle (but not crashes obviously).
Martin Raiber wrote:
Thanks! Looks like my specific issues are well on the radar.
On 22.10.2020 20:40 Howard Chu wrote:
martin@urbackup.org wrote:
This post outlines a few changes to LMDB I had to do to make it work in a specific use case. I’d like to see those changes upstream, but I understand that they may be/are not relevant for e.g. OpenLDAP. The use case is multiple databases on disks with long running large write transactions.
- Option to not use custom memory allocator/page pool
LMDB has a custom malloc() implementation that re-uses pages (me_dpages). I understand that this improves the performance at bit (depending on the malloc implementation). But there should at least be the option to not do that (for many reasons). I would even make not using it the default.
Not going to happen. But maybe it would be reasonable to allow configuring a limit on how many pages it keeps hanging around, before actually using libc free() on them.
A limit would work.
Feel free to submit an enhancement request for this on the OpenLDAP ITS.
- Large transactions and spilling
In a large write transaction, it will use a lot of memory per default (512MiB) which won’t get freed when the transaction commits (see 1.). If one has a lot of databases it uses a lot of memory that never gets freed.
Alternatively, one can use MDB_WRITEMAP, but (i) per default Linux isn’t tuned to delay writing pages to disk and (ii) before commit LMDB has to remove a dirty bit, so each page is written twice.
There is no more dirty bit in LMDB 1.0, and this double-write no longer happens.
That'd improve the MDB_WRITEMAP case. The general problem there is that tuning of the write-back behaviour is system wide (e.g. vm.dirty_expire_centisecs on Linux) and mostly not in control of the application, so if one wants to be sure that writeback happens only once sufficient changes have accumulated, one needs to not use MDB_WRITEMAP. In that case it would be great to be able to control the memory usage/write-back behaviour by configuring the amount at which it spills, as well.
This sounds like a pretty esoteric tuning knob. Really goes against the philosophy of LMDB's simplicity.
- LMDB causes crashes if database is corrupted
You can enable per-page checksums in LMDB 1.0, in which case you'll just get an error code if a page is corrupted (and the checksum fails to match). The DB will still be unusable if anything is corrupted.
That would fix the problem properly. Does it check that it is the correct transaction as well (e.g. by putting a transid into the page like btrfs)? Returning wrong results or MDB_CORRUPTED is something my application can handle (but not crashes obviously).
The txnid is part of the page header, which is one of the incompatible format changes from LMDB 0.9. This is what allows us to eliminate the dirty bit.
Not sure what you mean about the txnid being correct or not, but certainly it is included in the checksum.
On 27.10.2020 18:52 Howard Chu wrote:
- LMDB causes crashes if database is corrupted
You can enable per-page checksums in LMDB 1.0, in which case you'll just get an error code if a page is corrupted (and the checksum fails to match). The DB will still be unusable if anything is corrupted.
That would fix the problem properly. Does it check that it is the correct transaction as well (e.g. by putting a transid into the page like btrfs)? Returning wrong results or MDB_CORRUPTED is something my application can handle (but not crashes obviously).
The txnid is part of the page header, which is one of the incompatible format changes from LMDB 0.9. This is what allows us to eliminate the dirty bit.
Not sure what you mean about the txnid being correct or not, but certainly it is included in the checksum.
A common problem that e.g. btrfs users encounter is that a disk drops some writes. If there was a page at the same location previously the checksum check succeeds. But btrfs stores the transid of the page in the page's parent, so it compares that as well (The error message is "btrfs parent transid verify failed on OFFSET wanted TRANSID found TRANSID"). I think ZFS stores the checksum of the page in the page's parent as well (idk if this would work with lmdbs b-tree).
I guess a simple check (that might already exist) is checking if page transid<=root/meta page transid. But that doesn't catch the cases where the root page was updated, but updates to other pages were dropped (the disk might also drop a complete "transaction" but not report an error, in which case the next transaction then writes the root page pointing to an incompletely written tree).
Martin Raiber martin@urbackup.org schrieb am 19.01.2021 um 19:26 in Nachricht
<010201771be5823e-1014e8bb-0f79-4506-b27e-d08d25400b1e-000000@eu-west-1.amazonse .com>:
On 27.10.2020 18:52 Howard Chu wrote:
- LMDB causes crashes if database is corrupted
You can enable per-page checksums in LMDB 1.0, in which case you'll just get
an error code
if a page is corrupted (and the checksum fails to match). The DB will still
be unusable if
anything is corrupted.
That would fix the problem properly. Does it check that it is the correct
transaction as well (e.g. by putting a transid into the page like btrfs)? Returning
wrong results or MDB_CORRUPTED is something my application can handle (but
not crashes obviously).
The txnid is part of the page header, which is one of the incompatible
format changes from LMDB 0.9.
This is what allows us to eliminate the dirty bit.
Not sure what you mean about the txnid being correct or not, but certainly
it is included in the
checksum.
A common problem that e.g. btrfs users encounter is that a disk drops
If "a disk drops some writes" it's definitely not a problem of BtrFS. Dor you mean "BtrFS drops some writes"? I don't get it.
some writes. If there was a page at the same location previously the checksum check succeeds. But btrfs stores the transid of the page in the page's parent, so it compares that as well (The error message is "btrfs parent transid verify failed on OFFSET wanted TRANSID found TRANSID"). I think ZFS stores the checksum of the page in the page's parent as well (idk if this would work with lmdbs b-tree).
I guess a simple check (that might already exist) is checking if page transid<=root/meta page transid. But that doesn't catch the cases where the root page was updated, but updates to other pages were dropped (the disk might also drop a complete "transaction" but not report an error, in which case the next transaction then writes the root page pointing to an incompletely written tree).
Ulrich Windl wrote:
Martin Raiber <martin(a)urbackup.org> schrieb am
19.01.2021 um 19:26 in Nachricht <010201771be5823e-1014e8bb-0f79-4506-b27e-d08d25400b1e-000000(a)eu-west-1.amazonse
.com>:
On 27.10.2020 18:52 Howard Chu wrote:
- LMDB causes crashes if database is corrupted
You can enable per-page checksums in LMDB 1.0, in which case you'll just
get an error code
if a page is corrupted (and the checksum fails to match). The DB will still
be unusable if
anything is corrupted.
That would fix the problem properly. Does it check that it is the correct
transaction as well (e.g. by putting a transid into the page like btrfs)? Returning
wrong results or MDB_CORRUPTED is something my application can handle (but
not crashes obviously).
The txnid is part of the page header, which is one of the incompatible
format changes from LMDB 0.9.
This is what allows us to eliminate the dirty bit.
Not sure what you mean about the txnid being correct or not, but certainly
it is included in the
checksum.
A common problem that e.g. btrfs users encounter is that a disk drops
If "a disk drops some writes" it's definitely not a problem of BtrFS. Dor you mean "BtrFS drops some writes"? I don't get it.
It was meant as disk drops some writes and btrfs users notice it because it does this checksumming + transid check (and ask online for help about their now broken btrfs because it doesn't have good repair tools).
See here for a btrfs user testing disks for this problem: https://lore.kernel.org/linux-btrfs/20190624052718.GD11831@hungrycats.org/T/
W.r.t. to LMDB in e.g. LDAP you could say "don't use broken disks then". But as a general purpose database (with checksumming) it would be something nice to have.
some writes. If there was a page at the same location previously the
checksum check succeeds. But btrfs stores the transid of the page in the page's parent, so it compares that as well (The error message is "btrfs parent transid verify failed on OFFSET wanted TRANSID found TRANSID"). I think ZFS stores the checksum of the page in the page's parent as well (idk if this would work with lmdbs b-tree).
I guess a simple check (that might already exist) is checking if page transid<=root/meta page transid. But that doesn't catch the cases where the root page was updated, but updates to other pages were dropped (the disk might also drop a complete "transaction" but not report an error, in which case the next transaction then writes the root page pointing to an incompletely written tree).
openldap-technical@openldap.org