Use-case specific changes to LMDB

List overview All Threads
Download

newer

older

openldap mdb tuning

Symas OpenLDAP For Linux 2.4.57...

martin＠urbackup.org

22 Oct 2020 22 Oct '20

11:21 a.m.

This post outlines a few changes to LMDB I had to do to make it work in a specific use case. I’d like to see those changes upstream, but I understand that they may be/are not relevant for e.g. OpenLDAP. The use case is multiple databases on disks with long running large write transactions.

1. Option to not use custom memory allocator/page pool

LMDB has a custom malloc() implementation that re-uses pages (me_dpages). I understand that this improves the performance at bit (depending on the malloc implementation). But there should at least be the option to not do that (for many reasons). I would even make not using it the default.

2. Large transactions and spilling

In a large write transaction, it will use a lot of memory per default (512MiB) which won’t get freed when the transaction commits (see 1.). If one has a lot of databases it uses a lot of memory that never gets freed.

Alternatively, one can use MDB_WRITEMAP, but (i) per default Linux isn’t tuned to delay writing pages to disk and (ii) before commit LMDB has to remove a dirty bit, so each page is written twice.

Both problems would be fixed by making when pages get spilled configurable (mt_dirty_room as MDB_IDL_UM_MAX currently) and reducing the default non-spill memory amount for at least the MDB_WRITEMAP case. If this memory amount is low mt_spill_pgs gets sorted often so maybe this needs to be converted to a different data structure (e.g. red-black tree).

3. LMDB causes crashes if database is corrupted

If the database is corrupted it can cause the application to crash. I have fixed those cases when they (randomly) occurred. Properly fixing this would probably be best done with some fuzzing.

4. Allow LMDB to reside on a device

I used dm-cache to improve LMDB read performance. It needed a bit of adjustment to get the correct size of the device via ioctl BLKGETSIZE64.

I’ve fixed those issues w.r.t. my application. If there is interest in any of those application specific changes, I’ll clean them up and post them.

Show replies by date

Howard Chu

22 Oct 22 Oct

11:40 a.m.

martin@urbackup.org wrote:

...

This post outlines a few changes to LMDB I had to do to make it work in a specific use case. I’d like to see those changes upstream, but I understand that they may be/are not relevant for e.g. OpenLDAP. The use case is multiple databases on disks with long running large write transactions.

Option to not use custom memory allocator/page pool

LMDB has a custom malloc() implementation that re-uses pages (me_dpages). I understand that this improves the performance at bit (depending on the malloc implementation). But there should at least be the option to not do that (for many reasons). I would even make not using it the default.

Not going to happen. But maybe it would be reasonable to allow configuring a limit on how many pages it keeps hanging around, before actually using libc free() on them.

...

Large transactions and spilling

In a large write transaction, it will use a lot of memory per default (512MiB) which won’t get freed when the transaction commits (see 1.). If one has a lot of databases it uses a lot of memory that never gets freed.

Alternatively, one can use MDB_WRITEMAP, but (i) per default Linux isn’t tuned to delay writing pages to disk and (ii) before commit LMDB has to remove a dirty bit, so each page is written twice.

There is no more dirty bit in LMDB 1.0, and this double-write no longer happens.

...

LMDB causes crashes if database is corrupted

You can enable per-page checksums in LMDB 1.0, in which case you'll just get an error code if a page is corrupted (and the checksum fails to match). The DB will still be unusable if anything is corrupted.

...

Allow LMDB to reside on a device

LMDB 1.0 supports storage on raw devices.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Martin Raiber

27 Oct 27 Oct

10:24 a.m.

Thanks! Looks like my specific issues are well on the radar.

On 22.10.2020 20:40 Howard Chu wrote:

...

martin@urbackup.org wrote:

...
This post outlines a few changes to LMDB I had to do to make it work in a specific use case. I’d like to see those changes upstream, but I understand that they may be/are not relevant for e.g. OpenLDAP. The use case is multiple databases on disks with long running large write transactions.

Option to not use custom memory allocator/page pool

LMDB has a custom malloc() implementation that re-uses pages (me_dpages). I understand that this improves the performance at bit (depending on the malloc implementation). But there should at least be the option to not do that (for many reasons). I would even make not using it the default.

Not going to happen. But maybe it would be reasonable to allow configuring a limit on how many pages it keeps hanging around, before actually using libc free() on them.

A limit would work.

...

...

Large transactions and spilling

In a large write transaction, it will use a lot of memory per default (512MiB) which won’t get freed when the transaction commits (see 1.). If one has a lot of databases it uses a lot of memory that never gets freed.

Alternatively, one can use MDB_WRITEMAP, but (i) per default Linux isn’t tuned to delay writing pages to disk and (ii) before commit LMDB has to remove a dirty bit, so each page is written twice.

There is no more dirty bit in LMDB 1.0, and this double-write no longer happens.

That'd improve the MDB_WRITEMAP case. The general problem there is that tuning of the write-back behaviour is system wide (e.g. vm.dirty_expire_centisecs on Linux) and mostly not in control of the application, so if one wants to be sure that writeback happens only once sufficient changes have accumulated, one needs to not use MDB_WRITEMAP. In that case it would be great to be able to control the memory usage/write-back behaviour by configuring the amount at which it spills, as well.

...

...

LMDB causes crashes if database is corrupted

You can enable per-page checksums in LMDB 1.0, in which case you'll just get an error code if a page is corrupted (and the checksum fails to match). The DB will still be unusable if anything is corrupted.

That would fix the problem properly. Does it check that it is the correct transaction as well (e.g. by putting a transid into the page like btrfs)? Returning wrong results or MDB_CORRUPTED is something my application can handle (but not crashes obviously).

Howard Chu

10:52 a.m.

Martin Raiber wrote:

...

Thanks! Looks like my specific issues are well on the radar.

On 22.10.2020 20:40 Howard Chu wrote:

...
martin@urbackup.org wrote:

...
This post outlines a few changes to LMDB I had to do to make it work in a specific use case. I’d like to see those changes upstream, but I understand that they may be/are not relevant for e.g. OpenLDAP. The use case is multiple databases on disks with long running large write transactions.

Option to not use custom memory allocator/page pool

LMDB has a custom malloc() implementation that re-uses pages (me_dpages). I understand that this improves the performance at bit (depending on the malloc implementation). But there should at least be the option to not do that (for many reasons). I would even make not using it the default.

Not going to happen. But maybe it would be reasonable to allow configuring a limit on how many pages it keeps hanging around, before actually using libc free() on them.

A limit would work.

Feel free to submit an enhancement request for this on the OpenLDAP ITS.

...

...
...

Large transactions and spilling

In a large write transaction, it will use a lot of memory per default (512MiB) which won’t get freed when the transaction commits (see 1.). If one has a lot of databases it uses a lot of memory that never gets freed.

Alternatively, one can use MDB_WRITEMAP, but (i) per default Linux isn’t tuned to delay writing pages to disk and (ii) before commit LMDB has to remove a dirty bit, so each page is written twice.

There is no more dirty bit in LMDB 1.0, and this double-write no longer happens.

That'd improve the MDB_WRITEMAP case. The general problem there is that tuning of the write-back behaviour is system wide (e.g. vm.dirty_expire_centisecs on Linux) and mostly not in control of the application, so if one wants to be sure that writeback happens only once sufficient changes have accumulated, one needs to not use MDB_WRITEMAP. In that case it would be great to be able to control the memory usage/write-back behaviour by configuring the amount at which it spills, as well.

This sounds like a pretty esoteric tuning knob. Really goes against the philosophy of LMDB's simplicity.

...

...
...

LMDB causes crashes if database is corrupted

You can enable per-page checksums in LMDB 1.0, in which case you'll just get an error code if a page is corrupted (and the checksum fails to match). The DB will still be unusable if anything is corrupted.

That would fix the problem properly. Does it check that it is the correct transaction as well (e.g. by putting a transid into the page like btrfs)? Returning wrong results or MDB_CORRUPTED is something my application can handle (but not crashes obviously).

The txnid is part of the page header, which is one of the incompatible format changes from LMDB 0.9. This is what allows us to eliminate the dirty bit.

Not sure what you mean about the txnid being correct or not, but certainly it is included in the checksum.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Martin Raiber

19 Jan 19 Jan

10:26 a.m.

On 27.10.2020 18:52 Howard Chu wrote:

...

...
...
...

LMDB causes crashes if database is corrupted

You can enable per-page checksums in LMDB 1.0, in which case you'll just get an error code if a page is corrupted (and the checksum fails to match). The DB will still be unusable if anything is corrupted.

That would fix the problem properly. Does it check that it is the correct transaction as well (e.g. by putting a transid into the page like btrfs)? Returning wrong results or MDB_CORRUPTED is something my application can handle (but not crashes obviously).

The txnid is part of the page header, which is one of the incompatible format changes from LMDB 0.9. This is what allows us to eliminate the dirty bit.

Not sure what you mean about the txnid being correct or not, but certainly it is included in the checksum.

A common problem that e.g. btrfs users encounter is that a disk drops some writes. If there was a page at the same location previously the checksum check succeeds. But btrfs stores the transid of the page in the page's parent, so it compares that as well (The error message is "btrfs parent transid verify failed on OFFSET wanted TRANSID found TRANSID"). I think ZFS stores the checksum of the page in the page's parent as well (idk if this would work with lmdbs b-tree).

I guess a simple check (that might already exist) is checking if page transid<=root/meta page transid. But that doesn't catch the cases where the root page was updated, but updates to other pages were dropped (the disk might also drop a complete "transaction" but not report an error, in which case the next transaction then writes the root page pointing to an incompletely written tree).

Ulrich Windl

10:51 p.m.

New subject: Antw: [EXT] Re: Use-case specific changes to LMDB

...

...
...
Martin Raiber martin@urbackup.org schrieb am 19.01.2021 um 19:26 in Nachricht

<010201771be5823e-1014e8bb-0f79-4506-b27e-d08d25400b1e-000000@eu-west-1.amazonse .com>:

...

On 27.10.2020 18:52 Howard Chu wrote:

...
...
...
...

LMDB causes crashes if database is corrupted

You can enable per-page checksums in LMDB 1.0, in which case you'll just get

an error code

...
...
...
if a page is corrupted (and the checksum fails to match). The DB will still

be unusable if

...
...
...
anything is corrupted.

That would fix the problem properly. Does it check that it is the correct

transaction as well (e.g. by putting a transid into the page like btrfs)? Returning

...
...
wrong results or MDB_CORRUPTED is something my application can handle (but

not crashes obviously).

...
...
The txnid is part of the page header, which is one of the incompatible

format changes from LMDB 0.9.

...
...
This is what allows us to eliminate the dirty bit.

Not sure what you mean about the txnid being correct or not, but certainly

it is included in the

...
...
checksum.

A common problem that e.g. btrfs users encounter is that a disk drops

If "a disk drops some writes" it's definitely not a problem of BtrFS. Dor you mean "BtrFS drops some writes"? I don't get it.

...

some writes. If there was a page at the same location previously the checksum check succeeds. But btrfs stores the transid of the page in the page's parent, so it compares that as well (The error message is "btrfs parent transid verify failed on OFFSET wanted TRANSID found TRANSID"). I think ZFS stores the checksum of the page in the page's parent as well (idk if this would work with lmdbs b-tree).

I guess a simple check (that might already exist) is checking if page transid<=root/meta page transid. But that doesn't catch the cases where the root page was updated, but updates to other pages were dropped (the disk might also drop a complete "transaction" but not report an error, in which case the next transaction then writes the root page pointing to an incompletely written tree).

martin＠urbackup.org

22 Jan 22 Jan

7:52 a.m.

New subject: Antw: [EXT] Re: Use-case specific changes to LMDB

Ulrich Windl wrote:

...

...
...
...
Martin Raiber <martin(a)urbackup.org> schrieb am

19.01.2021 um 19:26 in Nachricht <010201771be5823e-1014e8bb-0f79-4506-b27e-d08d25400b1e-000000(a)eu-west-1.amazonse

.com>:

...
On 27.10.2020 18:52 Howard Chu wrote:

...
...
...
...

LMDB causes crashes if database is corrupted

You can enable per-page checksums in LMDB 1.0, in which case you'll just

get an error code

...
...
...
if a page is corrupted (and the checksum fails to match). The DB will still

be unusable if

...
...
...
anything is corrupted.

That would fix the problem properly. Does it check that it is the correct

transaction as well (e.g. by putting a transid into the page like btrfs)? Returning

...
...
wrong results or MDB_CORRUPTED is something my application can handle (but

not crashes obviously).

...
...
The txnid is part of the page header, which is one of the incompatible

format changes from LMDB 0.9.

...
...
This is what allows us to eliminate the dirty bit.

Not sure what you mean about the txnid being correct or not, but certainly

it is included in the

...
...
checksum.

A common problem that e.g. btrfs users encounter is that a disk drops

If "a disk drops some writes" it's definitely not a problem of BtrFS. Dor you mean "BtrFS drops some writes"? I don't get it.

It was meant as disk drops some writes and btrfs users notice it because it does this checksumming + transid check (and ask online for help about their now broken btrfs because it doesn't have good repair tools).

See here for a btrfs user testing disks for this problem: https://lore.kernel.org/linux-btrfs/20190624052718.GD11831@hungrycats.org/T/

W.r.t. to LMDB in e.g. LDAP you could say "don't use broken disks then". But as a general purpose database (with checksumming) it would be something nice to have.

...

...
some writes. If there was a page at the same location previously the

checksum check succeeds. But btrfs stores the transid of the page in the page's parent, so it compares that as well (The error message is "btrfs parent transid verify failed on OFFSET wanted TRANSID found TRANSID"). I think ZFS stores the checksum of the page in the page's parent as well (idk if this would work with lmdbs b-tree).

I guess a simple check (that might already exist) is checking if page transid<=root/meta page transid. But that doesn't catch the cases where the root page was updated, but updates to other pages were dropped (the disk might also drop a complete "transaction" but not report an error, in which case the next transaction then writes the root page pointing to an incompletely written tree).

1617

Age (days ago)

1709

Last active (days ago)

openldap-technical@openldap.org

6 comments

4 participants

tags (0)

participants (4)

Howard Chu
Martin Raiber
martin＠urbackup.org
Ulrich Windl