Hi,
I'm struggling with an issue that a large, but mostly free LMDB database is claiming
MDB_MAP_FULL when trying to store a large write transaction.
The LMDB mapsize is configured to 40 GiB and indeed, the database (data.mdb) already has
this size. Thus, it cannot grow.
mdb_stat says:
Status of Main DB
Tree depth: 3
Branch pages: 24
Leaf pages: 1123
Overflow pages: 152153
Entries: 27423
Computing [ (branch + leaf + overflow) * page_size ] (I assume page_size is 4 KiB) leads
to only ~600 MiB of the real database usage, which corresponds to the estimated amount of
data stored there. So the database is used by just 1.5% and ought to have plenty free
space inside.
The write attempt that fails, tries to insert 700 MiB more data in one transaction, still
just small percentage of the mapsize. However, this fails with MDB_MAP_FULL.
Note that our application stores such large data in the way that it splits them into small
chunks (~70 KiB each) and stores them as many key-value records. This is done to avoid
issues with searching for too large continuous free space.
I have tracked down one such failing write attempt with gdb to see what exactly fails:
#0 mdb_page_alloc (mc=0x7fffd7e9fa08, num=8, mp=0x7fffd7e9f518) at
contrib/lmdb/mdb.c:2286
#1 0x00000000004e2cf6 in mdb_page_new (mc=0x7fffd7e9fa08, flags=4, num=8,
mp=0x7fffd7e9f578) at contrib/lmdb/mdb.c:7178
#2 0x00000000004e6108 in mdb_node_add (mc=0x7fffd7e9fa08, indx=98, key=0x7fffd7e9fc08,
data=0x7fffd7e9fc20, pgno=0, flags=65536) at contrib/lmdb/mdb.c:7320
#3 0x00000000004de628 in mdb_cursor_put (mc=0x7fffd7e9fa08, key=0x7fffd7e9fc08,
data=0x7fffd7e9fc20, flags=65536) at contrib/lmdb/mdb.c:6947
#4 0x00000000004e830a in mdb_put (txn=0x7fffd000cd80, dbi=1, key=0x7fffd7e9fc08,
data=0x7fffd7e9fc20, flags=65536) at contrib/lmdb/mdb.c:9022
It shows an unsuccessful attempt to allocate 8 pages (interestingly, quite a little
amount, since it had allocated a lot of 20-paged chunks just before).
Tracking down the LMDB source code I found this interesting line:
https://git.openldap.org/openldap/openldap/-/blob/master/libraries/liblmd...
Unfortunately, I have not enough insight to understand mdb_page_alloc() in its complexity,
but this seems like a kind of heurstics. It limits the number of scanned fragments of free
space (for unknown reason). In the past, it has also been "tuned":
https://git.openldap.org/openldap/openldap/-/commit/5ee99f1125a775f28ed69...
I tried to patch my copy of LMDB source code by doubling this magic constant (60 ->
120), and voila, this particular write transaction succeeded afterwards. However, this is
obvoiusly no sustainable way of sorting this out.
What is interesting to me is that the number of retries relies only on the size of the
allocated chunk (number of pages to alloc), but not the mapsize. It seems possible to me
that in a huge database, there might probably be a lot of tiny (e.g. one-page) fragments
of free space, that need to be skipped before finding any large-enough one.
However, I don't yet dare to create an issue on this.
Could you please tell me if my thoughts lead some proper, or some wrong way? Does LMDB
need some improvement to avoid this (and similar) issues?
Do you think that chunking the large data to even smaller parts (70 KiB -> 3 KiB) by
our application could be more clever to reduce fragmentation of overflow-pages-space in
LMDB?
Thank you for any hints and more insight to LMDB internals.
Many cheers,
Libor
Knot DNS | CZ NIC