Hi, I'm struggling with an issue that a large, but mostly free LMDB database is claiming MDB_MAP_FULL when trying to store a large write transaction.
The LMDB mapsize is configured to 40 GiB and indeed, the database (data.mdb) already has this size. Thus, it cannot grow.
mdb_stat says:
Status of Main DB Tree depth: 3 Branch pages: 24 Leaf pages: 1123 Overflow pages: 152153 Entries: 27423
Computing [ (branch + leaf + overflow) * page_size ] (I assume page_size is 4 KiB) leads to only ~600 MiB of the real database usage, which corresponds to the estimated amount of data stored there. So the database is used by just 1.5% and ought to have plenty free space inside.
The write attempt that fails, tries to insert 700 MiB more data in one transaction, still just small percentage of the mapsize. However, this fails with MDB_MAP_FULL.
Note that our application stores such large data in the way that it splits them into small chunks (~70 KiB each) and stores them as many key-value records. This is done to avoid issues with searching for too large continuous free space.
I have tracked down one such failing write attempt with gdb to see what exactly fails:
#0 mdb_page_alloc (mc=0x7fffd7e9fa08, num=8, mp=0x7fffd7e9f518) at contrib/lmdb/mdb.c:2286 #1 0x00000000004e2cf6 in mdb_page_new (mc=0x7fffd7e9fa08, flags=4, num=8, mp=0x7fffd7e9f578) at contrib/lmdb/mdb.c:7178 #2 0x00000000004e6108 in mdb_node_add (mc=0x7fffd7e9fa08, indx=98, key=0x7fffd7e9fc08, data=0x7fffd7e9fc20, pgno=0, flags=65536) at contrib/lmdb/mdb.c:7320 #3 0x00000000004de628 in mdb_cursor_put (mc=0x7fffd7e9fa08, key=0x7fffd7e9fc08, data=0x7fffd7e9fc20, flags=65536) at contrib/lmdb/mdb.c:6947 #4 0x00000000004e830a in mdb_put (txn=0x7fffd000cd80, dbi=1, key=0x7fffd7e9fc08, data=0x7fffd7e9fc20, flags=65536) at contrib/lmdb/mdb.c:9022
It shows an unsuccessful attempt to allocate 8 pages (interestingly, quite a little amount, since it had allocated a lot of 20-paged chunks just before).
Tracking down the LMDB source code I found this interesting line: https://git.openldap.org/openldap/openldap/-/blob/master/libraries/liblmdb/m...
Unfortunately, I have not enough insight to understand mdb_page_alloc() in its complexity, but this seems like a kind of heurstics. It limits the number of scanned fragments of free space (for unknown reason). In the past, it has also been "tuned": https://git.openldap.org/openldap/openldap/-/commit/5ee99f1125a775f28ed69b06...
I tried to patch my copy of LMDB source code by doubling this magic constant (60 -> 120), and voila, this particular write transaction succeeded afterwards. However, this is obvoiusly no sustainable way of sorting this out.
What is interesting to me is that the number of retries relies only on the size of the allocated chunk (number of pages to alloc), but not the mapsize. It seems possible to me that in a huge database, there might probably be a lot of tiny (e.g. one-page) fragments of free space, that need to be skipped before finding any large-enough one.
However, I don't yet dare to create an issue on this.
Could you please tell me if my thoughts lead some proper, or some wrong way? Does LMDB need some improvement to avoid this (and similar) issues?
Do you think that chunking the large data to even smaller parts (70 KiB -> 3 KiB) by our application could be more clever to reduce fragmentation of overflow-pages-space in LMDB?
Thank you for any hints and more insight to LMDB internals.
Many cheers, Libor
Knot DNS | CZ NIC
openldap-technical@openldap.org