https://bugs.openldap.org/show_bug.cgi?id=10236
Issue ID: 10236 Summary: fragmentation makes mdb_page_alloc slow Product: LMDB Version: 0.9.31 Hardware: All OS: All Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: liblmdb Assignee: bugs@openldap.org Reporter: aalekseyev@janestreet.com Target Milestone: ---
Created attachment 1022 --> https://bugs.openldap.org/attachment.cgi?id=1022&action=edit patch is relative to LMDB_0.9.31
It's a known problem that mdb_page_alloc can be slow when the free list is large and fragmented. [1] [2] [3]
I'm not sure it's known *how* slow it can be.
In our workload we saw a fragmented freelist leading to a pathological O(n^2) behavior. To handle a multi-page allocation we iterate loading chunks of the free list one by one, and at every iteration we do O(n) work to check if the allocation can succeed.
Even small-ish allocations (tens of pages) are repeatedly hitting this edge case, with free list growing to ~1000000, and the outer loop taking ~2000 iterations (10^9 worth of work in total, just to allocate a few pages).
Even though I'm sure there are ways to avoid hitting this pathological scenario so much (avoid values larger than 4k, or fix whatever causes fragmentation), it seems unacceptable to have a performance cliff this bad.
I made a patch to make the allocation take ~O(n*log(n)), by loading and merging multiple chunks at once instead of doing it one-by-one.
I'd appreciate it if someone could review the patch (attached), improve it, and/or come up with an alternative fix.
The code in `midl.c` is kinda meme-y, including a contribution from GPT-4o, but it performs well enough to speed up our pathological workload by ~20x (which is still ~3x away from the non-fragmented case).
Anyway, the main thing that warrants scrutiny is the change in `mdb.c`: I understand very little about lmdb internals and I worry that loading multiple pages at once instead of one-by-one might break something.
[1] issue #8664
[2] https://lists.openldap.org/hyperkitty/list/openldap-bugs@openldap.org/thread...
[3] https://lists.openldap.org/hyperkitty/list/openldap-technical@openldap.org/m...
https://bugs.openldap.org/show_bug.cgi?id=10236
--- Comment #1 from Howard Chu hyc@openldap.org --- Thanks. The overall approach makes sense, but we cannot use GPT-authored code. The Project only accepts code submitted by its original authors, and LLM-generated code is by definition plagiarized from unknown numbers of authors.
We will look into writing a usable patch adopting this feature.
https://bugs.openldap.org/show_bug.cgi?id=10236
--- Comment #2 from aalekseyev@janestreet.com --- (In reply to Howard Chu from comment #1)
LLM-generated code is by definition plagiarized from unknown numbers of authors.
Welp, that's fair enough, although so is everything produced by my brain.
Let's try to find references with free enough license to resolve the liability of having to reimplement-but-not-plagiarize this code.
An MIT-licensed k-way-merge in Python can be seen here: https://github.com/amitrajitbose/Competitive_Programming/blob/master/Heap/me....
A heapify definition can be seen on Wikipedia (with indexes off-by-1): https://en.wikipedia.org/wiki/Binary_heap#Heap_implementation, search for "Max-Heapify".
It even mentions the optimization idea that I thought was cute in GPT-generated code:
The down-heap operation (without the preceding swap) can also be used to modify the value of the root, even when an element is not being deleted.
(I might have done pop followed by push naively)
https://bugs.openldap.org/show_bug.cgi?id=10236
--- Comment #3 from aalekseyev@janestreet.com --- Created attachment 1024 --> https://bugs.openldap.org/attachment.cgi?id=1024&action=edit repro
Attached a Rust program that reproduces the problem, using the lmdb-rs crate. (sorry, C is hard)
The program creates 10 million of "dust" particles (small entries), followed by 200k of 80-kb "bricks".
When run with the patch, all bricks are created in <100s (<0.5ms per brick).
When run without the patch, brick creation gets slower and slower after, reaching 500ms/brick. It's been running for a couple of hours so far and it's only at 170k.
https://bugs.openldap.org/show_bug.cgi?id=10236
--- Comment #4 from Howard Chu hyc@openldap.org --- Thanks again, it's great to have measurable impact from this change.
https://bugs.openldap.org/show_bug.cgi?id=10236
--- Comment #5 from aalekseyev@janestreet.com --- Um, I'm afraid I found a bug in the original patch: it leaks pages by dropping free list chunks too early.
To fix the bug you need to avoid advancing `last` unless you've decided you're doing the merge (so both `if (oldest <= last_candidate)` exits from the inner loop are problematic).
Speaking of which, are there any tests of lmdb out there that should have caught this?
https://bugs.openldap.org/show_bug.cgi?id=10236
--- Comment #6 from aalekseyev@janestreet.com --- I'm afraid I also found that fixing that bug erases a large part of the perf saving. A lot of saving came from dropping some "inconvenient" free list pages, thus accidentally defragmenting it.
So: Baseline: >3h (killed), 500 ms/brick towards the end With bug: ~2min, 23.4GB, 0.1 ms/brick towards the end Without bug: ~20min, 23.0GB, 30 ms/brick towards the end
So overall saving is "only" one order of magnitude, not two or more as I fooled myself into believing.
https://bugs.openldap.org/show_bug.cgi?id=10236
--- Comment #7 from aalekseyev@janestreet.com --- And, to put the numbers in perspective...
Without dust: 15s , 16.8GB, 0.06 ms/brick towards the end
https://bugs.openldap.org/show_bug.cgi?id=10236
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Assignee|bugs@openldap.org |hyc@openldap.org Status|UNCONFIRMED |CONFIRMED
https://bugs.openldap.org/show_bug.cgi?id=10236
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |1.0.0 Keywords|needs_review |