https://bugs.openldap.org/show_bug.cgi?id=9496
Issue ID: 9496 Summary: Some writes missing from database Product: LMDB Version: unspecified Hardware: All OS: All Status: UNCONFIRMED Severity: normal Priority: --- Component: liblmdb Assignee: bugs@openldap.org Reporter: igfoo@github.com Target Milestone: ---
With the attached test program, some of my database writes appear not to actually be written to the database. For example, a run may look like this:
$ ./run.sh All done. All finished 1802 test.txt foo_200 is missing bar_200 is missing foo_404 is missing bar_404 is missing foo_407 is missing bar_407 is missing
The script that I am using to run the program is below. This is using mdb.master 52bc29ee2efccf09c650598635cd42a50b6ecffe on Linux, with an ext4 filesystem.
Is this an LMDB bug, or is there a bug in my code?
Thanks Ian
#!/bin/sh
set -e
if ! [ -d lmdb ] then rm -rf lmdb git clone https://github.com/LMDB/lmdb.git INSTALL_DIR="`pwd`/inst" cd lmdb/libraries/liblmdb make install prefix="$INSTALL_DIR" cd ../../.. fi
gcc -Wall -Werror -Iinst/include loop.c inst/lib/liblmdb.a -o loop -pthread rm -f test.db test.db-lock ./loop echo "All finished" mdb_dump -np test.db > test.txt wc -l test.txt for i in `seq 100 999` do if ! grep -q "foo_$i" test.txt then echo "foo_$i is missing" fi if ! grep -q "bar_$i" test.txt then echo "bar_$i is missing" fi done
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #1 from igfoo@github.com --- Created attachment 805 --> https://bugs.openldap.org/attachment.cgi?id=805&action=edit Test case
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #2 from igfoo@github.com --- Created attachment 806 --> https://bugs.openldap.org/attachment.cgi?id=806&action=edit Script
https://bugs.openldap.org/show_bug.cgi?id=9496
igfoo@github.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Hardware|All |x86_64 OS|All |Linux
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #3 from Nate Pierce nwpierce@gmail.com --- Anyone look into this yet? I made a script to back up through the commit history, rebuild mdb_dump and your test, and run it 10 times - as long as I was able to get it to lose a commit on at least one run, I kept rewinding.
I cannot get your code to fail as of commit ce834559041747a8ae29884d2b82e144adc7600f. Everything at af2f8cc814fabe2814cacb573be3338292f47c0d or later will trigger it.
Tested on Debian 10 (4.0.19-11) and MacOS 11.2.3.
https://bugs.openldap.org/show_bug.cgi?id=9496
Nate Pierce nwpierce@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |nwpierce@gmail.com
--- Comment #4 from Nate Pierce nwpierce@gmail.com --- Created attachment 814 --> https://bugs.openldap.org/attachment.cgi?id=814&action=edit backs out a portion of af2f8cc814fabe2814cacb573be3338292f47c0d
I don't know if this has any adverse effects - I don't know the lmdb internals, but the two spots look related. After applying, all of the existing tests still pass, and your supplied test does now too.
https://bugs.openldap.org/show_bug.cgi?id=9496
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |TEST Status|UNCONFIRMED |RESOLVED
--- Comment #5 from Howard Chu hyc@openldap.org --- (In reply to Nate Pierce from comment #4)
Created attachment 814 [details] backs out a portion of af2f8cc814fabe2814cacb573be3338292f47c0d
I don't know if this has any adverse effects - I don't know the lmdb internals, but the two spots look related. After applying, all of the existing tests still pass, and your supplied test does now too.
Thanks for investigating. Looks like this is the right fix.
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #6 from Howard Chu hyc@openldap.org --- fixed in mdb.master, mdb.master3. not present in mdb.RE/0.9.
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #7 from Quanah Gibson-Mount quanah@openldap.org --- mdb.master:
Commits: • 4b615434 by Howard Chu at 2021-04-09T14:06:33+01:00 ITS#9496 fix mdb_env_open bug from #8704
mdb.master3:
Commits: • 557ab606 by Howard Chu at 2021-04-09T14:12:41+01:00 ITS#9496 fix mdb_env_open bug from #8704
https://bugs.openldap.org/show_bug.cgi?id=9496
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |1.0.0
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #8 from opensource@gmx-topmail.de opensource@gmx-topmail.de --- After applying this patch, I observed some conditions in which MDB_PREVSNAPSHOT does not work anymore: instead of the previous commit, the current commit is taken.
While the exact conditions when this happens is not 100% clear, these are the observations:
- Having the DB on a RAM disk has impact. On macOS, we observed that it is always reproducible for one of our test case (straight forward test for MDB_PREVSNAPSHOT but too much context required to share here). RAM disk was created like that: diskutil partitionDisk $(hdiutil attach -nomount ram://2048000) 1 GPTFormat APFS 'ramdisk' '100%'
- When building with thread sanitizer the issue was also observed on Linux. It was reproducible in about half of the runs (flaky).
- It happened also in regular builds, but very infrequent.
I'm sorry that I cannot dig deeper at this moment and e.g. provide a patch. Still wanted to put these observations down in the hope that they might be useful for someone.
Best, Markus
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #9 from Howard Chu hyc@openldap.org --- (In reply to opensource@gmx-topmail.de from comment #8)
After applying this patch, I observed some conditions in which MDB_PREVSNAPSHOT does not work anymore: instead of the previous commit, the current commit is taken.
While the exact conditions when this happens is not 100% clear, these are the observations:
- Having the DB on a RAM disk has impact. On macOS, we observed that it is
always reproducible for one of our test case (straight forward test for MDB_PREVSNAPSHOT but too much context required to share here). RAM disk was created like that: diskutil partitionDisk $(hdiutil attach -nomount ram://2048000) 1 GPTFormat APFS 'ramdisk' '100%'
- When building with thread sanitizer the issue was also observed on Linux.
It was reproducible in about half of the runs (flaky).
Note that all writes in LMDB are fully serialized, which means there cannot possibly be any threading bugs in LMDB. Any thread sanitizer issues showing up indicates bugs in your calling app.
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #10 from opensource@gmx-topmail.de opensource@gmx-topmail.de --- For clarification: we've seen the behavior without thread sanitizer too, just not as frequently. Also, the code in question does not use multiple threads.
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #11 from Quanah Gibson-Mount quanah@openldap.org --- (In reply to opensource@gmx-topmail.de from comment #10)
For clarification: we've seen the behavior without thread sanitizer too, just not as frequently. Also, the code in question does not use multiple threads.
LMDB write ops are 100% deterministic. Having flaky results on some runs and not others cannot be due to LMDB code.
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #12 from Markus markus@objectbox.io --- I was just able to debug into this and thus gathered new info. Inside mdb_txn_renew0, I made some "odd" observations that I wanted to check with you. "Odd" in the sense that it seems that the meta page selection does not consider MDB_PREVSNAPSHOT.
It entered mdb_txn_renew0 with MDB_TXN_RDONLY and ti (MDB_txninfo; env->me_txns) being non-NULL. However, ti->mti_txnid was 0 and thus txn->mt_txnid was set to 0. That's the reason for always selecting the first (index 0) meta page in this code line:
meta = env->me_metas[txn->mt_txnid & 1];
This seems wrong but maybe I missed something?
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #13 from Howard Chu hyc@openldap.org --- You'll have to provide more context for what you're trying to do. Note that the PREVSNAPSHOT flag only takes effect once and then is cleared. And probably you should open a new ticket since this one is already closed.
https://bugs.openldap.org/show_bug.cgi?id=9496
--- Comment #14 from Markus markus@objectbox.io --- OK, opened new one here: https://bugs.openldap.org/show_bug.cgi?id=10024