https://bugs.openldap.org/show_bug.cgi?id=9378
Issue ID: 9378 Summary: Crash in mdb_put() / mdb_page_dirty() Product: LMDB Version: 0.9.26 Hardware: All OS: Linux Status: UNCONFIRMED Severity: normal Priority: --- Component: liblmdb Assignee: bugs@openldap.org Reporter: nate@kde.org Target Milestone: ---
The KDE Baloo file indexer uses lmdb as its database (source code available at https://invent.kde.org/frameworks/baloo). Our most common crash, with over 100 duplicate bug reports, is in lmdb. Here's the bug report tracking it: https://bugs.kde.org/show_bug.cgi?id=389848.
The version of lmdb does not seem to matter much. We have bug reports from Arch users with lmdb 0.9.26 as well as bug reports from people using many earlier versions.
Here's an example backtrace, taken from https://bugs.kde.org/show_bug.cgi?id=426195:
#6 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #7 0x00007f3c0bbb9859 in __GI_abort () at abort.c:79 #8 0x00007f3c0b23ba83 in mdb_assert_fail (env=0x55e2ad710600, expr_txt=expr_txt@entry=0x7f3c0b23e02f "rc == 0", func=func@entry=0x7f3c0b23e978 <__func__.7221> "mdb_page_dirty", line=line@entry=2127, file=0x7f3c0b23e010 "mdb.c") at mdb.c:1542 #9 0x00007f3c0b2306d5 in mdb_page_dirty (mp=<optimized out>, txn=0x55e2ad7109f0) at mdb.c:2114 #10 mdb_page_dirty (txn=0x55e2ad7109f0, mp=<optimized out>) at mdb.c:2114 #11 0x00007f3c0b231966 in mdb_page_alloc (num=num@entry=1, mp=mp@entry=0x7f3c0727aee8, mc=<optimized out>) at mdb.c:2308 #12 0x00007f3c0b231ba3 in mdb_page_touch (mc=mc@entry=0x7f3c0727b420) at mdb.c:2495 #13 0x00007f3c0b2337c7 in mdb_cursor_touch (mc=mc@entry=0x7f3c0727b420) at mdb.c:6523 #14 0x00007f3c0b2368f9 in mdb_cursor_put (mc=mc@entry=0x7f3c0727b420, key=key@entry=0x7f3c0727b810, data=data@entry=0x7f3c0727b820, flags=flags@entry=0) at mdb.c:6657 #15 0x00007f3c0b23976b in mdb_put (txn=0x55e2ad7109f0, dbi=5, key=key@entry=0x7f3c0727b810, data=data@entry=0x7f3c0727b820, flags=flags@entry=0) at mdb.c:9022 #16 0x00007f3c0c7124c5 in Baloo::DocumentDB::put (this=this@entry=0x7f3c0727b960, docId=<optimized out>, docId@entry=27041423333263366, list=...) at ./src/engine/documentdb.cpp:79 #17 0x00007f3c0c743da7 in Baloo::WriteTransaction::replaceDocument (this=0x55e2ad7ea340, doc=..., operations=operations@entry=...) at ./src/engine/writetransaction.cpp:232 #18 0x00007f3c0c736b16 in Baloo::Transaction::replaceDocument (this=this@entry=0x7f3c0727bc10, doc=..., operations=operations@entry=...) at ./src/engine/transaction.cpp:295 #19 0x000055e2ac5d6cbc in Baloo::UnindexedFileIndexer::run (this=0x55e2ad79ca20) at /usr/include/x86_64-linux-gnu/qt5/QtCore/qrefcount.h:60 #20 0x00007f3c0c177f82 in QThreadPoolThread::run (this=0x55e2ad717f20) at thread/qthreadpool.cpp:99 #21 0x00007f3c0c1749d2 in QThreadPrivate::start (arg=0x55e2ad717f20) at thread/qthread_unix.cpp:361 #22 0x00007f3c0b29d609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #23 0x00007f3c0bcb6103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
https://bugs.openldap.org/show_bug.cgi?id=9378
Nate Graham nate@kde.org changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugs.openldap.org/s | |how_bug.cgi?id=8756
--- Comment #1 from Nate Graham nate@kde.org --- Worth mentioning that we previously thought this was https://bugs.openldap.org/show_bug.cgi?id=8756, but that's marked as verified fixed in 0.9.23 yet the issue is still happening. Must be a different issue.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #2 from Howard Chu hyc@openldap.org --- (In reply to Nate Graham from comment #1)
Worth mentioning that we previously thought this was https://bugs.openldap.org/show_bug.cgi?id=8756, but that's marked as verified fixed in 0.9.23 yet the issue is still happening. Must be a different issue.
Do you have the DB files from any of these crashes, that we can examine? Is the failure persistent, does a restart of the program abort again in the exact same place?
It's unlikely that anyone has time to build and debug baloo's source code. Need a self-contained test case.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #3 from Howard Chu hyc@openldap.org --- (In reply to Howard Chu from comment #2)
It's unlikely that anyone has time to build and debug baloo's source code. Need a self-contained test case.
I pulled the baloo git repo, am unable to build it because it requires newer extra-cmake-modules and libkf5* packages than I have available.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #4 from Nate Graham nate@kde.org --- Thanks for looking into this, Howard! We have a tool that builds stuff from source. See https://community.kde.org/Get_Involved/development#Frameworks
People's databases tend to get pretty huge, but I can see if I can find anyone who can reliably reproduce the issue with a small DB.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #5 from Howard Chu hyc@openldap.org --- (In reply to Nate Graham from comment #4)
Thanks for looking into this, Howard! We have a tool that builds stuff from source. See https://community.kde.org/Get_Involved/development#Frameworks
People's databases tend to get pretty huge, but I can see if I can find anyone who can reliably reproduce the issue with a small DB.
I switched to an Arch distro instead of Ubuntu and updated to a current snapshot. Now my build of baloo_file aborts when files change:
ASSERT: "!url.endsWith('/')" in file /mnt/2/software/kde/baloo/src/file/filewatch.cpp, line 102
Thread 1 "baloo_file" received signal SIGABRT, Aborted. 0x00007ffff722e615 in raise () from /usr/lib/libc.so.6 (gdb) bt #0 0x00007ffff722e615 in raise () from /usr/lib/libc.so.6 #1 0x00007ffff7217862 in abort () from /usr/lib/libc.so.6 #2 0x00007ffff77b09ac in QMessageLogger::fatal(char const*, ...) const () from /usr/lib/libQt5Core.so.5 #3 0x00007ffff77afd59 in qt_assert(char const*, char const*, int) () from /usr/lib/libQt5Core.so.5 #4 0x0000555555585b72 in Baloo::FileWatch::slotFileDeleted (this=0x7fffffffe7e0, urlString=..., isDir=true) at /mnt/2/software/kde/baloo/src/file/filewatch.cpp:102 #5 0x000055555558716e in QtPrivate::FunctorCall<QtPrivate::IndexesList<0, 1>, QtPrivate::List<QString const&, bool>, void, void (Baloo::FileWatch::*)(QString const&, bool)>::call (f= (void (Baloo::FileWatch::*)(Baloo::FileWatch * const, const QString &, bool)) 0x555555585aee <Baloo::FileWatch::slotFileDeleted(QString const&, bool)>, o=0x7fffffffe7e0, arg=0x7fffffffe090) at /usr/include/qt/QtCore/qobjectdefs_impl.h:152 #6 0x0000555555586eaa in QtPrivate::FunctionPointer<void (Baloo::FileWatch::*)(QString const&, bool)>::call<QtPrivate::List<QString const&, bool>, void> (f= (void (Baloo::FileWatch::*)(Baloo::FileWatch * const, const QString &, bool)) 0x555555585aee <Baloo::FileWatch::slotFileDeleted(QString const&, bool)>, o=0x7fffffffe7e0, arg=0x7fffffffe090) at /usr/include/qt/QtCore/qobjectdefs_impl.h:185 #7 0x0000555555586d46 in QtPrivate::QSlotObject<void (Baloo::FileWatch::*)(QString const&, bool), QtPrivate::List<QString const&, bool>, void>::impl (which=1, this_=0x5555555ec510, r=0x7fffffffe7e0, a=0x7fffffffe090, ret=0x0) at /usr/include/qt/QtCore/qobjectdefs_impl.h:418 #8 0x00007ffff7a0b036 in ?? () from /usr/lib/libQt5Core.so.5 #9 0x000055555558ecf9 in KInotify::deleted (this=0x5555555ec1e0, _t1=..., _t2=true) at /mnt/2/software/kde/baloo/build/src/file/baloofilecommon_autogen/include/moc_kinotify.cpp:334 #10 0x000055555558d906 in KInotify::slotEvent (this=0x5555555ec1e0, socket=14) at /mnt/2/software/kde/baloo/src/file/kinotify.cpp:395 #11 0x000055555559347a in QtPrivate::FunctorCall<QtPrivate::IndexesList<0>, QtPrivate::List<QSocketDescriptor>, void, void (KInotify::*)(int)>::call (f= (void (KInotify::*)(KInotify * const, int)) 0x55555558d3ae KInotify::slotEvent(int), o=0x5555555ec1e0, arg=0x7fffffffe340) at /usr/include/qt/QtCore/qobjectdefs_impl.h:152 #12 0x00005555555932a1 in QtPrivate::FunctionPointer<void (KInotify::*)(int)>::call<QtPrivate::List<QSocketDescriptor>, void> (f=(void (KInotify::*)(KInotify * const, int)) 0x55555558d3ae KInotify::slotEvent(int), o=0x5555555ec1e0, arg=0x7fffffffe340) at /usr/include/qt/QtCore/qobjectdefs_impl.h:185 #13 0x0000555555592cd8 in QtPrivate::QSlotObject<void (KInotify::*)(int), QtPrivate::List<QSocketDescriptor>, void>::impl (which=1, this_=0x7fffec03fa00, r=0x5555555ec1e0, a=0x7fffffffe340, ret=0x0) at /usr/include/qt/QtCore/qobjectdefs_impl.h:418 #14 0x00007ffff7a0b036 in ?? () from /usr/lib/libQt5Core.so.5 #15 0x00007ffff7a0e5a0 in QSocketNotifier::activated(QSocketDescriptor, QSocketNotifier::Type, QSocketNotifier::QPriva--Type <RET> for more, q to quit, c to continue without paging-- teSignal) () from /usr/lib/libQt5Core.so.5 #16 0x00007ffff7a0edad in QSocketNotifier::event(QEvent*) () from /usr/lib/libQt5Core.so.5 #17 0x00007ffff79d3cb0 in QCoreApplication::notifyInternal2(QObject*, QEvent*) () from /usr/lib/libQt5Core.so.5 #18 0x00007ffff7a2d556 in ?? () from /usr/lib/libQt5Core.so.5 #19 0x00007ffff5f9b914 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0 #20 0x00007ffff5fef7d1 in ?? () from /usr/lib/libglib-2.0.so.0 #21 0x00007ffff5f9a121 in g_main_context_iteration () from /usr/lib/libglib-2.0.so.0 #22 0x00007ffff7a2c941 in QEventDispatcherGlib::processEvents(QFlagsQEventLoop::ProcessEventsFlag) () from /usr/lib/libQt5Core.so.5 #23 0x00007ffff79d265c in QEventLoop::exec(QFlagsQEventLoop::ProcessEventsFlag) () from /usr/lib/libQt5Core.so.5 #24 0x00007ffff79daaf4 in QCoreApplication::exec() () from /usr/lib/libQt5Core.so.5 #25 0x0000555555562bc4 in main (argc=1, argv=0x7fffffffe9f8) at /mnt/2/software/kde/baloo/src/file/main.cpp:78 (gdb)
I guess I need to build a tagged release version instead?
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #6 from Howard Chu hyc@openldap.org --- (In reply to Howard Chu from comment #5)
(In reply to Nate Graham from comment #4)
Thanks for looking into this, Howard! We have a tool that builds stuff from source. See https://community.kde.org/Get_Involved/development#Frameworks
People's databases tend to get pretty huge, but I can see if I can find anyone who can reliably reproduce the issue with a small DB.
I switched to an Arch distro instead of Ubuntu and updated to a current snapshot. Now my build of baloo_file aborts when files change:
ASSERT: "!url.endsWith('/')" in file /mnt/2/software/kde/baloo/src/file/filewatch.cpp, line 102
Thread 1 "baloo_file" received signal SIGABRT, Aborted. 0x00007ffff722e615 in raise () from /usr/lib/libc.so.6
I guess I need to build a tagged release version instead?
Nope, same thing on tag v5.75.0. There's not much I can do if it's crashing for non-LMDB-related reasons.
https://bugs.openldap.org/show_bug.cgi?id=9378
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |zack+ldapbugs@owlfolio.org
--- Comment #7 from Howard Chu hyc@openldap.org --- *** Issue 10114 has been marked as a duplicate of this issue. ***
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #8 from Howard Chu hyc@openldap.org --- If you can compile your own debug build of liblmdb and run it with baloo, and capture its stderr into a logfile, try this MR https://git.openldap.org/openldap/openldap/-/merge_requests/655
If you compile lmdb with -DMDEB_DEBUG=4 it will output copious amounts of tracing info to stderr. With this MR it will also detect if txns are being misused by more than one thread.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #9 from Howard Chu hyc@openldap.org --- Typo: "-DMDB_DEBUG=4"
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #10 from Zack Weinberg zack+ldapbugs@owlfolio.org --- I already recreated the baloo index from scratch, and it finished doing that with no errors, so I don't think repeating that process with a debug build is going to tell us anything interesting. My own best guess as to what caused the corruption in the first place, was the indexer getting killed (by system shutdown) with a transaction in progress.
In https://bugs.kde.org/show_bug.cgi?id=475695#c5 tagwerk19@innerjoin.org suggested a database-checking script, which unfortunately just crashes with
mdb_cursor_get: MDB_PAGE_NOTFOUND: Requested page not found
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #11 from stefan.bruens@rwth-aachen.de --- @hyc I can definitely assure you this is not a threading issue in baloo, txns are never passed between threads.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #12 from stefan.bruens@rwth-aachen.de --- Btw, having the actual rc from mdb_mid2l_insert as part of the assert output would probably be useful.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #13 from Howard Chu hyc@openldap.org --- Created attachment 989 --> https://bugs.openldap.org/attachment.cgi?id=989&action=edit Replay log output
If you can build LMDB with debug logging enabled, and send the log output from a crashed session, we may be able to learn more. The attached program can replay all of the operations from a debug log, so it can exactly recreate the steps that occurred originally. (Note that it requires the log to start from the very beginning, i.e., when the DB is created.)
Since LMDB is single-writer all of its write operations are completely deterministic, and replaying the same sequence will always produce the exact same DB. The only way for this to fail is if multiple threads used the same write txn at the same time.
https://bugs.openldap.org/show_bug.cgi?id=9378
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID
--- Comment #14 from Howard Chu hyc@openldap.org --- Thanks to assistance from another user, we've made some progress setting up a KDE test environment to reproduce this issue. Using the replay logging facility in this branch https://git.openldap.org/hyc/openldap/-/tree/mplay09?ref_type=heads we collected a trace from one of the crash instances. The suspicious part is excerpted here:
mdb_put: 0x5638e7e58130, 2, 8[646f6d696e616e74], 16, 0 mdb_put: 0x5638e7e58130, 3, 8[646f6d696e616e74], 24, 0 mdb_put: 0x5638e7e58130, 2, 8[74656c6c74616c65], 8, 0 mdb_put: 0x5638e7e58130, 3, 8[74656c6c74616c65], 11, 0 mdb_put: 0x5638e7e58130, 2, 3[747874], 56200, 0 mdb_env_create: 0x559276b2ddc0 mdb_env_set_maxdbs: 0x559276b2ddc0, 12 mdb_env_set_mapsize: 0x559276b2ddc0, 274877906944 mdb_env_open: 0x559276b2ddc0, /home/vm/.local/share/baloo/index, 16793600, 0664 mdb_txn_begin: 0x559276b2ddc0, (nil), 0 = 0x559276b2f1c0 mdb_dbi_open: 0x559276b2f1c0, postingdb, 262144 = 2 mdb_dbi_open: 0x559276b2f1c0, positiondb, 262144 = 3 mdb_dbi_open: 0x559276b2f1c0, docterms, 262152 = 4 mdb_dbi_open: 0x559276b2f1c0, docfilenameterms, 262152 = 5 mdb_dbi_open: 0x559276b2f1c0, docxatrrterms, 262152 = 6 mdb_dbi_open: 0x559276b2f1c0, idtree, 262152 = 7 mdb_dbi_open: 0x559276b2f1c0, idfilename, 262152 = 8 mdb_dbi_open: 0x559276b2f1c0, documenttimedb, 262152 = 9 mdb_dbi_open: 0x559276b2f1c0, documentdatadb, 262152 = 10 mdb_dbi_open: 0x559276b2f1c0, indexingleveldb, 262152 = 11 mdb_dbi_open: 0x559276b2f1c0, failediddb, 262152 = 12 mdb_dbi_open: 0x559276b2f1c0, mtimedb, 262204 = 13 mdb_txn_commit: 0x559276b2f1c0 mdb_put: 0x5638e7e58130, 3, 3[747874], 91570, 0 mdb_put: 0x5638e7e58130, 2, 2[6368], 464, 0 mdb_put: 0x5638e7e58130, 3, 2[6368], 1286, 0 mdb_put: 0x5638e7e58130, 2, 7[766172696f7573], 1440, 0 mdb_put: 0x5638e7e58130, 3, 7[766172696f7573], 2282, 0
In the middle of txn 0x5638e7e58130 the init sequence occurs again, and all of the contents of this logfile are only being written by a single process. That means baloo_file has opened the same env twice in the same process, which is explicitly forbidden by the LMDB docs. http://www.lmdb.tech/doc/
Going to close this ticket as Invalid, it's a KDE bug and not an LMDB bug.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #15 from stefan.bruens@rwth-aachen.de --- baloo_file uses a singleton for the env, and the actual mdb_env_open is protected by a Mutex, so the env can be opened exactly once.
So I have significant doubts this trace is actually from baloo_file from an unaltered source repository. Note, baloo_file does fork a helper process, maybe thats what can be seen here?
(There was one unit test which did use the singleton and a separate instance, but thats a separate issue).
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #16 from Howard Chu hyc@openldap.org --- That possibility crossed my mind too, but this happens only 5% of the way into the log, and after the active txn commits there are no subsequent references to its env, and only the newly opened env pointer is present in the log after that. If this log event occurred because of a forked process I would have expected logging from both envs to be present after that point.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #17 from Howard Chu hyc@openldap.org --- We can of course expand the logging to explicitly check for a redundant env_open in the same process. It'll probably be another few hundred hours of iterations to hit it again.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #18 from Howard Chu hyc@openldap.org --- We've expanded the logging to include PIDs so we can more definitively show what's happening. But the problem remains:
514137:mdb_put: 0x558dac27e200, 3, 3[706572], 19480, 0 514137:mdb_put: 0x558dac27e200, 2, 4[6d6f6e6f], 720, 0 514137:mdb_put: 0x558dac27e200, 3, 4[6d6f6e6f], 1367, 0 514137:mdb_put: 0x558dac27e200, 2, 5[4d74657874], 1482864, 0 514137:mdb_put: 0x558dac27e200, 3, 5[4d74657874], 1853580, 0 594170:mdb_env_create: 0x556435b8bdc0 594170:mdb_env_set_maxdbs: 0x556435b8bdc0, 12 594170:mdb_env_set_mapsize: 0x556435b8bdc0, 274877906944 594170:mdb_env_open: 0x556435b8bdc0, /home/vm/.local/share/baloo/index, 16793600, 0664 594170:mdb_txn_begin: 0x556435b8bdc0, (nil), 0 = 0x556435b8d1c0 594170:mdb_dbi_open: 0x556435b8d1c0, postingdb, 262144 = 2 594170:mdb_dbi_open: 0x556435b8d1c0, positiondb, 262144 = 3 594170:mdb_dbi_open: 0x556435b8d1c0, docterms, 262152 = 4 594170:mdb_dbi_open: 0x556435b8d1c0, docfilenameterms, 262152 = 5 594170:mdb_dbi_open: 0x556435b8d1c0, docxatrrterms, 262152 = 6 594170:mdb_dbi_open: 0x556435b8d1c0, idtree, 262152 = 7 594170:mdb_dbi_open: 0x556435b8d1c0, idfilename, 262152 = 8 594170:mdb_dbi_open: 0x556435b8d1c0, documenttimedb, 262152 = 9 594170:mdb_dbi_open: 0x556435b8d1c0, documentdatadb, 262152 = 10 594170:mdb_dbi_open: 0x556435b8d1c0, indexingleveldb, 262152 = 11 594170:mdb_dbi_open: 0x556435b8d1c0, failediddb, 262152 = 12 594170:mdb_dbi_open: 0x556435b8d1c0, mtimedb, 262204 = 13 594170:mdb_txn_commit: 0x556435b8d1c0 514137:mdb_put: 0x558dac27e200, 2, 4[64696167], 760, 0 514137:mdb_put: 0x558dac27e200, 3, 4[64696167], 23186, 0 514137:mdb_put: 0x558dac27e200, 2, 6[646576696365], 15944, 0 514137:mdb_put: 0x558dac27e200, 3, 6[646576696365], 31281, 0
Process 514137 has an active write transaction, but process 594170 successfully opens a new write transaction. That can only happen if LMDB's write mutex has been removed out from under it.
Things like this imply that that's exactly what is happening https://github.com/KDE/baloo/blob/6f480871cbae83e5f3d02380df40bee73acb820f/s...
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #19 from Howard Chu hyc@openldap.org --- For reference, the compressed logfile is available temporarily on http://highlandsun.com/hyc/its9378log.xz - it's about 2.6GB uncompressed.
If you feed it to mplay stdin https://git.openldap.org/hyc/openldap/-/blob/mplay09/libraries/liblmdb/mplay... the run will hang at line 12336235 of the logfile, because that txn_begin can't proceed until the write txn in 514137 finishes. mplay can't proceed because it waits for each log line to be processed before moving on to the next one. The baloo code can only proceed because it has broken LMDB's write lock.
Actual corruptions are first detected at line 40127945, much later in the log, but the damage is obviously done far earlier than that.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #20 from Howard Chu hyc@openldap.org --- For completeness sake, here are the scripts used to launch baloo_file and kill it repeatedly, which were needed to reproduce the issue.
##### #!/bin/bash
while(true); do # /usr/bin/baloo_file /usr/bin/baloo_file 2>&1 & PID=$! wait -f $PID EXITCODE=$? echo "# killed $PID, exitcode $EXITCODE"
# SIGABRT or SIGSEGV if [[ "$EXITCODE" -eq 6 ]] || [[ "$EXITCODE" -eq 11 ]]; then echo "===========================" 1>&2 echo "===========================" 1>&2 echo "SIGABRT or SIGSEGV happened." 1>&2 echo "===========================" 1>&2 echo "===========================" 1>&2 touch ~/__IMPORTANT_FILE__ exit 1 fi sleep 0.1s done; ##### #!/bin/bash
sleep $(shuf -i 30-50 -n 1)s; while (true); do sudo killall -s 9 baloo_file sleep $(shuf -i 20-480 -n 1)s done; #####
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #21 from Quanah Gibson-Mount quanah@openldap.org --- Replay log commits:
mdb.master: • 754f3cb3 by Howard Chu at 2024-01-09T17:17:21+00:00 ITS#9378 Add explicit replay logging
• b8e54b4c by Howard Chu at 2024-01-09T17:41:59+00:00 ITS#9378 Add replay tool
mdb.master3:
• 9d45a80b by Howard Chu at 2024-01-09T17:19:36+00:00 ITS#9378 Add explicit replay logging
• 009dd916 by Howard Chu at 2024-01-09T17:42:58+00:00 ITS#9378 Add replay tool
mdb.RE/0.9:
• 4a19b804 by Howard Chu at 2024-01-09T17:27:59+00:00 ITS#9378 Add explicit replay logging
• 9bafe549 by Howard Chu at 2024-01-09T17:41:21+00:00 ITS#9378 Add replay tool
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #22 from stefan.bruens@rwth-aachen.de --- The lockfile removal comes from a time when the robust mutexes where not available, or at least not widely.
Of course just deleting the lockfile is wrong, but the Caveats documenting the correct approach for dealing with stale locks were added at about the same time (and there may have been some delay before these documentations were publicly visible, and known).
As this is no longer the case, and we can almost assume robust mutexes are always available (at least on Linux and *BSD), there is no longer any need to handle stale locks explicitly.
Is there any way to query at runtime if the env uses robust mutexes? (Runtime is required, as we liblmdb is dynamically linked.)
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #23 from stefan.bruens@rwth-aachen.de ---
That means baloo_file has opened the same env twice in the same process
We've expanded the logging to include PIDs so we can more definitively show what's happening. But the problem remains:
... Process 514137 has an active write transaction, but process 594170 successfully opens a new write transaction. That can only happen if LMDB's write mutex has been removed out from under it.
So it actually never was an duplicated open from the same process, but multiple opens from different processes, correct?
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #24 from Howard Chu hyc@openldap.org --- (In reply to stefan.bruens from comment #23)
That means baloo_file has opened the same env twice in the same process
We've expanded the logging to include PIDs so we can more definitively show what's happening. But the problem remains:
... Process 514137 has an active write transaction, but process 594170 successfully opens a new write transaction. That can only happen if LMDB's write mutex has been removed out from under it.
So it actually never was an duplicated open from the same process, but multiple opens from different processes, correct?
Correct. My initial guess was based on the wild assumption that nobody would be stupid enough to actually delete the lockfile.
The caveat was always clear:
* Otherwise just make all programs using the database close it; * the lockfile is always reset on first open of the environment.
You never had to do anything special, regardless of whether robust mutexes were in use or not. Once all processes using the DB exit, the lockfile doesn't matter any more.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #25 from Howard Chu hyc@openldap.org --- (In reply to stefan.bruens from comment #22)
Is there any way to query at runtime if the env uses robust mutexes? (Runtime is required, as we liblmdb is dynamically linked.)
No. Nor is there any reason to query this, since the application shouldn't care either way. Especially for programs like baloo_file, which start up periodically, do some work, and then exit - the lockfile becomes irrelevant when the last process opening the env closes it, so any potentially stale locks go away by themselves anyway.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #26 from stefan.bruens@rwth-aachen.de --- (In reply to Howard Chu from comment #25)
(In reply to stefan.bruens from comment #22)
Is there any way to query at runtime if the env uses robust mutexes? (Runtime is required, as we liblmdb is dynamically linked.)
No. Nor is there any reason to query this, since the application shouldn't care either way. Especially for programs like baloo_file, which start up periodically, do some work, and then exit - the lockfile becomes irrelevant when the last process opening the env closes it, so any potentially stale locks go away by themselves anyway.
Your understanding here seems to be quite incorrect. baloo_file is started with the session, and keeps running until the session ends. Only the extractor process is spawned on demand.
Taken from http://www.lmdb.tech/doc/index.html
- Windows - automatic
- Linux, systems using POSIX mutexes with Robust option - automatic
- not on BSD, systems using POSIX semaphores. Otherwise just make all programs using the database close it; the lockfile is always reset on first open of the environment.
Note the third bullet point - there apparently *is* special handling required for these systems. "Otherwise just make all programs using the database close it;".
There may be many processes which also read the DB, and there is also another helper process which runs for the whole session duration. Other processes (e.g. dolphin) also keep the env open while the process is running. These process open read transactions just for very short durations, but nevertheless may crash (last but not least because some users think it is a good idea to SIGKILL random processes). Btw, apparently there is no API to query stale readers in a programmatically useful way, and not way to query active readers at all.
Also, the Caveat only became clear when it was actually written (and obviously there was a need for it), and depending on where you got the documentation from (Homepage or header file from the distribution) it may have just been missing. Or it has been overlooked, because it is not linked from the mdb_env_open API documentation. No reason to call anyone stupid here ...
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #27 from Howard Chu hyc@openldap.org --- (In reply to stefan.bruens from comment #26)
(In reply to Howard Chu from comment #25)
(In reply to stefan.bruens from comment #22)
Is there any way to query at runtime if the env uses robust mutexes? (Runtime is required, as we liblmdb is dynamically linked.)
No. Nor is there any reason to query this, since the application shouldn't care either way. Especially for programs like baloo_file, which start up periodically, do some work, and then exit - the lockfile becomes irrelevant when the last process opening the env closes it, so any potentially stale locks go away by themselves anyway.
Your understanding here seems to be quite incorrect. baloo_file is started with the session, and keeps running until the session ends. Only the extractor process is spawned on demand.
Then it's even more stupid, because the extractor is deleting the lockfile with full knowledge that the process that spawned it is still active. Regardless of any particular documentation version, deleting the lockfile of a DB that you know full well is open in more than one process is stupid.
There may be many processes which also read the DB, and there is also another helper process which runs for the whole session duration. Other processes (e.g. dolphin) also keep the env open while the process is running. These process open read transactions just for very short durations, but nevertheless may crash (last but not least because some users think it is a good idea to SIGKILL random processes). Btw, apparently there is no API to query stale readers in a programmatically useful way, and not way to query active readers at all.
https://git.openldap.org/openldap/openldap/-/blob/mdb.RE/0.9/libraries/liblm...
There's nothing useful you can do with a list of stale readers, so the only API you need is one which clears them. Which is what is provided. It's debatable whether you can programmatically do anything useful with a list of active readers. You can always provide a FILE* handle to a socket if you really want it.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #28 from stefan.bruens@rwth-aachen.de --- (In reply to Howard Chu from comment #27)
(In reply to stefan.bruens from comment #26)
(In reply to Howard Chu from comment #25)
(In reply to stefan.bruens from comment #22)
Is there any way to query at runtime if the env uses robust mutexes? (Runtime is required, as we liblmdb is dynamically linked.)
No. Nor is there any reason to query this, since the application shouldn't care either way. Especially for programs like baloo_file, which start up periodically, do some work, and then exit - the lockfile becomes irrelevant when the last process opening the env closes it, so any potentially stale locks go away by themselves anyway.
Your understanding here seems to be quite incorrect. baloo_file is started with the session, and keeps running until the session ends. Only the extractor process is spawned on demand.
Then it's even more stupid, because the extractor is deleting the lockfile with full knowledge that the process that spawned it is still active. Regardless of any particular documentation version, deleting the lockfile of a DB that you know full well is open in more than one process is stupid.
How about calming down and reading what I wrote? baloo_file is not the extractor process, the extractor *newer* deletes the lockfile.
https://bugs.openldap.org/show_bug.cgi?id=9378
--- Comment #29 from Howard Chu hyc@openldap.org --- (In reply to stefan.bruens from comment #28)
(In reply to Howard Chu from comment #27)
(In reply to stefan.bruens from comment #26)
(In reply to Howard Chu from comment #25)
(In reply to stefan.bruens from comment #22)
Is there any way to query at runtime if the env uses robust mutexes? (Runtime is required, as we liblmdb is dynamically linked.)
No. Nor is there any reason to query this, since the application shouldn't care either way. Especially for programs like baloo_file, which start up periodically, do some work, and then exit - the lockfile becomes irrelevant when the last process opening the env closes it, so any potentially stale locks go away by themselves anyway.
Your understanding here seems to be quite incorrect. baloo_file is started with the session, and keeps running until the session ends. Only the extractor process is spawned on demand.
Then it's even more stupid, because the extractor is deleting the lockfile with full knowledge that the process that spawned it is still active. Regardless of any particular documentation version, deleting the lockfile of a DB that you know full well is open in more than one process is stupid.
How about calming down and reading what I wrote? baloo_file is not the extractor process, the extractor *newer* deletes the lockfile.
It really doesn't matter which of these two processes is doing the delete, since as you already said, there are other processes that have the DB open. And clearly if baloo_file restarts for any reason, when it deletes the lockfile corruption will soon follow.
I don't even know why we're having this conversation. We've wasted enough time on this issue - countless hours for countless people wasted. There's nothing left for us to discuss here.
https://bugs.openldap.org/show_bug.cgi?id=9378
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED