https://bugs.openldap.org/show_bug.cgi?id=9360
Issue ID: 9360 Summary: MDB_BAD_TXN: Transaction must abort, has a child, or is invalid Product: LMDB Version: unspecified Hardware: All OS: All Status: UNCONFIRMED Severity: normal Priority: --- Component: liblmdb Assignee: bugs@openldap.org Reporter: spam@markandruth.co.uk Target Milestone: ---
I have 2 python scripts writing to a database (lmdb 0.9.26, py-lmdb 0.98) and 5-10 lua processes (with lightningmdb module which uses lmdb 0.9.22) which are long-running serving queries from the database.
The database seems fine, not corrupted, and the python writes still working all the time. But periodically (perhaps 10-20% of the time), in a way I am unable to reliably reproduce, when the lua starts up every time a query is issued txn dbi_open returns "MDB_BAD_TXN: Transaction must abort, has a child, or is invalid". A direct restart of the processes does not fix this issue, however stopping lua+python and then starting again after a 5-20s wait usually fixes the issue. This has been reproduced over multiple servers but I'm at a loss as to how to debug this any further?
https://bugs.openldap.org/show_bug.cgi?id=9360
--- Comment #1 from Howard Chu hyc@openldap.org --- (In reply to spam@markandruth.co.uk from comment #0)
I have 2 python scripts writing to a database (lmdb 0.9.26, py-lmdb 0.98) and 5-10 lua processes (with lightningmdb module which uses lmdb 0.9.22) which are long-running serving queries from the database.
The database seems fine, not corrupted, and the python writes still working all the time. But periodically (perhaps 10-20% of the time), in a way I am unable to reliably reproduce, when the lua starts up every time a query is issued txn dbi_open returns "MDB_BAD_TXN: Transaction must abort, has a child, or is invalid". A direct restart of the processes does not fix this issue, however stopping lua+python and then starting again after a 5-20s wait usually fixes the issue. This has been reproduced over multiple servers but I'm at a loss as to how to debug this any further?
You didn't specify what OS you're using. This doesn't sound like an "all hardware/systems" type of issue.
You should probably not mix versions, as a general rule.
https://bugs.openldap.org/show_bug.cgi?id=9360
--- Comment #2 from spam@markandruth.co.uk spam@markandruth.co.uk --- It's a docker container running alpine edge on a google cloud COS host with HDDs (all processes accessing db running from within the same container). From some further tests earlier in the day it seems that it is only a subset of the lua processes that have issue, or perhaps it is intermittent in some of the lua processes - I didn't quite figure this out yet. mdb_stat -r doesn't show anything strange just 5-10 readers. DB is ~500mb.
I often see this when restarting the container after it has been running for several days; more frequent restarts don't seem to show the issue quite so much leading me to think it may be some sort of issue to do with slow hdd cache access generating a race or something?? But I don't fully understand why it would be a persistent issue rather than just the first few requests having a problem.
I'm happy to try to debug this further but need a bit of guidance as to what is the best data to try to get to figure this out.
https://bugs.openldap.org/show_bug.cgi?id=9360
--- Comment #3 from Howard Chu hyc@openldap.org --- (In reply to spam@markandruth.co.uk from comment #2)
It's a docker container running alpine edge on a google cloud COS host with HDDs (all processes accessing db running from within the same container). From some further tests earlier in the day it seems that it is only a subset of the lua processes that have issue, or perhaps it is intermittent in some of the lua processes - I didn't quite figure this out yet. mdb_stat -r doesn't show anything strange just 5-10 readers. DB is ~500mb.
I often see this when restarting the container after it has been running for several days; more frequent restarts don't seem to show the issue quite so much leading me to think it may be some sort of issue to do with slow hdd cache access generating a race or something?? But I don't fully understand why it would be a persistent issue rather than just the first few requests having a problem.
I'm happy to try to debug this further but need a bit of guidance as to what is the best data to try to get to figure this out.
Docker has been known to cause issues, particularly due to its use of overlay filesystems. If you use external persistent storage this problem will probably go away.
https://bugs.openldap.org/show_bug.cgi?id=9360
--- Comment #4 from spam@markandruth.co.uk spam@markandruth.co.uk --- The LMDB folder is actually with in an external (ie bind rather than 'docker volume') volume mount so I don't think it's anything overlay related.
https://bugs.openldap.org/show_bug.cgi?id=9360
Nate Graham nate@kde.org changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |nate@kde.org
--- Comment #5 from Nate Graham nate@kde.org --- We are seeing this in KDE's baloo file indexer running on a variety of Linux-based OSs. See https://bugs.kde.org/show_bug.cgi?id=422008 for more details.