https://bugs.openldap.org/show_bug.cgi?id=9291
Issue ID: 9291 Summary: Detection of corrupted database files Product: LMDB Version: unspecified Hardware: All OS: All Status: UNCONFIRMED Severity: normal Priority: --- Component: liblmdb Assignee: bugs@openldap.org Reporter: markus@objectbox.io Target Milestone: ---
Let's assume we have to deal with a corrupted database for whatever reason (e.g. broken hardware or file system). Current behavior seems to be mostly undefined, which is understandable as it's not known what is broken (e.g. there are no checksums).
For example, I'm seeing a SIGBUS in mdb_page_touch because the cursor's top page (mp) is pointing to invalid memory (0x7f99cf004000) during a commit: mdb_page_touch mdb.c:2772 mdb_page_search mdb.c:6595 mdb_freelist_save mdb.c:3575 mdb_txn_commit mdb.c:4060
Cursor data at that point: mc_snum = 1, mc_top = 0; myki[0] = 0
A SIGBUS is troublesome as it crashes the process, and I wonder if there are other ways to detect such inconsistencies. If that be possible there could be user-specific handling in place. E.g. a user might start a new database file.
This issue was reported by our users, which also provided DB files: https://github.com/objectbox/objectbox-java/issues/859
I did not find a lot of consistency checks besides MDB_PAGE_NOTFOUND and MDB_CORRUPTED. Also, I think there's no current way to thoroughly check a DB file (e.g. like fsck for the DB file)?
My first idea other than checksums was to walk through the branch pages from the root and check if the referenced pages are within reasonable bounds. Also check the page content (e.g. nodes, flags). Additionally (optionally?), it should be possible to check that the key values are actually sorted.
So, it boils down to 3 points in summary: 1.) If there no way to check the DB file for consistency yet(?), which approach do you think would make sense? There might be two modes; one for a through check through all data, and a quick check that does not take long and could be e.g. done when opening the DB. Goal is to avoid process crashes and let users handle the situation. 2.) In general, is it possible to add more consistency checks in regular DB operations? 3.) Could the the particular situation (for which I provided the stack trace) detected (e.g. is myki[0] = 0 legal here?)
I'd be happy to provide a patch if you provide some direction where you want to take that.
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #1 from Howard Chu hyc@openldap.org --- (In reply to Markus from comment #0)
Let's assume we have to deal with a corrupted database for whatever reason (e.g. broken hardware or file system). Current behavior seems to be mostly undefined, which is understandable as it's not known what is broken (e.g. there are no checksums).
For example, I'm seeing a SIGBUS in mdb_page_touch because the cursor's top page (mp) is pointing to invalid memory (0x7f99cf004000) during a commit: mdb_page_touch mdb.c:2772 mdb_page_search mdb.c:6595 mdb_freelist_save mdb.c:3575 mdb_txn_commit mdb.c:4060
Cursor data at that point: mc_snum = 1, mc_top = 0; myki[0] = 0
A SIGBUS is troublesome as it crashes the process, and I wonder if there are other ways to detect such inconsistencies. If that be possible there could be user-specific handling in place. E.g. a user might start a new database file.
This issue was reported by our users, which also provided DB files: https://github.com/objectbox/objectbox-java/issues/859
I did not find a lot of consistency checks besides MDB_PAGE_NOTFOUND and MDB_CORRUPTED. Also, I think there's no current way to thoroughly check a DB file (e.g. like fsck for the DB file)?
My first idea other than checksums was to walk through the branch pages from the root and check if the referenced pages are within reasonable bounds. Also check the page content (e.g. nodes, flags). Additionally (optionally?), it should be possible to check that the key values are actually sorted.
So, it boils down to 3 points in summary: 1.) If there no way to check the DB file for consistency yet(?), which approach do you think would make sense? There might be two modes; one for a through check through all data, and a quick check that does not take long and could be e.g. done when opening the DB. Goal is to avoid process crashes and let users handle the situation. 2.) In general, is it possible to add more consistency checks in regular DB operations? 3.) Could the the particular situation (for which I provided the stack trace) detected (e.g. is myki[0] = 0 legal here?)
I'd be happy to provide a patch if you provide some direction where you want to take that.
Hooks for per-page checksums (using user-provided checksum function) are already present in the 1.0 branch. It is currently only available to commercial licensed LMDB users.
re: (2) pervasive consistency checks will probably not be implemented since they would break LMDB's #1 reason to exist - fast read performance. (Although, we already do have a few such checks when compiled with higher values of MDB_DEBUG. You always have the option of compiling with your own debug checks as well.) Ultimately, it's not our responsibility to detect broken hardware or filesystems, it is the OS's.
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #2 from Markus markus@objectbox.io --- (In reply to Howard Chu from comment #1)
re: (2) pervasive consistency checks will probably not be implemented since they would break LMDB's #1 reason to exist - fast read performance.
Just to clarify, (2) would be about simple checks that should have no measurable performance impact, e.g. "page number is not zero" where appropriate, not the expensive fsck-like ones from (1).
Would you say that the described situation could be detected efficiently? (Basically my question (3)...)
(Although, we already do have a few such checks when compiled with higher values of MDB_DEBUG.
Thanks for that hint, I just found mdb_audit when looking for MDB_DEBUG. On a first glance it may be something similar to what I had in mind with (1). Will have a closer look at that and other MDB_DEBUG related code.
You always have the option of compiling with your own debug checks as well.) Ultimately, it's not our responsibility to detect broken hardware or filesystems, it is the OS's.
This could be read as you suggest forking for those cases? Again, to clarify, does that include checks that have no measurable performance impact?
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #3 from Markus markus@objectbox.io --- I digged deeper and found the actual cause for the SIGBUS and it which might be simple to detect. In short, a meta page points to a page beyond the file.
We could detect this in mdb_env_map and thus avoid the SIGBUS like this (first two lines exist and are for context):
env->me_metas[0] = METADATA(p); env->me_metas[1] = (MDB_meta *)((char *)env->me_metas[0] + env->me_psize);
mdb_size_t fsize; if(mdb_fsize(env->me_fd, &fsize) == MDB_SUCCESS && fsize) { pgno_t maxpgno = fsize / env->me_psize; if(env->me_metas[0]->mm_dbs[FREE_DBI].md_root > maxpgno || env->me_metas[0]->mm_dbs[MAIN_DBI].md_root > maxpgno || env->me_metas[1]->mm_dbs[FREE_DBI].md_root > maxpgno || env->me_metas[1]->mm_dbs[MAIN_DBI].md_root > maxpgno) { return MDB_PAGE_NOTFOUND; } }
Would that basic consistency check work for you?
Some background: file size was 2232320 (545 pages) and thus this new code caught the bad root at page 554. On odd thing was that I was able to see only 500 pages in the debugger's memory view; page 501 was already inaccessible. Didn't proceed there however...
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #4 from Howard Chu hyc@openldap.org --- If there are any references to pages beyond the end of the file, that implies that an fsync didn't complete. Are you running with NOSYNC or NOMETASYNC?
On Windows one of the common symptoms we see after a system crash is that trailing pages of the DB file are all-zero. Again, this indicates an fsync that didn't complete (or an NTFS bug).
I think your proposed change is too specific, since it could be any page, not just the meta pages, that references beyond the end of the file.
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #5 from Markus markus@objectbox.io --- We don't use NOSYNC or NOMETASYNC. The OS is Android, and it's likely that some device manufacture did something wrong in a specific model. However, I believe the specifics of how this happened are secondary as it will happen again in a different situation.
I think your proposed change is too specific, since it could be any page, not just the meta pages, that references beyond the end of the file.
Yes. There's a trade-off. Checking the roots can be done simply and in constant time. I'm aware that this is not a thorough check, which would require a tree traversal. I think in scenarios that are mostly appending new data, the chance would be quite good to catch a broken fsync. Of course it would not catch cases were pages are reused.
PS.: I added P_INVALID handling meanwhile.
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #6 from Markus markus@objectbox.io --- (In reply to Markus from comment #5)
Yes. There's a trade-off. Checking the roots can be done simply and in constant time. ...
Maybe it makes sense to level this trade-off out a bit. Let's say a branch page can reference up to around 40 page numbers. Checking the 4 root pages with a total of around 160 page numbers still seems quite doable without noticeable performance impact.
Going down another level might get noticeable with 160 pages (640 KB), but still doable. Actually, going down could be limited to the more recent meta tree and thus cutting it by half to 80 pages. It would give us up to 3200 additional page numbers we could check for consistency. The performance penalty is at least partially compensated by the fact that this "warms up" (caches) the top of the tree for future access.
What do you think?
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #7 from Markus markus@objectbox.io --- PS.: the math is different if you have really short keys, e.g. 4 bytes. In that case a node in the page should be around 18 bytes (?; from what I saw: 8 bytes meta info, 4 bytes key, 6 bytes page no). So let's say 225 page refs per page. Thus, going down 2 pages by one level may result in up to 450 pages or 1.8 MB. That might be too much already. Solvable by limiting the max number of pages to look for, e.g. first 100 pages.
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #8 from Markus markus@objectbox.io --- To move on with this quickly, we'll put the described checks in our next release (with some improvements). If you are interested in this, please let me know, I'm happy to contribute this.
Related question: the mentioned checksum are limited to the contents of a page, right? Thus, it does not help with verifying the consistency of the tree (within reasonable effort), correct? To my understanding that would require some bottom-up checksuming, e.g. checksum of a parent would also consider the checksums of its children (similar to a Merkle tree). Since this would have severe performance implications, I assume you don't consider to support that (ever)?
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #9 from Howard Chu hyc@openldap.org --- (In reply to Markus from comment #8)
To move on with this quickly, we'll put the described checks in our next release (with some improvements). If you are interested in this, please let me know, I'm happy to contribute this.
Feel free to attach your patch to this ticket. Please read https://openldap.org/devel/contributing.html first.
Related question: the mentioned checksum are limited to the contents of a page, right?
Right.
Thus, it does not help with verifying the consistency of the tree (within reasonable effort), correct? To my understanding that would require some bottom-up checksuming, e.g. checksum of a parent would also consider the checksums of its children (similar to a Merkle tree). Since this would have severe performance implications, I assume you don't consider to support that (ever)?
Not for production code, no.
https://bugs.openldap.org/show_bug.cgi?id=9291
Markus markus@objectbox.io changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |markus@objectbox.io
--- Comment #10 from Markus markus@objectbox.io --- Created attachment 744 --> https://bugs.openldap.org/attachment.cgi?id=744&action=edit First part: validating database root page numbers
The attached file is derived from OpenLDAP Software. All of the modifications to OpenLDAP Software represented in the following patch(es) were developed by ObjectBox Ltd. ObjectBox Ltd. has not assigned rights and/or interest in this work to any party. I, Markus Junginger am authorized by ObjectBox Ltd., my employer, to release this work under the following terms.
ObjectBox Ltd. hereby places the following modifications to OpenLDAP Software (and only these modifications) into the public domain. Hence, these modifications may be freely used and/or redistributed for any purpose with or without attribution and/or other notice.
https://bugs.openldap.org/show_bug.cgi?id=9291
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords| |has_patch
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #11 from Markus markus@objectbox.io --- For the second part, I think a new function that can be called optionally after the env has been opened, makes sense. Something like this draft:
/** * Checks the database for inconsistencies, e.g. page references beyond the * end of file. * @param max_page_number the maximum number of pages to check. * This limits the time spent, e.g. for a quick check on opening. * Pass zero to check all pages (the entire database). */ int mdb_check_pages(MDB_cursor *cursor, mdb_size_t max_page_number);
Please let me know if that looks good to you and if you have any naming preferences (not so sure here, e.g. mdb_page_ function prefix would also be fitting but is only used in mdb.c).
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #12 from Howard Chu hyc@openldap.org --- (In reply to Markus from comment #11)
For the second part, I think a new function that can be called optionally after the env has been opened, makes sense. Something like this draft:
/**
- Checks the database for inconsistencies, e.g. page references beyond the
- end of file.
- @param max_page_number the maximum number of pages to check.
This limits the time spent, e.g. for a quick check on opening.
Pass zero to check all pages (the entire database).
*/ int mdb_check_pages(MDB_cursor *cursor, mdb_size_t max_page_number);
Please let me know if that looks good to you and if you have any naming preferences (not so sure here, e.g. mdb_page_ function prefix would also be fitting but is only used in mdb.c).
You could simply add a check of pgno in mdb_page_get(), and check every page reference as they occur.
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #13 from Markus markus@objectbox.io --- (In reply to Howard Chu from comment #12)
You could simply add a check of pgno in mdb_page_get(), and check every page reference as they occur.
How would we know the maximum valid page number (e.g. maxpgno = fsize / env->me_psize)? I assumed we need mdb_fsize, which we obviously don't want to call too often. Happy to learn about a better alternative.
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #14 from Howard Chu hyc@openldap.org --- (In reply to Markus from comment #13)
(In reply to Howard Chu from comment #12)
You could simply add a check of pgno in mdb_page_get(), and check every page reference as they occur.
How would we know the maximum valid page number (e.g. maxpgno = fsize / env->me_psize)? I assumed we need mdb_fsize, which we obviously don't want to call too often. Happy to learn about a better alternative.
Use mm_last_pg from the meta page.
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #15 from Howard Chu hyc@openldap.org --- (In reply to Howard Chu from comment #14)
(In reply to Markus from comment #13)
(In reply to Howard Chu from comment #12)
You could simply add a check of pgno in mdb_page_get(), and check every page reference as they occur.
How would we know the maximum valid page number (e.g. maxpgno = fsize / env->me_psize)? I assumed we need mdb_fsize, which we obviously don't want to call too often. Happy to learn about a better alternative.
Use mm_last_pg from the meta page.
Actually, that's already been checked for you, you only need to compare to mt_next_pgno. And mdb_page_get already checks it. Though perhaps the check could be done earlier in that function.
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #16 from Markus markus@objectbox.io --- (In reply to Howard Chu from comment #15)
Use mm_last_pg from the meta page.
Actually, that's already been checked for you, you only need to compare to mt_next_pgno. And mdb_page_get already checks it. Though perhaps the check could be done earlier in that function.
Right, checking this earlier seems a good idea as the check might be skipped otherwise by one of the gotos?
1) What kept me away from using mm_last_pg was its comment "Actually the file may be shorter if the freeDB lists the final pages." So I wasn't sure if and when there are cases when mm_last_pg points beyond the file(?).
2) mt_next_pgno itself should be checked against the file size on opening. I think, I'll update the submitted patch.
3) A preemptive check to peek into a few pages might still make sense on opening, e.g. at that point one could easily switch to the previous snapshot. Also there might be additional checks possible, e.g. pages must not point to a parent, page numbers in branch pages must be 2 or greater, ...
https://bugs.openldap.org/show_bug.cgi?id=9291
Markus markus@objectbox.io changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #744 is|0 |1 obsolete| |
--- Comment #17 from Markus markus@objectbox.io --- Created attachment 745 --> https://bugs.openldap.org/attachment.cgi?id=745&action=edit First part: validating database root page numbers (Updated)
Also check mm_last_pg against maxpgno
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #18 from Howard Chu hyc@openldap.org --- (In reply to Markus from comment #16)
(In reply to Howard Chu from comment #15)
Use mm_last_pg from the meta page.
Actually, that's already been checked for you, you only need to compare to mt_next_pgno. And mdb_page_get already checks it. Though perhaps the check could be done earlier in that function.
Right, checking this earlier seems a good idea as the check might be skipped otherwise by one of the gotos?
- What kept me away from using mm_last_pg was its comment "Actually the
file may be shorter if the freeDB lists the final pages." So I wasn't sure if and when there are cases when mm_last_pg points beyond the file(?).
It's possible for some pages to be dirtied at the end of the file, but then get deleted/unused again before the txn commits. If nothing else in the txn tries to reuse them, they'll go onto the freeDB. If this happens, then technically we've allocated those pages, but since they never got flushed to the filesystem, the file may be shorter.
- mt_next_pgno itself should be checked against the file size on opening. I
think, I'll update the submitted patch.
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #19 from Markus markus@objectbox.io --- (In reply to Howard Chu from comment #18)
It's possible for some pages to be dirtied at the end of the file, but then get deleted/unused again before the txn commits.
I was able to reproduce this scenario and I don't see how mm_last_pg could be used in testing. Did I miss something? So, unless you have a better idea, I guess I'm back to my initially proposed patch.
https://bugs.openldap.org/show_bug.cgi?id=9291
Markus markus@objectbox.io changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #744 is|1 |0 obsolete| |
https://bugs.openldap.org/show_bug.cgi?id=9291
Markus markus@objectbox.io changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #745 is|0 |1 obsolete| |
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #20 from Markus markus@objectbox.io --- Created attachment 750 --> https://bugs.openldap.org/attachment.cgi?id=750&action=edit Alternative: validating database root page numbers and sums
The additional sum check may catch additional corruptions e.g. if the last (corrupted) commit added a significant amount of pages (or messed up page counts).
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #21 from Markus markus@objectbox.io --- Fyi, we'll do a release on our side using the patch of the last attachment.
Would be great to have a rough indication where you want to go with this (if at all).
https://bugs.openldap.org/show_bug.cgi?id=9291
--- Comment #22 from Howard Chu hyc@openldap.org --- (In reply to Markus from comment #21)
Fyi, we'll do a release on our side using the patch of the last attachment.
Would be great to have a rough indication where you want to go with this (if at all).
The patch looks OK, still checking to see if there's any case it can possibly break.