Hi,
I am using version 0.9.20 on Linux (Ubuntu derivates, uname see [1], [2]). One of the databases is used as an index to another database and thus has been created using the MDB_DUPSORT. Running my software in a test environment, about 33 million entries were generated in this database. In order to falsify a suspicion that my software would not perform housekeeping correctly, I copied the LMDB file to my workstation and forced my software to delete all "legal" entries in order to see whether any entries remain. Unfortunately, I got an MDB_CORRUPTED during the delete operation on that database.
Some more details: The deletion takes place in multiple steps, calling a function that deletes ranges in databases multiple times. The code is as follows (leaving some boilerplate code away):
unsigned int dbFlags;
int error = mdb_dbi_flags (txn, dbi, &dbFlags);
// [...]
bool isDupSort = dbFlags & MDB_DUPSORT;
error = mdb_cursor_open (txn, dbi, &cursor);
// [...]
error = mdb_cursor_get (cursor, &ckey, &cdata, MDB_SET_RANGE);
while (error != MDB_NOTFOUND) { // [...]
int compResult = mdb_cmp (txn, dbi, &ckey, &ekey);
if (compResult > 0 || !compResult && !endIsInclusive) break;
error = mdb_cursor_del (cursor, isDupSort ? MDB_NODUPDATA : 0);
// [...]
error = mdb_cursor_get (cursor, &ckey, &cdata, MDB_NEXT); }
mdb_cursor_close (cursor);
Is this the correct way to delete the data? The MDB_CORRUPTED error occurs in the mdb_cursor_del call. Other operations on that specific database are mdb_put (with no flags) and mdb_del, supplying both key and data.
One side observation: In a similar test with a lower number of entries, the database was completely emptied. However, the mdb_stat function still reported a larger number of entries for the database (5-6 digit figures). I also use the stat data to estimate the size of the database by adding all page counts and multiplying it by the page size. This puzzles me, as it is lower than I expected (it is roughly the net size of only the data part of the entries).
I guess that the provided information might not be sufficient to find the problem. What additional information would be helpful? How can I test whether the database is already corrupt at the start of the deletion or whether it becomes corrupt during the deletion (I guess the latter)? Shall I attempt to write a specific test case? While I could produce the error a second time with running my software from scratch, but I don't know to which extent the data pattern affects the problem and whether I can artificially reproduce this pattern.
Regards,
Klaus
[1] Linux aaa 4.4.0-65-generic #86-Ubuntu SMP Thu Feb 23 17:49:58 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux [2] Linux bbb 4.8.0-42-generic #45-Ubuntu SMP Wed Mar 8 20:06:06 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
On 3/20/17 2:19 PM, Klaus Malorny wrote:
[...]
one addition: I fiddled around to turn mdb_debug on at the right moment and got:
entering mdb_cursor_del
mdb_page_touch:2426 touched db 8 page 2071338 -> 6529752 mdb_node_del:7376 delete node 0 on leaf page 6529752 mdb_rebalance:8227 rebalancing leaf page 6529752 (has 28 keys, 59,4% full) mdb_rebalance:8232 no need to rebalance page 6529752, above fill threshold
mdb_cursor_del returned 0
entering mdb_cursor_get
mdb_page_search:5617 db -8 root page 2071112 has flags 0x4 mdb_page_search_root:5508 internal error, index points to a 04 page!?
mdb_cursor_get returned -30796
mdb_txn_end:2979 abort txn 15528w 0x7f2fdc6fcd00 on mdbenv 0x7f2fdc6e9500, root page 2007937
So the mdb_cursor_get seems to cause the error, maybe due to a side effector of mdb_cursor_del.
Regards,
Klaus
openldap-technical@openldap.org