[LMDB] getting MDB_CORRUPTED when deleting within a DUPSORT database - openldap-technical

20 Mar 2017


      Hi,
I am using version 0.9.20 on Linux (Ubuntu derivates, uname see [1], [2]). One 
of the databases is used as an index to another database and thus has been 
created using the MDB_DUPSORT. Running my software in a test environment, about 
33 million entries were generated in this database. In order to falsify a 
suspicion that my software would not perform housekeeping correctly, I copied 
the LMDB file to my workstation and forced my software to delete all "legal" 
entries in order to see whether any entries remain. Unfortunately, I got an 
MDB_CORRUPTED during the delete operation on that database.
Some more details: The deletion takes place in multiple steps, calling a 
function that deletes ranges in databases multiple times. The code is as follows 
(leaving some boilerplate code away):
unsigned int dbFlags;
int error = mdb_dbi_flags (txn, dbi, &dbFlags);
// [...]
bool isDupSort = dbFlags & MDB_DUPSORT;
error = mdb_cursor_open (txn, dbi, &cursor);
// [...]
error = mdb_cursor_get (cursor, &ckey, &cdata, MDB_SET_RANGE);
while (error != MDB_NOTFOUND)
     {
       // [...]
int compResult = mdb_cmp (txn, dbi, &ckey, &ekey);
if (compResult > 0 || !compResult && !endIsInclusive)
         break;
error = mdb_cursor_del (cursor, isDupSort ? MDB_NODUPDATA : 0);
// [...]
error = mdb_cursor_get (cursor, &ckey, &cdata, MDB_NEXT);
     }
mdb_cursor_close (cursor);
Is this the correct way to delete the data? The MDB_CORRUPTED error occurs in 
the mdb_cursor_del call. Other operations on that specific database are mdb_put 
(with no flags) and mdb_del, supplying both key and data.
One side observation: In a similar test with a lower number of entries, the 
database was completely emptied. However, the mdb_stat function still reported a 
larger number of entries for the database (5-6 digit figures). I also use the 
stat data to estimate the size of the database by adding all page counts and 
multiplying it by the page size. This puzzles me, as it is lower than I expected 
(it is roughly the net size of only the data part of the entries).
I guess that the provided information might not be sufficient to find the 
problem. What additional information would be helpful? How can I test whether 
the database is already corrupt at the start of the deletion or whether it 
becomes corrupt during the deletion (I guess the latter)? Shall I attempt to 
write a specific test case? While I could produce the error a second time with 
running my software from scratch, but I don't know to which extent the data 
pattern affects the problem and whether I can artificially reproduce this pattern.
Regards,
Klaus
[1] Linux aaa 4.4.0-65-generic #86-Ubuntu SMP Thu Feb 23 17:49:58 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux
[2] Linux bbb 4.8.0-42-generic #45-Ubuntu SMP Wed Mar 8 20:06:06 UTC 2017 x86_64 
x86_64 x86_64 GNU/Linux