On 06. aug. 2016 17:38, bentrask@comcast.net wrote:
Transaction commits are one of the few bottlenecks in MDB, because it has to fsync twice, sequentially.
I think MDB could support mixed low and high durability transactions in the same database by adding per-page checksums and a third root page. The idea is that when committing a low-durability transaction, no fsyncs are performed. (...)
Yesno. We can get rid of fsyncs, but not that way. Checksumming each page isn't enough. We must know it's the right version of the page and not e.g. a similar page from a previous aborted transaction. To commit a branch or meta page, we'd need to scan its children and checksum the page headers (thus including their checksum) of each. Expensive.
IIRC there are three things we can do:
- Use and fsync a WAL (write-ahead log) instead of the database pages. That can be cheaper because it writes one contiguous region instead of a lot of random-access pages. Requires recovery after a crash.
- Volatile metapages which mdb_env_open() _always_ throws away if no other environment is already open. They are lost of the application crashes/exits without doing a final checkpoint.
- Improve that a bit: Put them in a shared memory region, since that won't survive a system crash (unlike if we put them in the lockfile). That way they'll survive application crash provided something does a checkpoint before next system crash.
We've discussed these sometimes and there are caveats for some of them, I don't quite remember. One issue is that a "system crash" isn't the only thing which can lose unsynced pages. Another is unmounting and re-mounting the disk (i.e. an USB disk).