Thanks for the replies, Hallvard and Howard!
I was mistaken in thinking that NOMETASYNC didn't guarantee integrity. However, my proposal would allow fsync to be omitted entirely.
I think my approach with three roots is better than a WAL because it keeps the read and write paths simpler and more uniform. It also doesn't force periodic fsyncs when the log wraps, or consume unbounded space. In fact it's very similar to the basic design of MDB.
You're right that you'd actually need to record the page's checksum in the parent, rather than in the page itself. I guess this would hurt the branching factor.
Thanks again, Ben
On 08/06/2016 12:42 PM, Hallvard Breien Furuseth wrote:
On 06. aug. 2016 17:38, bentrask@comcast.net wrote:
Transaction commits are one of the few bottlenecks in MDB, because it has to fsync twice, sequentially.
I think MDB could support mixed low and high durability transactions in the same database by adding per-page checksums and a third root page. The idea is that when committing a low-durability transaction, no fsyncs are performed. (...)
Yesno. We can get rid of fsyncs, but not that way. Checksumming each page isn't enough. We must know it's the right version of the page and not e.g. a similar page from a previous aborted transaction. To commit a branch or meta page, we'd need to scan its children and checksum the page headers (thus including their checksum) of each. Expensive.
IIRC there are three things we can do:
Use and fsync a WAL (write-ahead log) instead of the database pages. That can be cheaper because it writes one contiguous region instead of a lot of random-access pages. Requires recovery after a crash.
Volatile metapages which mdb_env_open() _always_ throws away if no other environment is already open. They are lost of the application crashes/exits without doing a final checkpoint.
Improve that a bit: Put them in a shared memory region, since that won't survive a system crash (unlike if we put them in the lockfile). That way they'll survive application crash provided something does a checkpoint before next system crash.
We've discussed these sometimes and there are caveats for some of them, I don't quite remember. One issue is that a "system crash" isn't the only thing which can lose unsynced pages. Another is unmounting and re-mounting the disk (i.e. an USB disk).