Re: (ITS#8475) Feature request: MDB low durability transactions - openldap-bugs

7 Aug 2016

      Thanks for the replies, Hallvard and Howard!
I was mistaken in thinking that NOMETASYNC didn't guarantee integrity. 
However, my proposal would allow fsync to be omitted entirely.
I think my approach with three roots is better than a WAL because it 
keeps the read and write paths simpler and more uniform. It also doesn't 
force periodic fsyncs when the log wraps, or consume unbounded space. In 
fact it's very similar to the basic design of MDB.
You're right that you'd actually need to record the page's checksum in 
the parent, rather than in the page itself. I guess this would hurt the 
branching factor.
Thanks again,
Ben
On 08/06/2016 12:42 PM, Hallvard Breien Furuseth wrote:
...
On 06. aug. 2016 17:38, bentrask@comcast.net wrote:
...
Transaction commits are one of the few bottlenecks in MDB, because it
has to
fsync twice, sequentially.
I think MDB could support mixed low and high durability transactions
in the same
database by adding per-page checksums and a third root page. The idea
is that
when committing a low-durability transaction, no fsyncs are performed.
(...)
Yesno.  We can get rid of fsyncs, but not that way.  Checksumming each
page isn't enough.  We must know it's the right version of the page and
not e.g. a similar page from a previous aborted transaction.  To commit
a branch or meta page, we'd need to scan its children and checksum the
page headers (thus including their checksum) of each.  Expensive.
IIRC there are three things we can do:

Use and fsync a WAL (write-ahead log) instead of the database pages.
That can be cheaper because it writes one contiguous region instead
of a lot of random-access pages.  Requires recovery after a crash.

Volatile metapages which mdb_env_open() _always_ throws away if no
other environment is already open.  They are lost of the application
crashes/exits without doing a final checkpoint.

Improve that a bit: Put them in a shared memory region, since that
won't survive a system crash (unlike if we put them in the lockfile).
That way they'll survive application crash provided something does
a checkpoint before next system crash.

We've discussed these sometimes and there are caveats for some of them,
I don't quite remember.  One issue is that a "system crash" isn't the
only thing which can lose unsynced pages.  Another is unmounting and
re-mounting the disk (i.e. an USB disk).