I'll use this ITS to summarize details for "volatile commits". Hopefully I've managed to keep it all straight.
"Volatile" vs. "durable" are the most accurate names I can think of. Not sure if that's more instructive than simply "soft" / "hard".
Description:
* Volatile commits omit fdatasync() without losing consistency. To survive, they must be checkpointed *before* all processes close the env. Un-checkpointed volatiles are lost when the env closes.
Thus a separate checkpointing daemon can keep the env open to protect volatiles from application crash, at least if Robust locks are supported. (The lmdb.h doc seems a bit unclear about Robust.)
* Checkpointing == committing a durable (non-volatile) write-txn. (If there is nothing to do, Commit writes nothing.)
mdb_env_sync() will not checkpoint volatiles, since existing programs do not expect it to wait for the write mutex. It "checkpoints" MDB_NOMETASYNC/MDB_NOSYNC. Maybe mdb_checkpoint() will have a special case which obsoletes mdb_env_sync().
* Volatiles are unsupported with MDB_NOLOCK and pointless with MDB_NOSYNC. OTOH it makes sense to enable MDB_NOMETASYNC.
* Volatiles need a bigger datafile, because it takes two durable commits to make a freed page reusable. (Plus awaiting old readers).
Configuration. Too many options, ouch:
LMDB can be configured to auto-checkpoint after X volatile commits and/or Y written kbytes(pages?). Programs can also checkpoint every Z minutes(seconds?) - configured in LMDB to mimic Berkeley DB's "checkpoint <kbytes> <minutes>", but regular LMDB ops ignore that.
The lockfile gives the current config. An MDB_env could override for its particular process, e.g. with an MDB_NO_VOLATILE flag. Maybe resetting the lockfile should keep the previous config? OTOH I suppose MDB_meta can have default params the way it has mm_mapsize. That survives a backup/restore.
Implementation - plain version first:
* Bump MDB_LOCK_FORMAT, MDB_DATA_VERSION (or make MDB_DATA_FORMAT).
* Keep 2 'MDB_meta's in the lockfile, for volatile commits. MDB_env.me_metas[] gets 4 elements: durable + volatile.
* mdb_env_open() throws away volatiles if it re-inits the lockfile.
* Add field MDB_meta.mm_oldest: 1 + (previous durable meta).mm_txnid in durable metas, and (previous meta).mm_oldest in volatile metas.
Init 'oldest' in mdb_find_oldest() to new field MDB_env.me_oldest, which mdb_txn_renew0(write txn) sets to MDB_meta.mm_oldest.
When are no volatiles, this ends up initing 'oldest' = same value as today. Usually we could just have used 1 + (oldest durable meta).mm_txnid, but a failed write_meta() may have clobbered that.
* Replace MDB_txninfo.mti_txnid with mti_metaref = txnid*16 + P*4 + M: M = index to MDB_env.me_metas[], P = previous M during this session, initialized to the same as M, so we can get this info atomically.
P may prove unnecessary, but it's simplest to just include it for now. For when meta M fails a checksum so we want an older meta, for mdb_mutex_failed(), maybe so we can see if there are volatiles yet.
* Never use mdb_env_pick_meta() when the current metapage is known: Use it in mdb_env_open(), in mdb_mutex_failed(), and if MDB_NOLOCK. Or rather, I guess it gets a "force" param for those cases.
* Add config in the lockfile. Maybe per-env config overriding it and defaults in the datafile. Txn flags "prefer volatile", "checkpoint".
* Track the number of pages and volatiles since last durable commit. write_meta() compares with the config limits and makes the final decision of whether the new meta will be volatile.
Add MDB_pgstate.mf_pgcount with #pages used so far. The rest goes in a lockfile array[4] indexed by mti_metaref % 4, or in MDB_meta. That way, switching to next snapshot stays atomic - just update mti_metaref.
* txn_begin must verify metas, since we have no fdatasync barriers. Re-read and compare, or checksum.
write_meta() and mutex_failed(): memory barrier between making a volatile meta and updating mti_metaref. Most modern compilers have that. Maybe a fallback implementation is lock;unlock an otherwise unused mutex. Should also include CACHEFLUSH().
It may make sense to have more than 2 volatile metas, so read-only txns will have more time to read a meta before it gets overwritten.
MDB_WRITEMAP (and MDB_VL32?) has non-atomic issues we should deal with anyway.
* We can have (durable metapage).mp_pgno == (txnid & 1) as before: mdb_txn_renew0() steps txnid by 2 instead of 1 if 'meta' is volatile. But note that the txnid doesn't say if the snapshot is durable.
* "Try to checkpoint" feature, which does not await the write mutex:
Trylock the write mutex in mdb_txn_begin(). If it fails, set a lockfile flag "Please checkpoint" and return. Hopefully someone will obey and clear the flag. mdb_env_commit(writer) does.
Variants:
* Put volatile MDB_metas in the datafile, behind the usual MDB_metas.
That protects the metas from malicious/broken processes with read-only envs. Otherwise, using the lockfile (or non-file shared memory) extended read-only envs' ability to hack/break the DB.
Be careful to not read volatile MDB_metas that are older than last lockfile-reset, since the reset did not clear them. That also means this variant does not enable volatiles with MDB_NOLOCK.
* Stay with MDB_DATA_VERSION = 1, no change in datafile format:
- Volatiles share next durable txn's txnid (for freeDB keys), but put last durable txn's txnid in mti_metaref (for mdb_find_oldest).
- mdb_txn_id() / MDB_envinfo.me_last_txnid can no longer be used to distinguish txns.
Apps doing that could set an env flag "ignore volatiles", or txn flag "fail if current snapshot is volatile".
- Define V = 2-bit sequence number incremented by commit(writer). Include V in MDB_txninfo.mti_metaref. In volatile metas, include mm_metaref = copy of mti_metaref.
This lets mutex_failed() figure out which MDB_meta is most recent: abs(V in mm_metaref - V in mti_metaref) is <= 1 whether mm_metaref or mti_metaref changes first in the thread.
* Support full 32-bit txnids on 32-bit hosts.
mti_metaref eats some txnid bits in order to stay atomic, but we can get them back:
On 32-bit hosts, make mti_metaref a 64-bit value - something like ((txnid << 32) | R) where R = 32-bit metaref value described before.
When reading mti_metaref, R is authoritative for the low txnid bits. If (txnid << 32) part does not match, adjust it so it does: That'll be + or - a small value. mti_metaref's high bits vary slowly, so this is normally "atomic". txn_renew0's loop re-reads it to verify.
* We can squeeze away some bytes and mti_metaref bits, but don't get sucked into spending time on that before the rest is working.
E.g. the array[4] counting pages/volatiles can be just [2] if Commit always toggles the (mti_metaref & 1) bit. And we can likely include some of the mti_metaref fields in the txnid.
Roads to Hell:
* Support checkpointing without waiting for the write-mutex.
Programs which have used volatiles, may want this so they can exit quickly. But it requires a new shared mutex for write_meta(), or some clever tricks I've thought of which would not quite work.
* Rescue volatiles in a dead lockfile when the user "knows" it's safe.
Easy enough, just don't clear them when resetting the lockfile. But users screw up, and may then blame LMDB. Users who want to screw up, can run use MDB_NOSYNC instead of volatile commits.
Or if volatile metas are in non-file shared memory which a system crash will kill, it's _almost_ safe to not reset them along with the lockfile. Unless someone unmounts/mounts the disk, or replaces the DB by overwriting it with another DB file, or who knows what else.