I think MDB v2 should move the variable parts of MDB_meta into the
data pages. The datafile header would retain a word with the position
of the last *synced* MDB_meta, or of the last meta when MDB_NOSYNC.
The lockfile header would hold the position of the *last* MDB_meta.
All transactions start from the lockfile->metapos commit. Write txns
do not reuse free pages younger than the datafile->metapos commit.
mdb_env_sync() called by the user does roughly:
size_t lastpos = lockfile->metapos;
sync;
# define pos2id(env, pos) ((MDB_meta*)((env)->me_map+(pos)))->mt_txnid
if (pos2id(env, lastpos) > pos2id(env, datafile->metapos))
write lastpos to &datafile->metapos;
Called from mdb_txn_commit(), this may need lastpos as an argument.
Results, if I'm keeping this straight:
Setting the latest commit becomes atomic: Just change metapos.
(Field MDB_txninfo.mti_txnid goes away.)
No sync issues with copying 'MDB_db's from the meta, since the meta
will not be overwritten during the txn.
Users can sync infrequently yet preserve consistency, a generalization
of MDB_NOMETASYNC. An application crash will then lose unsynced
commits, since resetting the lockfile must reset lockfile->metapos.
MDB cannot know if a system crash left those commits unsynced.
mdb_env_sync() needs a mutex - either its own or the write lock.
(A soft mode could trylock and do nothing if that fails.)
If mdb_env_sync() gets its own mutex, then mdb_txn_commit() can
announce the commit at lockfile->metapos and unlock the write lock
_before_ doing mdb_env_sync. With multiple writer threads, that's
like an ACID-safe MDB_MAPASYNC.
However, that has quirks. I don't know how serious they are:
- mdb_txn_commit() can fail after other txns see the commit, or
succeed but set a failure flag for other txns to react to.
Delayed mdb_env_sync can fail today too, but it will also
happen if mdb_env_sync cannot set datafile->metapos.
- mdb_txn_commit() may not return immediately after the commit
becomes visible to other txns. Unless it is set up to queue the
{sync; set datafile->metapos} actions for a maintenance thread.
More detailed draft code, still ignoring various flags:
typedef struct MDB_meta { /* Meta info about a commit */
MDB_db mm_dbs[2];
txnid_t mm_txnid;
pgno_t mm_last_pg;
} MDB_meta;
typedef struct MDB_header { /* Datafile header */
...;
/* Position of last synced meta - or last known if MDB_NOSYNC */
size_t mh_metapos;
} MDB_header;
typedef struct MDB_txbody { /* Lockfile header */
...;
/* Position of last meta, possibly not synced. Both read and write
* txns start at this commit. Replaces the old member mtb_txnid. */
size_t mtb_metapos;
} MDB_txbody;
mdb_txn_commit(MDB_txn *txn) {
...;
/* Commit a write txn: */
pwritev(env->me_fd, <data pages including MDB_meta>);
/* Make the commit visible to other txns */
lockfile->mtb_metapos = <offset of MDB_meta in me_map>;
unlock(write_mutex);
/* Preserve the commit */
mdb_env_sync(env, 0);
}
# define pos2id(env, pos) (((MDB_meta*)((env)->me_map+(pos)))->mt_txnid)
mdb_txn_sync(MDB_txn *txn, int force) {
MDB_env *txn->mt_env;
MDB_txninfo *txns = env->me_txns;
enum { metapos_pos = offsetof(MDB_header, mh_metapos) };
lock(meta_mutex);
/* Positions of meta pages known to datafile and lockfile */
size_t cur = *(size_t *)(env->me_map + metapos_pos);
size_t lastpos = txns->mtb_metapos;
int got_new = pos2id(lastpos) > pos2id(cur);
if (force || (got_new && !(env->me_flags & MDB_NOSYNC)))
fdatasync(env->me_fd);
/* Make datafile catch up with pre-fdatasync lockfile */
if (got_new)
pwrite(env->me_mfd, &lastpos, sizeof(lastpos), metapos_pos);
unlock(meta_mutex);
}
--
Hallvard