Hallvard Breien Furuseth wrote:
I think MDB v2 should move the variable parts of MDB_meta into the data pages. The datafile header would retain a word with the position of the last *synced* MDB_meta, or of the last meta when MDB_NOSYNC. The lockfile header would hold the position of the *last* MDB_meta.
All transactions start from the lockfile->metapos commit. Write txns do not reuse free pages younger than the datafile->metapos commit.
I don't see this approach reducing seek overhead. It may be able to reduce sync overhead, but only if you accept the possibility of delayed syncs failing. Overall I don't see it as any improvement for ACID compliance.
What is the real benefit?
It may be worthwhile. I just want the actual specific advantages spelled out. Other DB systems use delayed/group commit to reduce sync overhead. It's worth doing, when your application can tolerate that type of behavior. But this can't be the default behavior.
mdb_env_sync() called by the user does roughly: size_t lastpos = lockfile->metapos; sync; # define pos2id(env, pos) ((MDB_meta*)((env)->me_map+(pos)))->mt_txnid if (pos2id(env, lastpos) > pos2id(env, datafile->metapos)) write lastpos to &datafile->metapos; Called from mdb_txn_commit(), this may need lastpos as an argument.
Results, if I'm keeping this straight:
Setting the latest commit becomes atomic: Just change metapos. (Field MDB_txninfo.mti_txnid goes away.)
No sync issues with copying 'MDB_db's from the meta, since the meta will not be overwritten during the txn.
Users can sync infrequently yet preserve consistency, a generalization of MDB_NOMETASYNC. An application crash will then lose unsynced commits, since resetting the lockfile must reset lockfile->metapos. MDB cannot know if a system crash left those commits unsynced.
mdb_env_sync() needs a mutex - either its own or the write lock. (A soft mode could trylock and do nothing if that fails.)
If mdb_env_sync() gets its own mutex, then mdb_txn_commit() can announce the commit at lockfile->metapos and unlock the write lock _before_ doing mdb_env_sync. With multiple writer threads, that's like an ACID-safe MDB_MAPASYNC. However, that has quirks. I don't know how serious they are:
- mdb_txn_commit() can fail after other txns see the commit, or succeed but set a failure flag for other txns to react to. Delayed mdb_env_sync can fail today too, but it will also happen if mdb_env_sync cannot set datafile->metapos.
- mdb_txn_commit() may not return immediately after the commit becomes visible to other txns. Unless it is set up to queue the {sync; set datafile->metapos} actions for a maintenance thread.
More detailed draft code, still ignoring various flags:
typedef struct MDB_meta { /* Meta info about a commit */ MDB_db mm_dbs[2]; txnid_t mm_txnid; pgno_t mm_last_pg; } MDB_meta;
typedef struct MDB_header { /* Datafile header */ ...; /* Position of last synced meta - or last known if MDB_NOSYNC */ size_t mh_metapos; } MDB_header;
typedef struct MDB_txbody { /* Lockfile header */ ...; /* Position of last meta, possibly not synced. Both read and write * txns start at this commit. Replaces the old member mtb_txnid. */ size_t mtb_metapos; } MDB_txbody;
mdb_txn_commit(MDB_txn *txn) { ...; /* Commit a write txn: */ pwritev(env->me_fd, <data pages including MDB_meta>); /* Make the commit visible to other txns */ lockfile->mtb_metapos = <offset of MDB_meta in me_map>; unlock(write_mutex); /* Preserve the commit */ mdb_env_sync(env, 0); }
# define pos2id(env, pos) (((MDB_meta*)((env)->me_map+(pos)))->mt_txnid)
mdb_txn_sync(MDB_txn *txn, int force) { MDB_env *txn->mt_env; MDB_txninfo *txns = env->me_txns; enum { metapos_pos = offsetof(MDB_header, mh_metapos) };
lock(meta_mutex); /* Positions of meta pages known to datafile and lockfile */ size_t cur = *(size_t *)(env->me_map + metapos_pos); size_t lastpos = txns->mtb_metapos; int got_new = pos2id(lastpos) > pos2id(cur); if (force || (got_new && !(env->me_flags & MDB_NOSYNC))) fdatasync(env->me_fd); /* Make datafile catch up with pre-fdatasync lockfile */ if (got_new) pwrite(env->me_mfd, &lastpos, sizeof(lastpos), metapos_pos); unlock(meta_mutex);
}