Re: MDB v2: Replace meta pages with "meta position" word

11 Nov 2012

      Hallvard Breien Furuseth wrote:
...
I think MDB v2 should move the variable parts of MDB_meta into the
data pages.  The datafile header would retain a word with the position
of the last *synced* MDB_meta, or of the last meta when MDB_NOSYNC.
The lockfile header would hold the position of the *last* MDB_meta.
All transactions start from the lockfile->metapos commit.  Write txns
do not reuse free pages younger than the datafile->metapos commit.
I don't see this approach reducing seek overhead. It may be able to reduce 
sync overhead, but only if you accept the possibility of delayed syncs 
failing. Overall I don't see it as any improvement for ACID compliance.
What is the real benefit?
It may be worthwhile. I just want the actual specific advantages spelled out. 
Other DB systems use delayed/group commit to reduce sync overhead. It's worth 
doing, when your application can tolerate that type of behavior. But this 
can't be the default behavior.
...
mdb_env_sync() called by the user does roughly:
   size_t lastpos = lockfile->metapos;
   sync;
# define pos2id(env, pos) ((MDB_meta*)((env)->me_map+(pos)))->mt_txnid
   if (pos2id(env, lastpos) > pos2id(env, datafile->metapos))
     write lastpos to &datafile->metapos;
Called from mdb_txn_commit(), this may need lastpos as an argument.
Results, if I'm keeping this straight:
Setting the latest commit becomes atomic: Just change metapos.
(Field MDB_txninfo.mti_txnid goes away.)
No sync issues with copying 'MDB_db's from the meta, since the meta
will not be overwritten during the txn.
Users can sync infrequently yet preserve consistency, a generalization
of MDB_NOMETASYNC.  An application crash will then lose unsynced
commits, since resetting the lockfile must reset lockfile->metapos.
MDB cannot know if a system crash left those commits unsynced.
mdb_env_sync() needs a mutex - either its own or the write lock.
(A soft mode could trylock and do nothing if that fails.)
If mdb_env_sync() gets its own mutex, then mdb_txn_commit() can
announce the commit at lockfile->metapos and unlock the write lock
_before_ doing mdb_env_sync.  With multiple writer threads, that's
like an ACID-safe MDB_MAPASYNC.
However, that has quirks.  I don't know how serious they are:

mdb_txn_commit() can fail after other txns see the commit, or
 succeed but set a failure flag for other txns to react to.
 Delayed mdb_env_sync can fail today too, but it will also
 happen if mdb_env_sync cannot set datafile->metapos.
mdb_txn_commit() may not return immediately after the commit
 becomes visible to other txns.  Unless it is set up to queue the
 {sync; set datafile->metapos} actions for a maintenance thread.

More detailed draft code, still ignoring various flags:
typedef struct MDB_meta {   /* Meta info about a commit */
     MDB_db      mm_dbs[2];
     txnid_t     mm_txnid;
     pgno_t      mm_last_pg;
} MDB_meta;
typedef struct MDB_header { /* Datafile header */
     ...;
     /* Position of last synced meta - or last known if MDB_NOSYNC */
     size_t      mh_metapos;
} MDB_header;
typedef struct MDB_txbody { /* Lockfile header */
     ...;
     /* Position of last meta, possibly not synced. Both read and write
      * txns start at this commit. Replaces the old member mtb_txnid. */
     size_t      mtb_metapos;
} MDB_txbody;
mdb_txn_commit(MDB_txn *txn) {
     ...;
     /* Commit a write txn: */
     pwritev(env->me_fd, <data pages including MDB_meta>);
     /* Make the commit visible to other txns */
     lockfile->mtb_metapos = <offset of MDB_meta in me_map>;
     unlock(write_mutex);
     /* Preserve the commit */
     mdb_env_sync(env, 0);
}
# define pos2id(env, pos) (((MDB_meta*)((env)->me_map+(pos)))->mt_txnid)
mdb_txn_sync(MDB_txn *txn, int force) {
     MDB_env *txn->mt_env;
     MDB_txninfo *txns = env->me_txns;
     enum { metapos_pos = offsetof(MDB_header, mh_metapos) };
 lock(meta_mutex);

 /* Positions of meta pages known to datafile and lockfile */
 size_t cur = *(size_t *)(env->me_map + metapos_pos);
 size_t lastpos = txns->mtb_metapos;
 int got_new = pos2id(lastpos) > pos2id(cur);

 if (force || (got_new && !(env->me_flags & MDB_NOSYNC)))
     fdatasync(env->me_fd);

 /* Make datafile catch up with pre-fdatasync lockfile */
 if (got_new)
     pwrite(env->me_mfd, &lastpos, sizeof(lastpos), metapos_pos);

 unlock(meta_mutex);

}
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: MDB v2: Replace meta pages with "meta position" word