I think MDB v2 should move the variable parts of MDB_meta into the data pages. The datafile header would retain a word with the position of the last *synced* MDB_meta, or of the last meta when MDB_NOSYNC. The lockfile header would hold the position of the *last* MDB_meta.
All transactions start from the lockfile->metapos commit. Write txns do not reuse free pages younger than the datafile->metapos commit.
mdb_env_sync() called by the user does roughly: size_t lastpos = lockfile->metapos; sync; # define pos2id(env, pos) ((MDB_meta*)((env)->me_map+(pos)))->mt_txnid if (pos2id(env, lastpos) > pos2id(env, datafile->metapos)) write lastpos to &datafile->metapos; Called from mdb_txn_commit(), this may need lastpos as an argument.
Results, if I'm keeping this straight:
Setting the latest commit becomes atomic: Just change metapos. (Field MDB_txninfo.mti_txnid goes away.)
No sync issues with copying 'MDB_db's from the meta, since the meta will not be overwritten during the txn.
Users can sync infrequently yet preserve consistency, a generalization of MDB_NOMETASYNC. An application crash will then lose unsynced commits, since resetting the lockfile must reset lockfile->metapos. MDB cannot know if a system crash left those commits unsynced.
mdb_env_sync() needs a mutex - either its own or the write lock. (A soft mode could trylock and do nothing if that fails.)
If mdb_env_sync() gets its own mutex, then mdb_txn_commit() can announce the commit at lockfile->metapos and unlock the write lock _before_ doing mdb_env_sync. With multiple writer threads, that's like an ACID-safe MDB_MAPASYNC. However, that has quirks. I don't know how serious they are: - mdb_txn_commit() can fail after other txns see the commit, or succeed but set a failure flag for other txns to react to. Delayed mdb_env_sync can fail today too, but it will also happen if mdb_env_sync cannot set datafile->metapos. - mdb_txn_commit() may not return immediately after the commit becomes visible to other txns. Unless it is set up to queue the {sync; set datafile->metapos} actions for a maintenance thread.
More detailed draft code, still ignoring various flags:
typedef struct MDB_meta { /* Meta info about a commit */ MDB_db mm_dbs[2]; txnid_t mm_txnid; pgno_t mm_last_pg; } MDB_meta;
typedef struct MDB_header { /* Datafile header */ ...; /* Position of last synced meta - or last known if MDB_NOSYNC */ size_t mh_metapos; } MDB_header;
typedef struct MDB_txbody { /* Lockfile header */ ...; /* Position of last meta, possibly not synced. Both read and write * txns start at this commit. Replaces the old member mtb_txnid. */ size_t mtb_metapos; } MDB_txbody;
mdb_txn_commit(MDB_txn *txn) { ...; /* Commit a write txn: */ pwritev(env->me_fd, <data pages including MDB_meta>); /* Make the commit visible to other txns */ lockfile->mtb_metapos = <offset of MDB_meta in me_map>; unlock(write_mutex); /* Preserve the commit */ mdb_env_sync(env, 0); }
# define pos2id(env, pos) (((MDB_meta*)((env)->me_map+(pos)))->mt_txnid)
mdb_txn_sync(MDB_txn *txn, int force) { MDB_env *txn->mt_env; MDB_txninfo *txns = env->me_txns; enum { metapos_pos = offsetof(MDB_header, mh_metapos) };
lock(meta_mutex);
/* Positions of meta pages known to datafile and lockfile */ size_t cur = *(size_t *)(env->me_map + metapos_pos); size_t lastpos = txns->mtb_metapos; int got_new = pos2id(lastpos) > pos2id(cur);
if (force || (got_new && !(env->me_flags & MDB_NOSYNC))) fdatasync(env->me_fd);
/* Make datafile catch up with pre-fdatasync lockfile */ if (got_new) pwrite(env->me_mfd, &lastpos, sizeof(lastpos), metapos_pos);
unlock(meta_mutex); }
Hallvard Breien Furuseth wrote:
I think MDB v2 should move the variable parts of MDB_meta into the data pages. The datafile header would retain a word with the position of the last *synced* MDB_meta, or of the last meta when MDB_NOSYNC. The lockfile header would hold the position of the *last* MDB_meta.
All transactions start from the lockfile->metapos commit. Write txns do not reuse free pages younger than the datafile->metapos commit.
I don't see this approach reducing seek overhead. It may be able to reduce sync overhead, but only if you accept the possibility of delayed syncs failing. Overall I don't see it as any improvement for ACID compliance.
What is the real benefit?
It may be worthwhile. I just want the actual specific advantages spelled out. Other DB systems use delayed/group commit to reduce sync overhead. It's worth doing, when your application can tolerate that type of behavior. But this can't be the default behavior.
mdb_env_sync() called by the user does roughly: size_t lastpos = lockfile->metapos; sync; # define pos2id(env, pos) ((MDB_meta*)((env)->me_map+(pos)))->mt_txnid if (pos2id(env, lastpos) > pos2id(env, datafile->metapos)) write lastpos to &datafile->metapos; Called from mdb_txn_commit(), this may need lastpos as an argument.
Results, if I'm keeping this straight:
Setting the latest commit becomes atomic: Just change metapos. (Field MDB_txninfo.mti_txnid goes away.)
No sync issues with copying 'MDB_db's from the meta, since the meta will not be overwritten during the txn.
Users can sync infrequently yet preserve consistency, a generalization of MDB_NOMETASYNC. An application crash will then lose unsynced commits, since resetting the lockfile must reset lockfile->metapos. MDB cannot know if a system crash left those commits unsynced.
mdb_env_sync() needs a mutex - either its own or the write lock. (A soft mode could trylock and do nothing if that fails.)
If mdb_env_sync() gets its own mutex, then mdb_txn_commit() can announce the commit at lockfile->metapos and unlock the write lock _before_ doing mdb_env_sync. With multiple writer threads, that's like an ACID-safe MDB_MAPASYNC. However, that has quirks. I don't know how serious they are:
- mdb_txn_commit() can fail after other txns see the commit, or succeed but set a failure flag for other txns to react to. Delayed mdb_env_sync can fail today too, but it will also happen if mdb_env_sync cannot set datafile->metapos.
- mdb_txn_commit() may not return immediately after the commit becomes visible to other txns. Unless it is set up to queue the {sync; set datafile->metapos} actions for a maintenance thread.
More detailed draft code, still ignoring various flags:
typedef struct MDB_meta { /* Meta info about a commit */ MDB_db mm_dbs[2]; txnid_t mm_txnid; pgno_t mm_last_pg; } MDB_meta;
typedef struct MDB_header { /* Datafile header */ ...; /* Position of last synced meta - or last known if MDB_NOSYNC */ size_t mh_metapos; } MDB_header;
typedef struct MDB_txbody { /* Lockfile header */ ...; /* Position of last meta, possibly not synced. Both read and write * txns start at this commit. Replaces the old member mtb_txnid. */ size_t mtb_metapos; } MDB_txbody;
mdb_txn_commit(MDB_txn *txn) { ...; /* Commit a write txn: */ pwritev(env->me_fd, <data pages including MDB_meta>); /* Make the commit visible to other txns */ lockfile->mtb_metapos = <offset of MDB_meta in me_map>; unlock(write_mutex); /* Preserve the commit */ mdb_env_sync(env, 0); }
# define pos2id(env, pos) (((MDB_meta*)((env)->me_map+(pos)))->mt_txnid)
mdb_txn_sync(MDB_txn *txn, int force) { MDB_env *txn->mt_env; MDB_txninfo *txns = env->me_txns; enum { metapos_pos = offsetof(MDB_header, mh_metapos) };
lock(meta_mutex); /* Positions of meta pages known to datafile and lockfile */ size_t cur = *(size_t *)(env->me_map + metapos_pos); size_t lastpos = txns->mtb_metapos; int got_new = pos2id(lastpos) > pos2id(cur); if (force || (got_new && !(env->me_flags & MDB_NOSYNC))) fdatasync(env->me_fd); /* Make datafile catch up with pre-fdatasync lockfile */ if (got_new) pwrite(env->me_mfd, &lastpos, sizeof(lastpos), metapos_pos); unlock(meta_mutex);
}
Howard Chu writes:
I don't see this approach reducing seek overhead. It may be able to reduce sync overhead, but only if you accept the possibility of delayed syncs failing. Overall I don't see it as any improvement for ACID compliance.
What is the real benefit?
I thought I said under results, but maybe I went at it backwards again...
[From the other message]
Sounds to me like you want to make MDB fully multi-version.
No, I didn't think of that at all.
Anyway:
Default mode offers full ACID if no outside interruptions interfere, so that can hardly be improved. MDB with some the speedup flags does not:
In terms of ACID, this is an ACI-safe variant of using MDB_NOSYNC + some mdb_env_sync()s, or MDB_MAPASYNC. That improves sync and seek overhead, but those flags allow a system crash to corrupt the database.
If you are willing to lose transactions but want ACI, MDB supports syncing once per txn (MDB_NOMETASYNC) instead of twice, but not fewer that that. This suggestion allows fewer.
Also syncing after unlock might speed up some programs since the next write txn can start sooner. That would be for whoever would use MDB_MAPASYNC if it were ACI-safe - i.e. sync busily but don't wait.
BTW, I wonder if the MDB_MAPASYNC flag is a good idea. It's almost as busy as MDB_NOMETASYNC, yet guaranteees nothing. I don't know why anyone would use it. It might be more useful as an option to mdb_env_sync(), which a user of MDB_NOSYNC could call now and then.
One thing about ACID - IIRC there are some potential data races which my single-word meta info in the header will fix. E.g. if the user does ^Z at an unfortunate time, and with WRITEMAP updating the meta page while a txn is reading it. Could page back in the IRC discussions to check.
It may be worthwhile. I just want the actual specific advantages spelled out. Other DB systems use delayed/group commit to reduce sync overhead.
Delayed sync, yes.
It's worth doing, when your application can tolerate that type of behavior. But this can't be the default behavior.
Indeed, default mode should be safe. Or almost-safe - I wonder if MDB_NOMETASYNC is the best default since it's ACI-safe and will normally lose nothing.
I wrote:
One thing about ACID - IIRC there are some potential data races which my single-word meta info in the header will fix. E.g. if the user does ^Z at an unfortunate time, and with WRITEMAP updating the meta page while a txn is reading it.
Reproducible by sending frequent SIGSTOPs and SIGCONTs, with mdb.c patched to verify (memcmp or checksum) the meta info after copying. Happens even with 1 MDB process.
(Re comments elsewhere: Selecting a meta page is atomic, but copying/writing it is not.)
Hallvard Breien Furuseth wrote:
I wrote:
One thing about ACID - IIRC there are some potential data races which my single-word meta info in the header will fix. E.g. if the user does ^Z at an unfortunate time, and with WRITEMAP updating the meta page while a txn is reading it.
Reproducible by sending frequent SIGSTOPs and SIGCONTs, with mdb.c patched to verify (memcmp or checksum) the meta info after copying. Happens even with 1 MDB process.
(Re comments elsewhere: Selecting a meta page is atomic, but copying/writing it is not.)
True, but writing the meta page doesn't need to be atomic - it won't be considered current until its txnid is updated. Until then, its txnid will be less than the other page's anyway.
Howard Chu writes:
Hallvard Breien Furuseth wrote:
I wrote:
One thing about ACID - IIRC there are some potential data races which my single-word meta info in the header will fix. E.g. if the user does ^Z at an unfortunate time, and with WRITEMAP updating the meta page while a txn is reading it.
Reproducible by sending frequent SIGSTOPs and SIGCONTs, with mdb.c patched to verify (memcmp or checksum) the meta info after copying. Happens even with 1 MDB process.
(Re comments elsewhere: Selecting a meta page is atomic, but copying/writing it is not.)
True, but writing the meta page doesn't need to be atomic - it won't be considered current until its txnid is updated. Until then, its txnid will be less than the other page's anyway.
I did test it. Maybe the signal caught a reader in the middle of copying a meta page, then several write transactions changed both metas.
I see WRITEMAP can only be relevant if something reoders some accesses, which I see no particular reason to happen in this case. Or at least not compared to reordering meta page vs. me_txns->mti_txnid access.
And of course I saw one problem as soon as I hit Send, after mulling over this for hours:( Shouldn't be a problem, though.
All transactions start from the lockfile->metapos commit. Write txns do not reuse free pages younger than the datafile->metapos commit.
Except a system crash may yet lose the latter transaction if the datafile->metapos itself has not been synced. So the datafile also needs to indicate the previously synced datafile->metapos value. Or the related txnid, since I think the only place this info is needed is in the search for old txnids in mdb_page_alloc().
If it proves useful to keep this in a single header word: Maybe MDB_meta could hold "txnid of last synced metapos", and (datafile->metapos & 1) can be set if "last synced metapos" == the current metapos. I.e. a full sync would sync the data pages, set datafile->metapos, sync again, and set datafile->metapos |= 1 to note that this transaction is safe.
Hallvard Breien Furuseth wrote:
I think MDB v2 should move the variable parts of MDB_meta into the data pages. The datafile header would retain a word with the position of the last *synced* MDB_meta, or of the last meta when MDB_NOSYNC. The lockfile header would hold the position of the *last* MDB_meta.
Sounds to me like you want to make MDB fully multi-version. I don't see any benefit for OpenLDAP/slapd in doing that.
If that's not what you're trying to do, then you need to specify the algorithm for allocating meta pages such that versions don't accumulate endlessly.
All transactions start from the lockfile->metapos commit. Write txns do not reuse free pages younger than the datafile->metapos commit.
mdb_env_sync() called by the user does roughly: size_t lastpos = lockfile->metapos; sync; # define pos2id(env, pos) ((MDB_meta*)((env)->me_map+(pos)))->mt_txnid if (pos2id(env, lastpos) > pos2id(env, datafile->metapos)) write lastpos to &datafile->metapos; Called from mdb_txn_commit(), this may need lastpos as an argument.
Results, if I'm keeping this straight:
Setting the latest commit becomes atomic: Just change metapos. (Field MDB_txninfo.mti_txnid goes away.)
The latest commit is already atomic. mti_txnid is updated atomically in the current code.
No sync issues with copying 'MDB_db's from the meta, since the meta will not be overwritten during the txn.
There are no sync issues in the current code.
Users can sync infrequently yet preserve consistency, a generalization of MDB_NOMETASYNC. An application crash will then lose unsynced commits, since resetting the lockfile must reset lockfile->metapos. MDB cannot know if a system crash left those commits unsynced.
OK, preserving consistency is potentially a win vs what we have now. But it's also more of a crapshoot - you're only providing ACI, not D, and the application won't hear about the loss of D until long after the fact.
Some applications can probably tolerate this. But is it something we want to deal with?
Howard Chu writes:
Setting the latest commit becomes atomic: Just change metapos. (Field MDB_txninfo.mti_txnid goes away.)
The latest commit is already atomic. mti_txnid is updated atomically in the current code.
No sync issues with copying 'MDB_db's from the meta, since the meta will not be overwritten during the txn.
There are no sync issues in the current code.
I may have been thinking of the same thing with these two, but maybe I'm out of date.
OK, preserving consistency is potentially a win vs what we have now. But it's also more of a crapshoot - you're only providing ACI, not D, and the application won't hear about the loss of D until long after the fact.
Some applications can probably tolerate this. But is it something we want to deal with?
Then I'm not sure what's the point of so many speedup options like it has now.
Hallvard Breien Furuseth wrote:
Howard Chu writes:
Setting the latest commit becomes atomic: Just change metapos. (Field MDB_txninfo.mti_txnid goes away.)
The latest commit is already atomic. mti_txnid is updated atomically in the current code.
No sync issues with copying 'MDB_db's from the meta, since the meta will not be overwritten during the txn.
There are no sync issues in the current code.
I may have been thinking of the same thing with these two, but maybe I'm out of date.
OK, preserving consistency is potentially a win vs what we have now. But it's also more of a crapshoot - you're only providing ACI, not D, and the application won't hear about the loss of D until long after the fact.
Some applications can probably tolerate this. But is it something we want to deal with?
Then I'm not sure what's the point of so many speedup options like it has now.
MDB_NOSYNC is perfectly safe on some filesystems like ZFS that guarantee write order.
Some apps want the ability to return immediately from txn_commit while performing syncs in a background thread. MAPASYNC lets us do that. What you're talking about may also do that. I just want to be clear about the motivation and the expected benefit.
Using variably positioned meta pages seems like something we would try to cut down on seek overhead, but looking closer it doesn't appear that it can do that.
Howard Chu writes:
Hallvard Breien Furuseth wrote:
Then I'm not sure what's the point of so many speedup options like it has now.
MDB_NOSYNC is perfectly safe on some filesystems like ZFS that guarantee write order.
Nice. Should tweak the doc a bit, then.
Do you mean safe without MDB_WRITEMAP - i.e. it orders write system calls correctly? Or even with MDB_WRITEMAP - notices the order of updates to mmap pages? I imagine the latter would be rather tough.
Sounds like this needs yet another mode for best performance, though: Sync after but not before writing the meta page.
Some apps want the ability to return immediately from txn_commit while performing syncs in a background thread. MAPASYNC lets us do that.
Only with MDB_WRITEMAP and if you do not care about consistency after a system crash, as mentioned in my other mail. Unless on ZFS and (MDB_NOSYNC & MDB_WRITEMAP) is safe, as above.
What you're talking about may also do that. I just want to be clear about the motivation and the expected benefit.
Using variably positioned meta pages seems like something we would try to cut down on seek overhead, but looking closer it doesn't appear that it can do that.
Right, default mode will still write the file header every time. Hm, and it'll be a bit slower since some commits will need one more page: When none of the committed pages have room for the MDB_meta.
Fewer seeks would only be a side effect of choosing fewer syncs.
Howard Chu writes:
Using variably positioned meta pages seems like something we would try to cut down on seek overhead, but looking closer it doesn't appear that it can do that.
Could some commits choose where the next commit's MDB_meta should go? Maybe peel a page off the freelist on behalf of the next commit. Then the current commit's MDB_meta includes a reference to that page and a check bit which is equal to some bit in the chosen page's MDB_meta. Instead of updating the file header, next commit toggles that bit in its own MDB_meta to show that it is in use.
mdb_page_alloc() cannot reuse free pages younger than the current chain. mdb_env_pick_meta() must walk this chain of forward references, so such a chain should be short.