(This is a quick follow-up to an earlier discussion, a note left for anyone confused by MDB_NOLOCK.)
On Mon, Dec 1, 2014 at 6:55 AM, Hallvard Breien Furuseth < h.b.furuseth@usit.uio.no> wrote:
A write transaction frees pages which its new snapshot cannot see. A later writer will overwrite them, when no *known* readers can see them either. But with MDB_NOLOCK, writers don't know about old readers and might overwrite pages which old readers can see.
Last snapshot is never overwritten. So readers which did begin/renew after latest commit(write txn) are safe from txn_begin(write txn).
The same with the commit of the write txn before that. I think. MDB keeps the last two snapshots in the metapages.
I've been reading the MDB source code a bit more to verify this assumption. It is not valid.
MDB does keep the last two metapages, but may begin to dismantle the elder of the two for pages if there are no readers for it. With MDB_NOLOCK, it simply assumes there are no readers for it (cf. mdb_find_oldest). **Only the most recent snapshot is preserved.**
For my current use case, I believe that I can still achieve a sufficient level of parallelism even if limited to double-buffering (whereas two snapshots would give me triple-buffering). I'm not going to press for any changes at this time.
On Fri, Jan 30, 2015 at 3:57 PM, David Barbour dmbarbour@gmail.com wrote:
For my current use case, I believe that I can still achieve a sufficient level of parallelism even if limited to double-buffering (whereas two snapshots would give me triple-buffering). I'm not going to press for any changes at this time.
After having examined this further, I've changed my mind.
With triple buffering, I can guarantee that the writer *almost* never waits on a short-running reader, and that the readers never wait on the writer. With double buffering, the probability of the writer waiting on even short-running readers, assuming they are frequent, is nearly 100%. Triple buffering is thus a huge advantage for users of MDB_NOLOCK.
The update to support this is almost trivial: tweak `mdb_find_oldest` such that both meta-page snapshots are considered to have active readers. I'm willing to develop and submit a patch, but only if this change also sounds good to the main LMDB developers.
Regards,
Dave
David Barbour wrote:
On Fri, Jan 30, 2015 at 3:57 PM, David Barbour <dmbarbour@gmail.com mailto:dmbarbour@gmail.com> wrote:
For my current use case, I believe that I can still achieve a sufficient level of parallelism even if limited to double-buffering (whereas two snapshots would give me triple-buffering). I'm not going to press for any changes at this time.
After having examined this further, I've changed my mind.
With triple buffering, I can guarantee that the writer *almost* never waits on a short-running reader, and that the readers never wait on the writer. With double buffering, the probability of the writer waiting on even short-running readers, assuming they are frequent, is nearly 100%. Triple buffering is thus a huge advantage for users of MDB_NOLOCK.
The update to support this is almost trivial: tweak `mdb_find_oldest` such that both meta-page snapshots are considered to have active readers. I'm willing to develop and submit a patch, but only if this change also sounds good to the main LMDB developers.
That is supposed to be its current behavior already. I.e., no page that either of the two meta pages points to is ever allowed to be reclaimed.
On Fri, Jan 30, 2015 at 6:21 PM, Howard Chu hyc@symas.com wrote:
That is supposed to be its current behavior already. I.e., no page that either of the two meta pages points to is ever allowed to be reclaimed.
Okay then. On Monday, I'll see if I can write a test demonstrating this bug. And, if so, a patch to fix it.
On 30/01/15 22:57, David Barbour wrote:
On Mon, Dec 1, 2014 at 6:55 AM, Hallvard Breien Furuseth <h.b.furuseth@usit.uio.no mailto:h.b.furuseth@usit.uio.no> wrote: (...) Last snapshot is never overwritten. So readers which did begin/renew after latest commit(write txn) are safe from txn_begin(write txn).
The same with the commit of the write txn before that. I think. MDB keeps the last two snapshots in the metapages.
I've been reading the MDB source code a bit more to verify this
assumption. It is not valid.
MDB does keep the last two metapages, but may begin to dismantle the elder of the two for pages if there are no readers for it.
Yes, true. The last two snapshots' *data pages* are never overwritten, and any readers using them will have read the metapage and do not need it again.
I confused myself because I was thinking of sync issues: Even if the oldest metapage has been overwritten, that does not mean it is gone yet: If it has not been synced to disk, a system crash can bring it back. And with it, its refs to datapages.
Looking closer though, that's only relevant if with MDB_NOSYNC, where the previous metapage has not been synced either.
Okay, I think I see what's happening here:
mdb_find_oldest(): will return the most recent snapshot if no readers exist
mdb_page_alloc(): will search FREE_DBI for a transaction `last` that is less than oldest, and will try to find a contiguous range of pages that were free'd by said transaction, potentially merging free pages from many transactions. If nothing is found, will instead grow the database.
Since `last` < `oldest` when we reuse any old pages, and we're only using the 'freed' pages from last (not the data pages), we know that at the data pages for the eldest two transactions are protected.
Is this right?
My earlier assumption (before reading mdb_page_alloc) was that LMDB would be aggressive about grabbing pages freed by transactions that are not actively being read. If we're relying on `last < oldest` to create a two page discrepancy, this means when we actually have readers on older transactions that we're being little more conservative than necessary. But it does protect the last two snapshots.
...
Idle question: what happens with freeing old pages when the txnid_t wraps around on a 32-bit system? Do the pages free'd by those transactions just get stuck?
David Barbour wrote:
Idle question: what happens with freeing old pages when the txnid_t wraps around on a 32-bit system? Do the pages free'd by those transactions just get stuck?
Probably. Again, I really don't see 32-bit systems as being worthy of consideration.
On Sat, Jan 31, 2015 at 12:50 AM, Howard Chu hyc@symas.com wrote:
Probably. Again, I really don't see 32-bit systems as being worthy of consideration.
That's understandable. There are other embedded databases suitable for 32-bit systems.
David Barbour wrote:
Probably. Again, I really don't see 32-bit systems as being worthy of consideration.
That's understandable. There are other embedded databases suitable for 32-bit systems.
If anyone ever runs a 32 bit server fast enough and long enough to process 4 billion write transactions, they can always do an mdb_copy to reset the txnIDs.
David Barbour wrote:
Okay, I think I see what's happening here:
mdb_find_oldest(): will return the most recent snapshot if no readers exist
mdb_page_alloc(): will search FREE_DBI for a transaction `last` that is less than oldest, and will try to find a contiguous range of pages that were free'd by said transaction, potentially merging free pages from many transactions. If nothing is found, will instead grow the database.
Since `last` < `oldest` when we reuse any old pages, and we're only using the 'freed' pages from last (not the data pages), we know that at the data pages for the eldest two transactions are protected.
Is this right?
Yep.
My earlier assumption (before reading mdb_page_alloc) was that LMDB would be aggressive about grabbing pages freed by transactions that are not actively being read. If we're relying on `last < oldest` to create a two page discrepancy, this means when we actually have readers on older transactions that we're being little more conservative than necessary.
More than necessary? I don't think so.
But it does protect the last two snapshots.
Yes, always.