On 15/12/14 22:26, Howard Chu wrote:
Yes, finding that gap is the tricky part now. I'm guessing we'll need a freeDB format change to have enough info to do it, though maybe not. Currently we record that in txn X, some list of pageIDs was freed. In LMDB 1.0 we will also have per-page txnID stamps, to tell us when a page was written.
Preserving the range from [txnO,txnL] pretty much means we cannot reuse any pages up to txnL's free list. I.e., the page's stamp must be > txnL, it must have been written at a point no txn in [txnO,txnL] could have seen it. In [txnL+1,txnE] we can reuse all the free pages whose stamp is
txnL. Perhaps we don't need a freeDB change.
Sure we do. The freeDB should have a more efficient format anyway, for overflow pages. Peeking at a lot of old pages and rejecting them can be a cache massacre. And a file page in the middle of a multi-page overflow page has no header to peek at anyway.
Also: A transaction which must inspect a page to see if it can be reused, must not overwrite that info in the page. The txn might abort or get rolled back by a system crash. Instead, it can move the page to a "reusable" list and commit. Then we wait one or two transactions for sync to metapages, and then we can reuse the page.
This may not be needed when the newer txnid in the page header is sufficient info, but I suggest we don't try to keep that straight yet: Can't overwrite if the page ends up inside an overflow page. Also the newer txnid will not always imply it can be reused.
Anyway, consider a page written by transaction ID (revision) W and freed (no longer seen) by txnid F: Visible to snapshots [W..F-1]. Transaction F writes freeDB record {F: list of pages freed by F}.
To F, oldest snapshot "S" which can see the page = max(W, oldest current snapshot). Store (F-S) along with the page number, it will often fit in the unused bits of the page number. If it doesn't, we can fall back to some expanded format, or peek at the page later, or return to the old way: Wait for all older readers.
Page numbers have 16 unused bits on 64-bit CPUs, log2(page size) unused bits on 32-bit CPUs. Use at least one bit to indicate a multi-page overflow page, or that the expanded format is used and this includes both age and number of file pages. (Overflow pages are used for nodes bigger than half a page.)
Once we have determined that a page can be reused, it can move to a freeDB record format which doesn't care about old snapshot IDs. Unless it does get reused at once, of course.
About MDB_RDONLY transactions: They just store their snapshot's transaction ID in the reader table and clear it on abort/commit.
Write transactions scan the reader table and do all the thinking. When they scan the table to make a list of all readers, they can save the youngest ones in a bit-vector indexed by transaction ID and the ones which do not fit there in a list. If most readers are ephemeral, that list will be short. If we sort the pages in a freeDB record by (F-S) above, it's easy to check the youngest freeDB records against the bitmap to look for pages we can use.