Hi guys,
we have had long discussions with Howard on LMDB about the release of pages that are not anymore in use. This is a critical part of a MVCC based B-tree, as each update will copy many pages, and we want to re-use the copied pages.
The problem is that those copied pages may be still in use by older readers. Let's quickly explain how it all works.
At Ti, we are updating the B-tree, and that creates many pages by copying pages from previous revisions. We will note those new pages based on the level in the B-tree they were stored in : P[0, i-1] will be for the root page, on level 0, and this page previous version was k (and as it's the root page, k is obviously one version earlier than the current revision). P[k, a] will be for a page at level k, and with a previous revision 'a' (where a <= i-1).
One more important point is that a copied page may be split in two new pages (if teh previous page was full), as we may copy the content of two old pages into one single new page (when the two old pages contain only N/2 elemnts and we removed one element in one of the two pages). Last, not least, we may have to remove one level, if the root page is removed. All in all, what is important here is to *know* which pages have been copied during the update operation, because this is those pages that will be removed at some point.
Now, assuming we keep a track of all the copied pages for a revision R, we can also assume we can remove all those pages when this revision R is not anymore in use (ie no reader thread is using R *and* no other reader thread is using a revision below R). This is what LMDB does, AFAICT.
Can we go a bit further, and get rid of copied pages when a reader is using a revision < R ? I believe so.
Let say that for a revision R, we have copied a page P[k, c], and let say we have a reader using a revision P < c (obviously, if a reader uses a revision > c, we can't free P[k, c], because it's fully part of the B-tree used by reader P). If c > P, that means we have copied a page which revison was < P at some point, so we can now get rid of this copy, no other reader will use it.
case 1 : P is newer than the copied page :
P R | | v v o--------[c]---------[P]---------[n]------> time ^ | /| | | / +--> we create a new revision, which copy page [c] and create page [n] .-----------|--------. | +--------------> Revision P is using page [c]
Here, we can't get rid of page [c], as it's still part of revision P B-tree
case 2 : P is older than the copied page
P Q R | | | v v v o--------[P]---------[c]---------[n]------> time | ^ /| | | / +--> we create a new revision, which copy page [c] and create page [n] | .--------. | +--------------> Revision P
Here, an update at revision Q has created a page [c] which has been copied by the update at revision R. We can get rid of page [c] because no other reader will ever use it.
That is all clear and easy. Now, the real pb is how do we determinate the pages we can free based on the copied pages at revision R ? At revision R, we know which pages we have copied, and each page has a unique revision number. That allows us to know which pages to free quite immediately : we just free all the pages which revision is newer than the most recent reader revision.
For instance, we can immediately get rid of all the root pages for any revision not in use, because this root page is immediately copied when we do an upadate.
That is for any update being done, when we don't have recent readers (ie, if we do many updates, this algorithm works very well).
But what if we have more than one pending reader ? There is little we can do here, we have to wait for one of the reader to be released, because this is where we will be able to free many pages.
Ok, let's see what happens when we free a reader revision now.
We know which pages this reader is holding, it's the list of copied pages by the update which occured at the same revision. When we release this reader, all those copied pages has to bec checked to see if they are still in use.
M P Q R | | | | v v v v o---[M]---[c]----[d]----[P]-----[Q]--[n]------> time | ^ ^ | / /| | | | | / / | | | .------|----. / +--> we create a new revision, which copy page [c] and create page [n] | | | / | .-------------|-------. | | | +--------------> Revision P is using page [c] | +---------------> Revision M
Here, we will release the reader P. We should expect that every page copied between M and P can now be released (in the graphic, revison Q has copied page [d], which is still in hold, because revision P is using it). So when we release P, we know we should free pages [c] and [d]. The difficulty here is to list all the pages that are in hold by revision P, and thi sis a costly process : - check every revision > P which have copied pages which revision is above M and below P.
In order to do that we must have a data structure that holds all the reader in use, order from the oldest to the newest, and we must have a data structure keeping a track of all the pages copied by each revision. We have both, but the second on is not in memory (well, we can't assume it's in memory, as this data structure is stored in a B-tree itself).
So basically, it's a matter of browsing the copied-pages B-tree looking for all the revisions between [M] and [R], and for each of those revisions, we can feee of all the copied pages which revision is > M. Not exactly inexpensive, but still acceptable, IMO.
Wdyt ?
Emmanuel Lécharny
Emmanuel Lécharny wrote:
Hi guys,
we have had long discussions with Howard on LMDB about the release of pages that are not anymore in use. This is a critical part of a MVCC based B-tree, as each update will copy many pages, and we want to re-use the copied pages.
The problem is that those copied pages may be still in use by older readers. Let's quickly explain how it all works.
This is a good analysis, but I have to state that the investigation down this path has stalled for a major reason:
The current reader/writer interaction depends on the fact that we only need to know the oldest reader, and that we can be lazy about determining what the oldest reader is. Once you start looking at intervals between ordered ages of readers, that means you require continuously up to date reader status, which means you require highly consistent updating and reading of the readers table. That imposes memory consistency guarantees that we currently don't require, and adding those guarantees has a significant performance cost.
Le 10/12/14 18:07, Howard Chu a écrit :
Emmanuel Lécharny wrote:
Hi guys,
we have had long discussions with Howard on LMDB about the release of pages that are not anymore in use. This is a critical part of a MVCC based B-tree, as each update will copy many pages, and we want to re-use the copied pages.
The problem is that those copied pages may be still in use by older readers. Let's quickly explain how it all works.
This is a good analysis, but I have to state that the investigation down this path has stalled for a major reason:
The current reader/writer interaction depends on the fact that we only need to know the oldest reader, and that we can be lazy about determining what the oldest reader is.
You have to keep all the readers in memory, just to be able to find the oldest when it leaves. Then, when the oldest leave, you have to find the new oldest in a way or another, which should be easy. Or you have to check that the leaving reader is the oldest by controlling all the other ones (and in this case, you have to iterate over the existing LDAP sessions).
In any case, you should know about the readers, no matter what.
Once you start looking at intervals between ordered ages of readers, that means you require continuously up to date reader status,
Yes. It's just a set associated with a list (the set to retrieve immediately the position in the list of the leaving reader).
How do you manage readers currently ?
which means you require highly consistent updating and reading of the readers table.
Yes, and no. You can avoid dealing with consistency, it's just depending on how you manage the readers
That imposes memory consistency guarantees that we currently don't require, and adding those guarantees has a significant performance cost.
Ok no let's suppose each thread handling reads (and there is a limited number of threads for that) has its own list of readers. This list is clearly not protected against concurrent access, because it's specific to each thread. Now, when a reader, managed by the thread, leaves, you can tell the writer that some cleanup can occur. This cleanup will be done regardless what the reader thread will do, because it implies an already dead reader. From the writer thread POV, it's a matter of checking on every reader threads to see if the released revision is in use, or not. If not, then it's time to cleanup the pages.
This is totally thread safe, with no contention at all.
Of course, I'm assumling that the reader threads keep a set of readers information...
I wonder if an intermediate variation on this idea might be easier to implement:
We can describe a subset of MDB_RDONLY transactions as 'long running', e.g. when copying the database. Normal read-only transactions might then be called 'ephemeral'. We might compute long-running transactions heuristically, e.g. ((currentTxnID - rdOnlyTxnId) > 100) and/or we might allow developers to mark them.
When deciding which transactions to free, we compute three numbers with a one-pass scan of the reader lock table:
txnO = the oldest active transaction txnE = the oldest active ephemeral transaction txnL = the newest active long-running transaction
Normally, LMDB just computes txnO, and conservatively assumes that readers might be operating anywhere between txnO and the current transaction. Emmanuel's proposal for precise analysis is somewhat daunting. But I think we can use this simple three-point analysis to find some sweet spots between precise and conservative.
We know by definition that txnO <= txnL, and txnO <= txnE.
Under normal conditions, i.e. assuming that long-running transactions are relatively infrequent, we also know that txnL < txnE will be true more frequently than not. Conservatively, we must protect the ranges from [txnO,txnL] and [txnE,currentTxnId]. But in the common case, there will be a non-empty range of transactions (txnL,txnE) that can be GC'd. And of course anything below txnO can still be GC'd.
I'm not sure whether or not we can easily identify which pages are used only within that gap, and may thus be collected. I haven't studied that aspect of LMDB yet. But supposing that aspect is addressed, this scheme should greatly mitigate the impact of infrequent long-running read-only transactions. A lot more space from recent write transactions would be reusable.
David Barbour wrote:
I wonder if an intermediate variation on this idea might be easier to implement:
We can describe a subset of MDB_RDONLY transactions as 'long running', e.g. when copying the database. Normal read-only transactions might then be called 'ephemeral'. We might compute long-running transactions heuristically, e.g. ((currentTxnID - rdOnlyTxnId) > 100) and/or we might allow developers to mark them.
Interesting. I think, rather than picking an arbitrary age, we could simply look for the largest gap between two reader txnIDs and work from there. As long as the gap is >2 then there should be some pages to recover.
When deciding which transactions to free, we compute three numbers with a one-pass scan of the reader lock table:
txnO = the oldest active transaction txnE = the oldest active ephemeral transaction txnL = the newest active long-running transaction
Normally, LMDB just computes txnO, and conservatively assumes that readers might be operating anywhere between txnO and the current transaction. Emmanuel's proposal for precise analysis is somewhat daunting. But I think we can use this simple three-point analysis to find some sweet spots between precise and conservative.
We know by definition that txnO <= txnL, and txnO <= txnE.
Under normal conditions, i.e. assuming that long-running transactions are relatively infrequent, we also know that txnL < txnE will be true more frequently than not. Conservatively, we must protect the ranges from [txnO,txnL] and [txnE,currentTxnId]. But in the common case, there will be a non-empty range of transactions (txnL,txnE) that can be GC'd. And of course anything below txnO can still be GC'd.
I'm not sure whether or not we can easily identify which pages are used only within that gap, and may thus be collected. I haven't studied that aspect of LMDB yet. But supposing that aspect is addressed, this scheme should greatly mitigate the impact of infrequent long-running read-only transactions. A lot more space from recent write transactions would be reusable.
Yes, finding that gap is the tricky part now. I'm guessing we'll need a freeDB format change to have enough info to do it, though maybe not. Currently we record that in txn X, some list of pageIDs was freed. In LMDB 1.0 we will also have per-page txnID stamps, to tell us when a page was written.
Preserving the range from [txnO,txnL] pretty much means we cannot reuse any pages up to txnL's free list. I.e., the page's stamp must be > txnL, it must have been written at a point no txn in [txnO,txnL] could have seen it. In [txnL+1,txnE] we can reuse all the free pages whose stamp is
txnL. Perhaps we don't need a freeDB change.
On 15/12/14 22:26, Howard Chu wrote:
Yes, finding that gap is the tricky part now. I'm guessing we'll need a freeDB format change to have enough info to do it, though maybe not. Currently we record that in txn X, some list of pageIDs was freed. In LMDB 1.0 we will also have per-page txnID stamps, to tell us when a page was written.
Preserving the range from [txnO,txnL] pretty much means we cannot reuse any pages up to txnL's free list. I.e., the page's stamp must be > txnL, it must have been written at a point no txn in [txnO,txnL] could have seen it. In [txnL+1,txnE] we can reuse all the free pages whose stamp is
txnL. Perhaps we don't need a freeDB change.
Sure we do. The freeDB should have a more efficient format anyway, for overflow pages. Peeking at a lot of old pages and rejecting them can be a cache massacre. And a file page in the middle of a multi-page overflow page has no header to peek at anyway.
Also: A transaction which must inspect a page to see if it can be reused, must not overwrite that info in the page. The txn might abort or get rolled back by a system crash. Instead, it can move the page to a "reusable" list and commit. Then we wait one or two transactions for sync to metapages, and then we can reuse the page.
This may not be needed when the newer txnid in the page header is sufficient info, but I suggest we don't try to keep that straight yet: Can't overwrite if the page ends up inside an overflow page. Also the newer txnid will not always imply it can be reused.
Anyway, consider a page written by transaction ID (revision) W and freed (no longer seen) by txnid F: Visible to snapshots [W..F-1]. Transaction F writes freeDB record {F: list of pages freed by F}.
To F, oldest snapshot "S" which can see the page = max(W, oldest current snapshot). Store (F-S) along with the page number, it will often fit in the unused bits of the page number. If it doesn't, we can fall back to some expanded format, or peek at the page later, or return to the old way: Wait for all older readers.
Page numbers have 16 unused bits on 64-bit CPUs, log2(page size) unused bits on 32-bit CPUs. Use at least one bit to indicate a multi-page overflow page, or that the expanded format is used and this includes both age and number of file pages. (Overflow pages are used for nodes bigger than half a page.)
Once we have determined that a page can be reused, it can move to a freeDB record format which doesn't care about old snapshot IDs. Unless it does get reused at once, of course.
About MDB_RDONLY transactions: They just store their snapshot's transaction ID in the reader table and clear it on abort/commit.
Write transactions scan the reader table and do all the thinking. When they scan the table to make a list of all readers, they can save the youngest ones in a bit-vector indexed by transaction ID and the ones which do not fit there in a list. If most readers are ephemeral, that list will be short. If we sort the pages in a freeDB record by (F-S) above, it's easy to check the youngest freeDB records against the bitmap to look for pages we can use.