Was chatting with Emmanuel Lecharny (who is currently working on Mavibot for ApacheDS, an MVCC backend similar to LMDB) and had an interesting realization: we can avoid the current issue of long-lived reader txns preventing page reclamation.
The key point is this - currently for any txn, we store in the freeDB a list of the pageIDs that were freed in that txn (due to being copied or deleted). All we need is to know that any of these pages has been copied *twice* since the txn of the outstanding reader. At that point, any such page is so new that the reader would never have seen it, and if no other readers are outstanding then the page can be safely reclaimed.
Currently mavibot maintains this revision history in a dedicated Btree (much like our freeDB but with more info contained). I'm thinking, since we're already going to add a txnID to every page's page header, we can simply add a 2nd txnID, recording the txnID of the previous change to this page's ancestor. Then, any page where this prevTxnID is >= the outstanding reader's txnID can be reclaimed.
Still thinking about the actual implementation of this, it may make more sense to store the prevTxnID in the freeDB than in each page header. Ideally we want to be able to grab a chunk of pageIDs unambiguously, instead of having to iterate thru each page and read its header to determine if it's safe.
Le 2/3/14 10:14 PM, Howard Chu a écrit :
Was chatting with Emmanuel Lecharny (who is currently working on Mavibot for ApacheDS, an MVCC backend similar to LMDB) and had an interesting realization: we can avoid the current issue of long-lived reader txns preventing page reclamation.
The key point is this - currently for any txn, we store in the freeDB a list of the pageIDs that were freed in that txn (due to being copied or deleted). All we need is to know that any of these pages has been copied *twice* since the txn of the outstanding reader.
At that point, any such page is so new that the reader would never have seen it, and if no other readers are outstanding then the page can be safely reclaimed.
Currently mavibot maintains this revision history in a dedicated Btree (much like our freeDB but with more info contained). I'm thinking, since we're already going to add a txnID to every page's page header, we can simply add a 2nd txnID, recording the txnID of the previous change to this page's ancestor. Then, any page where this prevTxnID is
= the outstanding reader's txnID can be reclaimed.
Still thinking about the actual implementation of this, it may make more sense to store the prevTxnID in the freeDB than in each page header. Ideally we want to be able to grab a chunk of pageIDs unambiguously, instead of having to iterate thru each page and read its header to determine if it's safe.
You have two use cases : o when you are adding new revisions, and no new read occurs (this is the easy use case : you can get rid of any pages being copied twice) o when a read txn is released (now, you have to go through all the pages that have been copied before this read txn down to the next read txn, and remove all pages which have been copied twice, and go up to the next read or write txn, to do the same work).
On 2014-02-03 22:14, Howard Chu wrote:
Was chatting with Emmanuel Lecharny (who is currently working on Mavibot for ApacheDS, an MVCC backend similar to LMDB) and had an interesting realization: we can avoid the current issue of long-lived reader txns preventing page reclamation.
(...) since we're already going to add a txnID to every page's page header, we can simply add a 2nd txnID, recording the txnID of the previous change to this page's ancestor. Then, any page where this prevTxnID is >= the outstanding reader's txnID can be reclaimed.
Nice. But: Seems to me the freelist needs to know when the page was written, and when it was freed. A reader older than the txn which wrote the current contents of a page, is irrelevant to whether the page can be overwritten. How do ancestors matter? (Do you mean a branch page? The age of the previous page contents?)
Still thinking about the actual implementation of this, it may make more sense to store the prevTxnID in the freeDB than in each page header. Ideally we want to be able to grab a chunk of pageIDs unambiguously, instead of having to iterate thru each page and read its header to determine if it's safe.
Yes, re-reading lots of pages just to find they can't be used does not sound fun. And we can't look for page headers in a broken-up overflow page. Unless we either quit breaking up freed ovpages, or write a page header to the unused chunk when breaking it up.
Can we grab Mavibot's freeDB structure?
Le 2/4/14 3:29 AM, Hallvard Breien Furuseth a écrit :
On 2014-02-03 22:14, Howard Chu wrote:
Was chatting with Emmanuel Lecharny (who is currently working on Mavibot for ApacheDS, an MVCC backend similar to LMDB) and had an interesting realization: we can avoid the current issue of long-lived reader txns preventing page reclamation.
(...) since we're already going to add a txnID to every page's page header, we can simply add a 2nd txnID, recording the txnID of the previous change to this page's ancestor. Then, any page where this prevTxnID is >= the outstanding reader's txnID can be reclaimed.
Nice. But: Seems to me the freelist needs to know when the page was written, and when it was freed. A reader older than the txn which wrote the current contents of a page, is irrelevant to whether the page can be overwritten. How do ancestors matter? (Do you mean a branch page? The age of the previous page contents?)
Still thinking about the actual implementation of this, it may make more sense to store the prevTxnID in the freeDB than in each page header. Ideally we want to be able to grab a chunk of pageIDs unambiguously, instead of having to iterate thru each page and read its header to determine if it's safe.
Yes, re-reading lots of pages just to find they can't be used does not sound fun. And we can't look for page headers in a broken-up overflow page. Unless we either quit breaking up freed ovpages, or write a page header to the unused chunk when breaking it up.
Can we grab Mavibot's freeDB structure?
The free page structure is irrelevant. What is important is the copied page structure, and we have the exact same data structure here : a B-tree of <txn ID, list of copiedPage>.
The key is to leverage this B-tree to free the pages that we know won't be used when we release a txn.
Hallvard Breien Furuseth wrote:
On 2014-02-03 22:14, Howard Chu wrote:
Was chatting with Emmanuel Lecharny (who is currently working on Mavibot for ApacheDS, an MVCC backend similar to LMDB) and had an interesting realization: we can avoid the current issue of long-lived reader txns preventing page reclamation.
(...) since we're already going to add a txnID to every page's page header, we can simply add a 2nd txnID, recording the txnID of the previous change to this page's ancestor. Then, any page where this prevTxnID is >= the outstanding reader's txnID can be reclaimed.
Nice. But: Seems to me the freelist needs to know when the page was written, and when it was freed. A reader older than the txn which wrote the current contents of a page, is irrelevant to whether the page can be overwritten. How do ancestors matter? (Do you mean a branch page? The age of the previous page contents?)
I meant the age of the previous page contents.
Still thinking about the actual implementation of this, it may make more sense to store the prevTxnID in the freeDB than in each page header. Ideally we want to be able to grab a chunk of pageIDs unambiguously, instead of having to iterate thru each page and read its header to determine if it's safe.
Yes, re-reading lots of pages just to find they can't be used does not sound fun. And we can't look for page headers in a broken-up overflow page. Unless we either quit breaking up freed ovpages, or write a page header to the unused chunk when breaking it up.
Can we grab Mavibot's freeDB structure?
Currently we're just using an array of pageIDs, we can turn that into an array of <pageID,prevTxnID> pairs instead.
Le 2/4/14 8:13 AM, Howard Chu a écrit :
Hallvard Breien Furuseth wrote:
On 2014-02-03 22:14, Howard Chu wrote:
Was chatting with Emmanuel Lecharny (who is currently working on Mavibot for ApacheDS, an MVCC backend similar to LMDB) and had an interesting realization: we can avoid the current issue of long-lived reader txns preventing page reclamation.
(...) since we're already going to add a txnID to every page's page header, we can simply add a 2nd txnID, recording the txnID of the previous change to this page's ancestor. Then, any page where this prevTxnID is >= the outstanding reader's txnID can be reclaimed.
Nice. But: Seems to me the freelist needs to know when the page was written, and when it was freed. A reader older than the txn which wrote the current contents of a page, is irrelevant to whether the page can be overwritten. How do ancestors matter? (Do you mean a branch page? The age of the previous page contents?)
I meant the age of the previous page contents.
Still thinking about the actual implementation of this, it may make more sense to store the prevTxnID in the freeDB than in each page header. Ideally we want to be able to grab a chunk of pageIDs unambiguously, instead of having to iterate thru each page and read its header to determine if it's safe.
Yes, re-reading lots of pages just to find they can't be used does not sound fun. And we can't look for page headers in a broken-up overflow page. Unless we either quit breaking up freed ovpages, or write a page header to the unused chunk when breaking it up.
Can we grab Mavibot's freeDB structure?
Currently we're just using an array of pageIDs, we can turn that into an array of <pageID,prevTxnID> pairs instead.
I don't know what's inside the Free List in LMDB, but I can only assume it's closed to what we have in Mavibot. Let me explain what we do (or will do !) in Mavibot, and what data structure we use for that.
We have 2 special b-trees to manage the other b-trees : - the btreeOfBtrees (BOB) which contain information about every version of btree still in use (ie, if a search is holding rev N, then BOB will hold btree N). - the copiedPagesBtree (CPB) store an nformation of every copied pages, associated with the revision which copied the page. In fact, we store a tuple <rev, list of copied pages>
We also maintain a list of free pages (and here, free pages mean pages that can be immediately reclaimed when we need some new page)
My guess is that the Free List is somehow a combinaison of the free pages list and the CPB b-tree we have in Mavibot.
At this point, the mechanism we foresee for pages release is the following : - every copied root page can be release immediately, if no search is using the revision (or, better said, when a search is completed, and no other search is using the revision, we can release the root page if it has been copied). This is very basic, as no other search will ever use this page. - for other copied pages, it's slightly more complex. When we release a search, and if the associated revision has not any more search associated, then we can do some cleanup. We will brows the CPB b-tree backward until we find a revision which is hold by a search request. Every elements found during the CPB browse can now be processed. If a page in this set of pages has been copied, we can safely remove the copied page, it's not in use anymore.
So if we have these elements in the CPB, when we release N and when the closest oldest search is N-3 :
rev(N) -> {p100, p120, p145} rev(N-1) -> {p99, p145} rev(N-2) -> {p45, p120, p145} rev(N-3) -> {p99, p100, p111, p145}
we can see that p145 has been copied twice, and p120 once. We can release both p145(N-1,N-2) pages, and p120(N-2) safely. OTOH, p100(N) is still in use by N-3, so we have to keep it (but it will be removed if any N+x revision is copying it)
What is needed at this point is to have a unique id for each page (and this id should remain when we copy a page), and a way to keep the tuple <rev, copied pages>. The algorithm is linear (O(n log(m) ) where n is the number of revisions between the released one and the closest oldest revision in use, and log(m) represent the number of copied pages for each revision : it's likely to be the height of the tree)
I don't know how it fits with LMDB implementation, I just wanted to give you an insight on what we are working on. Hope it helps.