openldap-commit2devel@OpenLDAP.org writes:
commit aff2693fc0721df4ccb6ceb357f80501c413ed38 Author: Howard Chu hyc@symas.com Date: Mon Dec 10 12:16:50 2012 -0800
ITS#7455 simplify Don't try to reclaim overflow pages while operating on the freelist (for now). The circular dependencies are much like the single-page case, but worse. Maybe look into this in the future, but it's not absolutely necessary now.
Suggestions to reduce freelist changes during commit:
Let a freelist entry steal page numbers listed in the next entries. Then mdb_page_alloc can grab more old pages without deleting/updating their entries and producing new dirty pages. Next txn does the updates.
Preallocate the final MDB_oldpages with MDB_RESERVE in mdb_txn_commit() and leave some room to spare. Then use page numbers from it and/or steal new ones at need.
BTW, could MDB offer an MDB_RESERVE2 which says "give me data->mv_size bytes plus as much more as will fit without growing the page"? And MDB_RESERVE2_SHRINK which shrinks the size to the final size.
Stolen pages -- one way would be to search for particular pages to seal, and list the stolen ones at the end of the freelist entry. Or: Stealing only from the end of the previous entry/entries should be simpler, but doesn't let us choose some specific pages to steal in order to gain a big enough contiguous page range: typedef struct MDB_freelist_entry { /* freelist entry in the DB */ short mf_len; /* saved length */ short mf_stolen_entries; /* #fully stolen entries */ short mf_nextlen; /* 0 or remaining length of next entry */ MDB_ID mf_pages[]; /* length mf_len. */ } MDB_freelist_entry; Thus, if the free DB contains (txnid_t)123 => { .mf_stolen_entries = 1, .mf_nextlen = 7 } (txnid_t)124 => { ... } (txnid_t)125 => { .mf_len = 20 } then mdb should henceforth skip entry#124 and entry#125.mf_pages[7..19].
A simple variant of page ranges, to save space and simplify range handling: /* Page range: (pagecount << MDB_PGNO_BITS) | (pageno + pagecount) */ typedef pgno_t mdb_pages_t;
Lone pages get pagecount=1. With MDB_PGCOUNT_BITS = (64bit 4 ? 19 : 12) and page size 4096, that limits MDB to a 128 petabyte DB and 2G entry size. Or 4G database and 16M entry size on 32-bit machines. (I'd call limiting the entry size a bonus compared to today's mdb: The current freelist doesn't exactly handle 2 billion freed pages gracefully.)
I wrote:
A simple variant of page ranges, to save space and simplify range handling: /* Page range: (pagecount << MDB_PGNO_BITS) | (pageno + pagecount) */ typedef pgno_t mdb_pages_t;
Lone pages get pagecount=1. With MDB_PGCOUNT_BITS = (64bit 4 ? 19 : 12)
MDB_PGCOUNT_BITS = sizeof(mdb_pages_t)*CHAR_BIT - MDB_PGNO_BITS, i.e. mdb_pages_t's remainig bits.
and page size 4096, that limits MDB to a 128 petabyte DB and 2G entry size. Or 4G database and 16M entry size on 32-bit machines. (I'd call limiting the entry size a bonus compared to today's mdb: The current freelist doesn't exactly handle 2 billion freed pages gracefully.)
2/4096 billion.
Come to think of it, the meta page can hold some freelist info. mdb_txn_commit() can hopefully decide at some point that its current freelist is what gets written to the freeDB. Further changes go to an amendment in the meta page. "Stolen oldpages" info goes there, not in the freeDB.
Next write-transaction initializes oldpages and mt_free_pgs from the meta page. Those mt_free_pgs pages become reusable one transaction later than they do today.
MDB_RDONLY txns will not copy MDB_meta.mm_dbs[FREE_DBI] and can no longer read the freeDB. Unless they get an MDB_STAT flag for use by ./mdb_stat & co. Seems a good place to move the complexity since it happens rarely. Maybe that'd create a readonly txn with the structure of a write txn so the freelist can be used and and mdb_stat0()ed.
Could use the opportunity to align mm_dbs[MAIN_DBI] with a cache line. Then the variable metadata a readonly txn copies, will all come from one cache line.