On 08/08/2016 05:41 AM, Hallvard Breien Furuseth wrote:
A transaction must not reuse data pages visible in the last snapshot known to be durable, since that's how far back LDMB may need to revert after abnormal termination. Like a crash after MDB_NOMETASYNC may do.
Sync the data pages from a txn, write the metapage, eventually sync that metapage, wait out any older read-only transactions, and *then* you can reuse the pages the txn freed. Not before. So when you don't sync, or a read-only txn won't die, LMDB degenerates to append-only.
...except if you sync the metapage and exit, next LMDB run may not know you synced it and must assume the metapage isn't yet durable. So it might not reuse pages visible to the _previous_ durable metapage, until it syncs. I'm rather losing track at this point, but I think it may mean twice as may not-yet-usable pages as one might expect.
Concretely: say the current write transaction is number 10, and a long-lived reader is on number 7. Currently, MDB will be unable to reuse any pages used in transactions 7+ until the reader ends.
Now say a 3rd, durable root is added. For the sake of argument, no checksums are used and in the event of a crash, only the last durable state is recovered. Say the durable transaction is number 2. Pages used in transaction 2 need to be preserved, obviously. 7+ still need to be preserved for the slow reader. But pages from transactions 3-6 can be reused.
Note that the last durable transaction is controlled purely by the single writer, so tracking it is actually easier than tracking which readers are where.
If a crash happens before a durable root is fully synced, then there should be a second, older durable root that hasn't been reused yet. In that case MDB recovers the way it does currently.
Does this make sense? Thanks for bearing with me.