Hallvard Breien Furuseth wrote:
Could writing a word/byte to the current meta page break someting?
While I'm asking, why are metas separate pages, instead of simply a fixed 256 or so bytes apart to keep them in separate cachelines?
The only reason I can think of is if a write gets garbled, the other meta page is safe - but mdb assumes correct filesystem operation anyway.
Because the fundamental unit of storage is a page. Writing to anything smaller than a page requires the OS to read a full page and then update the portion of it. Doing so from multiple processes would require file locking to prevent corruption. Writes to separate pages are guaranteed not to interfere with each other.
This is for a "syncdelay<count>" feature to replace "dbnosync". The latter can break DB consistency after a system crash: Without fdatasync(), the OS can reorder writes, leaving meta pages to refer to trees with not written or overwritten data pages.
This should not be a new keyword. Just implement the <size> feature of the checkpoint keyword.
syncdelay<count> will only sync every<count> or maybe<count>/2 commit. It'll need 4 meta pages, of which 2 may refer to unsynced data pages. mdb_env_sync() may then need to write a "synced" flag to the current meta page, or do a dummy write transaction which sets with a "synced data pages" flag in its meta page. The latter would have to wait out any existing/pending write transactions.