MDB, read/write mmap - openldap-devel

4 Sep 2012


      A suggestion was made to use a read/write mmap (as an option), to allow writes
to be performed with no syscall overhead. I'm thinking that might be ok as a
completely separate version of the library, because a fair bit of the code
would need to change to accommodate that update style, and it would push the
library over the 32K boundary.
Also this isn't as cool a suggestion as it sounds - it completely gives up
MDB's current immunity to corruption, and in fact makes reliability much less
stable. When you write through an mmap, you have absolutely no idea when the
OS is going to get around to flushing the data back to disk. You have no idea
what order the flushes will occur in. You can force the OS's hand, by calling
msync on every page you want to flush, in the order you want them flushed, but
then you'll just get back to having syscall overhead again, and by calling
msync in a particular order, you defeat the underlying filesystem's ability to
schedule the writes for optimum seeks.
Currently, by using writev, we can push a lot of data to the OS, and then when
we call fdatasync() at the end, the OS schedules those writes as it sees fit.
Right now the only ordering dependency MDB has is that all of the data pages
must be flushed successfully before flushing the meta page, so we can afford
to let the OS schedule all of the data page writes, and then do an explicitly
synchronous write of the meta page.
So, with a writable mmap, we're stuck with the choice of either (a) not
knowing at all whether our data has been flushed, or (b) being forced to
explicitly flush every page ourselves, in a predetermined order which we have
no way of knowing whether or not it's optimal for the current disk layout.
It seems to me this can only be a viable mode of operation if you're always
going to run asynch and don't care much about transaction durability or DB
recoverability. Running in this mode offers absolutely zero crash resistance;
the entire DB will almost always be irreparably damaged after a system crash.
Would you run like that, if it offered you the potential of maybe 10x faster
write performance? (It could be useful for slapadd -q, certainly.)
-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/