Howard Chu wrote:
Another update - http://highlandsun.com/hyc/mdb/microbench/MDB-fs.ods
OpenOffice spreadsheet tabulating the results from running the benchmarks
across many different filesystems. You can compare btrfs, ext2, ext3, ext4,
jfs, ntfs, reiserfs, xfs, and zfs to see which is best for the database
workloads being tested. In addition, ext3, ext4, jfs, reiserfs, and xfs are
tested in a 2nd configuration, with the journal stored on a tmpfs device, to
show how much overhead the filesystem's journaling mechanism imposes.
The hard drive used is the same as in the main benchmark document, attached
via eSATA to my laptop. The filesystems were created fresh for each test. The
tests are only run once each due to the great length of time needed to collect
all of the data. (It takes several minutes just to run mkfs for some of these
filesystems...) You will probably want to toggle through the tests in cell B13
of the spreadsheet to get the best view of the results.
With this drive, jfs with an external journal is the clear winner when you
need fully synchronous transactions. If you can tolerate some degree of asynch
operation, plain old ext2 is still the fastest for writes.
If you're dedicating an entire filesystem to an MDB database, it may make
sense to just use ext2 (or turn off metadata journaling in ext3/4). In that
case, you would want to preallocate all of the disk space for the DB. Once all
of the space has been allocated and the FS has been cleanly sync'd, there
would be no further structural meta-data updates to worry about. I.e., in a
subsequent unclean shutdown, fsck would have no work to do.
From that point on, the only meta-data updates would be updating the
mtime on write operations.
Note that just setting the filesize (using ftruncate()) is inadequate since
that would just create a sparse file. Also using fallocate() would only partly
serve the purpose (assuming it's even implemented on ext2) because fallocate()
marks the allocated space as unused. (So the first time a page is referenced,
it still needs to perform a meta-update to note that the page is now in use.)
The only suitable approach here is actually writing data to fill out the size
of the file. (Which is also rather unfortunate, particularly if you're using
MDB read speed is largely independent of FS type. I believe any
the reported speeds here is just measurement noise.
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/