Howard Chu wrote:
Another update - http://highlandsun.com/hyc/mdb/microbench/MDB-fs.ods is an OpenOffice spreadsheet tabulating the results from running the benchmarks across many different filesystems. You can compare btrfs, ext2, ext3, ext4, jfs, ntfs, reiserfs, xfs, and zfs to see which is best for the database workloads being tested. In addition, ext3, ext4, jfs, reiserfs, and xfs are tested in a 2nd configuration, with the journal stored on a tmpfs device, to show how much overhead the filesystem's journaling mechanism imposes.
The hard drive used is the same as in the main benchmark document, attached via eSATA to my laptop. The filesystems were created fresh for each test. The tests are only run once each due to the great length of time needed to collect all of the data. (It takes several minutes just to run mkfs for some of these filesystems...) You will probably want to toggle through the tests in cell B13 of the spreadsheet to get the best view of the results.
With this drive, jfs with an external journal is the clear winner when you need fully synchronous transactions. If you can tolerate some degree of asynch operation, plain old ext2 is still the fastest for writes.
If you're dedicating an entire filesystem to an MDB database, it may make sense to just use ext2 (or turn off metadata journaling in ext3/4). In that case, you would want to preallocate all of the disk space for the DB. Once all of the space has been allocated and the FS has been cleanly sync'd, there would be no further structural meta-data updates to worry about. I.e., in a subsequent unclean shutdown, fsck would have no work to do.
From that point on, the only meta-data updates would be updating the inode
mtime on write operations.
Note that just setting the filesize (using ftruncate()) is inadequate since that would just create a sparse file. Also using fallocate() would only partly serve the purpose (assuming it's even implemented on ext2) because fallocate() marks the allocated space as unused. (So the first time a page is referenced, it still needs to perform a meta-update to note that the page is now in use.)
The only suitable approach here is actually writing data to fill out the size of the file. (Which is also rather unfortunate, particularly if you're using an SSD...)
MDB read speed is largely independent of FS type. I believe any variation in the reported speeds here is just measurement noise.