Was reading thru Google's leveldb stuff and found their benchmark page
http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html
I adapted their sqlite test driver for MDB, attached.
On my laptop I get: violino:/home/software/leveldb> ./db_bench_mdb MDB: version MDB 0.9.0: ("September 1, 2011") Date: Mon Jul 2 07:17:09 2012 CPU: 4 * Intel(R) Core(TM)2 Extreme CPU Q9300 @ 2.53GHz CPUCache: 6144 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) ------------------------------------------------ fillseq : 9.740 micros/op; 11.4 MB/s fillseqsync : 8.182 micros/op; 13.5 MB/s (10000 ops) fillseqbatch : 0.502 micros/op; 220.5 MB/s fillrandom : 11.558 micros/op; 9.6 MB/s fillrandint : 9.593 micros/op; 10.3 MB/s fillrandibatch : 6.288 micros/op; 15.8 MB/s fillrandsync : 8.399 micros/op; 13.2 MB/s (10000 ops) fillrandbatch : 7.206 micros/op; 15.4 MB/s overwrite : 14.253 micros/op; 7.8 MB/s overwritebatch : 9.075 micros/op; 12.2 MB/s readrandom : 0.261 micros/op; readseq : 0.079 micros/op; 1392.5 MB/s readreverse : 0.085 micros/op; 1301.9 MB/s fillrand100K : 106.695 micros/op; 894.0 MB/s (1000 ops) fillseq100K : 93.626 micros/op; 1018.8 MB/s (1000 ops) readseq100K : 0.095 micros/op; 1005185.9 MB/s readrand100K : 0.368 micros/op;
Compared to the leveldb: violino:/home/software/leveldb> ./db_bench LevelDB: version 1.5 Date: Mon Jul 2 07:18:35 2012 CPU: 4 * Intel(R) Core(TM)2 Extreme CPU Q9300 @ 2.53GHz CPUCache: 6144 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) WARNING: Snappy compression is not enabled ------------------------------------------------ fillseq : 1.752 micros/op; 63.1 MB/s fillsync : 13.877 micros/op; 8.0 MB/s (1000 ops) fillrandom : 2.836 micros/op; 39.0 MB/s overwrite : 3.723 micros/op; 29.7 MB/s readrandom : 5.390 micros/op; (1000000 of 1000000 found) readrandom : 4.811 micros/op; (1000000 of 1000000 found) readseq : 0.228 micros/op; 485.1 MB/s readreverse : 0.520 micros/op; 212.9 MB/s compact : 439250.000 micros/op; readrandom : 3.269 micros/op; (1000000 of 1000000 found) readseq : 0.197 micros/op; 560.4 MB/s readreverse : 0.438 micros/op; 252.5 MB/s fill100K : 504.147 micros/op; 189.2 MB/s (1000 ops) crc32c : 4.134 micros/op; 944.9 MB/s (4K per op) snappycomp : 6863.000 micros/op; (snappy failure) snappyuncomp : 8145.000 micros/op; (snappy failure) acquireload : 0.439 micros/op; (each op is 1000 loads)
Interestingly enough, MDB wins on one or two write tests. It clearly wins on all of the read tests. MDB databases don't require compaction, so that's another win. MDB doesn't do compression, so those tests are disabled.
I haven't duplicated all of the test scenarios described on the web page yet, you can do that yourself with the attached code. It's pretty clear that nothing else even begins to approach MDB's read speed.
MDB sequential write speed is dominated by the memcpy's required for copy-on-write page updates. There's not much that can be done to eliminate that, besides batching writes. For random writes the memcmp's on the key comparisons become more of an issue. The fillrandi* tests use an integer key instead of a string-based key, to show the difference due to key comparison overhead.
For the synchronous writes, MDB is also faster, because it doesn't need to synchronously write a transaction logfile.
Howard Chu wrote:
Was reading thru Google's leveldb stuff and found their benchmark page
I haven't duplicated all of the test scenarios described on the web page yet, you can do that yourself with the attached code. It's pretty clear that nothing else even begins to approach MDB's read speed.
The results for large data values are even more dramatic:
For leveldb: violino:/home/software/leveldb> ./db_bench --value_size=100000 --num=1000 LevelDB: version 1.5 Date: Mon Jul 2 08:08:51 2012 CPU: 4 * Intel(R) Core(TM)2 Extreme CPU Q9300 @ 2.53GHz CPUCache: 6144 KB Keys: 16 bytes each Values: 100000 bytes each (50000 bytes after compression) Entries: 1000 RawSize: 95.4 MB (estimated) FileSize: 47.7 MB (estimated) WARNING: Snappy compression is not enabled ------------------------------------------------ fillseq : 293.817 micros/op; 324.6 MB/s fillsync : 10305.000 micros/op; 9.3 MB/s (1 ops) fillrandom : 467.954 micros/op; 203.8 MB/s overwrite : 873.647 micros/op; 109.2 MB/s readrandom : 59.306 micros/op; (1000 of 1000 found) readrandom : 38.869 micros/op; (1000 of 1000 found) readseq : 3.762 micros/op; 25353.9 MB/s readreverse : 67.664 micros/op; 1409.7 MB/s compact : 327394.000 micros/op; readrandom : 35.603 micros/op; (1000 of 1000 found) readseq : 1.518 micros/op; 62847.5 MB/s readreverse : 19.971 micros/op; 4776.0 MB/s fill100K : 6584.000 micros/op; 14.5 MB/s (1 ops) crc32c : 3.929 micros/op; 994.2 MB/s (4K per op) snappycomp : 10660.000 micros/op; (snappy failure) snappyuncomp : 8547.000 micros/op; (snappy failure) acquireload : 0.386 micros/op; (each op is 1000 loads)
For MDB: violino:/home/software/leveldb> ./db_bench_mdb --value_size=100000 --num=1000 MDB: version MDB 0.9.0: ("September 1, 2011") Date: Mon Jul 2 08:09:17 2012 CPU: 4 * Intel(R) Core(TM)2 Extreme CPU Q9300 @ 2.53GHz CPUCache: 6144 KB Keys: 16 bytes each Values: 100000 bytes each (50000 bytes after compression) Entries: 1000 RawSize: 95.4 MB (estimated) FileSize: 47.7 MB (estimated) ------------------------------------------------ fillseq : 89.330 micros/op; 1067.8 MB/s fillseqsync : 124.788 micros/op; 764.4 MB/s (10 ops) fillseqbatch : 152.159 micros/op; 626.9 MB/s fillrandom : 103.817 micros/op; 918.8 MB/s fillrandint : 105.732 micros/op; 902.0 MB/s fillrandibatch : 115.781 micros/op; 823.7 MB/s fillrandsync : 130.296 micros/op; 732.0 MB/s (10 ops) fillrandbatch : 113.984 micros/op; 836.8 MB/s overwrite : 105.091 micros/op; 907.6 MB/s overwritebatch : 101.044 micros/op; 944.0 MB/s readrandom : 0.303 micros/op; readseq : 0.142 micros/op; 671485.5 MB/s readreverse : 0.084 micros/op; 1131133.3 MB/s fillrand100K : 136.852 micros/op; 697.0 MB/s (1 ops) fillseq100K : 158.787 micros/op; 600.7 MB/s (1 ops) readseq100K : 9.060 micros/op; 10528.0 MB/s readrand100K : 5.007 micros/op;
MDB's zero-memcpy reads means read ops are essentially constant speed, independent of data volume. MDB's overflow-page handling makes large writes extremely cheap, and this advantage totally offsets the overhead of the copy-on-write tree management.
Howard Chu wrote:
Howard Chu wrote:
Was reading thru Google's leveldb stuff and found their benchmark page
I haven't duplicated all of the test scenarios described on the web page yet, you can do that yourself with the attached code. It's pretty clear that nothing else even begins to approach MDB's read speed.
The results for large data values are even more dramatic:
I've expanded the tests, added BerkeleyDB 5.3.21 to the mix, and summarized the results here: http://highlandsun.com/hyc/mdb/microbench/
Howard Chu wrote:
Howard Chu wrote:
Howard Chu wrote:
Was reading thru Google's leveldb stuff and found their benchmark page
I haven't duplicated all of the test scenarios described on the web page yet, you can do that yourself with the attached code. It's pretty clear that nothing else even begins to approach MDB's read speed.
The results for large data values are even more dramatic:
I've expanded the tests, added BerkeleyDB 5.3.21 to the mix, and summarized the results here: http://highlandsun.com/hyc/mdb/microbench/
Another update - http://highlandsun.com/hyc/mdb/microbench/MDB-fs.ods is an OpenOffice spreadsheet tabulating the results from running the benchmarks across many different filesystems. You can compare btrfs, ext2, ext3, ext4, jfs, ntfs, reiserfs, xfs, and zfs to see which is best for the database workloads being tested. In addition, ext3, ext4, jfs, reiserfs, and xfs are tested in a 2nd configuration, with the journal stored on a tmpfs device, to show how much overhead the filesystem's journaling mechanism imposes.
The hard drive used is the same as in the main benchmark document, attached via eSATA to my laptop. The filesystems were created fresh for each test. The tests are only run once each due to the great length of time needed to collect all of the data. (It takes several minutes just to run mkfs for some of these filesystems...) You will probably want to toggle through the tests in cell B13 of the spreadsheet to get the best view of the results.
With this drive, jfs with an external journal is the clear winner when you need fully synchronous transactions. If you can tolerate some degree of asynch operation, plain old ext2 is still the fastest for writes.
MDB read speed is largely independent of FS type. I believe any variation in the reported speeds here is just measurement noise.
Howard Chu wrote:
Another update - http://highlandsun.com/hyc/mdb/microbench/MDB-fs.ods is an OpenOffice spreadsheet tabulating the results from running the benchmarks across many different filesystems. You can compare btrfs, ext2, ext3, ext4, jfs, ntfs, reiserfs, xfs, and zfs to see which is best for the database workloads being tested. In addition, ext3, ext4, jfs, reiserfs, and xfs are tested in a 2nd configuration, with the journal stored on a tmpfs device, to show how much overhead the filesystem's journaling mechanism imposes.
The hard drive used is the same as in the main benchmark document, attached via eSATA to my laptop. The filesystems were created fresh for each test. The tests are only run once each due to the great length of time needed to collect all of the data. (It takes several minutes just to run mkfs for some of these filesystems...) You will probably want to toggle through the tests in cell B13 of the spreadsheet to get the best view of the results.
With this drive, jfs with an external journal is the clear winner when you need fully synchronous transactions. If you can tolerate some degree of asynch operation, plain old ext2 is still the fastest for writes.
If you're dedicating an entire filesystem to an MDB database, it may make sense to just use ext2 (or turn off metadata journaling in ext3/4). In that case, you would want to preallocate all of the disk space for the DB. Once all of the space has been allocated and the FS has been cleanly sync'd, there would be no further structural meta-data updates to worry about. I.e., in a subsequent unclean shutdown, fsck would have no work to do.
From that point on, the only meta-data updates would be updating the inode
mtime on write operations.
Note that just setting the filesize (using ftruncate()) is inadequate since that would just create a sparse file. Also using fallocate() would only partly serve the purpose (assuming it's even implemented on ext2) because fallocate() marks the allocated space as unused. (So the first time a page is referenced, it still needs to perform a meta-update to note that the page is now in use.)
The only suitable approach here is actually writing data to fill out the size of the file. (Which is also rather unfortunate, particularly if you're using an SSD...)
MDB read speed is largely independent of FS type. I believe any variation in the reported speeds here is just measurement noise.