 
            Thanks Howard.
The --readahead flag helped!
After setting --readahead=0, the average speed improved from 3MB/s to 8MB/s. And I don't see a lot of reads anymore.
On Fri, Mar 16, 2018 at 10:55 PM, Howard Chu hyc@symas.com wrote:
Chuntao HONG wrote:
I am testing LMDB performance with the benchmark given in http://www.lmdb.tech/bench/ondisk/. And I noticed that LMDB random writes are really slow when the data goes beyond memory.
I am using a machine with 4GB DRAM with Intel PCIE SSD. The key size is 10 bytes and value size is 1KB. The benchmark code is given in http://www.lmdb.tech/bench/ondisk/, and the command line I used is "./db_bench_mdb --benchmarks=fillrandbatch --threads=1 --stats_interval=1024 --num=10000000 --value_size=1000 --use_existing_db=0 ".
For the first 1GB of data written, the average write rate is 140MB/s. The rate then drops significantly to 40MB/s for the first 2GB. At the end of the test, in which 10M values are written, the average rate is just 3MB/s, and the instant rate is 1MB/s. I know LMDB is not optimized for writes, but I didn't expect it to be this slow, given that I have a really high-end Intel SSD.
Any flash SSD will get bogged down by a continuous write workload, since it must do wear-leveling and compaction in the background and "the background" is getting too busy.
I also notice that the way LMDB access the SSD is really strange. At the
beginning of the test, it writes the SSD at around 400MB/s, but performs no read, which is expected. But as we write more and more data, LMDB starts to read the SSD. As time goes on, the read throughput rises while the write throughput drops significantly. At the end of test, LMDB is constantly reading at around 190MB/s, while occationally issuing 100MB writes at around 10-20 second intervals.
- Is it normal for LMDB to have such low write throughput (1MB/s at the
end of test) for data stored on SSD?
- Why is LMDB reading more data than it is writing (about 20MB data read
per 1MB written) at the end of the test? **
To my understanding, although we have more data than the DRAM can hold, the branch nodes of the B-tree should still be in the DRAM. So for every write, the only pages that we need to fetch from SSD is the leaf nodes. And when we write the leaf node, we might also need to write its parents. So there should be more writes than reads. But it turns out LMDB is reading much more than writing. I think it might be the reason why it is so slow at the end. But I really cannot understand why.*
Rerun the benchmark with --readahead=0. The kernel does 16page readahead by default, and on a random access workload, 15 of those pages are wasted effort. They also cause useful pages to be evicted from RAM. This is where the majority of the excess reads come from.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/