I am testing LMDB performance with the benchmark given in http://www.lmdb.tech/bench/ondisk/. And I noticed that LMDB random writes are really slow when the data goes beyond memory.
I am using a machine with 4GB DRAM with Intel PCIE SSD. The key size is 10 bytes and value size is 1KB. The benchmark code is given in http://www.lmdb.tech/bench/ondisk/, and the command line I used is "./db_bench_mdb --benchmarks=fillrandbatch --threads=1 --stats_interval=1024 --num=10000000 --value_size=1000 --use_existing_db=0 ".
For the first 1GB of data written, the average write rate is 140MB/s. The rate then drops significantly to 40MB/s for the first 2GB. At the end of the test, in which 10M values are written, the average rate is just 3MB/s, and the instant rate is 1MB/s. I know LMDB is not optimized for writes, but I didn't expect it to be this slow, given that I have a really high-end Intel SSD.
I also notice that the way LMDB access the SSD is really strange. At the beginning of the test, it writes the SSD at around 400MB/s, but performs no read, which is expected. But as we write more and more data, LMDB starts to read the SSD. As time goes on, the read throughput rises while the write throughput drops significantly. At the end of test, LMDB is constantly reading at around 190MB/s, while occationally issuing 100MB writes at around 10-20 second intervals.
1. Is it normal for LMDB to have such low write throughput (1MB/s at the end of test) for data stored on SSD?
2. Why is LMDB reading more data than it is writing (about 20MB data read per 1MB written) at the end of the test?
To my understanding, although we have more data than the DRAM can hold, the branch nodes of the B-tree should still be in the DRAM. So for every write, the only pages that we need to fetch from SSD is the leaf nodes. And when we write the leaf node, we might also need to write its parents. So there should be more writes than reads. But it turns out LMDB is reading much more than writing. I think it might be the reason why it is so slow at the end. But I really cannot understand why.
For your reference, here is part of the log given by the benchmark: -------------------------------------------------------- 2018/03/12-10:36:30 ... thread 0: (1024,1024) ops and (54584.2,54584.2) ops/second in (0.018760,0.018760) seconds 2018/03/12-10:36:30 ... thread 0: (1024,2048) ops and (111231.8,73231.8) ops/second in (0.009206,0.027966) seconds 2018/03/12-10:36:30 ... thread 0: (1024,3072) ops and (125382.6,85019.2) ops/second in (0.008167,0.036133) seconds 2018/03/12-10:36:30 ... thread 0: (1024,4096) ops and (206202.2,99661.8) ops/second in (0.004966,0.041099) seconds 2018/03/12-10:36:30 ... thread 0: (1024,5120) ops and (259634.9,113669.2) ops/second in (0.003944,0.045043) seconds 2018/03/12-10:36:30 ... thread 0: (1024,6144) ops and (306495.1,126984.1) ops/second in (0.003341,0.048384) seconds 2018/03/12-10:36:30 ... thread 0: (1024,7168) ops and (339185.2,139447.1) ops/second in (0.003019,0.051403) seconds 2018/03/12-10:36:30 ... thread 0: (1024,8192) ops and (384240.2,151512.9) ops/second in (0.002665,0.054068) seconds 2018/03/12-10:36:30 ... thread 0: (1024,9216) ops and (385252.1,162465.2) ops/second in (0.002658,0.056726) seconds 2018/03/12-10:36:30 ... thread 0: (1024,10240) ops and (371553.0,172152.9) ops/second in (0.002756,0.059482) seconds ... 2018/03/12-10:36:37 ... thread 0: (1024,993280) ops and (70127.4,142518.0) ops/second in (0.014602,6.969505) seconds 2018/03/12-10:36:37 ... thread 0: (1024,994304) ops and (199415.8,142559.9) ops/second in (0.005135,6.974640) seconds 2018/03/12-10:36:37 ... thread 0: (1024,995328) ops and (75953.1,142431.4) ops/second in (0.013482,6.988122) seconds 2018/03/12-10:36:37 ... thread 0: (1024,996352) ops and (200823.7,142474.0) ops/second in (0.005099,6.993221) seconds 2018/03/12-10:36:37 ... thread 0: (1024,997376) ops and (71975.8,142330.8) ops/second in (0.014227,7.007448) seconds 2018/03/12-10:36:37 ... thread 0: (1024,998400) ops and (62117.1,142142.6) ops/second in (0.016485,7.023933) seconds 2018/03/12-10:36:37 ... thread 0: (1024,999424) ops and (36366.2,141720.2) ops/second in (0.028158,7.052091) seconds 2018/03/12-10:36:37 ... thread 0: (1024,1000448) ops and (61914.3,141533.5) ops/second in (0.016539,7.068630) seconds 2018/03/12-10:36:37 ... thread 0: (1024,1001472) ops and (60985.1,141342.6) ops/second in (0.016791,7.085421) seconds 2018/03/12-10:36:37 ... thread 0: (1024,1002496) ops and (60466.5,141149.8) ops/second in (0.016935,7.102356) seconds 2018/03/12-10:36:37 ... thread 0: (1024,1003520) ops and (60189.3,140956.3) ops/second in (0.017013,7.119369) seconds 2018/03/12-10:36:37 ... thread 0: (1024,1004544) ops and (61731.4,140772.1) ops/second in (0.016588,7.135957) seconds ... 2018/03/12-10:40:15 ... thread 0: (1024,3236864) ops and (5620.5,14373.0) ops/second in (0.182189,225.203790) seconds 2018/03/12-10:40:15 ... thread 0: (1024,3237888) ops and (6098.5,14366.9) ops/second in (0.167911,225.371701) seconds 2018/03/12-10:40:15 ... thread 0: (1024,3238912) ops and (5469.5,14359.5) ops/second in (0.187221,225.558922) seconds 2018/03/12-10:40:15 ... thread 0: (1024,3239936) ops and (5593.9,14352.4) ops/second in (0.183056,225.741978) seconds 2018/03/12-10:40:16 ... thread 0: (1024,3240960) ops and (5806.9,14345.7) ops/second in (0.176342,225.918320) seconds 2018/03/12-10:40:16 ... thread 0: (1024,3241984) ops and (5332.9,14338.1) ops/second in (0.192016,226.110336) seconds 2018/03/12-10:40:16 ... thread 0: (1024,3243008) ops and (5532.3,14330.9) ops/second in (0.185096,226.295432) seconds 2018/03/12-10:40:16 ... thread 0: (1024,3244032) ops and (6108.8,14324.8) ops/second in (0.167626,226.463058) seconds 2018/03/12-10:40:16 ... thread 0: (1024,3245056) ops and (6074.7,14318.6) ops/second in (0.168567,226.631625) seconds 2018/03/12-10:40:17 ... thread 0: (1024,3246080) ops and (5615.2,14311.6) ops/second in (0.182362,226.813987) seconds 2018/03/12-10:40:17 ... thread 0: (1024,3247104) ops and (5529.3,14304.5) ops/second in (0.185194,226.999181) seconds 2018/03/12-10:40:17 ... thread 0: (1024,3248128) ops and (5846.2,14298.0) ops/second in (0.175156,227.174337) seconds 2018/03/12-10:40:17 ... thread 0: (1024,3249152) ops and (5741.5,14291.2) ops/second in (0.178351,227.352688) seconds 2018/03/12-10:40:17 ... thread 0: (1024,3250176) ops and (5640.2,14284.3) ops/second in (0.181555,227.534243) seconds ... 2018/03/12-11:30:39 ... thread 0: (1024,9988096) ops and (1917.2,3074.3) ops/second in (0.534112,3248.860552) seconds 2018/03/12-11:30:39 ... thread 0: (1024,9989120) ops and (1858.9,3074.1) ops/second in (0.550851,3249.411403) seconds 2018/03/12-11:30:40 ... thread 0: (1024,9990144) ops and (1922.8,3073.9) ops/second in (0.532557,3249.943960) seconds 2018/03/12-11:30:40 ... thread 0: (1024,9991168) ops and (1857.2,3073.7) ops/second in (0.551382,3250.495342) seconds 2018/03/12-11:30:41 ... thread 0: (1024,9992192) ops and (1851.3,3073.5) ops/second in (0.553130,3251.048472) seconds 2018/03/12-11:30:41 ... thread 0: (1024,9993216) ops and (1941.0,3073.3) ops/second in (0.527568,3251.576040) seconds 2018/03/12-11:30:42 ... thread 0: (1024,9994240) ops and (1923.1,3073.2) ops/second in (0.532461,3252.108501) seconds 2018/03/12-11:30:42 ... thread 0: (1024,9995264) ops and (1987.6,3073.0) ops/second in (0.515200,3252.623701) seconds 2018/03/12-11:30:43 ... thread 0: (1024,9996288) ops and (1931.2,3072.8) ops/second in (0.530233,3253.153934) seconds 2018/03/12-11:30:43 ... thread 0: (1024,9997312) ops and (1918.9,3072.6) ops/second in (0.533633,3253.687567) seconds 2018/03/12-11:30:44 ... thread 0: (1024,9998336) ops and (1999.0,3072.4) ops/second in (0.512246,3254.199813) seconds 2018/03/12-11:30:44 ... thread 0: (1024,9999360) ops and (1853.3,3072.2) ops/second in (0.552533,3254.752346) seconds fillrandbatch : 325.508 micros/op 3072 ops/sec; 3.0 MB/s
* -------------------------------------------------------- * And here is the read/write rate dumpped from iostat: *-------------------------------------------------------- * Device tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdb 73.00 0.12 25.52 0 25 sdb 531.00 0.00 495.21 0 495 sdb 15089.00 0.00 488.77 0 488 sdb 27431.00 0.01 463.55 0 463 sdb 13093.00 0.00 478.77 0 478 sdb 53676.00 0.00 413.79 0 413 sdb 16781.00 0.00 483.60 0 483 sdb 22267.00 0.00 323.32 0 323 sdb 23945.00 0.00 164.55 0 164 sdb 22867.00 0.00 152.25 0 152 sdb 22038.00 0.00 146.39 0 146 sdb 23825.00 0.00 263.61 0 263 ... sdb 20866.00 85.81 76.90 85 76 sdb 7684.00 101.75 115.19 101 115 sdb 3707.00 154.48 0.00 154 0 sdb 4349.00 181.41 0.00 181 0 sdb 4373.00 184.70 0.00 184 0 sdb 4329.00 185.04 0.00 185 0 sdb 4338.00 182.30 0.01 182 0 sdb 4364.00 184.27 0.00 184 0 sdb 5310.00 177.32 4.99 177 4 sdb 32130.00 99.07 119.70 99 119 sdb 27010.00 103.26 99.25 103 99 sdb 11109.00 67.18 99.96 67 99 sdb 3931.00 172.51 0.00 172 0 sdb 4112.00 171.28 0.00 171 0 sdb 4202.00 183.03 0.00 183 0 sdb 4119.00 183.79 0.00 183 0 sdb 4232.00 182.77 0.02 182 0 sdb 4224.00 185.90 0.00 185 0 sdb 4304.00 186.17 0.00 186 0 sdb 4279.00 188.83 0.00 188 0 sdb 4087.00 184.38 0.00 184 0 sdb 7758.00 163.86 16.70 163 16 sdb 21309.00 68.95 80.11 68 80 sdb 21166.00 81.66 78.42 81 78 sdb 19328.00 71.56 71.55 71 71 sdb 20836.00 89.08 76.52 89 76 sdb 3211.00 112.01 82.21 112 82 sdb 3939.00 173.40 0.00 173 0 sdb 3992.00 178.03 0.00 178 0 sdb 4251.00 181.49 0.00 181 0 sdb 4148.00 185.63 0.00 185 0 sdb 4094.00 184.12 0.01 184 0 sdb 4241.00 187.38 0.00 187 0 sdb 4044.00 186.60 0.00 186 0 sdb 4049.00 185.47 0.00 185 0 sdb 4247.00 189.17 0.00 189 0 ... sdb 17457.00 105.45 64.05 105 64 sdb 16736.00 82.12 62.35 82 62 sdb 12074.00 108.76 66.21 108 66 sdb 2232.00 194.44 0.00 194 0 sdb 2171.00 187.27 0.02 187 0 sdb 2322.00 197.91 0.00 197 0 sdb 2311.00 194.65 0.00 194 0 sdb 2240.00 187.93 0.00 187 0 sdb 2189.00 191.38 0.00 191 0 sdb 2266.00 192.33 0.01 192 0 sdb 2312.00 198.95 0.00 198 0 sdb 2310.00 199.84 0.00 199 0 sdb 2350.00 198.83 0.00 198 0 sdb 2275.00 198.31 0.00 198 0 sdb 3952.00 185.05 6.79 185 6 sdb 15842.00 59.89 59.67 59 59 sdb 16676.00 88.24 61.79 88 61 sdb 14768.00 75.94 55.00 75 54 sdb 5677.00 141.71 35.03 141 35 sdb 2135.00 184.78 0.04 184 0 sdb 2301.00 197.18 0.00 197 0 sdb 2334.00 198.81 0.00 198 0 sdb 2304.00 198.83 0.00 198 0 sdb 2348.00 198.67 0.00 198 0 sdb 2352.00 198.42 0.01 198 0 sdb 2373.00 199.32 0.00 199 0 sdb 2363.00 197.55 0.00 197 0 sdb 2289.00 198.71 0.00 198 0 sdb 2246.00 189.31 0.00 189 0 sdb 2357.00 198.64 0.01 198 0 sdb 2338.00 197.96 0.00 197 0 sdb 6292.00 177.60 16.56 177 16 sdb 19374.00 93.72 72.16 93 72 sdb 16873.00 101.38 62.01 101 62 sdb 16960.00 98.99 76.84 98 76 sdb 2299.00 189.32 6.16 189 6 sdb 2285.00 195.82 0.00 195 0 sdb 2346.00 198.25 0.00 198 0 sdb 2325.00 198.91 0.00 198 0 sdb 2353.00 197.72 0.02 197 0 sdb 2320.00 198.82 0.00 198 0 sdb 2327.00 200.05 0.00 200 0 sdb 2340.00 198.35 0.00 198 0 sdb 2322.00 199.29 0.00 199 0 sdb 2316.00 197.43 0.01 197 0 sdb 690.00 51.17 0.00 51 0 --------------------------------------------------------
Chuntao HONG wrote:
I am testing LMDB performance with the benchmark given in http://www.lmdb.tech/bench/ondisk/. And I noticed that LMDB random writes are really slow when the data goes beyond memory.
I am using a machine with 4GB DRAM with Intel PCIE SSD. The key size is 10 bytes and value size is 1KB. The benchmark code is given in http://www.lmdb.tech/bench/ondisk/, and the command line I used is "./db_bench_mdb --benchmarks=fillrandbatch --threads=1 --stats_interval=1024 --num=10000000 --value_size=1000 --use_existing_db=0 ".
For the first 1GB of data written, the average write rate is 140MB/s. The rate then drops significantly to 40MB/s for the first 2GB. At the end of the test, in which 10M values are written, the average rate is just 3MB/s, and the instant rate is 1MB/s. I know LMDB is not optimized for writes, but I didn't expect it to be this slow, given that I have a really high-end Intel SSD.
Any flash SSD will get bogged down by a continuous write workload, since it must do wear-leveling and compaction in the background and "the background" is getting too busy.
I also notice that the way LMDB access the SSD is really strange. At the beginning of the test, it writes the SSD at around 400MB/s, but performs no read, which is expected. But as we write more and more data, LMDB starts to read the SSD. As time goes on, the read throughput rises while the write throughput drops significantly. At the end of test, LMDB is constantly reading at around 190MB/s, while occationally issuing 100MB writes at around 10-20 second intervals.
- Is it normal for LMDB to have such low write throughput (1MB/s at the end
of test) for data stored on SSD?
- Why is LMDB reading more data than it is writing (about 20MB data read per
1MB written) at the end of the test? **
To my understanding, although we have more data than the DRAM can hold, the branch nodes of the B-tree should still be in the DRAM. So for every write, the only pages that we need to fetch from SSD is the leaf nodes. And when we write the leaf node, we might also need to write its parents. So there should be more writes than reads. But it turns out LMDB is reading much more than writing. I think it might be the reason why it is so slow at the end. But I really cannot understand why.*
Rerun the benchmark with --readahead=0. The kernel does 16page readahead by default, and on a random access workload, 15 of those pages are wasted effort. They also cause useful pages to be evicted from RAM. This is where the majority of the excess reads come from.
Thanks Howard.
The --readahead flag helped!
After setting --readahead=0, the average speed improved from 3MB/s to 8MB/s. And I don't see a lot of reads anymore.
On Fri, Mar 16, 2018 at 10:55 PM, Howard Chu hyc@symas.com wrote:
Chuntao HONG wrote:
I am testing LMDB performance with the benchmark given in http://www.lmdb.tech/bench/ondisk/. And I noticed that LMDB random writes are really slow when the data goes beyond memory.
I am using a machine with 4GB DRAM with Intel PCIE SSD. The key size is 10 bytes and value size is 1KB. The benchmark code is given in http://www.lmdb.tech/bench/ondisk/, and the command line I used is "./db_bench_mdb --benchmarks=fillrandbatch --threads=1 --stats_interval=1024 --num=10000000 --value_size=1000 --use_existing_db=0 ".
For the first 1GB of data written, the average write rate is 140MB/s. The rate then drops significantly to 40MB/s for the first 2GB. At the end of the test, in which 10M values are written, the average rate is just 3MB/s, and the instant rate is 1MB/s. I know LMDB is not optimized for writes, but I didn't expect it to be this slow, given that I have a really high-end Intel SSD.
Any flash SSD will get bogged down by a continuous write workload, since it must do wear-leveling and compaction in the background and "the background" is getting too busy.
I also notice that the way LMDB access the SSD is really strange. At the
beginning of the test, it writes the SSD at around 400MB/s, but performs no read, which is expected. But as we write more and more data, LMDB starts to read the SSD. As time goes on, the read throughput rises while the write throughput drops significantly. At the end of test, LMDB is constantly reading at around 190MB/s, while occationally issuing 100MB writes at around 10-20 second intervals.
- Is it normal for LMDB to have such low write throughput (1MB/s at the
end of test) for data stored on SSD?
- Why is LMDB reading more data than it is writing (about 20MB data read
per 1MB written) at the end of the test? **
To my understanding, although we have more data than the DRAM can hold, the branch nodes of the B-tree should still be in the DRAM. So for every write, the only pages that we need to fetch from SSD is the leaf nodes. And when we write the leaf node, we might also need to write its parents. So there should be more writes than reads. But it turns out LMDB is reading much more than writing. I think it might be the reason why it is so slow at the end. But I really cannot understand why.*
Rerun the benchmark with --readahead=0. The kernel does 16page readahead by default, and on a random access workload, 15 of those pages are wasted effort. They also cause useful pages to be evicted from RAM. This is where the majority of the excess reads come from.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
openldap-technical@openldap.org