We have encountered an unexpected performance impact only by moving an LMDB to a different linux machine with similar hardware characteristics.
We have a 93GB LMDB environment with 3 databases. The database in question is 13GB. The test executable loops over key/value pairs in the 13GB database with a read-only cursor. For the same executable, we observe two different behaviors on different machines (the same lmdb environment was copied to the machines with scp). First machine has 148GB RAM and the second has 105GB RAM and the same CPU.
The expected and desired behavior on linux kernel 3.13 / eglibc 2.19 shows that the process takes 13GB of shared memory (seen by top and confirmed with /proc/<pid>/smaps below)). In the alternate behavior on kernel 4.15 / glibc 2.27 the process reads from disk 83GB into shared memory (which takes 16 minutes instead of 8 minutes on the correct machine in the corresponding initial runs).
/proc/<pid>/smaps
Machine with expected behavior:
7f3595da8000-7f4cde517000 r--s 00000000 fb:10 6442473002 /fusionio1/lmdb/db.0/dbgraph/data.mdb
Size: 97656252 kB
Rss: 13203648 kB
Pss: 13203648 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 13203648 kB
Private_Dirty: 0 kB
Referenced: 13203648 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd sh mr mw me ms sd
Machine with excessive Rss and slower read time:
7f55990aa000-7f6ce1819000 r--s 00000000 fc:02 7077908 /lmdbguest/lmdb/db.0/dbgraph/data.mdb
Size: 97656252 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 82587036 kB
Pss: 82587036 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 82587036 kB
Private_Dirty: 0 kB
Referenced: 82587036 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Alec Matusis wrote:
We have encountered an unexpected performance impact only by moving an LMDB to a different linux machine with similar hardware characteristics.
We have a 93GB LMDB environment with 3 databases. The database in question is 13GB. The test executable loops over key/value pairs in the 13GB database with a read-only cursor. For the same executable, we observe two different behaviors on different machines (the same lmdb environment was copied to the machines with scp). First machine has 148GB RAM and the second has 105GB RAM and the same CPU.
The expected and desired behavior on linux kernel 3.13 / eglibc 2.19 shows that the process takes 13GB of shared memory (seen by top and confirmed with /proc/<pid>/smaps below)). In the alternate behavior on kernel 4.15 / glibc 2.27 the process reads from disk 83GB into shared memory (which takes 16 minutes instead of 8 minutes on the correct machine in the corresponding initial runs).
glibc version is irrelevant, only the kernel version matters, since no library calls of any kind are invoked in a read operation.
Try repeating the test with MDB_NORDAHEAD set on the environment.
Try repeating the test with MDB_NORDAHEAD set on the environment.
Thank you: with MDB_NORDAHEAD it works on both machines as expected. We have a couple of questions and observations.
We have: machine 1: XFS filesystem, 148GB RAM, 3.13 # blockdev --getra /dev/fiob 256 Shared memory grows to 13GB with or without MDB_NORDAHEAD (as expected)
machine 2: EXT4 filesystem, 105GB RAM, 4.15 # blockdev --getra /dev/vda2 256 Shared memory grows to 83GB without MDB_NORDAHEAD (unexpected) and to 13GB with MDB_NORDAHEAD (as expected)
Questions and observations: 1. Since blockdev --getra shows the same 256 for both machines, why MDB_NORDAHEAD was necessary only on machine2? 2. After dropping the cache with sysctl -w vm.drop_caches=3 on the problematic machine2 we read 83GB into shared memory without MDB_NORDAHEAD. Then we re-run with MDB_NORDAHEAD reading from the in-memory cached file, but the shared memory still grows to 83GB. This contrasts with the first read with MDB_NORDAHEAD where shared memory is 13GB on all reads. 3. Both machines have the same raw disk read speeds (tested with hdparm and dd). After sysctl -w vm.drop_caches=3 the initial read time on machine2 is twice that of on machine1. 4. Initially even on the affected machine2 it read 13G into shared memory without MDB_NORDAHEAD. We upgraded the Ubuntu kernel from 4.15.0-91-generic to 4.15.0-101-generic and it started reading 83GB. Then we downgraded to 4.15.0-91-generic but it continued to read 83GB until we added MDB_NORDAHEAD.
-----Original Message----- From: Howard Chu [mailto:hyc@symas.com] Sent: Wednesday, June 03, 2020 12:56 PM To: Alec Matusis matusis@matusis.com; openldap-technical@openldap.org Subject: Re: Unexpected LMDB RSS /performance difference on similar machines
Alec Matusis wrote:
We have encountered an unexpected performance impact only by moving an
LMDB to a different linux machine with similar hardware characteristics.
We have a 93GB LMDB environment with 3 databases. The database in question is 13GB. The test executable loops over key/value pairs in the 13GB database with a read-only cursor. For the same executable, we
observe two different behaviors on different machines (the same lmdb environment was copied to the machines with scp). First machine has 148GB RAM and the second has 105GB RAM and the same CPU.
The expected and desired behavior on linux kernel 3.13 / eglibc 2.19 shows that the process takes 13GB of shared memory (seen by top and confirmed with /proc/<pid>/smaps below)). In the alternate behavior on
kernel 4.15 / glibc 2.27 the process reads from disk 83GB into shared memory (which takes 16 minutes instead of 8 minutes on the correct machine in the corresponding initial runs).
glibc version is irrelevant, only the kernel version matters, since no library calls of any kind are invoked in a read operation.
Try repeating the test with MDB_NORDAHEAD set on the environment. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Alec Matusis wrote:
Try repeating the test with MDB_NORDAHEAD set on the environment.
Thank you: with MDB_NORDAHEAD it works on both machines as expected. We have a couple of questions and observations.
We have: machine 1: XFS filesystem, 148GB RAM, 3.13 # blockdev --getra /dev/fiob 256 Shared memory grows to 13GB with or without MDB_NORDAHEAD (as expected)
machine 2: EXT4 filesystem, 105GB RAM, 4.15 # blockdev --getra /dev/vda2 256 Shared memory grows to 83GB without MDB_NORDAHEAD (unexpected) and to 13GB with MDB_NORDAHEAD (as expected)
Questions and observations:
- Since “blockdev --getra” shows the same 256 for both machines, why
MDB_NORDAHEAD was necessary only on machine2?
This is a stupid question. You claimed both machines have similar setups and yet they are running wildly different kernel versions and using completely different filesystems, and now you wonder why they behave differently??
None of this has anything to do with LMDB. Ask a filesystem or kernel developer.
Howard Chu wrote:
Alec Matusis wrote:
Try repeating the test with MDB_NORDAHEAD set on the environment.
Thank you: with MDB_NORDAHEAD it works on both machines as expected. We have a couple of questions and observations.
We have: machine 1: XFS filesystem, 148GB RAM, 3.13 # blockdev --getra /dev/fiob 256 Shared memory grows to 13GB with or without MDB_NORDAHEAD (as expected)
machine 2: EXT4 filesystem, 105GB RAM, 4.15 # blockdev --getra /dev/vda2 256 Shared memory grows to 83GB without MDB_NORDAHEAD (unexpected) and to 13GB with MDB_NORDAHEAD (as expected)
Questions and observations:
- Since “blockdev --getra” shows the same 256 for both machines, why
MDB_NORDAHEAD was necessary only on machine2?
This is a stupid question. You claimed both machines have similar setups and yet they are running wildly different kernel versions and using completely different filesystems, and now you wonder why they behave differently??
None of this has anything to do with LMDB. Ask a filesystem or kernel developer.
For anyone just tuning in - we demonstrated from day 1 the huge difference in performance between different filesystems.
http://www.lmdb.tech/bench/microbench/july/#sec11
Hi again Howard,
Sorry for the confusion with two different machines, but I have a question about just one machine.
I observe two things on a single machine:
1 .My test binary with MDB_NORDAHEAD reads 13GB into shared memory and 83GB without MDB_NORDAHEAD . The cold read shows about 10M/s sustained read speed on iotop and takes 18m. Then I do dd if=/fusionio1/lmdb/db.0/dbgraph/data.mdb of=/dev/null bs=8k dd shows the read speed of 300M/s, i.e. 30x faster than looping over read-only cursor. Can anything (other than removing MDB_NORDAHEAD) be done to reduce this30x read speed difference on the first cold read?
2. dd reads the entire environment file into system file buffers (93GB). Then when the entire environment is cached, I run the binary with MDB_NORDAHEAD, but now it reads 80GB into shared memory, like when MDB_NORDAHEAD is not set. Is this expected? Can it be prevented?
-----Original Message----- From: Howard Chu [mailto:hyc@symas.com] Sent: Friday, June 05, 2020 5:38 AM To: Alec Matusis matusis@matusis.com; openldap-technical@openldap.org Subject: Re: Unexpected LMDB RSS /performance difference on similar machines
Howard Chu wrote:
Alec Matusis wrote:
Try repeating the test with MDB_NORDAHEAD set on the environment.
Thank you: with MDB_NORDAHEAD it works on both machines as expected. We have a couple of questions and observations.
We have: machine 1: XFS filesystem, 148GB RAM, 3.13 # blockdev --getra /dev/fiob 256 Shared memory grows to 13GB with or without MDB_NORDAHEAD (as expected)
machine 2: EXT4 filesystem, 105GB RAM, 4.15 # blockdev --getra /dev/vda2 256 Shared memory grows to 83GB without MDB_NORDAHEAD (unexpected) and to 13GB with MDB_NORDAHEAD (as expected)
Questions and observations:
- Since “blockdev --getra” shows the same 256 for both machines,
why MDB_NORDAHEAD was necessary only on machine2?
This is a stupid question. You claimed both machines have similar setups and yet they are running wildly different kernel versions and using completely different filesystems, and now you wonder why they behave differently??
None of this has anything to do with LMDB. Ask a filesystem or kernel developer.
For anyone just tuning in - we demonstrated from day 1 the huge difference in performance between different filesystems.
http://www.lmdb.tech/bench/microbench/july/#sec11
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Alec Matusis wrote:
Hi again Howard,
Sorry for the confusion with two different machines, but I have a question about just one machine.
I observe two things on a single machine:
1 .My test binary with MDB_NORDAHEAD reads 13GB into shared memory and 83GB without MDB_NORDAHEAD . The cold read shows about 10M/s sustained read speed on iotop and takes 18m. Then I do dd if=/fusionio1/lmdb/db.0/dbgraph/data.mdb of=/dev/null bs=8k dd shows the read speed of 300M/s, i.e. 30x faster than looping over read-only cursor. Can anything (other than removing MDB_NORDAHEAD) be done to reduce this30x read speed difference on the first cold read?
Just run dd before starting your main program ...
dd is reading sequentially, so of course it can stream the data at higher speed. When you're reading through a cursor, the data pages are most likely scattered throughout the file. Physical random accesses will always be slower than sequential reads.
If you leave readahead enabled, you'll get a higher proportion of sequential read bursts, but it will still be a lot of random access.
- dd reads the entire environment file into system file buffers (93GB). Then when the entire environment is cached, I run the binary with MDB_NORDAHEAD, but now it reads 80GB into shared memory, like when MDB_NORDAHEAD is not set. Is this expected? Can it be prevented?
It's not reading anything, since the data is already cached in memory.
Is this expected? Yes - the data is already present, and LMDB always requests a single mmap for the entire size of the environment. Since the physical memory is already assigned, the mmap contains it all.
Can it be prevented - why does it matter? If any other process needs to use the RAM, it will get it automatically.
-----Original Message----- From: Howard Chu [mailto:hyc@symas.com] Sent: Friday, June 05, 2020 5:38 AM To: Alec Matusis matusis@matusis.com; openldap-technical@openldap.org Subject: Re: Unexpected LMDB RSS /performance difference on similar machines
Howard Chu wrote:
Alec Matusis wrote:
Try repeating the test with MDB_NORDAHEAD set on the environment.
Thank you: with MDB_NORDAHEAD it works on both machines as expected. We have a couple of questions and observations.
We have: machine 1: XFS filesystem, 148GB RAM, 3.13 # blockdev --getra /dev/fiob 256 Shared memory grows to 13GB with or without MDB_NORDAHEAD (as expected)
machine 2: EXT4 filesystem, 105GB RAM, 4.15 # blockdev --getra /dev/vda2 256 Shared memory grows to 83GB without MDB_NORDAHEAD (unexpected) and to 13GB with MDB_NORDAHEAD (as expected)
Questions and observations:
- Since “blockdev --getra” shows the same 256 for both machines,
why MDB_NORDAHEAD was necessary only on machine2?
This is a stupid question. You claimed both machines have similar setups and yet they are running wildly different kernel versions and using completely different filesystems, and now you wonder why they behave differently??
None of this has anything to do with LMDB. Ask a filesystem or kernel developer.
For anyone just tuning in - we demonstrated from day 1 the huge difference in performance between different filesystems.
openldap-technical@openldap.org