http://symas.com/mdb/inmem/scaling.html
i just looked at this and i have a sneaking suspicion that you may be running into the same problem that i encountered when accidentally opening 10 LMDBs 20 times by forking 30 processes *after* opening the 10 LMDBs... (and also forgetting to close them in the parent).
what i found was that when i reduced the number of LMDBs to 3 or below, the loadavg on the multi-core system was absolutely fine [ok, it was around 3 to 4 but that's ok]
adding one more LMDB (remember that's 31 extra file handles opened to a shm-mmap'd file) increased the loadavg to 6 or 7. by the time that was up to 10 the loadavg had completely unreasonably jumped to over 30. i could log in over ssh - that was not a problem. editing a file was ok (opening it) but trying to create new files resulted in applications (such as vim) was blocked so badly that i often could not even press ctrl-z in order to background the task, and had to kill the entire ssh session.
in each test run the amount of work being done was actually relatively small.
basically i suspect a severe bug in the linux kernel which these extreme circumstances (32 or 16 processes accessing the same mmap'd file for example) have never been encountered, so the bug is simply... unknown.
can i make the suggestion that, whilst i am aware that it is generally not recommended for production environments to run more processes than there are cores, you try running 128, 256 and even 512 processes all hitting that 64-core system, and monitor its I/O usage (iostats) and loadavg whilst doing so?
the hypothesis to test is that the performance, which should scale reasonably linearly downwards as a ratio of the number of processes to the number of cores, instead drops like a lead balloon.
l.
Luke Kenneth Casson Leighton wrote:
http://symas.com/mdb/inmem/scaling.html
i just looked at this and i have a sneaking suspicion that you may be running into the same problem that i encountered when accidentally opening 10 LMDBs 20 times by forking 30 processes *after* opening the 10 LMDBs... (and also forgetting to close them in the parent).
what i found was that when i reduced the number of LMDBs to 3 or below, the loadavg on the multi-core system was absolutely fine [ok, it was around 3 to 4 but that's ok]
adding one more LMDB (remember that's 31 extra file handles opened to a shm-mmap'd file) increased the loadavg to 6 or 7. by the time that was up to 10 the loadavg had completely unreasonably jumped to over 30. i could log in over ssh - that was not a problem. editing a file was ok (opening it) but trying to create new files resulted in applications (such as vim) was blocked so badly that i often could not even press ctrl-z in order to background the task, and had to kill the entire ssh session.
in each test run the amount of work being done was actually relatively small.
basically i suspect a severe bug in the linux kernel which these extreme circumstances (32 or 16 processes accessing the same mmap'd file for example) have never been encountered, so the bug is simply... unknown.
I've occasionally seen the system lockup as well; I'm pretty sure it's due to dirty page writeback. There are plenty of other references on the web about Linux dealing poorly with heavy writeback I/O.
can i make the suggestion that, whilst i am aware that it is generally not recommended for production environments to run more processes than there are cores, you try running 128, 256 and even 512 processes all hitting that 64-core system, and monitor its I/O usage (iostats) and loadavg whilst doing so?
Sure, I can conduct that test and collect system stats using atop. Will let ya know. By the way, we're using threads here, not processes. But the overall loading behavior should be the same either way.
the hypothesis to test is that the performance, which should scale reasonably linearly downwards as a ratio of the number of processes to the number of cores, instead drops like a lead balloon.
l.
On Mon, Oct 20, 2014 at 5:51 AM, Howard Chu hyc@symas.com wrote:
basically i suspect a severe bug in the linux kernel which these extreme circumstances (32 or 16 processes accessing the same mmap'd file for example) have never been encountered, so the bug is simply... unknown.
I've occasionally seen the system lockup as well; I'm pretty sure it's due to dirty page writeback. There are plenty of other references on the web about Linux dealing poorly with heavy writeback I/O.
ugh. that doesn't bode well for the future reliability of the system i've just been working on. i modified parabench.py yesterday (sent you and david a copy) and managed to completely hang my lovely precious macbook running debian amd64.
also i managed to get processes to deadlock when being asked to die. killing them off -9 one by one often at some point they would all properly exit, but until the one that was deadlocked was found the parent would simply sit there.
is anyone actually dealing with this actively [poor page writeback] in the linux kernel?
l.
Howard Chu wrote:
Luke Kenneth Casson Leighton wrote:
http://symas.com/mdb/inmem/scaling.html
can i make the suggestion that, whilst i am aware that it is generally not recommended for production environments to run more processes than there are cores, you try running 128, 256 and even 512 processes all hitting that 64-core system, and monitor its I/O usage (iostats) and loadavg whilst doing so?
Sure, I can conduct that test and collect system stats using atop. Will let ya know. By the way, we're using threads here, not processes. But the overall loading behavior should be the same either way.
the hypothesis to test is that the performance, which should scale reasonably linearly downwards as a ratio of the number of processes to the number of cores, instead drops like a lead balloon.
Threads Run Time CPU % DB Size Process Size Context Switc Write Read Wall User Sys Vol Invol 1 10:01.39 00:19:39.18 00:00:21.00 199 12647888 12650460 21 1513 45605 275331 2 10:01.38 00:29:35.21 00:00:24.33 299 12647888 12650472 48 2661 42726 528514 4 10:01.37 00:49:32.93 00:00:25.42 498 12647888 12650496 84 4106 40961 1068050 8 10:01.36 01:29:32.68 00:00:23.25 897 12647888 12650756 157 7738 38812 2058741 16 10:01.36 02:49:22.44 00:00:28.51 1694 12647888 12650852 345 16941 33357 3857045 32 10:01.36 05:28:35.39 00:01:02.69 3288 12647888 12651308 923 258250 23922 6091558 64 10:01.38 10:35:44.42 00:01:51.69 6361 12648060 12652132 1766 145585 16571 8724687 128 10:01.38 10:36:43.09 00:01:45.52 6368 12649296 12654928 3276 2906109 8594 9846720 256 10:01.48 10:36:53.05 00:01:36.83 6369 12649304 12658056 5365 3557137 4178 10453540 512 10:02.11 10:36:09.58 00:03:00.83 6369 12649320 12664304 8303 3511456 1947 10728221
Looks to me like the system was reasonably well behaved.
This is reusing a DB that had already had multiple iterations of this benchmark run on it, so the size is larger than for a fresh DB, and it would have significant internal fragmentation - i.e., a lot of sequential data will be in non-adjacent pages.
The only really obvious impact is that the number of involuntary context switches jumps up at 128 threads, which is what you'd expect since there are fewer cores than threads. The writer gets progressively starved, and read rates increase slightly.
On Mon, Oct 20, 2014 at 1:00 PM, Howard Chu hyc@symas.com wrote:
Howard Chu wrote:
Luke Kenneth Casson Leighton wrote:
http://symas.com/mdb/inmem/scaling.html
can i make the suggestion that, whilst i am aware that it is generally not recommended for production environments to run more processes than there are cores, you try running 128, 256 and even 512 processes all hitting that 64-core system, and monitor its I/O usage (iostats) and loadavg whilst doing so?
the hypothesis to test is that the performance, which should scale reasonably linearly downwards as a ratio of the number of processes to the number of cores, instead drops like a lead balloon.
Looks to me like the system was reasonably well behaved.
and it looks like the writer rate is approximately-halving with each doubling from 64 onwards.
ok, so that didn't show anything up... but wait... there's only one writer, right? the scenarios where i am seeing difficulties is when there are multiple writers and readers (actually, multiple writers and readers to multiple envs simultaneously).
so to duplicate that scenario, it would either be necessary to modify the benchmark to do multiple writer threads (knowing that they are going to have contention, but that's ok) or, to be closer to the scenario where i have observed difficulties to run the test several times *simultaneously* on the same database.
*thinks*.... actually in order to ensure that the reads and writes are approximately balanced, it would likely be necessary to modify the benchmark code to allow multiple writer threads and distribute the workload amongst them whilst at the same time keeping the number of reader threads the same as it was previously.
then it would be possible to make a direct comparison (against the figures you just sent), against the e.g. 32-threads case. 32 readers, 2 writers. 32 readers, 4 writers. 32 readers, 8 writers and so on. keeping the number of threads (write plus read) to below or equal the total number of cores avoids any unnecessary context-switching
the hypothesis being tested is that the writers performance overall remains the same, as only one may perform writes at a time.
i know it sounds silly to do that: it sounds so obvious that yeah it really should not make any difference given that no matter how many writers there are they will always do absolutely nothing (except one of them), and the context switching when one finishes should also be negligeable, but i know there's something wrong and i'd like to help find out what it is.
l.
Luke Kenneth Casson Leighton wrote:
On Mon, Oct 20, 2014 at 1:00 PM, Howard Chu hyc@symas.com wrote:
Howard Chu wrote:
Luke Kenneth Casson Leighton wrote:
http://symas.com/mdb/inmem/scaling.html
can i make the suggestion that, whilst i am aware that it is generally not recommended for production environments to run more processes than there are cores, you try running 128, 256 and even 512 processes all hitting that 64-core system, and monitor its I/O usage (iostats) and loadavg whilst doing so?
the hypothesis to test is that the performance, which should scale reasonably linearly downwards as a ratio of the number of processes to the number of cores, instead drops like a lead balloon.
Looks to me like the system was reasonably well behaved.
and it looks like the writer rate is approximately-halving with each doubling from 64 onwards.
ok, so that didn't show anything up... but wait... there's only one writer, right? the scenarios where i am seeing difficulties is when there are multiple writers and readers (actually, multiple writers and readers to multiple envs simultaneously).
so to duplicate that scenario, it would either be necessary to modify the benchmark to do multiple writer threads (knowing that they are going to have contention, but that's ok) or, to be closer to the scenario where i have observed difficulties to run the test several times *simultaneously* on the same database.
*thinks*.... actually in order to ensure that the reads and writes are approximately balanced, it would likely be necessary to modify the benchmark code to allow multiple writer threads and distribute the workload amongst them whilst at the same time keeping the number of reader threads the same as it was previously.
then it would be possible to make a direct comparison (against the figures you just sent), against the e.g. 32-threads case. 32 readers, 2 writers. 32 readers, 4 writers. 32 readers, 8 writers and so on. keeping the number of threads (write plus read) to below or equal the total number of cores avoids any unnecessary context-switching
We can do that by running two instances of the benchmark program concurrently; one doing a read-only job with a fixed number of threads (32) and one doing a write-only job with the increasing number of threads.
the hypothesis being tested is that the writers performance overall remains the same, as only one may perform writes at a time.
i know it sounds silly to do that: it sounds so obvious that yeah it really should not make any difference given that no matter how many writers there are they will always do absolutely nothing (except one of them), and the context switching when one finishes should also be negligeable, but i know there's something wrong and i'd like to help find out what it is.
My experience from benchmarking OpenLDAP over the years is that mutexes scale only up to a point. When you have threads grabbing the same mutex from across socket boundaries, things go into the toilet. There's no fix for this; that's the nature of inter-socket communication.
This test machine has 4 physical sockets but 8 NUMA nodes; internally each "processor" in a socket is really a pair of 8-core CPUs on a MCM which is why there are two NUMA nodes per physical socket.
Write throughput should degrade pretty noticeably as the number of writer threads goes up. When we get past 8 writer threads there's no way to keep them all in a single NUMA domain, so at that point we should see a sharp drop in throughput.
On Mon, Oct 20, 2014 at 1:53 PM, Howard Chu hyc@symas.com wrote:
then it would be possible to make a direct comparison (against the figures you just sent), against the e.g. 32-threads case. 32 readers, 2 writers. 32 readers, 4 writers. 32 readers, 8 writers and so on. keeping the number of threads (write plus read) to below or equal the total number of cores avoids any unnecessary context-switching
We can do that by running two instances of the benchmark program concurrently; one doing a read-only job with a fixed number of threads (32) and one doing a write-only job with the increasing number of threads.
ohh, ok - great. saves a job doing some programming at least.
the hypothesis being tested is that the writers performance overall remains the same, as only one may perform writes at a time.
i know it sounds silly to do that: it sounds so obvious that yeah it really should not make any difference given that no matter how many writers there are they will always do absolutely nothing (except one of them), and the context switching when one finishes should also be negligeable, but i know there's something wrong and i'd like to help find out what it is.
My experience from benchmarking OpenLDAP over the years is that mutexes scale only up to a point. When you have threads grabbing the same mutex from across socket boundaries, things go into the toilet. There's no fix for this; that's the nature of inter-socket communication.
argh. ok. so... actually.... accidentally, the design where i used a single LMDB (one env) shared amongst (20 to 30) processes using db_open to create (10 or so) databases would mitigate against that... taking a quick look at mdb.c the mutex lock is done on the env not on the database...
sooo compared to the previous design there would only be a 20/30-to-1 mutex contention whereas previously there were *10 sets* of 20 or 30 to 1 mutexes all competing... and if mutexes use sockets underneath that would explain why the inter-process communication (which also used sockets) was so dreadful.
huh, how about that.
do you happen to have access to a straight 8-core SMP system, or is it relatively easy to turn off the NUMA architecture?
l.
Luke Kenneth Casson Leighton wrote:
On Mon, Oct 20, 2014 at 1:53 PM, Howard Chu hyc@symas.com wrote:
My experience from benchmarking OpenLDAP over the years is that mutexes scale only up to a point. When you have threads grabbing the same mutex from across socket boundaries, things go into the toilet. There's no fix for this; that's the nature of inter-socket communication.
argh. ok. so... actually.... accidentally, the design where i used a single LMDB (one env) shared amongst (20 to 30) processes using db_open to create (10 or so) databases would mitigate against that... taking a quick look at mdb.c the mutex lock is done on the env not on the database...
sooo compared to the previous design there would only be a 20/30-to-1 mutex contention whereas previously there were *10 sets* of 20 or 30 to 1 mutexes all competing... and if mutexes use sockets underneath that would explain why the inter-process communication (which also used sockets) was so dreadful.
Note - I was talking about physical CPU sockets, not network sockets.
huh, how about that.
do you happen to have access to a straight 8-core SMP system,
No.
or is it relatively easy to turn off the NUMA architecture?
I can probably use taskset or something similar to restrict a process to a particular set of cores. What exactly do you have in mind?
On Mon, Oct 20, 2014 at 2:34 PM, Howard Chu hyc@symas.com wrote:
Luke Kenneth Casson Leighton wrote:
On Mon, Oct 20, 2014 at 1:53 PM, Howard Chu hyc@symas.com wrote:
My experience from benchmarking OpenLDAP over the years is that mutexes scale only up to a point. When you have threads grabbing the same mutex from across socket boundaries, things go into the toilet. There's no fix for this; that's the nature of inter-socket communication.
argh. ok. so... actually.... accidentally, the design where i used a single LMDB (one env) shared amongst (20 to 30) processes using db_open to create (10 or so) databases would mitigate against that... taking a quick look at mdb.c the mutex lock is done on the env not on the database...
sooo compared to the previous design there would only be a 20/30-to-1 mutex contention whereas previously there were *10 sets* of 20 or 30 to 1 mutexes all competing... and if mutexes use sockets underneath that would explain why the inter-process communication (which also used sockets) was so dreadful.
Note - I was talking about physical CPU sockets, not network sockets.
oh right, haha :) ok scratch that theory then.
or is it relatively easy to turn off the NUMA architecture?
I can probably use taskset or something similar to restrict a process to a particular set of cores. What exactly do you have in mind?
keeping the writers and readers proposed test, but onto the same 8 cores only, running:
* 1 writer 16 readers (single program, 16 threads) as a base-line, 2-to-1 contention between threads and cores * creating extra tests adding extra programs (single writers only) first 1 extra writer, then 2, then 4, then 8, then maybe even 16.
the idea is to see how the mutexes affect performance as a... (can't think of the word!!) factor(??) of the number of writers, but without the effects of NUMA to contend with.
l.
Luke Kenneth Casson Leighton wrote:
On Mon, Oct 20, 2014 at 2:34 PM, Howard Chu hyc@symas.com wrote:
I can probably use taskset or something similar to restrict a process to a particular set of cores. What exactly do you have in mind?
keeping the writers and readers proposed test, but onto the same 8 cores only, running:
Well in the meantime, the 32/32 test finished:
Writers Run Time CPU % Context Switches Write Read Run Time CPU % Wall User Sys Vol Invol Wall User Sys 1 10:01.38 00:09:34.90 00:00:25.80 99 9 2040 27160 5877894 10:00.68 05:19:03.44 00:00:35.34 3192 2 10:01.38 00:09:15.15 00:02:41.94 119 14397526 2955 24012 6150801 10:00.67 05:19:00.96 00:00:38.23 3192 4 10:01.34 00:09:06.93 00:02:46.94 118 14246971 8545 23795 6067928 10:00.68 05:18:59.65 00:00:39.52 3192 8 10:01.49 00:09:04.22 00:02:53.09 119 14011236 10850 23589 5961381 10:00.69 05:18:56.82 00:00:42.54 3192 16 10:01.64 00:09:07.32 00:02:53.09 119 13787738 20049 23670 5966794 10:00.68 05:18:59.13 00:00:40.35 3192 32 10:01.78 00:09:04.29 00:02:58.40 120 13205576 27847 23337 5959624 10:00.70 05:19:00.54 00:00:39.46 3192
The timings on the left are for the writer process; the reader process is on the right. You were right, the number of writers is largely irrelevant because they're all waiting most of the time. There's a huge jump in System CPU time from 1 writer to 2 writers, because with only 1 writer the mutex is uncontended and essentially zero-cost. As more writer threads are added, the System CPU time will rise.
The readers are basically unaffected by the number of writers.
On Mon, Oct 20, 2014 at 3:25 PM, Howard Chu hyc@symas.com wrote:
Luke Kenneth Casson Leighton wrote:
On Mon, Oct 20, 2014 at 2:34 PM, Howard Chu hyc@symas.com wrote:
I can probably use taskset or something similar to restrict a process to a particular set of cores. What exactly do you have in mind?
keeping the writers and readers proposed test, but onto the same 8 cores only, running:
Well in the meantime, the 32/32 test finished:
Writers Run Time CPU % Context Switches Write Read Run Time CPU % Wall User Sys Vol Invol Wall User Sys 1 10:01.38 00:09:34.90 00:00:25.80 99 9 2040 27160 5877894 10:00.68 05:19:03.44 00:00:35.34 3192 2 10:01.38 00:09:15.15 00:02:41.94 119 14397526 2955 24012 6150801 10:00.67 05:19:00.96 00:00:38.23 3192 4 10:01.34 00:09:06.93 00:02:46.94 118 14246971 8545 23795 6067928 10:00.68 05:18:59.65 00:00:39.52 3192 8 10:01.49 00:09:04.22 00:02:53.09 119 14011236 10850 23589 5961381 10:00.69 05:18:56.82 00:00:42.54 3192 16 10:01.64 00:09:07.32 00:02:53.09 119 13787738 20049 23670 5966794 10:00.68 05:18:59.13 00:00:40.35 3192 32 10:01.78 00:09:04.29 00:02:58.40 120 13205576 27847 23337 5959624 10:00.70 05:19:00.54 00:00:39.46 3192
The timings on the left are for the writer process; the reader process is on the right. You were right, the number of writers is largely irrelevant because they're all waiting most of the time. There's a huge jump in System CPU time from 1 writer to 2 writers, because with only 1 writer the mutex is uncontended and essentially zero-cost. As more writer threads are added, the System CPU time will rise.
by some 15 to 20% or so, from 2 writers going up to 32 writers. which is not unreasonable - nothing like what i saw (CPU usage dropping to 40%, loadavg jumping to over 8x the number of cores).
ok... what's the number of writes being carried out in each transaction? the original scenario in which i saw the issue was when there was one (!!). the design at the time was not optimised for batches of reads followed by batches of writes.
a single write per transaction would mean quite a lot more mutex usage.
l.
Luke Kenneth Casson Leighton wrote:
On Mon, Oct 20, 2014 at 3:25 PM, Howard Chu hyc@symas.com wrote:
The timings on the left are for the writer process; the reader process is on the right. You were right, the number of writers is largely irrelevant because they're all waiting most of the time. There's a huge jump in System CPU time from 1 writer to 2 writers, because with only 1 writer the mutex is uncontended and essentially zero-cost. As more writer threads are added, the System CPU time will rise.
by some 15 to 20% or so, from 2 writers going up to 32 writers. which is not unreasonable - nothing like what i saw (CPU usage dropping to 40%, loadavg jumping to over 8x the number of cores).
ok... what's the number of writes being carried out in each transaction? the original scenario in which i saw the issue was when there was one (!!). the design at the time was not optimised for batches of reads followed by batches of writes.
This is 1 write per txn.
a single write per transaction would mean quite a lot more mutex usage.
On Mon, Oct 20, 2014 at 3:47 PM, Howard Chu hyc@symas.com wrote:
ok... what's the number of writes being carried out in each transaction? the original scenario in which i saw the issue was when there was one (!!). the design at the time was not optimised for batches of reads followed by batches of writes.
This is 1 write per txn.
already? oh ok so that's hammering the machine rather a lot, then :)
ok let me do some exploring this end, see if i can come up with something that replicates the dreadful performance.
l.
Luke Kenneth Casson Leighton wrote:
On Mon, Oct 20, 2014 at 1:53 PM, Howard Chu hyc@symas.com wrote:
then it would be possible to make a direct comparison (against the figures you just sent), against the e.g. 32-threads case. 32 readers, 2 writers. 32 readers, 4 writers. 32 readers, 8 writers and so on. keeping the number of threads (write plus read) to below or equal the total number of cores avoids any unnecessary context-switching
We can do that by running two instances of the benchmark program concurrently; one doing a read-only job with a fixed number of threads (32) and one doing a write-only job with the increasing number of threads.
ohh, ok - great. saves a job doing some programming at least.
This is why it's important to support both multi-process and multi-threaded concurrency ;)