Luke Kenneth Casson Leighton wrote:
On Mon, Oct 20, 2014 at 1:00 PM, Howard Chu <hyc(a)symas.com>
wrote:
> Howard Chu wrote:
>>
>> Luke Kenneth Casson Leighton wrote:
>>>
>>>
http://symas.com/mdb/inmem/scaling.html
>>>
>>> can i make the suggestion that, whilst i am aware that it is generally
>>> not recommended for production environments to run more processes than
>>> there are cores, you try running 128, 256 and even 512 processes all
>>> hitting that 64-core system, and monitor its I/O usage (iostats) and
>>> loadavg whilst doing so?
>>> the hypothesis to test is that the performance, which should scale
>>> reasonably linearly downwards as a ratio of the number of processes to
>>> the number of cores, instead drops like a lead balloon.
>
> Looks to me like the system was reasonably well behaved.
and it looks like the writer rate is approximately-halving with each
doubling from 64 onwards.
ok, so that didn't show anything up... but wait... there's only one
writer, right? the scenarios where i am seeing difficulties is when
there are multiple writers and readers (actually, multiple writers and
readers to multiple envs simultaneously).
so to duplicate that scenario, it would either be necessary to modify
the benchmark to do multiple writer threads (knowing that they are
going to have contention, but that's ok) or, to be closer to the
scenario where i have observed difficulties to run the test several
times *simultaneously* on the same database.
*thinks*.... actually in order to ensure that the reads and writes
are approximately balanced, it would likely be necessary to modify the
benchmark code to allow multiple writer threads and distribute the
workload amongst them whilst at the same time keeping the number of
reader threads the same as it was previously.
then it would be possible to make a direct comparison (against the
figures you just sent), against the e.g. 32-threads case. 32 readers,
2 writers. 32 readers, 4 writers. 32 readers, 8 writers and so on.
keeping the number of threads (write plus read) to below or equal the
total number of cores avoids any unnecessary context-switching
We can do that by running two instances of the benchmark program concurrently;
one doing a read-only job with a fixed number of threads (32) and one doing a
write-only job with the increasing number of threads.
the hypothesis being tested is that the writers performance
overall
remains the same, as only one may perform writes at a time.
i know it sounds silly to do that: it sounds so obvious that yeah
it
really should not make any difference given that no matter how many
writers there are they will always do absolutely nothing (except one
of them), and the context switching when one finishes should also be
negligeable, but i know there's something wrong and i'd like to help
find out what it is.
My experience from benchmarking OpenLDAP over the years is that mutexes scale
only up to a point. When you have threads grabbing the same mutex from across
socket boundaries, things go into the toilet. There's no fix for this; that's
the nature of inter-socket communication.
This test machine has 4 physical sockets but 8 NUMA nodes; internally each
"processor" in a socket is really a pair of 8-core CPUs on a MCM which is why
there are two NUMA nodes per physical socket.
Write throughput should degrade pretty noticeably as the number of writer
threads goes up. When we get past 8 writer threads there's no way to keep them
all in a single NUMA domain, so at that point we should see a sharp drop in
throughput.
--
-- Howard Chu
CTO, Symas Corp.
http://www.symas.com
Director, Highland Sun
http://highlandsun.com/hyc/
Chief Architect, OpenLDAP
http://www.openldap.org/project/