hyc@OpenLDAP.org wrote:
Update of /repo/OpenLDAP/pkg/ldap/servers/slapd
Modified Files: bconfig.c 1.413 -> 1.414 daemon.c 1.444 -> 1.445 proto-slap.h 1.802 -> 1.803
Log Message: Add support for multiple listener threads. Lightly tested on Linux, Winsock needs a couple more tweaks
Well, it doesn't look like this patch caused any harm for the default case. I'm only seeing about a 10% gain in throughput using two listener threads on a 16 core machine. Not earth-shattering, not bad.
Howard Chu wrote:
hyc@OpenLDAP.org wrote:
Update of /repo/OpenLDAP/pkg/ldap/servers/slapd
Modified Files: bconfig.c 1.413 -> 1.414 daemon.c 1.444 -> 1.445 proto-slap.h 1.802 -> 1.803
Log Message: Add support for multiple listener threads. Lightly tested on Linux, Winsock needs a couple more tweaks
Well, it doesn't look like this patch caused any harm for the default case. I'm only seeing about a 10% gain in throughput using two listener threads on a 16 core machine. Not earth-shattering, not bad.
Eh. 10% was on a pretty lightly loaded test. On a heavy load the advantage is only 1.2%. Hardly seems worth the trouble.
Howard Chu wrote:
Howard Chu wrote:
Well, it doesn't look like this patch caused any harm for the default case. I'm only seeing about a 10% gain in throughput using two listener threads on a 16 core machine. Not earth-shattering, not bad.
Eh. 10% was on a pretty lightly loaded test. On a heavy load the advantage is only 1.2%. Hardly seems worth the trouble.
And now the really worrying news - I was originally testing on an old install of Debian Lenny with 2.6.26 kernel. I just now updated the system to Debian Squeeze running a 2.6.32 kernel and the throughput results are a solid 20% slower, with nothing else changed.
At lighter loads (16 slapd threads, 32 client threads) Lenny is up to 35% faster than Squeeze. At peak load (168 client threads) the difference is only 3%. Probably it's network limited by then.
I've re-run the same test sequence using tcmalloc but that didn't make much difference. Something else is much slower on this OS revision. I may build a 2.6.35 kernel just to see if it's a kernel or userspace problem...
Howard Chu wrote:
Howard Chu wrote:
Howard Chu wrote:
Well, it doesn't look like this patch caused any harm for the default case. I'm only seeing about a 10% gain in throughput using two listener threads on a 16 core machine. Not earth-shattering, not bad.
Eh. 10% was on a pretty lightly loaded test. On a heavy load the advantage is only 1.2%. Hardly seems worth the trouble.
And now the really worrying news - I was originally testing on an old install of Debian Lenny with 2.6.26 kernel. I just now updated the system to Debian Squeeze running a 2.6.32 kernel and the throughput results are a solid 20% slower, with nothing else changed.
At lighter loads (16 slapd threads, 32 client threads) Lenny is up to 35% faster than Squeeze. At peak load (168 client threads) the difference is only 3%. Probably it's network limited by then.
I've re-run the same test sequence using tcmalloc but that didn't make much difference. Something else is much slower on this OS revision. I may build a 2.6.35 kernel just to see if it's a kernel or userspace problem...
For anyone curious, the slamd reports from these test runs are available on http://highlandsun.com/hyc/slamd/
Howard Chu wrote:
Howard Chu wrote:
Howard Chu wrote:
Howard Chu wrote:
Well, it doesn't look like this patch caused any harm for the default case. I'm only seeing about a 10% gain in throughput using two listener threads on a 16 core machine. Not earth-shattering, not bad.
There is a slight drop in throughput for a single listener thread compared to the pre-patched code. It's around 1%, consistent enough to not be a measurement error, but not really significant.
Eh. 10% was on a pretty lightly loaded test. On a heavy load the advantage is only 1.2%. Hardly seems worth the trouble.
At least the advantage always outweighs the above-mentioned 1% loss. I.e., cancelling both effects out, we're still ahead overall.
For anyone curious, the slamd reports from these test runs are available on http://highlandsun.com/hyc/slamd/
Comparing the results, with a single listener thread there are several points where it is obviously scaling poorly. With two listener threads, those weak spots in the single listener graphs are gone and everything runs smoothly up to the peak load.
E.g. comparing single listener
http://highlandsun.com/hyc/slamd/squeeze/singlenew/jobs/optimizing_job_20100...
vs double listener
http://highlandsun.com/hyc/slamd/squeeze/double/jobs/optimizing_job_20100808...
at 56 client threads, the double-listener slapd is 37.6% faster. Dunno why 56 clients is a magic number for the single listener, it jumps up to a more reasonable throughput at 64 client threads, and the double is only 11.7% faster.
Howard Chu wrote:
Howard Chu wrote:
Howard Chu wrote:
Howard Chu wrote:
Howard Chu wrote:
Well, it doesn't look like this patch caused any harm for the default case. I'm only seeing about a 10% gain in throughput using two listener threads on a 16 core machine. Not earth-shattering, not bad.
There is a slight drop in throughput for a single listener thread compared to the pre-patched code. It's around 1%, consistent enough to not be a measurement error, but not really significant.
Eh. 10% was on a pretty lightly loaded test. On a heavy load the advantage is only 1.2%. Hardly seems worth the trouble.
At least the advantage always outweighs the above-mentioned 1% loss. I.e., cancelling both effects out, we're still ahead overall.
For anyone curious, the slamd reports from these test runs are available on http://highlandsun.com/hyc/slamd/
Comparing the results, with a single listener thread there are several points where it is obviously scaling poorly. With two listener threads, those weak spots in the single listener graphs are gone and everything runs smoothly up to the peak load.
E.g. comparing single listener
http://highlandsun.com/hyc/slamd/squeeze/singlenew/jobs/optimizing_job_20100...
vs double listener
http://highlandsun.com/hyc/slamd/squeeze/double/jobs/optimizing_job_20100808...
at 56 client threads, the double-listener slapd is 37.6% faster. Dunno why 56 clients is a magic number for the single listener, it jumps up to a more reasonable throughput at 64 client threads, and the double is only 11.7% faster.
When looking for a performance bottleneck in a system, it always helps to search in the right component.......
Tossing out the 4 old load generator machines and replacing them with two 8-core servers (and using slamd 2.0.1 instead of 2.0.0) paints quite a different picture.
http://highlandsun.com/hyc/slamd/squeeze/doublenew/jobs/optimizing_job_20100...
With the old client machines the latency went up to the 2-3msec range at peak load, with the new machines it stays under .9msec. So basically the slowdowns were due to the load generators getting overloaded, not any part of slapd getting overloaded.
The shape of the graph still looks odd with this kernel. (The column for 3 threads per client is out of whack.) But the results are so consistent I don't think there's any measuring error to blame.
Howard Chu wrote:
When looking for a performance bottleneck in a system, it always helps to search in the right component.......
Tossing out the 4 old load generator machines and replacing them with two 8-core servers (and using slamd 2.0.1 instead of 2.0.0) paints quite a different picture.
http://highlandsun.com/hyc/slamd/squeeze/doublenew/jobs/optimizing_job_20100...
With the old client machines the latency went up to the 2-3msec range at peak load, with the new machines it stays under .9msec. So basically the slowdowns were due to the load generators getting overloaded, not any part of slapd getting overloaded.
The shape of the graph still looks odd with this kernel. (The column for 3 threads per client is out of whack.) But the results are so consistent I don't think there's any measuring error to blame.
Also added results using BDB 4.8.30 (previous used 4.7.25) and also using a 2.6.35 kernel.
BDB 4.8 vs 4.7 seems to be worth about a 5% gain on its own. The 2.6.35 kernel gives a slight boost as well, with search rates spiking over 67600/second, and idle CPU down to 6%. At that point, slapd is consuming around 90% of the CPU. Network interrupts also consume about 6% total, or ~75% of one core.
Hm, I may have forgotten to mention these tests are being run on an HP Proliant DL585 G5 with 4 Opteron 8354 (2.2Ghz quad core) processors. The box has 64GB of RAM; the DB has 5 million entries and slapd is using around 25-26GB of memory. (Around 4KB per entry, plus ~4GB of BDB shared memory cache.)
Howard Chu wrote:
Howard Chu wrote:
When looking for a performance bottleneck in a system, it always helps to search in the right component.......
Tossing out the 4 old load generator machines and replacing them with two 8-core servers (and using slamd 2.0.1 instead of 2.0.0) paints quite a different picture.
http://highlandsun.com/hyc/slamd/squeeze/doublenew/jobs/optimizing_job_20100...
With the old client machines the latency went up to the 2-3msec range at peak load, with the new machines it stays under .9msec. So basically the slowdowns were due to the load generators getting overloaded, not any part of slapd getting overloaded.
The shape of the graph still looks odd with this kernel. (The column for 3 threads per client is out of whack.) But the results are so consistent I don't think there's any measuring error to blame.
Also added results using BDB 4.8.30 (previous used 4.7.25) and also using a 2.6.35 kernel.
BDB 4.8 vs 4.7 seems to be worth about a 5% gain on its own. The 2.6.35 kernel gives a slight boost as well, with search rates spiking over 67600/second, and idle CPU down to 6%. At that point, slapd is consuming around 90% of the CPU. Network interrupts also consume about 6% total, or ~75% of one core.
Hm, I may have forgotten to mention these tests are being run on an HP Proliant DL585 G5 with 4 Opteron 8354 (2.2Ghz quad core) processors. The box has 64GB of RAM; the DB has 5 million entries and slapd is using around 25-26GB of memory. (Around 4KB per entry, plus ~4GB of BDB shared memory cache.)
I re-ran the BDB 4.8 job with 29 iterations, to match a run I had done against some other directory server. (That other server took 29 iterations to satisfy the 2-consecutive-non-improving-iterations criteria, this was just to provide comparable data). That's in the "double29" results. The only thing it really demonstrates is that OpenLDAP's performance is rock-steady under load, it doesn't just peak and then deterioriate as the load gets heavier. (Which we've seen on other servers as their thread queues get overwhelmed.)
Also for comparison I've done a run against OpenDS 2.3.0 build 3, using Sun JRE 1.6. I'm very impressed with OpenDS's results; I configured the jvm with 32GB of heap and it only used 17GB but returned very good performance. Aside from allocating a few GB for the BDB cache it's basically in stock tune. (Also with access logging disabled for these runs.) I didn't try yet with entry caching enabled; in previous runs I had poor experience with entry caching. I guess I should ask on the OpenDS forums for further tuning advice.
http://highlandsun.com/hyc/slamd/squeeze/double29/ http://highlandsun.com/hyc/slamd/squeeze/opends2.3.0/
There's a lot to be said for being able to achieve good performance without needing to fret over configuring individual caches. It makes a stronger case for back-mdb, to my mind.
dont bother too much with entry caches unless you can squeeze your entire dataset into the cache or your test/s iterate over some portion of your dataset that fits into the cache nicely [ which is not the case here with your test/s ]. otherwise you gonna get alot of GC churn that gonna offset any cache gains. the other use case would be heavy objects like groups since we do have very flexible cache config so you can target specific things for cache retention. best thing you can do is allocate as much as you can to BDB cache, maybe do a pre-load as well https://www.opends.org/wiki/page/Caching https://www.opends.org/wiki/page/AdvancedProperties#section-AdvancedProperti... and then for further tuning you can tweak https://www.opends.org/promoted-builds/latest/OpenDS/build/docgen/configurat... https://www.opends.org/promoted-builds/latest/OpenDS/build/docgen/configurat... to get the most out of your specific h/w. write and mixed load tuning is more complex than that tho and you are welcome to ask about any of that further on OpenDS mailing lists of course.
On 13/08/2010 23:30, Howard Chu wrote:
Also for comparison I've done a run against OpenDS 2.3.0 build 3, using Sun JRE 1.6. I'm very impressed with OpenDS's results; I configured the jvm with 32GB of heap and it only used 17GB but returned very good performance. Aside from allocating a few GB for the BDB cache it's basically in stock tune. (Also with access logging disabled for these runs.) I didn't try yet with entry caching enabled; in previous runs I had poor experience with entry caching. I guess I should ask on the OpenDS forums for further tuning advice.