Anton Bobrov wrote:
increasing the number of pollers/readers should help significantly
on massively multi cpu/core systems help but it requires some fine
tuning because you have to evaluate that against the cost of
processing those requests otherwise all you gonna do is saturate
your work queue so its good to have some mechanism to cap it and
apply the brakes when needed as well. the real problems come from
the cost of synchronization on those systems tho. while back i had
some play time with fully loaded t5440 [256 h/w threads] and did
manage to get it to 85% utilized with OpenDS. there was of course
quite a number of threads involved thus various synchronization
issues associated with them at that scale and architecture that
normally have an insignificant difference on smaller systems. like
when you have your multiple pollers/readers putting things on the
work queue and multiple worker threads taking things from it. the
more creative you can get making things safely lockless the better.
Unfortunately, at this time writing lockless algorithms means resorting to
heavily machine-dependent code and we've been trying to stick to standardized
e.g. POSIX APIs. It would be pretty easy to write a CPU-cache-friendly
producer/consumer queue in assembly language for a few specific architectures,
and maybe doable using compiler-specific intrinsics, but our portability would
go out the window.
(Which is not to say that the thought hasn't crossed my mind, numerous times
already. I still have a very nice implementation I wrote in sparc assembly
language kicking around here, but it seems that only x86-64 matters these
days; that and ARM...)
On 02/08/2010 08:34, Emmanuel Lécharny wrote:
>> Here's the situation: suppose you have thousands of clients connected
>> and active. Even if you have CPUs to spare, the number of connections
>> you can acknowledge and dispatch is limited by the speed of the single
>> thread that's processing select(). Even if all it does is walk thru
>> the list of active descriptors and dispatch a job to the thread pool
>> for each one, it's only possible to dispatch a fixed number of
>> ops/second, no matter how many other CPUs there are.
> I'm a bit surprised that the select() processing *is* the bottleneck...
> All in all, it's just -internally- a matter of processing a bit field to
> see which bit is set to 1, and then get back the FD that is associated
> with this bit. You must have some other tasks running that create this
> I will have to check OpenLDAP code here...
>> Right now on a 24 core server I'm seeing 48,000 searches/second and
>> 50% CPU utilization. Adding more clients only seems to increase the
>> overall latency, but CPU usage and throughput don't increase any further.
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/