Anton Bobrov wrote:
increasing the number of pollers/readers should help significantly on massively multi cpu/core systems help but it requires some fine tuning because you have to evaluate that against the cost of processing those requests otherwise all you gonna do is saturate your work queue so its good to have some mechanism to cap it and apply the brakes when needed as well. the real problems come from the cost of synchronization on those systems tho. while back i had some play time with fully loaded t5440 [256 h/w threads] and did manage to get it to 85% utilized with OpenDS. there was of course quite a number of threads involved thus various synchronization issues associated with them at that scale and architecture that normally have an insignificant difference on smaller systems. like when you have your multiple pollers/readers putting things on the work queue and multiple worker threads taking things from it. the more creative you can get making things safely lockless the better.
Unfortunately, at this time writing lockless algorithms means resorting to heavily machine-dependent code and we've been trying to stick to standardized e.g. POSIX APIs. It would be pretty easy to write a CPU-cache-friendly producer/consumer queue in assembly language for a few specific architectures, and maybe doable using compiler-specific intrinsics, but our portability would go out the window.
(Which is not to say that the thought hasn't crossed my mind, numerous times already. I still have a very nice implementation I wrote in sparc assembly language kicking around here, but it seems that only x86-64 matters these days; that and ARM...)
On 02/08/2010 08:34, Emmanuel Lécharny wrote:
Here's the situation: suppose you have thousands of clients connected and active. Even if you have CPUs to spare, the number of connections you can acknowledge and dispatch is limited by the speed of the single thread that's processing select(). Even if all it does is walk thru the list of active descriptors and dispatch a job to the thread pool for each one, it's only possible to dispatch a fixed number of ops/second, no matter how many other CPUs there are.
I'm a bit surprised that the select() processing *is* the bottleneck... All in all, it's just -internally- a matter of processing a bit field to see which bit is set to 1, and then get back the FD that is associated with this bit. You must have some other tasks running that create this bottleneck.
I will have to check OpenLDAP code here...
Right now on a 24 core server I'm seeing 48,000 searches/second and 50% CPU utilization. Adding more clients only seems to increase the overall latency, but CPU usage and throughput don't increase any further.