Emmanuel Lécharny wrote:
On 8/2/10 5:26 AM, Howard Chu wrote:
<snip/> > Why would you have more than one select() ? Wouldn't it be better to > have one thread processing the select() and dispatching the operation to > a pool of threads ?
That's what we have right now, and as far as I can see it's a bottleneck that prevents us from utilizing more than 12 cores. (I could be wrong, and the bottleneck might actually be in the thread pool manager. I haven't got precise enough measurements yet to know for sure.)
Here's the situation: suppose you have thousands of clients connected and active. Even if you have CPUs to spare, the number of connections you can acknowledge and dispatch is limited by the speed of the single thread that's processing select(). Even if all it does is walk thru the list of active descriptors and dispatch a job to the thread pool for each one, it's only possible to dispatch a fixed number of ops/second, no matter how many other CPUs there are.
I'm a bit surprised that the select() processing *is* the bottleneck... All in all, it's just -internally- a matter of processing a bit field to see which bit is set to 1, and then get back the FD that is associated with this bit. You must have some other tasks running that create this bottleneck.
I will have to check OpenLDAP code here...
Right, it only amounts to around a 10% difference. Still, this is an improvement. I guess we'll need to setup oprofile and get some more detailed numbers to find out where any remaining bottlenecks are.
Right now on a 24 core server I'm seeing 48,000 searches/second and 50% CPU utilization. Adding more clients only seems to increase the overall latency, but CPU usage and throughput don't increase any further.
Have you tried to do something we did on ADS : remove all the processing to simply have a mock LDAP server, where only the network part is studied ?
ie, we just send back a mock response when a request has been received.
It helps to focus only on the network layer.
Yes, we have back-null for that purpose.
(sadly, as the PDU decoding is a costly operation, you may have to take that into account).
Sure, anything that's in the frontend is going to be included, but that's OK, there's no synchronization overhead in the decoders.