Revisiting an old thread...
http://www.openldap.org/lists/openldap-devel/200504/msg00066.html
I'm definitely seeing our current listener running out of steam on servers with more than 12 or so cores, and some work in this direction will definitely help. First it would be a good idea to classify all of the tasks that the current listener manages, before deciding how to divide them among multiple threads.
The listener is responsible for many events right now: signal/shutdown processing idle timeout checks write timeout checks runqueue scheduling listener socket events threadpool pauses client socket read events client socket write events
Splitting the client socket handling across multiple threads will bring the greatest improvement in scalability. Just need to check our thinking and make sure the remaining division of labor still makes sense.
There are two cases that annoy me in our current design - why don't we just dedicate a thread to each listener socket, and let it block on accept() ? That would eliminate a bit of churn in the current select() workload.
Likewise, why don't we just let writer threads block in write(), instead of having them ask the listener to listen for writability on their socket? Or, if we're using non-blocking sockets, why don't we let the writer threads block in their own select call, instead of relying on the central thread to do the select and re-dispatch?
The first, obvious answer is this: when threads are blocked in system calls like accept(), we can't simply wake them up again for shutdown events or other situations. I believe the obvious fix here is to use select() in each thread, waiting for both the target fd and the wake_sds fd which is written to whenever a signal is caught. Off the top of my head I'm not sure, when several threads are selecting on the same fd, if they all receive a Readable event or if only one of them will. Anyone know?
I don't think the remaining tasks really involve much overhead. So ideally, we can handle the idle/write timeout and runqueue scheduling in a single thread, which can also be responsible for the main signal/shutdown processing.
The listener sockets can each have their own dedicated thread, selecting on their listener socket and the signal fd.
A small number of threads can handle the bulk of the client socket events. We would continue to use a power-of-two descriptor table, and a power-of-two set of threads here. The division of labor would simply be thread# = FD % number of threads (There's no need to split the current connection table into multiple arrays, we just divvy things up so that a given thread only accesses its own slots in the array.)
Comments?
On 7/31/10 3:28 AM, Howard Chu wrote:
Revisiting an old thread...
http://www.openldap.org/lists/openldap-devel/200504/msg00066.html
I'm definitely seeing our current listener running out of steam on servers with more than 12 or so cores, and some work in this direction will definitely help. First it would be a good idea to classify all of the tasks that the current listener manages, before deciding how to divide them among multiple threads.
The listener is responsible for many events right now: signal/shutdown processing idle timeout checks write timeout checks runqueue scheduling listener socket events threadpool pauses client socket read events client socket write events
Splitting the client socket handling across multiple threads will bring the greatest improvement in scalability. Just need to check our thinking and make sure the remaining division of labor still makes sense.
There are two cases that annoy me in our current design - why don't we just dedicate a thread to each listener socket, and let it block on accept() ? That would eliminate a bit of churn in the current select() workload.
Likewise, why don't we just let writer threads block in write(), instead of having them ask the listener to listen for writability on their socket? Or, if we're using non-blocking sockets, why don't we let the writer threads block in their own select call, instead of relying on the central thread to do the select and re-dispatch?
If your writer threads block until the socket is writeable, how many threads will you need to have if many client never read the data ? You might quickly not have any remaining writing threads available... The select() is supposed to tell you when a socket is ready to accept more write, then it's time to select a writing thread to push data into this socket. You just need a pool of writing thread, and when a writing thread has pushed data into the socket, then it becomes available for the next write.
The first, obvious answer is this: when threads are blocked in system calls like accept(), we can't simply wake them up again for shutdown events or other situations. I believe the obvious fix here is to use select() in each thread, waiting for both the target fd and the wake_sds fd which is written to whenever a signal is caught. Off the top of my head I'm not sure, when several threads are selecting on the same fd, if they all receive a Readable event or if only one of them will. Anyone know?
Why would you have more than one select() ? Wouldn't it be better to have one thread processing the select() and dispatching the operation to a pool of threads ?
All in all, what costs CPU consumption in a server is most certainly the processing of incoming and outgoing requests, not the processing of the select() operation, no ?
Im' maybe a bit tainted by Java, but it's really based on the same mechanisms under the hood...
Emmanuel Lecharny wrote:
On 7/31/10 3:28 AM, Howard Chu wrote:
Revisiting an old thread...
http://www.openldap.org/lists/openldap-devel/200504/msg00066.html
I'm definitely seeing our current listener running out of steam on servers with more than 12 or so cores, and some work in this direction will definitely help. First it would be a good idea to classify all of the tasks that the current listener manages, before deciding how to divide them among multiple threads.
The listener is responsible for many events right now: signal/shutdown processing idle timeout checks write timeout checks runqueue scheduling listener socket events threadpool pauses client socket read events client socket write events
Splitting the client socket handling across multiple threads will bring the greatest improvement in scalability. Just need to check our thinking and make sure the remaining division of labor still makes sense.
There are two cases that annoy me in our current design - why don't we just dedicate a thread to each listener socket, and let it block on accept() ? That would eliminate a bit of churn in the current select() workload.
I'm not actually convinced that accept() traffic is a significant portion of our load, so maybe that part can be left as-is.
Likewise, why don't we just let writer threads block in write(), instead of having them ask the listener to listen for writability on their socket? Or, if we're using non-blocking sockets, why don't we let the writer threads block in their own select call, instead of relying on the central thread to do the select and re-dispatch?
If your writer threads block until the socket is writeable, how many threads will you need to have if many client never read the data ? You might quickly not have any remaining writing threads available... The select() is supposed to tell you when a socket is ready to accept more write, then it's time to select a writing thread to push data into this socket. You just need a pool of writing thread, and when a writing thread has pushed data into the socket, then it becomes available for the next write.
Unfortunately, while what you're describing is perfectly sane, that's not the way things work right now. Our writer state isn't self-contained, so we can't just package it up and put it on a queue somewhere. Right now, a single thread is occupied for the duration of a single operation. If it blocks on a write, it's stuck until the write timeout expires. Obviously it would be nicer if we had a saner structure here, but we don't...
The first, obvious answer is this: when threads are blocked in system calls like accept(), we can't simply wake them up again for shutdown events or other situations. I believe the obvious fix here is to use select() in each thread, waiting for both the target fd and the wake_sds fd which is written to whenever a signal is caught. Off the top of my head I'm not sure, when several threads are selecting on the same fd, if they all receive a Readable event or if only one of them will. Anyone know?
Why would you have more than one select() ? Wouldn't it be better to have one thread processing the select() and dispatching the operation to a pool of threads ?
That's what we have right now, and as far as I can see it's a bottleneck that prevents us from utilizing more than 12 cores. (I could be wrong, and the bottleneck might actually be in the thread pool manager. I haven't got precise enough measurements yet to know for sure.)
Here's the situation: suppose you have thousands of clients connected and active. Even if you have CPUs to spare, the number of connections you can acknowledge and dispatch is limited by the speed of the single thread that's processing select(). Even if all it does is walk thru the list of active descriptors and dispatch a job to the thread pool for each one, it's only possible to dispatch a fixed number of ops/second, no matter how many other CPUs there are.
Right now on a 24 core server I'm seeing 48,000 searches/second and 50% CPU utilization. Adding more clients only seems to increase the overall latency, but CPU usage and throughput don't increase any further.
Also in the test I'm looking at, some of the issues I pointed to above aren't even happening. E.g., none of the operations are blocking on write. It's all about handling read events, nothing else. (So again, maybe it's OK to leave the current write behavior as-is.)
As I've noted in the past, I can get double the throughput by running two slapds concurrently, so it's not a question of network or I/O resources. Just that a single listener thread (and/or a single mutex controlling the thread pool) will only scale so far, and no further.
All in all, what costs CPU consumption in a server is most certainly the processing of incoming and outgoing requests, not the processing of the select() operation, no ?
In relative costs, sure, but eventually even the small costs add up.
Im' maybe a bit tainted by Java, but it's really based on the same mechanisms under the hood...
Heh. Indeed. Concurrency issues are the same, no matter what language you use to dress them up.
One other realization I had is that the current design makes it very easy to build slapd as either threaded or non-threaded. Pushing too much activity into other threads would break the non-threaded builds. But at this point, with even cellphones going dual-core, I have to wonder how important it is to maintain compatibility for non-threaded slapd builds. ??
On 8/2/10 5:26 AM, Howard Chu wrote:
<snip/> > Why would you have more than one select() ? Wouldn't it be better to > have one thread processing the select() and dispatching the operation to > a pool of threads ?
That's what we have right now, and as far as I can see it's a bottleneck that prevents us from utilizing more than 12 cores. (I could be wrong, and the bottleneck might actually be in the thread pool manager. I haven't got precise enough measurements yet to know for sure.)
Here's the situation: suppose you have thousands of clients connected and active. Even if you have CPUs to spare, the number of connections you can acknowledge and dispatch is limited by the speed of the single thread that's processing select(). Even if all it does is walk thru the list of active descriptors and dispatch a job to the thread pool for each one, it's only possible to dispatch a fixed number of ops/second, no matter how many other CPUs there are.
I'm a bit surprised that the select() processing *is* the bottleneck... All in all, it's just -internally- a matter of processing a bit field to see which bit is set to 1, and then get back the FD that is associated with this bit. You must have some other tasks running that create this bottleneck.
I will have to check OpenLDAP code here...
Right now on a 24 core server I'm seeing 48,000 searches/second and 50% CPU utilization. Adding more clients only seems to increase the overall latency, but CPU usage and throughput don't increase any further.
Have you tried to do something we did on ADS : remove all the processing to simply have a mock LDAP server, where only the network part is studied ?
ie, we just send back a mock response when a request has been received.
It helps to focus only on the network layer.
(sadly, as the PDU decoding is a costly operation, you may have to take that into account).
Also in the test I'm looking at, some of the issues I pointed to above aren't even happening. E.g., none of the operations are blocking on write. It's all about handling read events, nothing else. (So again, maybe it's OK to leave the current write behavior as-is.)
One idea Jean-François Arcand devised was to have two select() = one for the reads, one for the writes. i'm not sure it's really interesting, but that may worth to be investigated.
Another idea, and this is what we have implemented in ADS, is to have more than one select() thread for reads/writes (in fact, we compute the number of processors, and create as many threads as we have cores, plus one. Each thread process a select() of course). I'm not convinced that, in our case, it helps a lot, but it may be helpful in your case if you don't want/can't modify the way requests are being processed in OpenLDAP.
As I've noted in the past, I can get double the throughput by running two slapds concurrently, so it's not a question of network or I/O resources. Just that a single listener thread (and/or a single mutex controlling the thread pool) will only scale so far, and no further.
All in all, what costs CPU consumption in a server is most certainly the processing of incoming and outgoing requests, not the processing of the select() operation, no ?
In relative costs, sure, but eventually even the small costs add up.
Im' maybe a bit tainted by Java, but it's really based on the same mechanisms under the hood...
Heh. Indeed. Concurrency issues are the same, no matter what language you use to dress them up.
One other realization I had is that the current design makes it very easy to build slapd as either threaded or non-threaded. Pushing too much activity into other threads would break the non-threaded builds. But at this point, with even cellphones going dual-core, I have to wonder how important it is to maintain compatibility for non-threaded slapd builds. ??
Java does not suffer from such limitation, that's for sure, thus we don't have to care about primitive devices which don't support threads... As Android - and supposely iOS - is supporting threads, it should not be such a constraint.
Now, if you have an installed base of users who still depend on a single threaded server, then may be it's time to split the server in two, and have the single threaded server having its own branch ? Even if you fix some bugs in trunk, it's very likely that you will be able to apply the patch to the single threaded branch, and I'm not sure that it has a long life expectancy anyway :)
Interesting problems :)
On Sun, 1 Aug 2010, Howard Chu wrote:
One other realization I had is that the current design makes it very easy to build slapd as either threaded or non-threaded. Pushing too much activity into other threads would break the non-threaded builds. But at this point, with even cellphones going dual-core, I have to wonder how important it is to maintain compatibility for non-threaded slapd builds.
I'm not so sure if it's important to maintain slapd non-threaded, but it would still be nice to see libldap and other client-side needs without threads. It's becoming rarely needed from a hardware perspective, but the OS ecosystem to match seems to lag (either in implementation timeline or implementation quality).
On 8/2/10 2:00 PM, Aaron Richton wrote:
On Sun, 1 Aug 2010, Howard Chu wrote:
One other realization I had is that the current design makes it very easy to build slapd as either threaded or non-threaded. Pushing too much activity into other threads would break the non-threaded builds. But at this point, with even cellphones going dual-core, I have to wonder how important it is to maintain compatibility for non-threaded slapd builds.
I'm not so sure if it's important to maintain slapd non-threaded, but it would still be nice to see libldap and other client-side needs without threads. It's becoming rarely needed from a hardware perspective, but the OS ecosystem to match seems to lag (either in implementation timeline or implementation quality).
IMO, client lib should be single-threaded if needed, but the default must be multi-threaded. Most of the 'clients' are in fact applications with many contexts, so not having a multi-threaded client is probably killing.
increasing the number of pollers/readers should help significantly on massively multi cpu/core systems help but it requires some fine tuning because you have to evaluate that against the cost of processing those requests otherwise all you gonna do is saturate your work queue so its good to have some mechanism to cap it and apply the brakes when needed as well. the real problems come from the cost of synchronization on those systems tho. while back i had some play time with fully loaded t5440 [256 h/w threads] and did manage to get it to 85% utilized with OpenDS. there was of course quite a number of threads involved thus various synchronization issues associated with them at that scale and architecture that normally have an insignificant difference on smaller systems. like when you have your multiple pollers/readers putting things on the work queue and multiple worker threads taking things from it. the more creative you can get making things safely lockless the better.
On 02/08/2010 08:34, Emmanuel Lécharny wrote:
Here's the situation: suppose you have thousands of clients connected and active. Even if you have CPUs to spare, the number of connections you can acknowledge and dispatch is limited by the speed of the single thread that's processing select(). Even if all it does is walk thru the list of active descriptors and dispatch a job to the thread pool for each one, it's only possible to dispatch a fixed number of ops/second, no matter how many other CPUs there are.
I'm a bit surprised that the select() processing *is* the bottleneck... All in all, it's just -internally- a matter of processing a bit field to see which bit is set to 1, and then get back the FD that is associated with this bit. You must have some other tasks running that create this bottleneck.
I will have to check OpenLDAP code here...
Right now on a 24 core server I'm seeing 48,000 searches/second and 50% CPU utilization. Adding more clients only seems to increase the overall latency, but CPU usage and throughput don't increase any further.
Anton Bobrov wrote:
increasing the number of pollers/readers should help significantly on massively multi cpu/core systems help but it requires some fine tuning because you have to evaluate that against the cost of processing those requests otherwise all you gonna do is saturate your work queue so its good to have some mechanism to cap it and apply the brakes when needed as well. the real problems come from the cost of synchronization on those systems tho. while back i had some play time with fully loaded t5440 [256 h/w threads] and did manage to get it to 85% utilized with OpenDS. there was of course quite a number of threads involved thus various synchronization issues associated with them at that scale and architecture that normally have an insignificant difference on smaller systems. like when you have your multiple pollers/readers putting things on the work queue and multiple worker threads taking things from it. the more creative you can get making things safely lockless the better.
Unfortunately, at this time writing lockless algorithms means resorting to heavily machine-dependent code and we've been trying to stick to standardized e.g. POSIX APIs. It would be pretty easy to write a CPU-cache-friendly producer/consumer queue in assembly language for a few specific architectures, and maybe doable using compiler-specific intrinsics, but our portability would go out the window.
(Which is not to say that the thought hasn't crossed my mind, numerous times already. I still have a very nice implementation I wrote in sparc assembly language kicking around here, but it seems that only x86-64 matters these days; that and ARM...)
On 02/08/2010 08:34, Emmanuel Lécharny wrote:
Here's the situation: suppose you have thousands of clients connected and active. Even if you have CPUs to spare, the number of connections you can acknowledge and dispatch is limited by the speed of the single thread that's processing select(). Even if all it does is walk thru the list of active descriptors and dispatch a job to the thread pool for each one, it's only possible to dispatch a fixed number of ops/second, no matter how many other CPUs there are.
I'm a bit surprised that the select() processing *is* the bottleneck... All in all, it's just -internally- a matter of processing a bit field to see which bit is set to 1, and then get back the FD that is associated with this bit. You must have some other tasks running that create this bottleneck.
I will have to check OpenLDAP code here...
Right now on a 24 core server I'm seeing 48,000 searches/second and 50% CPU utilization. Adding more clients only seems to increase the overall latency, but CPU usage and throughput don't increase any further.
Howard Chu writes:
Unfortunately, at this time writing lockless algorithms means resorting to heavily machine-dependent code and we've been trying to stick to standardized e.g. POSIX APIs. It would be pretty easy to write a CPU-cache-friendly producer/consumer queue in assembly language for a few specific architectures, and maybe doable using compiler-specific intrinsics, but our portability would go out the window.
Portability just means there should be a fall-back implementation which uses the ldap_pvt_thread_*() locking functions. Or if it is made a separate library from OpenLDAP, it could use macros which default to pthread_*() but could be overridden to use ldap_pvt_thread_*().
On 8/3/10 4:52 PM, Hallvard B Furuseth wrote:
Howard Chu writes:
Unfortunately, at this time writing lockless algorithms means resorting to heavily machine-dependent code and we've been trying to stick to standardized e.g. POSIX APIs. It would be pretty easy to write a CPU-cache-friendly producer/consumer queue in assembly language for a few specific architectures, and maybe doable using compiler-specific intrinsics, but our portability would go out the window.
Portability just means there should be a fall-back implementation which uses the ldap_pvt_thread_*() locking functions. Or if it is made a separate library from OpenLDAP, it could use macros which default to pthread_*() but could be overridden to use ldap_pvt_thread_*().
No wonder why I prefer Java :) ...
Macros... Sad memories :/
Emmanuel Lécharny wrote:
On 8/2/10 5:26 AM, Howard Chu wrote:
<snip/> > Why would you have more than one select() ? Wouldn't it be better to > have one thread processing the select() and dispatching the operation to > a pool of threads ?
That's what we have right now, and as far as I can see it's a bottleneck that prevents us from utilizing more than 12 cores. (I could be wrong, and the bottleneck might actually be in the thread pool manager. I haven't got precise enough measurements yet to know for sure.)
Here's the situation: suppose you have thousands of clients connected and active. Even if you have CPUs to spare, the number of connections you can acknowledge and dispatch is limited by the speed of the single thread that's processing select(). Even if all it does is walk thru the list of active descriptors and dispatch a job to the thread pool for each one, it's only possible to dispatch a fixed number of ops/second, no matter how many other CPUs there are.
I'm a bit surprised that the select() processing *is* the bottleneck... All in all, it's just -internally- a matter of processing a bit field to see which bit is set to 1, and then get back the FD that is associated with this bit. You must have some other tasks running that create this bottleneck.
I will have to check OpenLDAP code here...
Right, it only amounts to around a 10% difference. Still, this is an improvement. I guess we'll need to setup oprofile and get some more detailed numbers to find out where any remaining bottlenecks are.
Right now on a 24 core server I'm seeing 48,000 searches/second and 50% CPU utilization. Adding more clients only seems to increase the overall latency, but CPU usage and throughput don't increase any further.
Have you tried to do something we did on ADS : remove all the processing to simply have a mock LDAP server, where only the network part is studied ?
ie, we just send back a mock response when a request has been received.
It helps to focus only on the network layer.
Yes, we have back-null for that purpose.
(sadly, as the PDU decoding is a costly operation, you may have to take that into account).
Sure, anything that's in the frontend is going to be included, but that's OK, there's no synchronization overhead in the decoders.