Testing on an 8-socket AMD server with Opteron 885 dual-core processors (16 cores total) and a Sun T5120 (T2 Niagara 8 cores, 64 hardware threads) has shown that our current frontend code is performing very poorly with more than 16 server threads.
E.g. on the AMD system with 16 cores allocated, performance was still slower than on the 4-socket AMD server with Opteron 875 dual-core processors (despite 2x the cores and a significant clock-speed advantage). Testing also showed that in this configuration, at least one of the 16 cores was always 100% idle. Basically, the frontend cannot hand out work fast enough to the worker threads.
Rather than using a single mutex to control all accesses into the thread pool, I think we need to have separate queues per worker thread. The frontend can operate in single-producer mode where only the single listener thread is allowed to submit jobs into the pool. The workers can just access their own individual work queues, thus significantly reducing mutex contention.
Ideally we would arrange things such that any data structure is only ever written by a single thread, and all other threads only perform reads against it. (And in the best case, only one other thread needs to perform that read.) By eliminating memory ownership changes and unnecessary cache line sharing, we can dramatically reduce the cache coherency traffic.
Howard Chu writes:
I think we need to have separate queues per worker thread. The frontend can operate in single-producer mode where only the single listener thread is allowed to submit jobs into the pool. The workers can just access their own individual work queues, thus significantly reducing mutex contention.
Won't otherwise-fast requests then be stuck waiting behind the occational slow request? In our server I'm more concerned with worst- case times than the average time.
Hallvard B Furuseth wrote:
Howard Chu writes:
I think we need to have separate queues per worker thread. The frontend can operate in single-producer mode where only the single listener thread is allowed to submit jobs into the pool. The workers can just access their own individual work queues, thus significantly reducing mutex contention.
Won't otherwise-fast requests then be stuck waiting behind the occational slow request? In our server I'm more concerned with worst- case times than the average time.
I guess so. We can try to minimize that by assigning jobs to the shortest queues. On a lightly loaded server, there will probably be at least one empty queue, so jobs will always be dispatched quickly. On a heavily loaded server, all jobs are going to experience longer delays anyway, so I don't think the worst case can deviate very far from the average case.
Howard Chu writes:
I guess so. We can try to minimize that by assigning jobs to the shortest queues. On a lightly loaded server, there will probably be at least one empty queue, so jobs will always be dispatched quickly.
Great...
On a heavily loaded server, all jobs are going to experience longer delays anyway, so I don't think the worst case can deviate very far from the average case.
It can if one uses one slow and one fast backend.
Or with operations like adding a member to a 20000-member posixGroup. I just did that, it took 0.7 seconds with back-bdb. In the mean time, slapd sent 350 other results (not counting search result entries) and accepted 65 connections. Mostly for operations against another database. I don't know how similar the relative times would be on a heavily loaded server though.
BTW, is the problem that each operation locks pool->ltp_mutex 2-3 times, or is it the amount of time it is kept locked? Most tests which pool_<submit/wrapper>() do with ltp_mutex locked, could be precomputed.
Hallvard B Furuseth wrote:
Howard Chu writes:
I guess so. We can try to minimize that by assigning jobs to the shortest queues. On a lightly loaded server, there will probably be at least one empty queue, so jobs will always be dispatched quickly.
Great...
On a heavily loaded server, all jobs are going to experience longer delays anyway, so I don't think the worst case can deviate very far from the average case.
It can if one uses one slow and one fast backend.
True. Well as Rick suggested, we don't have to go to the extreme of one queue per thread, we can go part way and use one queue per N threads.
Or with operations like adding a member to a 20000-member posixGroup. I just did that, it took 0.7 seconds with back-bdb.
Did you try that with sorted values?
In the mean time, slapd sent 350 other results (not counting search result entries) and accepted 65 connections. Mostly for operations against another database. I don't know how similar the relative times would be on a heavily loaded server though.
We could identify slow operations and avoid them. Just stamp each op with a counter when it gets queued. On each pass thru the queues looking for the shortest queue, we can also notice if the active op's stamp is too old. E.g., with 16 threads running an equal distribution of jobs, all the active ops' stamps should be within a span of 16. Any stamp outside that range indicates an op that has run for longer than the average time.
BTW, is the problem that each operation locks pool->ltp_mutex 2-3 times, or is it the amount of time it is kept locked? Most tests which pool_<submit/wrapper>() do with ltp_mutex locked, could be precomputed.
Both I think. I haven't set up Dtrace on the T5120 yet to get better info, but oprofile on Linux shows that pthread_mutex_lock is taking too much CPU time (vying with the ethernet driver for #1 consumer). And realistically, a single shared resource like this is just a really bad idea.
Howard Chu writes:
True. Well as Rick suggested, we don't have to go to the extreme of one queue per thread, we can go part way and use one queue per N threads.
Sounds OK. But I know little about scheduling, so I guess I'll mostly stay out of that and your suggestion below for now.
Or with operations like adding a member to a 20000-member posixGroup. I just did that, it took 0.7 seconds with back-bdb.
Did you try that with sorted values?
Argh, I knew we'd forgotten something when we set up that new server... Thanks for the tip.
(...)
BTW, is the problem that each operation locks pool->ltp_mutex 2-3 times, or is it the amount of time it is kept locked? Most tests which pool_<submit/wrapper>() do with ltp_mutex locked, could be precomputed.
Both I think.
Then I'll at least reduce pool_wrapper a bit. The for(;;) can become: for (;;) { task = LDAP_STAILQ_FIRST(pool->ltp_pending_listptr); if (task == NULL) { if (pool->ltp_close_thread) break; /* !ltp_pause && (FINISHING or too many threads) */ ldap_pvt_thread_cond_wait(&pool->ltp_cond, &pool->ltp_mutex); continue; } <rest of loop untouched>; } ltp_pending_listptr == (ltp_pause ? &(empty list) : <p_pending_list). Removed state STOPPING. We can use FINISHING and flush pending_list.
Reducing _submit() gets a bit uglier. The if (...RUNNING etc ...) test and the "create new thread?" test can both be reduced to simple compares, and the ltp_pause test can move into the branch for the latter. I think that'll make the file harder to rearrange later though, so maybe it should wait.
I haven't set up Dtrace on the T5120 yet to get better info, but oprofile on Linux shows that pthread_mutex_lock is taking too much CPU time (vying with the ethernet driver for #1 consumer). And realistically, a single shared resource like this is just a really bad idea.
True enough. Still, slapd has a lot of mutexes. Should perhaps check if this one stands out before rearraning scheduling around it.
Hallvard B Furuseth wrote:
Howard Chu writes:
(...)
BTW, is the problem that each operation locks pool->ltp_mutex 2-3 times, or is it the amount of time it is kept locked? Most tests which pool_<submit/wrapper>() do with ltp_mutex locked, could be precomputed.
Both I think.
Then I'll at least reduce pool_wrapper a bit. The for(;;) can become: for (;;) { task = LDAP_STAILQ_FIRST(pool->ltp_pending_listptr); if (task == NULL) { if (pool->ltp_close_thread) break; /* !ltp_pause&& (FINISHING or too many threads) */ ldap_pvt_thread_cond_wait(&pool->ltp_cond,&pool->ltp_mutex); continue; } <rest of loop untouched>; } ltp_pending_listptr == (ltp_pause ?&(empty list) :<p_pending_list). Removed state STOPPING. We can use FINISHING and flush pending_list.
Reducing _submit() gets a bit uglier. The if (...RUNNING etc ...) test and the "create new thread?" test can both be reduced to simple compares, and the ltp_pause test can move into the branch for the latter. I think that'll make the file harder to rearrange later though, so maybe it should wait.
Reducing the size of the critical section is always a good idea. But right, if it's going to just make things more complicated, we can hold off for now.
I haven't set up Dtrace on the T5120 yet to get better info, but oprofile on Linux shows that pthread_mutex_lock is taking too much CPU time (vying with the ethernet driver for #1 consumer). And realistically, a single shared resource like this is just a really bad idea.
True enough. Still, slapd has a lot of mutexes. Should perhaps check if this one stands out before rearraning scheduling around it.
Using back-null makes this fairly clear. There are no globally shared mutexes in the connection manager at all, so this is the only remaining culprit. Everything else in the processing chain is per-connection, which should be zero contention. Granted, making back-null run fast may do little for back-bdb or other backends, but at the moment it's clear that the frontend is a problem.
Howard Chu wrote:
Hallvard B Furuseth wrote:
Reducing _submit() (...) I think that'll make the file harder to rearrange later though, so maybe it should wait.
Reducing the size of the critical section is always a good idea. But
right, if
it's going to just make things more complicated, we can hold off for now.
I was just tired and trying too hard. Some of it goes away easily. Testing now, and filed ITS#5364 for tracking.
Howard Chu wrote:
Hallvard B Furuseth wrote:
Then I'll at least reduce pool_wrapper a bit.
Done. Noticed a few other things underway:
- slapd does not set ltp_max_pending, we could drop it or add some slapd.conf option to set it. Or make it ltp_max_tasks instead, then the test in _submit() can be moved inside the branch which mallocs a new task. (The freelist size will enforce the limit.)
- It's slightly less work to maintain number of pending + active tasks than number of pending tasks, or for that matter a count which reduces the "paused or create new thread?" test to a single compare, but if _submit() grows smarter it may need to undo that. (That's the change I thought was best to delay for now.)
- Unless pool_purgekey() gets called from the main thread and expects to update other threads (I have no idea), we can kill thread_keys[] now and support multiple pools. Is that useful?
Replace thread_keys[] with: a circular list with prev/next pointers in the thread contexts, a pool->ltp_ctx_list which points to one of them, and a pool_context() call in pool_purgekey() to get at the list.
(...) a single shared resource like this [ltp_mutex] is just a really bad idea.
True enough. Still, slapd has a lot of mutexes. Should perhaps check if this one stands out before rearraning scheduling around it.
Using back-null makes this fairly clear. There are no globally shared mutexes in the connection manager at all, so this is the only remaining culprit.
After grepping around a bit I'm not sure what you mean. There are several other singled shared mutexes which seem to be frequently used. I'm far from sure of all of these, but here are some in slapd:
attr.c: attr_mutex, for attr_alloc() and attrs_alloc(). daemon.c: slap_daemon.sd_mutex via slapd_<set/clr>_<read/write>() and in slapd_daemon_task(). daemon.c: slapd_rq.rq_mutex? operation.c: slap_op_mutex, for slap_op_time(). gmtime_mutex if !HAVE_GMTIME_R, per connection in back-monitor at least. slap_get_csn() does not #ifndef HAVE_GMTIME_R, should it?
Hallvard B Furuseth wrote:
Howard Chu wrote:
Hallvard B Furuseth wrote:
Then I'll at least reduce pool_wrapper a bit.
Done. Noticed a few other things underway:
Seems to have helped with 16 threads on the T5120, back-null peak went from 17,900 auths/sec at 32 connections to 18,500 auths/sec at 96 connections.
Not much improvement with 24 threads, from a peak of 17,500 at 32 connections to a peak of 17,000 at 60 connections. So the overall peak is a little slower, but it can handle a heavier load before maxing out.
I haven't thought about any of the following suggestions yet.
slapd does not set ltp_max_pending, we could drop it or add some slapd.conf option to set it. Or make it ltp_max_tasks instead, then the test in _submit() can be moved inside the branch which mallocs a new task. (The freelist size will enforce the limit.)
It's slightly less work to maintain number of pending + active tasks than number of pending tasks, or for that matter a count which reduces the "paused or create new thread?" test to a single compare, but if _submit() grows smarter it may need to undo that. (That's the change I thought was best to delay for now.)
Unless pool_purgekey() gets called from the main thread and expects to update other threads (I have no idea), we can kill thread_keys[] now and support multiple pools. Is that useful?
Replace thread_keys[] with: a circular list with prev/next pointers in the thread contexts, a pool->ltp_ctx_list which points to one of them, and a pool_context() call in pool_purgekey() to get at the list.
(...) a single shared resource like this [ltp_mutex] is just a really bad idea.
True enough. Still, slapd has a lot of mutexes. Should perhaps check if this one stands out before rearraning scheduling around it.
Using back-null makes this fairly clear. There are no globally shared mutexes in the connection manager at all, so this is the only remaining culprit.
After grepping around a bit I'm not sure what you mean. There are several other singled shared mutexes which seem to be frequently used. I'm far from sure of all of these, but here are some in slapd:
attr.c: attr_mutex, for attr_alloc() and attrs_alloc().
Doesn't affect back-null.
daemon.c: slap_daemon.sd_mutex via slapd_<set/clr>_<read/write>() and in slapd_daemon_task().
True.
daemon.c: slapd_rq.rq_mutex?
Doesn't affect my current test configuration.
operation.c: slap_op_mutex, for slap_op_time().
True.
gmtime_mutex if !HAVE_GMTIME_R, per connection in back-monitor at least. slap_get_csn() does not #ifndef HAVE_GMTIME_R, should it?
No, read the comment there.
Howard Chu writes:
Seems to have helped with 16 threads on the T5120, back-null peak went from 17,900 auths/sec at 32 connections to 18,500 auths/sec at 96 connections.
Not much improvement with 24 threads, from a peak of 17,500 at 32 connections to a peak of 17,000 at 60 connections. So the overall peak is a little slower, but it can handle a heavier load before maxing out.
Hm, ±3% with back-null. But I'm not sure how I got the decrease. I've committed a slight cleanup now which might help.
I had swapped the tests before and after '&&' in pool_submit() here, since the 1st now is shorter: if (pool->ltp_vary_open_count > 0 && pool->ltp_open_count < pool->ltp_active_count+pool->ltp_pending_count) The first checks if we may open a thread, the 2nd if we want to. If slapd had less than 24 threads, there would be one extra test.
Could test with usleep(1) when adding/removing a task, first in _submit() and next in _wrapper(), and see which one leads to more mutex contention.
And at the other mutexes I mentioned, for that matter.
(...) There are several other singled shared mutexes which seem to be frequently used. I'm far from sure of all of these, but here are some in slapd: (...) operation.c: slap_op_mutex, for slap_op_time().
True.
Note ITS#5370. Can maybe remove it.
Hallvard B Furuseth wrote:
Howard Chu writes:
Seems to have helped with 16 threads on the T5120, back-null peak went from 17,900 auths/sec at 32 connections to 18,500 auths/sec at 96 connections.
Not much improvement with 24 threads, from a peak of 17,500 at 32 connections to a peak of 17,000 at 60 connections. So the overall peak is a little slower, but it can handle a heavier load before maxing out.
Hm, ±3% with back-null. But I'm not sure how I got the decrease. I've committed a slight cleanup now which might help.
The latest code got 19,500 auths/sec at 100 connections for 16 threads. Quite a jump.
For 24 threads, the peak was 17,230 at 52 connections.
I had swapped the tests before and after '&&' in pool_submit() here, since the 1st now is shorter: if (pool->ltp_vary_open_count> 0&& pool->ltp_open_count< pool->ltp_active_count+pool->ltp_pending_count) The first checks if we may open a thread, the 2nd if we want to. If slapd had less than 24 threads, there would be one extra test.
Could test with usleep(1) when adding/removing a task, first in _submit() and next in _wrapper(), and see which one leads to more mutex contention.
And at the other mutexes I mentioned, for that matter.
I think we need to get more detailed profile traces now. But I still have some other work to finish before I can spend any time in depth here.
We're still only getting about 20% total CPU utilization on the Sun T5120. Given how slow a single thread is on this machine, I think we're going to need multiple listener threads to really make effective use of it.
Howard Chu wrote:
Testing on an 8-socket AMD server with Opteron 885 dual-core processors (16 cores total) and a Sun T5120 (T2 Niagara 8 cores, 64 hardware threads) has shown that our current frontend code is performing very poorly with more than 16 server threads.
You mention one particular hardware architecture here. Looking through the rest of your performance tuning slides that I know of, I'm not seeing a lot of this kind of work done on other architectures. Is this possibly an AMD or Intel limitation? Or maybe there are OS-specific issues?
I only ask because right now our big OpenLDAP servers are on Sun SPARC processors (UltraSPARC-IIIi?) running Solaris 9, and I'm wondering if the problems we've had in the past might be related to our particular hardware/OS choices, as compared to the ones you've been testing with.
Just curious. Thanks!
Brad Knowles wrote:
Howard Chu wrote:
Testing on an 8-socket AMD server with Opteron 885 dual-core processors (16 cores total) and a Sun T5120 (T2 Niagara 8 cores, 64 hardware threads) has shown that our current frontend code is performing very poorly with more than 16 server threads.
You mention one particular hardware architecture here. Looking through the rest of your performance tuning slides that I know of, I'm not seeing a lot of this kind of work done on other architectures. Is this possibly an AMD or Intel limitation? Or maybe there are OS-specific issues?
These factors certainly come into play, but their influence tends to be small, e.g. 10% or so. For example, OpenLDAP on SPARC runs faster on Linux than on Solaris, but it's not a huge difference. The behaviors I'm worrying about here are much worse, e.g. throughput with 24 threads is half as fast as throughput with 16 threads.
I only ask because right now our big OpenLDAP servers are on Sun SPARC processors (UltraSPARC-IIIi?) running Solaris 9, and I'm wondering if the problems we've had in the past might be related to our particular hardware/OS choices, as compared to the ones you've been testing with.
In general, source-level optimizations - tuning algorithms, etc. - benefit all platforms. Some more than others, sure, but problems that show up on one platform are likely to be problems on all platforms. Likewise, a well-tuned installation should perform decently on any platform.
That aside, some platforms are definitely better than others. The SPARC architecture has really been lagging in instructions-per-cycle. I don't really believe that the Niagara design is going to get anywhere either. Aside from embarrassingly parallel workloads (array/vector processing, image processing, etc) it's pretty hard to write good parallel code that will scale across hundreds of threads, and you're always up against Amdahl's Law. Still, until we've investigated every possible avenue for getting decent performance out of this machine, it's too early to just write it off.
Brad Knowles wrote:
Howard Chu wrote:
Testing on an 8-socket AMD server with Opteron 885 dual-core processors (16 cores total) and a Sun T5120 (T2 Niagara 8 cores, 64 hardware threads) has shown that our current frontend code is performing very poorly with more than 16 server threads.
You mention one particular hardware architecture here.
Not just one - the Sun T2 Niagara is a quite different architecture from AMD/Intel. We also test on Itanium and PA-RISC, though less often. And of course, anyone with a system that they'd like to see tested is welcome to provide us access to conduct such tests. The 8-socket AMD server I mentioned above is based at a company in Singapore. As long as we can get network access and they can provide load generator machines on the same LAN as the target server, it doesn't really matter where the machine resides.