I presume we can close this ITS now.
I've been running some tests on a quad-processor AMD system, and seeing a lot of mutex contention in the frontend. It looks like the current threadpool and connection manager architecture are a bad fit for a NUMA system like this. I'm planning to add support for multiple thread pools (one per CPU would be the idea) and multiple listener threads to slapd.
As a first step, after 2.4.6 is released, I'm going to unifdef the SLAPD_LIGHTWEIGHT_DISPATCHER symbol and delete the old dispatcher code.
Based on some experimental changes I've already made, I see a difference between 25K auths/sec with the current code, vs 39K auths/sec using separate thread pools.
Howard Chu wrote:
Based on some experimental changes I've already made, I see a difference between 25K auths/sec with the current code, vs 39K auths/sec using separate thread pools.
With some more tweaking and adding a faster client load generator to the mix, I've coaxed 42,000 auths/sec out of the box. (42,004, actually.) (That was a peak over a 30 second interval; 41,800 is more typical over a sustained duration.) Analyzing the profile traces is interesting; the ethernet driver is the biggest CPU consumer at around 8.6%, followed by strval2str at 3.8%, then pthread_mutex_lock at 2.8%. As a practical matter we're already doing pretty well when the kernel/network overhead is greater than any of our own code. At these levels we're only getting about 690% of the CPU for our code, 100% is completely consumed by interrupt handling, and the remaining 10% is idle time (which I believe in this case is really the time a CPU spent blocked waiting for an already taken mutex).
It's pretty amazing to watch the processor status in top and see an entire CPU consumed by interrupt processing. That kind of points to some major walls down the road; while any 1GHz or faster processor today can saturate 100Mbps ethernet, it takes much faster processors to fully utilize 1Gbps ethernet. And unlike bulk data transfer protocols like ftp or http, we won't get any benefit from using jumbo frames in typical LDAP deployments.
Howard Chu wrote:
It's pretty amazing to watch the processor status in top and see an entire CPU consumed by interrupt processing. That kind of points to some major walls down the road; while any 1GHz or faster processor today can saturate 100Mbps ethernet, it takes much faster processors to fully utilize 1Gbps ethernet. And unlike bulk data transfer protocols like ftp or http, we won't get any benefit from using jumbo frames in typical LDAP deployments.
This may be a NIC hardware issue. I get 80k pps using ttcp with two very fast test machines that have Tigon NICs, which would match your 40k auths/s. This guy's work suggests that the Intel hardware is much more capable : http://pdos.csail.mit.edu/~rtm/e1000/ And in fact, when I test between two boxes with e1000 nics, but much slower CPUs than the first boxen, I get 250k pps.
David Boreham wrote:
Howard Chu wrote:
It's pretty amazing to watch the processor status in top and see an entire CPU consumed by interrupt processing. That kind of points to some major walls down the road; while any 1GHz or faster processor today can saturate 100Mbps ethernet, it takes much faster processors to fully utilize 1Gbps ethernet. And unlike bulk data transfer protocols like ftp or http, we won't get any benefit from using jumbo frames in typical LDAP deployments.
This may be a NIC hardware issue. I get 80k pps using ttcp with two very fast test machines that have Tigon NICs, which would match your 40k auths/s.
Actually an "authentication" in this case is 5 packets: Anonymous search for uid=foo, response, result, then Simple Bind + result. So, 200k pps. But still, the gigabit ethernet medium maxes out at over 1.4M pps for "small" packets, so 200k is nowhere near saturation.
This guy's work suggests that the Intel hardware is much more capable : http://pdos.csail.mit.edu/~rtm/e1000/ And in fact, when I test between two boxes with e1000 nics, but much slower CPUs than the first boxen, I get 250k pps.
Interesting link, thanks. Of course that's using a much older platform. The machine I'm testing is using a Broadcom BCM5704 interface, with Linux 2.6's Tigon tg3 driver. (Running a 2.6.23.1 kernel now.) I haven't peeked inside the driver to see if it's got any tweaks for delayed interrupts, but it sounds like something worth checking. (I hope it's not the Linksys switch we're using; most of these switches seem to be able to handle at least 700k pps though.)
Reminds me of the old leapfrogging games with Excelan ethernet cards and their onboard TCP engines (15+ years ago), allowing machines of that time to hit a whopping 250KB/sec on 10Mbit ethernet. A couple years later the main CPUs got fast enough to do 500KB/sec without using the cards' "accelerators." It's been many years since I saw another NIC with onboard TCP engine after that, but they're on the market now...