Howard Chu wrote:
Based on some experimental changes I've already made, I see a difference between 25K auths/sec with the current code, vs 39K auths/sec using separate thread pools.
With some more tweaking and adding a faster client load generator to the mix, I've coaxed 42,000 auths/sec out of the box. (42,004, actually.) (That was a peak over a 30 second interval; 41,800 is more typical over a sustained duration.) Analyzing the profile traces is interesting; the ethernet driver is the biggest CPU consumer at around 8.6%, followed by strval2str at 3.8%, then pthread_mutex_lock at 2.8%. As a practical matter we're already doing pretty well when the kernel/network overhead is greater than any of our own code. At these levels we're only getting about 690% of the CPU for our code, 100% is completely consumed by interrupt handling, and the remaining 10% is idle time (which I believe in this case is really the time a CPU spent blocked waiting for an already taken mutex).
It's pretty amazing to watch the processor status in top and see an entire CPU consumed by interrupt processing. That kind of points to some major walls down the road; while any 1GHz or faster processor today can saturate 100Mbps ethernet, it takes much faster processors to fully utilize 1Gbps ethernet. And unlike bulk data transfer protocols like ftp or http, we won't get any benefit from using jumbo frames in typical LDAP deployments.