Testing on an 8-socket AMD server with Opteron 885 dual-core processors (16 cores total) and a Sun T5120 (T2 Niagara 8 cores, 64 hardware threads) has shown that our current frontend code is performing very poorly with more than 16 server threads.
E.g. on the AMD system with 16 cores allocated, performance was still slower than on the 4-socket AMD server with Opteron 875 dual-core processors (despite 2x the cores and a significant clock-speed advantage). Testing also showed that in this configuration, at least one of the 16 cores was always 100% idle. Basically, the frontend cannot hand out work fast enough to the worker threads.
Rather than using a single mutex to control all accesses into the thread pool, I think we need to have separate queues per worker thread. The frontend can operate in single-producer mode where only the single listener thread is allowed to submit jobs into the pool. The workers can just access their own individual work queues, thus significantly reducing mutex contention.
Ideally we would arrange things such that any data structure is only ever written by a single thread, and all other threads only perform reads against it. (And in the best case, only one other thread needs to perform that read.) By eliminating memory ownership changes and unnecessary cache line sharing, we can dramatically reduce the cache coherency traffic.