It's looking like using a single mutex-controlled thread pool is a major bottleneck in the slapd frontend. Thinking it over, I've hit a number of different ideas but nothing without drawbacks.
Ideally we would get rid of the distinction between listener threads and worker threads, and only have worker threads. Each thread would be responsible for a fraction of the open sockets, and service them directly instead of queueing work into a thread pool. This would essentially mimic the behavior of SLAPD_NO_THREADS, just duplicated N times.
The upsides of such an approach are numerous; a whole slew of locks completely disappear from the design and we'd be keeping work local to the CPU that originally received a request. The obvious downside is that Abandon/Cancel ops would never be useful (as they currently are not useful in single-threaded slapd). I.e., since the thread responsible for a connection will always be occupied in actually processing an operation, it will never come back to read the next request on the connection (e.g. Abandon) until the current op is already finished.
A possible solution to that would be to always do a quick poll in send_ldap_response() etc. to check for new requests on a connection before sending another reply.