Rick Jones wrote:
The ethernet controllers are on a hub attached by HyperTransport to a single processor, I don't think you can usefully distribute the interrupts to anything beyond that socket.
Well, it wouldn't necessarily be goodness from the standpoint of the time to do any PIO reads, but from the standpoint of getting packet processing spread-out and matching with the core on which the application thread runs it might be. All depends I guess on how busy that HT link gets I guess and whether the NIC(s) can do the spreading in the first place. I wonder if there are any pre-packaged linux binaries out there to measure HT link utilization?
Haven't looked. Would be interesting to see.
Although, if there is still 10% idle, that probably needs to go next :)
Heh heh. The 80/20 rule hits this with a vengeance. That's "10%" of "800%" total, which means really only about 1.2% of a CPU, which is almost totally indistinguishable from measurement error in the oprofile results. This is all intuition (guesswork) now, no more obvious hot spots left to attack. Maybe if I'm really bored over the holidays I'll spend some time on it. (Not likely.)
My degree is in AppliedMath which means I cannot do math to save my life, but if it was 10%agepoints of 800%agepoints total, that means 10% of 8 cores or 80% of a core doesn't it?
No, there are 800 %agepoints total for all 8 cores, and 10 out of 800 are idle. That's a total of 10% of a single core, but it's actually distributed across all 8 cores. So 1.25% of each core.
If it were 10% of one core, then it would be 1.25% of 800% IIRC.
Yes.
And besides, the nature of single-threading bottlenecks seems to be that when you go to 16 cores it will be rather more than 2X what it is today :)
Yes, well, I haven't got a 16-core test system handy at the moment...
The changes I just checked in hit the easy stuff, reducing mutex contention by about 25% on a read-only workload with a nice corresponding boost in throughput. I haven't gotten to the multiple pools yet.
The most obvious one was using per-thread lists for the Operation free lists instead of a single global free list. This assumes that threads will get work allocated relatively uniformly; I guess we can add some counts and push free Ops onto a global list if a per-thread list gets too long. At the moment there's no checking, so once a few operations have been performed Operation allocation occurs with pretty much zero mutex activity. (This is probably redundant effort when using something like Google tcmalloc, but not everyone is using that...)
The other change was using per-thread slap_counters for statistics instead of just a global statistics block. In this case, the global slap_counters structure still exists as the head of the list, and its mutex is used to protect list traversals/manipulations. But actual live data is accumulated in per-thread structures chained off the global.
The head mutex is locked when allocating a new per-thread structure, deallocating a structure for an exiting thread, and whenever back-monitor wants to tally up data to present to a client. In normal operation, once the configured number of threads have been created, there will be no accesses to the global at all. If cn=config is used to decrease the number of threads, then some number of threads will exit and so they'll lock the list at that time and accumulate their stats onto the head structure before destroying themselves. So again, in normal operation, if nobody is querying cn=monitor, there will be zero mutex contention due to statistics counters.
Something else I've been considering is padding certain structures to sizes that are multiples of a given cache line. Not sure I want to write autoconf tests for that, may just arbitrarily choose 64 bytes and let folks override it at compile time. The Connection array is the most likely candidate here. Something like
#define ALIGN 64 typedef real_struct { whatever } real_struct;
typedef padded_struct { real_struct foo; char pad[(sizeof(real_struct)+ALIGN-1) & ~(ALIGN-1)]; } padded_struct;