David Boreham wrote:
Howard Chu wrote:
What's also interesting is that for Hoard, Umem, and Tcmalloc, the multi-threaded query times are consistently about 2x slower than the single-threaded case. The 2x slowdown makes sense since it's only a dual-core CPU and it's doing 4x as much work. This kinda says that the cost of malloc is overshadowed by the overhead of thread scheduling.
Is it possible that the block stride in the addresses returned by malloc() is affecting cache performance in the glibc case ? If they are too close I think it is possible to thrash cache lines between cores.
I've been tinkering with oprofile and some of the performance counters etc... I see that with the current entry_free() that returns an entry to the head of the free list, the same structs get re-used over and over. This is cache-friendly on a single-core machine but causes cache contention on a multi-core machine (because a just-freed entry tends to get reused in a different thread). Putting freed entries at the tail of the list avoids the contention in this case, but it sort of makes things equally bad for all the cores. (I.e., everyone has to go out to main memory for the structures, nobody gets any benefit from the cache.) For the moment I'm going to leave it with entry_free() returning entries to the head of the list.
Our current Entry structure is 80 bytes on a 64-bit machine. (Only 32 bytes on a 32 bit machine.) That's definitely not doing us any favors; I may try padding it up to 128 bytes to see how that affects things. Unfortunately while it may be more CPU cache-friendly, it will definitely cost us as far as how many entries we can keep cached in RAM. Another possibility would be to interleave the prealloc list. (E.g., 5 stripes of stride 8 would keep everything on 128 byte boundaries.)