David Boreham wrote:
Howard Chu wrote:
What's also interesting is that for Hoard, Umem, and Tcmalloc, the multi-threaded query times are consistently about 2x slower than the single-threaded case. The 2x slowdown makes sense since it's only a dual-core CPU and it's doing 4x as much work. This kinda says that the cost of malloc is overshadowed by the overhead of thread scheduling.
Is it possible that the block stride in the addresses returned by malloc() is affecting cache performance in the glibc case ? If they are too close I think it is possible to thrash cache lines between cores.
That's a very good question, and I don't have an answer yet. I've been working on some threading extensions for cachegrind so I can investigate that. Unfortunately its existing infrastructure doesn't lend itself well to tracking multiple caches, so that's been slow going.