David Boreham wrote:
Howard Chu wrote:
Kinda interesting - with hoard this shows us processing 23000 entries per second in the single-threaded case, vs only 3521 per second with glibc malloc.
It is possible that you are seeing the syndrome that I wrote about here: http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html AFAIK the poorly behaving code is still present in today's glibc malloc. (the problem affects UMich-derived LDAP servers because entries in the cache comprise a significant proportion of the heap traffic, and they tend to get allocated by a different thread than frees them when the cache fills up). malloc exhibits lock contention when blocks are freed by a different thread than allocated them (more exactly when the threads hash to different arenas).
Yes, that's a factor I had considered. One of the approaches I had tested was tagging each entry with the threadID that allocated it, and deferring the frees until the owning thread comes back to dispose of them. I wasn't really happy with this approach because it meant the entry cache size would not be strictly regulated, though on a busy server it's likely that all of the frees would happen in reasonably short time.
Still, free() contention alone doesn't explain the huge performance gap in the single-threaded case. However, judging from the fact that the very first glibc search often runs faster than the second and subsequent ones, I'd say that there are other significant costs in glibc free(). Most likely would be that glibc returns memory back to the OS too frequently. Hoard and Umem also return memory back to the OS, but they also keep more extensive free lists. Tcmalloc never returns memory back to the OS.
What's also interesting is that for Hoard, Umem, and Tcmalloc, the multi-threaded query times are consistently about 2x slower than the single-threaded case. The 2x slowdown makes sense since it's only a dual-core CPU and it's doing 4x as much work. This kinda says that the cost of malloc is overshadowed by the overhead of thread scheduling.
But for glibc the multi-threaded times are actually faster than the single-threaded case. Since glibc's costs are being partially hidden it seems that the multiple threads are actually getting some benefit from the entry cache here. Still the difference between glibc and the other allocators is so dramatic that this little difference is just academic.