Howard Chu wrote:
I've tested again with libhoard 3.5.1, and it's actually superior to libumem for both speed and fragmentation. Here are some results against glibc 2.3.3, libhoard, and libumem:
I've retested with current HEAD, adding Google's tcmalloc to the mix. glibc hoard umem tcmalloc size time size time size time size time initial 731 741 741 744 single 1368 2:03.02 1809 1:26.67 1778 1:43.49 1453 0:58.77 single 1369 2:23.61 1827 0:25.68 1825 0:18.40 1512 0:21.74 single 1368 1:48.41 1828 0:16.52 1826 0:21.07 1529 0:22.97 single 1368 1:48.59 1829 0:16.59 1827 0:16.95 1529 0:17.07 single 1368 1:48.72 1829 0:16.53 1827 0:16.61 1529 0:17.01 single 1368 1:48.39 1829 0:20.70 1827 0:16.56 1529 0:16.99 single 1368 1:48.63 1830 0:16.56 1828 0:17.48 1529 0:17.29 single 1384 1:48.14 1829 0:16.64 1828 0:22.17 1529 0:16.94 four 1967 1:20.21 1918 0:35.96 1891 0:29.95 1606 0:42.48 four 2002 1:10.58 1919 0:30.07 1911 0:29.00 1622 0:42.38 four 2009 1:33.45 1920 0:42.06 1911 0:40.01 1628 0:40.68 four 1998 1:32.94 1920 0:35.62 1911 0:39.11 1634 0:30.41 four 1995 1:35.47 1920 0:34.20 1911 0:28.40 1634 0:40.80 four 1986 1:34.38 1920 0:28.92 1911 0:31.16 1634 0:40.42 four 1989 1:33.23 1920 0:31.48 1911 0:33.73 1634 0:33.97 four 1999 1:33.04 1920 0:33.47 1911 0:38.33 1634 0:40.91 slapd CPU 26:31.56 8:34.78 8:33.39 9:19.87
The initial size is the size of the slapd process right after startup, with the process totally idle. The id2entry DB is 1.3GB with about 360,000 entries, BDB cache at 512M, entry cache at 70,000 entries, cachefree 7000. The subsequent statistics are the size of the slapd process after running a single search filtering on an unindexed attribute, basically spanning the entire DB. The entries range in size from a few K to a couple megabytes.
The BDB and slapd cache configurations are the same, but the machine now has 4GB of RAM so the entire DB fits in the filesystem buffer cache. As such there is no disk I/O during these tests.
After running the single ldapsearch 4 times, I then ran the same search again with 4 jobs in parallel. There should of course be some process growth for the resources for 3 additional threads (about 60MB is about right since this is an x86_64 system).
This time I ran the tests 8 times each. Basically I was looking for the slapd process size to stabilize at a constant number...
The machine only had 2GB of RAM, and you can see that with glibc malloc the kswapd got really busy in the 4-way run. The times might improve slightly after I add more RAM to the box. But clearly glibc malloc is fragmenting the heap like crazy. The current version of libhoard looks like the winner here.
There was no swap activity (or any other disk activity) this time, so the glibc numbers don't explode like they did before. But even so, the other allocators are 5-6 times faster. The Google folks claim tcmalloc is the fastest allocator they have ever seen. These tests show it is fast, but it is not the fastest. It definitely is the most space-efficient multi-threaded allocator though. It's hard to judge just by the execution times of the ldapsearch commands, so I also recorded the amount of CPU time the slapd process consumed by the end of each test. That gives a clearer picture of the performance differences for each allocator.
These numbers aren't directly comparable to the ones I posted on August 30, because I was using a hacked up RE23 there and used HEAD here.
Going thru these tests is a little disturbing; I would expect the single-threaded results to be 100% repeatable but they're not quite. In one run I saw glibc use up 2.1GB during the 4-way test, and never shrink back down. I later re-ran the same test (because I hadn't recorded the CPU usage the first time around) and got these numbers instead. The other thing that's disturbing is just how bad glibc's malloc really is, even in the single-threaded case which is supposed to be its ideal situation.
The other thing to note is that for this run with libhoard, I doubled its SUPERBLOCK_SIZE from 64K to 128K. That's probably why it's less space-efficient here than umem, while it was more efficient in the August results. I guess I'll have to recompile that with the original size to see what difference that makes. For now I'd say my preference would be tcmalloc...
Howard Chu wrote:
Howard Chu wrote:
I've tested again with libhoard 3.5.1, and it's actually superior to libumem for both speed and fragmentation. Here are some results against glibc 2.3.3, libhoard, and libumem:
I've retested with current HEAD, adding Google's tcmalloc to the mix.
The initial size is the size of the slapd process right after startup, with the process totally idle. The id2entry DB is 1.3GB with about 360,000 entries, BDB cache at 512M, entry cache at 70,000 entries, cachefree 7000. The subsequent statistics are the size of the slapd process after running a single search filtering on an unindexed attribute, basically spanning the entire DB. The entries range in size from a few K to a couple megabytes.
To be exact, there are 380836 entries in the DB, varying in size from 1K to 10MB.
The other thing to note is that for this run with libhoard, I doubled its SUPERBLOCK_SIZE from 64K to 128K. That's probably why it's less space-efficient here than umem, while it was more efficient in the August results. I guess I'll have to recompile that with the original size to see what difference that makes. For now I'd say my preference would be tcmalloc...
I reran the tests one more time with hoard in its default configuration (64K SUPERBLOCK_SIZE), in single-user mode, with no network connected. The results are here http://highlandsun.com/hyc/#Malloc
Kinda interesting - with hoard this shows us processing 23000 entries per second in the single-threaded case, vs only 3521 per second with glibc malloc. In the multi-threaded case we get peaks up to 51136 entries per second vs 19645/second with glibc. It's understandable that the multi-threaded case may get greater than 2x boost over the single-threaded numbers on this dual-core box, since some portion of entries will be in-cache and won't take any malloc hit at all. But still the difference against glibc is staggering. All of the other throughput benchmarks we've published so far have just used the default libc malloc and we're already faster than all other LDAP products. Re-running those with hoard or umem will be a real eye-opener.
Howard Chu wrote:
Kinda interesting - with hoard this shows us processing 23000 entries per second in the single-threaded case, vs only 3521 per second with glibc malloc.
It is possible that you are seeing the syndrome that I wrote about here: http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html AFAIK the poorly behaving code is still present in today's glibc malloc. (the problem affects UMich-derived LDAP servers because entries in the cache comprise a significant proportion of the heap traffic, and they tend to get allocated by a different thread than frees them when the cache fills up). malloc exhibits lock contention when blocks are freed by a different thread than allocated them (more exactly when the threads hash to different arenas).
David Boreham wrote:
Howard Chu wrote:
Kinda interesting - with hoard this shows us processing 23000 entries per second in the single-threaded case, vs only 3521 per second with glibc malloc.
It is possible that you are seeing the syndrome that I wrote about here: http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html AFAIK the poorly behaving code is still present in today's glibc malloc. (the problem affects UMich-derived LDAP servers because entries in the cache comprise a significant proportion of the heap traffic, and they tend to get allocated by a different thread than frees them when the cache fills up). malloc exhibits lock contention when blocks are freed by a different thread than allocated them (more exactly when the threads hash to different arenas).
Yes, that's a factor I had considered. One of the approaches I had tested was tagging each entry with the threadID that allocated it, and deferring the frees until the owning thread comes back to dispose of them. I wasn't really happy with this approach because it meant the entry cache size would not be strictly regulated, though on a busy server it's likely that all of the frees would happen in reasonably short time.
Still, free() contention alone doesn't explain the huge performance gap in the single-threaded case. However, judging from the fact that the very first glibc search often runs faster than the second and subsequent ones, I'd say that there are other significant costs in glibc free(). Most likely would be that glibc returns memory back to the OS too frequently. Hoard and Umem also return memory back to the OS, but they also keep more extensive free lists. Tcmalloc never returns memory back to the OS.
What's also interesting is that for Hoard, Umem, and Tcmalloc, the multi-threaded query times are consistently about 2x slower than the single-threaded case. The 2x slowdown makes sense since it's only a dual-core CPU and it's doing 4x as much work. This kinda says that the cost of malloc is overshadowed by the overhead of thread scheduling.
But for glibc the multi-threaded times are actually faster than the single-threaded case. Since glibc's costs are being partially hidden it seems that the multiple threads are actually getting some benefit from the entry cache here. Still the difference between glibc and the other allocators is so dramatic that this little difference is just academic.
Howard Chu wrote:
What's also interesting is that for Hoard, Umem, and Tcmalloc, the multi-threaded query times are consistently about 2x slower than the single-threaded case. The 2x slowdown makes sense since it's only a dual-core CPU and it's doing 4x as much work. This kinda says that the cost of malloc is overshadowed by the overhead of thread scheduling.
Is it possible that the block stride in the addresses returned by malloc() is affecting cache performance in the glibc case ? If they are too close I think it is possible to thrash cache lines between cores.
David Boreham wrote:
Howard Chu wrote:
What's also interesting is that for Hoard, Umem, and Tcmalloc, the multi-threaded query times are consistently about 2x slower than the single-threaded case. The 2x slowdown makes sense since it's only a dual-core CPU and it's doing 4x as much work. This kinda says that the cost of malloc is overshadowed by the overhead of thread scheduling.
Is it possible that the block stride in the addresses returned by malloc() is affecting cache performance in the glibc case ? If they are too close I think it is possible to thrash cache lines between cores.
That's a very good question, and I don't have an answer yet. I've been working on some threading extensions for cachegrind so I can investigate that. Unfortunately its existing infrastructure doesn't lend itself well to tracking multiple caches, so that's been slow going.
David Boreham wrote:
Howard Chu wrote:
What's also interesting is that for Hoard, Umem, and Tcmalloc, the multi-threaded query times are consistently about 2x slower than the single-threaded case. The 2x slowdown makes sense since it's only a dual-core CPU and it's doing 4x as much work. This kinda says that the cost of malloc is overshadowed by the overhead of thread scheduling.
Is it possible that the block stride in the addresses returned by malloc() is affecting cache performance in the glibc case ? If they are too close I think it is possible to thrash cache lines between cores.
I've been tinkering with oprofile and some of the performance counters etc... I see that with the current entry_free() that returns an entry to the head of the free list, the same structs get re-used over and over. This is cache-friendly on a single-core machine but causes cache contention on a multi-core machine (because a just-freed entry tends to get reused in a different thread). Putting freed entries at the tail of the list avoids the contention in this case, but it sort of makes things equally bad for all the cores. (I.e., everyone has to go out to main memory for the structures, nobody gets any benefit from the cache.) For the moment I'm going to leave it with entry_free() returning entries to the head of the list.
Our current Entry structure is 80 bytes on a 64-bit machine. (Only 32 bytes on a 32 bit machine.) That's definitely not doing us any favors; I may try padding it up to 128 bytes to see how that affects things. Unfortunately while it may be more CPU cache-friendly, it will definitely cost us as far as how many entries we can keep cached in RAM. Another possibility would be to interleave the prealloc list. (E.g., 5 stripes of stride 8 would keep everything on 128 byte boundaries.)
Happy New Year!
Howard Chu wrote:
I've tested again with libhoard 3.5.1, and it's actually superior to libumem for both speed and fragmentation. Here are some results against glibc 2.3.3, libhoard, and libumem and tcmalloc:
With the latest code in HEAD the difference in speed between Hoard, Umem, and TCmalloc pretty much disappears. Compared to the results from November, they're about 5-15% faster in the single-threaded case, and significantly faster in the multi-threaded case. Looking at glibc's performance it's pretty clear that our new Entry and Attribute slabs helped a fair amount, but weren't a complete cure. At this point I think the only cure for glibc malloc is not to use it...
November 22 figures: glibc hoard umem tcmalloc size time size time size time size time start 732 740 734 744 single 1368 01:06.04 1802 00:24.52 1781 00:38.46 1454 00:23.86 single 1369 01:47.15 1804 00:17.12 1808 00:18.93 1531 00:17.10 single 1384 01:48.22 1805 00:16.66 1825 00:24.56 1548 00:16.58 single 1385 01:48.65 1805 00:16.61 1825 00:16.17 1548 00:16.61 single 1384 01:48.87 1805 00:16.50 1825 00:16.37 1548 00:16.74 single 1384 01:48.63 1806 00:16.50 1825 00:16.22 1548 00:16.78 single 1385 02:31.95 1806 00:16.50 1825 00:16.30 1548 00:16.67 single 1384 02:43.56 1806 00:16.61 1825 00:16.20 1548 00:16.68 four 2015 02:00.42 1878 00:46.60 1883 00:28.70 1599 00:34.24 four 2055 01:17.54 1879 00:47.06 1883 00:39.45 1599 00:41.09 four 2053 01:21.53 1879 00:40.91 1883 00:37.90 1599 00:41.45 four 2045 01:20.48 1879 00:30.58 1883 00:39.59 1599 00:56.45 four 2064 01:26.11 1879 00:30.77 1890 00:47.71 1599 00:40.74 four 2071 01:29.01 1879 00:40.78 1890 00:44.53 1610 00:40.87 four 2053 01:30.59 1879 00:38.44 1890 00:39.31 1610 00:34.12 four 2056 01:28.11 1879 00:29.79 1890 00:39.53 1610 00:53.65 CPU1 15:23.00 02:20.00 02:43.00 02:21.00 CPU final 26:50.43 08:13.99 09:20.86 09:09.05
The start size is the size of the slapd process right after startup, with the process totally idle. The id2entry DB is 1.3GB with about 360,000 entries, BDB cache at 512M, entry cache at 70,000 entries, cachefree 7000. The subsequent statistics are the size of the slapd process after running a single search filtering on an unindexed attribute, basically spanning the entire DB. The entries range in size from a few K to a couple megabytes.
The process sizes are in megabytes. There are 380836 entries in the DB. The CPU1 line is the amount of CPU time the slapd process accumulated after the last single-threaded test. The multi-threaded test consists of starting the identical search 4 times in the background of a shell script, using "wait" to wait for all of them to complete, and timing the script. The CPU final line is the total amount of CPU time the slapd process used at the end of all tests.
Here are the numbers for HEAD as of today: glibc hoard umem tcmalloc size time size time size time size time start 649 655 651 660 single 1305 01:02.21 1746 00:32.73 1726 00:21.93 1364 00:19.97 single 1575 00:13.53 1786 00:12.57 1753 00:13.74 1396 00:12.92 single 1748 00:14.23 1797 00:13.34 1757 00:14.54 1455 00:13.84 single 1744 00:14.06 1797 00:13.45 1777 00:14.44 1473 00:13.92 single 1533 01:48.20 1798 00:13.45 1777 00:14.15 1473 00:13.90 single 1532 01:27.63 1797 00:13.44 1790 00:14.14 1473 00:13.89 single 1531 01:29.70 1798 00:13.42 1790 00:14.10 1473 00:13.87 single 1749 00:14.45 1798 00:13.41 1790 00:14.11 1473 00:13.87 four 2202 00:33.63 1863 00:23.11 1843 00:23.49 1551 00:23.37 four 2202 00:38.63 1880 00:23.23 1859 00:22.59 1551 00:23.71 four 2202 00:39.24 1880 00:23.34 1859 00:22.77 1564 00:23.57 four 2196 00:38.72 1880 00:23.23 1859 00:22.71 1564 00:23.65 four 2196 00:39.41 1881 00:23.40 1859 00:22.67 1564 00:23.96 four 2196 00:38.82 1880 00:23.13 1859 00:22.79 1564 00:23.41 four 2196 00:39.02 1881 00:23.18 1859 00:22.83 1564 00:23.27 four 2196 00:38.90 1880 00:23.12 1859 00:22.82 1564 00:23.48 CPU1 06:44.07 01:53.00 02:01.00 01:56.34 CPU final 12:56.51 05:48.56 05:52.21 05:47.77
Looking at the glibc numbers really makes you wonder what it's doing, running "OK" for a while, then chewing up CPU, then coming back. Seems to be a pretty expensive garbage collection pass. As before, the system was otherwise idle and there was no paging/swapping/disk activity during the tests. The overall improvements probably come mainly from the new cache-replacement code, and splitting/rearranging some cache locks. (Oprofile has turned out to be quite handy for identifying problem areas... Though it seems that playing with the cacheline alignment only netted a 0.5-1% effect, pretty forgettable.)
It's worth noting that we pay a pretty high overhead cost for using BDB locks here, but that cost seems to amortize out as the number of threads increases. Put another way, this dual-core machine is scaling as if it were a quad - it takes about a 4x increase in job load to see a 2x increase in execution time. E.g., 16 concurrent searches complete in about 41 seconds for Hoard. 32 complete in only 57 seconds. That's starting to approach the speed of the back-hdb entry cache - when all the entries are in the entry cache (as opposed to the BDB cache) a single search completes in about .7 seconds, which is hitting around the 500,000 entries/second mark.
Oops. These November figures are for back-bdb. The new numbers in the previous post were from back-hdb, so not comparable. I have new numbers for back-bdb here. Interestingly, all of them are faster than back-hdb, except for tcmalloc, which got slower. Nothing's ever straightforward...
Howard Chu wrote:
November 22 figures: glibc hoard umem tcmalloc size time size time size time size time start 732 740 734 744 single 1368 01:06.04 1802 00:24.52 1781 00:38.46 1454 00:23.86 single 1369 01:47.15 1804 00:17.12 1808 00:18.93 1531 00:17.10 single 1384 01:48.22 1805 00:16.66 1825 00:24.56 1548 00:16.58 single 1385 01:48.65 1805 00:16.61 1825 00:16.17 1548 00:16.61 single 1384 01:48.87 1805 00:16.50 1825 00:16.37 1548 00:16.74 single 1384 01:48.63 1806 00:16.50 1825 00:16.22 1548 00:16.78 single 1385 02:31.95 1806 00:16.50 1825 00:16.30 1548 00:16.67 single 1384 02:43.56 1806 00:16.61 1825 00:16.20 1548 00:16.68 four 2015 02:00.42 1878 00:46.60 1883 00:28.70 1599 00:34.24 four 2055 01:17.54 1879 00:47.06 1883 00:39.45 1599 00:41.09 four 2053 01:21.53 1879 00:40.91 1883 00:37.90 1599 00:41.45 four 2045 01:20.48 1879 00:30.58 1883 00:39.59 1599 00:56.45 four 2064 01:26.11 1879 00:30.77 1890 00:47.71 1599 00:40.74 four 2071 01:29.01 1879 00:40.78 1890 00:44.53 1610 00:40.87 four 2053 01:30.59 1879 00:38.44 1890 00:39.31 1610 00:34.12 four 2056 01:28.11 1879 00:29.79 1890 00:39.53 1610 00:53.65 CPU1 15:23.00 02:20.00 02:43.00 02:21.00 CPU final 26:50.43 08:13.99 09:20.86 09:09.05
back-bdb numbers for HEAD as of today: glibc hoard umem tcmalloc size time size time size time size time start 655 655 651 660 single 1703 00:20.50 1703 00:20.50 1696 00:22.24 1350 00:21.54 single 1708 00:11.79 1708 00:11.79 1720 00:12.26 1364 00:12.76 single 1715 00:12.86 1715 00:12.86 1743 00:13.61 1440 00:14.01 single 1729 00:12.98 1729 00:12.98 1743 00:13.07 1445 00:14.11 single 1729 00:12.91 1729 00:12.91 1743 00:13.05 1445 00:14.09 single 1746 00:12.93 1746 00:12.93 1743 00:13.08 1445 00:14.13 single 1747 00:12.92 1747 00:12.92 1743 00:13.06 1458 00:14.11 single 1747 00:12.92 1747 00:12.92 1743 00:13.06 1458 00:14.10 four 1814 00:41.54 1820 00:22.01 1791 00:22.97 1519 00:24.05 four 1858 00:44.89 1852 00:22.14 1791 00:22.43 1541 00:24.15 four 2035 00:33.69 1852 00:22.11 1791 00:22.36 1548 00:24.24 four 2060 00:33.89 1853 00:22.02 1791 00:22.35 1548 00:23.68 four 2075 00:33.67 1853 00:22.01 1791 00:22.30 1548 00:24.68 four 2066 00:35.08 1853 00:22.11 1799 00:22.33 1548 00:23.78 four 2056 00:32.59 1853 00:21.88 1815 00:22.24 1548 00:23.87 four 2055 00:41.03 1853 00:22.21 1815 00:22.11 1548 00:23.75 CPU1 06:16.19 01:49.00 01:54.14 01:59.01 CPU final 12:04.24 05:32.77 05:40.43 05:51.68
Here are the numbers for HEAD as of today: (back-hdb) glibc hoard umem tcmalloc size time size time size time size time start 649 655 651 660 single 1305 01:02.21 1746 00:32.73 1726 00:21.93 1364 00:19.97 single 1575 00:13.53 1786 00:12.57 1753 00:13.74 1396 00:12.92 single 1748 00:14.23 1797 00:13.34 1757 00:14.54 1455 00:13.84 single 1744 00:14.06 1797 00:13.45 1777 00:14.44 1473 00:13.92 single 1533 01:48.20 1798 00:13.45 1777 00:14.15 1473 00:13.90 single 1532 01:27.63 1797 00:13.44 1790 00:14.14 1473 00:13.89 single 1531 01:29.70 1798 00:13.42 1790 00:14.10 1473 00:13.87 single 1749 00:14.45 1798 00:13.41 1790 00:14.11 1473 00:13.87 four 2202 00:33.63 1863 00:23.11 1843 00:23.49 1551 00:23.37 four 2202 00:38.63 1880 00:23.23 1859 00:22.59 1551 00:23.71 four 2202 00:39.24 1880 00:23.34 1859 00:22.77 1564 00:23.57 four 2196 00:38.72 1880 00:23.23 1859 00:22.71 1564 00:23.65 four 2196 00:39.41 1881 00:23.40 1859 00:22.67 1564 00:23.96 four 2196 00:38.82 1880 00:23.13 1859 00:22.79 1564 00:23.41 four 2196 00:39.02 1881 00:23.18 1859 00:22.83 1564 00:23.27 four 2196 00:38.90 1880 00:23.12 1859 00:22.82 1564 00:23.48 CPU1 06:44.07 01:53.00 02:01.00 01:56.34 CPU final 12:56.51 05:48.56 05:52.21 05:47.77
And another oops. The single-thread numbers for glibc in this post were mis-copied (note they're identical to hoard's numbers). The multi-thread numbers and the CPU times are all correct.
The actual single-thread stats for glibc in that test run were: start 649 single 1281 01:02.25 single 1399 00:11.85 single 1404 01:43.01 single 1737 00:13.22 single 1748 00:12.80 single 1736 01:45.55 single 1764 00:12.79 single 1772 00:13.14
Still no idea why there are such erratic jumps in its behavior.
Howard Chu wrote:
back-bdb numbers for HEAD as of today: glibc hoard umem tcmalloc size time size time size time size time start 655 655 651 660 single 1703 00:20.50 1703 00:20.50 1696 00:22.24 1350 00:21.54 single 1708 00:11.79 1708 00:11.79 1720 00:12.26 1364 00:12.76 single 1715 00:12.86 1715 00:12.86 1743 00:13.61 1440 00:14.01 single 1729 00:12.98 1729 00:12.98 1743 00:13.07 1445 00:14.11 single 1729 00:12.91 1729 00:12.91 1743 00:13.05 1445 00:14.09 single 1746 00:12.93 1746 00:12.93 1743 00:13.08 1445 00:14.13 single 1747 00:12.92 1747 00:12.92 1743 00:13.06 1458 00:14.11 single 1747 00:12.92 1747 00:12.92 1743 00:13.06 1458 00:14.10 four 1814 00:41.54 1820 00:22.01 1791 00:22.97 1519 00:24.05 four 1858 00:44.89 1852 00:22.14 1791 00:22.43 1541 00:24.15 four 2035 00:33.69 1852 00:22.11 1791 00:22.36 1548 00:24.24 four 2060 00:33.89 1853 00:22.02 1791 00:22.35 1548 00:23.68 four 2075 00:33.67 1853 00:22.01 1791 00:22.30 1548 00:24.68 four 2066 00:35.08 1853 00:22.11 1799 00:22.33 1548 00:23.78 four 2056 00:32.59 1853 00:21.88 1815 00:22.24 1548 00:23.87 four 2055 00:41.03 1853 00:22.21 1815 00:22.11 1548 00:23.75 CPU1 06:16.19 01:49.00 01:54.14 01:59.01 CPU final 12:04.24 05:32.77 05:40.43 05:51.68
Howard Chu wrote:
And another oops. The single-thread numbers for glibc in this post were mis-copied (note they're identical to hoard's numbers). The multi-thread numbers and the CPU times are all correct.
The actual single-thread stats for glibc in that test run were:
Still no idea why there are such erratic jumps in its behavior.
Found the explanation for this http://sourceware.org/ml/libc-alpha/2006-03/msg00033.html
Running the glibc test with MALLOC_MMAP_THRESHOLD_ env var set to 8MB drastically improved its performance. Aside from some high startup overhead, it's very nearly the fastest in single-threaded, which is more in line with what other folks have written about it. Still the slowest for multi-threaded though.
glibc hoard umem tmalloc size time size time size time size time start 649 655 653 660 single 1278 01:01.95 1708 00:20.23 1693 00:22.33 1353 00:20.91 single 1278 00:26.69 1747 00:11.73 1731 00:12.37 1374 00:12.26 single 1389 00:11.33 1747 00:11.09 1736 00:11.75 1374 00:11.58 single 1412 00:11.12 1748 00:11.06 1736 00:11.71 1374 00:11.58 single 1412 00:11.09 1748 00:11.05 1736 00:11.62 1374 00:11.57 single 1408 00:11.12 1748 00:11.07 1736 00:11.65 1374 00:11.58 single 1412 00:11.11 1748 00:11.06 1736 00:11.54 1374 00:11.56 single 1402 00:11.11 1748 00:11.07 1736 00:11.51 1374 00:11.52 four 1504 00:27.48 1801 00:19.37 1786 00:19.44 1435 00:19.64 four 1502 00:24.24 1801 00:19.22 1799 00:20.90 1451 00:20.44 four 1501 00:23.67 1829 00:19.18 1807 00:19.41 1451 00:19.89 four 1497 00:23.47 1832 00:21.12 1821 00:19.84 1451 00:19.92 four 1498 00:23.68 1832 00:19.50 1821 00:21.27 1451 00:19.91 four 1498 00:23.71 1832 00:19.35 1821 00:19.63 1451 00:19.97 four 1512 00:23.71 1832 00:19.40 1821 00:19.36 1451 00:19.88 four 1512 00:24.53 1832 00:19.51 1821 00:19.59 1451 00:19.98 CPU1 02:36.32 01:38.54 01:45.14 01:42.62 CPU final 06:36.88 05:01.90 05:11.78 05:02.91
This is with current back-bdb (HEAD as of yesterday), slightly faster overall than the January 1 code. It still looks like a tradeoff between hoard for best speed bs tcmalloc for best memory efficiency.