Howard Chu wrote:
I've tested again with libhoard 3.5.1, and it's actually superior to libumem for both speed and fragmentation. Here are some results against glibc 2.3.3, libhoard, and libumem:
I've retested with current HEAD, adding Google's tcmalloc to the mix. glibc hoard umem tcmalloc size time size time size time size time initial 731 741 741 744 single 1368 2:03.02 1809 1:26.67 1778 1:43.49 1453 0:58.77 single 1369 2:23.61 1827 0:25.68 1825 0:18.40 1512 0:21.74 single 1368 1:48.41 1828 0:16.52 1826 0:21.07 1529 0:22.97 single 1368 1:48.59 1829 0:16.59 1827 0:16.95 1529 0:17.07 single 1368 1:48.72 1829 0:16.53 1827 0:16.61 1529 0:17.01 single 1368 1:48.39 1829 0:20.70 1827 0:16.56 1529 0:16.99 single 1368 1:48.63 1830 0:16.56 1828 0:17.48 1529 0:17.29 single 1384 1:48.14 1829 0:16.64 1828 0:22.17 1529 0:16.94 four 1967 1:20.21 1918 0:35.96 1891 0:29.95 1606 0:42.48 four 2002 1:10.58 1919 0:30.07 1911 0:29.00 1622 0:42.38 four 2009 1:33.45 1920 0:42.06 1911 0:40.01 1628 0:40.68 four 1998 1:32.94 1920 0:35.62 1911 0:39.11 1634 0:30.41 four 1995 1:35.47 1920 0:34.20 1911 0:28.40 1634 0:40.80 four 1986 1:34.38 1920 0:28.92 1911 0:31.16 1634 0:40.42 four 1989 1:33.23 1920 0:31.48 1911 0:33.73 1634 0:33.97 four 1999 1:33.04 1920 0:33.47 1911 0:38.33 1634 0:40.91 slapd CPU 26:31.56 8:34.78 8:33.39 9:19.87
The initial size is the size of the slapd process right after startup, with the process totally idle. The id2entry DB is 1.3GB with about 360,000 entries, BDB cache at 512M, entry cache at 70,000 entries, cachefree 7000. The subsequent statistics are the size of the slapd process after running a single search filtering on an unindexed attribute, basically spanning the entire DB. The entries range in size from a few K to a couple megabytes.
The BDB and slapd cache configurations are the same, but the machine now has 4GB of RAM so the entire DB fits in the filesystem buffer cache. As such there is no disk I/O during these tests.
After running the single ldapsearch 4 times, I then ran the same search again with 4 jobs in parallel. There should of course be some process growth for the resources for 3 additional threads (about 60MB is about right since this is an x86_64 system).
This time I ran the tests 8 times each. Basically I was looking for the slapd process size to stabilize at a constant number...
The machine only had 2GB of RAM, and you can see that with glibc malloc the kswapd got really busy in the 4-way run. The times might improve slightly after I add more RAM to the box. But clearly glibc malloc is fragmenting the heap like crazy. The current version of libhoard looks like the winner here.
There was no swap activity (or any other disk activity) this time, so the glibc numbers don't explode like they did before. But even so, the other allocators are 5-6 times faster. The Google folks claim tcmalloc is the fastest allocator they have ever seen. These tests show it is fast, but it is not the fastest. It definitely is the most space-efficient multi-threaded allocator though. It's hard to judge just by the execution times of the ldapsearch commands, so I also recorded the amount of CPU time the slapd process consumed by the end of each test. That gives a clearer picture of the performance differences for each allocator.
These numbers aren't directly comparable to the ones I posted on August 30, because I was using a hacked up RE23 there and used HEAD here.
Going thru these tests is a little disturbing; I would expect the single-threaded results to be 100% repeatable but they're not quite. In one run I saw glibc use up 2.1GB during the 4-way test, and never shrink back down. I later re-ran the same test (because I hadn't recorded the CPU usage the first time around) and got these numbers instead. The other thing that's disturbing is just how bad glibc's malloc really is, even in the single-threaded case which is supposed to be its ideal situation.
The other thing to note is that for this run with libhoard, I doubled its SUPERBLOCK_SIZE from 64K to 128K. That's probably why it's less space-efficient here than umem, while it was more efficient in the August results. I guess I'll have to recompile that with the original size to see what difference that makes. For now I'd say my preference would be tcmalloc...