New subject: better malloc strategies

21 Nov 2006


      Howard Chu wrote:
...
I've tested again with libhoard 3.5.1,  and it's actually superior to 
libumem for both speed and fragmentation. Here are some results against 
glibc 2.3.3, libhoard, and libumem:
I've retested with current HEAD, adding Google's tcmalloc to the mix.
    glibc		hoard		umem		tcmalloc
    size  time	size  time	size  time	size  time
initial	731		741		741		744
single	1368  2:03.02	1809  1:26.67	1778  1:43.49	1453  0:58.77
single	1369  2:23.61	1827  0:25.68	1825  0:18.40	1512  0:21.74	
single	1368  1:48.41	1828  0:16.52	1826  0:21.07	1529  0:22.97
single	1368  1:48.59	1829  0:16.59	1827  0:16.95	1529  0:17.07
single	1368  1:48.72	1829  0:16.53	1827  0:16.61	1529  0:17.01
single	1368  1:48.39	1829  0:20.70	1827  0:16.56	1529  0:16.99
single	1368  1:48.63	1830  0:16.56	1828  0:17.48	1529  0:17.29
single	1384  1:48.14	1829  0:16.64	1828  0:22.17	1529  0:16.94
four	1967  1:20.21	1918  0:35.96	1891  0:29.95	1606  0:42.48
four	2002  1:10.58	1919  0:30.07	1911  0:29.00	1622  0:42.38
four	2009  1:33.45	1920  0:42.06	1911  0:40.01	1628  0:40.68
four	1998  1:32.94	1920  0:35.62	1911  0:39.11	1634  0:30.41
four	1995  1:35.47	1920  0:34.20	1911  0:28.40	1634  0:40.80
four	1986  1:34.38	1920  0:28.92	1911  0:31.16	1634  0:40.42
four	1989  1:33.23	1920  0:31.48	1911  0:33.73	1634  0:33.97
four	1999  1:33.04	1920  0:33.47	1911  0:38.33	1634  0:40.91
slapd CPU    26:31.56	      8:34.78	      8:33.39	      9:19.87
...
The initial size is the size of the slapd process right after startup, 
with the process totally idle. The id2entry DB is 1.3GB with about 
360,000 entries, BDB cache at 512M, entry cache at 70,000 entries, 
cachefree 7000. The subsequent statistics are the size of the slapd 
process after running a single search filtering on an unindexed 
attribute, basically spanning the entire DB. The entries range in size 
from a few K to a couple megabytes.
The BDB and slapd cache configurations are the same, but the machine now 
has 4GB of RAM so the entire DB fits in the filesystem buffer cache. As 
such there is no disk I/O during these tests.
...
After running the single ldapsearch 4 times,  I then ran the same search 
again with 4 jobs in parallel. There should of course be some process 
growth for the resources for 3 additional threads (about 60MB is about 
right since this is an x86_64 system).
This time I ran the tests 8 times each. Basically I was looking for the 
slapd process size to stabilize at a constant number...
...
The machine only had 2GB of RAM, and you can see that with glibc malloc 
the kswapd got really busy in the 4-way run. The times might improve 
slightly after I add more RAM to the box. But clearly glibc malloc is 
fragmenting the heap like crazy. The current version of libhoard looks 
like the winner here.
There was no swap activity (or any other disk activity) this time, so 
the glibc numbers don't explode like they did before. But even so, the 
other allocators are 5-6 times faster. The Google folks claim tcmalloc 
is the fastest allocator they have ever seen. These tests show it is 
fast, but it is not the fastest. It definitely is the most 
space-efficient multi-threaded allocator though. It's hard to judge just 
by the execution times of the ldapsearch commands, so I also recorded 
the amount of CPU time the slapd process consumed by the end of each 
test. That gives a clearer picture of the performance differences for 
each allocator.
These numbers aren't directly comparable to the ones I posted on August 
30, because I was using a hacked up RE23 there and used HEAD here.
Going thru these tests is a little disturbing; I would expect the 
single-threaded results to be 100% repeatable but they're not quite. In 
one run I saw glibc use up 2.1GB during the 4-way test, and never shrink 
back down. I later re-ran the same test (because I hadn't recorded the 
CPU usage the first time around) and got these numbers instead. The 
other thing that's disturbing is just how bad glibc's malloc really is, 
even in the single-threaded case which is supposed to be its ideal 
situation.
The other thing to note is that for this run with libhoard, I doubled 
its SUPERBLOCK_SIZE from 64K to 128K. That's probably why it's less 
space-efficient here than umem, while it was more efficient in the 
August results. I guess I'll have to recompile that with the original 
size to see what difference that makes. For now I'd say my preference 
would be tcmalloc...
-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc
   OpenLDAP Core Team            http://www.openldap.org/project/

Re: better malloc strategies