Happy New Year!
Howard Chu wrote:
I've tested again with libhoard 3.5.1, and it's actually superior to libumem for both speed and fragmentation. Here are some results against glibc 2.3.3, libhoard, and libumem and tcmalloc:
With the latest code in HEAD the difference in speed between Hoard, Umem, and TCmalloc pretty much disappears. Compared to the results from November, they're about 5-15% faster in the single-threaded case, and significantly faster in the multi-threaded case. Looking at glibc's performance it's pretty clear that our new Entry and Attribute slabs helped a fair amount, but weren't a complete cure. At this point I think the only cure for glibc malloc is not to use it...
November 22 figures: glibc hoard umem tcmalloc size time size time size time size time start 732 740 734 744 single 1368 01:06.04 1802 00:24.52 1781 00:38.46 1454 00:23.86 single 1369 01:47.15 1804 00:17.12 1808 00:18.93 1531 00:17.10 single 1384 01:48.22 1805 00:16.66 1825 00:24.56 1548 00:16.58 single 1385 01:48.65 1805 00:16.61 1825 00:16.17 1548 00:16.61 single 1384 01:48.87 1805 00:16.50 1825 00:16.37 1548 00:16.74 single 1384 01:48.63 1806 00:16.50 1825 00:16.22 1548 00:16.78 single 1385 02:31.95 1806 00:16.50 1825 00:16.30 1548 00:16.67 single 1384 02:43.56 1806 00:16.61 1825 00:16.20 1548 00:16.68 four 2015 02:00.42 1878 00:46.60 1883 00:28.70 1599 00:34.24 four 2055 01:17.54 1879 00:47.06 1883 00:39.45 1599 00:41.09 four 2053 01:21.53 1879 00:40.91 1883 00:37.90 1599 00:41.45 four 2045 01:20.48 1879 00:30.58 1883 00:39.59 1599 00:56.45 four 2064 01:26.11 1879 00:30.77 1890 00:47.71 1599 00:40.74 four 2071 01:29.01 1879 00:40.78 1890 00:44.53 1610 00:40.87 four 2053 01:30.59 1879 00:38.44 1890 00:39.31 1610 00:34.12 four 2056 01:28.11 1879 00:29.79 1890 00:39.53 1610 00:53.65 CPU1 15:23.00 02:20.00 02:43.00 02:21.00 CPU final 26:50.43 08:13.99 09:20.86 09:09.05
The start size is the size of the slapd process right after startup, with the process totally idle. The id2entry DB is 1.3GB with about 360,000 entries, BDB cache at 512M, entry cache at 70,000 entries, cachefree 7000. The subsequent statistics are the size of the slapd process after running a single search filtering on an unindexed attribute, basically spanning the entire DB. The entries range in size from a few K to a couple megabytes.
The process sizes are in megabytes. There are 380836 entries in the DB. The CPU1 line is the amount of CPU time the slapd process accumulated after the last single-threaded test. The multi-threaded test consists of starting the identical search 4 times in the background of a shell script, using "wait" to wait for all of them to complete, and timing the script. The CPU final line is the total amount of CPU time the slapd process used at the end of all tests.
Here are the numbers for HEAD as of today: glibc hoard umem tcmalloc size time size time size time size time start 649 655 651 660 single 1305 01:02.21 1746 00:32.73 1726 00:21.93 1364 00:19.97 single 1575 00:13.53 1786 00:12.57 1753 00:13.74 1396 00:12.92 single 1748 00:14.23 1797 00:13.34 1757 00:14.54 1455 00:13.84 single 1744 00:14.06 1797 00:13.45 1777 00:14.44 1473 00:13.92 single 1533 01:48.20 1798 00:13.45 1777 00:14.15 1473 00:13.90 single 1532 01:27.63 1797 00:13.44 1790 00:14.14 1473 00:13.89 single 1531 01:29.70 1798 00:13.42 1790 00:14.10 1473 00:13.87 single 1749 00:14.45 1798 00:13.41 1790 00:14.11 1473 00:13.87 four 2202 00:33.63 1863 00:23.11 1843 00:23.49 1551 00:23.37 four 2202 00:38.63 1880 00:23.23 1859 00:22.59 1551 00:23.71 four 2202 00:39.24 1880 00:23.34 1859 00:22.77 1564 00:23.57 four 2196 00:38.72 1880 00:23.23 1859 00:22.71 1564 00:23.65 four 2196 00:39.41 1881 00:23.40 1859 00:22.67 1564 00:23.96 four 2196 00:38.82 1880 00:23.13 1859 00:22.79 1564 00:23.41 four 2196 00:39.02 1881 00:23.18 1859 00:22.83 1564 00:23.27 four 2196 00:38.90 1880 00:23.12 1859 00:22.82 1564 00:23.48 CPU1 06:44.07 01:53.00 02:01.00 01:56.34 CPU final 12:56.51 05:48.56 05:52.21 05:47.77
Looking at the glibc numbers really makes you wonder what it's doing, running "OK" for a while, then chewing up CPU, then coming back. Seems to be a pretty expensive garbage collection pass. As before, the system was otherwise idle and there was no paging/swapping/disk activity during the tests. The overall improvements probably come mainly from the new cache-replacement code, and splitting/rearranging some cache locks. (Oprofile has turned out to be quite handy for identifying problem areas... Though it seems that playing with the cacheline alignment only netted a 0.5-1% effect, pretty forgettable.)
It's worth noting that we pay a pretty high overhead cost for using BDB locks here, but that cost seems to amortize out as the number of threads increases. Put another way, this dual-core machine is scaling as if it were a quad - it takes about a 4x increase in job load to see a 2x increase in execution time. E.g., 16 concurrent searches complete in about 41 seconds for Hoard. 32 complete in only 57 seconds. That's starting to approach the speed of the back-hdb entry cache - when all the entries are in the entry cache (as opposed to the BDB cache) a single search completes in about .7 seconds, which is hitting around the 500,000 entries/second mark.