A bit of a summary of how the backend is shaping up. I've been testing with a variety of synthetic LDIFs as well as an actual application database (Zimbra accounts).
I noted before that back-mdb's write speeds on disk are quite slow. This is because a lot of its writes will be to random disk pages, and also the data writes in a transaction commit are followed by a meta page write, which always involves a seek to page 0 or page 1 of the DB file. For slapadd -q this effect can be somewhat hidden because the writes are done with MDB_NOSYNC specified, so no explicit flushes are performed. In my current tests with synchronous writes, back-mdb is one half the speed of back-bdb/hdb. (Even in fully synchronous mode, BDB only writes its transaction logs synchronously, and those are always sequential writes so there's no seek overhead to deal with.)
With that said, slapadd -q for a 3.2M entry database on a tmpfs:
back-hdb: real 75m32.678s user 84m31.733s sys 1m0.316s back-mdb: real 63m51.048s user 50m23.125s sys 13m27.958s
For back-hdb, BDB was configured with a 32GB environment cache. The resulting DB directory consumed 14951004KB including data files and environment files.
For back-mdb, MDB was configured with a 32GB mapsize. The resulting DB directory consumed 18299832KB. The input LDIF was 2.7GB, and there were 29 attributes indexed. Currently MDB is somewhat wasteful with space when dealing with the sorted-duplicate databases that are used for indexing, there's definitely room for improvement here.
Also this slapadd was done with tool-threads set to 1, because back-mdb only allows one writer at a time anyway. There is also obviously room for improvement here, in terms of a bulk-loading API for the MDB library.
With the DB loaded, the time to execute a search that scans every entry in the DB was performed against each server.
Initially back-hdb was only configured with a cachesize of 10000 and IDLcachesize of 10000. It was tested again using a cachesize of 5,000,000 (which is more than was needed since the DB only contained 3,200,100 entries). In each configuration a search was performed twice - once to measure the time to go from an empty cache to a fully primed cache, and again to measure the time for the fully cached search.
first second slapd size back-hdb, 10K cache 3m6.906s 1m39.835s 7.3GB back-hdb, 5M cache 3m12.596s 0m10.984s 46.8GB back-mdb 0m19.420s 0m16.625s 7.0GB
Next, the time to execute multiple instances of this search was measured, using 2, 4, 8, and 16 ldapsearch instances running concurrently. average result time 2 4 8 16 back-hdb, 5M 0m14.147s 0m17.384s 0m45.665s 17m15.114s back-mdb 0m16.701s 0m16.688s 0m16.621 0m16.955s
I don't recall doing this test against back-hdb on ada.openldap.org before, certainly the total blowup at 16 searches was unexpected. But as you can see, with no read locks in back-mdb, search performance is pretty much independent of load. At 16 threads back-mdb slowed down measurably, but that's understandable given that the rest of the system still needed CPU cycles here and there. Otherwise, slapd was running at 1600% CPU the entire time. For back-hdb, slapd maxed out at 1467% CPU, the lock overhead drove it into the ground.
So far I'm pretty pleased with the results; for the most part back-mdb is delivering on what I expected. Decoding each entry every time is a bit of a slowdown, compared to having entries fully cached. But the cost disappears as soon as you get more than a couple requests running at once.
Overall I believe it proves the basic philosophy - in this day and age, it's a waste of application developers' time to incorporate a caching layer into their own code. The OS already does it and does it well. Give yourself as transparent a path as possible between RAM and disk using mmap, and don't fuss with it any further.
back-mdb was first feature-complete a week ago. Between then and now I've spent a bit of time profiling its behavior and eliminating hot spots. For this work I used a test database of 250,000 synthetic entries, running under valgrind's callgrind tool and my FunctionCheck profiler. (Occasionally the two tools would disagree on where hot spots were, so I wound up targeting both.) The callgrind output is available on http://highlandsun.com/hyc/mdb_search/ for reference.
With the initial back-mdb code, which was basically back-bdb with all of its caching logic removed, an ldapsearch that scanned the entire DB ran in 8,875,046,744 instructions. The biggest hot spot was memnrcmp(), which is the libmdb function for comparing two strings in reverse byte order. The top 10 functions were:
1,828,659,111 memnrcmp 1,151,097,812 strncasecmp 911,117,166 mdb_search_node 628,388,775 avl_find 612,549,560 ad_keystring 600,502,442 entry_decode 494,045,070 slap_bv2ad 376,177,636 mdb_search_page 285,882,273 attr_index_name_cmp 199,751,242 mdb_cursor_set
(That basically corresponded to commit e5b1dce6a7904e0eb31029959865730fc813ce57)
-=-=-=-
The next step was to eliminate slap_bv2ad() from the entry_decode() path, using numeric IDs for attributeDescriptions in the database. Rewriting slapd entry_decode as mdb_entry_decode() with this change brought the total search execution down to 5,410,551,759 instructions. The top 10 functions were:
1,823,974,563 memnrcmp 796,099,511 mdb_search_node 528,502,136 mdb_entry_decode 376,251,165 mdb_search_page 199,751,329 mdb_cursor_set 167,097,364 strncasecmp 151,524,069 attr_clean 143,741,422 mdb_get_page 87,494,764 cursor_push_page 83,500,204 is_ad_subtype (commit 1e32fcf099ba8c15333365fe68aefa5217ae3d8c)
-=-=-=-
Next was to eliminate some redundant navigation of the dn2id index. It was doing essentially the same traversal twice on each search candidate - once to determine if the candidate belonged to the search scope, and once to assemble the entryDN. With this change the total search came down to 3,767,565,531 instructions. The top functions were:
1,018,812,205 memnrcmp 528,501,424 mdb_entry_decode 463,442,142 mdb_search_node 212,601,386 mdb_search_page 167,097,240 strncasecmp 151,523,868 attr_clean 93,001,026 mdb_cursor_set 83,500,204 is_ad_subtype 80,889,277 avl_find 80,496,022 mdb_get_page (commit 6c8e4f2671b6aed41cd5098725048dbe2513612c)
-=-=-=-
The next step was a minor libmdb cleanup, restructuring it so that key/data pairs were always guaranteed to start on a 2-byte aligned address. (While x86 didn't seem to care, CPUs like SPARC would SIGBUS otherwise.) This restructuring brought execution down to 3,441,377,693 instructions - making code more portable is always a good thing, even if the impact is minor. The top functions were
1,018,873,702 memnrcmp 537,251,494 mdb_entry_decode 463,465,251 mdb_search_node 212,597,818 mdb_search_page 151,523,868 attr_clean 93,001,026 mdb_cursor_set 83,500,204 is_ad_subtype 80,496,022 mdb_get_page 63,750,282 attrs_alloc 62,500,669 mdb_search (commit 293df78b2be77d6d153fd7052cc62d3377dc5501)
-=-=-=-
Next, finally start doing something about memnrcmp. First is simply writing a more integer-oriented function cintcmp, which operates on unaligned integers a byte at a time. This had only a small impact as well, getting us down only a bit to 3,342,205,373 instructions.
919,700,412 cintcmp (commit f9c8796d0b3ed9bc0f51c76bb28609121b1e2eec) The rest of the trace profile is basically identical to the previous one.
-=-=-=-
Next was a bit of libmdb code cleanup and restructuring. The performance change was minimal, bringing total execution down to 3,310,510,255 instructions. The trace profile is mostly the same as the previous. commit dac3fae3b540841ae753bea16f3b353e2124c43d)
-=-=-=-
Next was a further speedup of cintcmp, changing it to operate on unsigned shorts instead of just chars, now that we had guaranteed 2-byte alignment. This brought execution down to 2,832,817,987 instructions, and shook up the profile a little bit:
537,251,494 mdb_entry_decode 475,922,450 cintcmp 457,377,978 mdb_search_node 176,828,204 mdb_search_page_root 151,523,868 attr_clean 83,500,204 is_ad_subtype (commit 1b69295a48cca409ed0c2f3fe655325e00f55ce2)
-=-=-=-
Next was a further rewrite of mdb_entry_decode, using tmpmem allocs instead of the slapd central entry_alloc/attrs_alloc functions. This brought execution down to 2,483,077,294 instructions. The profile is much like the previous, but attr_clean and its associated functions disappear. The top functions were:
535,751,482 mdb_entry_decode 475,922,450 cintcmp 457,377,978 mdb_search_node 176,828,204 mdb_search_page_root 83,500,204 is_ad_subtype (commit f72d65b77aa6cd4439ee0ad80b498f4ed707a278)
-=-=-=-
Next was another tweak for mdb_search(), keeping the cursor on the id2entry database for the duration of the search. This eliminated a lot of mdb_search_page overhead since usually the cursor was already on the right page when the next entry was being fetched. This change brought execution down to 2,241,832,009 instructions. The top functions were:
535,751,482 mdb_entry_decode 391,064,166 cintcmp 350,350,674 mdb_search_node 139,905,148 mdb_search_page_root 93,256,473 mdb_cursor_set 83,500,204 is_ad_subtype (commit 54ced52c047425b432075946dd2997c52f020de0) -=-=-=-
Finally (as of this morning) a bit of cleanup and restructuring in libmdb, to eliminate a bunch of cruft in the previous data structure layout. This change was more for esthetic reasons than for performance, but it still offered a small gain, with execution at 2,232,560,312 instructions. The top functions:
535,751,482 mdb_entry_decode 391,064,166 cintcmp 342,681,498 mdb_search_node 129,900,173 mdb_search_page_root 90,006,339 mdb_cursor_set 83,500,204 is_ad_subtype (commit 25529a4c36903d0456b1251712de32f665850029)
-=-=-=-
At the outset libmdb had its clumsy parts. Now the libmdb/mdb.o text+data is only 31255 bytes - it's tight and very efficient. The entire DB engine can execute entirely within a CPU's L1 instruction cache, with room to spare.
Howard Chu wrote:
A bit of a summary of how the backend is shaping up. I've been testing with a variety of synthetic LDIFs as well as an actual application database (Zimbra accounts).
I noted before that back-mdb's write speeds on disk are quite slow. This is because a lot of its writes will be to random disk pages, and also the data writes in a transaction commit are followed by a meta page write, which always involves a seek to page 0 or page 1 of the DB file. For slapadd -q this effect can be somewhat hidden because the writes are done with MDB_NOSYNC specified, so no explicit flushes are performed. In my current tests with synchronous writes, back-mdb is one half the speed of back-bdb/hdb. (Even in fully synchronous mode, BDB only writes its transaction logs synchronously, and those are always sequential writes so there's no seek overhead to deal with.)
With that said, slapadd -q for a 3.2M entry database on a tmpfs:
back-hdb: real 75m32.678s user 84m31.733s sys 1m0.316s back-mdb: real 63m51.048s user 50m23.125s sys 13m27.958s
On an XFS partition, the same job took back-hdb real 80m34.403s user 86m2.439s sys 1m39.662s back-mdb real 85m48.598s user 49m40.606s sys 14m48.668s
Note that back-hdb runs a trickle-sync thread to flush dirty DB pages in the background, which is why its user time is greater than the real time.
back-mdb actually completed the load in 64m16.19s according to slapadd's progress meter. But back-mdb performs an mdb_env_sync() on close, and that sync took the remaining 21 minutes. (back-hdb pretty much does the same, it does a checkpoint on close, but it skips it in Quick mode. So to be apples-to-apples, back-mdb's final sync should have been omitted as well.)
For back-hdb, BDB was configured with a 32GB environment cache. The resulting DB directory consumed 14951004KB including data files and environment files.
For back-mdb, MDB was configured with a 32GB mapsize. The resulting DB directory consumed 18299832KB. The input LDIF was 2.7GB, and there were 29 attributes indexed. Currently MDB is somewhat wasteful with space when dealing with the sorted-duplicate databases that are used for indexing, there's definitely room for improvement here.
Also this slapadd was done with tool-threads set to 1, because back-mdb only allows one writer at a time anyway. There is also obviously room for improvement here, in terms of a bulk-loading API for the MDB library.
A couple new results for back-mdb as of today.
first second slapd size
back-hdb, 10K cache 3m6.906s 1m39.835s 7.3GB back-hdb, 5M cache 3m12.596s 0m10.984s 46.8GB back-mdb 0m19.420s 0m16.625s 7.0GB
back-mdb 0m15.041s 0m12.356s 7.8GB
Next, the time to execute multiple instances of this search was measured, using 2, 4, 8, and 16 ldapsearch instances running concurrently. average result time 2 4 8 16 back-hdb, 5M 0m14.147s 0m17.384s 0m45.665s 17m15.114s back-mdb 0m16.701s 0m16.688s 0m16.621 0m16.955s
back-mdb 0m12.009s 0m11.978s 0m12.048s 0m12.506s
So back-mdb is faster than back-hdb whenever there's more than one query running. Also with result times of 0m11.699s measured, back-mdb is within 7% of back-hdb's speed even in the single-query case, where hdb has zero lock contention and all it has to do is dump cached entries from RAM (i.e., back-hdb is doing practically zero work at all).
Howard Chu wrote:
A couple new results for back-mdb as of today.
first second slapd size
back-hdb, 10K cache 3m6.906s 1m39.835s 7.3GB back-hdb, 5M cache 3m12.596s 0m10.984s 46.8GB back-mdb 0m19.420s 0m16.625s 7.0GB
back-mdb 0m15.041s 0m12.356s 7.8GB
Next, the time to execute multiple instances of this search was measured, using 2, 4, 8, and 16 ldapsearch instances running concurrently. average result time 2 4 8 16 back-hdb, 5M 0m14.147s 0m17.384s 0m45.665s 17m15.114s back-mdb 0m16.701s 0m16.688s 0m16.621 0m16.955s
back-mdb 0m12.009s 0m11.978s 0m12.048s 0m12.506s
So back-mdb is faster than back-hdb whenever there's more than one query running. Also with result times of 0m11.699s measured, back-mdb is within 7% of back-hdb's speed even in the single-query case, where hdb has zero lock contention and all it has to do is dump cached entries from RAM (i.e., back-hdb is doing practically zero work at all).
For comparison, the time dd the raw DB file to /dev/null: hyc@ada:~$ time dd if=/mnt/hyc/data/ldap/mdb/db/data.mdb of=/dev/null bs=1024k 18643+1 records in 18643+1 records out 19548712960 bytes (20 GB) copied, 11.0087 s, 1.8 GB/s
real 0m11.019s user 0m0.000s sys 0m11.009s
So effectively, back-mdb with all of slapd wrapped around it only adds 10% overhead compared to just copying the raw data as fast as possible.
Howard Chu wrote:
A couple new results for back-mdb as of today.
first second slapd size
back-hdb, 10K cache 3m6.906s 1m39.835s 7.3GB back-hdb, 5M cache 3m12.596s 0m10.984s 46.8GB back-mdb 0m19.420s 0m16.625s 7.0GB
back-mdb 0m15.041s 0m12.356s 7.8GB
Next, the time to execute multiple instances of this search was measured, using 2, 4, 8, and 16 ldapsearch instances running concurrently. average result time 2 4 8 16 back-hdb, 5M 0m14.147s 0m17.384s 0m45.665s 17m15.114s back-mdb 0m16.701s 0m16.688s 0m16.621 0m16.955s
back-mdb 0m12.009s 0m11.978s 0m12.048s 0m12.506s
This result for back-hdb just didn't make any sense. Going back, I discovered that I'd made a newbie mistake - my slapd was using the libdb-4.7.so that Debian bundled, instead of the one I had built in /usr/local/lib. Apparently my LD_LIBRARY_PATH setting that I usually have in my .profile was commented out when I was working on some other stuff.
While loading a 5 million entry DB for SLAMD testing, I went back and rechecked these results and got much more reasonable numbers for hdb. Most likely the main difference is that Debian builds BDB with its default configuration for mutexes, which is a hybrid that begins with a spinlock and eventually falls back to a pthread mutex. Spinlocks are nice and fast, but only for a small number of processors. Since they use atomic instructions that are meant to lock the memory bus, the coherency traffic they generate is quite heavy, and it increases geometrically with the number of processors involved.
I always build BDB with an explicit --with-mutex=POSIX/pthreads to avoid the spinlock code. Linux futexes are decently fast, and scale much better as the number of processors goes up.
With slapd linked against my build of BDB 4.7, and using the 5 million entry database instead of the 3.2M entry database I used before, the numbers make much more sense.
slapadd -q times back-hdb real 66m09.831s user 115m52.374s sys 5m15.860s back-mdb real 29m33.212s user 22m21.264s sys 7m11.851s
ldapsearch scanning the entire DB first second slapd size DB size back-hdb, 5M 4m15.395s 0m16.204s 26GB 15.6GB back-mdb 0m14.725s 0m10.807s 10GB 12.8GB
multiple concurrent scans average result time 2 4 8 16 back-hdb, 5M 0m24.617s 0m32.171s 1m04.817s 3m04.464s back-mdb 0m10.789s 0m10.842s 0m10.931s 0m12.023s
You can see that up to 4 searches, probably the BDB spinlock would have been faster. Above that, you need to get rid of the spinlocks. If I had realized I was using the default BDB build I could of course have configured the BDB environment with set_tas_spins in the DB_CONFIG file. We used to always set this to 1, overriding the BDB default of (50 * number of CPUs), before we just decided to omit them entirely at configure time.
But I think this also illustrates another facet of MDB - reducing config complexity, so there's a much smaller range of mistakes that can be made.
re: the slapd process size - in my original test I configured a 32GB BDB environment cache. This particular LDIF only needed an 8GB cache, so that's the size I used this time around. The 47GB size reported was the Virtual size of the process, not the Resident size. That was also a mistake, all of the other numbers are Resident size.
When contemplating the design of MDB I had originally estimated that we could save somewhere around 2-3x the RAM compared to BDB. With slapd running 2.7x larger with BDB than MDB on this test DB, that estimate has been proved to be correct.
The MVCC approach has also proved its value, with no bottlenecks for readers and response time essentially flat as number of CPUs scales up.
There are still problem areas that need work, but it looks like we're on the right track, and what started out as possibly-a-good-idea is delivering on the promise.
Howard Chu wrote:
The MVCC approach has also proved its value, with no bottlenecks for readers and response time essentially flat as number of CPUs scales up.
slamd results have been interesting. The same set of clients that easily push back-hdb up to 62,000 searches/second at 1485% CPU use are gasping and dying, pushing back-mdb over 75,000 searches/second at only 1000% CPU use. Once again I need to bring some more load generator machines online in order to actually max out slapd.
There are still problem areas that need work, but it looks like we're on the right track, and what started out as possibly-a-good-idea is delivering on the promise.
Howard Chu wrote:
Howard Chu wrote:
The MVCC approach has also proved its value, with no bottlenecks for readers and response time essentially flat as number of CPUs scales up.
slamd results have been interesting. The same set of clients that easily push back-hdb up to 62,000 searches/second at 1485% CPU use are gasping and dying, pushing back-mdb over 75,000 searches/second at only 1000% CPU use. Once again I need to bring some more load generator machines online in order to actually max out slapd.
There are still problem areas that need work, but it looks like we're on the right track, and what started out as possibly-a-good-idea is delivering on the promise.
Here's a report from a slamd run I just completed.
http://highlandsun.com/hyc/slamd-mdb/
The report includes a ModRate job and a SearchRate job; both were run concurrently. Aside from the fact that the average search rate is over 85,000 searches/second (on a machine that I previously thought was maxed out at 63,000), more interesting is the peak of almost 107,000 searches/second. The result curve drops, flattens, and then raises again, which shows the influence of the writers occupying server threads and making then unavailable for readers, until the writer job finishes.
On this run slapd hit 1300% CPU. Core 0, which was fielding ethernet interrupts, was at 80% handling soft interrupts. I have no idea whether we can generate enough load to hit 90% or more there, seems unlikely.
The write rate is pretty slow, as we already knew. I frankly don't see it improving very much, given the single-writer nature of MDB.
http://highlandsun.com/hyc/slamd-mdb/
The report includes a ModRate job and a SearchRate job; both were run concurrently. Aside from the fact that the average search rate is over 85,000 searches/second (on a machine that I previously thought was maxed out at 63,000), more interesting is the peak of almost 107,000 searches/second. The result curve drops, flattens, and then raises again, which shows the influence of the writers occupying server threads and making then unavailable for readers, until the writer job finishes.
Wow! What are the test servers specs?
On this run slapd hit 1300% CPU. Core 0, which was fielding ethernet interrupts, was at 80% handling soft interrupts. I have no idea whether we can generate enough load to hit 90% or more there, seems unlikely.
The write rate is pretty slow, as we already knew. I frankly don't see it improving very much, given the single-writer nature of MDB. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Gavin Henry wrote:
http://highlandsun.com/hyc/slamd-mdb/
The report includes a ModRate job and a SearchRate job; both were run concurrently. Aside from the fact that the average search rate is over 85,000 searches/second (on a machine that I previously thought was maxed out at 63,000), more interesting is the peak of almost 107,000 searches/second. The result curve drops, flattens, and then raises again, which shows the influence of the writers occupying server threads and making then unavailable for readers, until the writer job finishes.
Wow! What are the test servers specs?
It's ada.openldap.org; you already have an account there and can see for yourself.
It's an HP DL585 G5 with 4 AMD Opteron 8354 quad-core processors. The same machine that produced these results last year:
http://highlandsun.com/hyc/slamd/
On this run slapd hit 1300% CPU. Core 0, which was fielding ethernet interrupts, was at 80% handling soft interrupts. I have no idea whether we can generate enough load to hit 90% or more there, seems unlikely.
The write rate is pretty slow, as we already knew. I frankly don't see it improving very much, given the single-writer nature of MDB.
Howard Chu wrote:
The write rate is pretty slow, as we already knew. I frankly don't see it improving very much, given the single-writer nature of MDB.
I guess "slow" is relative. My previous modrate tests didn't touch any indices. I've added entryCSN,entryUUID eq indexing to the config (since that's common for a syncrepl environment) and gotten much different results.
Basically for back-hdb there's a continuous stream of deadlocks from updating the entryCSN index. This pushes the maximum mod rate down to only 3400/sec. With this configuration back-mdb was actually faster, at 3800/sec.
With accesslog thrown into the mix, back-mdb drops to 2500/sec while back-hdb drops to 1700/sec. So for any realistic scenario, back-mdb beats back-hdb all around.