Howard Chu wrote:
A couple new results for back-mdb as of today.
first second slapd size
back-hdb, 10K cache 3m6.906s 1m39.835s 7.3GB back-hdb, 5M cache 3m12.596s 0m10.984s 46.8GB back-mdb 0m19.420s 0m16.625s 7.0GB
back-mdb 0m15.041s 0m12.356s 7.8GB
Next, the time to execute multiple instances of this search was measured, using 2, 4, 8, and 16 ldapsearch instances running concurrently. average result time 2 4 8 16 back-hdb, 5M 0m14.147s 0m17.384s 0m45.665s 17m15.114s back-mdb 0m16.701s 0m16.688s 0m16.621 0m16.955s
back-mdb 0m12.009s 0m11.978s 0m12.048s 0m12.506s
This result for back-hdb just didn't make any sense. Going back, I discovered that I'd made a newbie mistake - my slapd was using the libdb-4.7.so that Debian bundled, instead of the one I had built in /usr/local/lib. Apparently my LD_LIBRARY_PATH setting that I usually have in my .profile was commented out when I was working on some other stuff.
While loading a 5 million entry DB for SLAMD testing, I went back and rechecked these results and got much more reasonable numbers for hdb. Most likely the main difference is that Debian builds BDB with its default configuration for mutexes, which is a hybrid that begins with a spinlock and eventually falls back to a pthread mutex. Spinlocks are nice and fast, but only for a small number of processors. Since they use atomic instructions that are meant to lock the memory bus, the coherency traffic they generate is quite heavy, and it increases geometrically with the number of processors involved.
I always build BDB with an explicit --with-mutex=POSIX/pthreads to avoid the spinlock code. Linux futexes are decently fast, and scale much better as the number of processors goes up.
With slapd linked against my build of BDB 4.7, and using the 5 million entry database instead of the 3.2M entry database I used before, the numbers make much more sense.
slapadd -q times back-hdb real 66m09.831s user 115m52.374s sys 5m15.860s back-mdb real 29m33.212s user 22m21.264s sys 7m11.851s
ldapsearch scanning the entire DB first second slapd size DB size back-hdb, 5M 4m15.395s 0m16.204s 26GB 15.6GB back-mdb 0m14.725s 0m10.807s 10GB 12.8GB
multiple concurrent scans average result time 2 4 8 16 back-hdb, 5M 0m24.617s 0m32.171s 1m04.817s 3m04.464s back-mdb 0m10.789s 0m10.842s 0m10.931s 0m12.023s
You can see that up to 4 searches, probably the BDB spinlock would have been faster. Above that, you need to get rid of the spinlocks. If I had realized I was using the default BDB build I could of course have configured the BDB environment with set_tas_spins in the DB_CONFIG file. We used to always set this to 1, overriding the BDB default of (50 * number of CPUs), before we just decided to omit them entirely at configure time.
But I think this also illustrates another facet of MDB - reducing config complexity, so there's a much smaller range of mistakes that can be made.
re: the slapd process size - in my original test I configured a 32GB BDB environment cache. This particular LDIF only needed an 8GB cache, so that's the size I used this time around. The 47GB size reported was the Virtual size of the process, not the Resident size. That was also a mistake, all of the other numbers are Resident size.
When contemplating the design of MDB I had originally estimated that we could save somewhere around 2-3x the RAM compared to BDB. With slapd running 2.7x larger with BDB than MDB on this test DB, that estimate has been proved to be correct.
The MVCC approach has also proved its value, with no bottlenecks for readers and response time essentially flat as number of CPUs scales up.
There are still problem areas that need work, but it looks like we're on the right track, and what started out as possibly-a-good-idea is delivering on the promise.