Re: thread pools, performance

List overview All Threads
Download

newer

older

RE24 final call for testing

Re: RE24 testing

Howard Chu

26 Oct 2007 26 Oct '07

1:41 a.m.

Rick Jones wrote:

...

There are definitely interrupt coalescing settings available with tg3-driven cards, as well as bnx2 driven ones:

ftp://ftp.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt

Yep, that helped. Raising rx-usecs from default 20 to 1000, and rx-frames from default 5 to 100, I'm getting 43k auths/sec with back-null (in 4 separate thread pools) and the core fielding the interrupts is only about 80% busy now instead of 100%. I'm afraid my load generators may be maxed out now, because I can't seem to drive up the load on the server any higher even though there's more idle CPU.

The current code in HEAD (with only 1 thread pool) is reaching 36k auths/sec with back-null, so it's actually not far off from my experimental peak rate. Considering that HEAD was at 25k/sec last week (and now in 2.4.6) that's pretty decent.

With back-bdb and 1 million users I'm getting 26.1k/sec with plaintext passwords (up from 19.3k/sec last week). With {SSHA} passwords that drops to 25.7k/sec (~1.5% difference).

I have to put this tinkering on hold for a bit, to run some authrate tests against ActiveDirectory on this machine (using W2K3sp2 X64). Later on we'll do a W2K3 OpenLDAP build for comparison as well. Should be entertaining...

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Show replies by date

Howard Chu

27 Oct 27 Oct

5:33 a.m.

New subject: thread pools, performance

Howard Chu wrote:

...

Yep, that helped. Raising rx-usecs from default 20 to 1000, and rx-frames from default 5 to 100, I'm getting 43k auths/sec with back-null (in 4 separate thread pools) and the core fielding the interrupts is only about 80% busy now instead of 100%. I'm afraid my load generators may be maxed out now, because I can't seem to drive up the load on the server any higher even though there's more idle CPU.

The current code in HEAD (with only 1 thread pool) is reaching 36k auths/sec with back-null, so it's actually not far off from my experimental peak rate. Considering that HEAD was at 25k/sec last week (and now in 2.4.6) that's pretty decent.

With back-bdb and 1 million users I'm getting 26.1k/sec with plaintext passwords (up from 19.3k/sec last week). With {SSHA} passwords that drops to 25.7k/sec (~1.5% difference).

I have to put this tinkering on hold for a bit, to run some authrate tests against ActiveDirectory on this machine (using W2K3sp2 X64). Later on we'll do a W2K3 OpenLDAP build for comparison as well. Should be entertaining...

Just for reference, using slapadd with tool-threads set to 4, it took 7:05.17 seconds to load an LDIF with 1 million user objects. These user objects had plaintext passwords. When I later decided to change them to {SSHA} passwords it took 10:12.38 to ldapmodify all of them.

This machine came with a pair of Maxtor 36GB 10k RPM SCSI drives. We added a pair of IBM 146GB 10k RPM SCSI drives. One of the 36GB drives has FedoraCore6 on it. We installed Windows 2003 SP2 Enterprise Edition for x86_64 on the other 36GB drive.

We split the 146GB drives into two partitions each, with each partition occupying half of the drive. The partitions are assigned such that both Windows and Linux get equivalent layouts:

/dev/sdc1 - NTFS, AD logs /dev/sdc2 - XFS, OpenLDAP data

/dev/sdd1 - XFS, OpenLDAP logs /dev/sdd2 - NTFS, AD data

My assumption here is that the transaction log partition will get more frequent activity, and the data partition will just get the occasional flush. So, I chose to place the log partitions on the outer tracks of the drives where they should have higher throughput and lower latency.

Anyway, using Microsoft's ldifde tool to import the same 1 million user LDIF, using 8 threads, took 4:23:46.85 (yes, that's over 4 hours for MS AD vs about 7 minutes for OpenLDAP). By the way, we configured the server as noted in this Microsoft document http://www.microsoft.com/downloads/details.aspx?FamilyID=52e7c3bd-570a-475c-... in Appendix B: Setup Instructions Step 1. That allowed us to import regular inetOrgPerson entries with userPassword attributes and have AD treat them as actual user accounts. (Otherwise we would have had to convert all the entries to use the Microsoft unicodePwd attribute instead.)

Unfortunately, the accounts imported this way were all initially disabled. So we had to ldapmodify their userAccountControl attribute to enable them all before we could proceed with the authentication tests. It took 20:57.017 seconds to ldapmodify all 1 million user records.

Finally we got to running the actual authrate tests, which yielded a peak rate of 4526 auths/second with 40 client threads. The rate declined from there as more clients were added; AD clearly isn't capable of handling very many concurrent sessions. It also appears that most of the CPUs were idle, perhaps 3 out of 8 cores were actually doing any work. I.e., AD doesn't scale well across multiple CPUs.

Unfortunately the native AD server runs as a privileged process and Windows doesn't allow you to alter its processor affinity settings, so there's no way to directly measure how it scales from one core up to eight. But I guess there's really nothing interesting to see here anyway. (For reference, even when restricted to only a single core on this machine, OpenLDAP 2.4.5 handled about 8800 auths/second, coming from even more client threads. And that was before any other tweaks.)

The numbers speak for themselves.

It's enlightening to look at the actual CPU time used during the import tasks. For ldifde on W2K3 we got:

time ldifde.exe -i -f examp3.ldif -h -q 8 261.10u 140.73s 4:23:46.85 2.5%

For slapadd on FC6 we got:

time .slapadd -f slapd.conf.slam -q -l example.ldif.1mil 260.75u 80.86s 7:05.17 80%

One interesting part here is that the amount of user CPU time is nearly identical in both cases. That implies that both slapadd and ldifde are doing about the same amount of work to parse the input LDIF. (For all we know they could be doing *exactly* the same work, using our own code. Or it could just be an interesting coincidence.)

Comparing the rest of the time isn't really fair since it seems that ldifde just feeds data into a running server using LDAP, while slapadd simply writes to the DB directly. I guess for the sake of fairness we'll have to time an OpenLDAP import using ldapadd next.

We'll remove AD and test ADAM next. At least, running as a normal user process, we should be able to tweak its processor affinity so we can plot how it scales with number of cores. Later we'll build a 64 bit OpenLDAP on Windows and see how it fares. My experience with 32 bit Windows has been that slapd runs about as fast on Windows as it does on Linux. But with the silly limits that Windows places on how many sockets a process can have open, (64 IIRC) you really can't subject it to as heavy a load in production use.

At this point I'd have a few choice things to say about Microsoft in general and AD in particular, but I think the numbers speak for themselves. -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

28 Oct 28 Oct

2:34 a.m.

New subject: thread pools, performance

Howard Chu wrote:

...

Finally we got to running the actual authrate tests, which yielded a peak rate of 4526 auths/second with 40 client threads. The rate declined from there as more clients were added; AD clearly isn't capable of handling very many concurrent sessions.

Correction, the peak rate was with only 24 client threads. The rate dropped to 4400/sec and stayed there as more client threads were added after that point.

...

It's enlightening to look at the actual CPU time used during the import tasks. For ldifde on W2K3 we got:

time ldifde.exe -i -f examp3.ldif -h -q 8 261.10u 140.73s 4:23:46.85 2.5%

For slapadd on FC6 we got:

time .slapadd -f slapd.conf.slam -q -l example.ldif.1mil 260.75u 80.86s 7:05.17 80%

One interesting part here is that the amount of user CPU time is nearly identical in both cases. That implies that both slapadd and ldifde are doing about the same amount of work to parse the input LDIF.

Eh, I take that back. The slapadd time includes both parsing and all of the BDB manipulation, while the ldifde time just includes parsing and encoding to BER. The database manipulation time in the server isn't reflected at all in these numbers. So really there's not much comparable in these figures at all...

...

Comparing the rest of the time isn't really fair since it seems that ldifde just feeds data into a running server using LDAP, while slapadd simply writes to the DB directly. I guess for the sake of fairness we'll have to time an OpenLDAP import using ldapadd next.

For completeness we would have to time both ldapadd against AD and ldifde against OpenLDAP. Since ldifde only exists for Windows we'd have to do those runs all on Windows, against a Windows build of OpenLDAP. I guess that can come later; we already know that ldapadd in OpenLDAP is now pretty well optimized.

One interesting feature of ldifde is that it also supports multiple threads, unlike our ldapadd/ldapmodify. Of course, the MS implementation is pretty braindead: if you have your entire tree in a single LDIF file, and try to import it with multiple threads in parallel, it will fail because there's no check to make sure that the thread importing a parent entry completes before the threads that import child entries start. So, to take advantage of the multithreading, you need to break out all of the parent entries into a separate file, import them single-threaded, and then do all the children in a separate invocation.

It seems that whoever added that feature to ldifde didn't really think about how Directories work, or what's actually useful for a directory admin.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Gavin Henry

29 Oct 29 Oct

5:54 a.m.

New subject: thread pools, performance

Howard Chu wrote:

...

Howard Chu wrote:

...
Finally we got to running the actual authrate tests, which yielded a peak rate of 4526 auths/second with 40 client threads. The rate declined from there as more clients were added; AD clearly isn't capable of handling very many concurrent sessions.

Correction, the peak rate was with only 24 client threads. The rate dropped to 4400/sec and stayed there as more client threads were added after that point.

...
It's enlightening to look at the actual CPU time used during the import tasks. For ldifde on W2K3 we got:

time ldifde.exe -i -f examp3.ldif -h -q 8 261.10u 140.73s 4:23:46.85 2.5%

For slapadd on FC6 we got:

time .slapadd -f slapd.conf.slam -q -l example.ldif.1mil 260.75u 80.86s 7:05.17 80%

One interesting part here is that the amount of user CPU time is nearly identical in both cases. That implies that both slapadd and ldifde are doing about the same amount of work to parse the input LDIF.

Eh, I take that back. The slapadd time includes both parsing and all of the BDB manipulation, while the ldifde time just includes parsing and encoding to BER. The database manipulation time in the server isn't reflected at all in these numbers. So really there's not much comparable in these figures at all...

...
Comparing the rest of the time isn't really fair since it seems that ldifde just feeds data into a running server using LDAP, while slapadd simply writes to the DB directly. I guess for the sake of fairness we'll have to time an OpenLDAP import using ldapadd next.

For completeness we would have to time both ldapadd against AD and ldifde against OpenLDAP. Since ldifde only exists for Windows we'd have to do those runs all on Windows, against a Windows build of OpenLDAP. I guess that can come later; we already know that ldapadd in OpenLDAP is now pretty well optimized.

One interesting feature of ldifde is that it also supports multiple threads, unlike our ldapadd/ldapmodify. Of course, the MS implementation is pretty braindead: if you have your entire tree in a single LDIF file, and try to import it with multiple threads in parallel, it will fail because there's no check to make sure that the thread importing a parent entry completes before the threads that import child entries start. So, to take advantage of the multithreading, you need to break out all of the parent entries into a separate file, import them single-threaded, and then do all the children in a separate invocation.

It seems that whoever added that feature to ldifde didn't really think about how Directories work, or what's actually useful for a directory admin.

Thanks for all this Howard. It certainly makes it clear where OpenLDAP lies in the LDAP world (looking down from the top and around at everyone else ;-) ).

-- Kind Regards, Gavin Henry. OpenLDAP Engineering Team. E ghenry@OpenLDAP.org Community developed LDAP software. http://www.openldap.org/project/

David Boreham

6:45 a.m.

New subject: thread pools, performance

Gavin Henry wrote:

...

Thanks for all this Howard. It certainly makes it clear where OpenLDAP lies in the LDAP world (looking down from the top and around at everyone else ;-) ).

Hmm...not so sure about the 'everyone' part because I added this feature (parallel import with correctly implemented partent/child entry interlocking) to the product that is now SunDS and FedoraDS in 1999.

Howard Chu

1:32 p.m.

New subject: thread pools, performance

David Boreham wrote:

...

Gavin Henry wrote:

...
Thanks for all this Howard. It certainly makes it clear where OpenLDAP lies in the LDAP world (looking down from the top and around at everyone else ;-) ).

...

Hmm...not so sure about the 'everyone' part because I added this feature (parallel import with correctly implemented partent/child entry interlocking) to the product that is now SunDS and FedoraDS in 1999.

With all due respect David, you've done some nice work on those products but OpenLDAP is still significantly faster than either SunDS or FedoraDS, both in importing data and in the authentication rates.

IIRC we had this conversation about parallel import before on this list. I had tried parallel entry importing before and gave up on it because it showed no benefit on my tests. Then based on your hints I tested a variety of other approaches. What I eventually settled on for OL 2.3 - serial entry import with parallel indexing - has been the fastest approach in my tests.

One obvious reason is that no entry locking is required because only a single thread ever operates on any given DB. With parallel writes of entries to the same DB, you still need full locking to prevent data corruption. Another reason is that BerkeleyDB's B-trees are optimized for sequential writes; parallelizing the entry writes defeats that optimization because it partially randomizes the I/Os. (I.e., it requires seeks instead of just sequential access.)

It's too bad that SunDS still disallows publishing benchmark results. The current license terms just tells the world they have something to hide.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Gavin Henry

1:43 p.m.

New subject: thread pools, performance

David Boreham wrote:

...

Gavin Henry wrote:

...
Thanks for all this Howard. It certainly makes it clear where OpenLDAP lies in the LDAP world (looking down from the top and around at everyone else ;-) ).

Hmm...not so sure about the 'everyone' part because I added this feature (parallel import with correctly implemented partent/child entry interlocking) to the product that is now SunDS and FedoraDS in 1999.

I wasn't talking about a particular feature, but OpenLDAP in general.

-- Kind Regards, Gavin Henry. OpenLDAP Engineering Team. E ghenry@OpenLDAP.org Community developed LDAP software. http://www.openldap.org/project/

Jens Vagelpohl

6:50 a.m.

New subject: thread pools, performance

On Oct 29, 2007, at 13:54 , Gavin Henry wrote:

...

Thanks for all this Howard. It certainly makes it clear where OpenLDAP lies in the LDAP world (looking down from the top and around at everyone else ;-) ).

It's just a shame that most people don't even know this, they only think of the commercial packages when anyone brings up directory services.

jens

Gavin Henry

1:43 p.m.

New subject: thread pools, performance

Jens Vagelpohl wrote:

...

On Oct 29, 2007, at 13:54 , Gavin Henry wrote:

...
Thanks for all this Howard. It certainly makes it clear where OpenLDAP lies in the LDAP world (looking down from the top and around at everyone else ;-) ).

It's just a shame that most people don't even know this, they only think of the commercial packages when anyone brings up directory services.

jens

It's because the people in the *know* keep it their secret and don't want others to benefit ;-)

-- Kind Regards, Gavin Henry. OpenLDAP Engineering Team. E ghenry@OpenLDAP.org Community developed LDAP software. http://www.openldap.org/project/

Howard Chu

3:25 p.m.

New subject: thread pools, performance

Jens Vagelpohl wrote:

...

On Oct 29, 2007, at 13:54 , Gavin Henry wrote:

...
Thanks for all this Howard. It certainly makes it clear where OpenLDAP lies in the LDAP world (looking down from the top and around at everyone else ;-) ).

It's just a shame that most people don't even know this, they only think of the commercial packages when anyone brings up directory services.

Yes, refreshing the magazine coverage and other media channels would help. I recall that we contacted NetworkWorld (http://www.networkworld.com/reviews/2000/0515rev2.html) about revisiting this topic a couple years ago and they declined. I suppose if a lot of their readers were to write in about it, that may have more influence than any of us on the Project had.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

31 Oct 31 Oct

3:07 p.m.

New subject: thread pools, performance

Howard Chu wrote:

...

We'll remove AD and test ADAM next. At least, running as a normal user process, we should be able to tweak its processor affinity so we can plot how it scales with number of cores. Later we'll build a 64 bit OpenLDAP on Windows and see how it fares. My experience with 32 bit Windows has been that slapd runs about as fast on Windows as it does on Linux. But with the silly limits that Windows places on how many sockets a process can have open, (64 IIRC) you really can't subject it to as heavy a load in production use.

Ultimately after several more repeated runs, AD peaked at 4800 auths/sec (a bit faster than the 4400 we saw before, apparently it took a while for their caches to get fully primed). We then removed AD and installed ADAM. The import went a bit faster this time, thankfully:

time ldifde.exe -i -h -f examp3.ldif -s localhost -q 8 152.62u 99.32s 27:33.84 15.2%

Using 8 threads, 27-1/2 minutes. The peak authentication rate we got was just under 5500 auths/sec using 52 client threads. At that load a single auth took an average of 9.5 milliseconds, up from 1.5ms at 4 client threads. (slapd's average latency is only 0.5ms at 8 clients, 2.2ms at 52 clients.)

It'll be even more fun when we come back and run the tests on a 5 million entry DB. It's pretty clear from looking at their memory usage that they won't be able to cache all of that in the 16GB of RAM on this box, while OpenLDAP will. As such, their throughput rate will just reflect the speed of our disks, while slapd will still be running at full RAM speed.

...

At this point I'd have a few choice things to say about Microsoft in general and AD in particular, but I think the numbers speak for themselves.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Hallvard B Furuseth

4:11 p.m.

New subject: thread pools, performance

Howard Chu writes:

...

Ultimately after several more repeated runs, AD peaked at 4800 auths/sec (a bit faster than the 4400 we saw before, apparently it took a while for their caches to get fully primed). We then removed AD and installed ADAM. The import went a bit faster this time, thankfully: (...)

Are you collecting more info than the average time, e.g. standard deviation? Are there occasional requests that take very much longer time than the average, or even auth failures because the server is too busy? At those speeds, that can be as important or more important than the average time.

-- Regards, Hallvard

Howard Chu

6:10 p.m.

New subject: thread pools, performance

Hallvard B Furuseth wrote:

...

Howard Chu writes:

...
Ultimately after several more repeated runs, AD peaked at 4800 auths/sec (a bit faster than the 4400 we saw before, apparently it took a while for their caches to get fully primed). We then removed AD and installed ADAM. The import went a bit faster this time, thankfully: (...)

...

Are you collecting more info than the average time, e.g. standard deviation? Are there occasional requests that take very much longer time than the average, or even auth failures because the server is too busy? At those speeds, that can be as important or more important than the average time.

Yes, slamd records the standard deviation as well. There are no failures, just longer queuing delays.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

2 Nov 2 Nov

10:52 a.m.

New subject: performance

Howard Chu wrote:

...

Ultimately after several more repeated runs, AD peaked at 4800 auths/sec (a bit faster than the 4400 we saw before, apparently it took a while for their caches to get fully primed). We then removed AD and installed ADAM. The import went a bit faster this time, thankfully:

time ldifde.exe -i -h -f examp3.ldif -s localhost -q 8 152.62u 99.32s 27:33.84 15.2%

Using 8 threads, 27-1/2 minutes. The peak authentication rate we got was just under 5500 auths/sec using 52 client threads. At that load a single auth took an average of 9.5 milliseconds, up from 1.5ms at 4 client threads. (slapd's average latency is only 0.5ms at 8 clients, 2.2ms at 52 clients.)

A slapadd -q for back-hdb on 1M entries took only 5 minutes: time ../servers/slapd/slapd -Ta -f slapd.conf.slam -q -l ~hyc/ldif/example.ldif.1mil

real 5m2.669s user 3m59.224s sys 1m55.482s

(Using BDB 4.6.21 and libtcmalloc)

An ldapadd (with back-bdb) of those 1M entries took 19 minutes: time ../clients/tools/ldapmodify -a -x -D dc=example,dc=com -w secret -f ~hyc/ldif/example.ldif.1mil

real 19m3.044s user 1m12.435s sys 0m38.401s

So, even restricted to a single thread, we still import entries faster than AD or ADAM with 8 threads. Online or offline, they don't even come close. (Which should be no surprise...)

...

It'll be even more fun when we come back and run the tests on a 5 million entry DB. It's pretty clear from looking at their memory usage that they won't be able to cache all of that in the 16GB of RAM on this box, while OpenLDAP will. As such, their throughput rate will just reflect the speed of our disks, while slapd will still be running at full RAM speed.

The time to slapadd 5 million entries for back-hdb was 29 minutes: time ../servers/slapd/slapd -Ta -f slapd.conf.5M -l ~hyc/5Mssha.ldif -q

real 29m15.828s user 20m22.110s sys 6m56.433s

For ADAM it was about 3 hours:

time ldifde.exe -i -h -f 5M.ldif -s localhost -q 8 743.57u 475.56s 2:59:44.78 11.3%

The authrate for back-hdb was again 25,700/sec with SSHA passwords, pretty much the same as for 1 million entries. The authrate for ADAM was 2359/sec.

It's also interesting to note the database and process sizes. The 5M ldif was about 2.9GB; the hdb database was about 6.4GB. The ADAM DB was about 22GB.

The slapd process size was about 8GB. The ADAM process size was about 14GB. We could easily handle a 10 million entry database on this machine and still be running at full cache speed; ADAM is already disk limited here.

This weekend we'll try loading 5M into AD. I figure we'll start up the import at the end of the day today and come back to look at it again in a couple of days...

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

10 Nov 10 Nov

7:14 a.m.

New subject: performance

Howard Chu wrote:

...

The time to slapadd 5 million entries for back-hdb was 29 minutes: time ../servers/slapd/slapd -Ta -f slapd.conf.5M -l ~hyc/5Mssha.ldif -q

real 29m15.828s user 20m22.110s sys 6m56.433s

Using a shared memory BDB cache instead of mmap'd files brought this time down to 16 minutes. (For 1M entries, it was only 3 minutes.)

...

For ADAM it was about 3 hours:

time ldifde.exe -i -h -f 5M.ldif -s localhost -q 8 743.57u 475.56s 2:59:44.78 11.3%

For regular Active Directory it was just under 24 hours.

I noticed something familiar while looking at ldapsearch output from AD and ADAM. Even though their ISAM database uses B-trees for its indices, and they use integers for their primary keys (they call them DNTs, DN Tags, but they're essentially the same as our entryIDs, though theirs are fixed at 32 bits and ours are 64 bits on a 64 bit machine), they don't preserve the entry input order. In fact, it looks like they're passing their 4-byte integer keys directly to their B-tree routines in native little-endian byte order, which totally screws up the sorting and locality properties of the B-tree. I noticed this same mis-behavior in the back-bdb code within about a week of starting to work on it. (Patched it in November 2001 back-bdb/init.c:1.42, made the corresponding fix to back-ldbm/init.c:1.28 in December 2001...)

So essentially they have a fundamental design flaw that is costing them a huge amount of both write and read performance. It's even more amusing that they recognized they had a problem, but rather than fix the actual problem, they wrote a Defragmentation process that has to run on a regular schedule (or can be invoked manually when the server is shutdown) to try and put a bandaid over the problem. Good old Microsoft.

Summary: B-trees are balanced trees that use ordered keys. If you store these keys in this order, a retrieval should return them in the same order: 1 2 3 4

This characteristic offers a number of performance advantages - when you're doing a bulk load, it is essentially the same as a sequential write - there's very little disk seeking needed. When you're trying to find ranges of entries within a node, you can use binary search to locate the desired targets instead of having to do a pure linear search. And if you're retrieving a range of entries they'll reside next to each other, which makes caching more effective.

But this only works if you insert keys in their natural, sequential order, or your B-tree sort/comparison function knows something special about your keys' characteristics. Most B-tree implementations just use memcmp for this purpose. What happens when you feed little-endian integers to memcmp is incredibly bad, although the effects won't be seen until you get to more than 256 items in the database. The first 255 entries will all get added in sequentially, without difficulty. But when #256 comes, instead of just tacking on after #255, it has to seek back to the beginning and do an insertion.

The sort order with little-endian integers looks like (in hex) 0100 0200 ... ff00

for entries 1-255. Then 256 comes along as 0001 which clearly belongs at the beginning of the above list. And 257, 0101 slots in right after entry #1. If all of the nodes were optimally filled up to this point, the majority of them are going to need to be split as you add entries from 256-511. And the same rewind seeking/splitting will occur again for 512-767, and so on for everything up to entry #65535. Then entry #65536 aggravates the problem even further, as does entry #16777216.

I.e., this flaw turns what should be the best possible insert order for a B-tree into the worst possible insert order. It's something that's easily overlooked, when you don't think about what you're doing, but its effect is so obvious you really can't miss it. No wonder fragmentation is such a severe problem in AD and ADAM...

I think this also explains to a large degree why, even though all of these packages use B-trees, OpenLDAP imports data an order of magnitude faster than ADAM, and nearly two orders of magnitude faster than AD. It also explains why our import speeds scale pretty much linearly with the number of objects, without extensive multithreading tricks, while theirs scales sub-linearly, even using multiple threads in parallel. (When you compare actual CPU time used, instead of wall clock time, the difference is another 3-4x greater at 1M entries. More than that at 5M.)

By now there's not much point in testing ActiveDirectory any further, it's so miserably bad there'd be nothing to gain. ADAM isn't quite as bulky as AD but it still has a lot of the same design constraints and flaws.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

15 Jan 15 Jan

10:39 a.m.

New subject: performance

Howard Chu wrote:

...

Howard Chu wrote:

...
The time to slapadd 5 million entries for back-hdb was 29 minutes: time ../servers/slapd/slapd -Ta -f slapd.conf.5M -l ~hyc/5Mssha.ldif -q

real 29m15.828s user 20m22.110s sys 6m56.433s

Using a shared memory BDB cache instead of mmap'd files brought this time down to 16 minutes. (For 1M entries, it was only 3 minutes.)

I recently tested Isode's M-Vault 14. For the most part there were no surprises, we were 5-6 times faster on the 1M entry DB. Load times were comparable, with OpenLDAP at around 3 minutes (as before) and Isode at around 4 minutes. While OpenLDAP delivered over 29,000 auths/sec using 8 cores, Isode delivered only 4600 auths/sec. (And there's still the interesting result that OpenLDAP delivered almost 31,000/sec using only 7 cores, leaving one core reserved for the ethernet driver.)

But Steve Kille asked "how do they perform on a database much larger than the server memory, like 50 million entries?" and that yielded an unexpected result. For OpenLDAP 2.4.7, BDB 4.6.21, and back-hdb it took 11:27:40(hh:mm:ss) to slapadd the database, while it only took Isode 3:47:33 to bulk load. Naturally I was curious about why there was such a big difference. Unfortunately the Isode numbers were not repeatable; I was using XFS and the filesystem kept hanging on my subsequent rerun attempts.

I then recreated the filesystem using EXT2fs instead. For this load, Isode took only 3:18:29 while OpenLDAP took 6:59:25. I was astonished at how much XFS was costing us, but that still didn't explain the discrepancy. After all, this is a DB that's 50x larger but a runtime that's 220x slower than the 1M entry case.

Finally I noticed that there were large periods of time during the slapadd when the CPU was 100% busy but no entries were being added, and traced this down to BerkeleyDB's env_alloc_free function. So, the issue Jong raised in ITS#3851 still hasn't been completely addressed in BDB 4.6. If you're working with data sets much larger than your BDB cache, performance will plummet after the cache fills and starts needing to dump and replace pages.

I found the final proof of this conclusion by tweaking slapadd to use DB_PRIVATE when creating the environment. This option just uses malloc for the BDB caches, instead of shared memory. I also ran with tcmalloc. This time the slapadd took only 3:04:21. It would probably be worthwhile to default to DB_PRIVATE when using the -q option. Since slapadd is not extremely heavily threaded, even the default system malloc() will probably work fine. (I'll retest that as well.) One downside to this approach - when BDB creates its shared memory caches, it allocates exactly the amount of memory you specify, which is good. But when using malloc, it tracks how many bytes it requested, but can't account for any overhead that malloc itself requires to track its allocations. As such, I had to decrease my 12GB BDB cache to only 10GB in order for the slapadd to complete successfully (and it was using over 14GB of RAM out of the 16GB on the box when it completed).

It would also be worthwhile to revisit Jong's patch in ITS#3851...

(The 50M entry DB occupied 69GB on disk. The last time I tested a DB of this size was on an SGI Altix with 480GB of RAM and 32 CPUs available. Testing it on a machine with only 16GB of RAM was not a lot of fun, it turns into mainly a test of disk speed. OpenLDAP delivered only 160 auths/second on XFS, and 200 auths/second on EXT2FS. Isode delivered 8 auths/sec on EXT2FS, and I never got a test result for XFS.)

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Gavin Henry

11:41 a.m.

New subject: performance

...

It would also be worthwhile to revisit Jong's patch in ITS#3851...

(The 50M entry DB occupied 69GB on disk. The last time I tested a DB of this size was on an SGI Altix with 480GB of RAM and 32 CPUs available. Testing it on a machine with only 16GB of RAM was not a lot of fun, it turns into mainly a test of disk speed. OpenLDAP delivered only 160 auths/second on XFS, and 200 auths/second on EXT2FS. Isode delivered 8 auths/sec on EXT2FS, and I never got a test result for XFS.)

Not much to add, other than the fact I was just about to download the Isode LDAP offering for playing with...

Gavin.

...

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Pierangelo Masarati

11:55 a.m.

New subject: performance

Time to update this?

http://www.isode.com/whitepapers/m-vault-benchmarking.html

Ing. Pierangelo Masarati OpenLDAP Core Team

SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------

Gavin Henry

12:24 p.m.

New subject: performance

...

Time to update this?

http://www.isode.com/whitepapers/m-vault-benchmarking.html

p.

"OpenLDAP is a popular open source LDAP implementation. It is a straightforward and simple implementation of LDAP, and thus provides a useful independent reference point."

Lovely ;-)

Howard Chu

12:37 p.m.

New subject: performance

Gavin Henry wrote:

...

<quote who="Pierangelo Masarati"> > Time to update this? > > <http://www.isode.com/whitepapers/m-vault-benchmarking.html> >

Yes, I certainly had that in mind. Their paper has no date or version numbers listed. I think it was using M-Vault 10, from quite a few years back, and most likely OpenLDAP 1.x.

The Isode guys were all very helpful throughout these tests, and everything was friendly/collegial. I found a couple bugs in their stuff, found this latent problem in BDB, and we all learned from the experience.

...

"OpenLDAP is a popular open source LDAP implementation. It is a straightforward and simple implementation of LDAP, and thus provides a useful independent reference point."

Lovely ;-)

Well, it certainly is a useful reference point. ;)

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Gavin Henry

12:41 p.m.

New subject: performance

...

Gavin Henry wrote:

...
<quote who="Pierangelo Masarati"> > Time to update this? > > <http://www.isode.com/whitepapers/m-vault-benchmarking.html> >

Yes, I certainly had that in mind. Their paper has no date or version numbers listed. I think it was using M-Vault 10, from quite a few years back, and most likely OpenLDAP 1.x.

The Isode guys were all very helpful throughout these tests, and everything was friendly/collegial. I found a couple bugs in their stuff, found this latent problem in BDB, and we all learned from the experience.

That's very encouraging. I wish other vendors were more like this and didn't hide behind licenses that forbid publishing benchmarks.

Everyones benefits and learns something valuable in the end.

...

...
"OpenLDAP is a popular open source LDAP implementation. It is a straightforward and simple implementation of LDAP, and thus provides a useful independent reference point."

Lovely ;-)

Well, it certainly is a useful reference point. ;)

Top of the reference points! ;-)

...

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

2:16 p.m.

New subject: performance

Howard Chu wrote:

...

I found the final proof of this conclusion by tweaking slapadd to use DB_PRIVATE when creating the environment. This option just uses malloc for the BDB caches, instead of shared memory. I also ran with tcmalloc. This time the slapadd took only 3:04:21. It would probably be worthwhile to default to DB_PRIVATE when using the -q option. Since slapadd is not extremely heavily threaded, even the default system malloc() will probably work fine. (I'll retest that as well.)

Using glibc 2.7 malloc took 3:16:10, and I didn't bother to rerun the test multiple times. This figure is really not out of line.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

14 Feb 14 Feb

2:56 a.m.

New subject: performance

Howard Chu wrote:

...

Howard Chu wrote:

...
I found the final proof of this conclusion by tweaking slapadd to use DB_PRIVATE when creating the environment. This option just uses malloc for the BDB caches, instead of shared memory. I also ran with tcmalloc. This time the slapadd took only 3:04:21. It would probably be worthwhile to default to DB_PRIVATE when using the -q option. Since slapadd is not extremely heavily threaded, even the default system malloc() will probably work fine. (I'll retest that as well.)

Using glibc 2.7 malloc took 3:16:10, and I didn't bother to rerun the test multiple times. This figure is really not out of line.

With a patched BDB 4.7.13 and a regular shared environment slapadd took 4 hours. (Unpatched 4.7.13 took 8.5 hours.) So it looks like the next BDB release will finally have a decent memory manager. Still not as efficient as malloc, but better than before. Unfortunately it doesn't work with OpenLDAP without additional patching. I'll address that later when BDB 4.7 becomes generally available.

Now if they'd just do something about the (lack of) concurrency in their global lock table... (In a read-only test, slapd with back-hdb is about 3x faster and scales to much heavier client loads with BDB locking disabled. This is also on the Sun T5120.)

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

6364

Age (days ago)

6475

Last active (days ago)

openldap-devel@openldap.org

22 comments

7 participants

tags (0)

participants (7)

David Boreham
Gavin Henry
Gavin Henry
Hallvard B Furuseth
Howard Chu
Jens Vagelpohl
Pierangelo Masarati