hyc@OpenLDAP.org wrote:
Update of /repo/OpenLDAP/pkg/ldap/servers/slapd
Modified Files: connection.c 1.410 -> 1.411 daemon.c 1.414 -> 1.415 proto-slap.h 1.743 -> 1.744 syncrepl.c 1.370 -> 1.371
Log Message: Streamlined Winsock connection management
This patch eliminates all the #ifdef WINSOCK special cases in connection.c, and hides most of the dependencies in daemon.c.
Winsock's select() implementation is pretty non-optimal. Unfortunately using the Microsoft-recommended asynchronous functions would implicitly set all the sockets to non-blocking, and stuff like OpenSSL doesn't behave well with non-blocking sockets.
For reference, the peak throughput with back-null on the previous code was only 7,800 auths/sec (with 8 client threads). With this patch it's 11,140 auths/sec. In both cases the throughput declines as more client threads are used. (Compare to 35,553 auths/sec for the same machine running Linux, and no drop in throughput all the way up to hundreds/thousands of connections.)
Peak throughput on the new code with back-hdb is 7,972 auths/sec (with 12 client threads). With the previous code it was 6,252 auths/sec (with 8 client threads). (The 7,972 figure is also after setting processor affinities for the threads, forcing the listener to use core #0 and forcing the worker threads to use cores #1-7. Without that tweak, the peak is only 7,717/sec.)
Howard Chu wrote:
Winsock's select() implementation is pretty non-optimal. Unfortunately using the Microsoft-recommended asynchronous functions would implicitly set all the sockets to non-blocking, and stuff like OpenSSL doesn't behave well with non-blocking sockets.
For reference, the peak throughput with back-null on the previous code was only 7,800 auths/sec (with 8 client threads). With this patch it's 11,140 auths/sec. In both cases the throughput declines as more client threads are used. (Compare to 35,553 auths/sec for the same machine running Linux, and no drop in throughput all the way up to hundreds/thousands of connections.)
Peak throughput on the new code with back-hdb is 7,972 auths/sec (with 12 client threads).
Ah, read the preliminary result, oops. Final rate was 8,030 auths/sec.
With the previous code it was 6,252 auths/sec (with 8 client threads). (The 7,972 figure is also after setting processor affinities for the threads, forcing the listener to use core #0 and forcing the worker threads to use cores #1-7. Without that tweak, the peak is only 7,717/sec.)
I forgot to note that this is using an experimental build of gcc 4.3.0 (because earlier versions don't really support the Win64 ABI) and all optimization is turned off (due to some nasty bugs that make gcc 4.3.0's optimizer unusable). We're tracking the bug on the mingw-w64 mailing list; hopefully we'll have a fix soon.
This is also using BerkeleyDB 4.6.21. The 1M entry DB loads in about 8 minutes here (vs 3 minutes on Linux) and I doubt that the optimizer is going to make up a significant chunk of that difference. I.e., there are multiple aspects of this OS (Windows Server 2003 SP2 Enterprise Edition x86_64) that are much slower than Linux - not just the connection handling or disk I/O, but also mutexes, thread scheduling, etc.
E.g., this search command against the Linux OpenLDAP build time ./ldapsearch -x -H ldap://sihu -D dc=example,dc=com -w "secret" -LLL -b ou=people,dc=example,dc=com -E pr=1000/noprompt 1.1 > dn1
real 0m17.766s user 0m5.337s sys 0m7.831s
Got this result against Windows OpenLDAP time ./ldapsearch -x -H ldap://sihu:9000 -D dc=example,dc=com -w "secret" -LLL -b ou=people,dc=example,dc=com -E pr=1000/noprompt 1.1 > dn1
real 0m36.553s user 0m5.612s sys 0m4.541s
This is with the DB fully cached, so there's no disk I/O, and the number of network roundtrips is identical in both cases. (I guess I should measure that again on Linux without the optimizer, to make the numbers more fair.)
Howard Chu wrote:
Howard Chu wrote:
For reference, the peak throughput with back-null on the previous code was only 7,800 auths/sec (with 8 client threads). With this patch it's 11,140 auths/sec. In both cases the throughput declines as more client threads are used. (Compare to 35,553 auths/sec for the same machine running Linux, and no drop in throughput all the way up to hundreds/thousands of connections.)
Peak throughput on the new code with back-hdb is 7,972 auths/sec (with 12 client threads).
Ah, read the preliminary result, oops. Final rate was 8,030 auths/sec.
Re-running on Linux with a non-optimized build, peaked at 40,101 auths/sec. (I guess HEAD has sped up a bit more in the past week or so...)
I forgot to note that this is using an experimental build of gcc 4.3.0 (because earlier versions don't really support the Win64 ABI) and all optimization is turned off (due to some nasty bugs that make gcc 4.3.0's optimizer unusable). We're tracking the bug on the mingw-w64 mailing list; hopefully we'll have a fix soon.
This is also using BerkeleyDB 4.6.21. The 1M entry DB loads in about 8 minutes here (vs 3 minutes on Linux) and I doubt that the optimizer is going to make up a significant chunk of that difference. I.e., there are multiple aspects of this OS (Windows Server 2003 SP2 Enterprise Edition x86_64) that are much slower than Linux - not just the connection handling or disk I/O, but also mutexes, thread scheduling, etc.
E.g., this search command against the Linux OpenLDAP build time ./ldapsearch -x -H ldap://sihu -D dc=example,dc=com -w "secret" -LLL -b ou=people,dc=example,dc=com -E pr=1000/noprompt 1.1 > dn1
real 0m17.766s user 0m5.337s sys 0m7.831s
Got this result against Windows OpenLDAP time ./ldapsearch -x -H ldap://sihu:9000 -D dc=example,dc=com -w "secret" -LLL -b ou=people,dc=example,dc=com -E pr=1000/noprompt 1.1 > dn1
real 0m36.553s user 0m5.612s sys 0m4.541s
This is with the DB fully cached, so there's no disk I/O, and the number of network roundtrips is identical in both cases. (I guess I should measure that again on Linux without the optimizer, to make the numbers more fair.)
With the non-optimized Linux build I got time ./ldapsearch -x -H ldap://sihu -D dc=example,dc=com -w "secret" -LLL -b ou=people,dc=example,dc=com -E pr=1000/noprompt 1.1 > dn1
real 0m24.424s user 0m5.366s sys 0m4.230s
So I guess that the gcc optimizer could potentially make up to a 30% difference here.
Howard Chu wrote:
Howard Chu wrote:
Howard Chu wrote:
For reference, the peak throughput with back-null on the previous code was only 7,800 auths/sec (with 8 client threads). With this patch it's 11,140 auths/sec. In both cases the throughput declines as more client threads are used. (Compare to 35,553 auths/sec for the same machine running Linux, and no drop in throughput all the way up to hundreds/thousands of connections.)
Re-running on Linux with a non-optimized build, peaked at 40,101 auths/sec. (I guess HEAD has sped up a bit more in the past week or so...)
OK, this is odd. The code compiled without optimization peaks at 40K auths/sec at around 124-132 client threads. The code compiled with -O2 peaks at 37K sec at around 128 client threads.
The -O2 build is faster from about 4 to 24 client threads. From 28 on up, the nonoptimized code is faster at every load level. I was originally using gcc 4.1.2 but I'm seeing the same result now using gcc 4.2.2. Also, slapd is only configured with 8 worker threads in all of these tests. Strange that whatever optimizations the compiler has generated speeds things up for lighter load, but works against it under heavier load.
On Tue, Nov 27, 2007 at 05:17:04AM -0800, Howard Chu wrote:
The -O2 build is faster from about 4 to 24 client threads. From 28 on up, the nonoptimized code is faster at every load level. I was originally using gcc 4.1.2 but I'm seeing the same result now using gcc 4.2.2. Also, slapd is only configured with 8 worker threads in all of these tests. Strange that whatever optimizations the compiler has generated speeds things up for lighter load, but works against it under heavier load.
1stlevel instruction cache thrashing due to function inlining?
Volker
Howard Chu writes:
The -O2 build is faster from about 4 to 24 client threads. From 28 on up, the nonoptimized code is faster at every load level. I was originally using gcc 4.1.2 but I'm seeing the same result now using gcc 4.2.2. Also, slapd is only configured with 8 worker threads in all of these tests. Strange that whatever optimizations the compiler has generated speeds things up for lighter load, but works against it under heavier load.
Not really. Lots of possible optimizations are trade-offs between unguessable guesstimates - cache usage, branch prediction, whatever. Maybe some small piece of code got unluckily optimized and dominates the rest under heavy load. With a bit of luck, the difference between light and heavy runs will stand out with some sort of profiling (gprof, cachegrind, helgrind, whatever).
Hallvard B Furuseth wrote:
Howard Chu writes:
The -O2 build is faster from about 4 to 24 client threads. From 28 on up, the nonoptimized code is faster at every load level. I was originally using gcc 4.1.2 but I'm seeing the same result now using gcc 4.2.2. Also, slapd is only configured with 8 worker threads in all of these tests. Strange that whatever optimizations the compiler has generated speeds things up for lighter load, but works against it under heavier load.
Not really. Lots of possible optimizations are trade-offs between unguessable guesstimates - cache usage, branch prediction, whatever. Maybe some small piece of code got unluckily optimized and dominates the rest under heavy load. With a bit of luck, the difference between light and heavy runs will stand out with some sort of profiling (gprof, cachegrind, helgrind, whatever).
The difference is small enough that I'm not really concerned, just curious. Compiling with -Os to optimize space yielded about the same result as -O2. Interestingly, compiling with -O3 got a peak rate of around 39K/sec, but performance maxed out much more slowly. It took till 276 client connections before throughput finally stopped increasing. That's good news for servers that regularly have large numbers of active clients.