Has anyone got a dual or quad socket Intel Xeon based server for testing? I've been testing on two AMD systems, one quad socket dual core and one dual socket quad core. There are a lot of different ways to tune these systems...
slapd currently uses a single listener thread and a pool of some number of worker threads. I've found that performance improves significantly when the listener thread is pinned to a single core, and no other threads are allowed to run there. I've also found that performance improves somewhat when all worker threads are pinned to specific cores, instead of being free to run on any of the remaining cores. This has made testing a bit more complicated than I expected.
I originally was just pinning the entire process to a set number of cores (first 1, then 2, incrementing up to 8) to see how performance changed with additional cores. But due to the motherboard layout and the fact that the I/O bridges are directly attached to particular sockets, it makes a big difference exactly which cores you use.
Another item I noticed is that while we scale perfectly linearly from 1 core to 2 cores in a socket (with a dual-core processor), as we start spreading across multiple sockets the scaling tapers off drastically. That makes sense given the constraints of the Hypertransport connections between the sockets.
On the quad-core system we scale pretty linearly from 1 to 4 cores (in one socket) but again the improvement tapers off drastically when the 2nd socket is added in.
I don't have any Xeon systems to test on at the moment, but I'm curious to see how they do given that all CPUs should have equal access to the northbridge. (Of course, given that both memory and I/O traffic go over the bus, I'm not expecting any miracles...)
The quad-core system I'm using is a Supermicro AS-2021M-UR+B; it's based on an Nvidia MCP55 chipset. The gigabit ethernet is integrated in this chipset. Using back-null we can drive this machine to over 54,000 authentications/second, at which point 100% of a core is consumed by interrupt processing in the ethernet driver. The driver doesn't support interrupt coalescing, unfortunately. (By the way, that represents somewhere between 324,000pps and 432,000pps. While there's only 5 LDAP packets per transaction, some of the client machines choose to send separate TCP ACKs, while others don't, which makes the packet count somewhere between 5-8 packets per transaction. I hadn't taken those ACKs into account when I discussed these figures before. At these packet sizes (80-140 bytes), I think the network would be 100% saturated at around 900,000pps.)
Interestingly, while 2 cores can get over 13,000 auths/second, and 4 cores can get around 25,000 auths/second (using back-hdb), with all 8 cores it's only peaking at 29,000 auths/second. This tells me it's better to run two separate slapds in a mirrormode configuration on this box (4 cores per process) than to run a single process across all of the cores. Then I'd expect to hit 50,000 auths/second total, pretty close to the limits of the ethernet device/driver.
Hi Howard,
I should be able to get a hold of an 8-way Xeon system in January sometime. I will be able to place the order for it on the 2nd.
Cheers, Alex
On Dec 23, 2007 8:40 PM, Howard Chu hyc@symas.com wrote:
Has anyone got a dual or quad socket Intel Xeon based server for testing? I've been testing on two AMD systems, one quad socket dual core and one dual socket quad core. There are a lot of different ways to tune these systems...
slapd currently uses a single listener thread and a pool of some number of worker threads. I've found that performance improves significantly when the listener thread is pinned to a single core, and no other threads are allowed to run there. I've also found that performance improves somewhat when all worker threads are pinned to specific cores, instead of being free to run on any of the remaining cores. This has made testing a bit more complicated than I expected.
I originally was just pinning the entire process to a set number of cores (first 1, then 2, incrementing up to 8) to see how performance changed with additional cores. But due to the motherboard layout and the fact that the I/O bridges are directly attached to particular sockets, it makes a big difference exactly which cores you use.
Another item I noticed is that while we scale perfectly linearly from 1 core to 2 cores in a socket (with a dual-core processor), as we start spreading across multiple sockets the scaling tapers off drastically. That makes sense given the constraints of the Hypertransport connections between the sockets.
On the quad-core system we scale pretty linearly from 1 to 4 cores (in one socket) but again the improvement tapers off drastically when the 2nd socket is added in.
I don't have any Xeon systems to test on at the moment, but I'm curious to see how they do given that all CPUs should have equal access to the northbridge. (Of course, given that both memory and I/O traffic go over the bus, I'm not expecting any miracles...)
The quad-core system I'm using is a Supermicro AS-2021M-UR+B; it's based on an Nvidia MCP55 chipset. The gigabit ethernet is integrated in this chipset. Using back-null we can drive this machine to over 54,000 authentications/second, at which point 100% of a core is consumed by interrupt processing in the ethernet driver. The driver doesn't support interrupt coalescing, unfortunately. (By the way, that represents somewhere between 324,000pps and 432,000pps. While there's only 5 LDAP packets per transaction, some of the client machines choose to send separate TCP ACKs, while others don't, which makes the packet count somewhere between 5-8 packets per transaction. I hadn't taken those ACKs into account when I discussed these figures before. At these packet sizes (80-140 bytes), I think the network would be 100% saturated at around 900,000pps.)
Interestingly, while 2 cores can get over 13,000 auths/second, and 4 cores can get around 25,000 auths/second (using back-hdb), with all 8 cores it's only peaking at 29,000 auths/second. This tells me it's better to run two separate slapds in a mirrormode configuration on this box (4 cores per process) than to run a single process across all of the cores. Then I'd expect to hit 50,000 auths/second total, pretty close to the limits of the ethernet device/driver. -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Alex Karasulu wrote:
Hi Howard,
I should be able to get a hold of an 8-way Xeon system in January sometime. I will be able to place the order for it on the 2nd.
Sounds great, thanks Alex.
Interestingly, while 2 cores can get over 13,000 auths/second, and 4 cores can get around 25,000 auths/second (using back-hdb), with all 8 cores it's only peaking at 29,000 auths/second. This tells me it's better to run two separate slapds in a mirrormode configuration on this box (4 cores per process) than to run a single process across all of the cores. Then I'd expect to hit 50,000 auths/second total, pretty close to the limits of the ethernet device/driver.
Well, I guessed wrong. The peak with two separate processes was still the same, 29,000 auths/second. Interestingly, the peak with only 6 cores was about 31,000 auths/second. It appears that we're memory bandwidth-limited.
Howard Chu wrote:
Has anyone got a dual or quad socket Intel Xeon based server for testing? I've been testing on two AMD systems, one quad socket dual core and one dual socket quad core. There are a lot of different ways to tune these systems...
Thanks to Matt Ezell at the University of Tennessee for loaning me some of his servers over the holidays. He gave me access to a Dell 2950 with a pair of Intel Xeon 5345s (2.2GHz quad core processors) for testing. The test results are available here
http://connexitor.com/blog/pivot/entry.php?id=191#body
(My apologies to the researchers running their simulations on the machines we used for load generators. Java can be pretty unforgiving as a resource hog...)
slapd currently uses a single listener thread and a pool of some number of worker threads. I've found that performance improves significantly when the listener thread is pinned to a single core, and no other threads are allowed to run there. I've also found that performance improves somewhat when all worker threads are pinned to specific cores, instead of being free to run on any of the remaining cores. This has made testing a bit more complicated than I expected.
The Intel system behaved differently (of course). In general, the system delivered best performance when the listener thread was free to be assigned to any core of the active set. However, when testing with 6 or 7 cores, performance was extremely erratic, and only stabilized when the listener thread was pinned to the first CPU socket.
I originally was just pinning the entire process to a set number of cores (first 1, then 2, incrementing up to 8) to see how performance changed with additional cores. But due to the motherboard layout and the fact that the I/O bridges are directly attached to particular sockets, it makes a big difference exactly which cores you use.
I custom built a kernel (2.6.24-rc3 with Jeff Garzik's patched device drivers) with Intel IO/AT DMA support, which seemed to allow the ethernet driver a lot more leeway. Instead of the ethernet driver consuming a single CPU core, its cycles were evenly distributed across all 8 cores. A pity there's no such DMA engine on the AMD systems. (Without this DMA engine, the ethernet driver also consumed a single core on the Intel system.)
Another item I noticed is that while we scale perfectly linearly from 1 core to 2 cores in a socket (with a dual-core processor), as we start spreading across multiple sockets the scaling tapers off drastically. That makes sense given the constraints of the Hypertransport connections between the sockets.
Note that on the 4P AMD system, two of the processors have unused Hypertransport links. It would definitely improve things if the two CPUs were connected by that extra link; IIRC some other 4P AMD systems do that.
On the quad-core system we scale pretty linearly from 1 to 4 cores (in one socket) but again the improvement tapers off drastically when the 2nd socket is added in.
Well. The AMD quad-core Opteron certainly scales much better than the other two systems.
I don't have any Xeon systems to test on at the moment, but I'm curious to see how they do given that all CPUs should have equal access to the northbridge. (Of course, given that both memory and I/O traffic go over the bus, I'm not expecting any miracles...)
As expected, the Xeon's multithreaded/multicore performance is hampered by its FSB architecture.
The quad-core system I'm using is a Supermicro AS-2021M-UR+B; it's based on an Nvidia MCP55 chipset. The gigabit ethernet is integrated in this chipset. Using back-null we can drive this machine to over 54,000 authentications/second, at which point 100% of a core is consumed by interrupt processing in the ethernet driver. The driver doesn't support interrupt coalescing, unfortunately. (By the way, that represents somewhere between 324,000pps and 432,000pps. While there's only 5 LDAP packets per transaction, some of the client machines choose to send separate TCP ACKs, while others don't, which makes the packet count somewhere between 5-8 packets per transaction. I hadn't taken those ACKs into account when I discussed these figures before. At these packet sizes (80-140 bytes), I think the network would be 100% saturated at around 900,000pps.)
For these interfaces and interrupt rates, with such small packets, 50% utilization really isn't that bad.
Interestingly, while 2 cores can get over 13,000 auths/second, and 4 cores can get around 25,000 auths/second (using back-hdb), with all 8 cores it's only peaking at 29,000 auths/second. This tells me it's better to run two separate slapds in a mirrormode configuration on this box (4 cores per process) than to run a single process across all of the cores. Then I'd expect to hit 50,000 auths/second total, pretty close to the limits of the ethernet device/driver.
50K/sec is definitely out of reach given how many CPU cycles the ethernet driver consumes. But still the quad-core Opteron delivers amazing performance, especially given the fact that its core clock speed is so much slower than the other systems'. Just goes to show, in a database-oriented workload, memory bandwidth rules.