I work at a place with a fairly large openldap setup. (2.3.32). We have 3 large Read servers, Dell R900, 32 GB ram, 2x quadcore, hw RAID1 disks for ldap volume. The entire database takes about 13GB of physical disk space (the BDB files), and has a few million entries. DB_CONFIG is configurd to have the entire DB in memory (for speed), and slapd.conf cachesize is set to a million entries, to make as most effective use of the 32GB of ram this box has. We have them behind an F5 BigIP hardware load balancer (6400 series), and find that during peak times of the day, we get "connection deferred: binding" in our slapd.logs. (loglevel set to "none" (misnomer)), and a client request (or series of them) fails. If we use round robin DNS instead, we rarely see those errors. CPU usage is low, even during peak times, hovering at 20-50% of 1 core (the other 7 are idle)
The interesting this are it seems to happen (the connection defered: binding) , only after a certian load threshold is reached (busiest time of day), and only when behind the F5's. I am suspected it might be the "conn_max_pending" or "conn_max_pending_auth" defaults (100 and 1000 respectively), as when behind the F5, all the connections will appear to come from the F5 addresses, vs RR dns where it's coming from a wde range of sources (eah of the servers. (well over 100+).
We had tried experimenting with a higher number of threads previously, but that didn't seem to have a positive effect. Can any openLDAP guru's suggest some things to set/look for, i.e. (higher number of threads, higher defaults for conn_max_pending, conn_max_pending_auth).
Any ideas on what a theoretical performance limit should be of a machine of this caliber? i.e. how many reqs/sec, how far will it scale, etc..
We have plans to upgrade to 2.4, but it's a "down the road item", and mgmt is demanding answers to "how far can this design scale as it is"...
Thanks!
-- David J. Andruczyk
Why bother with the load balancer? I am curious, I am sure there is a reason, but it isn't making a lot of sense to me. You can either do round robin dns, or just pass out the 3 read server addy's to the clients for failover (and change the order for real poor mans load balancing.)
conn_max_pending is what I had to adjust to up the connections, but i suspect you may have indexing issues, returning too many responses, etc.
On Tue, 21 Jul 2009, David J. Andruczyk wrote:
I work at a place with a fairly large openldap setup. (2.3.32). We have 3 large Read servers, Dell R900, 32 GB ram, 2x quadcore, hw RAID1 disks for ldap volume. The entire database takes about 13GB of physical disk space (the BDB files), and has a few million entries. DB_CONFIG is configurd to have the entire DB in memory (for speed), and slapd.conf cachesize is set to a million entries, to make as most effective use of the 32GB of ram this box has. We have them behind an F5 BigIP hardware load balancer (6400 series), and find that during peak times of the day, we get "connection deferred: binding" in our slapd.logs. (loglevel set to "none" (misnomer)), and a client request (or series of them) fails. If we use round robin DNS instead, we rarely see those errors. CPU usage is low, even during peak times, hovering at 20-50% of 1 core (the other 7 are idle)
The interesting this are it seems to happen (the connection defered: binding) , only after a certian load threshold is reached (busiest time of day), and only when behind the F5's. I am suspected it might be the "conn_max_pending" or "conn_max_pending_auth" defaults (100 and 1000 respectively), as when behind the F5, all the connections will appear to come from the F5 addresses, vs RR dns where it's coming from a wde range of sources (eah of the servers. (well over 100+).
We had tried experimenting with a higher number of threads previously, but that didn't seem to have a positive effect. Can any openLDAP guru's suggest some things to set/look for, i.e. (higher number of threads, higher defaults for conn_max_pending, conn_max_pending_auth).
Any ideas on what a theoretical performance limit should be of a machine of this caliber? i.e. how many reqs/sec, how far will it scale, etc..
We have plans to upgrade to 2.4, but it's a "down the road item", and mgmt is demanding answers to "how far can this design scale as it is"...
Thanks!
-- David J. Andruczyk
-------------------------------------- Sean O'Malley, Information Technologist Michigan State University -------------------------------------
This is a large production environment (several hundred servers, thousands of requests per minute) and the F5-LB is used to balance the load and take care if a node needs to be taken out of service for maint for any reason. With RR DNS if a server is slow (for whatever reason ,backups ,etc) the F5 notices that and adjusts the connections distribution as needed, RR DNS can't do that. As far as indexes, the environment has been performing extremelywell until recently after a few m=hundred thousand more users were added as well as signifiantly higher activity, at which point we began seeing issues when behind the loadbalancers during peak times of day. The LB vender says the issue is with with openldap, and those settings, conn_max_pending/conn_max_pending_auth were the only ones that seemed to stick out, though the documentation on those is rather ambiguous.
-- David J. Andruczyk
----- Original Message ---- From: Sean O'Malley omalleys@msu.edu To: David J. Andruczyk djandruczyk@yahoo.com Cc: openldap-software@openldap.org Sent: Tuesday, July 21, 2009 2:24:58 PM Subject: Re: performance issue behind a a load balancer 2.3.32
Why bother with the load balancer? I am curious, I am sure there is a reason, but it isn't making a lot of sense to me. You can either do round robin dns, or just pass out the 3 read server addy's to the clients for failover (and change the order for real poor mans load balancing.)
conn_max_pending is what I had to adjust to up the connections, but i suspect you may have indexing issues, returning too many responses, etc.
On Tue, 21 Jul 2009, David J. Andruczyk wrote:
I work at a place with a fairly large openldap setup. (2.3.32). We have 3 large Read servers, Dell R900, 32 GB ram, 2x quadcore, hw RAID1 disks for ldap volume. The entire database takes about 13GB of physical disk space (the BDB files), and has a few million entries. DB_CONFIG is configurd to have the entire DB in memory (for speed), and slapd.conf cachesize is set to a million entries, to make as most effective use of the 32GB of ram this box has. We have them behind an F5 BigIP hardware load balancer (6400 series), and find that during peak times of the day, we get "connection deferred: binding" in our slapd.logs. (loglevel set to "none" (misnomer)), and a client request (or series of them) fails. If we use round robin DNS instead, we rarely see those errors. CPU usage is low, even during peak times, hovering at 20-50% of 1 core (the other 7 are idle)
The interesting this are it seems to happen (the connection defered: binding) , only after a certian load threshold is reached (busiest time of day), and only when behind the F5's. I am suspected it might be the "conn_max_pending" or "conn_max_pending_auth" defaults (100 and 1000 respectively), as when behind the F5, all the connections will appear to come from the F5 addresses, vs RR dns where it's coming from a wde range of sources (eah of the servers. (well over 100+).
We had tried experimenting with a higher number of threads previously, but that didn't seem to have a positive effect. Can any openLDAP guru's suggest some things to set/look for, i.e. (higher number of threads, higher defaults for conn_max_pending, conn_max_pending_auth).
Any ideas on what a theoretical performance limit should be of a machine of this caliber? i.e. how many reqs/sec, how far will it scale, etc..
We have plans to upgrade to 2.4, but it's a "down the road item", and mgmt is demanding answers to "how far can this design scale as it is"...
Thanks!
-- David J. Andruczyk
-------------------------------------- Sean O'Malley, Information Technologist Michigan State University -------------------------------------
--On Tuesday, July 21, 2009 12:39 PM -0700 "David J. Andruczyk" djandruczyk@yahoo.com wrote:
This is a large production environment (several hundred servers, thousands of requests per minute) and the F5-LB is used to balance the load and take care if a node needs to be taken out of service for maint for any reason. With RR DNS if a server is slow (for whatever reason ,backups ,etc) the F5 notices that and adjusts the connections distribution as needed, RR DNS can't do that. As far as indexes, the environment has been performing extremelywell until recently after a few m=hundred thousand more users were added as well as signifiantly higher activity, at which point we began seeing issues when behind the loadbalancers during peak times of day. The LB vender says the issue is with with openldap, and those settings, conn_max_pending/conn_max_pending_auth were the only ones that seemed to stick out, though the documentation on those is rather ambiguous.
We've certainly seen that F5 load balancers cause problems just like your seeing when used with LDAP. They just slow things down way too much to be worthwhile.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Why bother with the load balancer? I am curious, I am sure there is a reason, but it isn't making a lot of sense to me. You can either do round robin dns, or just pass out the 3 read server addy's to the clients for failover (and change the order for real poor mans load balancing.)
Things DNS RR does not allow that a load balancer does (just off the top of my head): 1. Dynamically removing a node if it goes down/crashes (DNS RR slows things down because clients need to time out for a failed server, assuming clients do properly time out and fail over, which is not a guarantee by any means - lots of broken clients wrt DNS RR out there). 2. Easily removing a node for maintenance (DNS RR requires modifying DNS, waiting for TTLs, hoping none of the clients ignore TTLs - again, lots of broken clients out there, etc). 3. Can't account for differing load or connection levels to backend servers. 4. Hiding the actual servers and/or number of servers in the cluster.
I'm sure there are other benefits I'vve forgotten.
On Tue, 2009-07-21 at 12:39 -0700, David J. Andruczyk wrote:
This is a large production environment (several hundred servers, thousands of requests per minute) and the F5-LB is used to balance the load and take care if a node needs to be taken out of service for maint
Given, I run a smaller environment, but even on a several-years-old v2.2.x install I regularly see several thousand requests per second -- not minute -- with tons of logging enabled -- handled while hardly touching the CPU. Are you sure you really need multiple machines?
John
We've certainly seen that F5 load balancers cause problems just like your seeing when used with LDAP. They just slow things down way too much to be worthwhile.
Do you have any facts/numbers to back this up? I've never seen F5's slow things down noticably. The most common problem load balancers introduce are idle timeout mismatches (where the LB times out and drops an idle connection from it's table that the client and server don't know was dropped, so leave lots of orphaned connections hanging open on both sides, which in turn does cause problems - one of which could be performance). Trivial to fix with proper idle timeout configuration/coordination on the backend server and LB.
--On Tuesday, July 21, 2009 4:51 PM -0400 "Clowser, Jeff" jeff_clowser@fanniemae.com wrote:
We've certainly seen that F5 load balancers cause problems just like your seeing when used with LDAP. They just slow things down way too much to be worthwhile.
Do you have any facts/numbers to back this up? I've never seen F5's slow things down noticably. The most common problem load balancers introduce are idle timeout mismatches (where the LB times out and drops an idle connection from it's table that the client and server don't know was dropped, so leave lots of orphaned connections hanging open on both sides, which in turn does cause problems - one of which could be performance). Trivial to fix with proper idle timeout configuration/coordination on the backend server and LB.
We've had F5's be the root of the problem with several clients who load balanced their LDAP servers, and pointed postfix at the F5 for delivery. They added just a few milliseconds of time to each LDAP query, but that was enough to completely back up their mail delivery system. Removing the F5 from the picture allowed mail to flow smoothly, no more problems.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Do you have any more specifics to back that up? We were running just fine until recently we seemed to cross a threshold. (hence my interest in conn_max_pending* , as all devices behind the F5 (when on the same subnet as the clients using hte balancer) will see all connections coming from the F5's IP's )
F5 load is low (less than 18% cpu), it's a reasonably high powered model (6400) and we aren't hitting traffic/packet limitations in it according to traces we've sent to F5 for analysis.
-- David J. Andruczyk
----- Original Message ---- From: Quanah Gibson-Mount quanah@zimbra.com To: David J. Andruczyk djandruczyk@yahoo.com; openldap-software@openldap.org Sent: Tuesday, July 21, 2009 4:38:29 PM Subject: Re: performance issue behind a a load balancer 2.3.32
--On Tuesday, July 21, 2009 12:39 PM -0700 "David J. Andruczyk" djandruczyk@yahoo.com wrote:
This is a large production environment (several hundred servers, thousands of requests per minute) and the F5-LB is used to balance the load and take care if a node needs to be taken out of service for maint for any reason. With RR DNS if a server is slow (for whatever reason ,backups ,etc) the F5 notices that and adjusts the connections distribution as needed, RR DNS can't do that. As far as indexes, the environment has been performing extremelywell until recently after a few m=hundred thousand more users were added as well as signifiantly higher activity, at which point we began seeing issues when behind the loadbalancers during peak times of day. The LB vender says the issue is with with openldap, and those settings, conn_max_pending/conn_max_pending_auth were the only ones that seemed to stick out, though the documentation on those is rather ambiguous.
We've certainly seen that F5 load balancers cause problems just like your seeing when used with LDAP. They just slow things down way too much to be worthwhile.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
yep, for a production environment, running on only one, is a sure fire way to earn myself a sparkling new pink slip...
-- David J. Andruczyk
----- Original Message ---- From: John Madden jmadden@ivytech.edu To: David J. Andruczyk djandruczyk@yahoo.com Cc: openldap-software@openldap.org Sent: Tuesday, July 21, 2009 4:47:07 PM Subject: Re: performance issue behind a a load balancer 2.3.32
On Tue, 2009-07-21 at 12:39 -0700, David J. Andruczyk wrote:
This is a large production environment (several hundred servers, thousands of requests per minute) and the F5-LB is used to balance the load and take care if a node needs to be taken out of service for maint
Given, I run a smaller environment, but even on a several-years-old v2.2.x install I regularly see several thousand requests per second -- not minute -- with tons of logging enabled -- handled while hardly touching the CPU. Are you sure you really need multiple machines?
John
On Tue, Jul 21, 2009 at 01:54:25PM -0700, Quanah Gibson-Mount wrote:
--On Tuesday, July 21, 2009 4:51 PM -0400 "Clowser, Jeff" jeff_clowser@fanniemae.com wrote:
Do you have any facts/numbers to back this up? I've never seen F5's slow things down noticably.
We've had F5's be the root of the problem with several clients who load balanced their LDAP servers, and pointed postfix at the F5 for delivery. They added just a few milliseconds of time to each LDAP query, but that was enough to completely back up their mail delivery system. Removing the F5 from the picture allowed mail to flow smoothly, no more problems.
I can't speak for any other clients that Quanah may be referencing, but we experienced this with our Zimbra deployment. However, I emphatically disagree with his stance against running LDAP services behind a hardware load balancer.
We have F5 BigIPs in front of nearly every service we provide, for the reasons cited by others. In the past, we've had load balancers from Cisco (CSS), and Alteon (ACEdirector, IIRC, and now owned by Nortel) and our BigIPs have been the most transparent and have worked the best.
That said, we did encounter throughput problems with Zimbra's Postfix MTAs due to BigIP configuration. When incoming mail volume started to ramp up for the day, Postfix's queue size would slowly build. We ruled out (host) CPU consumption, disk I/O load, syslogging bottlenecks, and a host of other usual and unusual suspects on the hosts themselves.
I'm not sure if Quanah heard the final resolution, which was to change the LDAP VIP type from Standard to "Performance (Layer 4)." This solved the problem immediately. I didn't see the final response from F5, but my impression was that Performance (Layer 4) bypasses a lot of the hooks that let you manipulate packets and connections. Interestingly, CPU consumption on our BigIPs was low and therefore didn't prompt us to troubleshoot from that angle. This was the first we've seen this behavior; our non-Zimbra OpenLDAP nodes have a higher operation rate (~12k operations/sec aggregate) and had been servicing a similar mail infrastructure before we started moving to Zimbra's software.
On Tue, Jul 21, 2009 at 05:56:48AM -0700, David J. Andruczyk wrote:
We had tried experimenting with a higher number of threads previously, but that didn't seem to have a positive effect. Can any openLDAP guru's suggest some things to set/look for, i.e. (higher number of threads, higher defaults for conn_max_pending, conn_max_pending_auth).
Any ideas on what a theoretical performance limit should be of a machine of this caliber? i.e. how many reqs/sec, how far will it scale, etc..
Sounds like you're doing NAT on inbound connections (so connections offered to your LDAP nodes are sourced from the BigIP), so I'm not sure if this alternate VIP type would preclude doing that. If you have OneConnect enabled, you might try disabling that, too. I generally see it used with HTTP, but perhaps it's usable with other protocols?
AFAICT, increasing conn_max_pending_auth shouldn't be helpful unless your application(s) are doing a lot of asynchronous operations simultaneously (i.e., submit many LDAP operations at once and have them pending simultaneously). If they're primarily submitting an operation and waiting for a response, lather rinse repeat, I don't see how a connection could accumulate pending operations.
As far as scalability, I see no reason OpenLDAP shouldn't scale reasonably to the limits of your hardware (CPU consumption and disk I/O). It bodes well for your OpenLDAP build, tuning, etc. that it can handle your current workload when using round-robin DNS. What kind of LDAP ops/sec are these machines taking?
john
--On Tuesday, July 21, 2009 10:48 PM -0400 John Morrissey jwm@horde.net wrote:
On Tue, Jul 21, 2009 at 01:54:25PM -0700, Quanah Gibson-Mount wrote:
--On Tuesday, July 21, 2009 4:51 PM -0400 "Clowser, Jeff" jeff_clowser@fanniemae.com wrote:
Do you have any facts/numbers to back this up? I've never seen F5's slow things down noticably.
We've had F5's be the root of the problem with several clients who load balanced their LDAP servers, and pointed postfix at the F5 for delivery. They added just a few milliseconds of time to each LDAP query, but that was enough to completely back up their mail delivery system. Removing the F5 from the picture allowed mail to flow smoothly, no more problems.
I can't speak for any other clients that Quanah may be referencing, but we experienced this with our Zimbra deployment. However, I emphatically disagree with his stance against running LDAP services behind a hardware load balancer.
Eh, it was against running it behind an F5, not a stance against load balancing in general. ;)
I'm not sure if Quanah heard the final resolution, which was to change the LDAP VIP type from Standard to "Performance (Layer 4)." This solved the problem immediately. I didn't see the final response from F5, but my impression was that Performance (Layer 4) bypasses a lot of the hooks that let you manipulate packets and connections. Interestingly, CPU consumption on our BigIPs was low and therefore didn't prompt us to troubleshoot from that angle. This was the first we've seen this behavior; our non-Zimbra OpenLDAP nodes have a higher operation rate (~12k operations/sec aggregate) and had been servicing a similar mail infrastructure before we started moving to Zimbra's software.
Nope, I wasn't aware of this eventual solution. The last I heard, the postfix part was load balancing against the LDAP urls. So it sounds like F5's can be just fine with that caveat. ;)
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On Tue, Jul 21, 2009 at 01:54:25PM -0700, Quanah Gibson-Mount wrote:
--On Tuesday, July 21, 2009 4:51 PM -0400 "Clowser, Jeff" jeff_clowser@fanniemae.com wrote:
Do you have any facts/numbers to back this up? I've never seen F5's slow things down noticably.
We've had F5's be the root of the problem with several clients who load balanced their LDAP servers, and pointed postfix at the F5 for delivery. They added just a few milliseconds of time to each LDAP query, but that was enough to completely back up their mail delivery system. <...>
Given the reported log message, this (latency) is very likely to be the cause of the problem. "connection deferred: binding" means that the server received a request on a connection that was in the middle of processing a bind. This means that the client sends a bind and then additional request(s) without waiting for the bind result. That's a violation by the client of the LDAP protocol specification, RFC 4511, section 4.2.1, paragraph 2:
After sending a BindRequest, clients MUST NOT send further LDAP PDUs until receiving the BindResponse. Similarly, servers SHOULD NOT process or respond to requests received while processing a BindRequest.
The log message is slapd saying "I'm obeying that SHOULD NOT for this connection, loser". It should be obvious now why the conn_max_pending* options have no effect.
Understanding _why_ clients are violating the spec by sending further requests while a bind is outstanding may help you understand how the F5 or the clients should be tuned (or beaten with sticks, etc).
You presumably don't notice this under normal circumstances or with RR DNS because the server completes the BIND before the next request is received. My understanding (perhaps suspect) is that the F5 will increase the 'bunching' of packets on individual connections (because the first packet after a pause will see a higher latency than the succeeding packets).
So, are you measuring latency through the F5? I would *strongly* suggest doing so *before* tuning the F5 in any way, such as by the VIP type mentioned by John Morrissey, so that you can wave that in front of management (and under the nose of the F5 saleman when negotiating your next support renewal...)
Philip Guenther
Philip Guenther wrote:
On Tue, Jul 21, 2009 at 01:54:25PM -0700, Quanah Gibson-Mount wrote:
--On Tuesday, July 21, 2009 4:51 PM -0400 "Clowser, Jeff"jeff_clowser@fanniemae.com wrote:
Do you have any facts/numbers to back this up? I've never seen F5's slow things down noticably.
We've had F5's be the root of the problem with several clients who load balanced their LDAP servers, and pointed postfix at the F5 for delivery. They added just a few milliseconds of time to each LDAP query, but that was enough to completely back up their mail delivery system.<...>
Given the reported log message, this (latency) is very likely to be the cause of the problem. "connection deferred: binding" means that the server received a request on a connection that was in the middle of processing a bind. This means that the client sends a bind and then additional request(s) without waiting for the bind result. That's a violation by the client of the LDAP protocol specification, RFC 4511, section 4.2.1, paragraph 2:
After sending a BindRequest, clients MUST NOT send further LDAP PDUs until receiving the BindResponse. Similarly, servers SHOULD NOT process or respond to requests received while processing a BindRequest.
The log message is slapd saying "I'm obeying that SHOULD NOT for this connection, loser". It should be obvious now why the conn_max_pending* options have no effect.
Understanding _why_ clients are violating the spec by sending further requests while a bind is outstanding may help you understand how the F5 or the clients should be tuned (or beaten with sticks, etc).
All true, but in certain versions of OpenLDAP, slapd would send the Bind result to the client before it was done with its internal bookkeeping. So it's possible that, on a very busy slapd, a very fast well-behaved client could get the Bind result and send its next request before slapd was finished marking the connection as "no longer Binding". (See ITS#3850 and #6189). Still, none of this will result in much additional latency within slapd (beyond any latency already imposed by the CPU load, number of available threads, etc...)
You presumably don't notice this under normal circumstances or with RR DNS because the server completes the BIND before the next request is received. My understanding (perhaps suspect) is that the F5 will increase the 'bunching' of packets on individual connections (because the first packet after a pause will see a higher latency than the succeeding packets).
So, are you measuring latency through the F5? I would *strongly* suggest doing so *before* tuning the F5 in any way, such as by the VIP type mentioned by John Morrissey, so that you can wave that in front of management (and under the nose of the F5 saleman when negotiating your next support renewal...)
yes, we have been measuring latency when under the F5 vs RR. When we switched to RR DNS is DID drop quite a bit from around 100ms to about 20 ms. We do NOT yet have the VIP set to Performance layer 4 however. It was at "standard". F5 has since suggested performance layer 4, but we have not implemented it yet, only due to the fact that the connection deferred: binding messages cause severe annoyances, and lots of CS calls from users of the system (auth failures, misc issues), that mgmt is wary of trying anything else until they have proof that whatever we do WILL DEFINITELY WORK beforehand. (yes cart before the horse, I know, but they sign the checks as well...)
When behind the F5 in the LDAP server logs all connections appear to come from the F5's IP, so, when pumping a hundred server's connections through that one Ip there are going to be many many binds/unbinds going on constanly, all coming from the same IP (the F5), so why doesn't it through "connection deferred: binding" constantly as the connection load is certainly very very high, it only throws them occasionnally (every few seconds), but it's enough to cause a major impact in terrms of failed queries. Are you saying hte F5 is dropping part of the session after binding on a port and retriying to bind? (i.e. tryingto reuse an already open port that hasn't been closed cleanly?,) can this be due to an idle timeout difference on slapd vs the F5? Where is the idle timeout defined on the F5 specific to the ldap virtual server/pool? (slapd.conf has it set relatively low 20 seconds)
-- David J. Andruczyk
----- Original Message ---- From: Philip Guenther guenther+ldapsoft@sendmail.com To: David J. Andruczyk djandruczyk@yahoo.com Cc: openldap-software@openldap.org Sent: Wednesday, July 22, 2009 12:54:53 AM Subject: Re: performance issue behind a a load balancer 2.3.32
On Tue, Jul 21, 2009 at 01:54:25PM -0700, Quanah Gibson-Mount wrote:
--On Tuesday, July 21, 2009 4:51 PM -0400 "Clowser, Jeff" jeff_clowser@fanniemae.com wrote:
Do you have any facts/numbers to back this up? I've never seen F5's slow things down noticably.
We've had F5's be the root of the problem with several clients who load balanced their LDAP servers, and pointed postfix at the F5 for delivery. They added just a few milliseconds of time to each LDAP query, but that was enough to completely back up their mail delivery system. <...>
Given the reported log message, this (latency) is very likely to be the cause of the problem. "connection deferred: binding" means that the server received a request on a connection that was in the middle of processing a bind. This means that the client sends a bind and then additional request(s) without waiting for the bind result. That's a violation by the client of the LDAP protocol specification, RFC 4511, section 4.2.1, paragraph 2:
After sending a BindRequest, clients MUST NOT send further LDAP PDUs until receiving the BindResponse. Similarly, servers SHOULD NOT process or respond to requests received while processing a BindRequest.
The log message is slapd saying "I'm obeying that SHOULD NOT for this connection, loser". It should be obvious now why the conn_max_pending* options have no effect.
Understanding _why_ clients are violating the spec by sending further requests while a bind is outstanding may help you understand how the F5 or the clients should be tuned (or beaten with sticks, etc).
You presumably don't notice this under normal circumstances or with RR DNS because the server completes the BIND before the next request is received. My understanding (perhaps suspect) is that the F5 will increase the 'bunching' of packets on individual connections (because the first packet after a pause will see a higher latency than the succeeding packets).
So, are you measuring latency through the F5? I would *strongly* suggest doing so *before* tuning the F5 in any way, such as by the VIP type mentioned by John Morrissey, so that you can wave that in front of management (and under the nose of the F5 saleman when negotiating your next support renewal...)
Philip Guenther
On Tue, 2009-07-21 at 19:03 -0700, David J. Andruczyk wrote:
yep, for a production environment, running on only one, is a sure fire way to earn myself a sparkling new pink slip...
That's pretty vague... Only one machine is not the same as "asking for a pink slip." There are plenty of ways to accomplish HA in this scenario, if that's what you're getting at.
John
When behind the F5 in the LDAP server logs all connections appear to come from the F5's IP
This strikes me as odd. Load balancers (including the F5) typically preserve the client IP... The most common case I've seen of this is when the load balancer is proxying a request vs rerouting it to a server in the pool, which tends to happen when you are using the F5 also as an SSL accelerator (i.e. client does SSL to the F5, then F5 load balances in clear text from it to a backend server. Are you doing something like this (and if so, when you use RR dns, are you doing SSL on the ldap serverr)? Or is there something else going on that is causing the F5 to replace the originating client IP with it's own?
The other case I can think of is if the servers are not "behind" the load balancers (i.e. the LB is not their default gateway that traffic to them is routed through) - in cases like that, the LB may need to proxy it like this to avoid an async routing issue, but that's really not a good way to use load balancers, because of problems like this (this kind of setup tends to cause all kinds of problems).
i.e. usually they are set up like:
server | ---------------------- | LB | ---------------------- | client
But if they are set up like
LB | --------------------------- | | client server
You need to do some unpleasant tricks to avoid routing issues.
On Wed, Jul 22, 2009 at 05:37:30AM -0700, David J. Andruczyk wrote:
yes, we have been measuring latency when under the F5 vs RR. When we switched to RR DNS is DID drop quite a bit from around 100ms to about 20 ms.
FWIW and IIRC, after switching to Performance (Layer 4), the observed latency for LDAP operations to the VIP and to the nodes themselves was essentially the same. I can't say what the latency difference was, since I wasn't the one who was troubleshooting the BigIPs and don't have the numbers handy.
We do NOT yet have the VIP set to Performance layer 4 however. It was at "standard". F5 has since suggested performance layer 4, but we have not implemented it yet, only due to the fact that the connection deferred: binding messages cause severe annoyances, and lots of CS calls from users of the system (auth failures, misc issues), that mgmt is wary of trying anything else until they have proof that whatever we do WILL DEFINITELY WORK beforehand. (yes cart before the horse, I know, but they sign the checks as well...)
That seems short-sighted, unless you're implying that you've moved all LDAP traffic off your BigIPs until you have a solution in hand that you *know* will solve the problem.
They may sign the checks, but that doesn't mean that informed argument shouldn't carry weight.
When behind the F5 in the LDAP server logs all connections appear to come from the F5's IP, so, when pumping a hundred server's connections through that one Ip there are going to be many many binds/unbinds going on constanly, all coming from the same IP (the F5), so why doesn't it through "connection deferred: binding" constantly as the connection load is certainly very very high, it only throws them occasionnally (every few seconds), but it's enough to cause a major impact in terrms of failed queries. Are you saying hte F5 is dropping part of the session after binding on a port and retriying to bind?
+1 on what Philip mentioned:
On Tue, 21 Jul 2009 21:54:53 -0700, Philip Guenther wrote:
Given the reported log message, this (latency) is very likely to be the cause of the problem. "connection deferred: binding" means that the server received a request on a connection that was in the middle of processing a bind. This means that the client sends a bind and then additional request(s) without waiting for the bind result. That's a violation by the client of the LDAP protocol specification, RFC 4511, section 4.2.1, paragraph 2:
[snip]
Understanding _why_ clients are violating the spec by sending further requests while a bind is outstanding may help you understand how the F5 or the clients should be tuned (or beaten with sticks, etc).
You presumably don't notice this under normal circumstances or with RR DNS because the server completes the BIND before the next request is received. My understanding (perhaps suspect) is that the F5 will increase the 'bunching' of packets on individual connections (because the first packet after a pause will see a higher latency than the succeeding packets).
So, are you measuring latency through the F5? I would *strongly* suggest doing so *before* tuning the F5 in any way, such as by the VIP type mentioned by John Morrissey, so that you can wave that in front of management (and under the nose of the F5 saleman when negotiating your next support renewal...)
What I'm parsing from:
https://support.f5.com/kb/en-us/solutions/public/8000/000/sol8082.html
(only accessible with an F5 support contract, unfortunately), is that with the "Standard" VIP type, the BigIP will wait for a three-way TCP handshake before establishing a connection with the load-balanced node. The BigIP becomes a "man in the middle" and establishes two independent connections: one facing the client, another facing the load balanced node.
With "Performance (Layer 4)", the BigIP forwards packets between clients and load-balanced nodes as they're received. As Philip says, the packet "bunching" due to the MITM nature of the Standard VIP type is probably teaming up with your LDAP client misbehavior. Under heavy load, the likelihood of bunching increases and you "win" this race condition.
Out of curiosity, what LDAP client SDK is involved here?
john
All LDAP traffic currently is using RR DNS.
The network is essentially "flat", the LDAP servers and systems requiring LDAP are on the same subnetwork, hence why when using the F5's for LDAP balancing all traffic will appears to come from the F5, otherwise you'll have an async routing issue. The F5 has VIP's on both the "inside" and the outside. (outside adddresses are in the DMZ behind the perimeter firewalls, and are for balancing traffic to other server clusters, i.e. web, etc)
Mgmt is of the mindset, of "if it works (even if it doesn't provide proper redundancy right now), then leave it be", which is OK, if servers never ever crash. I'm of the opinion of finding out WHY the ldap servers log "connection deferred: binding" when behind the F5's and ONLY when past a certain arbritrary load threshold. (i.e. for an hour or two around the busiest time of day, it throws those warnings every few seconds/minutes, but below that point all is well). hence my focus on conn_max_pending, and conn_max_pending_auth. thought I haven't heard a concrete response yet, saying that, "Yes,in your case where al lthe traffic will appear to come from the F5, due to the network layout, those parameters are too low and likely to throttle connections at some arbritrary level.".
I think the first test will be to try performance layer 4 on the F5, and if there still happens to be an issue, to try dobling the values of conn_max_pending, and conn_max_pending_auth.
-- David J. Andruczyk
----- Original Message ---- From: John Morrissey jwm@horde.net To: David J. Andruczyk djandruczyk@yahoo.com Cc: openldap-software@openldap.org; Philip Guenther guenther+ldapsoft@sendmail.com Sent: Wednesday, July 29, 2009 1:20:44 PM Subject: Re: performance issue behind a a load balancer 2.3.32
On Wed, Jul 22, 2009 at 05:37:30AM -0700, David J. Andruczyk wrote:
yes, we have been measuring latency when under the F5 vs RR. When we switched to RR DNS is DID drop quite a bit from around 100ms to about 20 ms.
FWIW and IIRC, after switching to Performance (Layer 4), the observed latency for LDAP operations to the VIP and to the nodes themselves was essentially the same. I can't say what the latency difference was, since I wasn't the one who was troubleshooting the BigIPs and don't have the numbers handy.
We do NOT yet have the VIP set to Performance layer 4 however. It was at "standard". F5 has since suggested performance layer 4, but we have not implemented it yet, only due to the fact that the connection deferred: binding messages cause severe annoyances, and lots of CS calls from users of the system (auth failures, misc issues), that mgmt is wary of trying anything else until they have proof that whatever we do WILL DEFINITELY WORK beforehand. (yes cart before the horse, I know, but they sign the checks as well...)
That seems short-sighted, unless you're implying that you've moved all LDAP traffic off your BigIPs until you have a solution in hand that you *know* will solve the problem.
They may sign the checks, but that doesn't mean that informed argument shouldn't carry weight.
When behind the F5 in the LDAP server logs all connections appear to come from the F5's IP, so, when pumping a hundred server's connections through that one Ip there are going to be many many binds/unbinds going on constanly, all coming from the same IP (the F5), so why doesn't it through "connection deferred: binding" constantly as the connection load is certainly very very high, it only throws them occasionnally (every few seconds), but it's enough to cause a major impact in terrms of failed queries. Are you saying hte F5 is dropping part of the session after binding on a port and retriying to bind?
+1 on what Philip mentioned:
On Tue, 21 Jul 2009 21:54:53 -0700, Philip Guenther wrote:
Given the reported log message, this (latency) is very likely to be the cause of the problem. "connection deferred: binding" means that the server received a request on a connection that was in the middle of processing a bind. This means that the client sends a bind and then additional request(s) without waiting for the bind result. That's a violation by the client of the LDAP protocol specification, RFC 4511, section 4.2.1, paragraph 2:
[snip]
Understanding _why_ clients are violating the spec by sending further requests while a bind is outstanding may help you understand how the F5 or the clients should be tuned (or beaten with sticks, etc).
You presumably don't notice this under normal circumstances or with RR DNS because the server completes the BIND before the next request is received. My understanding (perhaps suspect) is that the F5 will increase the 'bunching' of packets on individual connections (because the first packet after a pause will see a higher latency than the succeeding packets).
So, are you measuring latency through the F5? I would *strongly* suggest doing so *before* tuning the F5 in any way, such as by the VIP type mentioned by John Morrissey, so that you can wave that in front of management (and under the nose of the F5 saleman when negotiating your next support renewal...)
What I'm parsing from:
https://support.f5.com/kb/en-us/solutions/public/8000/000/sol8082.html
(only accessible with an F5 support contract, unfortunately), is that with the "Standard" VIP type, the BigIP will wait for a three-way TCP handshake before establishing a connection with the load-balanced node. The BigIP becomes a "man in the middle" and establishes two independent connections: one facing the client, another facing the load balanced node.
With "Performance (Layer 4)", the BigIP forwards packets between clients and load-balanced nodes as they're received. As Philip says, the packet "bunching" due to the MITM nature of the Standard VIP type is probably teaming up with your LDAP client misbehavior. Under heavy load, the likelihood of bunching increases and you "win" this race condition.
Out of curiosity, what LDAP client SDK is involved here?
john
On Thu, Jul 30, 2009 at 05:34:39AM -0700, David J. Andruczyk wrote:
Mgmt is of the mindset, of "if it works (even if it doesn't provide proper redundancy right now), then leave it be", which is OK, if servers never ever crash. I'm of the opinion of finding out WHY the ldap servers log "connection deferred: binding" when behind the F5's and ONLY when past a certain arbritrary load threshold.
nod, a good attitude to take. Especially because at some point, you're going to have an outage that round-robin DNS can't handle and your management is going to come to you asking why that traffic isn't load balanced. ^_____^
hence my focus on conn_max_pending, and conn_max_pending_auth. thought I haven't heard a concrete response yet, saying that, "Yes,in your case where al lthe traffic will appear to come from the F5, due to the network layout, those parameters are too low and likely to throttle connections at some arbritrary level.".
At least two people (Philip and Howard) have said the exact opposite: conn_max_pending{,_auth} are not going to have any effect on this situation. These directives control the number of pending operations *for each connection*.
In your case, yes, slapd sees all connections as originating from your BigIPs. Unless the BigIP is doing some deep magic LDAP connection pooling, there are numerous connections open, one for each LDAP client connection. These directives are per-connection and *do not* apply to the total number of operations pending across all connections.
More importantly, the error message you're getting indicates that increasing these values will have no effect. The problem is that slapd is receiving another LDAP operation on a given connection while a bind operation is still being processed for that connection. As Philip said, this is a violation of RFC 4511 and slapd correctly rejects it.
The behavior you're seeing could also be the result of software bugs in slapd that have since been fixed. Have you made sure your OpenLDAP build is more recent than/patched for ITS#3850 and #6189, as Howard mentioned?
john
On Thu, Jul 30, 2009 at 8:34 AM, David J. Andruczykdjandruczyk@yahoo.com wrote:
The network is essentially "flat", the LDAP servers and systems requiring LDAP are on the same subnetwork, hence why when using the F5's for LDAP balancing all traffic will appears to come from the F5
Have you tried enabling source NAT on the VIP, so the connections then are seen as coming from the clients, instead of the bigIP?
Wes
We are currently running 2.3.32, we can't upgrade to 2.4 yet as we are using slurpd. (yes, we are behind the times...) Are those two bugs ITS#3850 and #6189 fixed in the latest 2.3.x release?
-- David J. Andruczyk
----- Original Message ---- From: John Morrissey jwm@horde.net To: David J. Andruczyk djandruczyk@yahoo.com Cc: openldap-software@openldap.org Sent: Thursday, July 30, 2009 9:33:34 AM Subject: Re: performance issue behind a a load balancer 2.3.32
On Thu, Jul 30, 2009 at 05:34:39AM -0700, David J. Andruczyk wrote:
Mgmt is of the mindset, of "if it works (even if it doesn't provide proper redundancy right now), then leave it be", which is OK, if servers never ever crash. I'm of the opinion of finding out WHY the ldap servers log "connection deferred: binding" when behind the F5's and ONLY when past a certain arbritrary load threshold.
nod, a good attitude to take. Especially because at some point, you're going to have an outage that round-robin DNS can't handle and your management is going to come to you asking why that traffic isn't load balanced. ^_____^
hence my focus on conn_max_pending, and conn_max_pending_auth. thought I haven't heard a concrete response yet, saying that, "Yes,in your case where al lthe traffic will appear to come from the F5, due to the network layout, those parameters are too low and likely to throttle connections at some arbritrary level.".
At least two people (Philip and Howard) have said the exact opposite: conn_max_pending{,_auth} are not going to have any effect on this situation. These directives control the number of pending operations *for each connection*.
In your case, yes, slapd sees all connections as originating from your BigIPs. Unless the BigIP is doing some deep magic LDAP connection pooling, there are numerous connections open, one for each LDAP client connection. These directives are per-connection and *do not* apply to the total number of operations pending across all connections.
More importantly, the error message you're getting indicates that increasing these values will have no effect. The problem is that slapd is receiving another LDAP operation on a given connection while a bind operation is still being processed for that connection. As Philip said, this is a violation of RFC 4511 and slapd correctly rejects it.
The behavior you're seeing could also be the result of software bugs in slapd that have since been fixed. Have you made sure your OpenLDAP build is more recent than/patched for ITS#3850 and #6189, as Howard mentioned?
john
That won't work when the client and servers are all on the same subnetwork (they are in this environment), as it will cause an async routing problem. (I already tried it). That would work if the LDAP servers and clients were on different subnets, but that is not something easily changed in a 24x7 running environment.
-- David J. Andruczyk
----- Original Message ---- From: Wes Rogers skolpatrol@gmail.com To: openldap-software@openldap.org Sent: Thursday, July 30, 2009 11:31:47 AM Subject: Re: performance issue behind a a load balancer 2.3.32
On Thu, Jul 30, 2009 at 8:34 AM, David J. Andruczykdjandruczyk@yahoo.com wrote:
The network is essentially "flat", the LDAP servers and systems requiring LDAP are on the same subnetwork, hence why when using the F5's for LDAP balancing all traffic will appears to come from the F5
Have you tried enabling source NAT on the VIP, so the connections then are seen as coming from the clients, instead of the bigIP?
Wes
openldap-software@openldap.org