All LDAP traffic currently is using RR DNS.
The network is essentially "flat", the LDAP servers and systems requiring LDAP are on the same subnetwork, hence why when using the F5's for LDAP balancing all traffic will appears to come from the F5, otherwise you'll have an async routing issue. The F5 has VIP's on both the "inside" and the outside. (outside adddresses are in the DMZ behind the perimeter firewalls, and are for balancing traffic to other server clusters, i.e. web, etc)
Mgmt is of the mindset, of "if it works (even if it doesn't provide proper redundancy right now), then leave it be", which is OK, if servers never ever crash. I'm of the opinion of finding out WHY the ldap servers log "connection deferred: binding" when behind the F5's and ONLY when past a certain arbritrary load threshold. (i.e. for an hour or two around the busiest time of day, it throws those warnings every few seconds/minutes, but below that point all is well). hence my focus on conn_max_pending, and conn_max_pending_auth. thought I haven't heard a concrete response yet, saying that, "Yes,in your case where al lthe traffic will appear to come from the F5, due to the network layout, those parameters are too low and likely to throttle connections at some arbritrary level.".
I think the first test will be to try performance layer 4 on the F5, and if there still happens to be an issue, to try dobling the values of conn_max_pending, and conn_max_pending_auth.
-- David J. Andruczyk
----- Original Message ---- From: John Morrissey jwm@horde.net To: David J. Andruczyk djandruczyk@yahoo.com Cc: openldap-software@openldap.org; Philip Guenther guenther+ldapsoft@sendmail.com Sent: Wednesday, July 29, 2009 1:20:44 PM Subject: Re: performance issue behind a a load balancer 2.3.32
On Wed, Jul 22, 2009 at 05:37:30AM -0700, David J. Andruczyk wrote:
yes, we have been measuring latency when under the F5 vs RR. When we switched to RR DNS is DID drop quite a bit from around 100ms to about 20 ms.
FWIW and IIRC, after switching to Performance (Layer 4), the observed latency for LDAP operations to the VIP and to the nodes themselves was essentially the same. I can't say what the latency difference was, since I wasn't the one who was troubleshooting the BigIPs and don't have the numbers handy.
We do NOT yet have the VIP set to Performance layer 4 however. It was at "standard". F5 has since suggested performance layer 4, but we have not implemented it yet, only due to the fact that the connection deferred: binding messages cause severe annoyances, and lots of CS calls from users of the system (auth failures, misc issues), that mgmt is wary of trying anything else until they have proof that whatever we do WILL DEFINITELY WORK beforehand. (yes cart before the horse, I know, but they sign the checks as well...)
That seems short-sighted, unless you're implying that you've moved all LDAP traffic off your BigIPs until you have a solution in hand that you *know* will solve the problem.
They may sign the checks, but that doesn't mean that informed argument shouldn't carry weight.
When behind the F5 in the LDAP server logs all connections appear to come from the F5's IP, so, when pumping a hundred server's connections through that one Ip there are going to be many many binds/unbinds going on constanly, all coming from the same IP (the F5), so why doesn't it through "connection deferred: binding" constantly as the connection load is certainly very very high, it only throws them occasionnally (every few seconds), but it's enough to cause a major impact in terrms of failed queries. Are you saying hte F5 is dropping part of the session after binding on a port and retriying to bind?
+1 on what Philip mentioned:
On Tue, 21 Jul 2009 21:54:53 -0700, Philip Guenther wrote:
Given the reported log message, this (latency) is very likely to be the cause of the problem. "connection deferred: binding" means that the server received a request on a connection that was in the middle of processing a bind. This means that the client sends a bind and then additional request(s) without waiting for the bind result. That's a violation by the client of the LDAP protocol specification, RFC 4511, section 4.2.1, paragraph 2:
[snip]
Understanding _why_ clients are violating the spec by sending further requests while a bind is outstanding may help you understand how the F5 or the clients should be tuned (or beaten with sticks, etc).
You presumably don't notice this under normal circumstances or with RR DNS because the server completes the BIND before the next request is received. My understanding (perhaps suspect) is that the F5 will increase the 'bunching' of packets on individual connections (because the first packet after a pause will see a higher latency than the succeeding packets).
So, are you measuring latency through the F5? I would *strongly* suggest doing so *before* tuning the F5 in any way, such as by the VIP type mentioned by John Morrissey, so that you can wave that in front of management (and under the nose of the F5 saleman when negotiating your next support renewal...)
What I'm parsing from:
https://support.f5.com/kb/en-us/solutions/public/8000/000/sol8082.html
(only accessible with an F5 support contract, unfortunately), is that with the "Standard" VIP type, the BigIP will wait for a three-way TCP handshake before establishing a connection with the load-balanced node. The BigIP becomes a "man in the middle" and establishes two independent connections: one facing the client, another facing the load balanced node.
With "Performance (Layer 4)", the BigIP forwards packets between clients and load-balanced nodes as they're received. As Philip says, the packet "bunching" due to the MITM nature of the Standard VIP type is probably teaming up with your LDAP client misbehavior. Under heavy load, the likelihood of bunching increases and you "win" this race condition.
Out of curiosity, what LDAP client SDK is involved here?
john