Re: persistent search and keepalives

18 Sep 2007


      The "problem" (I use the term lightly; it's just the situation we have to 
work with) as I see it is that a persistent search may legitimately have 
no traffic for quite some time. At Rutgers, we first saw issues with 
keepalives on slaves that refreshAndPersisted a portion of the DIT 
reflecting network configuration, which is to say a portion that didn't 
change that much. The idletimeout on the provider was under two hours (the 
default TCP keepalive), and libldap wasn't requesting SO_KEEPALIVE in the 
first place, so there was no accounting for this at the protocol nor 
application level. Adding it at the protocol level was easy: the simple 
patch in ITS#4708 was sufficient.
In our case, we then tuned TCP keepalives (on the client) below the 
provider's idletimeout value, and we got the desired behavior -- the 
persistent search connection remains available, except in the case of a 
gross failure.
Admittedly, per your message, these clients are slightly aggressive in the 
case of failure, and this may not be desirable behavior. But moving 
forward in the post-ITS#4440 world, I'm not sure how serious this will be. 
Now that we can give replication DNs their own idletimeout, we should be 
able to keep the SO_KEEPALIVE connection at the system defaults (two 
hours), reducing the load in the server down case, and we should no longer 
lose the connection in the server up case.
...of course, this could all be swept under the carpet with some 
application layer keepalive, as you discuss. I guess my point is that I'm 
not sure what we gain. If there is to be any application layer keepalive, 
I'd be more interested in the refreshAndPersist provider occasionally 
sending it, since that's the flow that we need working for the next time a 
MOD hits. But it's kind of wrong to think this is only affecting syncrepl 
-- it's broad across many LDAP clients, and is probably a deeply ingrained 
issue. c.f. 
http://www.openldap.org/lists/openldap-software/200504/msg00445.html. 
Should application layer keepalives be published as "The Way To Do It"? 
Would it make sense for this method to be in (an OpenLDAP extension to) 
libldap for other affected applications to use? Or does it make more sense 
to just say "OpenLDAP Software depends on the OS/network/firewalls to do 
their job, make sure they are configured to detect networking failures and 
pass them upwards, we will reconnect when told to"?
On Mon, 17 Sep 2007, Howard Chu wrote:
...
Following on from ITS#5133, there are a couple different scenarios to deal 
with...

the remote network segment has disappeared (or the remote server has

crashed)
 2) an intervening firewall has killed the connection
Neither case is really distinguishable from the consumer side. In the case of 
a hardware failure, where either the remote host or the network to the host 
has failed, there's little to be gained by setting an aggressive retry 
policy. Failures of that sort tend to take a non-trivial amount of time to 
repair. I've seen some app guides recommending keepalives be sent once a 
minute or so; to me that is way overdoing things.
In the case of a firewall closing an idle connection, you really have to ask 
yourself what you're trying to accomplish - are you trying to send probes 
frequently enough to prevent the connection from closing, or are you just 
trying to detect that it has closed? This may be giving too much credit to 
the firewall admins, but I'd guess that they've set an idle timeout that is 
appropriate for the load that their networks see. Artificially inflating 
traffic on a connection to prevent it from appearing idle would just be an 
abuse of network resources. It's also possible that a stateful firewall will 
start dropping connections because it's been overwhelmed by traffic, and 
simply doesn't have the memory to track all the live connections. Keeping the 
connection open in these circumstances would just make a bad situation worse.
As such, it seems to me that you don't really want to be setting very short 
keepalive timeouts anywhere. The default of 2 hours that most systems use 
seems pretty reasonable.
On the other hand, it would probably be useful to be able to prod the 
consumer and have it kick the connection on request. In the past I've 
implemented this sort of thing using Search requests with magic filters. 
I.e., treat the Search operation as an RPC call, the target object is simply 
an embedded method, and the AVAs in the filter comprise a named parameter 
list.
So e.g. one might do a search on "cn=Sync Consumers,cn=monitor" with filter 
(|(objectclass=*)(kick=TRUE)) to cause every active consumer to probe its 
connections.
I like this approach a lot better than Modifying an object, because you can 
hit many objects at once with a Search request, and receive all of their 
execution results as attributes of the returned entries.
--
 -- Howard Chu
 Chief Architect, Symas Corp.  http://www.symas.com
 Director, Highland Sun        http://highlandsun.com/hyc/
 Chief Architect, OpenLDAP     http://www.openldap.org/project/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: persistent search and keepalives