The "problem" (I use the term lightly; it's just the situation we have to
work with) as I see it is that a persistent search may legitimately have
no traffic for quite some time. At Rutgers, we first saw issues with
keepalives on slaves that refreshAndPersisted a portion of the DIT
reflecting network configuration, which is to say a portion that didn't
change that much. The idletimeout on the provider was under two hours (the
default TCP keepalive), and libldap wasn't requesting SO_KEEPALIVE in the
first place, so there was no accounting for this at the protocol nor
application level. Adding it at the protocol level was easy: the simple
patch in ITS#4708 was sufficient.
In our case, we then tuned TCP keepalives (on the client) below the
provider's idletimeout value, and we got the desired behavior -- the
persistent search connection remains available, except in the case of a
Admittedly, per your message, these clients are slightly aggressive in the
case of failure, and this may not be desirable behavior. But moving
forward in the post-ITS#4440 world, I'm not sure how serious this will be.
Now that we can give replication DNs their own idletimeout, we should be
able to keep the SO_KEEPALIVE connection at the system defaults (two
hours), reducing the load in the server down case, and we should no longer
lose the connection in the server up case.
...of course, this could all be swept under the carpet with some
application layer keepalive, as you discuss. I guess my point is that I'm
not sure what we gain. If there is to be any application layer keepalive,
I'd be more interested in the refreshAndPersist provider occasionally
sending it, since that's the flow that we need working for the next time a
MOD hits. But it's kind of wrong to think this is only affecting syncrepl
-- it's broad across many LDAP clients, and is probably a deeply ingrained
Should application layer keepalives be published as "The Way To Do It"?
Would it make sense for this method to be in (an OpenLDAP extension to)
libldap for other affected applications to use? Or does it make more sense
to just say "OpenLDAP Software depends on the OS/network/firewalls to do
their job, make sure they are configured to detect networking failures and
pass them upwards, we will reconnect when told to"?
On Mon, 17 Sep 2007, Howard Chu wrote:
Following on from ITS#5133, there are a couple different scenarios to
1) the remote network segment has disappeared (or the remote server has
2) an intervening firewall has killed the connection
Neither case is really distinguishable from the consumer side. In the case of
a hardware failure, where either the remote host or the network to the host
has failed, there's little to be gained by setting an aggressive retry
policy. Failures of that sort tend to take a non-trivial amount of time to
repair. I've seen some app guides recommending keepalives be sent once a
minute or so; to me that is way overdoing things.
In the case of a firewall closing an idle connection, you really have to ask
yourself what you're trying to accomplish - are you trying to send probes
frequently enough to prevent the connection from closing, or are you just
trying to detect that it has closed? This may be giving too much credit to
the firewall admins, but I'd guess that they've set an idle timeout that is
appropriate for the load that their networks see. Artificially inflating
traffic on a connection to prevent it from appearing idle would just be an
abuse of network resources. It's also possible that a stateful firewall will
start dropping connections because it's been overwhelmed by traffic, and
simply doesn't have the memory to track all the live connections. Keeping the
connection open in these circumstances would just make a bad situation worse.
As such, it seems to me that you don't really want to be setting very short
keepalive timeouts anywhere. The default of 2 hours that most systems use
seems pretty reasonable.
On the other hand, it would probably be useful to be able to prod the
consumer and have it kick the connection on request. In the past I've
implemented this sort of thing using Search requests with magic filters.
I.e., treat the Search operation as an RPC call, the target object is simply
an embedded method, and the AVAs in the filter comprise a named parameter
So e.g. one might do a search on "cn=Sync Consumers,cn=monitor" with filter
(|(objectclass=*)(kick=TRUE)) to cause every active consumer to probe its
I like this approach a lot better than Modifying an object, because you can
hit many objects at once with a Search request, and receive all of their
execution results as attributes of the returned entries.
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/