I'm running into the following scenario. Shortly after slapd gets
bombarded by a burst of operations (from several different clients) on
existing connections (well under the max number of connections, about
3000 out of 16384), it suddenly hangs. It's not responsive to any new
connections, and doesn't process operations on existing connections.
Load average is near zero during this time, so it's not doing anything.
After 20 minutes (idletimeout), slapd frees several connections (maybe
say 1000), and resumes working again as if nothing happened.
The load pattern that gets it into this state happens every hour, almost
on the hour (most likely associated with nslcd and cron jobs, which
we're looking to mitigate elsewise). Another strange thing is that slapd
will survive one instance's worth of bombardment without hanging, but
the *next* hour will go into a hang state.
Are there any resources other than file descriptors that are freed up
during the idletimeout processing? Are there any other parameters that
can be tuned besides idletimeout here? Could it possibly be a case of
deadlock somewhere, something grabbing all the locks? Would things like
set_lk_max_locks be relevant to investigate here? Any log level settings
that might reveal more of what's happening here?
Thanks for any suggestions on things to look at and try.
-Kartik