On 06/01/2011 08:43 AM, Kartik Subbarao wrote:
I'm running into the following scenario. Shortly after slapd gets bombarded by a burst of operations (from several different clients) on existing connections (well under the max number of connections, about 3000 out of 16384), it suddenly hangs. It's not responsive to any new connections, and doesn't process operations on existing connections. Load average is near zero during this time, so it's not doing anything. After 20 minutes (idletimeout), slapd frees several connections (maybe say 1000), and resumes working again as if nothing happened.
The load pattern that gets it into this state happens every hour, almost on the hour (most likely associated with nslcd and cron jobs, which we're looking to mitigate elsewise). Another strange thing is that slapd will survive one instance's worth of bombardment without hanging, but the *next* hour will go into a hang state.
Are there any resources other than file descriptors that are freed up during the idletimeout processing? Are there any other parameters that can be tuned besides idletimeout here? Could it possibly be a case of deadlock somewhere, something grabbing all the locks? Would things like set_lk_max_locks be relevant to investigate here? Any log level settings that might reveal more of what's happening here?
I have noticed similar behavior on a handful of occasions with 2.4.23 and bdb-4.7.25p4.
When this happens, the last log entry I typically see is a search that misses the indexes (e.g. (mail=*a*)).
The server has the default idletimeout (disabled).
I have as yet been unable to force the hang, though I have not tried heavier loads with SLAMD. It has also been a while since I have seen this, so I do not have a stacktrace handy.
I just wanted to add this anecdotal evidence of the hang. I hope at some point I'll be able to get a working stacktrace. Of course, I should also try newer versions of OpenLDAP and BDB.