Strange hang scenario, resumes after idletimeout, but plenty of FDs available

List overview All Threads
Download

newer

older

ldapsearch and sambaAcctFlags

ACL Problem?

Kartik Subbarao

1 Jun 2011 1 Jun '11

5:43 a.m.

I'm running into the following scenario. Shortly after slapd gets bombarded by a burst of operations (from several different clients) on existing connections (well under the max number of connections, about 3000 out of 16384), it suddenly hangs. It's not responsive to any new connections, and doesn't process operations on existing connections. Load average is near zero during this time, so it's not doing anything. After 20 minutes (idletimeout), slapd frees several connections (maybe say 1000), and resumes working again as if nothing happened.

The load pattern that gets it into this state happens every hour, almost on the hour (most likely associated with nslcd and cron jobs, which we're looking to mitigate elsewise). Another strange thing is that slapd will survive one instance's worth of bombardment without hanging, but the *next* hour will go into a hang state.

Are there any resources other than file descriptors that are freed up during the idletimeout processing? Are there any other parameters that can be tuned besides idletimeout here? Could it possibly be a case of deadlock somewhere, something grabbing all the locks? Would things like set_lk_max_locks be relevant to investigate here? Any log level settings that might reveal more of what's happening here?

Thanks for any suggestions on things to look at and try.

-Kartik

Show replies by date

Kartik Subbarao

1 Jun 1 Jun

6:08 a.m.

On 06/01/2011 08:43 AM, Kartik Subbarao wrote:

...

Are there any resources other than file descriptors that are freed up during the idletimeout processing? Are there any other parameters that can be tuned besides idletimeout here? Could it possibly be a case of deadlock somewhere, something grabbing all the locks? Would things like set_lk_max_locks be relevant to investigate here? Any log level settings that might reveal more of what's happening here?

Update: It's not a locks issue. I ran db4.8_stat -C A while one of the servers was in this hang state and there are plenty of locks available. Also another clarification -- all of the operations during the "bombardment" times in question are read operations from clients, no writes.

-Kartik

Kartik Subbarao

9:19 a.m.

On 06/01/2011 09:08 AM, Kartik Subbarao wrote:

...

Update: It's not a locks issue.

Another update -- after I reduced the idletimeout to 60 seconds, the problem seems to have gone away. It would still be useful to know what might be causing this problem and to be able to support higher levels of idletimeout, but at least I have another workable option now.

-Kartik

Howard Chu

11:02 a.m.

Kartik Subbarao wrote:

...

On 06/01/2011 09:08 AM, Kartik Subbarao wrote:

...
Update: It's not a locks issue.

Another update -- after I reduced the idletimeout to 60 seconds, the problem seems to have gone away. It would still be useful to know what might be causing this problem and to be able to support higher levels of idletimeout, but at least I have another workable option now.

As usual, when investigating a hang, you should attach to slapd with gdb and get a snapshot of what all the threads are doing. Working around it before you know what caused it isn't all that useful in the long run.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Kartik Subbarao

11:22 a.m.

On 06/01/2011 02:02 PM, Howard Chu wrote:

...

Kartik Subbarao wrote:

...
On 06/01/2011 09:08 AM, Kartik Subbarao wrote:

...
Update: It's not a locks issue.

Another update -- after I reduced the idletimeout to 60 seconds, the problem seems to have gone away. It would still be useful to know what might be causing this problem and to be able to support higher levels of idletimeout, but at least I have another workable option now.

As usual, when investigating a hang, you should attach to slapd with gdb and get a snapshot of what all the threads are doing. Working around it before you know what caused it isn't all that useful in the long run.

Unfortunately this instance of slapd doesn't have debugging symbols enabled (it's a stock Debian 2.4.25 slapd package), so I didn't think gdb would be that helpful...But now that you mention it, I could probably get some basic stack trace info that might yield some clues.

Thanks for the reminder :-)

-Kartik

Quanah Gibson-Mount

11:49 a.m.

--On Wednesday, June 01, 2011 2:22 PM -0400 Kartik Subbarao subbarao@computer.org wrote:

...

On 06/01/2011 02:02 PM, Howard Chu wrote:

...
Kartik Subbarao wrote:

...
On 06/01/2011 09:08 AM, Kartik Subbarao wrote:

...
Update: It's not a locks issue.

Another update -- after I reduced the idletimeout to 60 seconds, the problem seems to have gone away. It would still be useful to know what might be causing this problem and to be able to support higher levels of idletimeout, but at least I have another workable option now.

As usual, when investigating a hang, you should attach to slapd with gdb and get a snapshot of what all the threads are doing. Working around it before you know what caused it isn't all that useful in the long run.

Unfortunately this instance of slapd doesn't have debugging symbols enabled (it's a stock Debian 2.4.25 slapd package), so I didn't think gdb would be that helpful...But now that you mention it, I could probably get some basic stack trace info that might yield some clues.

Thanks for the reminder :-)

Debian ships the debugging symbols as a separate package. All you need to do is install that package, and you can get a full backtrace.

p slapd-dbg - Debugging information for the OpenLDAP server (slapd)

--Quanah

Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration

Kartik Subbarao

12:01 p.m.

On 06/01/2011 02:49 PM, Quanah Gibson-Mount wrote:

...

...
Unfortunately this instance of slapd doesn't have debugging symbols enabled (it's a stock Debian 2.4.25 slapd package), so I didn't think gdb would be that helpful...But now that you mention it, I could probably get some basic stack trace info that might yield some clues.

Thanks for the reminder :-)

Debian ships the debugging symbols as a separate package. All you need to do is install that package, and you can get a full backtrace.

p slapd-dbg - Debugging information for the OpenLDAP server (slapd)

Even better! Thanks for the info.

-Kartik

Kartik Subbarao

2 Jun 2 Jun

6:20 a.m.

On 06/01/2011 02:02 PM, Howard Chu wrote:

...

Kartik Subbarao wrote:

...
On 06/01/2011 09:08 AM, Kartik Subbarao wrote:

...
Update: It's not a locks issue.

Another update -- after I reduced the idletimeout to 60 seconds, the problem seems to have gone away. It would still be useful to know what might be causing this problem and to be able to support higher levels of idletimeout, but at least I have another workable option now.

As usual, when investigating a hang, you should attach to slapd with gdb and get a snapshot of what all the threads are doing. Working around it before you know what caused it isn't all that useful in the long run.

Attached is the gdb stack trace from the hang state. It looks like the the threads are stuck in pthread_cond_wait() from send_ldap_ber(). Are there other relevant variables/structures to inspect for this scenario?

-Kartik

Kartik Subbarao

6:40 a.m.

On 06/02/2011 09:20 AM, Kartik Subbarao wrote:

...

Attached is the gdb stack trace from the hang state. It looks like the the threads are stuck in pthread_cond_wait() from send_ldap_ber(). Are there other relevant variables/structures to inspect for this scenario?

Looking at the source a bit more, this is the line that the threads are stuck on (line 372 in result.c, in the send_ldap_ber() function):

ldap_pvt_thread_cond_wait( &conn->c_write2_cv, &conn->c_write2_mutex );

It would appear that there may be some deadlock around the c_write2_mutex that seems to be freed when the idletimeout processing kicks in and clears out the offending connection(s).

-Kartik

Howard Chu

10:36 a.m.

Kartik Subbarao wrote:

...

On 06/01/2011 02:02 PM, Howard Chu wrote:

...
Kartik Subbarao wrote:

...
On 06/01/2011 09:08 AM, Kartik Subbarao wrote:

...
Update: It's not a locks issue.

Another update -- after I reduced the idletimeout to 60 seconds, the problem seems to have gone away. It would still be useful to know what might be causing this problem and to be able to support higher levels of idletimeout, but at least I have another workable option now.

As usual, when investigating a hang, you should attach to slapd with gdb and get a snapshot of what all the threads are doing. Working around it before you know what caused it isn't all that useful in the long run.

Attached is the gdb stack trace from the hang state. It looks like the the threads are stuck in pthread_cond_wait() from send_ldap_ber(). Are there other relevant variables/structures to inspect for this scenario?

-Kartik

Also get a netstat -nA inet. Threads waiting in send_ldap_ber() means their output buffers got full, clients didn't read the pending data.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Kartik Subbarao

3 Jun 3 Jun

7:41 a.m.

On 06/02/2011 01:36 PM, Howard Chu wrote:

...

...
Attached is the gdb stack trace from the hang state. It looks like the the threads are stuck in pthread_cond_wait() from send_ldap_ber(). Are there other relevant variables/structures to inspect for this scenario?

Also get a netstat -nA inet. Threads waiting in send_ldap_ber() means their output buffers got full, clients didn't read the pending data.

(Un)fortunately that hang state hasn't occurred since yesterday (even with idletimeout set to 20 min).

I did run that netstat command periodically in any case, just to observe what happens during times of high traffic. The one thing I do notice sporadically is a bunch of connections in FIN_WAIT_1. But they do seem to clear up on their own within a minute or so. So it doesn't look like there's a long-term problem with accumulating TCP connections.

I'll try to capture the netstat during the hang time if/when it happens again.

-Kartik

Kartik Subbarao

1 Jun 1 Jun

11:02 a.m.

On 06/01/2011 12:19 PM, Kartik Subbarao wrote:

...

On 06/01/2011 09:08 AM, Kartik Subbarao wrote:

...
Update: It's not a locks issue.

Another update -- after I reduced the idletimeout to 60 seconds, the problem seems to have gone away. It would still be useful to know what might be causing this problem and to be able to support higher levels of idletimeout, but at least I have another workable option now.

Well it looks like I spoke somewhat too soon. The problem hasn't entirely gone away, but now it happens less frequently, and only hangs for 60 seconds instead of 20 minutes. Which confirms my observation that this is idletimeout-related. Any thoughts/ideas would be greatly appreciated.

-Kartik

David Hawes

2 Jun 2 Jun

9:02 a.m.

On 06/01/2011 08:43 AM, Kartik Subbarao wrote:

...

I'm running into the following scenario. Shortly after slapd gets bombarded by a burst of operations (from several different clients) on existing connections (well under the max number of connections, about 3000 out of 16384), it suddenly hangs. It's not responsive to any new connections, and doesn't process operations on existing connections. Load average is near zero during this time, so it's not doing anything. After 20 minutes (idletimeout), slapd frees several connections (maybe say 1000), and resumes working again as if nothing happened.

The load pattern that gets it into this state happens every hour, almost on the hour (most likely associated with nslcd and cron jobs, which we're looking to mitigate elsewise). Another strange thing is that slapd will survive one instance's worth of bombardment without hanging, but the *next* hour will go into a hang state.

Are there any resources other than file descriptors that are freed up during the idletimeout processing? Are there any other parameters that can be tuned besides idletimeout here? Could it possibly be a case of deadlock somewhere, something grabbing all the locks? Would things like set_lk_max_locks be relevant to investigate here? Any log level settings that might reveal more of what's happening here?

I have noticed similar behavior on a handful of occasions with 2.4.23 and bdb-4.7.25p4.

When this happens, the last log entry I typically see is a search that misses the indexes (e.g. (mail=*a*)).

The server has the default idletimeout (disabled).

I have as yet been unable to force the hang, though I have not tried heavier loads with SLAMD. It has also been a while since I have seen this, so I do not have a stacktrace handy.

I just wanted to add this anecdotal evidence of the hang. I hope at some point I'll be able to get a working stacktrace. Of course, I should also try newer versions of OpenLDAP and BDB.

5147

Age (days ago)

5149

Last active (days ago)

openldap-technical@openldap.org

12 comments

4 participants

tags (0)

participants (4)

David Hawes
Howard Chu
Kartik Subbarao
Quanah Gibson-Mount