deferring operation: pending operations

List overview All Threads
Download

newer

older

[Fwd: [!! SPAM] Re:...

slapd crashing "randomly?"

Ben Beuchler

7 Apr 2007 7 Apr '07

3:07 p.m.

I'm getting a ton of these errors in a row, all on the same connection:

Apr 5 11:32:48 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations Apr 5 11:32:51 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations Apr 5 11:32:51 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations Apr 5 11:32:52 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations

What does "deferring operation: pending operations" actually indicate?

Thanks!

Show replies by date

Howard Chu

7 Apr 7 Apr

5:06 p.m.

Ben Beuchler wrote:

...

I'm getting a ton of these errors in a row, all on the same connection:

Where does it say there's an error here?

...

Apr 5 11:32:48 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations Apr 5 11:32:51 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations Apr 5 11:32:51 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations Apr 5 11:32:52 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations

What does "deferring operation: pending operations" actually indicate?

It means that an operation that was just received on that connection has been deferred, because there are other operations ahead of it that are still pending. I.e., there's a bunch of operations on that connection waiting to execute, and the newcomer is last in line.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Ben Beuchler

8 Apr 8 Apr

7:39 p.m.

On 4/7/07, Howard Chu hyc@symas.com wrote:

...

Ben Beuchler wrote:

...
I'm getting a ton of these errors in a row, all on the same connection:

Where does it say there's an error here?

Fair enough. s/errors/messages/g

...

...
Apr 5 11:32:52 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations

What does "deferring operation: pending operations" actually indicate?

It means that an operation that was just received on that connection has been deferred, because there are other operations ahead of it that are still pending. I.e., there's a bunch of operations on that connection waiting to execute, and the newcomer is last in line.

Hmmm. That particular connection rarely makes more than a couple search or bind requests per second. Well within what OpenLDAP should be able to handle on a nice, fast server like this. Is there a class of mistakes that the client could be making to cause operations to queue up like this?

-Ben

Howard Chu

9 Apr 9 Apr

3:54 a.m.

Ben Beuchler wrote:

...

On 4/7/07, Howard Chu hyc@symas.com wrote:

...
Ben Beuchler wrote:

...
Apr 5 11:32:52 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations

What does "deferring operation: pending operations" actually indicate?

It means that an operation that was just received on that connection has been deferred, because there are other operations ahead of it that are still pending. I.e., there's a bunch of operations on that connection waiting to execute, and the newcomer is last in line.

Hmmm. That particular connection rarely makes more than a couple search or bind requests per second. Well within what OpenLDAP should be able to handle on a nice, fast server like this. Is there a class of mistakes that the client could be making to cause operations to queue up like this?

The only time I've seen this occur on a frequent basis is with poorly written clients like Sendmail, which sends off a stream of a hundred or so search requests on a connection before it checks for any replies. The problem there is that the search replies build up in the network and socket buffers and once the socket buffer gets full we can't send anything else. At that point we stop executing operations on the connection; we just wait for the client to drain it.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Ben Beuchler

10 Apr 10 Apr

10:47 p.m.

On 4/7/07, Ben Beuchler insyte@gmail.com wrote:

...

I'm getting a ton of these errors in a row, all on the same connection:

Apr 5 11:32:48 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations

I've upgraded the client that was triggering the above failures. That client is still intermittently hanging on its auth attempts. The slapd frequently logs this message on connections owned by this client:

Apr 10 13:14:39 swizzle slapd[5944]: connection_input: conn=1696 deferring operation: binding

The client is making auth binds (anonymous search for dn, followed by binding with that dn) for numerous users. It does all of this over a single LDAP session that is kept open more or less until either the client or slapd is restarted. It rarely makes more than 3 - 5 requests per second, so it seems unlikely that OpenLDAP is actually unable to keep up.

What type of errors could the client make that would result in repeated deferrals?

Thanks!

Hallvard B Furuseth

11:50 p.m.

Ben Beuchler writes:

...

Apr 10 13:14:39 swizzle slapd[5944]: connection_input: conn=1696 deferring operation: binding

Include 256 in your loglevel, if you haven't already, and look for the '<date> swizzle slapd[<pid>]: conn=<num> ...' lines it produces. In particular, look for 'conn=<num> op=<num> RESULT' lines _after_ the 'deferring operation' line, where the matched received operation (same conn and op) was _before_ that line. Notice what the operation was, and what the result code was (RESULT ... err=). You'll find the result code descriptions in RFC 4511.

Note that logging is a bit sloppy - for requests with syntax errors or some other problems, only the response is logged. But since these requests fail without processing they shouldn't take time either.

-- Regards, Hallvard

Ben Beuchler

12 Apr 12 Apr

2:59 a.m.

...

Include 256 in your loglevel, if you haven't already, and look for the '<date> swizzle slapd[<pid>]: conn=<num> ...' lines it produces. In particular, look for 'conn=<num> op=<num> RESULT' lines _after_ the 'deferring operation' line, where the matched received operation (same conn and op) was _before_ that line. Notice what the operation was, and what the result code was (RESULT ... err=). You'll find the result code descriptions in RFC 4511.

The examples I could find all look like this:

Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 BIND anonymous mech=implicit ssf=0 Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 BIND dn="" method=128 Apr 11 15:06:43 cliff slapd[15355]: connection_input: conn=1 deferring operation: pending operations Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 RESULT tag=97 err=0 text=

I can't find a reference to '97' anywhere in the above mentioned RFC. Is there somewhere else I should look?

-Ben

Ben Beuchler

3:03 a.m.

...

The examples I could find all look like this:

Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 BIND anonymous mech=implicit ssf=0 Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 BIND dn="" method=128 Apr 11 15:06:43 cliff slapd[15355]: connection_input: conn=1 deferring operation: pending operations Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 RESULT tag=97 err=0 text=

I can't find a reference to '97' anywhere in the above mentioned RFC. Is there somewhere else I should look?

My apologies; I neglected to include an example of the "deferring operation: binding" log entries:

Apr 11 15:14:49 cliff slapd[15355]: conn=1 op=36640 BIND anonymous mech=implicit ssf=0 Apr 11 15:14:49 cliff slapd[15355]: conn=1 op=36640 BIND dn="uid=bbrogan,dc=users,dc=accounts,dc=ldap,dc=mcad,dc=edu" method=128 Apr 11 15:14:49 cliff slapd[15355]: conn=1 op=36640 BIND dn="uid=bbrogan,dc=users,dc=accounts,dc=ldap,dc=mcad,dc=edu" mech=SIMPLE ssf=0 Apr 11 15:14:49 cliff slapd[15355]: conn=1 op=36640 RESULT tag=97 err=0 text= Apr 11 15:14:50 cliff slapd[15355]: connection_input: conn=1 deferring operation: binding Apr 11 15:14:50 cliff slapd[15355]: conn=1 op=36641 BIND anonymous mech=implicit ssf=0 Apr 11 15:14:50 cliff slapd[15355]: conn=1 op=36641 BIND dn="" method=128 Apr 11 15:14:50 cliff slapd[15355]: conn=1 op=36641 RESULT tag=97 err=0 text=

It appears to my untrained eye that the operation both before and after the deferral completed successfully. Any idea what type of client behavior could trigger messages like this?

-Ben

Howard Chu

4:13 a.m.

Ben Beuchler wrote:

...

My apologies; I neglected to include an example of the "deferring operation: binding" log entries:

Apr 11 15:14:49 cliff slapd[15355]: conn=1 op=36640 BIND anonymous mech=implicit ssf=0 Apr 11 15:14:49 cliff slapd[15355]: conn=1 op=36640 BIND dn="uid=bbrogan,dc=users,dc=accounts,dc=ldap,dc=mcad,dc=edu" method=128 Apr 11 15:14:49 cliff slapd[15355]: conn=1 op=36640 BIND dn="uid=bbrogan,dc=users,dc=accounts,dc=ldap,dc=mcad,dc=edu" mech=SIMPLE ssf=0 Apr 11 15:14:49 cliff slapd[15355]: conn=1 op=36640 RESULT tag=97 err=0 text= Apr 11 15:14:50 cliff slapd[15355]: connection_input: conn=1 deferring operation: binding Apr 11 15:14:50 cliff slapd[15355]: conn=1 op=36641 BIND anonymous mech=implicit ssf=0 Apr 11 15:14:50 cliff slapd[15355]: conn=1 op=36641 BIND dn="" method=128 Apr 11 15:14:50 cliff slapd[15355]: conn=1 op=36641 RESULT tag=97 err=0 text=

It appears to my untrained eye that the operation both before and after the deferral completed successfully. Any idea what type of client behavior could trigger messages like this?

In this particular case it looks like the client has sent two Bind requests back to back. Possibly without waiting for the result from the first one to arrive. As Ando already explained in a previous response, Binds are required to be fully serialized - when a Bind is active on a connection, no other operations are allowed to run. Furthermore, if non-Bind operations are active on a connection and a Bind is received, all of those active operations are required to be immediately abandoned before the Bind proceeds.

One thing you never mentioned - what version of slapd are you running?

I still don't see any particular failures occurring here. If your client isn't having any problems I wouldn't waste any more time worrying about it. If you feel like rewriting your client, you can follow Ando's advice and use a dedicated connection for Binds and a separate connection for the normal operations, but you'll probably need to use proxy authorization in that case.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Ben Beuchler

8:07 a.m.

...

One thing you never mentioned - what version of slapd are you running?

2.3.32

...

I still don't see any particular failures occurring here. If your client isn't having any problems I wouldn't waste any more time worrying about it.

It's entirely possible that I'm chasing a red herring. My primary goal in posting here was to find out what, exactly, the deferral messages indicated and, from that, whether or not they could be responsible for the problems I'm seeing.

Thanks to the very clear, helpful responses from you, Pierangelo, Hallvard, and others I'm fairly confident I'll be able to rule the deferrals either in or out as relevant to my problem with a few more troubleshooting steps.

Thanks!

-Ben

Howard Chu

3:34 a.m.

Ben Beuchler wrote:

...

...
Include 256 in your loglevel, if you haven't already, and look for the '<date> swizzle slapd[<pid>]: conn=<num> ...' lines it produces. In particular, look for 'conn=<num> op=<num> RESULT' lines _after_ the 'deferring operation' line, where the matched received operation (same conn and op) was _before_ that line. Notice what the operation was, and what the result code was (RESULT ... err=). You'll find the result code descriptions in RFC 4511.

The examples I could find all look like this:

Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 BIND anonymous mech=implicit ssf=0 Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 BIND dn="" method=128 Apr 11 15:06:43 cliff slapd[15355]: connection_input: conn=1 deferring operation: pending operations Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 RESULT tag=97 err=0 text=

I can't find a reference to '97' anywhere in the above mentioned RFC. Is there somewhere else I should look?

Is there an actual problem showing up on your client? That looks like a pretty normal sequence, the received operations are still getting executed and returning success.

-- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Ben Beuchler

8 a.m.

...

Is there an actual problem showing up on your client? That looks like a pretty normal sequence, the received operations are still getting executed and returning success.

There is an actual problem (intermittent hanging of the authentication process under load) but I have nothing more than anecdote to tie it to the deferrals. At this point they are the only lead I have. I'm hoping that whatever the client's doing to cause the deferrals is, eventually, producing the auth failures.

-Ben

Aaron Richton

3:39 a.m.

It looks like that client is trying a SASL bind, then issuing a simple bind prior to receiving the SASL bind response. I don't think that's permitted per RFC; hence, slapd defers the simple bind?

On Wed, 11 Apr 2007, Ben Beuchler wrote:

...

...
Include 256 in your loglevel, if you haven't already, and look for the '<date> swizzle slapd[<pid>]: conn=<num> ...' lines it produces. In particular, look for 'conn=<num> op=<num> RESULT' lines _after_ the 'deferring operation' line, where the matched received operation (same conn and op) was _before_ that line. Notice what the operation was, and what the result code was (RESULT ... err=). You'll find the result code descriptions in RFC 4511.

The examples I could find all look like this:

Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 BIND anonymous mech=implicit ssf=0 Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 BIND dn="" method=128 Apr 11 15:06:43 cliff slapd[15355]: connection_input: conn=1 deferring operation: pending operations Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 RESULT tag=97 err=0 text=

I can't find a reference to '97' anywhere in the above mentioned RFC. Is there somewhere else I should look?

-Ben

Hallvard B Furuseth

7:34 p.m.

Ben Beuchler writes:

...

Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 BIND anonymous mech=implicit ssf=0 Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 BIND dn="" method=128 Apr 11 15:06:43 cliff slapd[15355]: connection_input: conn=1 deferring operation: pending operations Apr 11 15:06:43 cliff slapd[15355]: conn=1 op=36051 RESULT tag=97 err=0 text=

I can't find a reference to '97' anywhere in the above mentioned RFC. Is there somewhere else I should look?

97 is BindResponse. But it's easier to take the conn=1 op=36051 in the RESULT, and find the matching conn=1 op=36051 request. Each request on a connection gets its own op number. Don't be confused by the several BINDs with the same conn+op - that means several things are happening with that request or slapd wanted to log more than one line's worth of data. In this case, the mech=implicit Bind is because any Bind first reverts the connection to anonymous if it is not already anonymous.

BTW, I notice the requests/responses you've posted were less than 2 or often 1 seconds apart - is that how long clients hang, or are there some really long hangs?

On another note, maybe you can dispatch the Binds to e.g. 4 processes and thus 4 connections, thus reducing much of the sequencing.

-- Regards, Hallvard

Ben Beuchler

9:19 p.m.

...

BTW, I notice the requests/responses you've posted were less than 2 or often 1 seconds apart - is that how long clients hang, or are there some really long hangs?

Really long hangs, until I restart the client. Based on feedback from this list, the author found a bug in his auth bind code, which fixed the sequencing on the binds and cleared up the deferrals. So far it appears to have fixed the prolonged hangs, too. It's possible, though, that when I re-linked the daemon against 2.3.32 that fixed something.

...

On another note, maybe you can dispatch the Binds to e.g. 4 processes and thus 4 connections, thus reducing much of the sequencing.

That's an interesting idea; I will propose it to the author.

Thanks!

-Ben

Pierangelo Masarati

11 Apr 11 Apr

12:02 a.m.

Ben Beuchler wrote:

...

On 4/7/07, Ben Beuchler insyte@gmail.com wrote:

...
I'm getting a ton of these errors in a row, all on the same connection:

Apr 5 11:32:48 swizzle slapd[1626]: connection_input: conn=6232680 deferring operation: pending operations

I've upgraded the client that was triggering the above failures. That client is still intermittently hanging on its auth attempts. The slapd frequently logs this message on connections owned by this client:

Apr 10 13:14:39 swizzle slapd[5944]: connection_input: conn=1696 deferring operation: binding

The client is making auth binds (anonymous search for dn, followed by binding with that dn) for numerous users. It does all of this over a single LDAP session that is kept open more or less until either the client or slapd is restarted. It rarely makes more than 3 - 5 requests per second, so it seems unlikely that OpenLDAP is actually unable to keep up.

What type of errors could the client make that would result in repeated deferrals?

I'd note that if your client is performing search + bind on the same connection in an asynchronous manner, then the server behavior as per RFC4511 is undefined, since a server is not supposed to be willing to handle operations intermixed with binds until a bind is concluded. That is, any operation can be multiplexed on a single LDAP connection __except__ binds, which must be serialized. A more appropriate client for that type of operations would need to use at least 2 connections, one for searches and one for binds, and the one for binds should take care of either performing synchronous binds, or in any case to wait for response to a bind before performing another. In case of failure (e.g. timeout) while waiting for bind response, the connection should be trashed (see ldap_unbind_ext(3)) and recreated. This is what is done, for example, by slapd-ldap(5).

Ing. Pierangelo Masarati OpenLDAP Core Team

SysNet s.r.l. via Dossi, 8 - 27100 Pavia - ITALIA http://www.sys-net.it --------------------------------------- Office: +39 02 23998309 Mobile: +39 333 4963172 Email: pierangelo.masarati@sys-net.it ---------------------------------------

6656

Age (days ago)

6661

Last active (days ago)

openldap-software@openldap.org

15 comments

5 participants

tags (0)

participants (5)

Aaron Richton
Ben Beuchler
Hallvard B Furuseth
Howard Chu
Pierangelo Masarati