Re: (ITS#6104) race condition with cancel operation - openldap-bugs

15 May 2009


      Howard Chu writes:
...
I'd prefer to condense it all into a bitfield.
Right... keep both o_abandon and o_cancel (maybe renamed to o_dummy) for
RE24 though, otherwise we have a silent binary incompatible change with
already-compiled slapd modules.  The change will in any case break
modules that play with o_cancel though.
...
We already ensure that
the c_mutex is held when setting these flags.
Not yet, but c_mutex it is:-)  Must protect the following code.
Or rather, bitflag code which will replace it:
- o_cancel access in cancel.c, connection.c:connection_operation(),
  back-relay/op.c.
- o_abandon setting in overlays/syncprov.c:syncprov_op_abandon(), and
  in overlays/retcode.c where hopefully the first case should have been
    if ( op2.o_abandon ) op->o_abandon = 1;
  so that we can never reset o_abandon once it is set.
...
It would be a lot of
overhead to grab the mutex just to read the flag,
and would require additional analysis to make sure the new locking
behavior doesn't introduce new deadlocks.
Checking o_abandon!=0 works unmutexed.  (Well, technically any
such unmutexed access is wrong, but in practice we should be OK.)
As long as we:
- never reset o_abandon like retcode above can do, and
- never unmutexed assume consistency betweeen abandon-related variables
  set by different threads.
So the equivalent of this should be OK:
  if ( op->o_abandon ) rs->sr_err = SLAPD_ABANDON;
  ...  
  if ( rs->sr_err == SLAPD_ABANDON ) {
     lock c_mutex;
     more careful o_abandon checks;
     unlock c_mutex;
  }
While I'm looking at this:
The ldap_pvt_thread_yield() in cancel.c is really ugly.  There is no
cond to attach to it to without breaking binary compatibility, but even
a even a global mutex+cond might be better, with threads broadcasting on
it when it might be relevant.
It's still ugly though.  Cancelling an expensive operation ought to be a
friendly operation, but instead it is unfriendly in that yet another
thread is occupied, solely to wait for that operation.  What would make
sense (for RE25) would be to redo Cancel so that the cancelled thread is
responsible for sending the Cancel response or resubmitting an operation
which will.
Also, the be_cancel implementation in at least back-relay looks wrong.
Cancel isn't dispatched on DN, and back-relay with no 'relay' directive
configured cannot know where to dispatch.  I haven't checked how it
works, but it seems a call to be_cancel means the Cancel operation
_might_ belong to this backend.  But it might not, which means
back-relay shouldn't set op->o_cancel in this case.  What it should do
I don't know though.  And I don't know if syncprov is doing any better.
-- 
Hallvard