Re: (ITS#4943) tpool.c pause vs. finish - openldap-bugs

29 May 2007

      Some more comments, review, planned changes, and questions.
I guess I should summarize the bugs I've fixed in HEAD first, for
completeness.  (I was a bit too busy at the time.)  Restating the CVS
log comments, nothing new but maybe a bit clearer.  Possibly some of the
issues are tpool-only issues and can't happen the way slapd uses tpool.
----------------------------------------
rev 1.63:
ldap_pvt_thread_pool_submit(), after failed thread creation:
- ltp_pending_count++ should be --.
- should signal if there are no more threads, in case pool_destroy()
  is waiting for that.  (Not sure if that can happen.)
ldap_int_thread_pool_wrapper():
- The 'if(ltp_state...)' was spurious: It would always be true, and
  if it were not we'd get an eternal loop.
rev 1.64:
pool_pause():
- would cancel states FINISHING and STOPPING, pool_resume() would set
  it back to RUNNING.
- could hang when waiting for another pool_pause(), since it did not
  signal ltp_pcond if it decremented ltp_active_count to 1.
pool_wrapper():
- could update thread_keys[] during a pause while initializing and
  possibly while finishing,
- just after initializing, could handle a new ctx during a pause.
pool_destroy():
- would go on with destruction if pool_resume() was called, since it
  used an if(){wait} instead of while() and they use the same cond.
rev 1.65:
- pool_context() read thread_keys[] without mutex protection.
  pool_purgekey() too, but it expected a pause to protect the keys.
  pool_wrapper() protected with ltp_mutex while writing, and did not
  quite ensure a pause.
rev 1.66:
- thread_keys[] is a (poor) open-addessed hash table, but it lacked
  a "deleted item" mark.  So keys (contexts) could become hidden and
  re-emerge later if threads exited and then new ones were started.
- TID_HASH() did hash calculation in signed arithmetic, thus depending
  on friendly integer overflow.
rev 1.67:
- setkey/getkey expected ldap_int_thread_userctx_t.ltu_key[] to be
  NULL-terminated when not full.  purgekey() could set a NULL in
  the middle, hiding later keys until a new key overwrote the NULL.
API changes:
- setkey(data=NULL, kfree!=NULL) searched as if intended to reset
  the key, but updated by setting the key.  Now always updates.
- setkey(key=<not found>, data=NULL) could return either success or
  failure.  Now succeeds iff (data == NULL && kfree == NULL).
rev 1.68:
- ltp_max_count <= 0 meant allow an unlimited number of threads, but
  thread #LDAP_MAXTHR+1 and later would loop forever looking for a
  free thread_keys[] slot.  (Not fixed for multi-pool applications.)
----------------------------------------
I wrote in the tpool.c 1.66 -> 1.67 patch:
...
API changes, need review - I'm not sure what's indented here:

setkey(data=NULL, kfree!=NULL) searched as if intended to reset
the key, but updated by setting the key.  Now always updates.

Right.  That matches the one place it can happen: In bdb_locker_id()
with setkey(,,data=(void *)(long)lockid,).  I don't know if lockid
can be 0, but it doesn't need to for data to be NULL.
But it would cleaner to ch_malloc an integer instead of the (void*)
cast.  Then it won't depend on integer->pointer->integer conversions
preserving the value either.
BTW, is there a reason for the cast to long?  lockid is u_int32_t,
and the getkey 'long' result is restored to a u_int32_t.
Howard Chu wrote:
...
...
Also if ctx is invalid (another thread handled it) and a new
  pending request has gotten the now-unused ctx, then the
  'if (pool->ltp_open_count == 0)' code will free that new
  pending request instead of the old one.
Most likely this is due to unlocking and relocking the mutex around
the semaphore code. Axe that, remove the gratuitous unlock/relock,
and then that problem goes away.
Was that an instruction/permission to axe it or not?
If not I'll just move the unlock/lock into the #ifdef for now.
...
Thus far we've only had to worry about a single pool. If anyone can
come up with a reason to need multiple pools, I guess we need to
hear it.
I'll insert "static int initialized; assert(!initialized++);"
in pool_init() and see if anyone complains.
...
...

setkey does not free the old data with ltk_free.  Possibly that's
intentional, but if so that requires weird code elsewhere.

Yes, that's intentional. ltk_free is only intended for cleanup if a key
is still set by the time a thread exits. All of the keys we set are live
for the entire life of a thread, they don't get re-assigned.
Even the (void*)slap_sasl_bind key?
lutil_passwd() looks like it can be recursive:  Register a password
scheme which finds an entry in LDAP and authenticates via that
entry's userPassword - which again could call lutil_passwd().
If that's legal, this code in be_isroot_pw()/slap_passwd_check()
can break:
    pool_setkey(, slap_sasl_bind, <ctx>, <freefunc> );
    ... /* slap_passwd_check() does access_allowed() here */
    lutil_passwd();
    ...
    pool_setkey(, slap_sasl_bind, NULL, NULL);
The final setkey would need to restore the old key.
I'm not sure of contrib/slapd-modules/smbk5pwd/smbk5pwd.c + some
overlay either.  rwm + back-relay which uses another smbk5pwd or
something.  (And I have no idea of the BDB cache stuff, but since
you wrote it I assume you know:-)
If it's a problem, we can replace code like that with
    void *data = <ctx>;
    ldap_pvt_thread_pool_keyfree_t *kfree = <freefunc>;
    pool_swapkey(, slap_sasl_bind, &data, &kfree ); /* new function */
    lutil_passwd();
    pool_setkey(, slap_sasl_bind, data, kfree );
or if setkey reverts to ignoring kfree when it checks whether we are
setting or deleting a key, the kfree function can be passed directly.
I wrote:
...
But this reminds me: How much should an ltk_free() function be allowed
to do?

pool_context_reset() calls ltk_free().  Should pool_wrapper() go
active (wait for !pause, increment ltp_active_count, unlock ltp_mutex)
before calling pool_context_reset()?

I think so.  It makes slapi safer, and I don't see that it can hurt:
context_reset() can call connection_fake_destroy(), which via
slapi_int_free_object_extensions() calls slapi module destructors,
which can do who-knows-what.
Another stray thought: It would similarly be safer for context_reset()
and purgekey() to delete the key before calling ltk_free, so that
function can't access the key's data after freeing it.  OTOH that would
break any slapi module whose ltk_free calls a function which uses the
key before freeing it.  I'll leave it alone unless someone says otherwise.
Other matters:
pool->ltp_active_list is maintained but not used for anything.
Do I delete it?  Or does tpool maintain the list to ease debugging?
Use of tpool:
Another bug:
Slapd does not always check if pool_submit() and pool_setkey() succeed,
even though failure is not a bug.  (They can run into some limits.)
I do not understand the calls to ldap_pvt_thread_pool_context_reset()
in slapd (init.c and bconfig.c).  That frees the thread's sl_malloc
memory with slap_sl_mem_destroy(), which seems rather hazardous until
the thread is really, really done.
For that matter, I guess that means context_reset() had better release
keys in the reverse order of how they were set.  That usually happened,
but not always, and my clear_key_idx() made the "not always" more
common.  So to close the gap from a deleted key I'll make it move all
later keys forward, instead of just moving the latest key to the gap.
Bdb uses the database environment address as a key to pool_setkey().
So there can be at most ~30 BDB databases with an active key at the
same time in the same thread.  What does that mean - 30 overlays
all using bdb?  Access control groups spanning 30 databases?
Nitpick: Slapd uses both function pointers and data pointers as keys to
pool_[sg]etkey().  That makes it possible for two "different" keys to
compare equal, on a host where functions and data live in different
segments that otherwise share the same address space.  So I guess I
should clean up and replace (void*)<function> keys with the address of
    static char <function>_key;
-- 
Regards,
Hallvard