Re: (ITS#4943) tpool.c pause vs. finish

8 Jun 2007

      h.b.furuseth@usit.uio.no wrote:
...
OK, here's where we stand now, with my TODOs, questions and some
new issues.  3+ pages.  I thought this would be a brief message:-(
Hopefully I'll get to some of it this weekend, but no promises.
TODOs (unless someone protests) and questions:

When a thread is finishing, make it go active for context_reset().
Should make slapi module destructors safer, and I don't see that it
can hurt.

I don't understand this.
...

I'd like to rename thread contexts to tasks(?) and user contexts back
to contexts (as in RE23).  The current terminology is confusing.

I don't care about this either way.
...

Save and restore the old data for key slap_sasl_bind, and just in case
(I can't tell if it is needed) smbk5pwd.
Might as well do it everywhere we have setkey - do something - delete
key in one function call.  Since setkey(non-null) isn't supposed to
overwrite existing keys, the change should not be able to hurt.  It
could protect something I didn't think of or a future slapd change.
API: the smallest change would be to call pool_getkey() first,
but I prefer a pool_swapkey(ctx, key, &data, &kfree) function.
Then in RE24, maybe remove the &kfree parameter to pool_getkey().
Could omit the &kfree parameter to swapkey() if new data was
guaranteed non-NULL, but that'd require a malloc in bdb (silly says
Howard) or passing data in a &union{integer, ptr} (cumbersome say I).

Never use malloc unless absolutely necessary. We have enough problems 
with heap fragmentation already.
...

Axe semaphore code in tpool.c, the ldap_lazy_sem_init() call in
slapd/daemon.c, and with them LDAP_PVT_THREAD_POOL_SEM_LOAD_CONTROL
in ldap_pvt_thread.h and SLAP_SEM_LOAD_CONTROL in slap.h.

OK.
...

Is the ldap_pvt_thread_pool_context_reset() call in slapd/init.c
indeed safe - ie. its slap_sl_mem_destroy() call?  That's not during
startup, but in slap_destroy().  Slapd does various stuff after that.

Yes.
...

API changes:
Do we include the assert(no multiple pools) in the import to RE23?

...
Remove keys from ltu_key[] _before_ calling ltk_free() with them?
  That prevents functions called from ltk_free from seeing freed thread
  data, but it can also break ltk_free functions that call such
  functions _before_ freeing it.  Probably not for RE23.

Scheduling: If several threads call pool_pause(), then once there is a
pause tpool does not schedule them all.  They could get handled then,
or another thread could undo the pause so tpool would wait to pause
again.  Is that deliberate?

I don't understand the question. The point of the pause is to prevent 
any other thread (in the pool) from running. Why should tpool schedule 
any other threads at this point?
...
Dubious code, new issues, old ones I'm not sure we finished with:

thread_keys[] still has problems.

pool_context() breaks if several ldap_pvt_thread_t bit patterns can
represent the same thread ID: TID_HASH() would give different hashes
for the same thread, and pool_context() stops searching when it hits
a slot with ctx == NULL.  (My previous bug report was a bit wrong.)

How can a single thread ID be represented by multiple bit patterns?
...
The best fix would be to use use real thread-specific data instead.
Just one key with the ctx for now, that minimizes the changes.  OTOH
it also means we'll do thread-specific key lookup twice - first in
pthread to get the ctx, and then in ltu_key[] to find our data.
Anyway, I said I'd do that later but might as well get on with it,
at least for pthreads.  Except I don't know if that's OS-dependent
enough that it should wait for RE24?

The best fix may be to just use the address of the thread's stack as the 
thread ID. That's actually perfect for our purpose.
...
When that's not implemented, I'd prefer to cast the thread ID to an
integer when that is safe, and hash that instead of its bytes.
Won't work for structs/unions, and is not formally guaranteed for
pointers or in case of pthreads even integers.  BTW, when it doesn't
work, we've got a problem in ITS#4983 (tls.c id_callback) as well.

When that's not possible, we could search the entire table before
giving up.  Could make thread_keys[] smaller and reduce the maximum
allowed #threads on such hosts.  (What should the limit be?)

thread_keys is a poor hash table implementation with a poor hash
function.

Ideally it would not be a hash table at all, but a direct lookup based 
on the thread ID.
...
Oops - for caching, maybe I shouldn't have removed the thread id
from thread_keys[].  Or maybe the hash value should be stored there
instead, so we can compare with == instead of ldap_pvt_thread_equal.

I wonder if there are long-lived ldap_pvt_thread_cond_wait(),
[r]mutex_lock() or rdwr_[rw]lock() calls which should make the thread
go inactive first (after waiting for any pause to finish), so that
(a) a pool_pause() will not block waiting for them, and
(b) all threads in the pool can't be blocked in long waits without
    tpool knowing about it.

Long-lived lock calls would be a quite noticeable bug. This seems to be 
a non-issue.
...

back-bdb/tools.c uses pool_submit() to start long-lived threads:
bdb_tool_trickle_task() and bdb_tool_index_task().  They don't exit
until slapd is stopping, so they'll block slapd if a thread requests
a pause.  Also the pool thinks its has more available threads that it
does, so scheduled tasks might never be executed.
Possibly that's not a problem with slaptools.  As complex as slapd is
now with overlays and plugins and whatnot, I have no idea.

Tool threads are completely separate from slapd threads. This is not an 
issue.
...

slapd/bconfig.c maintains its own idea of the max number of threads
in connection_pool_max and slap_tool_thread_max, which does not match
tpool's idea if the limit is <=0 or larger than LDAP_MAXTHR.
Since slapd acts on its own count, I imagine that can cause breakage.
Simplest fix might be to set max #threads and then read it back,
I haven't checked.  Would need to return LDAP_MAXTHR instead of 0
for "as many as allowed".  Note however also the issue above:
tpool's idea of actually available threads can be too high too.

The practical limits are actually pretty low. Even if you're on a 1024 
processor machine with many terabytes of RAM, it's not a good idea to 
have many thousands of threads... Scheduler overhead would overshadow 
any useful work.
...

I said my new ltp_pause waits might be too aggressive.  Maybe I should
clarify: I don't see any instances where they can be dropped, but it's
possible that slapd expects pauses to be less aggressive somewhere.

I don't understand this comment.
...
About pool_submit() and the ltp_mutex unlock-relock:
Some history and notes, just to finish up previous discussion.
  Some possible TODOs, but they are not needed since it's not broken:
Originally, pool_submit() unlocked ltp_mutex before thread_create()
  and re-locked it afterwards for further maintenance.  That was not
  because of the semaphore code.  It looks like it could work even in
  non-threaded mode where thread_create(,,fn,arg) just called fn(arg),
  though tpool.c was always wrapped in an #ifndef <--without-threads>.
I don't think non-threaded would have worked anyway since rev 1.24
  (apr 2003), which introduced thread_keys[].  Caller and callee would
  have gotten the same thread ID, so tpool would have confused their
  contexts.  Previously the new context was pushed onto ltp_active_list,
  and would have temporarily hid the old context.
That's also when thread_create() was moved after the ltp_mutex
  re-lock, which would mean the mutex was locked while pool_wrapper()
  was running and tried to re-lock the mutex.  And now, my rev.171 has
  removed the unlock-relock.
With that I can also remove the search for ctx in ltp_pending_list
  when backing out of pool_submit, it will be at the head of that list.
With tpool only supporting threaded mode, another cleanup could be for
  pool_submit to start a thread before inserting ctx in ltp_pending_list
  instead of after.  That done, it can fail if ltp_open_count is still
  0, and we won't need the code for backing out of pool_submit().
OK.
...
Oh, the comment /* no open threads at all?!? */:  That's not strange.
  Happens in the first pool_submit() call(s) if thread_create() fails.
  Though in that case there was no danger of removing the wrong ctx
  from ltp_pending_list after all (fixed in rev 1.63).
(Sure, but thread_create() should never fail if you have a sane 
configuration.)
-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: (ITS#4943) tpool.c pause vs. finish