Some more comments, review, planned changes, and questions.
I guess I should summarize the bugs I've fixed in HEAD first, for completeness. (I was a bit too busy at the time.) Restating the CVS log comments, nothing new but maybe a bit clearer. Possibly some of the issues are tpool-only issues and can't happen the way slapd uses tpool.
----------------------------------------
rev 1.63: ldap_pvt_thread_pool_submit(), after failed thread creation: - ltp_pending_count++ should be --. - should signal if there are no more threads, in case pool_destroy() is waiting for that. (Not sure if that can happen.) ldap_int_thread_pool_wrapper(): - The 'if(ltp_state...)' was spurious: It would always be true, and if it were not we'd get an eternal loop.
rev 1.64: pool_pause(): - would cancel states FINISHING and STOPPING, pool_resume() would set it back to RUNNING. - could hang when waiting for another pool_pause(), since it did not signal ltp_pcond if it decremented ltp_active_count to 1. pool_wrapper(): - could update thread_keys[] during a pause while initializing and possibly while finishing, - just after initializing, could handle a new ctx during a pause. pool_destroy(): - would go on with destruction if pool_resume() was called, since it used an if(){wait} instead of while() and they use the same cond.
rev 1.65: - pool_context() read thread_keys[] without mutex protection. pool_purgekey() too, but it expected a pause to protect the keys. pool_wrapper() protected with ltp_mutex while writing, and did not quite ensure a pause.
rev 1.66: - thread_keys[] is a (poor) open-addessed hash table, but it lacked a "deleted item" mark. So keys (contexts) could become hidden and re-emerge later if threads exited and then new ones were started. - TID_HASH() did hash calculation in signed arithmetic, thus depending on friendly integer overflow.
rev 1.67: - setkey/getkey expected ldap_int_thread_userctx_t.ltu_key[] to be NULL-terminated when not full. purgekey() could set a NULL in the middle, hiding later keys until a new key overwrote the NULL. API changes: - setkey(data=NULL, kfree!=NULL) searched as if intended to reset the key, but updated by setting the key. Now always updates. - setkey(key=<not found>, data=NULL) could return either success or failure. Now succeeds iff (data == NULL && kfree == NULL).
rev 1.68: - ltp_max_count <= 0 meant allow an unlimited number of threads, but thread #LDAP_MAXTHR+1 and later would loop forever looking for a free thread_keys[] slot. (Not fixed for multi-pool applications.)
----------------------------------------
I wrote in the tpool.c 1.66 -> 1.67 patch:
API changes, need review - I'm not sure what's indented here:
- setkey(data=NULL, kfree!=NULL) searched as if intended to reset the key, but updated by setting the key. Now always updates.
Right. That matches the one place it can happen: In bdb_locker_id() with setkey(,,data=(void *)(long)lockid,). I don't know if lockid can be 0, but it doesn't need to for data to be NULL.
But it would cleaner to ch_malloc an integer instead of the (void*) cast. Then it won't depend on integer->pointer->integer conversions preserving the value either. BTW, is there a reason for the cast to long? lockid is u_int32_t, and the getkey 'long' result is restored to a u_int32_t.
Howard Chu wrote:
Also if ctx is invalid (another thread handled it) and a new pending request has gotten the now-unused ctx, then the 'if (pool->ltp_open_count == 0)' code will free that new pending request instead of the old one.
Most likely this is due to unlocking and relocking the mutex around the semaphore code. Axe that, remove the gratuitous unlock/relock, and then that problem goes away.
Was that an instruction/permission to axe it or not? If not I'll just move the unlock/lock into the #ifdef for now.
Thus far we've only had to worry about a single pool. If anyone can come up with a reason to need multiple pools, I guess we need to hear it.
I'll insert "static int initialized; assert(!initialized++);" in pool_init() and see if anyone complains.
- setkey does not free the old data with ltk_free. Possibly that's intentional, but if so that requires weird code elsewhere.
Yes, that's intentional. ltk_free is only intended for cleanup if a key is still set by the time a thread exits. All of the keys we set are live for the entire life of a thread, they don't get re-assigned.
Even the (void*)slap_sasl_bind key? lutil_passwd() looks like it can be recursive: Register a password scheme which finds an entry in LDAP and authenticates via that entry's userPassword - which again could call lutil_passwd().
If that's legal, this code in be_isroot_pw()/slap_passwd_check() can break: pool_setkey(, slap_sasl_bind, <ctx>, <freefunc> ); ... /* slap_passwd_check() does access_allowed() here */ lutil_passwd(); ... pool_setkey(, slap_sasl_bind, NULL, NULL); The final setkey would need to restore the old key.
I'm not sure of contrib/slapd-modules/smbk5pwd/smbk5pwd.c + some overlay either. rwm + back-relay which uses another smbk5pwd or something. (And I have no idea of the BDB cache stuff, but since you wrote it I assume you know:-)
If it's a problem, we can replace code like that with void *data = <ctx>; ldap_pvt_thread_pool_keyfree_t *kfree = <freefunc>; pool_swapkey(, slap_sasl_bind, &data, &kfree ); /* new function */ lutil_passwd(); pool_setkey(, slap_sasl_bind, data, kfree ); or if setkey reverts to ignoring kfree when it checks whether we are setting or deleting a key, the kfree function can be passed directly.
I wrote:
But this reminds me: How much should an ltk_free() function be allowed to do?
- pool_context_reset() calls ltk_free(). Should pool_wrapper() go active (wait for !pause, increment ltp_active_count, unlock ltp_mutex) before calling pool_context_reset()?
I think so. It makes slapi safer, and I don't see that it can hurt: context_reset() can call connection_fake_destroy(), which via slapi_int_free_object_extensions() calls slapi module destructors, which can do who-knows-what.
Another stray thought: It would similarly be safer for context_reset() and purgekey() to delete the key before calling ltk_free, so that function can't access the key's data after freeing it. OTOH that would break any slapi module whose ltk_free calls a function which uses the key before freeing it. I'll leave it alone unless someone says otherwise.
Other matters:
pool->ltp_active_list is maintained but not used for anything. Do I delete it? Or does tpool maintain the list to ease debugging?
Use of tpool:
Another bug: Slapd does not always check if pool_submit() and pool_setkey() succeed, even though failure is not a bug. (They can run into some limits.)
I do not understand the calls to ldap_pvt_thread_pool_context_reset() in slapd (init.c and bconfig.c). That frees the thread's sl_malloc memory with slap_sl_mem_destroy(), which seems rather hazardous until the thread is really, really done.
For that matter, I guess that means context_reset() had better release keys in the reverse order of how they were set. That usually happened, but not always, and my clear_key_idx() made the "not always" more common. So to close the gap from a deleted key I'll make it move all later keys forward, instead of just moving the latest key to the gap.
Bdb uses the database environment address as a key to pool_setkey(). So there can be at most ~30 BDB databases with an active key at the same time in the same thread. What does that mean - 30 overlays all using bdb? Access control groups spanning 30 databases?
Nitpick: Slapd uses both function pointers and data pointers as keys to pool_[sg]etkey(). That makes it possible for two "different" keys to compare equal, on a host where functions and data live in different segments that otherwise share the same address space. So I guess I should clean up and replace (void*)<function> keys with the address of static char <function>_key;