Howard Chu writes:
The obvious fix is to adopt the same strategies that tcmalloc uses. (And unfortunately we can't simply rely on tcmalloc always being available, or always being stable in a given environment.)
Good, though I'd like to see these slapd re-implementations of system features (like malloc) #ifdeffed with a fallback to the system feature. Then one can compile with -D<revert to system feature> either when that one is as good or better than slapd's, or to simplify debugging. Configure can guess about it too, e.g. it can detect tcmalloc.
The new entry_free() plus tcmalloc may be better than plain tcmalloc, I don't know. It retains the global mutex though, which presumably is or someday will be a pessimization compared to _some_ malloc out there.
I.e., use per-thread cached free lists. We maintain some small number of free objects per thread; this per-thread free list can be used without locking. When the number of free objects on a given thread exceeds a particular threshold
...or there is no thread key for the mutex (e.g. when the current thread is not from the thread pool)...
Might be convenient to let slapd register init-thread and cleanup-thread functions in the thread pool. These could create/destroy these mutexes, and maybe some other per-thread slapd variables too.
(Preferably the init function would be able to fail and cause the pool thread to die, but that'd mess up the pool logic which assumes once a thread has been created it will be able to handle submitted tasks. Except slapd often doesn't check for malloc/mutex_init success anyway, so demanding success would be no worse than what slapd does now.)
then we obtain the global lock to return some number of objects to the global list.
In practice this threshold can be very small - any given thread typically needs no more than 4 entries at a time. (ModDN is the worst case at 3 entries locked at once. LDAP TXNs would distort this figure but not in any critical fashion.) For attributes the typical usage is much more variable, but any number we pick will be an improvement over the current code.
Add a few more for overlays, in particular syncrepl. Otherwise even a single overlay doing entry_dup() reduces performance.