I have a "thread" branch on ssh://ada.openldap.org/~hyc/OD/head/.git with some preliminary patches to improve multi-threaded indexing in back-mdb slapadd. So far the scale of improvement is small, but there may be ways to enhance it further from here.
Currently back-mdb's multi-threaded indexing code is largely copied from back-bdb/hdb, and it is always slower than regular single-threaded slapadd. The slowdown comes from a number of reasons. 1) Thread synchronization overhead is huge. Much greater than the actual cost of index processing. 2) Index processing is awkward since MDB doesn't support multi-threaded writes (while BDB does).
1) is because we're using cond_wait/cond_broadcast to perform synchronization, and that requires every waiting thread to successfully acquire the condition mutex before progressing any further. So it is inherently serial, even using cond_broadcast. Also, we're doing this twice, once to start all of the index processing for an entry, and once at the end of index processing. The latter one is required so that we will know it's safe to dispose of the current entry and move on to the next.
The new code uses a single pthread_barrier for synchronization, and it is only used at the start of index processing. Index cleanup is deferred, and all of the relevant data is double-buffered. This allows us to only need to worry about the start of processing; the processing can take as long as it needs. This restructure requires a small change in slapadd as well, to maintain two Entry pointers instead of just one, and free them in a staggered fashion. The back-bdb/hdb threaded indexer can also be restructured along these lines for a similar benefit.
2) Index results were being gathered in several malloc'd structures and then individually freed after being written to the DB. Now I'm using sl_malloc, and simply resetting the memctx between entries, thus eliminating all cost of free() operations.
These changes are sufficient to bring multithreaded performance down to just on par with single-threaded slapadd. In fact there's very little to gain here when all of the DB writes are still single-threaded.
One further tweak: most of our index keys are 4-byte hash values. Using the MDB_INTEGERKEY flag allows them to be compared word-at-a-time instead of byte-at-a-time, which gives a further speedup. At present this is the most significant improvement.
With the original back-mdb code currently in git master, slapadd of our test LDIF (4.9GB, 6326513 entries, 31 indexed attributes) was taking 1h50m on ada. (Using -q. Without -q, 2h46m.)
With no indexing, it takes only 9m11s. Even though sizewise indices don't account for much, they cost a lot in numbers of keys. 6M entries means 6M keys in the id2entry DB and 12M keys in the dn2id DB. Indexing on 31 attributes with multiple values, substrings, and other such parameters thrown in, amounts to hundreds of millions of keys, and each of these needs multiple comparisons to be inserted into their proper index.
I tried a further hack on MDB_APPEND to eliminate some more of this overhead. Currently, MDB_APPEND only affects the behavior of mdb_page_split, causing the newly added key to simply be added as the first node of a new page rather than splitting the existing page in half and copying half of the keys to the new page. (This in itself is a pretty major speedup for slapadd, but obviously already factored in.) The new code also simply appends new keys to the end of the DB's last page (rather than searching for its insertion point), which again is useful for eliminating unnecessary comparisons. This is only effective when adding an entryID to an already existing index key. This brought the slapadd -q time down to 1h46m.
Turning on the MDB_INTEGERKEY flag brought the slapadd -q time down to 1h32m.
Adding the threading rewrite to that, the time came down to 1h29m with tool-threads set to 4. Using more threads actually slows things down, again because thread synchronization becomes too expensive.
Only the index key generation is being done in the extra threads, and the reality is that index key generation is not a very significant cost in the overall workload.
But anyway, some of the work done here can be ported back into back-bdb/hdb to improve things there. In particular, the use of two synchronization events per entry can be reduced to one by adding double-buffering there too. And some malloc/free overhead in the index threads can be removed by using sl_malloc in each thread. Of course we first need to add pthread_barrier detection to the configure script, as well as wrapper functions in libldap_r. (Anyone interested in handling this? And writing the compatibility code for Windows?)
In the meantime, I'm considering an MDB environment option to support multiple threads in a single write TXN, as long as each thread is operating on separate databases. This would hopefully allow us to further distribute the load of indexing without adding too much new complexity to libmdb.
If you have an account on ada.openldap.org I encourage you to checkout this code and think about what's suitable to merge back into master.