MDB slapadd potential improvements - openldap-devel

14 Apr 2012


      I have a "thread" branch on ssh://ada.openldap.org/~hyc/OD/head/.git with some 
preliminary patches to improve multi-threaded indexing in back-mdb slapadd. So 
far the scale of improvement is small, but there may be ways to enhance it 
further from here.
Currently back-mdb's multi-threaded indexing code is largely copied from 
back-bdb/hdb, and it is always slower than regular single-threaded slapadd. 
The slowdown comes from a number of reasons.
   1) Thread synchronization overhead is huge. Much greater than the actual 
cost of index processing.
   2) Index processing is awkward since MDB doesn't support multi-threaded 
writes (while BDB does).
1) is because we're using cond_wait/cond_broadcast to perform synchronization, 
and that requires every waiting thread to successfully acquire the condition 
mutex before progressing any further. So it is inherently serial, even using 
cond_broadcast. Also, we're doing this twice, once to start all of the index 
processing for an entry, and once at the end of index processing. The latter 
one is required so that we will know it's safe to dispose of the current entry 
and move on to the next.
The new code uses a single pthread_barrier for synchronization, and it is only 
used at the start of index processing. Index cleanup is deferred, and all of 
the relevant data is double-buffered. This allows us to only need to worry 
about the start of processing; the processing can take as long as it needs. 
This restructure requires a small change in slapadd as well, to maintain two 
Entry pointers instead of just one, and free them in a staggered fashion. The 
back-bdb/hdb threaded indexer can also be restructured along these lines for a 
similar benefit.
2) Index results were being gathered in several malloc'd structures and then 
individually freed after being written to the DB. Now I'm using sl_malloc, and 
simply resetting the memctx between entries, thus eliminating all cost of 
free() operations.
These changes are sufficient to bring multithreaded performance down to just 
on par with single-threaded slapadd. In fact there's very little to gain here 
when all of the DB writes are still single-threaded.
One further tweak: most of our index keys are 4-byte hash values. Using the 
MDB_INTEGERKEY flag allows them to be compared word-at-a-time instead of 
byte-at-a-time, which gives a further speedup. At present this is the most 
significant improvement.
With the original back-mdb code currently in git master, slapadd of our test 
LDIF (4.9GB, 6326513 entries, 31 indexed attributes) was taking 1h50m on ada.
(Using -q. Without -q, 2h46m.)
With no indexing, it takes only 9m11s. Even though sizewise indices don't 
account for much, they cost a lot in numbers of keys. 6M entries means 6M keys 
in the id2entry DB and 12M keys in the dn2id DB. Indexing on 31 attributes 
with multiple values, substrings, and other such parameters thrown in, amounts 
to hundreds of millions of keys, and each of these needs multiple comparisons 
to be inserted into their proper index.
I tried a further hack on MDB_APPEND to eliminate some more of this overhead. 
Currently, MDB_APPEND only affects the behavior of mdb_page_split, causing the 
newly added key to simply be added as the first node of a new page rather than 
splitting the existing page in half and copying half of the keys to the new 
page. (This in itself is a pretty major speedup for slapadd, but obviously 
already factored in.) The new code also simply appends new keys to the end of 
the DB's last page (rather than searching for its insertion point), which 
again is useful for eliminating unnecessary comparisons. This is only 
effective when adding an entryID to an already existing index key. This 
brought the slapadd -q time down to 1h46m.
Turning on the MDB_INTEGERKEY flag brought the slapadd -q time down to 1h32m.
Adding the threading rewrite to that, the time came down to 1h29m with 
tool-threads set to 4. Using more threads actually slows things down, again 
because thread synchronization becomes too expensive.
Only the index key generation is being done in the extra threads, and the 
reality is that index key generation is not a very significant cost in the 
overall workload.
But anyway, some of the work done here can be ported back into back-bdb/hdb to 
improve things there. In particular, the use of two synchronization events per 
entry can be reduced to one by adding double-buffering there too. And some 
malloc/free overhead in the index threads can be removed by using sl_malloc in 
each thread. Of course we first need to add pthread_barrier detection to the 
configure script, as well as wrapper functions in libldap_r.
(Anyone interested in handling this? And writing the compatibility code for 
Windows?)
In the meantime, I'm considering an MDB environment option to support multiple 
threads in a single write TXN, as long as each thread is operating on separate 
databases. This would hopefully allow us to further distribute the load of 
indexing without adding too much new complexity to libmdb.
If you have an account on ada.openldap.org I encourage you to checkout this 
code and think about what's suitable to merge back into master.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/