There are two main problems being addressed by the mdb approach - eliminating multiple layers of cache copying, and eliminating multiple layers of locking.
Currently we have per-entry mutexes and also BDB locks for every cached item. The mutex overhead is ridiculous, 40-some bytes per mutex x 10 million entries is a lot of wasted storage. Especially since a single slapd can only actually lock 2-3 items per thread - i.e., we only need on the order of 100 locks, ever. The sane thing to do here is continue to use the BDB approach to locking, but with fewer options.
By default BDB allocates a new locker for every database handle and every cursor that's created. This actually causes us a lot of trouble since it allows two cursors created by the same thread to deadlock each other. (And it's why we had to go to the trouble of wrapping read operations in a TXN, even though we don't need any of the ACID properties for reads.) For mdb we will want an API for which (essentially) threadID == lockerID and cursors and such will not have their own lockerIDs.
Actual locks will be managed in separate tables. The most common case will be locks indexed by entryID. For this case we can easily structure the table to minimize lock and cache contention. E.g., for 16 threads, use a 16x16 array of structures. Each element contains a vector of 16 (mutex,locker,entryID) tuples. The element to lock for a given entryID is ((entryID & 0xff0) >> 4). In the worst case of all 16 threads going after the same entry, they will all be waiting on a single mutex which is no worse than what we have now. For all 16 threads going after the same slot, they will simply have to use other slots in the lock vector.
But in the typical case, all threads will be going after different mutexes residing in different processor cachelines. And, a thread that is iterating sequentially through the database will be re-using the same mutex/cacheline for 16 entries at a time.
(16 was chosen just for the purpose of this example. The actual numbers should be derived from the maximum number of threads allowed to access the database.)
Howard Chu wrote:
There are two main problems being addressed by the mdb approach - eliminating multiple layers of cache copying, and eliminating multiple layers of locking.
Currently we have per-entry mutexes and also BDB locks for every cached item. The mutex overhead is ridiculous, 40-some bytes per mutex x 10 million entries is a lot of wasted storage. Especially since a single slapd can only actually lock 2-3 items per thread - i.e., we only need on the order of 100 locks, ever. The sane thing to do here is continue to use the BDB approach to locking, but with fewer options.
Note that we can use this locking approach in back-bdb too, but aside from the memory savings I don't see it giving any performance gain. The real gain comes from completely eliminating one layer of locks, not just from streamlining one layer but keeping both layers.