B-tree code - openldap-devel

11 Aug 2011


      Hi Martin,
   Just thought you'd like to know about a project I've been working on for a 
couple months. My current code started with your append-only B-tree source. 
It's just about in usable shape now
https://gitorious.org/mdb .
Also I'll be presenting details at the LDAPCon in Hedelberg this October.
http://www.daasi.de/ldapcon2011/index.php?site=program
I started with your code, and removed the page cache. Instead the entire DB is 
accessed thru a read-only mmap region. As such, there is no longer any cache 
management at the DB level (it's all done by the OS/VM). I also removed the 
prefix-compression logic, because it made rebalancing/merging unreliable. The 
mmap approach avoids a ton of malloc/memcpy overhead. It also makes overflow 
pages quite cheap to manage.
Instead of writing a new meta-page at the tail of the file, I ping-pong 
between two meta pages at the head of the file. (Double-buffering.) This 
provides most of the MVCC benefits of the append-only approach, but without 
the wasted space or the need to search for the most recent meta page.
I also added tracking of outstanding read transactions, and tracking of free 
pages. Reader tracking is done without locks; readers are never blocked when 
accessing the DB (unless the OS itself is busy servicing page fauits).
This way it can quickly check when a copied page is no longer referenced, and 
re-use the pages, so the DB no longer grows without bounds. This completely 
removes the need for the compaction logic. Since active data is never 
overwritten, the DB can never be corrupted, so no write-ahead logging is 
needed, nor any recovery procedures.
I also added several ideas from BerkeleyDB, so that I can drop it into 
OpenLDAP more easily. The DB is now a "DB environment" with support for 
multiple databases within an environment. This was necessary because I didn't 
want to have to manage multiple separate mmap's for multiple little index 
databases and other misc. usages. Also the free list is itself a sub-DB in the 
environment. I also added support for sorted-duplicate data items for a given 
key, which OpenLDAP's back-hdb relies on.
I'm just now getting started adapting our back-hdb code to this mdb library. 
It looks like the new backend will be vastly simplified, both in real code and 
in configuration, so it will be much friendlier to sysadmins, while at the 
same time giving superior performance to BerkeleyDB and excellent reliability. 
Of course the code is still pretty raw, and I haven't done any heavy load 
testing on it yet, so it remains to be seen how much of the promise is realized.
I was originally targeting a design where the mmap resides at a fixed memory 
address. That way slapd can store its entries as-is, instead of flattening 
them into a storable structure. There's a hook for a relocation function, 
which would be used to relocate an entry if it gets shifted around during 
adds/deletes/rebalances. I haven't implemented this yet because I'm not sure 
it will actually work well in real use. For slapd it might be OK if all 
entries wind up in overflow pages, since those pages aren't touched by tree 
balancing activity. But if average entry sizes are small, it would become a 
serious hassle.
I'd be interested to hear your comments on this.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/