Hi Martin,
Just thought you'd like to know about a project I've been working on for a
couple months. My current code started with your append-only B-tree source.
It's just about in usable shape now
https://gitorious.org/mdb .
Also I'll be presenting details at the LDAPCon in Hedelberg this October.
http://www.daasi.de/ldapcon2011/index.php?site=program
I started with your code, and removed the page cache. Instead the entire DB is
accessed thru a read-only mmap region. As such, there is no longer any cache
management at the DB level (it's all done by the OS/VM). I also removed the
prefix-compression logic, because it made rebalancing/merging unreliable. The
mmap approach avoids a ton of malloc/memcpy overhead. It also makes overflow
pages quite cheap to manage.
Instead of writing a new meta-page at the tail of the file, I ping-pong
between two meta pages at the head of the file. (Double-buffering.) This
provides most of the MVCC benefits of the append-only approach, but without
the wasted space or the need to search for the most recent meta page.
I also added tracking of outstanding read transactions, and tracking of free
pages. Reader tracking is done without locks; readers are never blocked when
accessing the DB (unless the OS itself is busy servicing page fauits).
This way it can quickly check when a copied page is no longer referenced, and
re-use the pages, so the DB no longer grows without bounds. This completely
removes the need for the compaction logic. Since active data is never
overwritten, the DB can never be corrupted, so no write-ahead logging is
needed, nor any recovery procedures.
I also added several ideas from BerkeleyDB, so that I can drop it into
OpenLDAP more easily. The DB is now a "DB environment" with support for
multiple databases within an environment. This was necessary because I didn't
want to have to manage multiple separate mmap's for multiple little index
databases and other misc. usages. Also the free list is itself a sub-DB in the
environment. I also added support for sorted-duplicate data items for a given
key, which OpenLDAP's back-hdb relies on.
I'm just now getting started adapting our back-hdb code to this mdb library.
It looks like the new backend will be vastly simplified, both in real code and
in configuration, so it will be much friendlier to sysadmins, while at the
same time giving superior performance to BerkeleyDB and excellent reliability.
Of course the code is still pretty raw, and I haven't done any heavy load
testing on it yet, so it remains to be seen how much of the promise is realized.
I was originally targeting a design where the mmap resides at a fixed memory
address. That way slapd can store its entries as-is, instead of flattening
them into a storable structure. There's a hook for a relocation function,
which would be used to relocate an entry if it gets shifted around during
adds/deletes/rebalances. I haven't implemented this yet because I'm not sure
it will actually work well in real use. For slapd it might be OK if all
entries wind up in overflow pages, since those pages aren't touched by tree
balancing activity. But if average entry sizes are small, it would become a
serious hassle.
I'd be interested to hear your comments on this.
--
-- Howard Chu
CTO, Symas Corp.
http://www.symas.com
Director, Highland Sun
http://highlandsun.com/hyc/
Chief Architect, OpenLDAP
http://www.openldap.org/project/