back-mdb notes - openldap-devel

5 Mar 2011


      Thought this was an interesting read:
http://www.varnish-cache.org/trac/wiki/ArchitectNotes
Too bad he talks about his approach being "2006 era" programming. In fact the 
single-level store is 1964-era, from Multics.
http://en.wikipedia.org/wiki/Single_level_store
I guess they'll have to tweak Henry Spencer's quote ("Those who do not 
understand UNIX are condemned to reinvent it, poorly.") to Multics instead...
I've been working on a new "in-memory" B-tree library that operates on an 
mmap'd file. It is a copy-on-write design; it supports MVCC and is immune to 
corruption and requires no recovery procedure. It is not an append-only 
design, since that requires explicit compaction, and also is not amenable to 
mmap usage. Also the append-only approach requires total serialization of 
write operations, which would be quite poor for throughput.
The current approach simply reserves space for two root node pointers and flip 
flops between them. So, multiple writes may be outstanding at once, but 
commits are of course serialized; each commit causes the currently unused root 
node pointer to become the currently valid root node pointer. Transaction 
aborts are pretty much free; there's nothing to rollback. Read transactions 
begin by snapshotting the current root pointer and then can run without any 
interference from any other operations.
Public commits have been waiting for our official transition to git, but since 
that's been going nowhere I will probably start publishing on github.com in 
the next couple of weeks. (With St. Patrick's Day right around the corner it 
may have to wait a bit.)
Unfortunately I realized that not all application-level caching can be 
eliminated - with the hierarchical DB approach, we don't store full entry DNs 
in the DB so they still need to be generated in main memory, and they probably 
should be cached. But that's a detail to be addressed later; it may well be 
that the cost of always constructing them on the fly (no caching) is acceptable.
This backend should perform much better in all aspects (memory, CPU, and I/O 
usage) than the current BerkeleyDB code. It eliminates two levels of caching, 
entries pulled from the DB require zero decoding, readers require no locks, 
writes require no write-ahead-logging overhead. There are only two 
configurable parameters (the pathname to the DB file, and the size) so this 
will be far simpler for admins.
Potential downside - on a 32 bit machine with only 2GB of addressable memory 
the maximum usable DB size is around 1.6GB. On a 64 bit machine, I doubt the 
limits will pose any problem. ("64 bits should be enough for anyone...")
re: configuring the size of the DB file - this is most likely not a value that 
can be changed on an existing DB. I.e., if you configure a DB and find that 
you need to grow it later, you will probably need to slapcat/slapadd it again. 
At DB creation time the file is mmap'd with address NULL so that the OS picks 
the address, and the address is recorded in the DB. On subsequent opens the 
file is mmap'd at the recorded address. If the size is changed, and the 
process' address space is already full of other mappings, it may not be 
possible to simply grow the mapping at its current address. Since the DB 
records contain actual memory pointers based on the region address, any change 
in the mapping address would render the DB unusable.
If this restriction turns out to be too impractical, we may have to resort to 
just storing array offsets, but that will then imply a decoding phase and the 
re-introduction of entry caching, which I really really want to avoid.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/