--On Saturday, March 05, 2011 5:05 AM -0800 Howard Chu hyc@symas.com wrote:
Thought this was an interesting read:
http://www.varnish-cache.org/trac/wiki/ArchitectNotes
Too bad he talks about his approach being "2006 era" programming. In fact the single-level store is 1964-era, from Multics.
http://en.wikipedia.org/wiki/Single_level_store
I guess they'll have to tweak Henry Spencer's quote ("Those who do not understand UNIX are condemned to reinvent it, poorly.") to Multics instead...
I've been working on a new "in-memory" B-tree library that operates on an mmap'd file. It is a copy-on-write design; it supports MVCC and is immune to corruption and requires no recovery procedure. It is not an append-only design, since that requires explicit compaction, and also is not amenable to mmap usage. Also the append-only approach requires total serialization of write operations, which would be quite poor for throughput.
The current approach simply reserves space for two root node pointers and flip flops between them. So, multiple writes may be outstanding at once, but commits are of course serialized; each commit causes the currently unused root node pointer to become the currently valid root node pointer. Transaction aborts are pretty much free; there's nothing to rollback. Read transactions begin by snapshotting the current root pointer and then can run without any interference from any other operations.
Public commits have been waiting for our official transition to git, but since that's been going nowhere I will probably start publishing on github.com in the next couple of weeks. (With St. Patrick's Day right around the corner it may have to wait a bit.)
Unfortunately I realized that not all application-level caching can be eliminated - with the hierarchical DB approach, we don't store full entry DNs in the DB so they still need to be generated in main memory, and they probably should be cached. But that's a detail to be addressed later; it may well be that the cost of always constructing them on the fly (no caching) is acceptable.
This backend should perform much better in all aspects (memory, CPU, and I/O usage) than the current BerkeleyDB code. It eliminates two levels of caching, entries pulled from the DB require zero decoding, readers require no locks, writes require no write-ahead-logging overhead. There are only two configurable parameters (the pathname to the DB file, and the size) so this will be far simpler for admins.
Let me know when you think it is in a state where I can start doing some perf testing with it. ;)
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration