Re: back-mdb - futures...

19 May 2009


      Francis Swasey wrote:
...
What protections would be needed to deal with the (hopefully infrequent) system crash if we are
depending on the OS filesystem cache to get things from memory to disk?  What would be the
recovery mechanism in the worst-case system crash?
Same as always. Transactions will use write-ahead logging. Recovery will be 
automatic.
...
On 5/17/09 12:27 AM, Howard Chu wrote:
...
Just some thoughts on what I'd like to see in a new memory-based backend...
One of the complaints about back-bdb/hdb is the complexity in the
tuning; there are a number of different components that need to be
balanced against each other and the proper balance point varies
depending on data size and workload. One of the directions we were
investigating a couple years back was mechanisms for self-tuning of the
caches. (This was essentially the thrust of Jong-Hyuk Choi's work with
zoned allocs for the back-bdb entry cache; it would allow large chunks
of the entry cache to be discarded on demand when system memory pressure
increased.) Unfortunately Jong hasn't been active on the project in a
while and it doesn't appear that anyone else was tracking that work.
Self-tuning is still a goal but it seems to me to be attacking the wrong
problem.
One of the things that annoys me with the current BerkeleyDB based
design is that we have 3 levels of cache operating at all times -
filesystem, BDB, and slapd. This means at least 2 memory copy operations
to get any piece of data from disk into working memory, and you have to
play games with the OS to minimize the waste in the FS cache. (E.g. on
Linux, tweak the swappiness setting.)
Back in the 80s I spent a lot of time working on the Apollo DOMAIN OS,
which was based on the M68K platform. One of their (many) claims to fame
was the notion of a single-level store: the processor architecture
supported a full 32 bit address space but it was uncommon for systems to
have more than 24 bits worth of that populated, and nobody had anywhere
near 1GB of disk space on their entire network. As such, every byte of
available disk space could be directly mapped to a virtual memory
address, and all disk I/O was done thru mmaps and demand paging. As a
result, memory management was completely unified and memory usage was
extremely efficient.
These days you could still take that sort of approach, though on a 32
bit machine a DB limit of 1-2GB may not be so useful any more. However,
with the ubiquity of 64 bit machines, the idea becomes quite attractive
again.
The basic idea is to construct a database that is always mmap'd to a
fixed virtual address, and which returns its mmap'd data pages directly
to the caller (instead of copying them to a newly allocated buffer).
Given a fixed address, it becomes feasible to make the on-disk record
format identical to the in-memory format. Today we have to convert from
a BER-like encoding into our in-memory format, and while that conversion
is fast it still takes up a measurable amount of time. (Which is one
reason our slapd entry cache is still so much faster than just using
BDB's cache.) So instead of storing offsets into a flattened data
record, we store actual pointers (since they all simply reside in the
mmap'd space).
Using this directly mmap'd approach immediately eliminates the 3 layers
of caching and brings it down to 1. As another benefit, the DB would
require *zero* cache configuration/tuning - it would be entirely under
the control of the OS memory manager, and its resident set size would
grow or shrink dynamically without any outside intervention.
It's not clear to me that we can modify BDB to operate in this manner.
It currently supports mmap access for read-only DBs, but it doesn't map
to fixed addresses and still does alloc/copy before returning data to
the caller.
Also, while BDB development continues, the new development is mainly
occurring in areas that don't matter to us (e.g. BDB replication) and
the areas we care about (B-tree performance) haven't really changed much
in quite a while. I've mentioned B-link trees a few times before on this
list; they have much lower lock contention than plain B-trees and thus
can support even greater concurrency. I've also mentioned them to the
BDB team a few times and as yet they have no plans to implement them.
(Here's a good reference:
http://www.springerlink.com/content/eurxct8ewt0h3rxm/ )
As such, it seems likely that we would have to write our own DB engine
to pursue this path. (Clearly such an engine must still provide full
ACID transaction support, so this is a non-trivial undertaking.) Whether
and when we embark on this is unclear; this is somewhat of an "ideal"
design and as always, "good enough" is the enemy of "perfect" ...
This isn't a backend we can simply add to the current slapd source base,
so it's probably an OpenLDAP 3.x target: In order to have a completely
canonical record on disk, we also need pointers to AttributeDescriptions
to be recorded in each entry and those AttributeDescription pointers
must also be persistent. Which means that our current
AttributeDescription cache must be modified to also allocate its records
from a fixed mmap'd region. (And we'll have to include a
schema-generation stamp, so that if schema elements are deleted we can
force new AD pointers to be looked up when necessary.) (Of course, given
the self-contained nature of the AD cache, we can probably modify its
behavior in this way without impacting any other slapd code...)
There's also a potential risk to leaving all memory management up to the
OS - the native memory manager on some OS's (e.g. Windows) is abysmal,
and the CLOCK-based cache replacement code we now use in the entry cache
is more efficient than the LRU schemes that some older OS versions use.
So we may get into this and decide we still need to play games with
mlock() etc. to control the cache management. That would be an
unfortunate complication, but it would still allow us to do simpler
tuning than we currently need. Still, establishing a 1:1 correspondence
between virtual memory addresses and disk addresses is a big win for
performance, scalability, and reduced complexity (== greater
reliability)...
(And yes, by the way, we have planning for LDAPCon2009 this September in
the works; I imagine the Call For Papers will go out in a week or two.
So now's a good time to pull up whatever other ideas you've had in the
back of your mind for a while...)
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: back-mdb - futures...