Luke Kenneth Casson Leighton wrote:
>> We fell for the fantasy of parallel writes with BerkeleyDB, but after a
>> dozen+ years of poking, profiling, and benchmarking, it all becomes clear -
>> all of that locking overhead+deadlock detection/recovery is just a waste of
>> resources.
>
> ... which is why tdb went to the other extreme, to show it could be done.
But even tdb only allows one write transaction at a time. I looked into
writing a back-tdb for OpenLDAP back in 2009, before I started writing LMDB. I
know pretty well how tdb works...
>> https://twitter.com/hyc_symas/status/451763166985613312
>
> quote:
>
> "The new code is faster at indexing and searching, but not so much
> faster it would blow you away, even using
> LMDB. Turns out the slowness of Python looping trumps the speed of a
> fast datastore :(. The difference
> might be bigger on a big index; I'm going to run experiments on the
> Enron dataset and see."
>
> interesting. so why is read up at 5,000,000 per second under python
> (in a python loop, obviously) but write isn't? something odd there.
Good question. I'd guess there's some memory allocation overhead involved in
writes. The Whoosh guys have some more perf stats here
https://bitbucket.org/mchaput/whoosh/wiki/Whoosh3
(their test.Tokyo / All Keys result is highly suspect though, the timing is
the same for 100,000 keys as for 1M keys. Probably a bug in their test code.)
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
2
8
py-lmdb
by Luke Kenneth Casson Leighton
19 May '14
19 May '14
hi all, and a special hello to howard, i had forgotten you're working
on openldap.
i've just discovered lmdb and its python bindings, and i am to be
absolutely honest completely astounded. that's for two reasons: the
first is how steeupidly quick lmdb is, and secondly because bizarrely
although it walks all over the alternatives it isn't more widely
adopted. mariadb added leveldb last year, mongodb i believe are
looking at a leveldb port.
i have been looking at a *lot* of key-value database stores recently,
to find the fastest possible one after realising that standard SQL and
NOSQL databases simply aren't fast enough. there is something called
structchunk which instead of storing the object in a leveldb just
stores an mmap file offset in the value and stores the actual object
in an mmap'd file... and it's *still* nowhere near as quick as lmdb.
so i am both impressed and puzzled :) i have created a debian RFP for
the python bindings, in case it helps.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=748373 - can i
suggest that anyone wishing to see that in there send a message
seconding it to be added: because as a general rule debian developers
do not like my blunt honesty and they tend to ignore my advice, so if
you would like to see py-lmdb packaged someone needs to speak up.
questions:
i wrote some python evaluation code that stored 5,000 records with
8-byte keys and 100-byte values before doing a transaction commit: it
managed 900,000 records per second (which is ridiculously fast even
for python. however: when i enabled append mode (on the cursor
put)... and yes i used the db stats to create a key that each time was
1 bigger lexicographically than all other keys... bizarrely things
*slowed down* ever so slightly - maybe about 3 to 5%.
what gives, there? the benchmarks show that this is supposed to be
faster (a *lot* faster) and that is simply not happening. is the
overhead from python that large it wipes out the speed advantages?
l.
Luke Kenneth Casson Leighton wrote:
>> Not surprising. Remember, I've been writing world's-fastest <whatevers>
>> since the 1980s. There is no other KV store that is full ACID and anywhere
>> near as small or as fast as LMDB.
>
> ... or does range as well... *and* duplicate values per key. you
> have a long way to go before beating tdb for number of lines of code
> though :) written by rusty russell, andrew tridgell and others - did
> you see it does hash buckets and then spin-locks (file-based) on each
> hash chain, so you can have multiple simultaneous writers not just
> readers? good trick that, i mention it just in case it's something
> that could be deployed to good effect in lmdb, that would be awesome
> to have parallel writes speeded up as well per core.
>
> ... yes i realise that lmdb is read-optimised, but hey its being
> adopted elsewhere as well
Fair enough, tdb is smaller, but it's also missing all those other goodies
that we couldn't live without in OpenLDAP. And it doesn't perform.
Something that the world needs to understand - there is no such thing as
parallel writes, not in a transactional database. Every system out there gets
serialized sooner or later - most just get serialized in their WAL. All the
fine grained locking they do ahead of that is just mental masturbation ahead
of that inevitable bottleneck. You should notice there are plenty of
benchmarks where LMDB's write performance is far faster than so-called
write-optimized databases too. When you jettison all the overhead of those
fine-grained locks your write path can get a lot faster, and with LMDB's
zero-copy writes they go faster still.
We fell for the fantasy of parallel writes with BerkeleyDB, but after a dozen+
years of poking, profiling, and benchmarking, it all becomes clear - all of
that locking overhead+deadlock detection/recovery is just a waste of
resources. back-mdb isn't just faster than back-bdb/hdb for reads, it's also
several times faster for writes, and the absence of ubiquitous locks is a good
part of that.
>>> i wrote some python evaluation code that stored 5,000 records with
>>> 8-byte keys and 100-byte values before doing a transaction commit: it
>>> managed 900,000 records per second (which is ridiculously fast even
>>> what gives, there? the benchmarks show that this is supposed to be
>>> faster (a *lot* faster) and that is simply not happening. is the
>>> overhead from python that large it wipes out the speed advantages?
>> No idea. I don't use python enough to have any insight there.
But these folks have some thoughts on it
https://twitter.com/hyc_symas/status/451763166985613312
> ok.. is there some c-based benchmark code somewhere i can check how
> to do sequential writes, compare it with the python bindings? just to
> make sure. it is very puzzling that there's a slow-down rather than a
> speed-up.
All of the source code for the microbenchmarks is linked from the microbench
page. (Minus the actual LevelDB source tree, which you also need.)
http://symas.com/mdb/microbench
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Le 5/5/14 1:46 PM, Hallvard Breien Furuseth a écrit :
> On 05/02/2014 11:45 AM, Emmanuel Lécharny wrote:
>> I don't know what's inside the Free List in LMDB, but I can only assume
>> it's closed to what we have in Mavibot. Let me explain what we do (or
>> will do !) in Mavibot, and what data structure we use for that.
>
> Thanks.
>
> > (...)
>> I don't know how it fits with LMDB implementation, I just wanted to give
>> you an insight on what we are working on. Hope it helps.
>
> LMDB's freeDB is simpler: A mapping
> {ID of transaction which freed the pages => [page numbers]}.
We used a tree because we need to be able to find fast the pages copied
by a given transaction for a specific BTree. If we are to look for any
older transaction belonging to the same btree, {ID, page numbers} is not
enough.
> A write transaction may only use page numbers from freeDB records
> older than the oldest snapshot.
Yes, this is the current state. In Mavibot, we distinguish the elements
in the copiedPagesBTree with the pages in teh freePages list (which is a
simple linked list). The rational is that it's less costly to grap a
page from a simple linked list than from a b-tree, assuming that some
pages will be freed without having to be put into the copiedPagesBtree :
for instance, the root page can always be released if there is no read
transaction using the N-1 version, or the pages used by the b-tree
headers (same reason).
At this point, those 'optimisation' are quite pulled out of thin air, I
must admit.
> It scans the reader table for
> that: Each read-only txn has a slot where it puts its snapshot's
> txnID.
W edo have the same thing in memory.
> Also the datafile has metapages for the last 2 snapshots.
We have two pointers in the RecordManager header (RMH), one on the
current BtreeOfBtrees and CopiedPagesBtree, and one for the previous
version of those guys.
We always keep a track to the previous pointers until we have done the
cleanup, then we rewrite the RecordManager header with only the current
version.
If we have a crash before the first RMH write, we restart with the
current version (which is the version before we updated whatever btrees).
If the crash occurs in the middle of the two RMH writes, then at
startup, we discover that we have the previous and current pointer, so
we know that we have to restart the cleanup
Otherwise, we are safe with the new version.
In any case, we are in a state we can restart with no breakage, except
that we may have lost the latest modification, and in a minimal time.
I do think that LMDB and Mavibot are quite similar in the base algorithm
(there is not much room for many smart and different approaches here ;),
but the implementation *is* different, due to the fact that we are in
Java, when LMDB is in C. Typically, serialization/deserialization is a
real burden for us, so is the cache. We do'nt use MemoryMappedFile atm,
but we could.
> There are no concurrent write txns.
>
--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com
Was chatting with Emmanuel Lecharny (who is currently working on Mavibot for
ApacheDS, an MVCC backend similar to LMDB) and had an interesting realization:
we can avoid the current issue of long-lived reader txns preventing page
reclamation.
The key point is this - currently for any txn, we store in the freeDB a list
of the pageIDs that were freed in that txn (due to being copied or deleted).
All we need is to know that any of these pages has been copied *twice* since
the txn of the outstanding reader. At that point, any such page is so new that
the reader would never have seen it, and if no other readers are outstanding
then the page can be safely reclaimed.
Currently mavibot maintains this revision history in a dedicated Btree (much
like our freeDB but with more info contained). I'm thinking, since we're
already going to add a txnID to every page's page header, we can simply add a
2nd txnID, recording the txnID of the previous change to this page's ancestor.
Then, any page where this prevTxnID is >= the outstanding reader's txnID can
be reclaimed.
Still thinking about the actual implementation of this, it may make more sense
to store the prevTxnID in the freeDB than in each page header. Ideally we want
to be able to grab a chunk of pageIDs unambiguously, instead of having to
iterate thru each page and read its header to determine if it's safe.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/