openldap-devel May 2014

openldap-devel@openldap.org

4 participants
5 discussions

Re: py-lmdb
by Howard Chu 22 May '14

22 May '14

Luke Kenneth Casson Leighton wrote: >> We fell for the fantasy of parallel writes with BerkeleyDB, but after a >> dozen+ years of poking, profiling, and benchmarking, it all becomes clear - >> all of that locking overhead+deadlock detection/recovery is just a waste of >> resources. > > ... which is why tdb went to the other extreme, to show it could be done. But even tdb only allows one write transaction at a time. I looked into writing a back-tdb for OpenLDAP back in 2009, before I started writing LMDB. I know pretty well how tdb works... >> https://twitter.com/hyc_symas/status/451763166985613312 > > quote: > > "The new code is faster at indexing and searching, but not so much > faster it would blow you away, even using > LMDB. Turns out the slowness of Python looping trumps the speed of a > fast datastore :(. The difference > might be bigger on a big index; I'm going to run experiments on the > Enron dataset and see." > > interesting. so why is read up at 5,000,000 per second under python > (in a python loop, obviously) but write isn't? something odd there. Good question. I'd guess there's some memory allocation overhead involved in writes. The Whoosh guys have some more perf stats here https://bitbucket.org/mchaput/whoosh/wiki/Whoosh3 (their test.Tokyo / All Keys result is highly suspect though, the timing is the same for 100,000 keys as for 1M keys. Probably a bug in their test code.) -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

2 8

py-lmdb
by Luke Kenneth Casson Leighton 18 May '14

18 May '14

hi all, and a special hello to howard, i had forgotten you're working on openldap. i've just discovered lmdb and its python bindings, and i am to be absolutely honest completely astounded. that's for two reasons: the first is how steeupidly quick lmdb is, and secondly because bizarrely although it walks all over the alternatives it isn't more widely adopted. mariadb added leveldb last year, mongodb i believe are looking at a leveldb port. i have been looking at a *lot* of key-value database stores recently, to find the fastest possible one after realising that standard SQL and NOSQL databases simply aren't fast enough. there is something called structchunk which instead of storing the object in a leveldb just stores an mmap file offset in the value and stores the actual object in an mmap'd file... and it's *still* nowhere near as quick as lmdb. so i am both impressed and puzzled :) i have created a debian RFP for the python bindings, in case it helps. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=748373 - can i suggest that anyone wishing to see that in there send a message seconding it to be added: because as a general rule debian developers do not like my blunt honesty and they tend to ignore my advice, so if you would like to see py-lmdb packaged someone needs to speak up. questions: i wrote some python evaluation code that stored 5,000 records with 8-byte keys and 100-byte values before doing a transaction commit: it managed 900,000 records per second (which is ridiculously fast even for python. however: when i enabled append mode (on the cursor put)... and yes i used the db stats to create a key that each time was 1 bigger lexicographically than all other keys... bizarrely things *slowed down* ever so slightly - maybe about 3 to 5%. what gives, there? the benchmarks show that this is supposed to be faster (a *lot* faster) and that is simply not happening. is the overhead from python that large it wipes out the speed advantages? l.

2 2

Re: py-lmdb
by Howard Chu 18 May '14

18 May '14

Luke Kenneth Casson Leighton wrote: >> Not surprising. Remember, I've been writing world's-fastest <whatevers> >> since the 1980s. There is no other KV store that is full ACID and anywhere >> near as small or as fast as LMDB. > > ... or does range as well... *and* duplicate values per key. you > have a long way to go before beating tdb for number of lines of code > though :) written by rusty russell, andrew tridgell and others - did > you see it does hash buckets and then spin-locks (file-based) on each > hash chain, so you can have multiple simultaneous writers not just > readers? good trick that, i mention it just in case it's something > that could be deployed to good effect in lmdb, that would be awesome > to have parallel writes speeded up as well per core. > > ... yes i realise that lmdb is read-optimised, but hey its being > adopted elsewhere as well Fair enough, tdb is smaller, but it's also missing all those other goodies that we couldn't live without in OpenLDAP. And it doesn't perform. Something that the world needs to understand - there is no such thing as parallel writes, not in a transactional database. Every system out there gets serialized sooner or later - most just get serialized in their WAL. All the fine grained locking they do ahead of that is just mental masturbation ahead of that inevitable bottleneck. You should notice there are plenty of benchmarks where LMDB's write performance is far faster than so-called write-optimized databases too. When you jettison all the overhead of those fine-grained locks your write path can get a lot faster, and with LMDB's zero-copy writes they go faster still. We fell for the fantasy of parallel writes with BerkeleyDB, but after a dozen+ years of poking, profiling, and benchmarking, it all becomes clear - all of that locking overhead+deadlock detection/recovery is just a waste of resources. back-mdb isn't just faster than back-bdb/hdb for reads, it's also several times faster for writes, and the absence of ubiquitous locks is a good part of that. >>> i wrote some python evaluation code that stored 5,000 records with >>> 8-byte keys and 100-byte values before doing a transaction commit: it >>> managed 900,000 records per second (which is ridiculously fast even >>> what gives, there? the benchmarks show that this is supposed to be >>> faster (a *lot* faster) and that is simply not happening. is the >>> overhead from python that large it wipes out the speed advantages? >> No idea. I don't use python enough to have any insight there. But these folks have some thoughts on it https://twitter.com/hyc_symas/status/451763166985613312 > ok.. is there some c-based benchmark code somewhere i can check how > to do sequential writes, compare it with the python bindings? just to > make sure. it is very puzzling that there's a slow-down rather than a > speed-up. All of the source code for the microbenchmarks is linked from the microbench page. (Minus the actual LevelDB source tree, which you also need.) http://symas.com/mdb/microbench -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

1 0

Re: LMDB stuff
by Emmanuel Lécharny 05 May '14

05 May '14

Le 5/5/14 1:46 PM, Hallvard Breien Furuseth a écrit : > On 05/02/2014 11:45 AM, Emmanuel Lécharny wrote: >> I don't know what's inside the Free List in LMDB, but I can only assume >> it's closed to what we have in Mavibot. Let me explain what we do (or >> will do !) in Mavibot, and what data structure we use for that. > > Thanks. > > > (...) >> I don't know how it fits with LMDB implementation, I just wanted to give >> you an insight on what we are working on. Hope it helps. > > LMDB's freeDB is simpler: A mapping > {ID of transaction which freed the pages => [page numbers]}. We used a tree because we need to be able to find fast the pages copied by a given transaction for a specific BTree. If we are to look for any older transaction belonging to the same btree, {ID, page numbers} is not enough. > A write transaction may only use page numbers from freeDB records > older than the oldest snapshot. Yes, this is the current state. In Mavibot, we distinguish the elements in the copiedPagesBTree with the pages in teh freePages list (which is a simple linked list). The rational is that it's less costly to grap a page from a simple linked list than from a b-tree, assuming that some pages will be freed without having to be put into the copiedPagesBtree : for instance, the root page can always be released if there is no read transaction using the N-1 version, or the pages used by the b-tree headers (same reason). At this point, those 'optimisation' are quite pulled out of thin air, I must admit. > It scans the reader table for > that: Each read-only txn has a slot where it puts its snapshot's > txnID. W edo have the same thing in memory. > Also the datafile has metapages for the last 2 snapshots. We have two pointers in the RecordManager header (RMH), one on the current BtreeOfBtrees and CopiedPagesBtree, and one for the previous version of those guys. We always keep a track to the previous pointers until we have done the cleanup, then we rewrite the RecordManager header with only the current version. If we have a crash before the first RMH write, we restart with the current version (which is the version before we updated whatever btrees). If the crash occurs in the middle of the two RMH writes, then at startup, we discover that we have the previous and current pointer, so we know that we have to restart the cleanup Otherwise, we are safe with the new version. In any case, we are in a state we can restart with no breakage, except that we may have lost the latest modification, and in a minimal time. I do think that LMDB and Mavibot are quite similar in the base algorithm (there is not much room for many smart and different approaches here ;), but the implementation *is* different, due to the fact that we are in Java, when LMDB is in C. Typically, serialization/deserialization is a real burden for us, so is the cache. We do'nt use MemoryMappedFile atm, but we could. > There are no concurrent write txns. > -- Regards, Cordialement, Emmanuel Lécharny www.iktek.com

1 0

LMDB stuff
by Howard Chu 02 May '14

02 May '14

Was chatting with Emmanuel Lecharny (who is currently working on Mavibot for ApacheDS, an MVCC backend similar to LMDB) and had an interesting realization: we can avoid the current issue of long-lived reader txns preventing page reclamation. The key point is this - currently for any txn, we store in the freeDB a list of the pageIDs that were freed in that txn (due to being copied or deleted). All we need is to know that any of these pages has been copied *twice* since the txn of the outstanding reader. At that point, any such page is so new that the reader would never have seen it, and if no other readers are outstanding then the page can be safely reclaimed. Currently mavibot maintains this revision history in a dedicated Btree (much like our freeDB but with more info contained). I'm thinking, since we're already going to add a txnID to every page's page header, we can simply add a 2nd txnID, recording the txnID of the previous change to this page's ancestor. Then, any page where this prevTxnID is >= the outstanding reader's txnID can be reclaimed. Still thinking about the actual implementation of this, it may make more sense to store the prevTxnID in the freeDB than in each page header. Ideally we want to be able to grab a chunk of pageIDs unambiguously, instead of having to iterate thru each page and read its header to determine if it's safe. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

3 5

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

openldap-devel May 2014