openldap-devel December 2014

openldap-devel@openldap.org

8 participants
9 discussions

ext3/ext4 fsync hack (was: openldap.git branch mdb.master updated. 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe)
by Hallvard Breien Furuseth 06 Jan '15

06 Jan '15

On 18/12/14 05:40, openldap-commit2devel(a)OpenLDAP.org wrote: > commit 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe > Author: Howard Chu <hyc(a)openldap.org> > Date: Thu Dec 18 04:38:53 2014 +0000 > > Hack for potential ext3/ext4 corruption issue > > Use regular fsync() if we think this commit grew the DB file. This does not catch all cases: If the new pages below mt_next_pgno were freed instead of written, me_size becomes too big. Later when the file does grow, me_size may be >= actual filesize so it fdatasync()s. Similar to b09e46904c1c059bd5086243e3915b6be510e57d "ITS#7886 fix mdb_copy write size". We can fix me_size, grow the file anyway (ftruncate), or give the pages back to mt_next_pgno in mdb_freelist_save(). Another issue: After an MDB_NOSYNC commit, mdb_env_sync() only fdatasync()s. It does not know when the file grew. The planned "group commits" may get the same problem if the user checkpoints with mdb_env_sync().

2 3

LMDB crash consistency, again
by Howard Chu 05 Jan '15

05 Jan '15

This paper https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zh… describes a potential crash vulnerability in LMDB due to its use of fdatasync instead of fsync when syncing writes to the data file. The vulnerability exists because fdatasync omits syncs of the file metadata; if the data file needed to grow as a result of any writes then this requires a metadata update. This is a well-understood issue in LMDB; we briefly touched on it in this earlier email thread http://www.openldap.org/lists/openldap-technical/201402/msg00111.html and it's been a topic of discussion on IRC ever since the first multi-FS microbenchmarks we conducted back in 2012. http://symas.com/mdb/microbench/july/ It's worth noting that this vulnerability doesn't exist on Windows, MacOSX, Android, or *BSD, because none of these OSs have a function equivalent to fdatasync in the first place - they always use fsync (or the Windows equivalent). (Android is an oddball; the underlying Linux kernel of course supports fdatasync, but the C library, bionic, does not.) We have a couple approaches for Linux: 1) provide an option to preallocate the file, using fallocate(). Unfortunately this doesn't completely eliminate metadata updates - filesystem drivers tend to try to be "smart" and make fallocate cheap; they allocate the space in the FS metadata but they also mark it as "unseen." The first time a process accesses an unseen page, it gets zeroed out. Up until that point, whatever old contents of the disk page are still present. The act of marking a page from "unseen" to "seen" requires a metadata update of its own. We had a discussion of this FS mis-feature a while ago, but it was fruitless. https://lkml.org/lkml/2012/12/7/396 2) preallocate the file by explicitly writing zeros to it. This has a couple other disadvantages: a) on SSDs, doing such a write needlessly contributes to wearout of the flash. b) Windows detects all-zero writes and compresses them out, creating a sparse file, thus defeating the attempt at preallocation. 3) track the allocated size of the file, and toggle between fsync and fdatasync depending on whether the allocated size actually grows or not. This is the approach I'm currently taking in a development branch. Whether we add this to a new 0.9.x release, or just in 1.0, I haven't yet decided. As another footnote, I plan to add support for LMDB on a raw partition in 1.x. Naturally, fsync vs fdatasync will be irrelevant in that case. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

2 9

RE24 testing call #1 (2.4.41), LMDB RE0.9 testing call #1 (0.9.15)
by Quanah Gibson-Mount 22 Dec '14

22 Dec '14

The following changes have been made to OpenLDAP directly for 2.4.41: OpenLDAP 2.4.41 Engineering Fixed libldap double free of request during abandon (ITS#7967) Fixed libldap segfault in ldap_sync_initialize (ITS#8001) Fixed libldap ldif-wrap off by one error (ITS#8003) Fixed slapd syncrepl delta-mmr issue with overlays and slapd.conf (ITS#7976) Fixed slapd syncrepl mutex for cookie state (ITS#7968) Fixed slapd-mdb one-level search (ITS#7975) Fixed slapd-mdb heap corruption (ITS#7965) Fixed slapd-mdb crash after deleting in-use schema (ITS#7995) Fixed slapo-collect segfault (ITS#7797) Build Environment Enhanced contrib modules build paths (ITS#7782) Fixed contrib/autogroup internal operation identity (ITS#8006) Fixed contrib/passwd/sha2 compiler warning (ITS#8000) Fixed contrib/noopsrch compiler warning (ITS#7998) Fixed contrib/dupent compiler warnings (ITS#7997) Contrib Added pbkdf2 sha256 and sha512 schemes (ITS#7977) The following changes have been made to LMDB: LMDB 0.9.15 Release Engineering Fix txn init (ITS#7961,#7987) Fix MDB_PREV_DUP (ITS#7955,#7671) Fix compact of empty env (ITS#7956) Fix potential fdatasync bug on ext3fs Build Don't use -fPIC for static lib Update .gitignore (ITS#7952,#7953) Cleanup for "make test" (ITS#7841) Misc. Android/Windows cleanup Documentation Fix MDB_APPEND doc Clarify mdb_dbi_open doc Please test and report back! --Quanah -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration

3 2

test msg (рlease ignore)
by Howard Chu 18 Dec '14

18 Dec '14

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

1 0

Freeing unused pages between two revisions...
by Emmanuel Lécharny 16 Dec '14

16 Dec '14

Hi guys, we have had long discussions with Howard on LMDB about the release of pages that are not anymore in use. This is a critical part of a MVCC based B-tree, as each update will copy many pages, and we want to re-use the copied pages. The problem is that those copied pages may be still in use by older readers. Let's quickly explain how it all works. At Ti, we are updating the B-tree, and that creates many pages by copying pages from previous revisions. We will note those new pages based on the level in the B-tree they were stored in : P[0, i-1] will be for the root page, on level 0, and this page previous version was k (and as it's the root page, k is obviously one version earlier than the current revision). P[k, a] will be for a page at level k, and with a previous revision 'a' (where a <= i-1). One more important point is that a copied page may be split in two new pages (if teh previous page was full), as we may copy the content of two old pages into one single new page (when the two old pages contain only N/2 elemnts and we removed one element in one of the two pages). Last, not least, we may have to remove one level, if the root page is removed. All in all, what is important here is to *know* which pages have been copied during the update operation, because this is those pages that will be removed at some point. Now, assuming we keep a track of all the copied pages for a revision R, we can also assume we can remove all those pages when this revision R is not anymore in use (ie no reader thread is using R *and* no other reader thread is using a revision below R). This is what LMDB does, AFAICT. Can we go a bit further, and get rid of copied pages when a reader is using a revision < R ? I believe so. Let say that for a revision R, we have copied a page P[k, c], and let say we have a reader using a revision P < c (obviously, if a reader uses a revision > c, we can't free P[k, c], because it's fully part of the B-tree used by reader P). If c > P, that means we have copied a page which revison was < P at some point, so we can now get rid of this copy, no other reader will use it. case 1 : P is newer than the copied page : P R | | v v o--------[c]---------[P]---------[n]------> time ^ | /| | | / +--> we create a new revision, which copy page [c] and create page [n] .-----------|--------. | +--------------> Revision P is using page [c] Here, we can't get rid of page [c], as it's still part of revision P B-tree case 2 : P is older than the copied page P Q R | | | v v v o--------[P]---------[c]---------[n]------> time | ^ /| | | / +--> we create a new revision, which copy page [c] and create page [n] | .--------. | +--------------> Revision P Here, an update at revision Q has created a page [c] which has been copied by the update at revision R. We can get rid of page [c] because no other reader will ever use it. That is all clear and easy. Now, the real pb is how do we determinate the pages we can free based on the copied pages at revision R ? At revision R, we know which pages we have copied, and each page has a unique revision number. That allows us to know which pages to free quite immediately : we just free all the pages which revision is newer than the most recent reader revision. For instance, we can immediately get rid of all the root pages for any revision not in use, because this root page is immediately copied when we do an upadate. That is for any update being done, when we don't have recent readers (ie, if we do many updates, this algorithm works very well). But what if we have more than one pending reader ? There is little we can do here, we have to wait for one of the reader to be released, because this is where we will be able to free many pages. Ok, let's see what happens when we free a reader revision now. We know which pages this reader is holding, it's the list of copied pages by the update which occured at the same revision. When we release this reader, all those copied pages has to bec checked to see if they are still in use. M P Q R | | | | v v v v o---[M]---[c]----[d]----[P]-----[Q]--[n]------> time | ^ ^ | / /| | | | | / / | | | .------|----. / +--> we create a new revision, which copy page [c] and create page [n] | | | / | .-------------|-------. | | | +--------------> Revision P is using page [c] | +---------------> Revision M Here, we will release the reader P. We should expect that every page copied between M and P can now be released (in the graphic, revison Q has copied page [d], which is still in hold, because revision P is using it). So when we release P, we know we should free pages [c] and [d]. The difficulty here is to list all the pages that are in hold by revision P, and thi sis a costly process : - check every revision > P which have copied pages which revision is above M and below P. In order to do that we must have a data structure that holds all the reader in use, order from the oldest to the newest, and we must have a data structure keeping a track of all the pages copied by each revision. We have both, but the second on is not in memory (well, we can't assume it's in memory, as this data structure is stored in a B-tree itself). So basically, it's a matter of browsing the copied-pages B-tree looking for all the revisions between [M] and [R], and for each of those revisions, we can feee of all the copied pages which revision is > M. Not exactly inexpensive, but still acceptable, IMO. Wdyt ? Emmanuel Lécharny

4 6

[LMDB] inconsistent mutex behavior and problems for M:N threading
by David Barbour 11 Dec '14

11 Dec '14

The code setting up the locks is `mdb_env_setup_locks` circa line 4250 of mdb.c. This code has a bunch of #ifdefs for posix vs win32. A major difference is that locks for posix are non-recursive by default (unless you set a specific flag), whereas the CreateMutex operation used for win32 creates recursive locks. The MDB_NOTLS flag only affects readers. When two writers operate in the same thread using M:N threading, a bug will occur in Windows whenever two different lightweight 'writer' threads grab the same mutex in the same OS thread. Because both writers will succeed. Conversely, in posix systems, a different bug will occur: the second writer will attempt to grab the mutex; this will lock up the thread... which, unfortunately, happens to be the same thread that the first writer needs to use to unlock the mutex. There is no way for a lightweight writer to migrate to another thread and unlock it there. I believe both systems would behave more consistently and usefully if we switch from mutexes to semaphores (of size 1). The difference is that a semaphore doesn't track an 'owner' thread. This is also the suggestion to avoid the recursive mutex behavior of windows [1]. I would like to hear other opinions on this matter. Regards, Dave [1] http://stackoverflow.com/questions/1988324/how-to-alter-the-recursive-locki…

3 4

List charter
by Howard Chu 05 Dec '14

05 Dec '14

Recent emails imply that a reminder is in order - this email list is for discussion of actual development of OpenLDAP software. Discussion about how OpenLDAP software works or how to use it belongs on the -technical list. If you're not actively engaged in writing code for inclusion in the OpenLDAP repos then you should not be posting questions here. On a slightly more pointed note - if you post to this list and your message is deemed off-topic or insufficiently researched, you *will* be chided, mocked, and denigrated. There *is* such a thing as a stupid question. If you don't read what's in front of you, if you ignore the list charter, or the text of the welcome message that is sent to every new subscriber, you will be publicly mocked and made unwelcome. This is entirely intentional and by design. Ask good questions within the bounds of the list charter and you will get good answers. If you're not conscientious enough to pay attention to details then I don't want you here. Demonstrating a basic ability to read is the minimum price of admission. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

1 0

Valgrind report an error in sample-mdb.c
by yi huang 05 Dec '14

05 Dec '14

I'm not sure how this happens, is it a problem? $ make $ gcc sample-mdb.c -L. -llmdb $ valgrind ./a.out ==13010== Memcheck, a memory error detector ==13010== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==13010== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==13010== Command: ./a.out ==13010== ==13010== Syscall param pwrite64(buf) points to uninitialised byte(s) ==13010== at 0x42319D3: pwrite (pwrite.c:46) ==13010== by 0x403AB25: mdb_page_flush (mdb.c:3160) ==13010== by 0x4043AD0: mdb_txn_commit (mdb.c:3416) ==13010== by 0x8048A13: main (in /home/yihuang/src/mdb/libraries/liblmdb/a.out) ==13010== Address 0x43c35b4 is 4,084 bytes inside a block of size 4,096 alloc'd ==13010== at 0x402BE18: malloc (vg_replace_malloc.c:270) ==13010== by 0x403AE47: mdb_page_malloc.isra.13 (mdb.c:1615) ==13010== by 0x403D281: mdb_page_alloc.isra.22 (mdb.c:2120) ==13010== by 0x403D446: mdb_page_touch (mdb.c:2252) ==13010== by 0x403EEED: mdb_cursor_touch (mdb.c:6019) ==13010== key: 0x4643fdc 020 , data: 0x4643fe0 020 3141592 foo bar ==13010== ==13010== HEAP SUMMARY: ==13010== in use at exit: 0 bytes in 0 blocks ==13010== total heap usage: 15 allocs, 15 frees, 1,586,385 bytes allocated ==13010== ==13010== All heap blocks were freed -- no leaks are possible ==13010== ==13010== For counts of detected and suppressed errors, rerun with: -v ==13010== Use --track-origins=yes to see where uninitialised values come from ==13010== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

2 1

LMDB Getting transaction ID (feature request)
by David Barbour 05 Dec '14

05 Dec '14

At the moment, LMDB provides a means to view the last committed transaction on the environment, but does not provide any means to reliably acquire the ID for the 'current' transaction. I cannot check that the `me_last_txnid` returned from mdb_envinfo() is the same snapshot I'm looking at. I can think of several scenarios where accessing the transaction ID specific to the snapshot a transaction is viewing could be very useful. First, if I want to promote a read-only transaction to a read-write transaction, a very simple mechanism is to abort the read, create the read-write transaction and compare the transaction IDs. If they're the same, I can immediately continue since there is no risk of any changes. Otherwise, I can restart the transaction with write enabled. Second, if I want to build a layer above LMDB modeling concurrent read-write transactions, I can potentially track (in several read-only transactions) intended writes together with a bloom-filter for the keys I've read. (There are other ways to do this, too, e.g. via a proper STM layer.) It is feasible to logically serialize and merge a bunch of non-conflicting write transactions as one larger write transaction. Such a layer can potentially improve throughput and amortize synchronization costs, albeit with a tradeoff for memory overheads and latency. Third, if I want to poll the database, the ability to peek at the transaction ID and see that nothing has changed could be very useful. In an HTTP server, the transaction ID could serve as an (admittedly coarse grained) e-tag. Fortunately, implementing this feature request is trivial. It uses only information we already have: mt_txnid. I'm asking for access to this value, e.g. via: txnid_t mdb_txn_id(MDB_txn* txn); Will you make this happen? Regards, Dave

2 1

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

openldap-devel December 2014