ext3/ext4 fsync hack (was: openldap.git branch mdb.master updated. 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe)
by Hallvard Breien Furuseth
On 18/12/14 05:40, openldap-commit2devel(a)OpenLDAP.org wrote:
> commit 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe
> Author: Howard Chu <hyc(a)openldap.org>
> Date: Thu Dec 18 04:38:53 2014 +0000
>
> Hack for potential ext3/ext4 corruption issue
>
> Use regular fsync() if we think this commit grew the DB file.
This does not catch all cases:
If the new pages below mt_next_pgno were freed instead of
written, me_size becomes too big. Later when the file does
grow, me_size may be >= actual filesize so it fdatasync()s.
Similar to b09e46904c1c059bd5086243e3915b6be510e57d
"ITS#7886 fix mdb_copy write size".
We can fix me_size, grow the file anyway (ftruncate), or
give the pages back to mt_next_pgno in mdb_freelist_save().
Another issue: After an MDB_NOSYNC commit, mdb_env_sync()
only fdatasync()s. It does not know when the file grew.
The planned "group commits" may get the same problem if
the user checkpoints with mdb_env_sync().
8 years, 8 months
LMDB crash consistency, again
by Howard Chu
This paper
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/...
describes a potential crash vulnerability in LMDB due to its use of fdatasync
instead of fsync when syncing writes to the data file. The vulnerability
exists because fdatasync omits syncs of the file metadata; if the data file
needed to grow as a result of any writes then this requires a metadata update.
This is a well-understood issue in LMDB; we briefly touched on it in this
earlier email thread
http://www.openldap.org/lists/openldap-technical/201402/msg00111.html and it's
been a topic of discussion on IRC ever since the first multi-FS
microbenchmarks we conducted back in 2012. http://symas.com/mdb/microbench/july/
It's worth noting that this vulnerability doesn't exist on Windows, MacOSX,
Android, or *BSD, because none of these OSs have a function equivalent to
fdatasync in the first place - they always use fsync (or the Windows
equivalent). (Android is an oddball; the underlying Linux kernel of course
supports fdatasync, but the C library, bionic, does not.)
We have a couple approaches for Linux:
1) provide an option to preallocate the file, using fallocate().
Unfortunately this doesn't completely eliminate metadata updates - filesystem
drivers tend to try to be "smart" and make fallocate cheap; they allocate the
space in the FS metadata but they also mark it as "unseen." The first time a
process accesses an unseen page, it gets zeroed out. Up until that point,
whatever old contents of the disk page are still present. The act of marking a
page from "unseen" to "seen" requires a metadata update of its own.
We had a discussion of this FS mis-feature a while ago, but it was fruitless.
https://lkml.org/lkml/2012/12/7/396
2) preallocate the file by explicitly writing zeros to it. This has a
couple other disadvantages:
a) on SSDs, doing such a write needlessly contributes to wearout of the
flash.
b) Windows detects all-zero writes and compresses them out, creating a
sparse file, thus defeating the attempt at preallocation.
3) track the allocated size of the file, and toggle between fsync and
fdatasync depending on whether the allocated size actually grows or not. This
is the approach I'm currently taking in a development branch. Whether we add
this to a new 0.9.x release, or just in 1.0, I haven't yet decided.
As another footnote, I plan to add support for LMDB on a raw partition in 1.x.
Naturally, fsync vs fdatasync will be irrelevant in that case.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
8 years, 8 months
RE24 testing call #1 (2.4.41), LMDB RE0.9 testing call #1 (0.9.15)
by Quanah Gibson-Mount
The following changes have been made to OpenLDAP directly for 2.4.41:
OpenLDAP 2.4.41 Engineering
Fixed libldap double free of request during abandon (ITS#7967)
Fixed libldap segfault in ldap_sync_initialize (ITS#8001)
Fixed libldap ldif-wrap off by one error (ITS#8003)
Fixed slapd syncrepl delta-mmr issue with overlays and slapd.conf
(ITS#7976)
Fixed slapd syncrepl mutex for cookie state (ITS#7968)
Fixed slapd-mdb one-level search (ITS#7975)
Fixed slapd-mdb heap corruption (ITS#7965)
Fixed slapd-mdb crash after deleting in-use schema (ITS#7995)
Fixed slapo-collect segfault (ITS#7797)
Build Environment
Enhanced contrib modules build paths (ITS#7782)
Fixed contrib/autogroup internal operation identity
(ITS#8006)
Fixed contrib/passwd/sha2 compiler warning (ITS#8000)
Fixed contrib/noopsrch compiler warning (ITS#7998)
Fixed contrib/dupent compiler warnings (ITS#7997)
Contrib
Added pbkdf2 sha256 and sha512 schemes (ITS#7977)
The following changes have been made to LMDB:
LMDB 0.9.15 Release Engineering
Fix txn init (ITS#7961,#7987)
Fix MDB_PREV_DUP (ITS#7955,#7671)
Fix compact of empty env (ITS#7956)
Fix potential fdatasync bug on ext3fs
Build
Don't use -fPIC for static lib
Update .gitignore (ITS#7952,#7953)
Cleanup for "make test" (ITS#7841)
Misc. Android/Windows cleanup
Documentation
Fix MDB_APPEND doc
Clarify mdb_dbi_open doc
Please test and report back!
--Quanah
--
Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra :: the leader in open source messaging and collaboration
8 years, 9 months
Freeing unused pages between two revisions...
by Emmanuel Lécharny
Hi guys,
we have had long discussions with Howard on LMDB about the release of
pages that are not anymore in use. This is a critical part of a MVCC
based B-tree, as each update will copy many pages, and we want to re-use
the copied pages.
The problem is that those copied pages may be still in use by older
readers. Let's quickly explain how it all works.
At Ti, we are updating the B-tree, and that creates many pages by
copying pages from previous revisions. We will note those new pages
based on the level in the B-tree they were stored in : P[0, i-1] will be
for the root page, on level 0, and this page previous version was k (and
as it's the root page, k is obviously one version earlier than the
current revision). P[k, a] will be for a page at level k, and with a
previous revision 'a' (where a <= i-1).
One more important point is that a copied page may be split in two new
pages (if teh previous page was full), as we may copy the content of two
old pages into one single new page (when the two old pages contain only
N/2 elemnts and we removed one element in one of the two pages). Last,
not least, we may have to remove one level, if the root page is removed.
All in all, what is important here is to *know* which pages have been
copied during the update operation, because this is those pages that
will be removed at some point.
Now, assuming we keep a track of all the copied pages for a revision R,
we can also assume we can remove all those pages when this revision R is
not anymore in use (ie no reader thread is using R *and* no other reader
thread is using a revision below R). This is what LMDB does, AFAICT.
Can we go a bit further, and get rid of copied pages when a reader is
using a revision < R ? I believe so.
Let say that for a revision R, we have copied a page P[k, c], and let
say we have a reader using a revision P < c (obviously, if a reader uses
a revision > c, we can't free P[k, c], because it's fully part of the
B-tree used by reader P). If c > P, that means we have copied a page
which revison was < P at some point, so we can now get rid of this copy,
no other reader will use it.
case 1 : P is newer than the copied page :
P R
| |
v v
o--------[c]---------[P]---------[n]------> time
^ | /|
| | / +--> we create a new revision, which
copy page [c] and create page [n]
.-----------|--------.
|
+--------------> Revision P is using page [c]
Here, we can't get rid of page [c], as it's still part of revision P B-tree
case 2 : P is older than the copied page
P Q R
| | |
v v v
o--------[P]---------[c]---------[n]------> time
| ^ /|
| | / +--> we create a new revision, which
copy page [c] and create page [n]
| .--------.
|
+--------------> Revision P
Here, an update at revision Q has created a page [c] which has been
copied by the update at revision R. We can get rid of page [c] because
no other reader will ever use it.
That is all clear and easy. Now, the real pb is how do we determinate
the pages we can free based on the copied pages at revision R ? At
revision R, we know which pages we have copied, and each page has a
unique revision number. That allows us to know which pages to free quite
immediately : we just free all the pages which revision is newer than
the most recent reader revision.
For instance, we can immediately get rid of all the root pages for any
revision not in use, because this root page is immediately copied when
we do an upadate.
That is for any update being done, when we don't have recent readers
(ie, if we do many updates, this algorithm works very well).
But what if we have more than one pending reader ? There is little we
can do here, we have to wait for one of the reader to be released,
because this is where we will be able to free many pages.
Ok, let's see what happens when we free a reader revision now.
We know which pages this reader is holding, it's the list of copied
pages by the update which occured at the same revision. When we release
this reader, all those copied pages has to bec checked to see if they
are still in use.
M P Q R
| | | |
v v v v
o---[M]---[c]----[d]----[P]-----[Q]--[n]------> time
| ^ ^ | / /|
| | | | / / |
| | .------|----. / +--> we create a new revision,
which copy page [c] and create page [n]
| | | /
| .-------------|-------.
| |
| +--------------> Revision P is using page [c]
|
+---------------> Revision M
Here, we will release the reader P. We should expect that every page
copied between M and P can now be released (in the graphic, revison Q
has copied page [d], which is still in hold, because revision P is using
it). So when we release P, we know we should free pages [c] and [d]. The
difficulty here is to list all the pages that are in hold by revision P,
and thi sis a costly process :
- check every revision > P which have copied pages which revision is
above M and below P.
In order to do that we must have a data structure that holds all the
reader in use, order from the oldest to the newest, and we must have a
data structure keeping a track of all the pages copied by each revision.
We have both, but the second on is not in memory (well, we can't assume
it's in memory, as this data structure is stored in a B-tree itself).
So basically, it's a matter of browsing the copied-pages B-tree looking
for all the revisions between [M] and [R], and for each of those
revisions, we can feee of all the copied pages which revision is > M.
Not exactly inexpensive, but still acceptable, IMO.
Wdyt ?
Emmanuel Lécharny
8 years, 9 months
[LMDB] inconsistent mutex behavior and problems for M:N threading
by David Barbour
The code setting up the locks is `mdb_env_setup_locks` circa line 4250 of
mdb.c. This code has a bunch of #ifdefs for posix vs win32. A major
difference is that locks for posix are non-recursive by default (unless you
set a specific flag), whereas the CreateMutex operation used for win32
creates recursive locks.
The MDB_NOTLS flag only affects readers.
When two writers operate in the same thread using M:N threading, a bug will
occur in Windows whenever two different lightweight 'writer' threads grab
the same mutex in the same OS thread. Because both writers will succeed.
Conversely, in posix systems, a different bug will occur: the second writer
will attempt to grab the mutex; this will lock up the thread... which,
unfortunately, happens to be the same thread that the first writer needs to
use to unlock the mutex. There is no way for a lightweight writer to
migrate to another thread and unlock it there.
I believe both systems would behave more consistently and usefully if we
switch from mutexes to semaphores (of size 1). The difference is that a
semaphore doesn't track an 'owner' thread. This is also the suggestion to
avoid the recursive mutex behavior of windows [1].
I would like to hear other opinions on this matter.
Regards,
Dave
[1]
http://stackoverflow.com/questions/1988324/how-to-alter-the-recursive-loc...
8 years, 9 months
List charter
by Howard Chu
Recent emails imply that a reminder is in order - this email list is for
discussion of actual development of OpenLDAP software. Discussion about
how OpenLDAP software works or how to use it belongs on the -technical
list. If you're not actively engaged in writing code for inclusion in
the OpenLDAP repos then you should not be posting questions here.
On a slightly more pointed note - if you post to this list and your
message is deemed off-topic or insufficiently researched, you *will* be
chided, mocked, and denigrated. There *is* such a thing as a stupid
question. If you don't read what's in front of you, if you ignore the
list charter, or the text of the welcome message that is sent to every
new subscriber, you will be publicly mocked and made unwelcome.
This is entirely intentional and by design. Ask good questions within
the bounds of the list charter and you will get good answers. If you're
not conscientious enough to pay attention to details then I don't want
you here. Demonstrating a basic ability to read is the minimum price of
admission.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
8 years, 9 months
Valgrind report an error in sample-mdb.c
by yi huang
I'm not sure how this happens, is it a problem?
$ make
$ gcc sample-mdb.c -L. -llmdb
$ valgrind ./a.out
==13010== Memcheck, a memory error detector
==13010== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==13010== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==13010== Command: ./a.out
==13010==
==13010== Syscall param pwrite64(buf) points to uninitialised byte(s)
==13010== at 0x42319D3: pwrite (pwrite.c:46)
==13010== by 0x403AB25: mdb_page_flush (mdb.c:3160)
==13010== by 0x4043AD0: mdb_txn_commit (mdb.c:3416)
==13010== by 0x8048A13: main (in
/home/yihuang/src/mdb/libraries/liblmdb/a.out)
==13010== Address 0x43c35b4 is 4,084 bytes inside a block of size 4,096
alloc'd
==13010== at 0x402BE18: malloc (vg_replace_malloc.c:270)
==13010== by 0x403AE47: mdb_page_malloc.isra.13 (mdb.c:1615)
==13010== by 0x403D281: mdb_page_alloc.isra.22 (mdb.c:2120)
==13010== by 0x403D446: mdb_page_touch (mdb.c:2252)
==13010== by 0x403EEED: mdb_cursor_touch (mdb.c:6019)
==13010==
key: 0x4643fdc 020 , data: 0x4643fe0 020 3141592 foo bar
==13010==
==13010== HEAP SUMMARY:
==13010== in use at exit: 0 bytes in 0 blocks
==13010== total heap usage: 15 allocs, 15 frees, 1,586,385 bytes allocated
==13010==
==13010== All heap blocks were freed -- no leaks are possible
==13010==
==13010== For counts of detected and suppressed errors, rerun with: -v
==13010== Use --track-origins=yes to see where uninitialised values come
from
==13010== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
8 years, 9 months
LMDB Getting transaction ID (feature request)
by David Barbour
At the moment, LMDB provides a means to view the last committed transaction
on the environment, but does not provide any means to reliably acquire the
ID for the 'current' transaction. I cannot check that the `me_last_txnid`
returned from mdb_envinfo() is the same snapshot I'm looking at.
I can think of several scenarios where accessing the transaction ID
specific to the snapshot a transaction is viewing could be very useful.
First, if I want to promote a read-only transaction to a read-write
transaction, a very simple mechanism is to abort the read, create the
read-write transaction and compare the transaction IDs. If they're the
same, I can immediately continue since there is no risk of any changes.
Otherwise, I can restart the transaction with write enabled.
Second, if I want to build a layer above LMDB modeling concurrent
read-write transactions, I can potentially track (in several read-only
transactions) intended writes together with a bloom-filter for the keys
I've read. (There are other ways to do this, too, e.g. via a proper STM
layer.) It is feasible to logically serialize and merge a bunch of
non-conflicting write transactions as one larger write transaction. Such a
layer can potentially improve throughput and amortize synchronization
costs, albeit with a tradeoff for memory overheads and latency.
Third, if I want to poll the database, the ability to peek at the
transaction ID and see that nothing has changed could be very useful. In an
HTTP server, the transaction ID could serve as an (admittedly coarse
grained) e-tag.
Fortunately, implementing this feature request is trivial. It uses only
information we already have: mt_txnid. I'm asking for access to this value,
e.g. via:
txnid_t mdb_txn_id(MDB_txn* txn);
Will you make this happen?
Regards,
Dave
8 years, 9 months