syncrepl consumer is slow
by Howard Chu
One thing I just noticed, while testing replication with 3 servers on my
laptop - during a refresh, the provider gets blocked waiting to write to
the consumers after writing about 4000 entries. I.e., the consumers
aren't processing fast enough to keep up with the search running on the
provider.
(That's actually not too surprising since reads are usually faster than
writes anyway.)
The consumer code has lots of problems as it is, just adding this note
to the pile.
I'm considering adding an option to the consumer to write its entries
with dbnosync during the refresh phase. The rationale being, there's
nothing to lose anyway if the refresh is interrupted. I.e., the consumer
can't update its contextCSN until the very end of the refresh, so any
partial refresh that gets interrupted is wasted effort - the consumer
will always have to start over from the beginning on its next refresh
attempt. As such, there's no point in safely/synchronously writing any
of the received entries - they're useless until the final contextCSN update.
The implementation approach would be to define a new control e.g. "fast
write" for the consumer to pass to the underlying backend on any write
op. We would also have to e.g. add an MDB_TXN_NOSYNC flag to
mdb_txn_begin() (BDB already has the equivalent flag).
This would only be used for writes that are part of a refresh phase. In
persist mode the provider and consumers' write speeds should be more
closely matched so it wouldn't be necessary or useful.
Comments?
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
8 years
RE24 testing call #2 (2.4.41), LMDB RE0.9 testing call #2 (0.9.15)
by Quanah Gibson-Mount
OpenLDAP 2.4.41 Engineering
Fixed libldap double free of request during abandon (ITS#7967)
Fixed libldap segfault in ldap_sync_initialize (ITS#8001)
Fixed libldap ldif-wrap off by one error (ITS#8003)
Fixed libldap handling of TLS in async mode (ITS#8022)
Fixed libldap null pointer dereference (ITS#8028)
Fixed slapd slapadd onetime leak with -w (ITS#8014)
Fixed slapd syncrepl delta-mmr issue with overlays and slapd.conf
(ITS#7976)
Fixed slapd syncrepl mutex for cookie state (ITS#7968)
Fixed slapd-mdb one-level search (ITS#7975)
Fixed slapd-mdb heap corruption (ITS#7965)
Fixed slapd-mdb crash after deleting in-use schema (ITS#7995)
Fixed slapd-mdb minor code cleanup (ITS#8011)
Fixed slapd-mdb to return errors when using incorrect env flags
(ITS#8016)
Fixed slapd-meta TLS initialization with ldaps URIs (ITS#8022)
Fixed slapo-collect segfault (ITS#7797)
Fixed slapo-constraint with 0 count constraint (ITS#7780,ITS#7781)
Fixed slapo-deref with empty attribute list (ITS#8027)
Fixed slapo-syncprov synprov_matchops usage of test_filter
(ITS#8013)
Fixed slapo-syncprov segfault on disconnect/abandon
(ITS#5452,ITS#8012)
Build Environment
Enhanced contrib modules build paths (ITS#7782)
Fixed contrib/autogroup internal operation identity
(ITS#8006)
Fixed contrib/passwd/sha2 compiler warning (ITS#8000)
Fixed contrib/noopsrch compiler warning (ITS#7998)
Fixed contrib/dupent compiler warnings (ITS#7997)
Contrib
Added pbkdf2 sha256 and sha512 schemes (ITS#7977)
Documentation
Added ldap_get_option(3) LDAP_FEATURE_INFO_VERSION
information (ITS#8032)
Added ldap_get_option(3) LDAP_OPT_API_INFO_VERSION
information (ITS#8032)
LMDB 0.9.15 Release Engineering
Fix txn init (ITS#7961,#7987)
Fix MDB_PREV_DUP (ITS#7955,#7671)
Fix compact of empty env (ITS#7956)
Added workaround for fdatasync bug in ext3fs
Build
Don't use -fPIC for static lib
Update .gitignore (ITS#7952,#7953)
Cleanup for "make test" (ITS#7841)
Misc. Android/Windows cleanup
Documentation
Fix MDB_APPEND doc
Clarify mdb_dbi_open doc
--
Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra :: the leader in open source messaging and collaboration
8 years, 4 months
LMDB and text encoding
by Timur Kristóf
Hi Everyone,
I've been talking to Howard about this and he suggested to post it to
this mailing list. There are two things that I recently noticed about
how LMDB works with various encodings and I think it's worth to
discuss.
1. Database names
mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on
unixes and ANSI on Windows, which is problematic for cross-platform
applications.
My suggestion is to create a variant of this function that also
accepts a length parameter (or just use MDB_val) so that instead of
treating it as a C string, it would treat it like a series of bytes,
allowing the user to use the encoding of their choice.
2. Path names
Functions like mdb_env_open, mdb_env_get_path, mdb_env_copy and the
likes accept a char* for path names. This is fine on most unixes where
char* is an UTF-8 string, but unfortunately, these functions call the
ANSI variants of the Windows API functions, making it impossible to
use Unicode path names with them.
I think we should switch to the widechar APIs instead, but that would
also mean changing the LMDB API to accept a wchar_t* parameter on
Windows instead of char*.
What do you guys think about all this?
Best regards,
Timur Kristóf
8 years, 4 months
Re: Question regarding MDB_NOLOCK
by David Barbour
(This is a quick follow-up to an earlier discussion, a note left for anyone
confused by MDB_NOLOCK.)
On Mon, Dec 1, 2014 at 6:55 AM, Hallvard Breien Furuseth <
h.b.furuseth(a)usit.uio.no> wrote:
>
> A write transaction frees pages which its new snapshot cannot see.
> A later writer will overwrite them, when no *known* readers can see
> them either. But with MDB_NOLOCK, writers don't know about old
> readers and might overwrite pages which old readers can see.
>
> Last snapshot is never overwritten. So readers which did begin/renew
> after latest commit(write txn) are safe from txn_begin(write txn).
>
> The same with the commit of the write txn before that. I think.
> MDB keeps the last two snapshots in the metapages.
I've been reading the MDB source code a bit more to verify this assumption.
It is not valid.
MDB does keep the last two metapages, but may begin to dismantle the elder
of the two for pages if there are no readers for it. With MDB_NOLOCK, it
simply assumes there are no readers for it (cf. mdb_find_oldest). **Only
the most recent snapshot is preserved.**
For my current use case, I believe that I can still achieve a sufficient
level of parallelism even if limited to double-buffering (whereas two
snapshots would give me triple-buffering). I'm not going to press for any
changes at this time.
8 years, 4 months
Fwd: Question regarding MDB_NOLOCK
by David Barbour
On Sat, Jan 31, 2015 at 12:54 AM, Howard Chu <hyc(a)symas.com> wrote:
>
> My earlier assumption (before reading mdb_page_alloc) was that LMDB
>> would be aggressive about grabbing pages freed by transactions that are
>> not actively being read. If we're relying on `last < oldest` to create a
>> two page discrepancy, this means when we actually have readers on older
>> transactions that we're being little more conservative than necessary.
>>
>
> More than necessary? I don't think so.
You'll conserve exactly one more transaction's free pages than necessary in
the case where a reader-lock exists on any transaction older than the most
recent snapshot.
8 years, 4 months
wasteful data structures: AVL tree
by Howard Chu
ITS#8038 (syncrepl hanging onto its presentlist) only came to my
attention due to the amount of memory involved. On a refresh of a DB
with 2.8M entries I saw the consumer using about 320MB just for the
presentlist. This list consists solely of 16 byte entryUUIDs; 2.8M items
should have used no more than 48MB. An AVL node itself is 28 bytes on
64-bit platform, plus 16 bytes for the struct berval wrapped around the
UUID.
I'm looking into adding an in-memory B+tree library to liblutil. For the
type of fixed-size records we're usually storing in AVL trees, a Btree
will be much more compact and higher performance since it will need
rebalancing far less frequently.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
8 years, 4 months
ext3/ext4 fsync hack (was: openldap.git branch mdb.master updated. 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe)
by Hallvard Breien Furuseth
On 18/12/14 05:40, openldap-commit2devel(a)OpenLDAP.org wrote:
> commit 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe
> Author: Howard Chu <hyc(a)openldap.org>
> Date: Thu Dec 18 04:38:53 2014 +0000
>
> Hack for potential ext3/ext4 corruption issue
>
> Use regular fsync() if we think this commit grew the DB file.
This does not catch all cases:
If the new pages below mt_next_pgno were freed instead of
written, me_size becomes too big. Later when the file does
grow, me_size may be >= actual filesize so it fdatasync()s.
Similar to b09e46904c1c059bd5086243e3915b6be510e57d
"ITS#7886 fix mdb_copy write size".
We can fix me_size, grow the file anyway (ftruncate), or
give the pages back to mt_next_pgno in mdb_freelist_save().
Another issue: After an MDB_NOSYNC commit, mdb_env_sync()
only fdatasync()s. It does not know when the file grew.
The planned "group commits" may get the same problem if
the user checkpoints with mdb_env_sync().
8 years, 4 months
LMDB crash consistency, again
by Howard Chu
This paper
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/...
describes a potential crash vulnerability in LMDB due to its use of fdatasync
instead of fsync when syncing writes to the data file. The vulnerability
exists because fdatasync omits syncs of the file metadata; if the data file
needed to grow as a result of any writes then this requires a metadata update.
This is a well-understood issue in LMDB; we briefly touched on it in this
earlier email thread
http://www.openldap.org/lists/openldap-technical/201402/msg00111.html and it's
been a topic of discussion on IRC ever since the first multi-FS
microbenchmarks we conducted back in 2012. http://symas.com/mdb/microbench/july/
It's worth noting that this vulnerability doesn't exist on Windows, MacOSX,
Android, or *BSD, because none of these OSs have a function equivalent to
fdatasync in the first place - they always use fsync (or the Windows
equivalent). (Android is an oddball; the underlying Linux kernel of course
supports fdatasync, but the C library, bionic, does not.)
We have a couple approaches for Linux:
1) provide an option to preallocate the file, using fallocate().
Unfortunately this doesn't completely eliminate metadata updates - filesystem
drivers tend to try to be "smart" and make fallocate cheap; they allocate the
space in the FS metadata but they also mark it as "unseen." The first time a
process accesses an unseen page, it gets zeroed out. Up until that point,
whatever old contents of the disk page are still present. The act of marking a
page from "unseen" to "seen" requires a metadata update of its own.
We had a discussion of this FS mis-feature a while ago, but it was fruitless.
https://lkml.org/lkml/2012/12/7/396
2) preallocate the file by explicitly writing zeros to it. This has a
couple other disadvantages:
a) on SSDs, doing such a write needlessly contributes to wearout of the
flash.
b) Windows detects all-zero writes and compresses them out, creating a
sparse file, thus defeating the attempt at preallocation.
3) track the allocated size of the file, and toggle between fsync and
fdatasync depending on whether the allocated size actually grows or not. This
is the approach I'm currently taking in a development branch. Whether we add
this to a new 0.9.x release, or just in 1.0, I haven't yet decided.
As another footnote, I plan to add support for LMDB on a raw partition in 1.x.
Naturally, fsync vs fdatasync will be irrelevant in that case.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
8 years, 5 months