openldap-devel January 2015

openldap-devel@openldap.org

8 participants
9 discussions

syncrepl consumer is slow
by Howard Chu 12 May '15

12 May '15

One thing I just noticed, while testing replication with 3 servers on my laptop - during a refresh, the provider gets blocked waiting to write to the consumers after writing about 4000 entries. I.e., the consumers aren't processing fast enough to keep up with the search running on the provider. (That's actually not too surprising since reads are usually faster than writes anyway.) The consumer code has lots of problems as it is, just adding this note to the pile. I'm considering adding an option to the consumer to write its entries with dbnosync during the refresh phase. The rationale being, there's nothing to lose anyway if the refresh is interrupted. I.e., the consumer can't update its contextCSN until the very end of the refresh, so any partial refresh that gets interrupted is wasted effort - the consumer will always have to start over from the beginning on its next refresh attempt. As such, there's no point in safely/synchronously writing any of the received entries - they're useless until the final contextCSN update. The implementation approach would be to define a new control e.g. "fast write" for the consumer to pass to the underlying backend on any write op. We would also have to e.g. add an MDB_TXN_NOSYNC flag to mdb_txn_begin() (BDB already has the equivalent flag). This would only be used for writes that are part of a refresh phase. In persist mode the provider and consumers' write speeds should be more closely matched so it wouldn't be necessary or useful. Comments? -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

5 16

RE24 testing call #2 (2.4.41), LMDB RE0.9 testing call #2 (0.9.15)
by Quanah Gibson-Mount 03 Feb '15

03 Feb '15

OpenLDAP 2.4.41 Engineering Fixed libldap double free of request during abandon (ITS#7967) Fixed libldap segfault in ldap_sync_initialize (ITS#8001) Fixed libldap ldif-wrap off by one error (ITS#8003) Fixed libldap handling of TLS in async mode (ITS#8022) Fixed libldap null pointer dereference (ITS#8028) Fixed slapd slapadd onetime leak with -w (ITS#8014) Fixed slapd syncrepl delta-mmr issue with overlays and slapd.conf (ITS#7976) Fixed slapd syncrepl mutex for cookie state (ITS#7968) Fixed slapd-mdb one-level search (ITS#7975) Fixed slapd-mdb heap corruption (ITS#7965) Fixed slapd-mdb crash after deleting in-use schema (ITS#7995) Fixed slapd-mdb minor code cleanup (ITS#8011) Fixed slapd-mdb to return errors when using incorrect env flags (ITS#8016) Fixed slapd-meta TLS initialization with ldaps URIs (ITS#8022) Fixed slapo-collect segfault (ITS#7797) Fixed slapo-constraint with 0 count constraint (ITS#7780,ITS#7781) Fixed slapo-deref with empty attribute list (ITS#8027) Fixed slapo-syncprov synprov_matchops usage of test_filter (ITS#8013) Fixed slapo-syncprov segfault on disconnect/abandon (ITS#5452,ITS#8012) Build Environment Enhanced contrib modules build paths (ITS#7782) Fixed contrib/autogroup internal operation identity (ITS#8006) Fixed contrib/passwd/sha2 compiler warning (ITS#8000) Fixed contrib/noopsrch compiler warning (ITS#7998) Fixed contrib/dupent compiler warnings (ITS#7997) Contrib Added pbkdf2 sha256 and sha512 schemes (ITS#7977) Documentation Added ldap_get_option(3) LDAP_FEATURE_INFO_VERSION information (ITS#8032) Added ldap_get_option(3) LDAP_OPT_API_INFO_VERSION information (ITS#8032) LMDB 0.9.15 Release Engineering Fix txn init (ITS#7961,#7987) Fix MDB_PREV_DUP (ITS#7955,#7671) Fix compact of empty env (ITS#7956) Added workaround for fdatasync bug in ext3fs Build Don't use -fPIC for static lib Update .gitignore (ITS#7952,#7953) Cleanup for "make test" (ITS#7841) Misc. Android/Windows cleanup Documentation Fix MDB_APPEND doc Clarify mdb_dbi_open doc -- Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration

4 3

LMDB and text encoding
by Timur Kristóf 02 Feb '15

02 Feb '15

Hi Everyone, I've been talking to Howard about this and he suggested to post it to this mailing list. There are two things that I recently noticed about how LMDB works with various encodings and I think it's worth to discuss. 1. Database names mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on unixes and ANSI on Windows, which is problematic for cross-platform applications. My suggestion is to create a variant of this function that also accepts a length parameter (or just use MDB_val) so that instead of treating it as a C string, it would treat it like a series of bytes, allowing the user to use the encoding of their choice. 2. Path names Functions like mdb_env_open, mdb_env_get_path, mdb_env_copy and the likes accept a char* for path names. This is fine on most unixes where char* is an UTF-8 string, but unfortunately, these functions call the ANSI variants of the Windows API functions, making it impossible to use Unicode path names with them. I think we should switch to the widechar APIs instead, but that would also mean changing the LMDB API to accept a wchar_t* parameter on Windows instead of char*. What do you guys think about all this? Best regards, Timur Kristóf

3 11

Re: Question regarding MDB_NOLOCK
by David Barbour 31 Jan '15

31 Jan '15

(This is a quick follow-up to an earlier discussion, a note left for anyone confused by MDB_NOLOCK.) On Mon, Dec 1, 2014 at 6:55 AM, Hallvard Breien Furuseth < h.b.furuseth(a)usit.uio.no> wrote: > > A write transaction frees pages which its new snapshot cannot see. > A later writer will overwrite them, when no *known* readers can see > them either. But with MDB_NOLOCK, writers don't know about old > readers and might overwrite pages which old readers can see. > > Last snapshot is never overwritten. So readers which did begin/renew > after latest commit(write txn) are safe from txn_begin(write txn). > > The same with the commit of the write txn before that. I think. > MDB keeps the last two snapshots in the metapages. I've been reading the MDB source code a bit more to verify this assumption. It is not valid. MDB does keep the last two metapages, but may begin to dismantle the elder of the two for pages if there are no readers for it. With MDB_NOLOCK, it simply assumes there are no readers for it (cf. mdb_find_oldest). **Only the most recent snapshot is preserved.** For my current use case, I believe that I can still achieve a sufficient level of parallelism even if limited to double-buffering (whereas two snapshots would give me triple-buffering). I'm not going to press for any changes at this time.

3 9

Fwd: Question regarding MDB_NOLOCK
by David Barbour 31 Jan '15

31 Jan '15

On Sat, Jan 31, 2015 at 12:54 AM, Howard Chu <hyc(a)symas.com> wrote: > > My earlier assumption (before reading mdb_page_alloc) was that LMDB >> would be aggressive about grabbing pages freed by transactions that are >> not actively being read. If we're relying on `last < oldest` to create a >> two page discrepancy, this means when we actually have readers on older >> transactions that we're being little more conservative than necessary. >> > > More than necessary? I don't think so. You'll conserve exactly one more transaction's free pages than necessary in the case where a reader-lock exists on any transaction older than the most recent snapshot.

1 0

wasteful data structures: AVL tree
by Howard Chu 29 Jan '15

29 Jan '15

ITS#8038 (syncrepl hanging onto its presentlist) only came to my attention due to the amount of memory involved. On a refresh of a DB with 2.8M entries I saw the consumer using about 320MB just for the presentlist. This list consists solely of 16 byte entryUUIDs; 2.8M items should have used no more than 48MB. An AVL node itself is 28 bytes on 64-bit platform, plus 16 bytes for the struct berval wrapped around the UUID. I'm looking into adding an in-memory B+tree library to liblutil. For the type of fixed-size records we're usually storing in AVL trees, a Btree will be much more compact and higher performance since it will need rebalancing far less frequently. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

2 3

Re: openldap.git branch RE24 created. OPENLDAP_REL_ENG_2_4_40-99-gf5c0af5
by Howard Chu 18 Jan '15

18 Jan '15

openldap-commit2devel(a)OpenLDAP.org wrote: > A ref change was pushed to the OpenLDAP (openldap.git) repository. > It will be available in the public mirror shortly. > > The branch, RE24 has been created > at f5c0af56d605fdad778db1eecad2ef12ca03760b (commit) My mistake, disregard. Branch now deleted. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

1 0

ext3/ext4 fsync hack (was: openldap.git branch mdb.master updated. 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe)
by Hallvard Breien Furuseth 06 Jan '15

06 Jan '15

On 18/12/14 05:40, openldap-commit2devel(a)OpenLDAP.org wrote: > commit 0018eeb2c3b2239c30def9d47c9d194a4ebf35fe > Author: Howard Chu <hyc(a)openldap.org> > Date: Thu Dec 18 04:38:53 2014 +0000 > > Hack for potential ext3/ext4 corruption issue > > Use regular fsync() if we think this commit grew the DB file. This does not catch all cases: If the new pages below mt_next_pgno were freed instead of written, me_size becomes too big. Later when the file does grow, me_size may be >= actual filesize so it fdatasync()s. Similar to b09e46904c1c059bd5086243e3915b6be510e57d "ITS#7886 fix mdb_copy write size". We can fix me_size, grow the file anyway (ftruncate), or give the pages back to mt_next_pgno in mdb_freelist_save(). Another issue: After an MDB_NOSYNC commit, mdb_env_sync() only fdatasync()s. It does not know when the file grew. The planned "group commits" may get the same problem if the user checkpoints with mdb_env_sync().

2 3

LMDB crash consistency, again
by Howard Chu 05 Jan '15

05 Jan '15

This paper https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zh… describes a potential crash vulnerability in LMDB due to its use of fdatasync instead of fsync when syncing writes to the data file. The vulnerability exists because fdatasync omits syncs of the file metadata; if the data file needed to grow as a result of any writes then this requires a metadata update. This is a well-understood issue in LMDB; we briefly touched on it in this earlier email thread http://www.openldap.org/lists/openldap-technical/201402/msg00111.html and it's been a topic of discussion on IRC ever since the first multi-FS microbenchmarks we conducted back in 2012. http://symas.com/mdb/microbench/july/ It's worth noting that this vulnerability doesn't exist on Windows, MacOSX, Android, or *BSD, because none of these OSs have a function equivalent to fdatasync in the first place - they always use fsync (or the Windows equivalent). (Android is an oddball; the underlying Linux kernel of course supports fdatasync, but the C library, bionic, does not.) We have a couple approaches for Linux: 1) provide an option to preallocate the file, using fallocate(). Unfortunately this doesn't completely eliminate metadata updates - filesystem drivers tend to try to be "smart" and make fallocate cheap; they allocate the space in the FS metadata but they also mark it as "unseen." The first time a process accesses an unseen page, it gets zeroed out. Up until that point, whatever old contents of the disk page are still present. The act of marking a page from "unseen" to "seen" requires a metadata update of its own. We had a discussion of this FS mis-feature a while ago, but it was fruitless. https://lkml.org/lkml/2012/12/7/396 2) preallocate the file by explicitly writing zeros to it. This has a couple other disadvantages: a) on SSDs, doing such a write needlessly contributes to wearout of the flash. b) Windows detects all-zero writes and compresses them out, creating a sparse file, thus defeating the attempt at preallocation. 3) track the allocated size of the file, and toggle between fsync and fdatasync depending on whether the allocated size actually grows or not. This is the approach I'm currently taking in a development branch. Whether we add this to a new 0.9.x release, or just in 1.0, I haven't yet decided. As another footnote, I plan to add support for LMDB on a raw partition in 1.x. Naturally, fsync vs fdatasync will be irrelevant in that case. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

2 9

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

openldap-devel January 2015