On 05/16/2014 08:09 PM, Howard Chu wrote:
> rsbx(a)acm.org wrote:
>> Full_Name: Raymond S Brand
>> Version: mdb.master
>> OS: linux
>> URL: ftp://ftp.openldap.org/incoming/
>> Submission from: (NULL) (107.145.137.13)
>>
>>
>> [I would have uploaded this to the ftp.openldap.org but there was no
>> space.]
>>
>> As implemented, the mdb readers/writer exclusion protocol has a race
>> condition
>> that could result in a writer reclaiming and over-writing pages still
>> in use
>> by a reader.
>
> Yes, you're basically correct. But this is unlikely in practice because
> the reader thread has no blocking calls in its codepath, while the
> writer must acquire locks etc. Generally the only way for two write txns
> to complete while a reader thread is stalled is if you explicitly send a
> SIGSTOP to the reader thread while it's in its critical section.
Or just unlucky process scheduling.
>
>>
>> Pseudo code of the snapshot locking protocols of reader and writer
>> transactions. The labeled sections, for the purposes of this analysis,
>> can be
>> assumed to execute atomically.
>>
>> READER
>> ======
>>
>> *R1* t0 = meta[0].txid; t1 = meta[1].txid
>> if (t1 > t0)
>> t0 = t1
>>
>> *R2* readers[me].txid = t0
>>
>> *R3* snapshot_root = meta[t0&1].snapshot
>>
>> *R4* /* lookup data */
>>
>> *R5* readers[me].txid = -1 // Release snapshot
>>
>> WRITER
>> ======
>>
>> *W1* lock(writer, EXCLUSIVE)
>> curr_txid = meta[0].txid
>> if (meta[1].txid > curr_txid)
>> curr_txid = meta[1].txid
>>
>> *W2* oldest = curr_txid
>> for (i=0; i<reader_slots; i++)
>> t = readers[i].txid
>> if (t != -1 && t < oldest)
>> oldest = t
>>
>> *W3* reclaim_old_pages(oldest)
>>
>> /*
>> ** Commit new transaction
>> */
>>
>> *W4* unlock(writer)
>>
>> ---------------------------------------------------------------------------
>>
>>
>> Adversarial scheduling analysis:
>>
>> The following timeline demonstrates that a writer can reclaim pages in
>> use by a
>> reader.
>>
>> T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
>> R1 R2 R3 R4
>> W1 W2 W3 W4 * W1 W2 W3
>>
>> T0 Reader has found the latest txid, Xn.
>>
>> T1->T4 Reader is not scheduled to run and a write transaction
>> commits, Xn+1.
>>
>> T5 Reader is still not scheduled and another write transaction
>> starts.
>>
>> T6 Writer finds that the oldest referenced transaction is the last
>> committed transaction, Xn+1.
>>
>> T7 Reader records Xn in the reader table.
>>
>> T8 Reader gets the snapshot root page for transaction Xn.
>>
>> T9 Writer, believing that Xn+1 is the oldest reference transaction,
>> reclaims the pages of transaction Xn.
>>
>>
>> T10 Reader attempts to navigate pages of a transaction, Xn, that
>> has been
>> reclaimed and "Bad Things Happen".
>>
>> In fact, bad things will happen if an even number of write
>> transactions commit
>> between W4 and W1, represented by the '*', in the timeline above.
>>
>> The fix is to search for the highest txid in R1 and between R2 and R3.
>> Optionally, followed by recording the txid, of the snapshot root found
>> in R3,
>> in the reader's reader table slot to, possibly, increase the number of
>> reclaimable transactions.
>>
>> The lack of compiler and memory barriers in the implementation of the
>> locking
>> protocol is also of concern.
>>
>> Beyond the above, the code in mdb_txn_renew0() after the
>> "/* Copy the DB info and flags */"
>> comment appears to have a number of data races.
>>
>> ---
>> libraries/liblmdb/mdb.c | 24 +++++++++++++++---------
>> 1 file changed, 15 insertions(+), 9 deletions(-)
>>
>> diff --git a/libraries/liblmdb/mdb.c b/libraries/liblmdb/mdb.c
>> index e0f551e..908417c 100644
>> --- a/libraries/liblmdb/mdb.c
>> +++ b/libraries/liblmdb/mdb.c
>> @@ -533,11 +533,11 @@ typedef struct MDB_rxbody {
>> * started from so we can avoid overwriting any data
>> used in that
>> * particular version.
>> */
>> - txnid_t mrb_txnid;
>> + volatile txnid_t mrb_txnid;
>> /** The process ID of the process owning this reader txn. */
>> - MDB_PID_T mrb_pid;
>> + volatile MDB_PID_T mrb_pid;
>> /** The thread ID of the thread owning this txn. */
>> - pthread_t mrb_tid;
>> + volatile pthread_t mrb_tid;
>> } MDB_rxbody;
>>
>> /** The actual reader record, with cacheline padding. */
>> @@ -585,12 +585,12 @@ typedef struct MDB_txbody {
>> * This is recorded here only for convenience;
>> the value
>> can always
>> * be determined by reading the main database
>> meta pages.
>> */
>> - txnid_t mtb_txnid;
>> + volatile txnid_t mtb_txnid;
>> /** The number of slots that have been used in the
>> reader
>> table.
>> * This always records the maximum count, it is not
>> decremented
>> * when readers release their slots.
>> */
>> - unsigned mtb_numreaders;
>> + volatile unsigned mtb_numreaders;
>> } MDB_txbody;
>>
>> /** The actual reader table definition. */
>> @@ -854,7 +854,7 @@ typedef struct MDB_meta {
>> /** Any persistent environment flags. @ref mdb_env */
>> #define mm_flags mm_dbs[0].md_flags
>> pgno_t mm_last_pg; /**< last
>> used page in
>> file */
>> - txnid_t mm_txnid; /**< txnid that
>> committed this page */
>> + volatile txnid_t mm_txnid; /**< txnid that
>> committed this
>> page */
>> } MDB_meta;
>>
>> /** Buffer for a stack-allocated meta page.
>> @@ -2303,10 +2303,12 @@ mdb_txn_renew0(MDB_txn *txn)
>>
>> if (txn->mt_flags & MDB_TXN_RDONLY) {
>> if (!ti) {
>> + /* No readers table; app responsible for
>> locking */
>> meta = env->me_metas[ mdb_env_pick_meta(env) ];
>> txn->mt_txnid = meta->mm_txnid;
>> txn->mt_u.reader = NULL;
>> } else {
>> + /* Has readers table */
>> MDB_reader *r = (env->me_flags & MDB_NOTLS) ?
>> txn->mt_u.reader :
>> pthread_getspecific(env->me_txkey);
>> if (r) {
>> @@ -2347,8 +2349,12 @@ mdb_txn_renew0(MDB_txn *txn)
>> return rc;
>> }
>> }
>> - txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
>> txn->mt_u.reader = r;
>> + r->mr_txnid = ti->mti_txnid;
>> +
>> + /* Really need a memory barrier here */
>> +
>> + txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
>> meta = env->me_metas[txn->mt_txnid & 1];
>> }
>> } else {
>> @@ -2376,10 +2382,10 @@ mdb_txn_renew0(MDB_txn *txn)
>> }
>>
>> /* Copy the DB info and flags */
>> - memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db));
>> + memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db)); /*
>> Racy */
>>
>> /* Moved to here to avoid a data race in read TXNs */
>> - txn->mt_next_pgno = meta->mm_last_pg+1;
>> + txn->mt_next_pgno = meta->mm_last_pg+1; /*
>> Racy */
>>
>> for (i=2; i<txn->mt_numdbs; i++) {
>> x = env->me_dbflags[i];
>>
>
>
rsbx(a)acm.org wrote:
> Full_Name: Raymond S Brand
> Version: mdb.master
> OS: linux
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (107.145.137.13)
>
>
> [I would have uploaded this to the ftp.openldap.org but there was no space.]
>
> As implemented, the mdb readers/writer exclusion protocol has a race condition
> that could result in a writer reclaiming and over-writing pages still in use
> by a reader.
Yes, you're basically correct. But this is unlikely in practice because the
reader thread has no blocking calls in its codepath, while the writer must
acquire locks etc. Generally the only way for two write txns to complete while
a reader thread is stalled is if you explicitly send a SIGSTOP to the reader
thread while it's in its critical section.
>
> Pseudo code of the snapshot locking protocols of reader and writer
> transactions. The labeled sections, for the purposes of this analysis, can be
> assumed to execute atomically.
>
> READER
> ======
>
> *R1* t0 = meta[0].txid; t1 = meta[1].txid
> if (t1 > t0)
> t0 = t1
>
> *R2* readers[me].txid = t0
>
> *R3* snapshot_root = meta[t0&1].snapshot
>
> *R4* /* lookup data */
>
> *R5* readers[me].txid = -1 // Release snapshot
>
> WRITER
> ======
>
> *W1* lock(writer, EXCLUSIVE)
> curr_txid = meta[0].txid
> if (meta[1].txid > curr_txid)
> curr_txid = meta[1].txid
>
> *W2* oldest = curr_txid
> for (i=0; i<reader_slots; i++)
> t = readers[i].txid
> if (t != -1 && t < oldest)
> oldest = t
>
> *W3* reclaim_old_pages(oldest)
>
> /*
> ** Commit new transaction
> */
>
> *W4* unlock(writer)
>
> ---------------------------------------------------------------------------
>
> Adversarial scheduling analysis:
>
> The following timeline demonstrates that a writer can reclaim pages in use by a
> reader.
>
> T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
> R1 R2 R3 R4
> W1 W2 W3 W4 * W1 W2 W3
>
> T0 Reader has found the latest txid, Xn.
>
> T1->T4 Reader is not scheduled to run and a write transaction commits, Xn+1.
>
> T5 Reader is still not scheduled and another write transaction starts.
>
> T6 Writer finds that the oldest referenced transaction is the last
> committed transaction, Xn+1.
>
> T7 Reader records Xn in the reader table.
>
> T8 Reader gets the snapshot root page for transaction Xn.
>
> T9 Writer, believing that Xn+1 is the oldest reference transaction,
> reclaims the pages of transaction Xn.
>
>
> T10 Reader attempts to navigate pages of a transaction, Xn, that has been
> reclaimed and "Bad Things Happen".
>
> In fact, bad things will happen if an even number of write transactions commit
> between W4 and W1, represented by the '*', in the timeline above.
>
> The fix is to search for the highest txid in R1 and between R2 and R3.
> Optionally, followed by recording the txid, of the snapshot root found in R3,
> in the reader's reader table slot to, possibly, increase the number of
> reclaimable transactions.
>
> The lack of compiler and memory barriers in the implementation of the locking
> protocol is also of concern.
>
> Beyond the above, the code in mdb_txn_renew0() after the
> "/* Copy the DB info and flags */"
> comment appears to have a number of data races.
>
> ---
> libraries/liblmdb/mdb.c | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/libraries/liblmdb/mdb.c b/libraries/liblmdb/mdb.c
> index e0f551e..908417c 100644
> --- a/libraries/liblmdb/mdb.c
> +++ b/libraries/liblmdb/mdb.c
> @@ -533,11 +533,11 @@ typedef struct MDB_rxbody {
> * started from so we can avoid overwriting any data used in that
> * particular version.
> */
> - txnid_t mrb_txnid;
> + volatile txnid_t mrb_txnid;
> /** The process ID of the process owning this reader txn. */
> - MDB_PID_T mrb_pid;
> + volatile MDB_PID_T mrb_pid;
> /** The thread ID of the thread owning this txn. */
> - pthread_t mrb_tid;
> + volatile pthread_t mrb_tid;
> } MDB_rxbody;
>
> /** The actual reader record, with cacheline padding. */
> @@ -585,12 +585,12 @@ typedef struct MDB_txbody {
> * This is recorded here only for convenience; the value
> can always
> * be determined by reading the main database meta pages.
> */
> - txnid_t mtb_txnid;
> + volatile txnid_t mtb_txnid;
> /** The number of slots that have been used in the reader
> table.
> * This always records the maximum count, it is not
> decremented
> * when readers release their slots.
> */
> - unsigned mtb_numreaders;
> + volatile unsigned mtb_numreaders;
> } MDB_txbody;
>
> /** The actual reader table definition. */
> @@ -854,7 +854,7 @@ typedef struct MDB_meta {
> /** Any persistent environment flags. @ref mdb_env */
> #define mm_flags mm_dbs[0].md_flags
> pgno_t mm_last_pg; /**< last used page in
> file */
> - txnid_t mm_txnid; /**< txnid that
> committed this page */
> + volatile txnid_t mm_txnid; /**< txnid that committed this
> page */
> } MDB_meta;
>
> /** Buffer for a stack-allocated meta page.
> @@ -2303,10 +2303,12 @@ mdb_txn_renew0(MDB_txn *txn)
>
> if (txn->mt_flags & MDB_TXN_RDONLY) {
> if (!ti) {
> + /* No readers table; app responsible for locking */
> meta = env->me_metas[ mdb_env_pick_meta(env) ];
> txn->mt_txnid = meta->mm_txnid;
> txn->mt_u.reader = NULL;
> } else {
> + /* Has readers table */
> MDB_reader *r = (env->me_flags & MDB_NOTLS) ?
> txn->mt_u.reader :
> pthread_getspecific(env->me_txkey);
> if (r) {
> @@ -2347,8 +2349,12 @@ mdb_txn_renew0(MDB_txn *txn)
> return rc;
> }
> }
> - txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
> txn->mt_u.reader = r;
> + r->mr_txnid = ti->mti_txnid;
> +
> + /* Really need a memory barrier here */
> +
> + txn->mt_txnid = r->mr_txnid = ti->mti_txnid;
> meta = env->me_metas[txn->mt_txnid & 1];
> }
> } else {
> @@ -2376,10 +2382,10 @@ mdb_txn_renew0(MDB_txn *txn)
> }
>
> /* Copy the DB info and flags */
> - memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db));
> + memcpy(txn->mt_dbs, meta->mm_dbs, 2 * sizeof(MDB_db)); /* Racy */
>
> /* Moved to here to avoid a data race in read TXNs */
> - txn->mt_next_pgno = meta->mm_last_pg+1;
> + txn->mt_next_pgno = meta->mm_last_pg+1; /* Racy */
>
> for (i=2; i<txn->mt_numdbs; i++) {
> x = env->me_dbflags[i];
>
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Armon Dadgar wrote:
> When I run the attached test case on my machine, IÂ’m hitting the failing case.
>
> Here is the test output: https://gist.github.com/armon/e529d7909fe301126fc6
You seem to be running obsolete code. I checked your gomdb github repo, you're
using mdb.c at rev fca18d2. Current mdb.master is 4844a72.
> My steps:
> $ clang 7844.c mdb.c midl.c
> $ mkdir testdb
> $ ./a.out
>
> $ clang -v
> Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
> Target: x86_64-apple-darwin13.1.0
> Thread model: posix
>
> $ uname -a
> Darwin Armons-MacBook-Air.local 13.1.0 Darwin Kernel Version 13.1.0: Thu Jan
> 16 19:40:37 PST 2014; root:xnu-2422.90.20~2/RELEASE_X86_64 x86_64
>
> Best Regards,
> Armon Dadgar
>
> From: Howard Chu hyc(a)symas.com <mailto:hyc@symas.com>
> Reply: Howard Chu hyc(a)symas.com <mailto:hyc@symas.com>
> Date: May 16, 2014 at 11:06:49 AM
> To: armon.dadgar(a)gmail.com armon.dadgar(a)gmail.com
> <mailto:armon.dadgar@gmail.com>, openldap-its(a)openldap.org
> openldap-its(a)openldap.org <mailto:openldap-its@openldap.org>
> Subject: Re: (ITS#7844) LMDB Delete Cursor inconsistencies
>
>> armon.dadgar(a)gmail.com wrote:
>> > --5372ac85_8edbdab_1271
>> > Content-Type: text/plain; charset="utf-8"
>> > Content-Transfer-Encoding: quoted-printable
>> > Content-Disposition: inline
>> >
>> > =46or now, we have application code to retry the delete until no further =
>> > rows are removed.
>> > Still, it would be nice to have this resolved (and tested) in master=21
>>
>> Unable to reproduce the issue. I've attached my test program based on your
>> description.
>>
>> --
>> -- Howard Chu
>> CTO, Symas Corp. http://www.symas.com
>> Director, Highland Sun http://highlandsun.com/hyc/
>> Chief Architect, OpenLDAP http://www.openldap.org/project/
>> ------------------------------------------------------------------------------
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Full_Name: Philip Guenther
Version: 2.4.39
OS: OpenBSD
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (76.253.0.176)
The ldap.conf(5) manpage says this about TLS_REQCERT
TLS_REQCERT <level>
Specifies what checks to perform on server certificates in a TLS
session, if any. The <level> can be specified as one of the
following keywords:
...
try The server certificate is requested. If no certificate is
provided, the session proceeds normally. If a bad
certificate is provided, the session is immediately
terminated.
demand | hard
These keywords are equivalent. The server certificate is
requested. If no certificate is provided, or a bad
certificate is provided, the session is immediately
terminated. This is the default setting.
In testing, I can find no difference in behavior between the 'try' and 'hard'
keywords. For the ldap* tools, both 'try' and 'hard' seem to place the same
requirements on the server. What does "if no certificate is provided" *mean* in
terms of server and/or client configuration?
openldap(a)semyon.org wrote:
> On 13-09-23 10:16 PM, Howard Chu wrote:
>
>> If you backup the DB with slapcat and reload it on another server with
>> slapadd, can you still reproduce the fault on the copy?
>
> Unfortunately, yes. It breaks trying to retrieve the same record as
> before, on the same instruction in back_mdb:
>
> yesterday:
> [126393.233615] slapd[19244]: segfault at 7f499f38d000 ip
> 00007f4da0e9b6a1 sp 00007f499f1fa540 error 6 in
> back_mdb-2.4.so.2.9.2[7f4da0e80000+34000]
>
> today:
> [517860.953779] slapd[24871]: segfault at 7fdeb50a0000 ip
> 00007fe2b63ad6a1 sp 00007fdeb4f0d540 error 6 in
> back_mdb-2.4.so.2.9.2[7fe2b6392000+34000]
>
> Semyon
A fix is now in git master.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
rajesh(a)linux.vnet.ibm.com wrote:
> Full_Name: Rajeshkumar S
> Version: 2.4.39
> OS: Ubuntu 14.04
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (122.248.161.59)
>
>
>
> I am trying to build the openldap package from the source following the
> release tarball from
> ftp://ftp.openldap.org/pub/OpenLDAP/openldap-release/openldap-2.4.39.tgz. in a
> new architecture ppc64le ( IBM PowerPC Little endian ). As the config.guess and
> libtool did not have the required patches to identify this new architecture, I
> did autoreconf -f -i in my build system whose latest automake and libtool has
> the patches
> of ppc64le.
> autoreconf fails with automake errors as shown below
>
> automake: error: no 'Makefile.am' found for any configure output
> autoreconf: automake failed with exit status: 1
That's to be expected, since there certainly are no Makefile.am files in our
source tree. We don't use automake. This is not a bug.
> As discussed in the mailing list, Quanah Gibson-Mount suggested me to request an
> update for auto-tools to support new architectures.
>
> The automake with version >= 1.13.4 has the correct config.guess with the
> required ppc64le fixes.
> Regarding the libtool, the last release is more than 2 years ago I believe, we
> managed to get an alpha source release which has all the latest patches to
> support the new architecture ppc64le
>
> An alpha version of libtool with ppc64le support is available at
> ftp://alpha.gnu.org/gnu/libtool/libtool-2.4.2.418.tar.gz.
Libtool is so perpetually problematic that we will certainly not integrate an
alpha version into our source tree.
> Hence I request for an update of the auto tools in your build system to support
> these new architectures.
>
> Thanks and Regards
> Rajeshkumar S
> Linux Technology Center, IBM
>
>
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Full_Name: Rajeshkumar S
Version: 2.4.39
OS: Ubuntu 14.04
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (122.248.161.59)
I am trying to build the openldap package from the source following the
release tarball from
ftp://ftp.openldap.org/pub/OpenLDAP/openldap-release/openldap-2.4.39.tgz. in a
new architecture ppc64le ( IBM PowerPC Little endian ). As the config.guess and
libtool did not have the required patches to identify this new architecture, I
did autoreconf -f -i in my build system whose latest automake and libtool has
the patches
of ppc64le.
autoreconf fails with automake errors as shown below
automake: error: no 'Makefile.am' found for any configure output
autoreconf: automake failed with exit status: 1
As discussed in the mailing list, Quanah Gibson-Mount suggested me to request an
update for auto-tools to support new architectures.
The automake with version >= 1.13.4 has the correct config.guess with the
required ppc64le fixes.
Regarding the libtool, the last release is more than 2 years ago I believe, we
managed to get an alpha source release which has all the latest patches to
support the new architecture ppc64le
An alpha version of libtool with ppc64le support is available at
ftp://alpha.gnu.org/gnu/libtool/libtool-2.4.2.418.tar.gz.
Hence I request for an update of the auto tools in your build system to support
these new architectures.
Thanks and Regards
Rajeshkumar S
Linux Technology Center, IBM