zzgrim(a)gmail.com wrote:
> Full_Name: zgrim
> Version: 2.4.40
> OS: Linux
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (79.115.186.237)
>
>
> Hello, thank you for lmdb.
> I'm using lmdb via a Go wrapper and under some not-yet-clear circumstances,
> under high concurrency stress testing,
> I sometimes get "double free or corruption" SIGABRT related to
> mdb_dbis_update().
> The setup is using the NOTLS and RDONLY flags on a 64bit 3.17.0 Arch Linux
> system.
> If I protect the free() inside mdb_dbis_update() (in mdb.c) with a mutex (code
> inline below), the symptom disappears.
> I'm sorry I can't provide a simple test case, but in my code, without the
> mutex, launching some tens of test clients simultaneously trigger the
> corruption fairly quick, i.e. a few minutes or even seconds.
Sounds like you have not read the mdb_dbi_open() doc closely. The code in
mdb_dbis_update() only triggers if you've opened new DBs in a transaction. The
doc for mdb_dbi_open() specifically says you cannot call it from multiple
concurrent transactions.
Closing this ITS.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
>>> We are using the combination:
>>> envflags writemap nosync lifo
>>> checkpoint 0 1
>>>
>>> If the checkpoint is set in seconds, it gives us the assurance
>>> consistent state database on disk.
>>> However, without this patch meta-pages can be written by the kernel
>>> before the data.
>>>
>>> In fact, for a full guarantee in case of death slapd process,
>>> meta-page should be written explicitly.
>
>
> No, the DB can never go inconsistent due to a process crashing - the pages
> in OS cache are always correct. It can only go inconsistent if the OS
> crashes and a proper sync has not occurred.
Yes, Howard, you are right.
But apparently I need to be more precise.
Talking about "death" of slapd, I meant all the reasons, including power off.
For example, a power-off case:
- The main power is turned off and the system switches to the UPS.
- Given the notice, OS starts an emergency stop processes.
- For some reason (does not have enough time to stop) slapd receives SIGKILL.
- OS tries to write mmap-region of the DB-file and begins with the
lower address.
- let the meta-pages has written completely, but for the rest of the
data is not enough battery power.
- Now DB is completely destroyed on the disk.
To avoid this, the meta-pages should not be included in the rw-mapped
region, and should be written explicity after a data pages.
>>>>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
>>>>> Author: Leo Yuriev <leo(a)yuriev.ru>
>>>>> Date: 2014-09-19 22:47:19 +0400
>>>>>
>>>>> BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
>>>>>
>>>>> Meta-pages may be updated during data-syncing in mdb_sync_env(),
>>>>> in this case database would be inconsistent.
>>>>>
>>>>> Check-and-retry if lead txn-id changed during flushing data in
>>>>> mdb_sync_env().
>
> Fundamentally, you are trying to make an inherently unsafe configuration
> "safer", but it's impossible. Assume you have mlock'd the meta pages into
> memory, so the OS never flushes them itself any more, and you're running
> with NOSYNC. That means, within 3 transactions, the data pages on disk will
> be out of sync with the meta pages on disk. If the OS crashes at that point,
> the entire DB will be lost.
Not a problem.
I had explained above - we should write meta-pages explicitly after
the data sync.
But also we should not perform reclaiming ahead of the last checkpoint.
> The only way to make this mode of operation somewhat safe is to defer
> reclaiming pages for even longer. E.g., instead of halting at current_txnid
> - 3, halt at current_txnid - 22, in which case the data pointed to by the
> on-disk meta pages cannot get obsolete until 20 transactions have occurred.
>
> Note that in combination with your LIFO patch, it's pretty much guaranteed
> that the on-disk meta pages will be useless after only 2 un-sync'd
> transactions.
Yes, Howard, you are right.
But I think there is confusion in the discussion because of mixing of
LIFO-feature and changes for checkpoints consistency in a NOSYNC and
WRITEMAP+NOSYNC modes.
For a "NOSYNC + checkpoints" topic I will submit a separate ITS (like
a 'volaile' related 7969,7970,7971).
My opinion - it is a flaw, and no reason to don't fix it.
Continuing the conversation about checkpoints in a LIFO context.
I saw the problem, that you specified, and thinking over its solution,
but have not yet found "golden ratio".
And since we are having a serious problem with syncrepl, then I put
off this task with an excuse "LIFO-patch not does worse than it was."
In general, we should do not reclaim anything ahead of the txn, that
is synced to the disk (let this be named a R-rule).
To do so we need a second field like mti_txnid, but which will be
update only at the end of mdb_env_write_meta().
Finally we should start search in mdb_find_oldest() from value of this
new field instead of the current txn number.
This seems to will be work fine.
However, I stopped on a reasoning - about the purpose of the
checkpoints, about design LMDB as a product, about the expectations of
the user and the necessary configuration parameters:
- checkpoints are needed ONLY in nosync modes;
- if the user does NOT activate the checkpoints, he do not care about
consistency;
- but if it is turned on, we MUST provide consistency on the checkpoints;
- otherwise a checkpoints feature is thoughtless and should be REMOVED.
Therefore implementation of checkpoints & reclaiming should be updated
to conform to the "R-rule", that noted above.
>From this point of view a LIFO-feature also should be refined, but
nevertheless can be very useful.
- SYNC mode = takes a benefit from storage with write-back cache
(assume powered by battery).
- ASYNC/NOSYNC without checkpoint = significant reduction of
committed/dirty pages and thereby much less write-iops.
- ASYNC/NOSYNC with checkpoint = seems to same as a SYNC case.
Total of all the above - I think first we need to fix a reclaiming or
delete the checkpoints, and then I will complete LIFO.
> -- Howard Chu
> CTO, Symas Corp. http://www.symas.com
> Director, Highland Sun http://highlandsun.com/hyc/
> Chief Architect, OpenLDAP http://www.openldap.org/project/
Thank for conversation.
Leonid.
Full_Name: zgrim
Version: 2.4.40
OS: Linux
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (79.115.186.237)
Hello, thank you for lmdb.
I'm using lmdb via a Go wrapper and under some not-yet-clear circumstances,
under high concurrency stress testing,
I sometimes get "double free or corruption" SIGABRT related to
mdb_dbis_update().
The setup is using the NOTLS and RDONLY flags on a 64bit 3.17.0 Arch Linux
system.
If I protect the free() inside mdb_dbis_update() (in mdb.c) with a mutex (code
inline below), the symptom disappears.
I'm sorry I can't provide a simple test case, but in my code, without the
mutex, launching some tens of test clients simultaneously trigger the
corruption fairly quick, i.e. a few minutes or even seconds.
I tried older releases of lmdb as well as the latest git tree,
9a8eb95674c7b500cfe5f44d03493ff76c9fc0c1 of mdb.master, the bahaviour is the
same.
I also tried several approaches in the higher level code, like either txn reset
and renew and a pool of txns or simplified the code to bare bones
to just abort after each transaction and remake new txns, it does not matter,
the corruption happens the same.
If the problem should lie in the higher level code, I'd welcome your insights
on possible culprits, I'm very new to lmdb.
Thanks.
against latest git mentioned above i applied something like,
@@ -2769,19 +2769,24 @@
MDB_dbi n = txn->mt_numdbs;
MDB_env *env = txn->mt_env;
unsigned char *tdbflags = txn->mt_dbflags;
+ int rc = 0, rx = 0;
for (i = n; --i >= 2;) {
if (tdbflags[i] & DB_NEW) {
if (keep) {
env->me_dbflags[i] = txn->mt_dbs[i].md_flags |
MDB_VALID;
} else {
- char *ptr = env->me_dbxs[i].md_name.mv_data;
- if (ptr) {
+ if (env->me_dbxs[i].md_name.mv_data != NULL) {
+ mdb_mutex_t *wmutex = MDB_MUTEX(env,
w);
+ rc = LOCK_MUTEX(rx, env, wmutex);
+
+ free(env->me_dbxs[i].md_name.mv_data);
env->me_dbxs[i].md_name.mv_data = NULL;
env->me_dbxs[i].md_name.mv_size = 0;
env->me_dbflags[i] = 0;
env->me_dbiseqs[i]++;
- free(ptr);
+
+ if (!rc) UNLOCK_MUTEX(wmutex);
}
}
}
Леонид Юрьев wrote:
> 2014-10-03 3:13 GMT+04:00 Howard Chu <hyc(a)symas.com>:
> 2014-10-04 0:04 GMT+04:00 Леонид Юрьев <leo(a)yuriev.ru>:
>> 2014-10-03 3:13 GMT+04:00 Howard Chu <hyc(a)symas.com>:
>>>> commit 841059330fd44769e93eb4b937c3ce42654fad6f
>>>> Author: Leo Yuriev <leo(a)yuriev.ru>
>>>> Date: 2014-09-20 07:16:15 +0400
>>>>
>>>> BUGFIX - lmdb: lock meta-pages in writemap-mode to avoid unexpected
>>>> write,
>>>> before the data pages would be synchronized.
>>>>
>>>> Without locking the meta-pages may be writen by OS before other
>>>> data,
>>>> in this case database would be inconsistent.
>>>
>>>
>>> Seems unnecessary. Won't happen by default; could happen with MDB_NOSYNC but
>>> that risk is already documented.
>>
>> We are using the combination:
>> envflags writemap nosync lifo
>> checkpoint 0 1
>>
>> If the checkpoint is set in seconds, it gives us the assurance
>> consistent state database on disk.
>> However, without this patch meta-pages can be written by the kernel
>> before the data.
>>
>> In fact, for a full guarantee in case of death slapd process,
>> meta-page should be written explicitly.
No, the DB can never go inconsistent due to a process crashing - the pages in
OS cache are always correct. It can only go inconsistent if the OS crashes and
a proper sync has not occurred.
>> But it requires a lot of changes and I do not do that.
>>>> commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
>>>> Author: Leo Yuriev <leo(a)yuriev.ru>
>>>> Date: 2014-09-19 22:47:19 +0400
>>>>
>>>> BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().
>>>>
>>>> Meta-pages may be updated during data-syncing in mdb_sync_env(),
>>>> in this case database would be inconsistent.
>>>>
>>>> Check-and-retry if lead txn-id changed during flushing data in
>>>> mdb_sync_env().
Fundamentally, you are trying to make an inherently unsafe configuration
"safer", but it's impossible. Assume you have mlock'd the meta pages into
memory, so the OS never flushes them itself any more, and you're running with
NOSYNC. That means, within 3 transactions, the data pages on disk will be out
of sync with the meta pages on disk. If the OS crashes at that point, the
entire DB will be lost.
The only way to make this mode of operation somewhat safe is to defer
reclaiming pages for even longer. E.g., instead of halting at current_txnid -
3, halt at current_txnid - 22, in which case the data pointed to by the
on-disk meta pages cannot get obsolete until 20 transactions have occurred.
Note that in combination with your LIFO patch, it's pretty much guaranteed
that the on-disk meta pages will be useless after only 2 un-sync'd transactions.
>>> Probably could simplify this, just obtain the write mutex unconditionally,
>>> then there's no need to loop or retry. But also, this depends on MDB_NOLOCK
>>> - if that's set, then do no locking at all.
>>
>> I did so for reasons of performance and less a lock retention time.
>>
>> Retries will be if there an intensive flow of changes.
>> In this case it will be a lot of updated pages, the record which will
>> take some time.
>>
>> However, in subsequent iterations (if a transactions had committed
>> while there was a record),
>> the modified pages will be much fewer, and the sync will be quick.
>>
>> Thus (and it was seen in tests) even when a substantial amount of the
>> transactions,
>> usually only two iterations of the cycle,
>> without locking and flow of changes are not suspended.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
leo(a)yuriev.ru wrote:
> The attached files is derived from OpenLDAP Software. All of the modifications
> to OpenLDAP Software represented in the following patch(es) were developed by
> Peter-Service LLC, Moscow, Russia. Peter-Service LLC has not assigned rights
> and/or interest in this work to any party. I, Leonid Yuriev am authorized by
> Peter-Service LLC, my employer, to release this work under the following terms.
>
> Peter-Service LLC hereby places the following modifications to OpenLDAP Software
> (and only these modifications) into the public domain. Hence, these
> modifications may be freely used and/or redistributed for any purpose with or
> without attribution and/or other notice.
Committed to mdb.master. The whitespace is wrong in all of the patches you
submitted, had to apply them manually.
>
> commit 5e83ea8a76f81dac8d5a9dcd74d5b10a01330102
> Author: Leo Yuriev <leo(a)yuriev.ru>
> Date: 2014-09-12 01:32:13 +0400
>
> ITS#7969 for LMDB: volatile & __synchronize().
>
> diff --git a/libraries/liblmdb/mdb.c b/libraries/liblmdb/mdb.c
> index 6cc3433..3286ffb 100644
> --- a/libraries/liblmdb/mdb.c
> +++ b/libraries/liblmdb/mdb.c
> @@ -580,11 +580,11 @@ typedef struct MDB_rxbody {
> * started from so we can avoid overwriting any data used in that
> * particular version.
> */
> - txnid_t mrb_txnid;
> + volatile txnid_t mrb_txnid;
> /** The process ID of the process owning this reader txn. */
> - MDB_PID_T mrb_pid;
> + volatile MDB_PID_T mrb_pid;
> /** The thread ID of the thread owning this txn. */
> - MDB_THR_T mrb_tid;
> + volatile MDB_THR_T mrb_tid;
> } MDB_rxbody;
>
> /** The actual reader record, with cacheline padding. */
> @@ -632,12 +632,12 @@ typedef struct MDB_txbody {
> * This is recorded here only for convenience; the value can always
> * be determined by reading the main database meta pages.
> */
> - txnid_t mtb_txnid;
> + volatile txnid_t mtb_txnid;
> /** The number of slots that have been used in the reader table.
> * This always records the maximum count, it is not decremented
> * when readers release their slots.
> */
> - unsigned mtb_numreaders;
> + volatile unsigned mtb_numreaders;
> } MDB_txbody;
>
> /** The actual reader table definition. */
> @@ -908,7 +908,7 @@ typedef struct MDB_meta {
> /** Any persistent environment flags. @ref mdb_env */
> #define mm_flags mm_dbs[0].md_flags
> pgno_t mm_last_pg; /**< last used page in file */
> - txnid_t mm_txnid; /**< txnid that committed this page */
> + volatile txnid_t mm_txnid; /**< txnid that committed this page */
> } MDB_meta;
>
> /** Buffer for a stack-allocated meta page.
> @@ -3559,6 +3559,10 @@ mdb_env_write_meta(MDB_txn *txn)
> mp->mm_dbs[0] = txn->mt_dbs[0];
> mp->mm_dbs[1] = txn->mt_dbs[1];
> mp->mm_last_pg = txn->mt_next_pgno - 1;
> +#if !(defined(_MSC_VER) || defined(__i386__) || defined(__x86_64__))
> + /* LY: issue a memory barrier, if not x86. */
> + __sync_synchronize();
> +#endif
> mp->mm_txnid = txn->mt_txnid;
> if (!(env->me_flags & (MDB_NOMETASYNC|MDB_NOSYNC))) {
> unsigned meta_size = env->me_psize;
>
>
>
>
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
The attached files is derived from OpenLDAP Software. All of the modifications
to OpenLDAP Software represented in the following patch(es) were developed by
Peter-Service LLC, Moscow, Russia. Peter-Service LLC has not assigned rights
and/or interest in this work to any party. I, Leonid Yuriev am authorized by
Peter-Service LLC, my employer, to release this work under the following terms.
Peter-Service LLC hereby places the following modifications to OpenLDAP Software
(and only these modifications) into the public domain. Hence, these
modifications may be freely used and/or redistributed for any purpose with or
without attribution and/or other notice.
https://gist.github.com/leo-yuriev/6f1e1444c9e29bd94885
The attached files is derived from OpenLDAP Software. All of the modifications
to OpenLDAP Software represented in the following patch(es) were developed by
Peter-Service LLC, Moscow, Russia. Peter-Service LLC has not assigned rights
and/or interest in this work to any party. I, Leonid Yuriev am authorized by
Peter-Service LLC, my employer, to release this work under the following terms.
Peter-Service LLC hereby places the following modifications to OpenLDAP Software
(and only these modifications) into the public domain. Hence, these
modifications may be freely used and/or redistributed for any purpose with or
without attribution and/or other notice.
https://gist.github.com/leo-yuriev/fc00313b8035cddba6a7