Re: (ITS#9037) observing crash in mdb_cursor_put()
by vporof@mozilla.com
> On 1. Aug 2019, at 19:47, Howard Chu <hyc(a)symas.com> wrote:
>=20
> vporof(a)mozilla.com wrote:
>> Hey folks.
>>=20
>> =3D46rom Myk=3DE2=3D80=3D99s investigations in the previous followup, =
it seems =3D
>> that the suggested changes to `mdb_cursor_init` to avoid using an =3D
>> invalid DBI might not be solving the actual issue, given the =
behaviour =3D
>> of `mdb_page_search`.
>=20
> Agreed, that assert that I suggested isn't catching what we want.
FWIW, here's my findings when attempting to look into what was happening =
with regards to that test case: =
https://gist.github.com/victorporof/d1f98b8a52f79c7ff99f361d3a2adc3e
It's unclear how much overlap there is with the previous findings, and =
whether or not calling `mdb_put` should assert with the DBI previously =
opened via `mdb_dbi_open` with MDB_CREATE. Let me know if what I'm =
observing here is expected.
>=20
>> It=3DE2=3D80=3D99s also causing the seemingly correct test program to =
assert =3D
>> when it wasn=3DE2=3D80=3D99t before. It=3DE2=3D80=3D99s unclear =
whether this should =3D
>> be the case or not, perhaps Howard can confirm the expected =
behaviour.
>>=20
>> In any case, we=3DE2=3D80=3D99re wondering if there=3DE2=3D80=3D99s =
been any other =3D
>> progress, or if someone managed to reproduce this issue? Shipping new =
=3D
>> features built on top of LMDB in Firefox is currently blocked due to =
=3D
>> these failures, so any additional info would be helpful.
>=20
> Sorry, still not seeing this over here. What else do you know about =
the
> systems where this is occurring? RAM size, storage on HDD / SSD / USB =
?
Here's everything we know about: =
https://crash-stats.mozilla.org/report/index/5d77bd19-41ce-459f-9c1c-7f9fb=
0190324 See the "details" and "telemetry environment" sections for a =
breakdown.
Victor
>=20
> --=20
> -- Howard Chu
> CTO, Symas Corp. http://www.symas.com
> Director, Highland Sun http://highlandsun.com/hyc/
> Chief Architect, OpenLDAP http://www.openldap.org/project/
2 years, 9 months
(ITS#9060) lmdb corruption after many memory map resizes
by ruan.declercq@netronome.com
Full_Name: Ruan de Clercq
Version:
OS: Ubuntu 19.04
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (155.93.214.171)
Hi,
I am using 0.9.23-0ubuntu1 on ubuntu 19.04.
I have an application where the exact db size isn't known beforehand. Therefore,
I set the memory map size beforehand, and then increase the size when I run out
of space. However, when I use lmdb in concurrent processes, and frequently
increase the size of the memory map, I eventually get a corrupt database
(MDB_CORRUPTED). Here's a minimal code example:
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <time.h>
#include "lmdb.h"
#include <cstring>
int resize(MDB_env *env) {
// read map size and increase
int val;
MDB_envinfo stat;
MDB_txn *txn;
if ((val = mdb_env_info(env, &stat) != MDB_SUCCESS) ||
(val = mdb_env_set_mapsize(env, stat.me_mapsize + 4096) != MDB_SUCCESS)
||
(val = mdb_env_info(env, &stat) != MDB_SUCCESS)) {
printf("set new mapsize %d\n", val);
return val;
}
printf("Resized to %lu\n", stat.me_mapsize);
if (((val = mdb_txn_begin(env, NULL, 0, &txn) != MDB_SUCCESS)) ||
((val = mdb_txn_commit(txn)))) {
printf("txn %d\n", val);
}
return val;
}
int main(int argc,char * argv[])
{
timespec tp;
clock_gettime(CLOCK_MONOTONIC, &tp);
srand(tp.tv_sec + tp.tv_nsec);
MDB_env *env = NULL;
MDB_txn *txn = NULL;
MDB_dbi dbi;
int val;
if ((val = mdb_env_create(&env) != MDB_SUCCESS) ||
(val = mdb_env_set_maxreaders(env, 126) != MDB_SUCCESS)) {
printf("Create error %d", val);
goto out;
}
if ((val = mdb_env_set_mapsize(env, 4096 * 128) != MDB_SUCCESS) ||
(val = mdb_env_set_maxdbs(env, 1) != MDB_SUCCESS) ||
(val = mdb_env_open(env, "./data/", MDB_MAPASYNC | MDB_WRITEMAP, 0664) !=
MDB_SUCCESS) ||
(val = mdb_txn_begin(env, NULL, 0, &txn) != MDB_SUCCESS) ||
(val = mdb_dbi_open(txn, "test", MDB_CREATE, &dbi) != MDB_SUCCESS) ||
(val = mdb_txn_commit(txn) != MDB_SUCCESS)) {
printf("create db %d\n", val);
goto out;
}
for (int i = 0; i < 1000; i++) {
char sval[32];
sprintf(sval, "%03x %d foo bar", 32, rand());
MDB_val mdb_key, mdb_val;
mdb_key.mv_size = strlen(sval);;
mdb_key.mv_data = sval;
mdb_val.mv_size = strlen(sval);
mdb_val.mv_data = sval;
if ((val = mdb_txn_begin(env, NULL, 0, &txn) != MDB_SUCCESS)) {
printf("mdb_txn_begin %d\n", val);
goto out;
}
if ((val = mdb_put(txn, dbi, &mdb_key, &mdb_val, 0)) != MDB_SUCCESS) {
mdb_txn_commit(txn);
if (val == MDB_MAP_FULL) {
val = resize(env);
continue;
}
printf("mdb_put %d\n", val);
goto out;
}
if ((val = mdb_txn_commit(txn)) != MDB_SUCCESS) {
if (val == MDB_MAP_FULL) {
val = resize(env);
continue;
}
printf("mdb_txn_commit %d\n", val);
goto out;
}
}
out:
if (env) {
mdb_dbi_close(env, dbi);
mdb_env_close(env);
}
printf("OK\n");
}
After compiling I run the following command a couple of times:
for i in {1..20}; do echo $i; (lmdb_test &); done
2 years, 9 months
Re: (ITS#9037) observing crash in mdb_cursor_put()
by hyc@symas.com
vporof(a)mozilla.com wrote:
> Hey folks.
>
> =46rom Myk=E2=80=99s investigations in the previous followup, it seems =
> that the suggested changes to `mdb_cursor_init` to avoid using an =
> invalid DBI might not be solving the actual issue, given the behaviour =
> of `mdb_page_search`.
Agreed, that assert that I suggested isn't catching what we want.
> It=E2=80=99s also causing the seemingly correct test program to assert =
> when it wasn=E2=80=99t before. It=E2=80=99s unclear whether this should =
> be the case or not, perhaps Howard can confirm the expected behaviour.
>
> In any case, we=E2=80=99re wondering if there=E2=80=99s been any other =
> progress, or if someone managed to reproduce this issue? Shipping new =
> features built on top of LMDB in Firefox is currently blocked due to =
> these failures, so any additional info would be helpful.
Sorry, still not seeing this over here. What else do you know about the
systems where this is occurring? RAM size, storage on HDD / SSD / USB ?
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
2 years, 9 months
Re: (ITS#9059) session log parsing triggers cascading REFRESH
by quanah@symas.com
--On Thursday, August 01, 2019 4:16 PM +0000 quanah(a)openldap.org wrote:
> Full_Name: Quanah Gibson-Mount
> Version: 2.4.47
> OS: N/A
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (47.208.128.44)
>
>
> In investigating why a particular consumer went into REFRESH mode, I find
> that this was triggered by one of its providers currently parsing a
> sessionlog. Sessionlog parsing should *not* cause a consumer of that
> server to fall back to REFRESH.
Some notes on this specific environment and situation.
This is a 4-way MMR setup. Via the use of a load balancer, only
iam-ed-ldp-p01 recieves write operations. Thus there should never be any
REFRESHes occurring anywhere, so the fact that there is is already
problematic.
Servers are p01, p02, p03 (all in the same data center) and b01 (offsite DR
system).
At the time this issue hit:
p01 was upgraded to a new release of OpenLDAP @ 21:26:48, thus we see:
Jul 30 21:25:17 iam-ed-ldp-p01 slapd[30495]: slapd shutdown: waiting for 0
operations/tasks to finish
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20978]: @(#) $OpenLDAP: slapd 2.4.47
(Mar 22 2019 09:12:20)
$#012#011build@rhel7build.rb.symas.net:/home/build/sold-2.4.47.3/redhat7-x86_64/build/openldap.x86_64/servers/slapd
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: slapd starting
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=002
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001
LDAP_RES_SEARCH_RESULT
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001
delta-sync lost sync, switching to REFRESH
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001 (4096)
Content Sync Refresh Required
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=003
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
p03 was operating w/o issue. Interstingly enough, when it was restarted
for the software update @ 21:37:16, it got an error (53) talking to p02
instead of an error (4096) like p01 did. This is likely because p02 was
restarted for its software update @ 21:32:19, which wiped out its
sessionlog.
p02 had started doing session log parsing @ 21:26:17 with
Jul 30 21:26:17 iam-ed-ldp-p02 slapd[3750]: conn=18205759 op=1 SEARCH
RESULT tag=101 err=4096 duration=18.462ms nentries=0 text=sync cookie is
stale
We see the connection to p02 from p01 with a similar issue when it starts
up:
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: conn=18205790 op=1 SRCH
base="cn=accesslog" scope=2 deref=0
filter="(&(objectClass=auditWriteObject)(reqResult=0))"
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: conn=18205790 op=1 SRCH
attr=reqDN reqType reqMod reqNewRDN reqDeleteOldRDN reqNewSuperior entryCSN
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: conn=18205790 op=1 SEARCH
RESULT tag=101 err=4096 duration=0.431ms nentries=0 text=sync cookie is
stale
b01 had its software update @ 21:21:25
Generally it seems like restarting/wiping the sessionlog made things
incredibly bad across the cluster.
--
Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>
2 years, 9 months
(ITS#9059) session log parsing triggers cascading REFRESH
by quanah@openldap.org
Full_Name: Quanah Gibson-Mount
Version: 2.4.47
OS: N/A
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (47.208.128.44)
In investigating why a particular consumer went into REFRESH mode, I find that
this was triggered by one of its providers currently parsing a sessionlog.
Sessionlog parsing should *not* cause a consumer of that server to fall back to
REFRESH.
Server that got dropped into REFRESH incorrectly:
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001
LDAP_RES_SEARCH_RESULT
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001 delta-sync
lost sync, switching to REFRESH
Jul 30 21:26:48 iam-ed-ldp-p01 slapd[20979]: do_syncrep2: rid=001 (4096) Content
Sync Refresh Required
RID001 server:
....
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127055648.177968Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181109151409.323374Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127122502.661263Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061139.881198Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.514773Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061139.941766Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.622874Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061139.977196Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.656726Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181109151409.575116Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.672371Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.701608Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181109151409.833435Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.766277Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181109151410.102670Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127055648.192922Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061140.091604Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.859669Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061140.160929Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127122502.664055Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127061140.213051Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: cmp -256, too old
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127122502.666873Z#000000#001#000000
Jul 30 21:26:48 iam-ed-ldp-p02 slapd[3750]: log csn
20181127062500.892859Z#000000#001#000000
....
2 years, 9 months