Re: (ITS#7958) LMDB: LIFO-reclaiming, write-performance improvement & bugfixes - openldap-bugs

12 Jan 2015

      For informaion only - Nowadays 'lifo' and 'coalesce' features
implemented in ReOpenLDAP fork.
1) lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes.
https://github.com/ReOpen/ReOpenLDAP/commit/829c2063b602238b5c93ea36a981de3=
d0d7994bc
2) lmdb-backend: support config for 'lifo' and 'coalesce' envflags.
https://github.com/ReOpen/ReOpenLDAP/commit/08b4a41b5b837548444ef0fef761494=
0c41d882a
With the couple of issues:
1)  lmdb in 'writemap' mode may inconsistent even with checkpoints
https://github.com/ReOpen/ReOpenLDAP/issues/1
2) lifo feature should be synchonized with checkpoints
https://github.com/ReOpen/ReOpenLDAP/issues/2
However currently it gives a reasonable boost (5-10 times) of
write-performance in our use case.
Leonid.
2014-10-20 0:27 GMT+03:00 =D0=9B=D0=B5=D0=BE=D0=BD=D0=B8=D0=B4 =D0=AE=D1=80=
=D1=8C=D0=B5=D0=B2 leo@yuriev.ru:
...
...
...
...
We are using the combination:
   envflags writemap nosync lifo
   checkpoint 0 1
If the checkpoint is set in seconds, it gives us the assurance
consistent state database on disk.
However, without this patch meta-pages can be written by the kernel
before the data.
In fact, for a full guarantee in case of death slapd process,
meta-page should be written explicitly.
No, the DB can never go inconsistent due to a process crashing - the pag=
es
...
...
in OS cache are always correct. It can only go inconsistent if the OS
crashes and a proper sync has not occurred.
Yes, Howard, you are right.
But apparently I need to be more precise.
Talking about "death" of slapd, I meant all the reasons, including power =
off.
...
For example, a power-off case:

The main power is turned off and the system switches to the UPS.
Given the notice, OS starts an emergency stop processes.
For some reason (does not have enough time to stop) slapd receives SIGK=

ILL.
...

OS tries to write mmap-region of the DB-file and begins with the

lower address.

let the meta-pages has written completely, but for the rest of the

data is not enough battery power.

Now DB is completely destroyed on the disk.

To avoid this, the meta-pages should not be included in the rw-mapped
region, and should be written explicity after a data pages.
...
...
...
...
...
commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
Author: Leo Yuriev leo@yuriev.ru
Date:   2014-09-19 22:47:19 +0400
   BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().

   Meta-pages may be updated during data-syncing in mdb_sync_env=

(),
...
...
...
...
...
...
   in this case database would be inconsistent.

   Check-and-retry if lead txn-id changed during flushing data i=

n
...
...
...
...
...
...
mdb_sync_env().
Fundamentally, you are trying to make an inherently unsafe configuration
"safer", but it's impossible. Assume you have mlock'd the meta pages int=
o
...
...
memory, so the OS never flushes them itself any more, and you're running
with NOSYNC. That means, within 3 transactions, the data pages on disk w=
ill
...
...
be out of sync with the meta pages on disk. If the OS crashes at that po=
int,
...
...
the entire DB will be lost.
Not a problem.
I had explained above - we should write meta-pages explicitly after
the data sync.
But also we should not perform reclaiming ahead of the last checkpoint.
...
The only way to make this mode of operation somewhat safe is to defer
reclaiming pages for even longer. E.g., instead of halting at current_tx=
nid
...
...

3, halt at current_txnid - 22, in which case the data pointed to by th=

e
...
...
on-disk meta pages cannot get obsolete until 20 transactions have occurr=
ed.
...
...
Note that in combination with your LIFO patch, it's pretty much guarante=
ed
...
...
that the on-disk meta pages will be useless after only 2 un-sync'd
transactions.
Yes, Howard, you are right.
But I think there is confusion in the discussion because of mixing of
LIFO-feature and changes for checkpoints consistency in a NOSYNC and
WRITEMAP+NOSYNC modes.
For a "NOSYNC + checkpoints" topic I will submit a separate ITS (like
a 'volaile' related 7969,7970,7971).
My opinion - it is a flaw, and no reason to don't fix it.
Continuing the conversation about checkpoints in a LIFO context.
I saw the problem, that you specified, and thinking over its solution,
but have not yet found "golden ratio".
And since we are having a serious problem with syncrepl, then I put
off this task with an excuse "LIFO-patch not does worse than it was."
In general, we should do not reclaim anything ahead of the txn, that
is synced to the disk (let this be named a R-rule).
To do so we need a second field like mti_txnid, but which will be
update only at the end of mdb_env_write_meta().
Finally we should start search in mdb_find_oldest() from value of this
new field instead of the current txn number.
This seems to will be work fine.
However, I stopped on a reasoning - about the purpose of the
checkpoints, about design LMDB as a product, about the expectations of
the user and the necessary configuration parameters:

checkpoints are needed ONLY in nosync modes;
if the user does NOT activate the checkpoints, he do not care about

consistency;

but if it is turned on, we MUST provide consistency on the checkpoints;
otherwise a checkpoints feature is thoughtless and should be REMOVED.

Therefore implementation of checkpoints & reclaiming should be updated
to conform to the "R-rule", that noted above.
From this point of view a LIFO-feature also should be refined, but
nevertheless can be very useful.

SYNC mode =3D takes a benefit from storage with write-back cache

(assume powered by battery).

ASYNC/NOSYNC without checkpoint =3D significant reduction of

committed/dirty pages and thereby much less write-iops.

ASYNC/NOSYNC with checkpoint =3D seems to same as a SYNC case.

Total of all the above - I think first we need to fix a reclaiming or
delete the checkpoints, and then I will complete LIFO.
...
-- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
Thank for conversation.
Leonid.