Re: (ITS#7958) LMDB: LIFO-reclaiming, write-performance improvement & bugfixes - openldap-bugs

19 Oct 2014

      ...
...
...
We are using the combination:
   envflags writemap nosync lifo
   checkpoint 0 1
If the checkpoint is set in seconds, it gives us the assurance
consistent state database on disk.
However, without this patch meta-pages can be written by the kernel
before the data.
In fact, for a full guarantee in case of death slapd process,
meta-page should be written explicitly.
No, the DB can never go inconsistent due to a process crashing - the pages
in OS cache are always correct. It can only go inconsistent if the OS
crashes and a proper sync has not occurred.
Yes, Howard, you are right.
But apparently I need to be more precise.
Talking about "death" of slapd, I meant all the reasons, including power off.
For example, a power-off case:
- The main power is turned off and the system switches to the UPS.
- Given the notice, OS starts an emergency stop processes.
- For some reason (does not have enough time to stop) slapd receives SIGKILL.
- OS tries to write mmap-region of the DB-file and begins with the
lower address.
- let the meta-pages has written completely, but for the rest of the
data is not enough battery power.
- Now DB is completely destroyed on the disk.
To avoid this, the meta-pages should not be included in the rw-mapped
region, and should be written explicity after a data pages.
...
...
...
...
...
commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
Author: Leo Yuriev leo@yuriev.ru
Date:   2014-09-19 22:47:19 +0400
   BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().

   Meta-pages may be updated during data-syncing in mdb_sync_env(),
   in this case database would be inconsistent.

   Check-and-retry if lead txn-id changed during flushing data in

mdb_sync_env().
Fundamentally, you are trying to make an inherently unsafe configuration
"safer", but it's impossible. Assume you have mlock'd the meta pages into
memory, so the OS never flushes them itself any more, and you're running
with NOSYNC. That means, within 3 transactions, the data pages on disk will
be out of sync with the meta pages on disk. If the OS crashes at that point,
the entire DB will be lost.
Not a problem.
I had explained above - we should write meta-pages explicitly after
the data sync.
But also we should not perform reclaiming ahead of the last checkpoint.
...
The only way to make this mode of operation somewhat safe is to defer
reclaiming pages for even longer. E.g., instead of halting at current_txnid

3, halt at current_txnid - 22, in which case the data pointed to by the

on-disk meta pages cannot get obsolete until 20 transactions have occurred.
Note that in combination with your LIFO patch, it's pretty much guaranteed
that the on-disk meta pages will be useless after only 2 un-sync'd
transactions.
Yes, Howard, you are right.
But I think there is confusion in the discussion because of mixing of
LIFO-feature and changes for checkpoints consistency in a NOSYNC and
WRITEMAP+NOSYNC modes.
For a "NOSYNC + checkpoints" topic I will submit a separate ITS (like
a 'volaile' related 7969,7970,7971).
My opinion - it is a flaw, and no reason to don't fix it.
Continuing the conversation about checkpoints in a LIFO context.
I saw the problem, that you specified, and thinking over its solution,
but have not yet found "golden ratio".
And since we are having a serious problem with syncrepl, then I put
off this task with an excuse "LIFO-patch not does worse than it was."
In general, we should do not reclaim anything ahead of the txn, that
is synced to the disk (let this be named a R-rule).
To do so we need a second field like mti_txnid, but which will be
update only at the end of mdb_env_write_meta().
Finally we should start search in mdb_find_oldest() from value of this
new field instead of the current txn number.
This seems to will be work fine.
However, I stopped on a reasoning - about the purpose of the
checkpoints, about design LMDB as a product, about the expectations of
the user and the necessary configuration parameters:
- checkpoints are needed ONLY in nosync modes;
- if the user does NOT activate the checkpoints, he do not care about
consistency;
- but if it is turned on, we MUST provide consistency on the checkpoints;
- otherwise a checkpoints feature is thoughtless and should be REMOVED.
Therefore implementation of checkpoints & reclaiming should be updated
to conform to the "R-rule", that noted above.
...
From this point of view a LIFO-feature also should be refined, but
nevertheless can be very useful.
- SYNC mode = takes a benefit from storage with write-back cache
(assume powered by battery).
- ASYNC/NOSYNC without checkpoint = significant reduction of
committed/dirty pages and thereby much less write-iops.
- ASYNC/NOSYNC with checkpoint = seems to same as a SYNC case.
Total of all the above - I think first we need to fix a reclaiming or
delete the checkpoints, and then I will complete LIFO.
...
-- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
Thank for conversation.
Leonid.