Re: (ITS#7958) LMDB: LIFO-reclaiming, write-performance improvement & bugfixes - openldap-bugs

18 Oct 2014

      Леонид Юрьев wrote:
...
2014-10-03 3:13 GMT+04:00 Howard Chu hyc@symas.com:
2014-10-04 0:04 GMT+04:00 Леонид Юрьев leo@yuriev.ru:
...
...
2014-10-03 3:13 GMT+04:00 Howard Chu hyc@symas.com:
...
...
commit 841059330fd44769e93eb4b937c3ce42654fad6f
Author: Leo Yuriev leo@yuriev.ru
Date:   2014-09-20 07:16:15 +0400
   BUGFIX - lmdb: lock meta-pages in writemap-mode to avoid unexpected

write,
                 before the data pages would be synchronized.
   Without locking the meta-pages may be writen by OS before other

data,
       in this case database would be inconsistent.
Seems unnecessary. Won't happen by default; could happen with MDB_NOSYNC but
that risk is already documented.
We are using the combination:
   envflags writemap nosync lifo
   checkpoint 0 1
If the checkpoint is set in seconds, it gives us the assurance
consistent state database on disk.
However, without this patch meta-pages can be written by the kernel
before the data.
In fact, for a full guarantee in case of death slapd process,
meta-page should be written explicitly.
No, the DB can never go inconsistent due to a process crashing - the pages in 
OS cache are always correct. It can only go inconsistent if the OS crashes and 
a proper sync has not occurred.
...
...
But it requires a lot of changes and I do not do that.
...
...
...
...
commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
Author: Leo Yuriev leo@yuriev.ru
Date:   2014-09-19 22:47:19 +0400
   BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().

   Meta-pages may be updated during data-syncing in mdb_sync_env(),
   in this case database would be inconsistent.

   Check-and-retry if lead txn-id changed during flushing data in

mdb_sync_env().
Fundamentally, you are trying to make an inherently unsafe configuration 
"safer", but it's impossible. Assume you have mlock'd the meta pages into 
memory, so the OS never flushes them itself any more, and you're running with 
NOSYNC. That means, within 3 transactions, the data pages on disk will be out 
of sync with the meta pages on disk. If the OS crashes at that point, the 
entire DB will be lost.
The only way to make this mode of operation somewhat safe is to defer 
reclaiming pages for even longer. E.g., instead of halting at current_txnid - 
3, halt at current_txnid - 22, in which case the data pointed to by the 
on-disk meta pages cannot get obsolete until 20 transactions have occurred.
Note that in combination with your LIFO patch, it's pretty much guaranteed 
that the on-disk meta pages will be useless after only 2 un-sync'd transactions.
...
...
...
Probably could simplify this, just obtain the write mutex unconditionally,
then there's no need to loop or retry. But also, this depends on MDB_NOLOCK

if that's set, then do no locking at all.

I did so for reasons of performance and less a lock retention time.
Retries will be if there an intensive flow of changes.
In this case it will be a lot of updated pages, the record which will
take some time.
However, in subsequent iterations (if a transactions had committed
while there was a record),
the modified pages will be much fewer, and the sync will be quick.
Thus (and it was seen in tests) even when a substantial amount of the
transactions,
usually only two iterations of the cycle,
without locking and flow of changes are not suspended.
-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/