Re: (ITS#7958) LMDB: LIFO-reclaiming, write-performance improvement & bugfixes - openldap-bugs

3 Oct 2014

      Fwd: (ITS#7841) high disk utilization
2014-10-03 3:13 GMT+04:00 Howard Chu hyc@symas.com:
...
...
commit 841059330fd44769e93eb4b937c3ce42654fad6f
Author: Leo Yuriev leo@yuriev.ru
Date:   2014-09-20 07:16:15 +0400
  BUGFIX - lmdb: lock meta-pages in writemap-mode to avoid unexpected

write,
                before the data pages would be synchronized.
  Without locking the meta-pages may be writen by OS before other

data,
      in this case database would be inconsistent.
Seems unnecessary. Won't happen by default; could happen with MDB_NOSYNC but
that risk is already documented.
We are using the combination:
  envflags writemap nosync lifo
  checkpoint 0 1
If the checkpoint is set in seconds, it gives us the assurance
consistent state database on disk.
However, without this patch meta-pages can be written by the kernel
before the data.
In fact, for a full guarantee in case of death slapd process,
meta-page should be written explicitly.
But it requires a lot of changes and I do not do that.
...
...
commit 0c168d0e63ed78d13df3fc8a42f3667335678639
Author: Leo Yuriev leo@yuriev.ru
Date:   2014-09-20 10:13:28 +0400
  FEATURE - lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes.

  Reclaim FreeDB in LIFO order - this is a main feature.
  Also aim to coalesce small FreeDFB records.

Will spend more time looking at this closer.
I would be suggested, but do not insist, review this patch on github.
...
...
commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae
Author: Leo Yuriev leo@yuriev.ru
Date:   2014-09-19 22:47:19 +0400
  BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env().

  Meta-pages may be updated during data-syncing in mdb_sync_env(),
  in this case database would be inconsistent.

  Check-and-retry if lead txn-id changed during flushing data in

mdb_sync_env().
Probably could simplify this, just obtain the write mutex unconditionally,
then there's no need to loop or retry. But also, this depends on MDB_NOLOCK

if that's set, then do no locking at all.

I did so for reasons of performance and less a lock retention time.
Retries will be if there an intensive flow of changes.
In this case it will be a lot of updated pages, the record which will
take some time.
However, in subsequent iterations (if a transactions had committed
while there was a record),
the modified pages will be much fewer, and the sync will be quick.
Thus (and it was seen in tests) even when a substantial amount of the
transactions,
usually only two iterations of the cycle,
without locking and flow of changes are not suspended.
...
...
commit 147f41a8110f28456bc32123bde86d47183f9c0a
Author: Leo Yuriev leo@yuriev.ru
Date:   2014-09-04 16:01:15 +0400
  FEATURE - lmdb: implementation of "checkpoint kbytes".

  Force flush when volume of the changes reached a configurable

threshold.
Probably OK. Needs some typographical cleanup. Not sure "syncbytes" is a
good name.
Agree.
I just took the first choice and try to retaining the style.
Ideas?
...
...
commit fb82a0b688f4c31313d0790415feda8aaa18651c
Author: Leo Yuriev leo@yuriev.ru
Date:   2014-09-04 15:18:16 +0400
  CHANGE - lmdb-backend: checkpoint-interval in seconds instead of

minutes.
Gratuitous change. We used minutes since the BDB backend uses minutes, and
the intention was to maintain parallel functionality. What's the
justification for this change?
As I had wrote above, we are using the combination:
  envflags writemap nosync lifo
  checkpoint 0 1
If the interval is specified in minutes, then it can not be set less
than one minute.
But it's too big amount of time to allow lost the updates.
However, setting the synchronization interval of one second,
we reduce the amount of losses in the event of an accident to an
acceptable level,
while the load on the storage system is acceptable even for a large
flow of updates.
As a result, I have not found a better solution than simply replace
the minutes by the seconds.
...
...
commit fc409d89e0d9dde20f612e34c2a463c8a81ea000
Author: Leo Yuriev leo@yuriev.ru
Date:   2014-09-20 06:51:04 +0400
  EXTENSION - lmdb: more usefull info from mdb_stat tool.

A bit ambiguous. me_tail_txnid is actually the ID of the oldest reader, not
the "last" reader. I'm not convinced of the value of this patch, since you
can already view the readers list.
I am agree then "tail" is a best choice.
But the main value of this patch is not to show a txn of oldest
reader, but to show an info about pages usage.
Especially the amount of pages which are "blocked" by oldest (laggard)
reader, and how much pages are actually available.
...
--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
Thank you in advance.
BR.
Leonid Yuriev.