For informaion only - Nowadays 'lifo' and 'coalesce' features implemented in ReOpenLDAP fork. 1) lmdb: MDB_LIFORECLAIM & MDB_COALESCE modes. https://github.com/ReOpen/ReOpenLDAP/commit/829c2063b602238b5c93ea36a981de3= d0d7994bc 2) lmdb-backend: support config for 'lifo' and 'coalesce' envflags. https://github.com/ReOpen/ReOpenLDAP/commit/08b4a41b5b837548444ef0fef761494= 0c41d882a
With the couple of issues: 1) lmdb in 'writemap' mode may inconsistent even with checkpoints https://github.com/ReOpen/ReOpenLDAP/issues/1 2) lifo feature should be synchonized with checkpoints https://github.com/ReOpen/ReOpenLDAP/issues/2
However currently it gives a reasonable boost (5-10 times) of write-performance in our use case.
Leonid.
2014-10-20 0:27 GMT+03:00 =D0=9B=D0=B5=D0=BE=D0=BD=D0=B8=D0=B4 =D0=AE=D1=80= =D1=8C=D0=B5=D0=B2 leo@yuriev.ru:
We are using the combination: envflags writemap nosync lifo checkpoint 0 1
If the checkpoint is set in seconds, it gives us the assurance consistent state database on disk. However, without this patch meta-pages can be written by the kernel before the data.
In fact, for a full guarantee in case of death slapd process, meta-page should be written explicitly.
No, the DB can never go inconsistent due to a process crashing - the pag=
es
in OS cache are always correct. It can only go inconsistent if the OS crashes and a proper sync has not occurred.
Yes, Howard, you are right. But apparently I need to be more precise. Talking about "death" of slapd, I meant all the reasons, including power =
off.
For example, a power-off case:
- The main power is turned off and the system switches to the UPS.
- Given the notice, OS starts an emergency stop processes.
- For some reason (does not have enough time to stop) slapd receives SIGK=
ILL.
- OS tries to write mmap-region of the DB-file and begins with the
lower address.
- let the meta-pages has written completely, but for the rest of the
data is not enough battery power.
- Now DB is completely destroyed on the disk.
To avoid this, the meta-pages should not be included in the rw-mapped region, and should be written explicity after a data pages.
commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae Author: Leo Yuriev leo@yuriev.ru Date: 2014-09-19 22:47:19 +0400
BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env(). Meta-pages may be updated during data-syncing in mdb_sync_env=
(),
in this case database would be inconsistent. Check-and-retry if lead txn-id changed during flushing data i=
n
mdb_sync_env().
Fundamentally, you are trying to make an inherently unsafe configuration "safer", but it's impossible. Assume you have mlock'd the meta pages int=
o
memory, so the OS never flushes them itself any more, and you're running with NOSYNC. That means, within 3 transactions, the data pages on disk w=
ill
be out of sync with the meta pages on disk. If the OS crashes at that po=
int,
the entire DB will be lost.
Not a problem. I had explained above - we should write meta-pages explicitly after the data sync. But also we should not perform reclaiming ahead of the last checkpoint.
The only way to make this mode of operation somewhat safe is to defer reclaiming pages for even longer. E.g., instead of halting at current_tx=
nid
- 3, halt at current_txnid - 22, in which case the data pointed to by th=
e
on-disk meta pages cannot get obsolete until 20 transactions have occurr=
ed.
Note that in combination with your LIFO patch, it's pretty much guarante=
ed
that the on-disk meta pages will be useless after only 2 un-sync'd transactions.
Yes, Howard, you are right. But I think there is confusion in the discussion because of mixing of LIFO-feature and changes for checkpoints consistency in a NOSYNC and WRITEMAP+NOSYNC modes. For a "NOSYNC + checkpoints" topic I will submit a separate ITS (like a 'volaile' related 7969,7970,7971). My opinion - it is a flaw, and no reason to don't fix it.
Continuing the conversation about checkpoints in a LIFO context. I saw the problem, that you specified, and thinking over its solution, but have not yet found "golden ratio". And since we are having a serious problem with syncrepl, then I put off this task with an excuse "LIFO-patch not does worse than it was."
In general, we should do not reclaim anything ahead of the txn, that is synced to the disk (let this be named a R-rule). To do so we need a second field like mti_txnid, but which will be update only at the end of mdb_env_write_meta(). Finally we should start search in mdb_find_oldest() from value of this new field instead of the current txn number. This seems to will be work fine.
However, I stopped on a reasoning - about the purpose of the checkpoints, about design LMDB as a product, about the expectations of the user and the necessary configuration parameters:
- checkpoints are needed ONLY in nosync modes;
- if the user does NOT activate the checkpoints, he do not care about
consistency;
- but if it is turned on, we MUST provide consistency on the checkpoints;
- otherwise a checkpoints feature is thoughtless and should be REMOVED.
Therefore implementation of checkpoints & reclaiming should be updated to conform to the "R-rule", that noted above.
From this point of view a LIFO-feature also should be refined, but nevertheless can be very useful.
- SYNC mode =3D takes a benefit from storage with write-back cache
(assume powered by battery).
- ASYNC/NOSYNC without checkpoint =3D significant reduction of
committed/dirty pages and thereby much less write-iops.
- ASYNC/NOSYNC with checkpoint =3D seems to same as a SYNC case.
Total of all the above - I think first we need to fix a reclaiming or delete the checkpoints, and then I will complete LIFO.
-- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Thank for conversation. Leonid.