Леонид Юрьев wrote:
2014-10-03 3:13 GMT+04:00 Howard Chu hyc@symas.com: 2014-10-04 0:04 GMT+04:00 Леонид Юрьев leo@yuriev.ru:
2014-10-03 3:13 GMT+04:00 Howard Chu hyc@symas.com:
commit 841059330fd44769e93eb4b937c3ce42654fad6f Author: Leo Yuriev leo@yuriev.ru Date: 2014-09-20 07:16:15 +0400
BUGFIX - lmdb: lock meta-pages in writemap-mode to avoid unexpected
write, before the data pages would be synchronized.
Without locking the meta-pages may be writen by OS before other
data, in this case database would be inconsistent.
Seems unnecessary. Won't happen by default; could happen with MDB_NOSYNC but that risk is already documented.
We are using the combination: envflags writemap nosync lifo checkpoint 0 1
If the checkpoint is set in seconds, it gives us the assurance consistent state database on disk. However, without this patch meta-pages can be written by the kernel before the data.
In fact, for a full guarantee in case of death slapd process, meta-page should be written explicitly.
No, the DB can never go inconsistent due to a process crashing - the pages in OS cache are always correct. It can only go inconsistent if the OS crashes and a proper sync has not occurred.
But it requires a lot of changes and I do not do that.
commit 8ddd63161aeb2689822d1a8d27385d62e4e341ae Author: Leo Yuriev leo@yuriev.ru Date: 2014-09-19 22:47:19 +0400
BUGFIX - lmdb: properly sync meta-pages in mdb_sync_env(). Meta-pages may be updated during data-syncing in mdb_sync_env(), in this case database would be inconsistent. Check-and-retry if lead txn-id changed during flushing data in
mdb_sync_env().
Fundamentally, you are trying to make an inherently unsafe configuration "safer", but it's impossible. Assume you have mlock'd the meta pages into memory, so the OS never flushes them itself any more, and you're running with NOSYNC. That means, within 3 transactions, the data pages on disk will be out of sync with the meta pages on disk. If the OS crashes at that point, the entire DB will be lost.
The only way to make this mode of operation somewhat safe is to defer reclaiming pages for even longer. E.g., instead of halting at current_txnid - 3, halt at current_txnid - 22, in which case the data pointed to by the on-disk meta pages cannot get obsolete until 20 transactions have occurred.
Note that in combination with your LIFO patch, it's pretty much guaranteed that the on-disk meta pages will be useless after only 2 un-sync'd transactions.
Probably could simplify this, just obtain the write mutex unconditionally, then there's no need to loop or retry. But also, this depends on MDB_NOLOCK
- if that's set, then do no locking at all.
I did so for reasons of performance and less a lock retention time.
Retries will be if there an intensive flow of changes. In this case it will be a lot of updated pages, the record which will take some time.
However, in subsequent iterations (if a transactions had committed while there was a record), the modified pages will be much fewer, and the sync will be quick.
Thus (and it was seen in tests) even when a substantial amount of the transactions, usually only two iterations of the cycle, without locking and flow of changes are not suspended.