Hello,
Working with overflow pages directly via pointers outside write transactions works great and it helps that they do not move "by design" in current versions as discussed in this thread.
I have two related scenarios that will give a substantial performance boost in my case.
*The first one* is updating a value in-place via pointer from aborted write transaction. If I
1) use MDB_WRITEMAP, 2) from **write** transaction find a record (which is small and not in an overflow page), 3) modify a part of it's value (for duplicates this part is not used in the compare function) directly via the MDB_VAL data pointer (e.g. interlocked_increment or compare_and_swap), 4) and **abort** the transaction,
then readers see the updated value via normal read transactions later. Since I do the direct updates from inside a write transaction all other writers should be locked until I exit the transaction (abort in this case), and no pages should move since the transaction is aborted. Is this correct? Does this work "by design" or "by accident" currently?
*The second one* is about updating values in-place from read transactions. If I
1) use MDB_WRITEMAP, 2) open a database with MDB_INTEGERKEY (and could use a dedicated environment with a single DB if that changes the answer), 3) add values to the DB *only* using MDB_APPEND | MDB_NOOVERWRITE, 4) modify a part of a value directly via the MDB_VAL data pointer,
is it possible that the page from the read transaction is replaced with a new one if there is a parallel write transaction?
There is a quote from Howard on GitHub (https://github.com/lmdbjava/ benchmarks/issues/9#issuecomment-354184989): "When we do sequential inserts using MDB_APPEND, there is* no page split at all* - we just allocate a new page and fill it, sequentially." Does this mean that if there are no page splits then existing pages do not move as well, and it is "safe" to use pointers outside of write transactions as is the case with the overflow pages?
In both cases I update values of a struct that indicate e.g. some lifecycle stage of an object the LMDB record refers to and stage transitions are idempotent. If a direct pointer write doesn't make to disk due to system failure subsequent readers (workers) will see an older stage and repeat stage transition.
Therefore missed direct writes do not break application logic, I only care about physical corruption of the entire DB. If I update the values in-place inside read transactions and the page becomes stale this should not corrupt DB since the old page will go to the free list only after the read transaction is finished, so this "hack" should not break DB. But then missed writes will be a norm and not a special-case on OS failure. But if pages do not move, all these "soft" updates could be done in parallel and be very fast.
Unfortunately I cannot answer this myself while trying to read the mbd.c file. In the second scenario I'm specifically concerned what is happening when DB becomes large and the tree needs rebalancing. At least in this case some pages need to move, but does the rebalancing replace/split existing pages?
Thanks & best regards, Victor
On Fri, Oct 30, 2015 at 9:26 PM, Howard Chu hyc@symas.com wrote:
Victor Baybekov wrote:
Thanks a lot! My proof-of-concept code works OK.
I do not understand all subtle details of mmap reliability, could you please help with these two:
If I write data to a pointer to an opaque blob as discussed above, and my process crashes before mdb_env_sync, but OS doesn't crash - will that data be secure in the mmap file?
Of course. The OS owns the memory, it doesn't matter if your process crashes.
Also, am I correct that mdb_env_sync synchronizes all dirty pages in the
mmap file as seen by a file system, regardless how they were modified - either via LMDB API or via a direct pointer writes?
Yes.
As for "you could at least set a callback to notify you that a block has
moved" - if that is implemented, it would be nice to have a notification /before/ a block is moved (with old and new address, so that right after the callback it is OK to use the new address), otherwise this non-intended but convenient use of LMDB won't work anymore.
"right after the callback it is OK to use the new address" - that's the point of the callback, it's job is to make the new address valid. So yes, when it returns, you use the new address.
Best regards, Victor
On Sat, Oct 3, 2015 at 1:27 AM, Howard Chu <hyc@symas.com mailto:hyc@symas.com> wrote:
Howard Chu wrote: Victor Baybekov wrote: Thank you! I understand this copy-on-write behavior, but am interested if I could control it a little. What if I use records that are
always much bigger than a single page, e.g. 100 kb with 4kb pages, and make sure that a record is never updated (via LMDB means) during a lifetime of an environment, - is there any scenario that the location of such a big record could be changed during a lifetime of an environment, without updating the record?
At this point in time, no, if you don't update a large record
there is no reason that it will move. That is not to say that this won't change in the future. The documentation tells you what promises we are willing to make. Relying on any non-documented behavior is your own responsibility.
Note that the relocation functions in LMDB are intended to accommodate blocks being moved around. The actual guts of that API haven't been implemented, but probably in 1.x we'll flesh them out. Given that
support, you could at least set a callback to notify you that a block has moved. But currently, overflow pages don't move if they're not modified.
On Fri, Oct 2, 2015 at 4:38 PM, Howard Chu <hyc@symas.com <mailto:hyc@symas.com> <mailto:hyc@symas.com <mailto:hyc@symas.com>>> wrote: Victor Baybekov wrote: Hi, Docs for MDB_RESERVE say that a returned pointer to
the reserved space is valid "before the next update operation or the transaction ends." Docs for MDB_WRITEMAP say that it "writes directly to the mmap instead of using malloc for pages." Does combining the two options return a pointer directly to a place in a mmap
Yes. so that this pointer could be used after a
transaction ends or after the next update?
No. Longer answer: maybe. Full answer: LMDB is copy-on-write. If you update another record on the same page, in a later transaction, the contents of that
page will be copied to a new page and the original page will go onto the freelist. In that case, the pointer you got must not be used again.
If you don't directly update that page and cause it to be copied, then you might get lucky and be able to use the pointer for a
while. It all depends on what other modifications you do and how they affect that node or neighboring nodes.
I have a use case where I want to somewhat abuse LMDB safety for convenience. If I could get a pointer to a place inside a mmap I
could work with LMDB value as opaque blob or as a region inside the single big mmap. This could be more convenient than creating and opening hundreds of temporary memory mapped files and keeping open handles to them. For example, Aeron terms could be stored like this: a stream id per an LMDB db and a term id for a key in the db.
Thanks! Victor
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Victor Baybekov wrote:
Hello,
Working with overflow pages directly via pointers outside write transactions works great and it helps that they do not move "by design" in current versions as discussed in this thread.
I have two related scenarios that will give a substantial performance boost in my case.
/The first one/ is updating a value in-place via pointer from aborted write transaction. If I
- use MDB_WRITEMAP,
- from **write** transaction find a record (which is small and not in an overflow page),
- modify a part of it's value (for duplicates this part is not used in the compare function) directly via the MDB_VAL data pointer (e.g. interlocked_increment
or compare_and_swap), 4) and **abort** the transaction,
then readers see the updated value via normal read transactions later. Since I do the direct updates from inside a write transaction all other writers should be locked until I exit the transaction (abort in this case), and no pages should move since the transaction is aborted. Is this correct? Does this work "by design" or "by accident" currently?
By accident. Note that LMDB 1.0 supports page checksums, and what you're doing here will break those checksums.
/The second one/ is about updating values in-place from read transactions. If I
Updating any value in a read transaction is ludicrous.
Basically you're abusing MDB_WRITEMAP, whose only purpose is to (potentially) optimize normal write transactions. I've no interest in even speculating on intentional misuse of the API.
By accident. Note that LMDB 1.0 supports page checksums, and what you're doing here will break those checksums.
Will the checksums be mandatory or optional? For InMemory/NoSync case they could add overhead and you already have many options to choose safety vs performance (WRITEMAP, NOMETASYNC, etc).
In-place updates in write transactions are 2x faster and work from several threads, it's just a lock and cumulative performance over N threads behaves just as with a shared lock.
For some reason, in-place updates from read transactions work https://github.com/Spreads/Spreads.LMDB/blob/91292c20c6e5261051874d7b3831cb546eee64ab/test/Spreads.LMDB.Tests/LMDBTests.cs#L507 for NoSync case (with 1M 16-bytes values in tests) and give another 2x performance. Yet I understand the ludicrosity of such attempts.
Basically you're abusing MDB_WRITEMAP, whose only purpose is to
(potentially) optimize normal write transactions. I've no interest in even speculating on intentional misuse of the API.
I successfully use the overflow-pages-do-not-move hack and already rely on it (since it was confirmed in this thread that at least for a particular version large pages do not move by design). LMDB acts like a shared memory allocator/pool, to move buffers off-heap and make them accessible from different processes and containers. Checksums will break this case as well, but I'm fine staying with 0.9x because it's too cool to have 500G NVM fast granular random-access memory pool with only 16G RAM on the smallest AWS i3 instance and similar cases. But still having checksums optional would be nice, I already use ones at application level.
On Thu, Aug 16, 2018 at 7:45 PM, Howard Chu hyc@symas.com wrote:
Victor Baybekov wrote:
Hello,
Working with overflow pages directly via pointers outside write transactions works great and it helps that they do not move "by design" in current versions as discussed in this thread.
I have two related scenarios that will give a substantial performance boost in my case.
/The first one/ is updating a value in-place via pointer from aborted write transaction. If I
- use MDB_WRITEMAP,
- from **write** transaction find a record (which is small and not in an
overflow page), 3) modify a part of it's value (for duplicates this part is not used in the compare function) directly via the MDB_VAL data pointer (e.g. interlocked_increment or compare_and_swap), 4) and **abort** the transaction,
then readers see the updated value via normal read transactions later. Since I do the direct updates from inside a write transaction all other writers should be locked until I exit the transaction (abort in this case), and no pages should move since the transaction is aborted. Is this correct? Does this work "by design" or "by accident" currently?
By accident. Note that LMDB 1.0 supports page checksums, and what you're doing here will break those checksums.
/The second one/ is about updating values in-place from read transactions.
If I
Updating any value in a read transaction is ludicrous.
Basically you're abusing MDB_WRITEMAP, whose only purpose is to (potentially) optimize normal write transactions. I've no interest in even speculating on intentional misuse of the API.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
openldap-technical@openldap.org