New subject: [LMDB] What pointer is returned with combination of MDB_WRITEMAP and MDB_RESERVE?

16 Aug 2018


      Hello,
Working with overflow pages directly via pointers outside write
transactions works great and it helps that they do not move "by design" in
current versions as discussed in this thread.
I have two related scenarios that will give a substantial performance boost
in my case.
*The first one* is updating a value in-place via pointer from aborted write
transaction. If I
1) use MDB_WRITEMAP,
2) from **write** transaction find a record (which is small and not in an
overflow page),
3) modify a part of it's value (for duplicates this part is not used in the
compare function) directly via the MDB_VAL data pointer (e.g.
interlocked_increment or compare_and_swap),
4) and **abort** the transaction,
then readers see the updated value via normal read transactions later.
Since I do the direct updates from inside a write transaction all other
writers should be locked until I exit the transaction (abort in this case),
and no pages should move since the transaction is aborted. Is this correct?
Does this work "by design" or "by accident" currently?
*The second one* is about updating values in-place from read transactions.
If I
1) use MDB_WRITEMAP,
2) open a database with MDB_INTEGERKEY (and could use a dedicated
environment with a single DB if that changes the answer),
3) add values to the DB *only* using MDB_APPEND | MDB_NOOVERWRITE,
4) modify a part of a value directly via the MDB_VAL data pointer,
is it possible that the page from the read transaction is replaced with a
new one if there is a parallel write transaction?
There is a quote from Howard on GitHub (https://github.com/lmdbjava/
benchmarks/issues/9#issuecomment-354184989): "When we do sequential inserts
using MDB_APPEND, there is* no page split at all* - we just allocate a new
page and fill it, sequentially." Does this mean that if there are no page
splits then existing pages do not move as well, and it is "safe" to use
pointers outside of write transactions as is the case with the overflow
pages?
In both cases I update values of a struct that indicate e.g. some lifecycle
stage of an object the LMDB record refers to and stage transitions are
idempotent. If a direct pointer write doesn't make to disk due to system
failure subsequent readers (workers) will see an older stage and repeat
stage transition.
Therefore missed direct writes do not break application logic, I only care
about physical corruption of the entire DB. If I update the values in-place
inside read transactions and the page becomes stale this should not corrupt
DB since the old page will go to the free list only after the read
transaction is finished, so this "hack" should not break DB. But then
missed writes will be a norm and not a special-case on OS failure. But if
pages do not move, all these "soft" updates could be done in parallel and
be very fast.
Unfortunately I cannot answer this myself while trying to read the mbd.c
file. In the second scenario I'm specifically concerned what is happening
when DB becomes large and the tree needs rebalancing. At least in this case
some pages need to move, but does the rebalancing replace/split existing
pages?
Thanks & best regards,
Victor
On Fri, Oct 30, 2015 at 9:26 PM, Howard Chu hyc@symas.com wrote:
...
Victor Baybekov wrote:
...
Thanks a lot! My proof-of-concept code works OK.
I do not understand all subtle details of mmap reliability, could you
please
help with these two:
If I write data to a pointer to an opaque blob as discussed above, and my
process crashes before mdb_env_sync, but OS doesn't crash - will that
data be
secure in the mmap file?
Of course. The OS owns the memory, it doesn't matter if your process
crashes.
Also, am I correct that mdb_env_sync synchronizes all dirty pages in the
...
mmap
file as seen by a file system, regardless how they were modified - either
via
LMDB API or via a direct pointer writes?
Yes.
As for "you could at least set a callback to notify you that a block has
...
moved"  - if that is implemented, it would be nice to have a notification
/before/ a block is moved (with old and new address, so that right after
the
callback it is OK to use the new address), otherwise this non-intended but
convenient use of LMDB won't work anymore.
"right after the callback it is OK to use the new address" - that's the
point of the callback, it's job is to make the new address valid. So yes,
when it returns, you use the new address.
...
Best regards,
Victor
On Sat, Oct 3, 2015 at 1:27 AM, Howard Chu <hyc@symas.com
mailto:hyc@symas.com> wrote:
Howard Chu wrote:

    Victor Baybekov wrote:

        Thank you! I understand this copy-on-write behavior, but am
        interested if I
        could control it a little. What if I use records that are

always
            much bigger
            than a single page, e.g. 100 kb with 4kb pages, and make sure
that
            a record is
            never updated (via LMDB means) during a lifetime of an
            environment, - is there
            any scenario that the location of such a big record could be
            changed during a
            lifetime of an environment, without updating the record?
    At this point in time, no, if you don't update a large record

there is no
        reason that it will move. That is not to say that this won't
change in the
        future. The documentation tells you what promises we are willing
to make.
        Relying on any non-documented behavior is your own responsibility.
Note that the relocation functions in LMDB are intended to accommodate
blocks being moved around. The actual guts of that API haven't been
implemented, but probably in 1.x we'll flesh them out. Given that

support,
    you could at least set a callback to notify you that a block has
moved.
    But currently, overflow pages don't move if they're not modified.
        On Fri, Oct 2, 2015 at 4:38 PM, Howard Chu <hyc@symas.com
        <mailto:hyc@symas.com>
        <mailto:hyc@symas.com <mailto:hyc@symas.com>>> wrote:

             Victor Baybekov wrote:

                 Hi,

                 Docs for MDB_RESERVE say that a returned pointer to

the
            reserved
            space is
                     valid "before the next update operation or the
            transaction ends." Docs
                     for MDB_WRITEMAP say that it "writes directly to the
mmap
            instead of
            using
                     malloc for pages." Does combining the two options
return
            a pointer
                     directly to
                     a place in a mmap
             Yes.

                 so that this pointer could be used after a

transaction ends
                     or after the next update?
             No.

             Longer answer: maybe.

             Full answer: LMDB is copy-on-write. If you update another
        record on the
             same page, in a later transaction, the contents of that

page
            will be
                 copied to a new page and the original page will go onto
the
            freelist. In
                 that case, the pointer you got must not be used again.
             If you don't directly update that page and cause it to be
        copied, then you
             might get lucky and be able to use the pointer for a

while.
            It all depends
                 on what other modifications you do and how they affect
that
            node or
                 neighboring nodes.
                 I have a use case where I want to somewhat abuse LMDB
        safety for
                 convenience.
                 If I could get a pointer to a place inside a mmap I

could
            work with
                     LMDB value
                     as opaque blob or as a region inside the single big
mmap.
            This could
                     be more
                     convenient than creating and opening hundreds of
            temporary memory
                     mapped files
                     and keeping open handles to them. For example, Aeron
            terms could be
            stored
                     like this: a stream id per an LMDB db and a term id
for a
            key in the
            db.
                 Thanks!
                 Victor


--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Re: [LMDB] What pointer is returned with combination of MDB_WRITEMAP and MDB_RESERVE?