Re: (ITS#8430) Improved handling for large number of databases - openldap-bugs

27 May 2016


      On 27/05/16 23:39, juerg.bircher@gmail.com wrote:
...
On 27/05/16 14:42, "Hallvard Breien Furuseth" h.b.furuseth@usit.uio.no wrote:
...
I think MDB_WORD intends to be mdb_size_t, but should be 'unsigned
long': It's probably no optimization to use 'long long' when that
is not a native integer type.  But I haven't profiled that.
The MDB_WORD is used to get the appropriate word size of the CPU for
the optimised dbflag check. So on a 32 bit CPUs it should be an 32 bit
integer and on a 64 bit CPUs a 64 bit integer.
That's why I suggested unsigned long.  It seems the best guess for a
native wide type.  But it's hard to guess what kind of CPU is in use.
...
But the mdb_size_t seems always to be 64 bits?
On a 32-bit host it's normally 32-bit, but you get 64-bit mdb_size_t 
anyway if you build and use LMDB with MDB_VL32 defined.  I misread
your patch, it should not use MDB_VL32.
...
...
The delay in picking up env->me_dbflags is a bit troubling, since it
means creating a new txn is no longer quite atomic.  I can't think of
how it'll make a difference when DBIs are used correctly, but that's one
thing people screw up.  Maybe it'd be OK if renew0() just copies it into
mt_dbs[].md_flags without the '&', and mdb_setup_db_info() fixes it up.
Is it really atomic in original lmdb? The flags are copied from the
environment and calculated. But while copying the dbflags from the
environment they may change. So read level consistency and therefore
atomicity is not guaranteed.
I should have said, it's atomic when DBIs are used correctly.  Such as
obeying this limitation from the mdb_dbi_open() doc and a similar one
for mdb_dbi_close():
* A transaction that uses
* this function must finish (either commit or abort) before
* any other transaction in the process may use this function.
To guarante atomicity we'd need locks around some DBI operations.
To avoid that, LMDB simply calls unsafe use "user errors" - which can
cause breakage.  Hence those annoying DBI usage restrictions.
LMDB does attempt to catch user errors when it's cheap to try, see e.g.
TXN_DBI_CHANGED().  Doesn't work at all for read-only txns, they have
txn->mt_dbiseqs == env->me_dbiseqs.
So you can see why DBI cleverness makes me nervous.
I think the MDB_WORD looping can go in now, but we'll need to think
carefully about the lazy init.
...
But I guess it just does not matter.
So picking up the flags lazily just extends the time of copying the
flags from the environment but is basically the same.
Atomicity and also isolation is a must in respect to the data but I
think not in respect to the database handles and their flags.
Maybe I am overseeing something as I do not know all the thoughts
between the lines.
With ITS#8431 I can think of one thing:
There's no sync between mdb_dbi_open() in one txn and another txn
using the new DBI.  It'll be up to the user to synchronize DBIs.  For
that to work in write-txs, TXN_DBI_CHANGED() must go away, and writers
will have no way to try to catch user errors with sync.  Or if it
should only work in read-only txns, then read-only txns and write-txns
must have different definitions of what is a "user error" with DBIs.
Anyway, we'll need to think about the definition of user errors.
Other notes:
/** ... */ above a global declaration or /**< ... */ to the right
of it tells Doxygen that the comment is the doc of the declared item.
"#foo" inside a Doxygen comment is a link to the declaration of foo.
Maybe mt_dbflags should have an extra byte or word at the end which
does not have DB_UNUSED, and with size rounded up to sizeof(MDB_WORD).
Then stepping to next DBI can do something like this, unrelated to
the check against mt_numdbs.  Not sure if that's an improvement, I'm
getting tired:
/** Return an #MDB_WORD where each byte is \b byte */
#define WORD_BYTES(byte)	((MDB_WORD)-1 / 0xff * (byte))
/** Step up to next used DBI */
#define DBI_STEP(txn, dbi) do { \
     if (! (++(dbi) & (sizeof(mdb_word_t)-1))) \
         while (*(mdb_word_t *)((txn)->mt_dbflags + (dbi)) == 
WORD_BYTES(DB_UNUSED))
             (dbi) += sizeof(MDB_WORD); \
} while (0)