Quanah Gibson-Mount wrote:
--On Saturday, March 05, 2011 5:05 AM -0800 Howard Chuhyc@symas.com wrote:
I've been working on a new "in-memory" B-tree library that operates on an mmap'd file. It is a copy-on-write design; it supports MVCC and is immune to corruption and requires no recovery procedure. It is not an append-only design, since that requires explicit compaction, and also is not amenable to mmap usage. Also the append-only approach requires total serialization of write operations, which would be quite poor for throughput.
My experience with back-(bdb/hdb) and syncrepl was the only reliable way to ensure consistent replication was to use delta-syncrepl which... serializes write operations. In fact, not forcing serialized writes for back-(bdb/hdb) was slower than serializing things, because of all the contention in the database. I understand this may not hold true for back-mdb, but thought I would note that currently our best performance is already achieved by serialization, write-wise.
I'm well aware of all of this, no need to remind me. Non-serialized writes in bdb/hdb tended to run into deadlocks all the time, and the retries are slow. (In fact, we intentionally slow them down with an exponential backoff. This feature is probably detrimental on a heavily loaded machine since the thread can't do any useful work during the backoff.)
I expect the occurrence of deadlocks using MVCC to be drastically reduced. Readers will never be the cause of deadlocks in mdb, so that's half the problem gone already. Writers will hold locks and be able to block each other, so that possibility remains.
re: configuring the size of the DB file - this is most likely not a value that can be changed on an existing DB. I.e., if you configure a DB and find that you need to grow it later, you will probably need to slapcat/slapadd it again. At DB creation time the file is mmap'd with address NULL so that the OS picks the address, and the address is recorded in the DB. On subsequent opens the file is mmap'd at the recorded address. If the size is changed, and the process' address space is already full of other mappings, it may not be possible to simply grow the mapping at its current address. Since the DB records contain actual memory pointers based on the region address, any change in the mapping address would render the DB unusable.
How exactly does the DB file size for back-mdb relate to the existing size of the database? Do they have to match?
Not at all. This configures a maximum size that the DB will consume on disk. The DB can be whatever size, and grow to that limit.
I.e., is this more like the DB_CONFIG cachesize, which can be more or less than the database size, or are they supposed to be an exact match? We have plenty of customers who have databases that are certainly not static in size. Particularly if you are using an accesslog databases for delta-syncrepl or other operations.
Obviously it would be stupid to require them to match.