Thought this was an interesting read:
http://www.varnish-cache.org/trac/wiki/ArchitectNotes
Too bad he talks about his approach being "2006 era" programming. In fact the single-level store is 1964-era, from Multics.
http://en.wikipedia.org/wiki/Single_level_store
I guess they'll have to tweak Henry Spencer's quote ("Those who do not understand UNIX are condemned to reinvent it, poorly.") to Multics instead...
I've been working on a new "in-memory" B-tree library that operates on an mmap'd file. It is a copy-on-write design; it supports MVCC and is immune to corruption and requires no recovery procedure. It is not an append-only design, since that requires explicit compaction, and also is not amenable to mmap usage. Also the append-only approach requires total serialization of write operations, which would be quite poor for throughput.
The current approach simply reserves space for two root node pointers and flip flops between them. So, multiple writes may be outstanding at once, but commits are of course serialized; each commit causes the currently unused root node pointer to become the currently valid root node pointer. Transaction aborts are pretty much free; there's nothing to rollback. Read transactions begin by snapshotting the current root pointer and then can run without any interference from any other operations.
Public commits have been waiting for our official transition to git, but since that's been going nowhere I will probably start publishing on github.com in the next couple of weeks. (With St. Patrick's Day right around the corner it may have to wait a bit.)
Unfortunately I realized that not all application-level caching can be eliminated - with the hierarchical DB approach, we don't store full entry DNs in the DB so they still need to be generated in main memory, and they probably should be cached. But that's a detail to be addressed later; it may well be that the cost of always constructing them on the fly (no caching) is acceptable.
This backend should perform much better in all aspects (memory, CPU, and I/O usage) than the current BerkeleyDB code. It eliminates two levels of caching, entries pulled from the DB require zero decoding, readers require no locks, writes require no write-ahead-logging overhead. There are only two configurable parameters (the pathname to the DB file, and the size) so this will be far simpler for admins.
Potential downside - on a 32 bit machine with only 2GB of addressable memory the maximum usable DB size is around 1.6GB. On a 64 bit machine, I doubt the limits will pose any problem. ("64 bits should be enough for anyone...")
re: configuring the size of the DB file - this is most likely not a value that can be changed on an existing DB. I.e., if you configure a DB and find that you need to grow it later, you will probably need to slapcat/slapadd it again. At DB creation time the file is mmap'd with address NULL so that the OS picks the address, and the address is recorded in the DB. On subsequent opens the file is mmap'd at the recorded address. If the size is changed, and the process' address space is already full of other mappings, it may not be possible to simply grow the mapping at its current address. Since the DB records contain actual memory pointers based on the region address, any change in the mapping address would render the DB unusable.
If this restriction turns out to be too impractical, we may have to resort to just storing array offsets, but that will then imply a decoding phase and the re-introduction of entry caching, which I really really want to avoid.
If this restriction turns out to be too impractical, we may have to resort to just storing array offsets, but that will then imply a decoding phase and the re-introduction of entry caching, which I really really want to avoid.
Do we need any docs for this yet drafted from your notes?
--On Saturday, March 05, 2011 5:05 AM -0800 Howard Chu hyc@symas.com wrote:
Thought this was an interesting read:
http://www.varnish-cache.org/trac/wiki/ArchitectNotes
Too bad he talks about his approach being "2006 era" programming. In fact the single-level store is 1964-era, from Multics.
http://en.wikipedia.org/wiki/Single_level_store
I guess they'll have to tweak Henry Spencer's quote ("Those who do not understand UNIX are condemned to reinvent it, poorly.") to Multics instead...
I've been working on a new "in-memory" B-tree library that operates on an mmap'd file. It is a copy-on-write design; it supports MVCC and is immune to corruption and requires no recovery procedure. It is not an append-only design, since that requires explicit compaction, and also is not amenable to mmap usage. Also the append-only approach requires total serialization of write operations, which would be quite poor for throughput.
The current approach simply reserves space for two root node pointers and flip flops between them. So, multiple writes may be outstanding at once, but commits are of course serialized; each commit causes the currently unused root node pointer to become the currently valid root node pointer. Transaction aborts are pretty much free; there's nothing to rollback. Read transactions begin by snapshotting the current root pointer and then can run without any interference from any other operations.
Public commits have been waiting for our official transition to git, but since that's been going nowhere I will probably start publishing on github.com in the next couple of weeks. (With St. Patrick's Day right around the corner it may have to wait a bit.)
Unfortunately I realized that not all application-level caching can be eliminated - with the hierarchical DB approach, we don't store full entry DNs in the DB so they still need to be generated in main memory, and they probably should be cached. But that's a detail to be addressed later; it may well be that the cost of always constructing them on the fly (no caching) is acceptable.
This backend should perform much better in all aspects (memory, CPU, and I/O usage) than the current BerkeleyDB code. It eliminates two levels of caching, entries pulled from the DB require zero decoding, readers require no locks, writes require no write-ahead-logging overhead. There are only two configurable parameters (the pathname to the DB file, and the size) so this will be far simpler for admins.
Let me know when you think it is in a state where I can start doing some perf testing with it. ;)
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
--On Saturday, March 05, 2011 5:05 AM -0800 Howard Chu hyc@symas.com wrote:
I've been working on a new "in-memory" B-tree library that operates on an mmap'd file. It is a copy-on-write design; it supports MVCC and is immune to corruption and requires no recovery procedure. It is not an append-only design, since that requires explicit compaction, and also is not amenable to mmap usage. Also the append-only approach requires total serialization of write operations, which would be quite poor for throughput.
My experience with back-(bdb/hdb) and syncrepl was the only reliable way to ensure consistent replication was to use delta-syncrepl which... serializes write operations. In fact, not forcing serialized writes for back-(bdb/hdb) was slower than serializing things, because of all the contention in the database. I understand this may not hold true for back-mdb, but thought I would note that currently our best performance is already achieved by serialization, write-wise.
re: configuring the size of the DB file - this is most likely not a value that can be changed on an existing DB. I.e., if you configure a DB and find that you need to grow it later, you will probably need to slapcat/slapadd it again. At DB creation time the file is mmap'd with address NULL so that the OS picks the address, and the address is recorded in the DB. On subsequent opens the file is mmap'd at the recorded address. If the size is changed, and the process' address space is already full of other mappings, it may not be possible to simply grow the mapping at its current address. Since the DB records contain actual memory pointers based on the region address, any change in the mapping address would render the DB unusable.
How exactly does the DB file size for back-mdb relate to the existing size of the database? Do they have to match? I.e., is this more like the DB_CONFIG cachesize, which can be more or less than the database size, or are they supposed to be an exact match? We have plenty of customers who have databases that are certainly not static in size. Particularly if you are using an accesslog databases for delta-syncrepl or other operations.
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
--On Saturday, March 05, 2011 5:05 AM -0800 Howard Chuhyc@symas.com wrote:
I've been working on a new "in-memory" B-tree library that operates on an mmap'd file. It is a copy-on-write design; it supports MVCC and is immune to corruption and requires no recovery procedure. It is not an append-only design, since that requires explicit compaction, and also is not amenable to mmap usage. Also the append-only approach requires total serialization of write operations, which would be quite poor for throughput.
My experience with back-(bdb/hdb) and syncrepl was the only reliable way to ensure consistent replication was to use delta-syncrepl which... serializes write operations. In fact, not forcing serialized writes for back-(bdb/hdb) was slower than serializing things, because of all the contention in the database. I understand this may not hold true for back-mdb, but thought I would note that currently our best performance is already achieved by serialization, write-wise.
I'm well aware of all of this, no need to remind me. Non-serialized writes in bdb/hdb tended to run into deadlocks all the time, and the retries are slow. (In fact, we intentionally slow them down with an exponential backoff. This feature is probably detrimental on a heavily loaded machine since the thread can't do any useful work during the backoff.)
I expect the occurrence of deadlocks using MVCC to be drastically reduced. Readers will never be the cause of deadlocks in mdb, so that's half the problem gone already. Writers will hold locks and be able to block each other, so that possibility remains.
re: configuring the size of the DB file - this is most likely not a value that can be changed on an existing DB. I.e., if you configure a DB and find that you need to grow it later, you will probably need to slapcat/slapadd it again. At DB creation time the file is mmap'd with address NULL so that the OS picks the address, and the address is recorded in the DB. On subsequent opens the file is mmap'd at the recorded address. If the size is changed, and the process' address space is already full of other mappings, it may not be possible to simply grow the mapping at its current address. Since the DB records contain actual memory pointers based on the region address, any change in the mapping address would render the DB unusable.
How exactly does the DB file size for back-mdb relate to the existing size of the database? Do they have to match?
Not at all. This configures a maximum size that the DB will consume on disk. The DB can be whatever size, and grow to that limit.
I.e., is this more like the DB_CONFIG cachesize, which can be more or less than the database size, or are they supposed to be an exact match? We have plenty of customers who have databases that are certainly not static in size. Particularly if you are using an accesslog databases for delta-syncrepl or other operations.
Obviously it would be stupid to require them to match.
Howard Chu writes:
Unfortunately I realized that not all application-level caching can be eliminated - with the hierarchical DB approach, we don't store full entry DNs in the DB so they still need to be generated in main memory, and they probably should be cached. But that's a detail to be addressed later; it may well be that the cost of always constructing them on the fly (no caching) is acceptable.
I previously complained that an mmapped database with no level of caching in the code would collect the effects of bugs quite efficiently. If some other module incorrectly modifies an entry in-memory when it should have modified a private copy, the mod goes directly to the DB even if nothing is doing an LDAP update operation. Restarting slapd does not clean up, slapcat/slapadd may not work either. Is there something in this design which fixes that?
Hallvard B Furuseth wrote:
Howard Chu writes:
Unfortunately I realized that not all application-level caching can be eliminated - with the hierarchical DB approach, we don't store full entry DNs in the DB so they still need to be generated in main memory, and they probably should be cached. But that's a detail to be addressed later; it may well be that the cost of always constructing them on the fly (no caching) is acceptable.
I previously complained that an mmapped database with no level of caching in the code would collect the effects of bugs quite efficiently. If some other module incorrectly modifies an entry in-memory when it should have modified a private copy, the mod goes directly to the DB even if nothing is doing an LDAP update operation. Restarting slapd does not clean up, slapcat/slapadd may not work either. Is there something in this design which fixes that?
Not at present.
You're just going to have to be very very careful about what modules you use...
Since the DB uses MVCC, that means that semantically all data in the DB can be considered read-only. We could use mprotect() to enforce that, but I'm not that far in the development of the mdb library yet and I have no idea how much overhead that will impose.