Just some thoughts on what I'd like to see in a new memory-based backend...
One of the complaints about back-bdb/hdb is the complexity in the tuning; there are a number of different components that need to be balanced against each other and the proper balance point varies depending on data size and workload. One of the directions we were investigating a couple years back was mechanisms for self-tuning of the caches. (This was essentially the thrust of Jong-Hyuk Choi's work with zoned allocs for the back-bdb entry cache; it would allow large chunks of the entry cache to be discarded on demand when system memory pressure increased.) Unfortunately Jong hasn't been active on the project in a while and it doesn't appear that anyone else was tracking that work. Self-tuning is still a goal but it seems to me to be attacking the wrong problem.
One of the things that annoys me with the current BerkeleyDB based design is that we have 3 levels of cache operating at all times - filesystem, BDB, and slapd. This means at least 2 memory copy operations to get any piece of data from disk into working memory, and you have to play games with the OS to minimize the waste in the FS cache. (E.g. on Linux, tweak the swappiness setting.)
Back in the 80s I spent a lot of time working on the Apollo DOMAIN OS, which was based on the M68K platform. One of their (many) claims to fame was the notion of a single-level store: the processor architecture supported a full 32 bit address space but it was uncommon for systems to have more than 24 bits worth of that populated, and nobody had anywhere near 1GB of disk space on their entire network. As such, every byte of available disk space could be directly mapped to a virtual memory address, and all disk I/O was done thru mmaps and demand paging. As a result, memory management was completely unified and memory usage was extremely efficient.
These days you could still take that sort of approach, though on a 32 bit machine a DB limit of 1-2GB may not be so useful any more. However, with the ubiquity of 64 bit machines, the idea becomes quite attractive again.
The basic idea is to construct a database that is always mmap'd to a fixed virtual address, and which returns its mmap'd data pages directly to the caller (instead of copying them to a newly allocated buffer). Given a fixed address, it becomes feasible to make the on-disk record format identical to the in-memory format. Today we have to convert from a BER-like encoding into our in-memory format, and while that conversion is fast it still takes up a measurable amount of time. (Which is one reason our slapd entry cache is still so much faster than just using BDB's cache.) So instead of storing offsets into a flattened data record, we store actual pointers (since they all simply reside in the mmap'd space).
Using this directly mmap'd approach immediately eliminates the 3 layers of caching and brings it down to 1. As another benefit, the DB would require *zero* cache configuration/tuning - it would be entirely under the control of the OS memory manager, and its resident set size would grow or shrink dynamically without any outside intervention.
It's not clear to me that we can modify BDB to operate in this manner. It currently supports mmap access for read-only DBs, but it doesn't map to fixed addresses and still does alloc/copy before returning data to the caller.
Also, while BDB development continues, the new development is mainly occurring in areas that don't matter to us (e.g. BDB replication) and the areas we care about (B-tree performance) haven't really changed much in quite a while. I've mentioned B-link trees a few times before on this list; they have much lower lock contention than plain B-trees and thus can support even greater concurrency. I've also mentioned them to the BDB team a few times and as yet they have no plans to implement them. (Here's a good reference: http://www.springerlink.com/content/eurxct8ewt0h3rxm/ )
As such, it seems likely that we would have to write our own DB engine to pursue this path. (Clearly such an engine must still provide full ACID transaction support, so this is a non-trivial undertaking.) Whether and when we embark on this is unclear; this is somewhat of an "ideal" design and as always, "good enough" is the enemy of "perfect" ...
This isn't a backend we can simply add to the current slapd source base, so it's probably an OpenLDAP 3.x target: In order to have a completely canonical record on disk, we also need pointers to AttributeDescriptions to be recorded in each entry and those AttributeDescription pointers must also be persistent. Which means that our current AttributeDescription cache must be modified to also allocate its records from a fixed mmap'd region. (And we'll have to include a schema-generation stamp, so that if schema elements are deleted we can force new AD pointers to be looked up when necessary.) (Of course, given the self-contained nature of the AD cache, we can probably modify its behavior in this way without impacting any other slapd code...)
There's also a potential risk to leaving all memory management up to the OS - the native memory manager on some OS's (e.g. Windows) is abysmal, and the CLOCK-based cache replacement code we now use in the entry cache is more efficient than the LRU schemes that some older OS versions use. So we may get into this and decide we still need to play games with mlock() etc. to control the cache management. That would be an unfortunate complication, but it would still allow us to do simpler tuning than we currently need. Still, establishing a 1:1 correspondence between virtual memory addresses and disk addresses is a big win for performance, scalability, and reduced complexity (== greater reliability)...
(And yes, by the way, we have planning for LDAPCon2009 this September in the works; I imagine the Call For Papers will go out in a week or two. So now's a good time to pull up whatever other ideas you've had in the back of your mind for a while...)
Howard Chu wrote:
Just some thoughts on what I'd like to see in a new memory-based backend...
One of the complaints about back-bdb/hdb is the complexity in the tuning; there are a number of different components that need to be balanced against each other and the proper balance point varies depending on data size and workload. One of the directions we were investigating a couple years back was mechanisms for self-tuning of the caches. (This was essentially the thrust of Jong-Hyuk Choi's work with zoned allocs for the back-bdb entry cache; it would allow large chunks of the entry cache to be discarded on demand when system memory pressure increased.) Unfortunately Jong hasn't been active on the project in a while and it doesn't appear that anyone else was tracking that work. Self-tuning is still a goal but it seems to me to be attacking the wrong problem.
One of the things that annoys me with the current BerkeleyDB based design is that we have 3 levels of cache operating at all times - filesystem, BDB, and slapd. This means at least 2 memory copy operations to get any piece of data from disk into working memory, and you have to play games with the OS to minimize the waste in the FS cache. (E.g. on Linux, tweak the swappiness setting.)
Back in the 80s I spent a lot of time working on the Apollo DOMAIN OS, which was based on the M68K platform. One of their (many) claims to fame was the notion of a single-level store: the processor architecture supported a full 32 bit address space but it was uncommon for systems to have more than 24 bits worth of that populated, and nobody had anywhere near 1GB of disk space on their entire network. As such, every byte of available disk space could be directly mapped to a virtual memory address, and all disk I/O was done thru mmaps and demand paging. As a result, memory management was completely unified and memory usage was extremely efficient.
These days you could still take that sort of approach, though on a 32 bit machine a DB limit of 1-2GB may not be so useful any more. However, with the ubiquity of 64 bit machines, the idea becomes quite attractive again.
The basic idea is to construct a database that is always mmap'd to a fixed virtual address, and which returns its mmap'd data pages directly to the caller (instead of copying them to a newly allocated buffer). Given a fixed address, it becomes feasible to make the on-disk record format identical to the in-memory format. Today we have to convert from a BER-like encoding into our in-memory format, and while that conversion is fast it still takes up a measurable amount of time. (Which is one reason our slapd entry cache is still so much faster than just using BDB's cache.) So instead of storing offsets into a flattened data record, we store actual pointers (since they all simply reside in the mmap'd space).
Using this directly mmap'd approach immediately eliminates the 3 layers of caching and brings it down to 1. As another benefit, the DB would require *zero* cache configuration/tuning - it would be entirely under the control of the OS memory manager, and its resident set size would grow or shrink dynamically without any outside intervention.
It's not clear to me that we can modify BDB to operate in this manner. It currently supports mmap access for read-only DBs, but it doesn't map to fixed addresses and still does alloc/copy before returning data to the caller.
Also, while BDB development continues, the new development is mainly occurring in areas that don't matter to us (e.g. BDB replication) and the areas we care about (B-tree performance) haven't really changed much in quite a while. I've mentioned B-link trees a few times before on this list; they have much lower lock contention than plain B-trees and thus can support even greater concurrency. I've also mentioned them to the BDB team a few times and as yet they have no plans to implement them. (Here's a good reference: http://www.springerlink.com/content/eurxct8ewt0h3rxm/ )
As such, it seems likely that we would have to write our own DB engine to pursue this path. (Clearly such an engine must still provide full ACID transaction support, so this is a non-trivial undertaking.) Whether and when we embark on this is unclear; this is somewhat of an "ideal" design and as always, "good enough" is the enemy of "perfect" ...
This isn't a backend we can simply add to the current slapd source base, so it's probably an OpenLDAP 3.x target: In order to have a completely canonical record on disk, we also need pointers to AttributeDescriptions to be recorded in each entry and those AttributeDescription pointers must also be persistent. Which means that our current AttributeDescription cache must be modified to also allocate its records from a fixed mmap'd region. (And we'll have to include a schema-generation stamp, so that if schema elements are deleted we can force new AD pointers to be looked up when necessary.) (Of course, given the self-contained nature of the AD cache, we can probably modify its behavior in this way without impacting any other slapd code...)
There's also a potential risk to leaving all memory management up to the OS - the native memory manager on some OS's (e.g. Windows) is abysmal, and the CLOCK-based cache replacement code we now use in the entry cache is more efficient than the LRU schemes that some older OS versions use. So we may get into this and decide we still need to play games with mlock() etc. to control the cache management. That would be an unfortunate complication, but it would still allow us to do simpler tuning than we currently need. Still, establishing a 1:1 correspondence between virtual memory addresses and disk addresses is a big win for performance, scalability, and reduced complexity (== greater reliability)...
That sounds interesting. Now, you may consider another idea to be totally insane, but instead of writing your own DB engine implementation, what about relying on the FS ? We discussed about this idea recently in the Apache Directory community (we have pretty much the same concern : 3 level of cache is just over killing). So if you take Window$ out of the picture (and even if you keep it in the full picture), many existing linux/unix FS are already implemented using a BTree (EXT3/4, BTRFS, even NTFS !). What about using this underlying FS to store entries directly, instead of building a special file which will be a intermediate layer ? The main issue will be to manage indexes, but that should not be a real problem. So every entry will be stored as a single file (could be in LDIF format :)
So far, this is just a discussion we are having, but that might worth a try at some point...
Does it sound insane ?
i did try a dummy prototype awhile back and it doesnt perform very well. you end up incurring too much overhead and it doesnt pay off even when underlaying FS data is 100% cached. plus you can never truly control what happens with FS cache, you can size and influence it in some ways but you cannot guarantee your operation will hit cached data which does make it difficult to deliver predictable response times, in other words you gonna have to accept I/O hits and widen your response window to the worst case scenario for at least some %tage of operations. this can be optimized and made more predictable on a black box where you control the entire machine but moot otherwise. the FS was ZFS and just for the record the perf didnt suck per se but didnt quite match traditional db backends perf [especially with entry caches] either. i dont have slamd comparison data anymore to show you unfortunately.
Emmanuel Lecharny wrote:
That sounds interesting. Now, you may consider another idea to be totally insane, but instead of writing your own DB engine implementation, what about relying on the FS ? We discussed about this idea recently in the Apache Directory community (we have pretty much the same concern : 3 level of cache is just over killing). So if you take Window$ out of the picture (and even if you keep it in the full picture), many existing linux/unix FS are already implemented using a BTree (EXT3/4, BTRFS, even NTFS !). What about using this underlying FS to store entries directly, instead of building a special file which will be a intermediate layer ? The main issue will be to manage indexes, but that should not be a real problem. So every entry will be stored as a single file (could be in LDIF format :)
So far, this is just a discussion we are having, but that might worth a try at some point...
Does it sound insane ?
Emmanuel Lecharny wrote:
That sounds interesting. Now, you may consider another idea to be totally insane, but instead of writing your own DB engine implementation, what about relying on the FS ? We discussed about this idea recently in the Apache Directory community (we have pretty much the same concern : 3 level of cache is just over killing). So if you take Window$ out of the picture (and even if you keep it in the full picture), many existing linux/unix FS are already implemented using a BTree (EXT3/4, BTRFS, even NTFS !). What about using this underlying FS to store entries directly, instead of building a special file which will be a intermediate layer ? The main issue will be to manage indexes, but that should not be a real problem. So every entry will be stored as a single file (could be in LDIF format :)
So far, this is just a discussion we are having, but that might worth a try at some point...
Does it sound insane ?
In fact we already have a back-ldif which does exactly this, but it's not intended for real use. It was only written to serve as a vehicle for back-config. (I.e., we wanted a simple, zero-config persistent store that could still behave like an LDAP database in very specifically defined use cases.) I'm pretty sure we've documented that it's not recommended for general purpose use, although some folks seem to want to misuse it that way regardless.
One of the main downsides - any such backend requires a couple system calls to access any given entry, and that generally means at least a few context switches. No matter how wonderfully efficient the FS itself is, anything that forces you to switch context between user mode and kernel mode for every entry is always a loss.
And no matter how wonderful these FSs are, to my knowledge none of them are using B-link trees, which means they all still have higher lock contention than necessary for reads, inserts, and deletes. In fact the only open B-link implementation I'm aware of is written in Java (bonus for you guys!), and some of the thornier issues of B-link management have only been solved in the past couple years. When I first started looking into them a few years ago the issue of Delete rebalancing hadn't actually been solved yet. This is all pretty new stuff. (In the original paper, the authors described how to do searches and inserts without any lock-coupling, which is a huge concurrency win. They had no solution for deletes though, and just allowed deleted nodes to accumulate in the tree.)
For a C implementation I'd try to re-use as much as possible of the existing BerkeleyDB code since it's quite mature and provides a lot of features we already like/want/need...
Anton Bobrov wrote:
i did try a dummy prototype awhile back and it doesnt perform very well. you end up incurring too much overhead and it doesnt pay off even when underlaying FS data is 100% cached. plus you can never truly control what happens with FS cache, you can size and influence it in some ways but you cannot guarantee your operation will hit cached data which does make it difficult to deliver predictable response times, in other words you gonna have to accept I/O hits and widen your response window to the worst case scenario for at least some %tage of operations. this can be optimized and made more predictable on a black box where you control the entire machine but moot otherwise. the FS was ZFS and just for the record the perf didnt suck per se but didnt quite match traditional db backends perf [especially with entry caches] either. i dont have slamd comparison data anymore to show you unfortunately.
Also true, which is one of the reasons I wasn't too thrilled with Jong's original line of research here; it would degrade slapd's performance for the benefit of anything else on the box when other processes' resource demands increased. But in the face of a heavily overcommitted machine, all bets are off and you might as well go down gracefully instead of getting killed by OOM or somesuch.
What protections would be needed to deal with the (hopefully infrequent) system crash if we are depending on the OS filesystem cache to get things from memory to disk? What would be the recovery mechanism in the worst-case system crash?
On 5/17/09 12:27 AM, Howard Chu wrote:
Just some thoughts on what I'd like to see in a new memory-based backend...
One of the complaints about back-bdb/hdb is the complexity in the tuning; there are a number of different components that need to be balanced against each other and the proper balance point varies depending on data size and workload. One of the directions we were investigating a couple years back was mechanisms for self-tuning of the caches. (This was essentially the thrust of Jong-Hyuk Choi's work with zoned allocs for the back-bdb entry cache; it would allow large chunks of the entry cache to be discarded on demand when system memory pressure increased.) Unfortunately Jong hasn't been active on the project in a while and it doesn't appear that anyone else was tracking that work. Self-tuning is still a goal but it seems to me to be attacking the wrong problem.
One of the things that annoys me with the current BerkeleyDB based design is that we have 3 levels of cache operating at all times - filesystem, BDB, and slapd. This means at least 2 memory copy operations to get any piece of data from disk into working memory, and you have to play games with the OS to minimize the waste in the FS cache. (E.g. on Linux, tweak the swappiness setting.)
Back in the 80s I spent a lot of time working on the Apollo DOMAIN OS, which was based on the M68K platform. One of their (many) claims to fame was the notion of a single-level store: the processor architecture supported a full 32 bit address space but it was uncommon for systems to have more than 24 bits worth of that populated, and nobody had anywhere near 1GB of disk space on their entire network. As such, every byte of available disk space could be directly mapped to a virtual memory address, and all disk I/O was done thru mmaps and demand paging. As a result, memory management was completely unified and memory usage was extremely efficient.
These days you could still take that sort of approach, though on a 32 bit machine a DB limit of 1-2GB may not be so useful any more. However, with the ubiquity of 64 bit machines, the idea becomes quite attractive again.
The basic idea is to construct a database that is always mmap'd to a fixed virtual address, and which returns its mmap'd data pages directly to the caller (instead of copying them to a newly allocated buffer). Given a fixed address, it becomes feasible to make the on-disk record format identical to the in-memory format. Today we have to convert from a BER-like encoding into our in-memory format, and while that conversion is fast it still takes up a measurable amount of time. (Which is one reason our slapd entry cache is still so much faster than just using BDB's cache.) So instead of storing offsets into a flattened data record, we store actual pointers (since they all simply reside in the mmap'd space).
Using this directly mmap'd approach immediately eliminates the 3 layers of caching and brings it down to 1. As another benefit, the DB would require *zero* cache configuration/tuning - it would be entirely under the control of the OS memory manager, and its resident set size would grow or shrink dynamically without any outside intervention.
It's not clear to me that we can modify BDB to operate in this manner. It currently supports mmap access for read-only DBs, but it doesn't map to fixed addresses and still does alloc/copy before returning data to the caller.
Also, while BDB development continues, the new development is mainly occurring in areas that don't matter to us (e.g. BDB replication) and the areas we care about (B-tree performance) haven't really changed much in quite a while. I've mentioned B-link trees a few times before on this list; they have much lower lock contention than plain B-trees and thus can support even greater concurrency. I've also mentioned them to the BDB team a few times and as yet they have no plans to implement them. (Here's a good reference: http://www.springerlink.com/content/eurxct8ewt0h3rxm/ )
As such, it seems likely that we would have to write our own DB engine to pursue this path. (Clearly such an engine must still provide full ACID transaction support, so this is a non-trivial undertaking.) Whether and when we embark on this is unclear; this is somewhat of an "ideal" design and as always, "good enough" is the enemy of "perfect" ...
This isn't a backend we can simply add to the current slapd source base, so it's probably an OpenLDAP 3.x target: In order to have a completely canonical record on disk, we also need pointers to AttributeDescriptions to be recorded in each entry and those AttributeDescription pointers must also be persistent. Which means that our current AttributeDescription cache must be modified to also allocate its records from a fixed mmap'd region. (And we'll have to include a schema-generation stamp, so that if schema elements are deleted we can force new AD pointers to be looked up when necessary.) (Of course, given the self-contained nature of the AD cache, we can probably modify its behavior in this way without impacting any other slapd code...)
There's also a potential risk to leaving all memory management up to the OS - the native memory manager on some OS's (e.g. Windows) is abysmal, and the CLOCK-based cache replacement code we now use in the entry cache is more efficient than the LRU schemes that some older OS versions use. So we may get into this and decide we still need to play games with mlock() etc. to control the cache management. That would be an unfortunate complication, but it would still allow us to do simpler tuning than we currently need. Still, establishing a 1:1 correspondence between virtual memory addresses and disk addresses is a big win for performance, scalability, and reduced complexity (== greater reliability)...
(And yes, by the way, we have planning for LDAPCon2009 this September in the works; I imagine the Call For Papers will go out in a week or two. So now's a good time to pull up whatever other ideas you've had in the back of your mind for a while...)
Francis Swasey wrote:
What protections would be needed to deal with the (hopefully infrequent) system crash if we are depending on the OS filesystem cache to get things from memory to disk? What would be the recovery mechanism in the worst-case system crash?
Same as always. Transactions will use write-ahead logging. Recovery will be automatic.
On 5/17/09 12:27 AM, Howard Chu wrote:
Just some thoughts on what I'd like to see in a new memory-based backend...
One of the complaints about back-bdb/hdb is the complexity in the tuning; there are a number of different components that need to be balanced against each other and the proper balance point varies depending on data size and workload. One of the directions we were investigating a couple years back was mechanisms for self-tuning of the caches. (This was essentially the thrust of Jong-Hyuk Choi's work with zoned allocs for the back-bdb entry cache; it would allow large chunks of the entry cache to be discarded on demand when system memory pressure increased.) Unfortunately Jong hasn't been active on the project in a while and it doesn't appear that anyone else was tracking that work. Self-tuning is still a goal but it seems to me to be attacking the wrong problem.
One of the things that annoys me with the current BerkeleyDB based design is that we have 3 levels of cache operating at all times - filesystem, BDB, and slapd. This means at least 2 memory copy operations to get any piece of data from disk into working memory, and you have to play games with the OS to minimize the waste in the FS cache. (E.g. on Linux, tweak the swappiness setting.)
Back in the 80s I spent a lot of time working on the Apollo DOMAIN OS, which was based on the M68K platform. One of their (many) claims to fame was the notion of a single-level store: the processor architecture supported a full 32 bit address space but it was uncommon for systems to have more than 24 bits worth of that populated, and nobody had anywhere near 1GB of disk space on their entire network. As such, every byte of available disk space could be directly mapped to a virtual memory address, and all disk I/O was done thru mmaps and demand paging. As a result, memory management was completely unified and memory usage was extremely efficient.
These days you could still take that sort of approach, though on a 32 bit machine a DB limit of 1-2GB may not be so useful any more. However, with the ubiquity of 64 bit machines, the idea becomes quite attractive again.
The basic idea is to construct a database that is always mmap'd to a fixed virtual address, and which returns its mmap'd data pages directly to the caller (instead of copying them to a newly allocated buffer). Given a fixed address, it becomes feasible to make the on-disk record format identical to the in-memory format. Today we have to convert from a BER-like encoding into our in-memory format, and while that conversion is fast it still takes up a measurable amount of time. (Which is one reason our slapd entry cache is still so much faster than just using BDB's cache.) So instead of storing offsets into a flattened data record, we store actual pointers (since they all simply reside in the mmap'd space).
Using this directly mmap'd approach immediately eliminates the 3 layers of caching and brings it down to 1. As another benefit, the DB would require *zero* cache configuration/tuning - it would be entirely under the control of the OS memory manager, and its resident set size would grow or shrink dynamically without any outside intervention.
It's not clear to me that we can modify BDB to operate in this manner. It currently supports mmap access for read-only DBs, but it doesn't map to fixed addresses and still does alloc/copy before returning data to the caller.
Also, while BDB development continues, the new development is mainly occurring in areas that don't matter to us (e.g. BDB replication) and the areas we care about (B-tree performance) haven't really changed much in quite a while. I've mentioned B-link trees a few times before on this list; they have much lower lock contention than plain B-trees and thus can support even greater concurrency. I've also mentioned them to the BDB team a few times and as yet they have no plans to implement them. (Here's a good reference: http://www.springerlink.com/content/eurxct8ewt0h3rxm/ )
As such, it seems likely that we would have to write our own DB engine to pursue this path. (Clearly such an engine must still provide full ACID transaction support, so this is a non-trivial undertaking.) Whether and when we embark on this is unclear; this is somewhat of an "ideal" design and as always, "good enough" is the enemy of "perfect" ...
This isn't a backend we can simply add to the current slapd source base, so it's probably an OpenLDAP 3.x target: In order to have a completely canonical record on disk, we also need pointers to AttributeDescriptions to be recorded in each entry and those AttributeDescription pointers must also be persistent. Which means that our current AttributeDescription cache must be modified to also allocate its records from a fixed mmap'd region. (And we'll have to include a schema-generation stamp, so that if schema elements are deleted we can force new AD pointers to be looked up when necessary.) (Of course, given the self-contained nature of the AD cache, we can probably modify its behavior in this way without impacting any other slapd code...)
There's also a potential risk to leaving all memory management up to the OS - the native memory manager on some OS's (e.g. Windows) is abysmal, and the CLOCK-based cache replacement code we now use in the entry cache is more efficient than the LRU schemes that some older OS versions use. So we may get into this and decide we still need to play games with mlock() etc. to control the cache management. That would be an unfortunate complication, but it would still allow us to do simpler tuning than we currently need. Still, establishing a 1:1 correspondence between virtual memory addresses and disk addresses is a big win for performance, scalability, and reduced complexity (== greater reliability)...
(And yes, by the way, we have planning for LDAPCon2009 this September in the works; I imagine the Call For Papers will go out in a week or two. So now's a good time to pull up whatever other ideas you've had in the back of your mind for a while...)
Howard Chu writes:
The basic idea is to construct a database that is always mmap'd to a fixed virtual address, and which returns its mmap'd data pages directly to the caller (instead of copying them to a newly allocated buffer). Given a fixed address, it becomes feasible to make the on-disk record format identical to the in-memory format. (...)
One big problem, if I understand you correctly, is that a database accumulates the results bugs more efficiently than anything else. There's no layer between slapd and the disk database which may catch an error or fail before the error gets saved.
So if something does a wild pointer write which ends up in the database, it stays written. If something creates a wild pointer out from an entry in the database (ping ITS#5340), you've saved a "slapd will crash" state into the database itself. be_release() of the entry doesn't help, nor does stopping slapd. slapcat/slapadd might save you since it doesn't need the entire database, but it too is more fragile.
Hallvard B Furuseth writes:
One big problem, if I understand you correctly, is that a database accumulates the results bugs more efficiently than anything else. There's no layer between slapd and the disk database which may catch an error or fail before the error gets saved.
More importantly, an Entry* from the database is a write handle into the database which other modules *must not* make use of. If they do, the write happens even if the database never makes a decision to write anything. E.g. in a search operation if overlay rwm meddles a bit too deeply. Though that problem can be fixed with... sigh, an entry cache.
So if something does a wild pointer write which ends up in the database, it stays written. If something creates a wild pointer out from an entry in the database (ping ITS#5340), you've saved a "slapd will crash" state into the database itself. be_release() of the entry doesn't help, nor does stopping slapd. slapcat/slapadd might save you since it doesn't need the entire database, but it too is more fragile.
Hallvard B Furuseth wrote:
Hallvard B Furuseth writes:
One big problem, if I understand you correctly, is that a database accumulates the results bugs more efficiently than anything else. There's no layer between slapd and the disk database which may catch an error or fail before the error gets saved.
More importantly, an Entry* from the database is a write handle into the database which other modules *must not* make use of. If they do, the write happens even if the database never makes a decision to write anything. E.g. in a search operation if overlay rwm meddles a bit too deeply. Though that problem can be fixed with... sigh, an entry cache.
Not necessarily. We can mmap memory ranges with read-only access. That means whenever we modify an entry, we must duplicate it first, but that's already how things work now.
Howard Chu wrote:
The basic idea is to construct a database that is always mmap'd to a fixed virtual address, and which returns its mmap'd data pages directly to the caller (instead of copying them to a newly allocated buffer). Given a fixed address, it becomes feasible to make the on-disk record format identical to the in-memory format. Today we have to convert from a BER-like encoding into our in-memory format, and while that conversion is fast it still takes up a measurable amount of time. (Which is one reason our slapd entry cache is still so much faster than just using BDB's cache.) So instead of storing offsets into a flattened data record, we store actual pointers (since they all simply reside in the mmap'd space).
One stumbling block: on Little-Endian machines, of which we seem to be cursed with an overabundance these days, the in-memory format for integers makes a terrible format for database keys. Byte-swapping them between on-disk and in-memory would completely defeat the mmap'ing scheme. So there's two choices: store them Little-Endian on disk, and use a reverse-order key comparison function (which we did back in OpenLDAP 2.1). This would break portability of the database files to other machines using Big-Endian format.
The other alternative is to store them in Big-Endian format, and just use them in their reversed order in memory. That would allow the database files to remain portable and eliminate the need for alternate key comparison functions. But it would require a custom iterator to do in-order traversals and entryID sorting comparisons.
At this point I'm leaning toward the former choice: store in native byte order and sacrifice portability. The alternative will have too big an ipmact on runtime performance. With the native byte order choice, this means if you ever want to cluster a bunch of servers on the same database they will all need to use the same byte order. (And of course, the same word size, which is the same requirement we have today.)
(Too bad C doesn't give us a "byteswapped" data attribute; some CPU architectures have instructions that can load a word from memory in a byte order that you choose. That would make life easier here, but if your CPU was that smart, it probably wouldn't be using brain-damaged byte order in the first place. Oh well...)
(And yes, by the way, we have planning for LDAPCon2009 this September in the works; I imagine the Call For Papers will go out in a week or two. So now's a good time to pull up whatever other ideas you've had in the back of your mind for a while...)
Reminder: LDAPCon2009 is just a couple weeks away!
I'm not sure how you expect a portable format anyway, when you are planning to store pointers in the database. Even on hosts with the same pointer format, I imagine which address ranges are available for the "fixed address" may differ.
Hallvard B Furuseth wrote:
I'm not sure how you expect a portable format anyway, when you are planning to store pointers in the database. Even on hosts with the same pointer format, I imagine which address ranges are available for the "fixed address" may differ.
Yeah, it would require pretty near identical machines and software configurations. And no address-space-layout-randomization. Never mind...
Howard Chu writes:
Yeah, it would require pretty near identical machines and software configurations. And no address-space-layout-randomization. Never mind...
BTW, I hope an mdb database will eventually be recoverable with slapcat if the fixed address becomes unavailable after an OS upgrade or whatever. E.g. store the address itself at the beginning of the file, and allow an operation mode which adjusts all pointers to the actual address. But that's a lot of pointer rewrites.
What will mdb do about data which can be shared by several slapd databases, like attribute descriptions?