hyc@symas.com wrote:
marvin.mundry@uni-hamburg.de wrote:
Full_Name: Marvin Mundry Version: 2.4.33 OS: Ubuntu 12.10 URL: https://idmswiki.rrz.uni-hamburg.de:8005/debug.tar.bz Submission from: (NULL) (134.100.2.183)
Thanks for the report. The crash has been fixed in git, but your test runs into another (known) issue in MDB.
You're working with a very large entry, which libmdb stores in overflow pages. In the current version of libmdb, freespace management for overflow pages is not fully implemented, so every time you update the entry libmdb will always use new pages (instead of reusing old pages). Thus, after a few hundred operations, your 1GB map will be exhausted.
It looks like you won't be able to use back-mdb until this feature is fully implemented in libmdb.
So the issue is how to find a contiguous run of pages large enough to satisfy the overflow page, in the current freelist. This takes us into the realm of malloc algorithms, first-fit/best-fit/..., etc.
I think first we scan whatever freelist we have in memory, to see if a suitable run of pages is already present.
If not, and there are additional freelists still available: 1) we could just merge all of them, and then search again or 2) merge one at a time, and search again
Leaning toward #2, I suspect we don't need to coalesce all freelists all the time.
--On Monday, November 05, 2012 10:10 AM -0800 Howard Chu hyc@symas.com wrote:
hyc@symas.com wrote:
marvin.mundry@uni-hamburg.de wrote:
Full_Name: Marvin Mundry Version: 2.4.33 OS: Ubuntu 12.10 URL: https://idmswiki.rrz.uni-hamburg.de:8005/debug.tar.bz Submission from: (NULL) (134.100.2.183)
Thanks for the report. The crash has been fixed in git, but your test runs into another (known) issue in MDB.
You're working with a very large entry, which libmdb stores in overflow pages. In the current version of libmdb, freespace management for overflow pages is not fully implemented, so every time you update the entry libmdb will always use new pages (instead of reusing old pages). Thus, after a few hundred operations, your 1GB map will be exhausted.
It looks like you won't be able to use back-mdb until this feature is fully implemented in libmdb.
So the issue is how to find a contiguous run of pages large enough to satisfy the overflow page, in the current freelist. This takes us into the realm of malloc algorithms, first-fit/best-fit/..., etc.
I think first we scan whatever freelist we have in memory, to see if a suitable run of pages is already present.
If not, and there are additional freelists still available:
- we could just merge all of them, and then search again
or 2) merge one at a time, and search again
Leaning toward #2, I suspect we don't need to coalesce all freelists all the time.
I like the sound of #2 as well. If you come up with a patch, I can test. ;)
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration
Quanah Gibson-Mount wrote:
--On Monday, November 05, 2012 10:10 AM -0800 Howard Chu hyc@symas.com
So the issue is how to find a contiguous run of pages large enough to satisfy the overflow page, in the current freelist. This takes us into the realm of malloc algorithms, first-fit/best-fit/..., etc.
I think first we scan whatever freelist we have in memory, to see if a suitable run of pages is already present.
If not, and there are additional freelists still available: 1) we could just merge all of them, and then search again or 2) merge one at a time, and search again
Leaning toward #2, I suspect we don't need to coalesce all freelists all the time.
I like the sound of #2 as well. If you come up with a patch, I can test. ;)
--Quanah
Well, just like last LinuxCon, we've had some new input from this LinuxCon. Theodore Ts'o (ext4 lead developer) raised the topic of Erase Blocks on flash-based storage devices. If we can ensure that our page allocations are aligned with the Erase Block size of the data store, we'll get higher write throughput on SSDs, MMC cards, etc. Erase Blocks are commonly 32KB or 64KB today, with 128KB coming soon.
So, we may want to think about chunking up our page allocations into power-of-two chunks. Perhaps as a separate environment flag setting.
Even if we don't explicitly try to form 64K chunks all the time, it may be best for us to fully coalesce all available free lists whenever a request for an overflow page arrives. So our default case of single-page requests will continue as before, overflow pages will have some chance of reusing old pages, and otherwise they'll just use new pages, as they currently do.
Interestingly, the Red Hat folks expressed a desire to adopt MDB in RPM, which currently uses BerkeleyDB. Ironically, they'd like an option to run in pure append-only mode, to allow rolling back to previous state if one package of a large upgrade fails, and the user decides to abandon the entire upgrade.
It would be simple enough to add an environment flag for append-only mode, which would skip all of the freelist management entirely. Not interested in looking at that yet, can address it later if/when movement happens on the RPM project.