A suggestion was made to use a read/write mmap (as an option), to allow writes to be performed with no syscall overhead. I'm thinking that might be ok as a completely separate version of the library, because a fair bit of the code would need to change to accommodate that update style, and it would push the library over the 32K boundary.
Also this isn't as cool a suggestion as it sounds - it completely gives up MDB's current immunity to corruption, and in fact makes reliability much less stable. When you write through an mmap, you have absolutely no idea when the OS is going to get around to flushing the data back to disk. You have no idea what order the flushes will occur in. You can force the OS's hand, by calling msync on every page you want to flush, in the order you want them flushed, but then you'll just get back to having syscall overhead again, and by calling msync in a particular order, you defeat the underlying filesystem's ability to schedule the writes for optimum seeks.
Currently, by using writev, we can push a lot of data to the OS, and then when we call fdatasync() at the end, the OS schedules those writes as it sees fit. Right now the only ordering dependency MDB has is that all of the data pages must be flushed successfully before flushing the meta page, so we can afford to let the OS schedule all of the data page writes, and then do an explicitly synchronous write of the meta page.
So, with a writable mmap, we're stuck with the choice of either (a) not knowing at all whether our data has been flushed, or (b) being forced to explicitly flush every page ourselves, in a predetermined order which we have no way of knowing whether or not it's optimal for the current disk layout.
It seems to me this can only be a viable mode of operation if you're always going to run asynch and don't care much about transaction durability or DB recoverability. Running in this mode offers absolutely zero crash resistance; the entire DB will almost always be irreparably damaged after a system crash.
Would you run like that, if it offered you the potential of maybe 10x faster write performance? (It could be useful for slapadd -q, certainly.)
Howard Chu wrote:
It seems to me this can only be a viable mode of operation if you're always going to run asynch and don't care much about transaction durability or DB recoverability. Running in this mode offers absolutely zero crash resistance; the entire DB will almost always be irreparably damaged after a system crash.
Would you run like that, if it offered you the potential of maybe 10x faster write performance? (It could be useful for slapadd -q, certainly.)
OK, 10x speedup was far too optimistic. Quickly cobbling the changes together, it looks more like about a 70% speedup.
slapadd -q with 5 million entries took 24m16s as originally reported at LDAPCon last year.
With current mdb.master it takes 22m24s:
real 22m23.984s user 26m1.658s sys 8m17.415s
With the writable mmap and no msyncs it took 13m17s.
real 13m17.225s user 22m15.511s sys 1m12.533s
This code is currently available on the map2 branch of my git repo on ada.openldap.org. I'll clean it up a bit further then push it to mdb.master after some more testing.
Howard Chu wrote:
Howard Chu wrote:
It seems to me this can only be a viable mode of operation if you're always going to run asynch and don't care much about transaction durability or DB recoverability. Running in this mode offers absolutely zero crash resistance; the entire DB will almost always be irreparably damaged after a system crash.
Would you run like that, if it offered you the potential of maybe 10x faster write performance? (It could be useful for slapadd -q, certainly.)
OK, 10x speedup was far too optimistic. Quickly cobbling the changes together, it looks more like about a 70% speedup.
slapadd -q with 5 million entries took 24m16s as originally reported at LDAPCon last year.
With current mdb.master it takes 22m24s:
real 22m23.984s user 26m1.658s sys 8m17.415s
With the writable mmap and no msyncs it took 13m17s.
real 13m17.225s user 22m15.511s sys 1m12.533s
This code is currently available on the map2 branch of my git repo on ada.openldap.org. I'll clean it up a bit further then push it to mdb.master after some more testing.
The speedup seems to be proportional to the number of indices that are defined on the database. With Quanah's torture-test LDIF (~6 million entries, 4.9GB), there's only a small difference between the two when no indices are defined:
Original mdb.master:
real 11m27.385s user 15m5.489s sys 6m47.825s
Writemap:
real 10m27.447s user 14m35.663s sys 6m10.767s
But with 31 indices defined...
Original:
real 94m35.862s user 93m31.755s sys 20m39.693s
Writemap:
real 42m53.499s user 54m35.509s sys 7m23.364s
Over a 2:1 speedup.
On 4/9/2012 5:14 μμ, Howard Chu wrote:
It seems to me this can only be a viable mode of operation if you're always going to run asynch and don't care much about transaction durability or DB recoverability. Running in this mode offers absolutely zero crash resistance; the entire DB will almost always be irreparably damaged after a system crash.
Would you run like that, if it offered you the potential of maybe 10x faster write performance?
Absolutely not! DB reliability is crucial to us. Transaction durability too, esp. since we are using syncrepl in refreshAndPersist mode and numerous production systems (consumers) are affected.
Thanks for asking! Regards, Nick
Nikolaos Milas wrote:
On 4/9/2012 5:14 μμ, Howard Chu wrote:
It seems to me this can only be a viable mode of operation if you're always going to run asynch and don't care much about transaction durability or DB recoverability. Running in this mode offers absolutely zero crash resistance; the entire DB will almost always be irreparably damaged after a system crash.
Would you run like that, if it offered you the potential of maybe 10x faster write performance?
Absolutely not! DB reliability is crucial to us. Transaction durability too, esp. since we are using syncrepl in refreshAndPersist mode and numerous production systems (consumers) are affected.
Thanks for asking!
Thanks for your response; that's my view on it as well. Still, there's room for experimentation and there are other folks using the MDB library in their own applications, who may be OK with different reliability levels. It looks like we can actually support all of these optional behaviors without too much fuss.
Howard Chu wrote:
Nikolaos Milas wrote:
On 4/9/2012 5:14 μμ, Howard Chu wrote:
It seems to me this can only be a viable mode of operation if you're always going to run asynch and don't care much about transaction durability or DB recoverability. Running in this mode offers absolutely zero crash resistance; the entire DB will almost always be irreparably damaged after a system crash.
Would you run like that, if it offered you the potential of maybe 10x faster write performance?
Absolutely not! DB reliability is crucial to us. Transaction durability too, esp. since we are using syncrepl in refreshAndPersist mode and numerous production systems (consumers) are affected.
Thanks for asking!
Thanks for your response; that's my view on it as well. Still, there's room for experimentation and there are other folks using the MDB library in their own applications, who may be OK with different reliability levels. It looks like we can actually support all of these optional behaviors without too much fuss.
I've posted new microbenchmarks results with the writable mmap.
http://highlandsun.com/hyc/mdb/microbench/
In asynchronous mode MDB writes are pretty much fastest all across the board. In fully synchronous mode it's slower but still quite competitive.
Now that these features are in the library, some additional work remains to allow configuring them in back-mdb. (So much for back-mdb having no performance tuning options. I guess that was never going to last...)
--On Friday, September 07, 2012 6:04 AM -0700 Howard Chu hyc@symas.com wrote:
In asynchronous mode MDB writes are pretty much fastest all across the board. In fully synchronous mode it's slower but still quite competitive.
I enabled the new bits in my local RE24 build, and did a test against one of the largest databases I have available to me to play with that consists of real data. It is a 4.6GB LDIF File. My test is done on an UBUNTU10 LTS VM with 16GB of RAM. This test was with slapadd -q
MDB prior to writemap:
time /opt/zimbra/openldap/sbin/slapadd -q -b "" -l /tmp/ldap.bak -F /opt/zimbra/data/ldap/config -#################### 100.00% eta none elapsed 02h21m04s spd 567.5 k/s Closing DB...
real 141m5.743s
MDB after writemap:
zimbra@zre-ldap001:~$ time /opt/zimbra/openldap-2.4.33.2z/sbin/slapadd -F /opt/zimbra/data/ldap/config -q -b "" -l /tmp/frontier.ldif .#################### 100.00% eta none elapsed 45m19s spd 1.7 M/s Closing DB...
real 45m19.682s
HDB (Using a 12GB SHM key):
zimbra@zre-ldap001:~$ time ./libexec/zmslapadd -q -F /opt/zimbra/data/ldap/config -b "" -l /tmp/frontier.ldif _#################### 100.00% eta none elapsed 01h40m27s spd 797.0 k/s Closing DB...
real 101m22.404s
So in my scenario, MDB is now 3.1 times faster than it used to be, and it is now 2.25 times faster than BDB. Nice!
--Quanah
--
Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration