(ITS#9017) Improving performance of commit sync in Windows - openldap-bugs

30 Apr 2019


      Full_Name: Kristopher William Zyp
Version: LMDB 0.9.23
OS: Windows
URL: https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332...
Submission from: (NULL) (71.199.6.148)
We have seen very poor performance on the sync of commits on large databases in
Windows. On databases with 2GB of data, in writemap mode, the sync of even small
commits is consistently well over 100ms (without writemap it is faster, but
still slow). It is expected that a sync should take some time while waiting for
disk confirmation of the writes, but more concerning is that these sync
operations (in writemap mode) are instead dominated by nearly 100% system CPU
utilization, so operations that requires sub-millisecond b-tree update
operations are then dominated by very large amounts of system CPU cycles during
the sync phase.
I think that the fundamental problem is that FlushViewOfFile seems to be an O(n)
operation where n is the size of the file (or map). I presume that Windows is
scanning the entire map/file for dirty pages to flush, I'm guessing because it
doesn't have an internal index of all the dirty pages for every file/map-view in
the OS disk cache. Therefore, the turns into an extremely expensive, CPU-bound
operation to find the dirty pages for (large file) and initiate their writes,
which, of course, is contrary to the whole goal of a scalable database system.
And FlushFileBuffers is also relatively slow as well. We have attempted to batch
as many operations into single transaction as possible, but this is still a very
large overhead.
The Windows docs for FlushFileBuffers itself warns about the inefficiencies of
this function (https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flus...).
Which also points to the solution: it is much faster to write out the dirty
pages with WriteFile through a sync file handle (FILE_FLAG_WRITE_THROUGH).
The associated patch
(https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332...)
is my attempt at implementing this solution, for Windows. Fortunately, with the
design of LMDB, this is relatively straightforward. LMDB already supports
writing out dirty pages with WriteFile calls. I added a write-through handle for
sending these writes directly to disk. I then made that file-handle
overlapped/asynchronously, so all the writes for a commit could be started in
overlap mode, and (at least theoretically) transfer in parallel to the drive and
then used GetOverlappedResult to wait for the completion. So basically
mdb_page_flush becomes the sync. I extended the writing of dirty pages through
WriteFile to writemap mode as well (for writing meta too), so that WriteFile
with write-through can be used to flush the data without ever needing to call
FlushViewOfFile or FlushFileBuffers. I also implemented support for write
gathering in writemap mode where contiguous file positions infers contiguous
memory (by tracking the starting position with wdp and writing contiguous pages
in single operations). Sorting of the dirty list is maintained even in writemap
mode for this purpose.
The performance benefits of this patch, in my testing, are considerable. Writing
out/syncing transactions is typically over 5x faster in writemap mode, and 2x
faster in standard mode. And perhaps more importantly (especially in environment
with many threads/processes), the efficiency benefits are even larger,
particularly in writemap mode, where there can be a 50-100x reduction in the
system CPU usage by using this patch. This brings windows performance with
sync'ed transactions in LMDB back into the range of "lightning" performance :).
All of these changes in the associated patch should only affect Windows. I
actually had started with the approach of using a flag to indicate write-through
behavior (here https://github.com/kriszyp/node-lmdb/commit/435ca423d0e13936f2a5f0193e994f54...
if anyone wants to test it or play with it further). However, I didn't really
see any substantive improvements in unix. It is possible that maybe with the use
of dsync writes that are done asynchronously in parallel and wait for the
completion (aio_write/aio_suspend), there might be a performance benefit
opportunity, but I was less interested in this, and I assume you probably have
already thoroughly explored these options already. In the end, I think it makes
more sense to just make this the default behavior for Windows where the major
improvement is, and don't change the unix behavior. Also, I don't think you can
write to files that have an open writeable mapped region, so the implementation
would be limited on unix anyway.
Anyway, this is certainly a more involved patch, in a sophisticated code-base,
so I humbly present this for consideration. I have tested both the performance
and the sync safety of this code. Our application has a high enough intensity of
db interaction that it is actually pretty easy to reproduce LMDB data corruption
by powering off a VM with the app running, if sync-mode is not enabled. And with
my testing so far, sync-mode with this patch seems to preserve the rock-solid
crash-proof design of LMDB. But I'd be glad to make any changes or cleanup
needed, or if you want the patch to be submitted differently (it seems like
using patches from my node repo fork has worked in the past though).