Full_Name: Kristopher William Zyp Version: LMDB 0.9.23 OS: Windows URL: https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332... Submission from: (NULL) (71.199.6.148)
We have seen very poor performance on the sync of commits on large databases in Windows. On databases with 2GB of data, in writemap mode, the sync of even small commits is consistently well over 100ms (without writemap it is faster, but still slow). It is expected that a sync should take some time while waiting for disk confirmation of the writes, but more concerning is that these sync operations (in writemap mode) are instead dominated by nearly 100% system CPU utilization, so operations that requires sub-millisecond b-tree update operations are then dominated by very large amounts of system CPU cycles during the sync phase.
I think that the fundamental problem is that FlushViewOfFile seems to be an O(n) operation where n is the size of the file (or map). I presume that Windows is scanning the entire map/file for dirty pages to flush, I'm guessing because it doesn't have an internal index of all the dirty pages for every file/map-view in the OS disk cache. Therefore, the turns into an extremely expensive, CPU-bound operation to find the dirty pages for (large file) and initiate their writes, which, of course, is contrary to the whole goal of a scalable database system. And FlushFileBuffers is also relatively slow as well. We have attempted to batch as many operations into single transaction as possible, but this is still a very large overhead.
The Windows docs for FlushFileBuffers itself warns about the inefficiencies of this function (https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flus...). Which also points to the solution: it is much faster to write out the dirty pages with WriteFile through a sync file handle (FILE_FLAG_WRITE_THROUGH).
The associated patch (https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332...) is my attempt at implementing this solution, for Windows. Fortunately, with the design of LMDB, this is relatively straightforward. LMDB already supports writing out dirty pages with WriteFile calls. I added a write-through handle for sending these writes directly to disk. I then made that file-handle overlapped/asynchronously, so all the writes for a commit could be started in overlap mode, and (at least theoretically) transfer in parallel to the drive and then used GetOverlappedResult to wait for the completion. So basically mdb_page_flush becomes the sync. I extended the writing of dirty pages through WriteFile to writemap mode as well (for writing meta too), so that WriteFile with write-through can be used to flush the data without ever needing to call FlushViewOfFile or FlushFileBuffers. I also implemented support for write gathering in writemap mode where contiguous file positions infers contiguous memory (by tracking the starting position with wdp and writing contiguous pages in single operations). Sorting of the dirty list is maintained even in writemap mode for this purpose.
The performance benefits of this patch, in my testing, are considerable. Writing out/syncing transactions is typically over 5x faster in writemap mode, and 2x faster in standard mode. And perhaps more importantly (especially in environment with many threads/processes), the efficiency benefits are even larger, particularly in writemap mode, where there can be a 50-100x reduction in the system CPU usage by using this patch. This brings windows performance with sync'ed transactions in LMDB back into the range of "lightning" performance :).
All of these changes in the associated patch should only affect Windows. I actually had started with the approach of using a flag to indicate write-through behavior (here https://github.com/kriszyp/node-lmdb/commit/435ca423d0e13936f2a5f0193e994f54... if anyone wants to test it or play with it further). However, I didn't really see any substantive improvements in unix. It is possible that maybe with the use of dsync writes that are done asynchronously in parallel and wait for the completion (aio_write/aio_suspend), there might be a performance benefit opportunity, but I was less interested in this, and I assume you probably have already thoroughly explored these options already. In the end, I think it makes more sense to just make this the default behavior for Windows where the major improvement is, and don't change the unix behavior. Also, I don't think you can write to files that have an open writeable mapped region, so the implementation would be limited on unix anyway.
Anyway, this is certainly a more involved patch, in a sophisticated code-base, so I humbly present this for consideration. I have tested both the performance and the sync safety of this code. Our application has a high enough intensity of db interaction that it is actually pretty easy to reproduce LMDB data corruption by powering off a VM with the app running, if sync-mode is not enabled. And with my testing so far, sync-mode with this patch seems to preserve the rock-solid crash-proof design of LMDB. But I'd be glad to make any changes or cleanup needed, or if you want the patch to be submitted differently (it seems like using patches from my node repo fork has worked in the past though).