kriszyp@gmail.com wrote:
Full_Name: Kristopher William Zyp Version: LMDB 0.9.23 OS: Windows URL: https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332... Submission from: (NULL) (71.199.6.148)
We have seen very poor performance on the sync of commits on large databases in Windows. On databases with 2GB of data, in writemap mode, the sync of even small commits is consistently well over 100ms (without writemap it is faster, but still slow). It is expected that a sync should take some time while waiting for disk confirmation of the writes, but more concerning is that these sync operations (in writemap mode) are instead dominated by nearly 100% system CPU utilization, so operations that requires sub-millisecond b-tree update operations are then dominated by very large amounts of system CPU cycles during the sync phase.
I think that the fundamental problem is that FlushViewOfFile seems to be an O(n) operation where n is the size of the file (or map). I presume that Windows is scanning the entire map/file for dirty pages to flush, I'm guessing because it doesn't have an internal index of all the dirty pages for every file/map-view in the OS disk cache. Therefore, the turns into an extremely expensive, CPU-bound operation to find the dirty pages for (large file) and initiate their writes, which, of course, is contrary to the whole goal of a scalable database system. And FlushFileBuffers is also relatively slow as well. We have attempted to batch as many operations into single transaction as possible, but this is still a very large overhead.
The Windows docs for FlushFileBuffers itself warns about the inefficiencies of this function (https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flus...). Which also points to the solution: it is much faster to write out the dirty pages with WriteFile through a sync file handle (FILE_FLAG_WRITE_THROUGH).
The associated patch (https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332...) is my attempt at implementing this solution, for Windows. Fortunately, with the design of LMDB, this is relatively straightforward. LMDB already supports writing out dirty pages with WriteFile calls. I added a write-through handle for sending these writes directly to disk. I then made that file-handle overlapped/asynchronously, so all the writes for a commit could be started in overlap mode, and (at least theoretically) transfer in parallel to the drive and then used GetOverlappedResult to wait for the completion. So basically mdb_page_flush becomes the sync. I extended the writing of dirty pages through WriteFile to writemap mode as well (for writing meta too), so that WriteFile with write-through can be used to flush the data without ever needing to call FlushViewOfFile or FlushFileBuffers. I also implemented support for write gathering in writemap mode where contiguous file positions infers contiguous memory (by tracking the starting position with wdp and writing contiguous pages in single operations). Sorting of the dirty list is maintained even in writemap mode for this purpose.
What is the point of using writemap mode if you still need to use WriteFile on every individual page?
The performance benefits of this patch, in my testing, are considerable. Writing out/syncing transactions is typically over 5x faster in writemap mode, and 2x faster in standard mode. And perhaps more importantly (especially in environment with many threads/processes), the efficiency benefits are even larger, particularly in writemap mode, where there can be a 50-100x reduction in the system CPU usage by using this patch. This brings windows performance with sync'ed transactions in LMDB back into the range of "lightning" performance :).
What is the performance difference between your patch using writemap, and just not using writemap in the first place?