Re: (ITS#9017) Improving performance of commit sync in Windows

6 Feb 2020


      --000000000000bdb86d059deaf350
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Sorry to keep pestering, but just pinging about this patch again, as I
still think this fix could benefit windows users. And at this point, I
think I can say we have tested it pretty well, running on our servers for
almost a year :).
Thanks,
Kris
On Wed, Sep 18, 2019 at 12:56 PM Kris Zyp kriszyp@gmail.com wrote:
...
Checking on this again, is this still a possibility for merging into LMDB=
?
...
This fix is still working great (improved performance) on our systems.
Thanks,
Kris
On Mon, Jun 17, 2019 at 1:04 PM Kris Zyp kriszyp@gmail.com wrote:
...
Is this still being considered/reviewed? Let me know if there are any
other changes you would like me to make. This patch has continued to yie=
ld
...
...
significant and reliable performance improvements for us, and seems like=
it
...
...
would be nice for this to be available for other Windows users.
On Fri, May 3, 2019 at 3:52 PM Kris Zyp kriszyp@gmail.com wrote:
...
For the sake of putting this in the email thread (other code discussion
in GitHub), here is the latest squashed commit of the proposed patch (w=
ith
...
...
...
the on-demand, retained overlapped array to reduce re-malloc and openin=
g
...
...
...
event handles):
https://github.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75=
ee222072b990f
...
...
...
Thanks,
Kris
*From: *Kris Zyp kriszyp@gmail.com
*Sent: *April 30, 2019 12:43 PM
*To: *Howard Chu hyc@symas.com; openldap-its@OpenLDAP.org
*Subject: *RE: (ITS#9017) Improving performance of commit sync in
Windows
...
What is the point of using writemap mode if you still need to use
WriteFile
...
on every individual page?
As I understood from the documentation, and have observed, using
writemap mode is faster (and uses less temporary memory) because it doe=
sn=E2=80=99t
...
...
...
require mallocs to allocate pages (docs: =E2=80=9CThis is faster and us=
es fewer
...
...
...
mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and e=
fficient,
...
...
...
that in sync-mode, it takes enormous transactions before the time spent
allocating and creating the dirty pages with the updated b-tree is anyw=
here
...
...
...
even remotely close to the time it takes to wait for disk flushing, eve=
n
...
...
...
with an SSD. But the more pertinent question is efficiency, and measuri=
ng
...
...
...
CPU cycles rather than time spent (efficiency is more important than ju=
st
...
...
...
time spent). When I ran my tests this morning of 100 (sync) transaction=
s
...
...
...
with 100 puts per transaction, times varied quite a bit, but it seemed =
like
...
...
...
running with writemap enabled typically averages about 500ms of CPU and
with writemap disabled it typically averages around 600ms. Not a huge
difference, but still definitely worthwhile, I think.
Caveat emptor: Measuring LMDB performance with sync interactions on
Windows is one of the most frustratingly erratic things to measure. It =
is
...
...
...
sunny outside right now, times could be different when it starts rainin=
g
...
...
...
later, but, this is what I saw this morning...
...
What is the performance difference between your patch using writemap,
and just
...
not using writemap in the first place?
Running 1000 sync transactions on 3GB db with a single put per
transaction, without writemap map, without the patch took about 60 seco=
nds.
...
...
...
And it took about 1 second with the patch with writemap mode enabled!
(there is no significant difference in sync times with writemap enabled=
or
...
...
...
disabled with the patch.) So the difference was huge in my test. And no=
t
...
...
...
only that, without the patch, the CPU usage was actually _*higher*_
during that 60 seconds (close to 100% of a core) than during the execut=
ion
...
...
...
with the patch for one second (close to 50%).  Anyway, there are certai=
nly
...
...
...
tests I have run where the differences are not as large (doing small
commits on large dbs accentuates the differences), but the patch always
seems to win. It could also be that my particular configuration causes
bigger differences (on an SSD drive, and maybe a more fragmented file?)=
.
...
...
...
Anyway, I added error handling for the malloc, and fixed/changed the
other things you suggested. Be happy to make any other changes you want=
.
...
...
...
The updated patch is here:
https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec1=
7b9b62094acde
...
...
...
...
OVERLAPPED* ov =3D malloc((pagecount - keep) * sizeof(OVERLAPPED));
...
Probably this ought to just be pre-allocated based on the maximum
number of dirty pages a txn allows.
I wasn=E2=80=99t sure I understood this comment. Are you suggesting we =
malloc(MDB_IDL_UM_MAX
...
...
...

sizeof(OVERLAPPED)) for each environment, and retain it for the life =

of
...
...
...
the environment? I think that is 4MB, if my math is right, which seems =
like
...
...
...
a lot of memory to keep allocated (we usually have a lot of open
environments). If the goal is to reduce the number of mallocs, how abou=
t we
...
...
...
retain the OVERLAPPED array, and only free and re-malloc if the previou=
s
...
...
...
allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t unnece=
ssary allocation,
...
...
...
and we only malloc when there is a bigger transaction than any previous=
. I
...
...
...
put this together in a separate commit, as I wasn=E2=80=99t sure if thi=
s what you
...
...
...
wanted (can squash if you prefer):
https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17=
a4b2adefaac40
...
...
...
Thank you for the review!
Thanks,
Kris
*From: *Howard Chu hyc@symas.com
*Sent: *April 30, 2019 7:12 AM
*To: *kriszyp@gmail.com; openldap-its@OpenLDAP.org
*Subject: *Re: (ITS#9017) Improving performance of commit sync in
Windows
kriszyp@gmail.com wrote:
...
Full_Name: Kristopher William Zyp
...
Version: LMDB 0.9.23
...
OS: Windows
...
URL:
https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0a=
b9332b7fc4ce9
...
...
...
...
Submission from: (NULL) (71.199.6.148)
...
...
...
We have seen very poor performance on the sync of commits on large
databases in
...
Windows. On databases with 2GB of data, in writemap mode, the sync of
even small
...
commits is consistently well over 100ms (without writemap it is
faster, but
...
still slow). It is expected that a sync should take some time while
waiting for
...
disk confirmation of the writes, but more concerning is that these sy=
nc
...
...
...
...
operations (in writemap mode) are instead dominated by nearly 100%
system CPU
...
utilization, so operations that requires sub-millisecond b-tree updat=
e
...
...
...
...
operations are then dominated by very large amounts of system CPU
cycles during
...
the sync phase.
...
...
I think that the fundamental problem is that FlushViewOfFile seems to
be an O(n)
...
operation where n is the size of the file (or map). I presume that
Windows is
...
scanning the entire map/file for dirty pages to flush, I'm guessing
because it
...
doesn't have an internal index of all the dirty pages for every
file/map-view in
...
the OS disk cache. Therefore, the turns into an extremely expensive,
CPU-bound
...
operation to find the dirty pages for (large file) and initiate their
writes,
...
which, of course, is contrary to the whole goal of a scalable databas=
e
...
...
...
system.
...
And FlushFileBuffers is also relatively slow as well. We have
attempted to batch
...
as many operations into single transaction as possible, but this is
still a very
...
large overhead.
...
...
The Windows docs for FlushFileBuffers itself warns about the
inefficiencies of
...
this function (
https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi=
-flushfilebuffers
...
...
...
).
...
Which also points to the solution: it is much faster to write out the
dirty
...
pages with WriteFile through a sync file handle
(FILE_FLAG_WRITE_THROUGH).
...
...
The associated patch
...
(
https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0a=
b9332b7fc4ce9
...
...
...
)
...
is my attempt at implementing this solution, for Windows. Fortunately=
,
...
...
...
with the
...
design of LMDB, this is relatively straightforward. LMDB already
supports
...
writing out dirty pages with WriteFile calls. I added a write-through
handle for
...
sending these writes directly to disk. I then made that file-handle
...
overlapped/asynchronously, so all the writes for a commit could be
started in
...
overlap mode, and (at least theoretically) transfer in parallel to th=
e
...
...
...
drive and
...
then used GetOverlappedResult to wait for the completion. So basicall=
y
...
...
...
...
mdb_page_flush becomes the sync. I extended the writing of dirty page=
s
...
...
...
through
...
WriteFile to writemap mode as well (for writing meta too), so that
WriteFile
...
with write-through can be used to flush the data without ever needing
to call
...
FlushViewOfFile or FlushFileBuffers. I also implemented support for
write
...
gathering in writemap mode where contiguous file positions infers
contiguous
...
memory (by tracking the starting position with wdp and writing
contiguous pages
...
in single operations). Sorting of the dirty list is maintained even i=
n
...
...
...
writemap
...
mode for this purpose.
What is the point of using writemap mode if you still need to use
WriteFile
on every individual page?
...
The performance benefits of this patch, in my testing, are
considerable. Writing
...
out/syncing transactions is typically over 5x faster in writemap mode=
,
...
...
...
and 2x
...
faster in standard mode. And perhaps more importantly (especially in
environment
...
with many threads/processes), the efficiency benefits are even larger=
,
...
...
...
...
particularly in writemap mode, where there can be a 50-100x reduction
in the
...
system CPU usage by using this patch. This brings windows performance
with
...
sync'ed transactions in LMDB back into the range of "lightning"
performance :).
What is the performance difference between your patch using writemap,
and just
not using writemap in the first place?
--
-- Howard Chu
CTO, Symas Corp.           http://www.symas.com
Director, Highland Sun     http://highlandsun.com/hyc/
Chief Architect, OpenLDAP  http://www.openldap.org/project/
--000000000000bdb86d059deaf350
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">Sorry to keep pestering, but just pinging about this patch=
 again, as I still think this fix could benefit windows users. And at this =
point, I think I can say we have tested it pretty well, running on our serv=
ers for almost a year :).<div>Thanks,</div><div>Kris</div></div><br><div cl=
ass=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, Sep 18, 2=
019 at 12:56 PM Kris Zyp &lt;<a href=3D"mailto:kriszyp@gmail.com">kriszyp@g=
mail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D=
"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-le=
ft:1ex"><div dir=3D"ltr">Checking on this again, is this still a possibilit=
y for merging into LMDB? This fix is still working great (improved performa=
nce) on our systems.<div>Thanks,</div><div>Kris</div></div><br><div class=
=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Jun 17, 2019=
 at 1:04 PM Kris Zyp &lt;<a href=3D"mailto:kriszyp@gmail.com" target=3D"_bl=
ank">kriszyp@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_q=
uote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,2=
04);padding-left:1ex"><div dir=3D"ltr">Is this still being considered/revie=
wed? Let me know if there are any other changes you would like me to make. =
This patch has continued to yield significant and reliable performance impr=
ovements for us, and seems like it would be nice for this to be available f=
or other Windows users.</div><br><div class=3D"gmail_quote"><div dir=3D"ltr=
" class=3D"gmail_attr">On Fri, May 3, 2019 at 3:52 PM Kris Zyp &lt;<a href=
=3D"mailto:kriszyp@gmail.com" target=3D"_blank">kriszyp@gmail.com</a>&gt; w=
rote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0p=
x 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div lang=
=3D"EN-CA"><div><p class=3D"MsoNormal">For the sake of putting this in the =
email thread (other code discussion in GitHub), here is the latest squashed=
 commit of the proposed patch (with the on-demand, retained overlapped arra=
y to reduce re-malloc and opening event handles): <a href=3D"https://github=
.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75ee222072b990f" tar=
get=3D"_blank">https://github.com/kriszyp/node-lmdb/commit/726a9156662c703b=
f3d453aab75ee222072b990f</a></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u>=
</p><p class=3D"MsoNormal">Thanks,<br>Kris</p><p class=3D"MsoNormal"><u></u=
...
=C2=A0<u></u></p><div style=3D"border-right:none;border-bottom:none;border=
-left:none;border-top:1pt solid rgb(225,225,225);padding:3pt 0cm 0cm"><p cl=
ass=3D"MsoNormal" style=3D"border:none;padding:0cm"><b>From: </b><a href=3D=
"mailto:kriszyp@gmail.com" target=3D"_blank">Kris Zyp</a><br><b>Sent: </b>A=
pril 30, 2019 12:43 PM<br><b>To: </b><a href=3D"mailto:hyc@symas.com" targe=
t=3D"_blank">Howard Chu</a>; <a href=3D"mailto:openldap-its@OpenLDAP.org" t=
arget=3D"_blank">openldap-its@OpenLDAP.org</a><br><b>Subject: </b>RE: (ITS#=
9017) Improving performance of commit sync in Windows</p></div><p class=3D"=
MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">&gt; What is the =
point of using writemap mode if you still need to use WriteFile<u></u><u></=
u></p><p class=3D"MsoNormal">&gt; on every individual page?<u></u><u></u></=
p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">As =
I understood from the documentation, and have observed, using writemap mode=
 is faster (and uses less temporary memory) because it doesn=E2=80=99t requ=
ire mallocs to allocate pages (docs: =E2=80=9CThis is faster and uses fewer=
 mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and effi=
cient, that in sync-mode, it takes enormous transactions before the time sp=
ent allocating and creating the dirty pages with the updated b-tree is anyw=
here even remotely close to the time it takes to wait for disk flushing, ev=
en with an SSD. But the more pertinent question is efficiency, and measurin=
g CPU cycles rather than time spent (efficiency is more important than just=
 time spent). When I ran my tests this morning of 100 (sync) transactions w=
ith 100 puts per transaction, times varied quite a bit, but it seemed like =
running with writemap enabled typically averages about 500ms of CPU and wit=
h writemap disabled it typically averages around 600ms. Not a huge differen=
ce, but still definitely worthwhile, I think.<u></u><u></u></p><p class=3D"=
MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Caveat emptor: Me=
asuring LMDB performance with sync interactions on Windows is one of the mo=
st frustratingly erratic things to measure. It is sunny outside right now, =
times could be different when it starts raining later, but, this is what I =
saw this morning...<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u=
...
</u></p><p class=3D"MsoNormal">&gt; What is the performance difference bet=
ween your patch using writemap, and just<u></u><u></u></p><p class=3D"MsoNo=
rmal">&gt; not using writemap in the first place?<u></u><u></u></p><p class=
=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Running 1000 =
sync transactions on 3GB db with a single put per transaction, without writ=
emap map, without the patch took about 60 seconds. And it took about 1 seco=
nd with the patch with writemap mode enabled! (there is no significant diff=
erence in sync times with writemap enabled or disabled with the patch.) So =
the difference was huge in my test. And not only that, without the patch, t=
he CPU usage was actually _<i>higher</i>_ during that 60 seconds (close to =
100% of a core) than during the execution with the patch for one second (cl=
ose to 50%).=C2=A0 Anyway, there are certainly tests I have run where the d=
ifferences are not as large (doing small commits on large dbs accentuates t=
he differences), but the patch always seems to win. It could also be that m=
y particular configuration causes bigger differences (on an SSD drive, and =
maybe a more fragmented file?).<u></u><u></u></p><p class=3D"MsoNormal"><u>=
</u>=C2=A0<u></u></p><p class=3D"MsoNormal">Anyway, I added error handling =
for the malloc, and fixed/changed the other things you suggested. Be happy =
to make any other changes you want. The updated patch is here:<u></u><u></u=
...
</p><p class=3D"MsoNormal"><a href=3D"https://github.com/kriszyp/node-lmdb=
/commit/25366dea9453749cf6637f43ec17b9b62094acde" target=3D"_blank">https:/=
/github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b62094ac=
de</a><u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p c=
lass=3D"MsoNormal">&gt;<span><span style=3D"font-size:9pt;font-family:Conso=
las;color:rgb(36,41,46)"> OVERLAPPED* ov =3D </span></span><span><span styl=
e=3D"font-size:9pt;font-family:Consolas;color:rgb(0,92,197)">malloc</span><=
/span><span><span style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,=
41,46)">((pagecount - keep) * </span></span><span><span style=3D"font-size:=
9pt;font-family:Consolas;color:rgb(215,58,73)">sizeof</span></span><span><s=
pan style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41,46)">(OVERL=
APPED));</span></span><span><span style=3D"font-size:9pt;font-family:Consol=
as;color:rgb(36,41,46)"><u></u><u></u></span></span></p><p class=3D"MsoNorm=
al"><span><span style=3D"font-size:9pt;font-family:Consolas;color:rgb(36,41=
,46)">&gt; </span></span><span style=3D"font-size:10.5pt;font-family:&quot;=
Segoe UI&quot;,sans-serif;color:rgb(36,41,46);background:white">Probably th=
is ought to just be pre-allocated based on the maximum number of dirty page=
s a txn allows.</span><span style=3D"font-size:10.5pt;font-family:&quot;Seg=
oe UI&quot;,sans-serif;background:white"><u></u><u></u></span></p><p class=
=3D"MsoNormal"><span style=3D"font-size:10.5pt;font-family:&quot;Segoe UI&q=
uot;,sans-serif;color:rgb(36,41,46);background:white"><u></u>=C2=A0<u></u><=
/span></p><p class=3D"MsoNormal"><span style=3D"font-size:10.5pt;font-famil=
y:&quot;Segoe UI&quot;,sans-serif;color:rgb(36,41,46);background:white">I w=
asn=E2=80=99t sure I understood this comment. Are you suggesting we </span>=
malloc(MDB_IDL_UM_MAX * sizeof(OVERLAPPED)) for each environment, and retai=
n it for the life of the environment? I think that is 4MB, if my math is ri=
ght, which seems like a lot of memory to keep allocated (we usually have a =
lot of open environments). If the goal is to reduce the number of mallocs, =
how about we retain the OVERLAPPED array, and only free and re-malloc if th=
e previous allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t=
 unnecessary allocation, and we only malloc when there is a bigger transact=
ion than any previous. I put this together in a separate commit, as I wasn=
=E2=80=99t sure if this what you wanted (can squash if you prefer): <a href=
=3D"https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17=
a4b2adefaac40" target=3D"_blank">https://github.com/kriszyp/node-lmdb/commi=
t/2fe68fb5269c843e2e789746a17a4b2adefaac40</a><u></u><u></u></p><p class=3D=
"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Thank you for th=
e review! <span style=3D"font-size:10.5pt;font-family:&quot;Segoe UI&quot;,=
sans-serif;color:rgb(36,41,46);background:white"><u></u><u></u></span></p><=
p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">Thanks=
,<br>Kris<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><=
div style=3D"border-right:none;border-bottom:none;border-left:none;border-t=
op:1pt solid rgb(225,225,225);padding:3pt 0cm 0cm"><p class=3D"MsoNormal"><=
b>From: </b><a href=3D"mailto:hyc@symas.com" target=3D"_blank">Howard Chu</=
a><br><b>Sent: </b>April 30, 2019 7:12 AM<br><b>To: </b><a href=3D"mailto:k=
riszyp@gmail.com" target=3D"_blank">kriszyp@gmail.com</a>; <a href=3D"mailt=
o:openldap-its@OpenLDAP.org" target=3D"_blank">openldap-its@OpenLDAP.org</a=
...
<br><b>Subject: </b>Re: (ITS#9017) Improving performance of commit sync in=
Windows<u></u><u></u></p></div><p class=3D"MsoNormal"><u></u>=C2=A0<u></u>=
</p><p class=3D"MsoNormal"><a href=3D"mailto:kriszyp@gmail.com" target=3D"_=
blank">kriszyp@gmail.com</a> wrote:<u></u><u></u></p><p class=3D"MsoNormal"=
...
&gt; Full_Name: Kristopher William Zyp<u></u><u></u></p><p class=3D"MsoNor=
mal">&gt; Version: LMDB 0.9.23<u></u><u></u></p><p class=3D"MsoNormal">&gt;=
 OS: Windows<u></u><u></u></p><p class=3D"MsoNormal">&gt; URL: <a href=3D"h=
ttps://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332=
b7fc4ce9" target=3D"_blank">https://github.com/kriszyp/node-lmdb/commit/7ff=
525ae57684a163d32af74a0ab9332b7fc4ce9</a><u></u><u></u></p><p class=3D"MsoN=
ormal">&gt; Submission from: (NULL) (71.199.6.148)<u></u><u></u></p><p clas=
s=3D"MsoNormal">&gt; <u></u><u></u></p><p class=3D"MsoNormal">&gt; <u></u><=
u></u></p><p class=3D"MsoNormal">&gt; We have seen very poor performance on=
 the sync of commits on large databases in<u></u><u></u></p><p class=3D"Mso=
Normal">&gt; Windows. On databases with 2GB of data, in writemap mode, the =
sync of even small<u></u><u></u></p><p class=3D"MsoNormal">&gt; commits is =
consistently well over 100ms (without writemap it is faster, but<u></u><u><=
/u></p><p class=3D"MsoNormal">&gt; still slow). It is expected that a sync =
should take some time while waiting for<u></u><u></u></p><p class=3D"MsoNor=
mal">&gt; disk confirmation of the writes, but more concerning is that thes=
e sync<u></u><u></u></p><p class=3D"MsoNormal">&gt; operations (in writemap=
 mode) are instead dominated by nearly 100% system CPU<u></u><u></u></p><p =
class=3D"MsoNormal">&gt; utilization, so operations that requires sub-milli=
second b-tree update<u></u><u></u></p><p class=3D"MsoNormal">&gt; operation=
s are then dominated by very large amounts of system CPU cycles during<u></=
u><u></u></p><p class=3D"MsoNormal">&gt; the sync phase.<u></u><u></u></p><=
p class=3D"MsoNormal">&gt; <u></u><u></u></p><p class=3D"MsoNormal">&gt; I =
think that the fundamental problem is that FlushViewOfFile seems to be an O=
(n)<u></u><u></u></p><p class=3D"MsoNormal">&gt; operation where n is the s=
ize of the file (or map). I presume that Windows is<u></u><u></u></p><p cla=
ss=3D"MsoNormal">&gt; scanning the entire map/file for dirty pages to flush=
, I&#39;m guessing because it<u></u><u></u></p><p class=3D"MsoNormal">&gt; =
doesn&#39;t have an internal index of all the dirty pages for every file/ma=
p-view in<u></u><u></u></p><p class=3D"MsoNormal">&gt; the OS disk cache. T=
herefore, the turns into an extremely expensive, CPU-bound<u></u><u></u></p=
...
<p class=3D"MsoNormal">&gt; operation to find the dirty pages for (large f=
ile) and initiate their writes,<u></u><u></u></p><p class=3D"MsoNormal">&gt=
; which, of course, is contrary to the whole goal of a scalable database sy=
stem.<u></u><u></u></p><p class=3D"MsoNormal">&gt; And FlushFileBuffers is =
also relatively slow as well. We have attempted to batch<u></u><u></u></p><=
p class=3D"MsoNormal">&gt; as many operations into single transaction as po=
ssible, but this is still a very<u></u><u></u></p><p class=3D"MsoNormal">&g=
t; large overhead.<u></u><u></u></p><p class=3D"MsoNormal">&gt; <u></u><u><=
/u></p><p class=3D"MsoNormal">&gt; The Windows docs for FlushFileBuffers it=
self warns about the inefficiencies of<u></u><u></u></p><p class=3D"MsoNorm=
al">&gt; this function (<a href=3D"https://docs.microsoft.com/en-us/windows=
/desktop/api/fileapi/nf-fileapi-flushfilebuffers" target=3D"_blank">https:/=
/docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flushfileb=
uffers</a>).<u></u><u></u></p><p class=3D"MsoNormal">&gt; Which also points=
 to the solution: it is much faster to write out the dirty<u></u><u></u></p=
...
<p class=3D"MsoNormal">&gt; pages with WriteFile through a sync file handl=
e (FILE_FLAG_WRITE_THROUGH).<u></u><u></u></p><p class=3D"MsoNormal">&gt; <=
u></u><u></u></p><p class=3D"MsoNormal">&gt; The associated patch<u></u><u>=
</u></p><p class=3D"MsoNormal">&gt; (<a href=3D"https://github.com/kriszyp/=
node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9" target=3D"_blank=
">https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9=
332b7fc4ce9</a>)<u></u><u></u></p><p class=3D"MsoNormal">&gt; is my attempt=
 at implementing this solution, for Windows. Fortunately, with the<u></u><u=
...
</u></p><p class=3D"MsoNormal">&gt; design of LMDB, this is relatively str=
aightforward. LMDB already supports<u></u><u></u></p><p class=3D"MsoNormal"=
...
&gt; writing out dirty pages with WriteFile calls. I added a write-through=
handle for<u></u><u></u></p><p class=3D"MsoNormal">&gt; sending these writ=
es directly to disk. I then made that file-handle<u></u><u></u></p><p class=
=3D"MsoNormal">&gt; overlapped/asynchronously, so all the writes for a comm=
it could be started in<u></u><u></u></p><p class=3D"MsoNormal">&gt; overlap=
 mode, and (at least theoretically) transfer in parallel to the drive and<u=
...
</u><u></u></p><p class=3D"MsoNormal">&gt; then used GetOverlappedResult t=
o wait for the completion. So basically<u></u><u></u></p><p class=3D"MsoNor=
mal">&gt; mdb_page_flush becomes the sync. I extended the writing of dirty =
pages through<u></u><u></u></p><p class=3D"MsoNormal">&gt; WriteFile to wri=
temap mode as well (for writing meta too), so that WriteFile<u></u><u></u><=
/p><p class=3D"MsoNormal">&gt; with write-through can be used to flush the =
data without ever needing to call<u></u><u></u></p><p class=3D"MsoNormal">&=
gt; FlushViewOfFile or FlushFileBuffers. I also implemented support for wri=
te<u></u><u></u></p><p class=3D"MsoNormal">&gt; gathering in writemap mode =
where contiguous file positions infers contiguous<u></u><u></u></p><p class=
=3D"MsoNormal">&gt; memory (by tracking the starting position with wdp and =
writing contiguous pages<u></u><u></u></p><p class=3D"MsoNormal">&gt; in si=
ngle operations). Sorting of the dirty list is maintained even in writemap<=
u></u><u></u></p><p class=3D"MsoNormal">&gt; mode for this purpose.<u></u><=
u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"MsoNor=
mal">What is the point of using writemap mode if you still need to use Writ=
eFile<u></u><u></u></p><p class=3D"MsoNormal">on every individual page?<u><=
/u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p class=3D"Ms=
oNormal">&gt; The performance benefits of this patch, in my testing, are co=
nsiderable. Writing<u></u><u></u></p><p class=3D"MsoNormal">&gt; out/syncin=
g transactions is typically over 5x faster in writemap mode, and 2x<u></u><=
u></u></p><p class=3D"MsoNormal">&gt; faster in standard mode. And perhaps =
more importantly (especially in environment<u></u><u></u></p><p class=3D"Ms=
oNormal">&gt; with many threads/processes), the efficiency benefits are eve=
n larger,<u></u><u></u></p><p class=3D"MsoNormal">&gt; particularly in writ=
emap mode, where there can be a 50-100x reduction in the<u></u><u></u></p><=
p class=3D"MsoNormal">&gt; system CPU usage by using this patch. This bring=
s windows performance with<u></u><u></u></p><p class=3D"MsoNormal">&gt; syn=
c&#39;ed transactions in LMDB back into the range of &quot;lightning&quot; =
performance :).<u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u=
...
</p><p class=3D"MsoNormal">What is the performance difference between your=
patch using writemap, and just<u></u><u></u></p><p class=3D"MsoNormal">not=
 using writemap in the first place?<u></u><u></u></p><p class=3D"MsoNormal"=
...
<u></u>=C2=A0<u></u></p><p class=3D"MsoNormal">-- <u></u><u></u></p><p cla=
ss=3D"MsoNormal">=C2=A0=C2=A0-- Howard Chu<u></u><u></u></p><p class=3D"Mso=
Normal">=C2=A0 CTO, Symas Corp.=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0 <a href=3D"http://www.symas.com" target=3D"_blank">http:=
//www.symas.com</a><u></u><u></u></p><p class=3D"MsoNormal">=C2=A0 Director=
, Highland Sun=C2=A0=C2=A0=C2=A0=C2=A0 <a href=3D"http://highlandsun.com/hy=
c/" target=3D"_blank">http://highlandsun.com/hyc/</a><u></u><u></u></p><p c=
lass=3D"MsoNormal">=C2=A0 Chief Architect, OpenLDAP=C2=A0 <a href=3D"http:/=
/www.openldap.org/project/" target=3D"_blank">http://www.openldap.org/proje=
ct/</a><u></u><u></u></p><p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p><p =
class=3D"MsoNormal"><u></u>=C2=A0<u></u></p></div></div></blockquote></div>
</blockquote></div>
</blockquote></div>
--000000000000bdb86d059deaf350--

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: (ITS#9017) Improving performance of commit sync in Windows