--_F6B70E9A-5F12-495E-A3D8-F48F4F20717D_ Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8"
What is the point of using writemap mode if you still need to use WriteFi=
le
on every individual page?
As I understood from the documentation, and have observed, using writemap m= ode is faster (and uses less temporary memory) because it doesn=E2=80=99t r= equire mallocs to allocate pages (docs: =E2=80=9CThis is faster and uses fe= wer mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and e= fficient, that in sync-mode, it takes enormous transactions before the time= spent allocating and creating the dirty pages with the updated b-tree is a= nywhere even remotely close to the time it takes to wait for disk flushing,= even with an SSD. But the more pertinent question is efficiency, and measu= ring CPU cycles rather than time spent (efficiency is more important than j= ust time spent). When I ran my tests this morning of 100 (sync) transaction= s with 100 puts per transaction, times varied quite a bit, but it seemed li= ke running with writemap enabled typically averages about 500ms of CPU and = with writemap disabled it typically averages around 600ms. Not a huge diffe= rence, but still definitely worthwhile, I think.
Caveat emptor: Measuring LMDB performance with sync interactions on Windows= is one of the most frustratingly erratic things to measure. It is sunny ou= tside right now, times could be different when it starts raining later, but= , this is what I saw this morning...
What is the performance difference between your patch using writemap, and=
just
not using writemap in the first place?
Running 1000 sync transactions on 3GB db with a single put per transaction,= without writemap map, without the patch took about 60 seconds. And it took= about 1 second with the patch with writemap mode enabled! (there is no sig= nificant difference in sync times with writemap enabled or disabled with th= e patch.) So the difference was huge in my test. And not only that, without= the patch, the CPU usage was actually _higher_ during that 60 seconds (clo= se to 100% of a core) than during the execution with the patch for one seco= nd (close to 50%). Anyway, there are certainly tests I have run where the = differences are not as large (doing small commits on large dbs accentuates = the differences), but the patch always seems to win. It could also be that = my particular configuration causes bigger differences (on an SSD drive, and= maybe a more fragmented file?).
Anyway, I added error handling for the malloc, and fixed/changed the other = things you suggested. Be happy to make any other changes you want. The upda= ted patch is here: https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b= 62094acde
OVERLAPPED* ov =3D malloc((pagecount - keep) * sizeof(OVERLAPPED)); Probably this ought to just be pre-allocated based on the maximum number =
of dirty pages a txn allows.
I wasn=E2=80=99t sure I understood this comment. Are you suggesting we mall= oc(MDB_IDL_UM_MAX * sizeof(OVERLAPPED)) for each environment, and retain it= for the life of the environment? I think that is 4MB, if my math is right,= which seems like a lot of memory to keep allocated (we usually have a lot = of open environments). If the goal is to reduce the number of mallocs, how = about we retain the OVERLAPPED array, and only free and re-malloc if the pr= evious allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t unn= ecessary allocation, and we only malloc when there is a bigger transaction = than any previous. I put this together in a separate commit, as I wasn=E2= =80=99t sure if this what you wanted (can squash if you prefer): https://gi= thub.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17a4b2adefaac40
Thank you for the review!=20
Thanks, Kris
From: Howard Chu Sent: April 30, 2019 7:12 AM To: kriszyp@gmail.com; openldap-its@OpenLDAP.org Subject: Re: (ITS#9017) Improving performance of commit sync in Windows
kriszyp@gmail.com wrote:
Full_Name: Kristopher William Zyp Version: LMDB 0.9.23 OS: Windows URL: https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74=
a0ab9332b7fc4ce9
Submission from: (NULL) (71.199.6.148) =20 =20 We have seen very poor performance on the sync of commits on large databa=
ses in
Windows. On databases with 2GB of data, in writemap mode, the sync of eve=
n small
commits is consistently well over 100ms (without writemap it is faster, b=
ut
still slow). It is expected that a sync should take some time while waiti=
ng for
disk confirmation of the writes, but more concerning is that these sync operations (in writemap mode) are instead dominated by nearly 100% system=
CPU
utilization, so operations that requires sub-millisecond b-tree update operations are then dominated by very large amounts of system CPU cycles =
during
the sync phase. =20 I think that the fundamental problem is that FlushViewOfFile seems to be =
an O(n)
operation where n is the size of the file (or map). I presume that Window=
s is
scanning the entire map/file for dirty pages to flush, I'm guessing becau=
se it
doesn't have an internal index of all the dirty pages for every file/map-=
view in
the OS disk cache. Therefore, the turns into an extremely expensive, CPU-=
bound
operation to find the dirty pages for (large file) and initiate their wri=
tes,
which, of course, is contrary to the whole goal of a scalable database sy=
stem.
And FlushFileBuffers is also relatively slow as well. We have attempted t=
o batch
as many operations into single transaction as possible, but this is still=
a very
large overhead. =20 The Windows docs for FlushFileBuffers itself warns about the inefficienci=
es of
this function (https://docs.microsoft.com/en-us/windows/desktop/api/filea=
pi/nf-fileapi-flushfilebuffers).
Which also points to the solution: it is much faster to write out the dir=
ty
pages with WriteFile through a sync file handle (FILE_FLAG_WRITE_THROUGH)=
.
=20 The associated patch (https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab=
9332b7fc4ce9)
is my attempt at implementing this solution, for Windows. Fortunately, wi=
th the
design of LMDB, this is relatively straightforward. LMDB already supports writing out dirty pages with WriteFile calls. I added a write-through han=
dle for
sending these writes directly to disk. I then made that file-handle overlapped/asynchronously, so all the writes for a commit could be starte=
d in
overlap mode, and (at least theoretically) transfer in parallel to the dr=
ive and
then used GetOverlappedResult to wait for the completion. So basically mdb_page_flush becomes the sync. I extended the writing of dirty pages th=
rough
WriteFile to writemap mode as well (for writing meta too), so that WriteF=
ile
with write-through can be used to flush the data without ever needing to =
call
FlushViewOfFile or FlushFileBuffers. I also implemented support for write gathering in writemap mode where contiguous file positions infers contigu=
ous
memory (by tracking the starting position with wdp and writing contiguous=
pages
in single operations). Sorting of the dirty list is maintained even in wr=
itemap
mode for this purpose.
What is the point of using writemap mode if you still need to use WriteFile on every individual page?
The performance benefits of this patch, in my testing, are considerable. =
Writing
out/syncing transactions is typically over 5x faster in writemap mode, an=
d 2x
faster in standard mode. And perhaps more importantly (especially in envi=
ronment
with many threads/processes), the efficiency benefits are even larger, particularly in writemap mode, where there can be a 50-100x reduction in =
the
system CPU usage by using this patch. This brings windows performance wit=
h
sync'ed transactions in LMDB back into the range of "lightning" performan=
ce :).
What is the performance difference between your patch using writemap, and j= ust not using writemap in the first place?
--=20 -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
--_F6B70E9A-5F12-495E-A3D8-F48F4F20717D_ Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8"
<html xmlns:o=3D"urn:schemas-microsoft-com:office:office" xmlns:w=3D"urn:sc= hemas-microsoft-com:office:word" xmlns:m=3D"http://schemas.microsoft.com/of= fice/2004/12/omml" xmlns=3D"http://www.w3.org/TR/REC-html40%22%3E<head><meta ht= tp-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8"><meta name= =3DGenerator content=3D"Microsoft Word 15 (filtered medium)"><style><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Consolas; panose-1:2 11 6 9 2 2 4 3 2 4;} @font-face {font-family:"Segoe UI"; panose-1:2 11 5 2 4 2 4 2 2 3;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; margin-bottom:.0001pt; font-size:11.0pt; font-family:"Calibri",sans-serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:#954F72; text-decoration:underline;} p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph {mso-style-priority:34; margin-top:0cm; margin-right:0cm; margin-bottom:0cm; margin-left:36.0pt; margin-bottom:.0001pt; font-size:11.0pt; font-family:"Calibri",sans-serif;} span.blob-code-inner {mso-style-name:blob-code-inner;} span.pl-c1 {mso-style-name:pl-c1;} span.pl-k {mso-style-name:pl-k;} .MsoChpDefault {mso-style-type:export-only;} @page WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 72.0pt 72.0pt 72.0pt;} div.WordSection1 {page:WordSection1;} --></style></head><body lang=3DEN-CA link=3Dblue vlink=3D"#954F72"><div cla= ss=3DWordSection1><p class=3DMsoNormal>> What is the point of using writ= emap mode if you still need to use WriteFile<o:p></o:p></p><p class=3DMsoNo= rmal>> on every individual page?</p><p class=3DMsoNormal><o:p> </o:= p></p><p class=3DMsoNormal>As I understood from the documentation, and have= observed, using writemap mode is faster (and uses less temporary memory) b= ecause it doesn=E2=80=99t require mallocs to allocate pages (docs: =E2=80= =9CThis is faster and uses fewer mallocs=E2=80=9D). To be clear though, LMD= B is so incredibly fast and efficient, that in sync-mode, it takes enormous= transactions before the time spent allocating and creating the dirty pages= with the updated b-tree is anywhere even remotely close to the time it tak= es to wait for disk flushing, even with an SSD. But the more pertinent ques= tion is efficiency, and measuring CPU cycles rather than time spent (effici= ency is more important than just time spent). When I ran my tests this morn= ing of 100 (sync) transactions with 100 puts per transaction, times varied = quite a bit, but it seemed like running with writemap enabled typically ave= rages about 500ms of CPU and with writemap disabled it typically averages a= round 600ms. Not a huge difference, but still definitely worthwhile, I thin= k.</p><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal>Caveat= emptor: Measuring LMDB performance with sync interactions on Windows is on= e of the most frustratingly erratic things to measure. It is sunny outside = right now, times could be different when it starts raining later, but, this= is what I saw this morning...<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbs= p;</o:p></p><p class=3DMsoNormal>> What is the performance difference be= tween your patch using writemap, and just<o:p></o:p></p><p class=3DMsoNorma= l>> not using writemap in the first place?<o:p></o:p></p><p class=3DMsoN= ormal><o:p> </o:p></p><p class=3DMsoNormal>Running 1000 sync transacti= ons on 3GB db with a single put per transaction, without writemap map, with= out the patch took about 60 seconds. And it took about 1 second with the pa= tch with writemap mode enabled! (there is no significant difference in sync= times with writemap enabled or disabled with the patch.) So the difference= was huge in my test. And not only that, without the patch, the CPU usage w= as actually _<i>higher</i>_ during that 60 seconds (close to 100% of a core= ) than during the execution with the patch for one second (close to 50%). = =C2=A0Anyway, there are certainly tests I have run where the differences ar= e not as large (doing small commits on large dbs accentuates the difference= s), but the patch always seems to win. It could also be that my particular = configuration causes bigger differences (on an SSD drive, and maybe a more = fragmented file?).</p><p class=3DMsoNormal><o:p> </o:p></p><p class=3D= MsoNormal>Anyway, I added error handling for the malloc, and fixed/changed = the other things you suggested. Be happy to make any other changes you want= . The updated patch is here:<o:p></o:p></p><p class=3DMsoNormal>https://git= hub.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b62094acde<o= :p></o:p></p><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal=
><span class=3Dblob-code-inner><span style=3D'font-size:9.0pt;font-fami=
ly:Consolas;color:#24292E'> OVERLAPPED* ov =3D </span></span><span class=3D= pl-c1><span style=3D'font-size:9.0pt;font-family:Consolas;color:#005CC5'>ma= lloc</span></span><span class=3Dblob-code-inner><span style=3D'font-size:9.= 0pt;font-family:Consolas;color:#24292E'>((pagecount - keep) * </span></span=
<span class=3Dpl-k><span style=3D'font-size:9.0pt;font-family:Consolas;col=
or:#D73A49'>sizeof</span></span><span class=3Dblob-code-inner><span style= =3D'font-size:9.0pt;font-family:Consolas;color:#24292E'>(OVERLAPPED));<o:p>= </o:p></span></span></p><p class=3DMsoNormal><span class=3Dblob-code-inner>= <span style=3D'font-size:9.0pt;font-family:Consolas;color:#24292E'>> </s= pan></span><span style=3D'font-size:10.5pt;font-family:"Segoe UI",sans-seri= f;color:#24292E;background:white'>Probably this ought to just be pre-alloca= ted based on the maximum number of dirty pages a txn allows.<o:p></o:p></sp= an></p><p class=3DMsoNormal><span style=3D'font-size:10.5pt;font-family:"Se= goe UI",sans-serif;color:#24292E;background:white'><o:p> </o:p></span>= </p><p class=3DMsoNormal><span style=3D'font-size:10.5pt;font-family:"Segoe= UI",sans-serif;color:#24292E;background:white'>I wasn=E2=80=99t sure I und= erstood this comment. Are you suggesting we </span>malloc(MDB_IDL_UM_MAX * = sizeof(OVERLAPPED)) for each environment, and retain it for the life of the= environment? I think that is 4MB, if my math is right, which seems like a = lot of memory to keep allocated (we usually have a lot of open environments= ). If the goal is to reduce the number of mallocs, how about we retain the = OVERLAPPED array, and only free and re-malloc if the previous allocation wa= sn=E2=80=99t large enough? Then there isn=E2=80=99t unnecessary allocation,= and we only malloc when there is a bigger transaction than any previous. I= put this together in a separate commit, as I wasn=E2=80=99t sure if this w= hat you wanted (can squash if you prefer): https://github.com/kriszyp/node-= lmdb/commit/2fe68fb5269c843e2e789746a17a4b2adefaac40</p><p class=3DMsoNorma= l><o:p> </o:p></p><p class=3DMsoNormal>Thank you for the review! <span= style=3D'font-size:10.5pt;font-family:"Segoe UI",sans-serif;color:#24292E;= background:white'><o:p></o:p></span></p><p class=3DMsoNormal><o:p> </o= :p></p><p class=3DMsoNormal>Thanks,<br>Kris</p><p class=3DMsoNormal><o:p>&n= bsp;</o:p></p><div style=3D'mso-element:para-border-div;border:none;border-= top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=3DMsoNormal sty= le=3D'border:none;padding:0cm'><b>From: </b><a href=3D"mailto:hyc@symas.com= ">Howard Chu</a><br><b>Sent: </b>April 30, 2019 7:12 AM<br><b>To: </b><a hr= ef=3D"mailto:kriszyp@gmail.com">kriszyp@gmail.com</a>; <a href=3D"mailto:op= enldap-its@OpenLDAP.org">openldap-its@OpenLDAP.org</a><br><b>Subject: </b>R= e: (ITS#9017) Improving performance of commit sync in Windows</p></div><p c= lass=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal>kriszyp@gmail.co= m wrote:</p><p class=3DMsoNormal>> Full_Name: Kristopher William Zyp</p>= <p class=3DMsoNormal>> Version: LMDB 0.9.23</p><p class=3DMsoNormal>>= OS: Windows</p><p class=3DMsoNormal>> URL: https://github.com/kriszyp/n= ode-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9</p><p class=3DMsoN= ormal>> Submission from: (NULL) (71.199.6.148)</p><p class=3DMsoNormal>&= gt; </p><p class=3DMsoNormal>> </p><p class=3DMsoNormal>> We have see= n very poor performance on the sync of commits on large databases in</p><p = class=3DMsoNormal>> Windows. On databases with 2GB of data, in writemap = mode, the sync of even small</p><p class=3DMsoNormal>> commits is consis= tently well over 100ms (without writemap it is faster, but</p><p class=3DMs= oNormal>> still slow). It is expected that a sync should take some time = while waiting for</p><p class=3DMsoNormal>> disk confirmation of the wri= tes, but more concerning is that these sync</p><p class=3DMsoNormal>> op= erations (in writemap mode) are instead dominated by nearly 100% system CPU= </p><p class=3DMsoNormal>> utilization, so operations that requires sub-= millisecond b-tree update</p><p class=3DMsoNormal>> operations are then = dominated by very large amounts of system CPU cycles during</p><p class=3DM= soNormal>> the sync phase.</p><p class=3DMsoNormal>> </p><p class=3DM= soNormal>> I think that the fundamental problem is that FlushViewOfFile = seems to be an O(n)</p><p class=3DMsoNormal>> operation where n is the s= ize of the file (or map). I presume that Windows is</p><p class=3DMsoNormal=
> scanning the entire map/file for dirty pages to flush, I'm guessing b=
ecause it</p><p class=3DMsoNormal>> doesn't have an internal index of al= l the dirty pages for every file/map-view in</p><p class=3DMsoNormal>> t= he OS disk cache. Therefore, the turns into an extremely expensive, CPU-bou= nd</p><p class=3DMsoNormal>> operation to find the dirty pages for (larg= e file) and initiate their writes,</p><p class=3DMsoNormal>> which, of c= ourse, is contrary to the whole goal of a scalable database system.</p><p c= lass=3DMsoNormal>> And FlushFileBuffers is also relatively slow as well.= We have attempted to batch</p><p class=3DMsoNormal>> as many operations= into single transaction as possible, but this is still a very</p><p class= =3DMsoNormal>> large overhead.</p><p class=3DMsoNormal>> </p><p class= =3DMsoNormal>> The Windows docs for FlushFileBuffers itself warns about = the inefficiencies of</p><p class=3DMsoNormal>> this function (https://d= ocs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flushfilebuf= fers).</p><p class=3DMsoNormal>> Which also points to the solution: it i= s much faster to write out the dirty</p><p class=3DMsoNormal>> pages wit= h WriteFile through a sync file handle (FILE_FLAG_WRITE_THROUGH).</p><p cla= ss=3DMsoNormal>> </p><p class=3DMsoNormal>> The associated patch</p><= p class=3DMsoNormal>> (https://github.com/kriszyp/node-lmdb/commit/7ff52= 5ae57684a163d32af74a0ab9332b7fc4ce9)</p><p class=3DMsoNormal>> is my att= empt at implementing this solution, for Windows. Fortunately, with the</p><= p class=3DMsoNormal>> design of LMDB, this is relatively straightforward= . LMDB already supports</p><p class=3DMsoNormal>> writing out dirty page= s with WriteFile calls. I added a write-through handle for</p><p class=3DMs= oNormal>> sending these writes directly to disk. I then made that file-h= andle</p><p class=3DMsoNormal>> overlapped/asynchronously, so all the wr= ites for a commit could be started in</p><p class=3DMsoNormal>> overlap = mode, and (at least theoretically) transfer in parallel to the drive and</p=
<p class=3DMsoNormal>> then used GetOverlappedResult to wait for the co=
mpletion. So basically</p><p class=3DMsoNormal>> mdb_page_flush becomes = the sync. I extended the writing of dirty pages through</p><p class=3DMsoNo= rmal>> WriteFile to writemap mode as well (for writing meta too), so tha= t WriteFile</p><p class=3DMsoNormal>> with write-through can be used to = flush the data without ever needing to call</p><p class=3DMsoNormal>> Fl= ushViewOfFile or FlushFileBuffers. I also implemented support for write</p>= <p class=3DMsoNormal>> gathering in writemap mode where contiguous file = positions infers contiguous</p><p class=3DMsoNormal>> memory (by trackin= g the starting position with wdp and writing contiguous pages</p><p class= =3DMsoNormal>> in single operations). Sorting of the dirty list is maint= ained even in writemap</p><p class=3DMsoNormal>> mode for this purpose.<= /p><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal>What is t= he point of using writemap mode if you still need to use WriteFile</p><p cl= ass=3DMsoNormal>on every individual page?</p><p class=3DMsoNormal><o:p>&nbs= p;</o:p></p><p class=3DMsoNormal>> The performance benefits of this patc= h, in my testing, are considerable. Writing</p><p class=3DMsoNormal>> ou= t/syncing transactions is typically over 5x faster in writemap mode, and 2x= </p><p class=3DMsoNormal>> faster in standard mode. And perhaps more imp= ortantly (especially in environment</p><p class=3DMsoNormal>> with many = threads/processes), the efficiency benefits are even larger,</p><p class=3D= MsoNormal>> particularly in writemap mode, where there can be a 50-100x = reduction in the</p><p class=3DMsoNormal>> system CPU usage by using thi= s patch. This brings windows performance with</p><p class=3DMsoNormal>> = sync'ed transactions in LMDB back into the range of "lightning" p= erformance :).</p><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoN= ormal>What is the performance difference between your patch using writemap,= and just</p><p class=3DMsoNormal>not using writemap in the first place?</p=
<p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal>-- </p><p c=
lass=3DMsoNormal>=C2=A0=C2=A0-- Howard Chu</p><p class=3DMsoNormal>=C2=A0 C= TO, Symas Corp.=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= http://www.symas.com</p><p class=3DMsoNormal>=C2=A0 Director, Highland Sun= =C2=A0=C2=A0=C2=A0=C2=A0 http://highlandsun.com/hyc/</p><p class=3DMsoNorma= l>=C2=A0 Chief Architect, OpenLDAP=C2=A0 http://www.openldap.org/project/</= p><p class=3DMsoNormal><o:p> </o:p></p></div></body></html>=
--_F6B70E9A-5F12-495E-A3D8-F48F4F20717D_--