RE: (ITS#9017) Improving performance of commit sync in Windows - openldap-bugs

3 May 2019


      --_E0C027EF-451F-4EC6-B6DE-2F6B94348BB5_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"
For the sake of putting this in the email thread (other code discussion in =
GitHub), here is the latest squashed commit of the proposed patch (with the=
 on-demand, retained overlapped array to reduce re-malloc and opening event=
 handles): https://github.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d4=
53aab75ee222072b990f
Thanks,
Kris
From: Kris Zyp
Sent: April 30, 2019 12:43 PM
To: Howard Chu; openldap-its@OpenLDAP.org
Subject: RE: (ITS#9017) Improving performance of commit sync in Windows
...
What is the point of using writemap mode if you still need to use WriteFi=
le
...
on every individual page?
As I understood from the documentation, and have observed, using writemap m=
ode is faster (and uses less temporary memory) because it doesn=E2=80=99t r=
equire mallocs to allocate pages (docs: =E2=80=9CThis is faster and uses fe=
wer mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and e=
fficient, that in sync-mode, it takes enormous transactions before the time=
 spent allocating and creating the dirty pages with the updated b-tree is a=
nywhere even remotely close to the time it takes to wait for disk flushing,=
 even with an SSD. But the more pertinent question is efficiency, and measu=
ring CPU cycles rather than time spent (efficiency is more important than j=
ust time spent). When I ran my tests this morning of 100 (sync) transaction=
s with 100 puts per transaction, times varied quite a bit, but it seemed li=
ke running with writemap enabled typically averages about 500ms of CPU and =
with writemap disabled it typically averages around 600ms. Not a huge diffe=
rence, but still definitely worthwhile, I think.
Caveat emptor: Measuring LMDB performance with sync interactions on Windows=
 is one of the most frustratingly erratic things to measure. It is sunny ou=
tside right now, times could be different when it starts raining later, but=
, this is what I saw this morning...
...
What is the performance difference between your patch using writemap, and=
just
...
not using writemap in the first place?
Running 1000 sync transactions on 3GB db with a single put per transaction,=
 without writemap map, without the patch took about 60 seconds. And it took=
 about 1 second with the patch with writemap mode enabled! (there is no sig=
nificant difference in sync times with writemap enabled or disabled with th=
e patch.) So the difference was huge in my test. And not only that, without=
 the patch, the CPU usage was actually _higher_ during that 60 seconds (clo=
se to 100% of a core) than during the execution with the patch for one seco=
nd (close to 50%). =C2=A0Anyway, there are certainly tests I have run where=
 the differences are not as large (doing small commits on large dbs accentu=
ates the differences), but the patch always seems to win. It could also be =
that my particular configuration causes bigger differences (on an SSD drive=
, and maybe a more fragmented file?).
Anyway, I added error handling for the malloc, and fixed/changed the other =
things you suggested. Be happy to make any other changes you want. The upda=
ted patch is here:
https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b=
62094acde
...
OVERLAPPED* ov =3D malloc((pagecount - keep) * sizeof(OVERLAPPED));
Probably this ought to just be pre-allocated based on the maximum number =
of dirty pages a txn allows.
I wasn=E2=80=99t sure I understood this comment. Are you suggesting we mall=
oc(MDB_IDL_UM_MAX * sizeof(OVERLAPPED)) for each environment, and retain it=
 for the life of the environment? I think that is 4MB, if my math is right,=
 which seems like a lot of memory to keep allocated (we usually have a lot =
of open environments). If the goal is to reduce the number of mallocs, how =
about we retain the OVERLAPPED array, and only free and re-malloc if the pr=
evious allocation wasn=E2=80=99t large enough? Then there isn=E2=80=99t unn=
ecessary allocation, and we only malloc when there is a bigger transaction =
than any previous. I put this together in a separate commit, as I wasn=E2=
=80=99t sure if this what you wanted (can squash if you prefer): https://gi=
thub.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17a4b2adefaac40
Thank you for the review!=20
Thanks,
Kris
From: Howard Chu
Sent: April 30, 2019 7:12 AM
To: kriszyp@gmail.com; openldap-its@OpenLDAP.org
Subject: Re: (ITS#9017) Improving performance of commit sync in Windows
kriszyp@gmail.com wrote:
...
Full_Name: Kristopher William Zyp
Version: LMDB 0.9.23
OS: Windows
URL: https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74=
a0ab9332b7fc4ce9
...
Submission from: (NULL) (71.199.6.148)
=20
=20
We have seen very poor performance on the sync of commits on large databa=
ses in
...
Windows. On databases with 2GB of data, in writemap mode, the sync of eve=
n small
...
commits is consistently well over 100ms (without writemap it is faster, b=
ut
...
still slow). It is expected that a sync should take some time while waiti=
ng for
...
disk confirmation of the writes, but more concerning is that these sync
operations (in writemap mode) are instead dominated by nearly 100% system=
CPU
...
utilization, so operations that requires sub-millisecond b-tree update
operations are then dominated by very large amounts of system CPU cycles =
during
...
the sync phase.
=20
I think that the fundamental problem is that FlushViewOfFile seems to be =
an O(n)
...
operation where n is the size of the file (or map). I presume that Window=
s is
...
scanning the entire map/file for dirty pages to flush, I'm guessing becau=
se it
...
doesn't have an internal index of all the dirty pages for every file/map-=
view in
...
the OS disk cache. Therefore, the turns into an extremely expensive, CPU-=
bound
...
operation to find the dirty pages for (large file) and initiate their wri=
tes,
...
which, of course, is contrary to the whole goal of a scalable database sy=
stem.
...
And FlushFileBuffers is also relatively slow as well. We have attempted t=
o batch
...
as many operations into single transaction as possible, but this is still=
a very
...
large overhead.
=20
The Windows docs for FlushFileBuffers itself warns about the inefficienci=
es of
...
this function (https://docs.microsoft.com/en-us/windows/desktop/api/filea=
pi/nf-fileapi-flushfilebuffers).
...
Which also points to the solution: it is much faster to write out the dir=
ty
...
pages with WriteFile through a sync file handle (FILE_FLAG_WRITE_THROUGH)=
.
...
=20
The associated patch
(https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab=
9332b7fc4ce9)
...
is my attempt at implementing this solution, for Windows. Fortunately, wi=
th the
...
design of LMDB, this is relatively straightforward. LMDB already supports
writing out dirty pages with WriteFile calls. I added a write-through han=
dle for
...
sending these writes directly to disk. I then made that file-handle
overlapped/asynchronously, so all the writes for a commit could be starte=
d in
...
overlap mode, and (at least theoretically) transfer in parallel to the dr=
ive and
...
then used GetOverlappedResult to wait for the completion. So basically
mdb_page_flush becomes the sync. I extended the writing of dirty pages th=
rough
...
WriteFile to writemap mode as well (for writing meta too), so that WriteF=
ile
...
with write-through can be used to flush the data without ever needing to =
call
...
FlushViewOfFile or FlushFileBuffers. I also implemented support for write
gathering in writemap mode where contiguous file positions infers contigu=
ous
...
memory (by tracking the starting position with wdp and writing contiguous=
pages
...
in single operations). Sorting of the dirty list is maintained even in wr=
itemap
...
mode for this purpose.
What is the point of using writemap mode if you still need to use WriteFile
on every individual page?
...
The performance benefits of this patch, in my testing, are considerable. =
Writing
...
out/syncing transactions is typically over 5x faster in writemap mode, an=
d 2x
...
faster in standard mode. And perhaps more importantly (especially in envi=
ronment
...
with many threads/processes), the efficiency benefits are even larger,
particularly in writemap mode, where there can be a 50-100x reduction in =
the
...
system CPU usage by using this patch. This brings windows performance wit=
h
...
sync'ed transactions in LMDB back into the range of "lightning" performan=
ce :).
What is the performance difference between your patch using writemap, and j=
ust
not using writemap in the first place?
--=20
=C2=A0=C2=A0-- Howard Chu
=C2=A0 CTO, Symas Corp.=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 http://www.symas.com
=C2=A0 Director, Highland Sun=C2=A0=C2=A0=C2=A0=C2=A0 http://highlandsun.co=
m/hyc/
=C2=A0 Chief Architect, OpenLDAP=C2=A0 http://www.openldap.org/project/
--_E0C027EF-451F-4EC6-B6DE-2F6B94348BB5_
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="utf-8"
<html xmlns:o=3D"urn:schemas-microsoft-com:office:office" xmlns:w=3D"urn:sc=
hemas-microsoft-com:office:word" xmlns:m=3D"http://schemas.microsoft.com/of=
fice/2004/12/omml" xmlns=3D"http://www.w3.org/TR/REC-html40%22%3E<head><meta ht=
tp-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8"><meta name=
=3DGenerator content=3D"Microsoft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
    {font-family:Consolas;
    panose-1:2 11 6 9 2 2 4 3 2 4;}
@font-face
    {font-family:"Segoe UI";
    panose-1:2 11 5 2 4 2 4 2 2 3;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0cm;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
    {mso-style-priority:99;
    color:blue;
    text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
    {mso-style-priority:99;
    color:#954F72;
    text-decoration:underline;}
span.blob-code-inner
    {mso-style-name:blob-code-inner;}
span.pl-c1
    {mso-style-name:pl-c1;}
span.pl-k
    {mso-style-name:pl-k;}
.MsoChpDefault
    {mso-style-type:export-only;}
@page WordSection1
    {size:612.0pt 792.0pt;
    margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
    {page:WordSection1;}
--></style></head><body lang=3DEN-CA link=3Dblue vlink=3D"#954F72"><div cla=
ss=3DWordSection1><p class=3DMsoNormal>For the sake of putting this in the =
email thread (other code discussion in GitHub), here is the latest squashed=
 commit of the proposed patch (with the on-demand, retained overlapped arra=
y to reduce re-malloc and opening event handles): https://github.com/kriszy=
p/node-lmdb/commit/726a9156662c703bf3d453aab75ee222072b990f</p><p class=3DM=
soNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Thanks,<br>Kris</p><p cl=
ass=3DMsoNormal><o:p>&nbsp;</o:p></p><div style=3D'mso-element:para-border-=
div;border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm'><=
p class=3DMsoNormal style=3D'border:none;padding:0cm'><b>From: </b><a href=
=3D"mailto:kriszyp@gmail.com">Kris Zyp</a><br><b>Sent: </b>April 30, 2019 1=
2:43 PM<br><b>To: </b><a href=3D"mailto:hyc@symas.com">Howard Chu</a>; <a h=
ref=3D"mailto:openldap-its@OpenLDAP.org">openldap-its@OpenLDAP.org</a><br><=
b>Subject: </b>RE: (ITS#9017) Improving performance of commit sync in Windo=
ws</p></div><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>=
&gt; What is the point of using writemap mode if you still need to use Writ=
eFile<o:p></o:p></p><p class=3DMsoNormal>&gt; on every individual page?<o:p=
...
</o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>A=
s I understood from the documentation, and have observed, using writemap mo=
de is faster (and uses less temporary memory) because it doesn=E2=80=99t re=
quire mallocs to allocate pages (docs: =E2=80=9CThis is faster and uses few=
er mallocs=E2=80=9D). To be clear though, LMDB is so incredibly fast and ef=
ficient, that in sync-mode, it takes enormous transactions before the time =
spent allocating and creating the dirty pages with the updated b-tree is an=
ywhere even remotely close to the time it takes to wait for disk flushing, =
even with an SSD. But the more pertinent question is efficiency, and measur=
ing CPU cycles rather than time spent (efficiency is more important than ju=
st time spent). When I ran my tests this morning of 100 (sync) transactions=
 with 100 puts per transaction, times varied quite a bit, but it seemed lik=
e running with writemap enabled typically averages about 500ms of CPU and w=
ith writemap disabled it typically averages around 600ms. Not a huge differ=
ence, but still definitely worthwhile, I think.<o:p></o:p></p><p class=3DMs=
oNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Caveat emptor: Measuring =
LMDB performance with sync interactions on Windows is one of the most frust=
ratingly erratic things to measure. It is sunny outside right now, times co=
uld be different when it starts raining later, but, this is what I saw this=
 morning...<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p clas=
s=3DMsoNormal>&gt; What is the performance difference between your patch us=
ing writemap, and just<o:p></o:p></p><p class=3DMsoNormal>&gt; not using wr=
itemap in the first place?<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</=
o:p></p><p class=3DMsoNormal>Running 1000 sync transactions on 3GB db with =
a single put per transaction, without writemap map, without the patch took =
about 60 seconds. And it took about 1 second with the patch with writemap m=
ode enabled! (there is no significant difference in sync times with writema=
p enabled or disabled with the patch.) So the difference was huge in my tes=
t. And not only that, without the patch, the CPU usage was actually _<i>hig=
her</i>_ during that 60 seconds (close to 100% of a core) than during the e=
xecution with the patch for one second (close to 50%). &nbsp;Anyway, there =
are certainly tests I have run where the differences are not as large (doin=
g small commits on large dbs accentuates the differences), but the patch al=
ways seems to win. It could also be that my particular configuration causes=
 bigger differences (on an SSD drive, and maybe a more fragmented file?).<o=
:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal=
...
Anyway, I added error handling for the malloc, and fixed/changed the other=
things you suggested. Be happy to make any other changes you want. The upd=
ated patch is here:<o:p></o:p></p><p class=3DMsoNormal>https://github.com/k=
riszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b62094acde<o:p></o:p>=
</p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>&gt;<spa=
n class=3Dblob-code-inner><span style=3D'font-size:9.0pt;font-family:Consol=
as;color:#24292E'> OVERLAPPED* ov =3D </span></span><span class=3Dpl-c1><sp=
an style=3D'font-size:9.0pt;font-family:Consolas;color:#005CC5'>malloc</spa=
n></span><span class=3Dblob-code-inner><span style=3D'font-size:9.0pt;font-=
family:Consolas;color:#24292E'>((pagecount - keep) * </span></span><span cl=
ass=3Dpl-k><span style=3D'font-size:9.0pt;font-family:Consolas;color:#D73A4=
9'>sizeof</span></span><span class=3Dblob-code-inner><span style=3D'font-si=
ze:9.0pt;font-family:Consolas;color:#24292E'>(OVERLAPPED));</span></span><s=
pan class=3Dblob-code-inner><span style=3D'font-size:9.0pt;font-family:Cons=
olas;color:#24292E'><o:p></o:p></span></span></p><p class=3DMsoNormal><span=
 class=3Dblob-code-inner><span style=3D'font-size:9.0pt;font-family:Consola=
s;color:#24292E'>&gt; </span></span><span style=3D'font-size:10.5pt;font-fa=
mily:"Segoe UI",sans-serif;color:#24292E;background:white'>Probably this ou=
ght to just be pre-allocated based on the maximum number of dirty pages a t=
xn allows.</span><span style=3D'font-size:10.5pt;font-family:"Segoe UI",san=
s-serif;background:white'><o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:10.5pt;font-family:"Segoe UI",sans-serif;color:#24292E;b=
ackground:white'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span sty=
le=3D'font-size:10.5pt;font-family:"Segoe UI",sans-serif;color:#24292E;back=
ground:white'>I wasn=E2=80=99t sure I understood this comment. Are you sugg=
esting we </span>malloc(MDB_IDL_UM_MAX * sizeof(OVERLAPPED)) for each envir=
onment, and retain it for the life of the environment? I think that is 4MB,=
 if my math is right, which seems like a lot of memory to keep allocated (w=
e usually have a lot of open environments). If the goal is to reduce the nu=
mber of mallocs, how about we retain the OVERLAPPED array, and only free an=
d re-malloc if the previous allocation wasn=E2=80=99t large enough? Then th=
ere isn=E2=80=99t unnecessary allocation, and we only malloc when there is =
a bigger transaction than any previous. I put this together in a separate c=
ommit, as I wasn=E2=80=99t sure if this what you wanted (can squash if you =
prefer): https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e7897=
46a17a4b2adefaac40<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p>=
<p class=3DMsoNormal>Thank you for the review! <span style=3D'font-size:10.=
5pt;font-family:"Segoe UI",sans-serif;color:#24292E;background:white'><o:p>=
</o:p></span></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNo=
rmal>Thanks,<br>Kris<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></=
p><div style=3D'border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0c=
m 0cm 0cm'><p class=3DMsoNormal><b>From: </b><a href=3D"mailto:hyc@symas.co=
m">Howard Chu</a><br><b>Sent: </b>April 30, 2019 7:12 AM<br><b>To: </b><a h=
ref=3D"mailto:kriszyp@gmail.com">kriszyp@gmail.com</a>; <a href=3D"mailto:o=
penldap-its@OpenLDAP.org">openldap-its@OpenLDAP.org</a><br><b>Subject: </b>=
Re: (ITS#9017) Improving performance of commit sync in Windows<o:p></o:p></=
p></div><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>kris=
zyp@gmail.com wrote:<o:p></o:p></p><p class=3DMsoNormal>&gt; Full_Name: Kri=
stopher William Zyp<o:p></o:p></p><p class=3DMsoNormal>&gt; Version: LMDB 0=
.9.23<o:p></o:p></p><p class=3DMsoNormal>&gt; OS: Windows<o:p></o:p></p><p =
class=3DMsoNormal>&gt; URL: https://github.com/kriszyp/node-lmdb/commit/7ff=
525ae57684a163d32af74a0ab9332b7fc4ce9<o:p></o:p></p><p class=3DMsoNormal>&g=
t; Submission from: (NULL) (71.199.6.148)<o:p></o:p></p><p class=3DMsoNorma=
l>&gt; <o:p></o:p></p><p class=3DMsoNormal>&gt; <o:p></o:p></p><p class=3DM=
soNormal>&gt; We have seen very poor performance on the sync of commits on =
large databases in<o:p></o:p></p><p class=3DMsoNormal>&gt; Windows. On data=
bases with 2GB of data, in writemap mode, the sync of even small<o:p></o:p>=
</p><p class=3DMsoNormal>&gt; commits is consistently well over 100ms (with=
out writemap it is faster, but<o:p></o:p></p><p class=3DMsoNormal>&gt; stil=
l slow). It is expected that a sync should take some time while waiting for=
<o:p></o:p></p><p class=3DMsoNormal>&gt; disk confirmation of the writes, b=
ut more concerning is that these sync<o:p></o:p></p><p class=3DMsoNormal>&g=
t; operations (in writemap mode) are instead dominated by nearly 100% syste=
m CPU<o:p></o:p></p><p class=3DMsoNormal>&gt; utilization, so operations th=
at requires sub-millisecond b-tree update<o:p></o:p></p><p class=3DMsoNorma=
l>&gt; operations are then dominated by very large amounts of system CPU cy=
cles during<o:p></o:p></p><p class=3DMsoNormal>&gt; the sync phase.<o:p></o=
:p></p><p class=3DMsoNormal>&gt; <o:p></o:p></p><p class=3DMsoNormal>&gt; I=
 think that the fundamental problem is that FlushViewOfFile seems to be an =
O(n)<o:p></o:p></p><p class=3DMsoNormal>&gt; operation where n is the size =
of the file (or map). I presume that Windows is<o:p></o:p></p><p class=3DMs=
oNormal>&gt; scanning the entire map/file for dirty pages to flush, I'm gue=
ssing because it<o:p></o:p></p><p class=3DMsoNormal>&gt; doesn't have an in=
ternal index of all the dirty pages for every file/map-view in<o:p></o:p></=
p><p class=3DMsoNormal>&gt; the OS disk cache. Therefore, the turns into an=
 extremely expensive, CPU-bound<o:p></o:p></p><p class=3DMsoNormal>&gt; ope=
ration to find the dirty pages for (large file) and initiate their writes,<=
o:p></o:p></p><p class=3DMsoNormal>&gt; which, of course, is contrary to th=
e whole goal of a scalable database system.<o:p></o:p></p><p class=3DMsoNor=
mal>&gt; And FlushFileBuffers is also relatively slow as well. We have atte=
mpted to batch<o:p></o:p></p><p class=3DMsoNormal>&gt; as many operations i=
nto single transaction as possible, but this is still a very<o:p></o:p></p>=
<p class=3DMsoNormal>&gt; large overhead.<o:p></o:p></p><p class=3DMsoNorma=
l>&gt; <o:p></o:p></p><p class=3DMsoNormal>&gt; The Windows docs for FlushF=
ileBuffers itself warns about the inefficiencies of<o:p></o:p></p><p class=
=3DMsoNormal>&gt; this function (https://docs.microsoft.com/en-us/windows/d=
esktop/api/fileapi/nf-fileapi-flushfilebuffers).<o:p></o:p></p><p class=3DM=
soNormal>&gt; Which also points to the solution: it is much faster to write=
 out the dirty<o:p></o:p></p><p class=3DMsoNormal>&gt; pages with WriteFile=
 through a sync file handle (FILE_FLAG_WRITE_THROUGH).<o:p></o:p></p><p cla=
ss=3DMsoNormal>&gt; <o:p></o:p></p><p class=3DMsoNormal>&gt; The associated=
 patch<o:p></o:p></p><p class=3DMsoNormal>&gt; (https://github.com/kriszyp/=
node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9)<o:p></o:p></p><p=
 class=3DMsoNormal>&gt; is my attempt at implementing this solution, for Wi=
ndows. Fortunately, with the<o:p></o:p></p><p class=3DMsoNormal>&gt; design=
 of LMDB, this is relatively straightforward. LMDB already supports<o:p></o=
:p></p><p class=3DMsoNormal>&gt; writing out dirty pages with WriteFile cal=
ls. I added a write-through handle for<o:p></o:p></p><p class=3DMsoNormal>&=
gt; sending these writes directly to disk. I then made that file-handle<o:p=
...
</o:p></p><p class=3DMsoNormal>&gt; overlapped/asynchronously, so all the =
writes for a commit could be started in<o:p></o:p></p><p class=3DMsoNormal>=
&gt; overlap mode, and (at least theoretically) transfer in parallel to the=
 drive and<o:p></o:p></p><p class=3DMsoNormal>&gt; then used GetOverlappedR=
esult to wait for the completion. So basically<o:p></o:p></p><p class=3DMso=
Normal>&gt; mdb_page_flush becomes the sync. I extended the writing of dirt=
y pages through<o:p></o:p></p><p class=3DMsoNormal>&gt; WriteFile to writem=
ap mode as well (for writing meta too), so that WriteFile<o:p></o:p></p><p =
class=3DMsoNormal>&gt; with write-through can be used to flush the data wit=
hout ever needing to call<o:p></o:p></p><p class=3DMsoNormal>&gt; FlushView=
OfFile or FlushFileBuffers. I also implemented support for write<o:p></o:p>=
</p><p class=3DMsoNormal>&gt; gathering in writemap mode where contiguous f=
ile positions infers contiguous<o:p></o:p></p><p class=3DMsoNormal>&gt; mem=
ory (by tracking the starting position with wdp and writing contiguous page=
s<o:p></o:p></p><p class=3DMsoNormal>&gt; in single operations). Sorting of=
 the dirty list is maintained even in writemap<o:p></o:p></p><p class=3DMso=
Normal>&gt; mode for this purpose.<o:p></o:p></p><p class=3DMsoNormal><o:p>=
&nbsp;</o:p></p><p class=3DMsoNormal>What is the point of using writemap mo=
de if you still need to use WriteFile<o:p></o:p></p><p class=3DMsoNormal>on=
 every individual page?<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p=
...
</p><p class=3DMsoNormal>&gt; The performance benefits of this patch, in m=
y testing, are considerable. Writing<o:p></o:p></p><p class=3DMsoNormal>&gt=
; out/syncing transactions is typically over 5x faster in writemap mode, an=
d 2x<o:p></o:p></p><p class=3DMsoNormal>&gt; faster in standard mode. And p=
erhaps more importantly (especially in environment<o:p></o:p></p><p class=
=3DMsoNormal>&gt; with many threads/processes), the efficiency benefits are=
 even larger,<o:p></o:p></p><p class=3DMsoNormal>&gt; particularly in write=
map mode, where there can be a 50-100x reduction in the<o:p></o:p></p><p cl=
ass=3DMsoNormal>&gt; system CPU usage by using this patch. This brings wind=
ows performance with<o:p></o:p></p><p class=3DMsoNormal>&gt; sync'ed transa=
ctions in LMDB back into the range of &quot;lightning&quot; performance :).=
<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNorm=
al>What is the performance difference between your patch using writemap, an=
d just<o:p></o:p></p><p class=3DMsoNormal>not using writemap in the first p=
lace?<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMs=
oNormal>-- <o:p></o:p></p><p class=3DMsoNormal>&nbsp;&nbsp;-- Howard Chu<o:=
p></o:p></p><p class=3DMsoNormal>&nbsp; CTO, Symas Corp.&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; http://www.symas.com<o:p></o:p></=
p><p class=3DMsoNormal>&nbsp; Director, Highland Sun&nbsp;&nbsp;&nbsp;&nbsp=
; http://highlandsun.com/hyc/<o:p></o:p></p><p class=3DMsoNormal>&nbsp; Chi=
ef Architect, OpenLDAP&nbsp; http://www.openldap.org/project/<o:p></o:p></p=
...
<p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal><o:p>&nbsp;=
</o:p></p></div></body></html>=
--_E0C027EF-451F-4EC6-B6DE-2F6B94348BB5_--