4, so for "one byte" page update, there should be 4 pages update, plus
one
meta page update, write amplification should be 5 rather than ~9, let me know
if I missed something?
There are two Btrees to update - the user data and the freeDB data.
Please read
rather than spending time asking
questions that are already fully documented.
There is no such thing. Or, it is entirely variable.
Cheers,
xinxin
-----Original Message-----
From: Леонид Юрьев [mailto:leo@yuriev.ru]
Sent: Tuesday, May 05, 2015 6:16 PM
To: Shu, Xinxin
Cc: openldap-technical(a)openldap.org
Subject: Re: large write amplification
Hm, ANY change needs a btree-update.
Let have a item key=K, data=A.
Then overwrite A to B, so now key=K, data=B.
This is a simply "one byte" change, but a few db-pages need to be cloned and
updated:
- a page, which contains the data=B and records around.
- a page in b-tree, that holds a pointer/reference to a page, which contains data=B and
records around.
- all "leaf-to-root path in btree" pages, related to a new page in btree, that
holds a pointer/reference to a page, which contains data=B and records around.
- ...
- a new root-pages of mainDB and freeDB.
- a point to "new root" in meta-page, that lay in the house that Jack built ;)
So, by design LMDB is optimized for highload reading, but not for writes.
Leonid.
2015-05-05 10:26 GMT+03:00 Shu, Xinxin <xinxin.shu(a)intel.com>:
> Hi leonid,
>
> Thanks for your reply, I observed another scenario , I also tested
> "overwrite mode", I slightly modify source code to change default
> behavior (set dbflags_ = SYNC, flush data to disk once transaction is
> committed ), also collected iostat , the overwrite IOPS is ~ 521
> ops/sec , but iostat show that w/s is ~ 4666, the write amplification
> is ~9, to my understanding, overwriting exist value does not adjust
> btree, why write amplification is so large, could you help explain ?
> thanks
>
> Cheers,
> xinxin
>
> -----Original Message-----
> From: Леонид Юрьев [mailto:leo@yuriev.ru]
> Sent: Monday, May 04, 2015 6:59 PM
> To: Shu, Xinxin
> Cc: openldap-technical(a)openldap.org
> Subject: Re: large write amplification
>
> Hi, Xinxin.
>
> I will try to answer briefly, without a details:
>
> - To allow readers be never blocked by a writer, LMDB provides a snapshot of data,
indexes and directory for each completed transaction.
>
> - Most of a db-pages (which is not changed by a particular
> transaction) are "shared" between such snapshots. But any changes of data
itself and reflection to btree-indexes (include a particular table, free-db, main-db and
so forth) require a new pages to be used and written to the disk.
>
> - In a large db a small "one-byte" change may make "dirty" a lot
of db-pages (usualy 4K each). For example, one add/del/mod operation in LDAP-db with size
of few GB, requires about 50-100 page-level IOPS.
>
> Leonid.
>
> P.S.
> For highload uses-cases I made a few changes in our fork of OpenLDAP/LMDB.
> A one of these features we called "LIFO reclaiming".
> It give us 10-50 times performance boost, especially by engaging benefits of
write-back cache of storage subsystem.
> Nowadays we used it in our production (telco) environment.
> But currently ones is not safe for all cases, see
>
https://github.com/ReOpen/ReOpenLDAP/issues/2 and
https://github.com/ReOpen/ReOpenLDAP/issues/1.
>
> 2015-05-04 5:31 GMT+03:00 Shu, Xinxin <xinxin.shu(a)intel.com>:
>> Hi list,
>>
>> Recently I run micro tests on LMDB on DC3700 (200GB), I use bench
>> code
https://github.com/hyc/leveldb/tree/benches , I tested fillrandsync mode
and collected iostat data, found that write amplification is large For fillrandsync case:
>>
>> IOPS : 1020 ops/sec
>>
>> Iostat data shows that w/s on that SSD is 8093, and avgqu-sz is ~ 1,
>> await time is about 0.16 ms, so the write amplification is ~8, which
>> is large to me, can someone help explain why write amplification is
>> so large? thanks
>>
>>
>> Cheers,
>> xinxin
>>
>>
--
-- Howard Chu
CTO, Symas Corp.