large write amplification

List overview All Threads
Download

newer

older

Data is not getting replicated...

olcAuthzRegexp not matching

Shu, Xinxin

4 May 2015 4 May '15

4:31 a.m.

Hi list,

Recently I run micro tests on LMDB on DC3700 (200GB), I use bench code https://github.com/hyc/leveldb/tree/benches , I tested fillrandsync mode and collected iostat data, found that write amplification is large For fillrandsync case:

IOPS : 1020 ops/sec

Iostat data shows that w/s on that SSD is 8093, and avgqu-sz is ~ 1, await time is about 0.16 ms, so the write amplification is ~8, which is large to me, can someone help explain why write amplification is so large? thanks

Cheers, xinxin

Show replies by thread

Леонид Юрьев

4 May 4 May

12:58 p.m.

Hi, Xinxin.

I will try to answer briefly, without a details:

- To allow readers be never blocked by a writer, LMDB provides a snapshot of data, indexes and directory for each completed transaction.

- Most of a db-pages (which is not changed by a particular transaction) are "shared" between such snapshots. But any changes of data itself and reflection to btree-indexes (include a particular table, free-db, main-db and so forth) require a new pages to be used and written to the disk.

- In a large db a small "one-byte" change may make "dirty" a lot of db-pages (usualy 4K each). For example, one add/del/mod operation in LDAP-db with size of few GB, requires about 50-100 page-level IOPS.

Leonid.

P.S. For highload uses-cases I made a few changes in our fork of OpenLDAP/LMDB. A one of these features we called "LIFO reclaiming". It give us 10-50 times performance boost, especially by engaging benefits of write-back cache of storage subsystem. Nowadays we used it in our production (telco) environment. But currently ones is not safe for all cases, see https://github.com/ReOpen/ReOpenLDAP/issues/2 and https://github.com/ReOpen/ReOpenLDAP/issues/1.

2015-05-04 5:31 GMT+03:00 Shu, Xinxin xinxin.shu@intel.com:

...

Hi list,

Recently I run micro tests on LMDB on DC3700 (200GB), I use bench code https://github.com/hyc/leveldb/tree/benches , I tested fillrandsync mode and collected iostat data, found that write amplification is large For fillrandsync case:

IOPS : 1020 ops/sec

Iostat data shows that w/s on that SSD is 8093, and avgqu-sz is ~ 1, await time is about 0.16 ms, so the write amplification is ~8, which is large to me, can someone help explain why write amplification is so large? thanks

Cheers, xinxin

Shu, Xinxin

5 May 5 May

9:26 a.m.

Hi leonid,

Thanks for your reply, I observed another scenario , I also tested "overwrite mode", I slightly modify source code to change default behavior (set dbflags_ = SYNC, flush data to disk once transaction is committed ), also collected iostat , the overwrite IOPS is ~ 521 ops/sec , but iostat show that w/s is ~ 4666, the write amplification is ~9, to my understanding, overwriting exist value does not adjust btree, why write amplification is so large, could you help explain ? thanks

Cheers, xinxin

-----Original Message----- From: Леонид Юрьев [mailto:leo@yuriev.ru] Sent: Monday, May 04, 2015 6:59 PM To: Shu, Xinxin Cc: openldap-technical@openldap.org Subject: Re: large write amplification

Hi, Xinxin.

I will try to answer briefly, without a details:

- To allow readers be never blocked by a writer, LMDB provides a snapshot of data, indexes and directory for each completed transaction.

Leonid.

2015-05-04 5:31 GMT+03:00 Shu, Xinxin xinxin.shu@intel.com:

...

Hi list,

Recently I run micro tests on LMDB on DC3700 (200GB), I use bench code https://github.com/hyc/leveldb/tree/benches , I tested fillrandsync mode and collected iostat data, found that write amplification is large For fillrandsync case:

IOPS : 1020 ops/sec

Iostat data shows that w/s on that SSD is 8093, and avgqu-sz is ~ 1, await time is about 0.16 ms, so the write amplification is ~8, which is large to me, can someone help explain why write amplification is so large? thanks

Cheers, xinxin

Howard Chu

9:42 a.m.

Леонид Юрьев wrote:

...

Hi, Xinxin.

I will try to answer briefly, without a details:

To allow readers be never blocked by a writer, LMDB provides a

snapshot of data, indexes and directory for each completed transaction.

Most of a db-pages (which is not changed by a particular

transaction) are "shared" between such snapshots. But any changes of data itself and reflection to btree-indexes (include a particular table, free-db, main-db and so forth) require a new pages to be used and written to the disk.

In a large db a small "one-byte" change may make "dirty" a lot of

db-pages (usualy 4K each). For example, one add/del/mod operation in LDAP-db with size of few GB, requires about 50-100 page-level IOPS.

Correct, up to this last point. The degree of amplification is greatly overstated.

See http://symas/com/mdb/ondisk/

The number of pages touched depends on the height of the B+tree, which is O(logN) of the number of records. Even a tree of multiple terabytes is unlikely to reach beyond a height of 5.

The minimum write amplification may be on the order of 8 pages for a trivial write. But it also tends to be the maximum write amplification too.

...

Leonid.

P.S. For highload uses-cases I made a few changes in our fork of OpenLDAP/LMDB. A one of these features we called "LIFO reclaiming". It give us 10-50 times performance boost, especially by engaging benefits of write-back cache of storage subsystem. Nowadays we used it in our production (telco) environment. But currently ones is not safe for all cases, see https://github.com/ReOpen/ReOpenLDAP/issues/2 and https://github.com/ReOpen/ReOpenLDAP/issues/1.

The LIFO approach inherently breaks the safety guarantees of the LMDB concurrency design, as I have already explained.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

9:58 a.m.

Shu, Xinxin wrote:

...

Hi leonid,

Thanks for your reply, I observed another scenario , I also tested

"overwrite mode", I slightly modify source code to change default behavior (set dbflags_ = SYNC, flush data to disk once transaction is committed ), also collected iostat , the overwrite IOPS is ~ 521 ops/sec , but iostat show that w/s is ~ 4666, the write amplification is ~9, to my understanding, overwriting exist value does not adjust btree, why write amplification is so large, could you help explain ? thanks

Your understanding is wrong. LMDB is clearly documented as a copy-on-write design. It does not modify values in-place.

...

Cheers, xinxin

-----Original Message----- From: Леонид Юрьев [mailto:leo@yuriev.ru] Sent: Monday, May 04, 2015 6:59 PM To: Shu, Xinxin Cc: openldap-technical@openldap.org Subject: Re: large write amplification

Hi, Xinxin.

I will try to answer briefly, without a details:

To allow readers be never blocked by a writer, LMDB provides a snapshot of data, indexes and directory for each completed transaction.

Most of a db-pages (which is not changed by a particular

transaction) are "shared" between such snapshots. But any changes of data itself and reflection to btree-indexes (include a particular table, free-db, main-db and so forth) require a new pages to be used and written to the disk.

In a large db a small "one-byte" change may make "dirty" a lot of db-pages (usualy 4K each). For example, one add/del/mod operation in LDAP-db with size of few GB, requires about 50-100 page-level IOPS.

Leonid.

P.S. For highload uses-cases I made a few changes in our fork of OpenLDAP/LMDB. A one of these features we called "LIFO reclaiming". It give us 10-50 times performance boost, especially by engaging benefits of write-back cache of storage subsystem. Nowadays we used it in our production (telco) environment. But currently ones is not safe for all cases, see https://github.com/ReOpen/ReOpenLDAP/issues/2 and https://github.com/ReOpen/ReOpenLDAP/issues/1.

2015-05-04 5:31 GMT+03:00 Shu, Xinxin xinxin.shu@intel.com:

...
Hi list,

Recently I run micro tests on LMDB on DC3700 (200GB), I use bench code https://github.com/hyc/leveldb/tree/benches , I tested fillrandsync mode and collected iostat data, found that write amplification is large For fillrandsync case:

IOPS : 1020 ops/sec

Iostat data shows that w/s on that SSD is 8093, and avgqu-sz is ~ 1, await time is about 0.16 ms, so the write amplification is ~8, which is large to me, can someone help explain why write amplification is so large? thanks

Cheers, xinxin

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Леонид Юрьев

12:16 p.m.

Hm, ANY change needs a btree-update.

Let have a item key=K, data=A. Then overwrite A to B, so now key=K, data=B.

This is a simply "one byte" change, but a few db-pages need to be cloned and updated: - a page, which contains the data=B and records around. - a page in b-tree, that holds a pointer/reference to a page, which contains data=B and records around. - all "leaf-to-root path in btree" pages, related to a new page in btree, that holds a pointer/reference to a page, which contains data=B and records around. - ... - a new root-pages of mainDB and freeDB. - a point to "new root" in meta-page, that lay in the house that Jack built ;)

So, by design LMDB is optimized for highload reading, but not for writes.

Leonid.

2015-05-05 10:26 GMT+03:00 Shu, Xinxin xinxin.shu@intel.com:

...

Hi leonid,

Thanks for your reply, I observed another scenario , I also tested "overwrite mode", I slightly modify source code to change default behavior (set dbflags_ = SYNC, flush data to disk once transaction is committed ), also collected iostat , the overwrite IOPS is ~ 521 ops/sec , but iostat show that w/s is ~ 4666, the write amplification is ~9, to my understanding, overwriting exist value does not adjust btree, why write amplification is so large, could you help explain ? thanks

Cheers, xinxin

-----Original Message----- From: Леонид Юрьев [mailto:leo@yuriev.ru] Sent: Monday, May 04, 2015 6:59 PM To: Shu, Xinxin Cc: openldap-technical@openldap.org Subject: Re: large write amplification

Hi, Xinxin.

I will try to answer briefly, without a details:

To allow readers be never blocked by a writer, LMDB provides a snapshot of data, indexes and directory for each completed transaction.

Most of a db-pages (which is not changed by a particular

transaction) are "shared" between such snapshots. But any changes of data itself and reflection to btree-indexes (include a particular table, free-db, main-db and so forth) require a new pages to be used and written to the disk.

In a large db a small "one-byte" change may make "dirty" a lot of db-pages (usualy 4K each). For example, one add/del/mod operation in LDAP-db with size of few GB, requires about 50-100 page-level IOPS.

Leonid.

P.S. For highload uses-cases I made a few changes in our fork of OpenLDAP/LMDB. A one of these features we called "LIFO reclaiming". It give us 10-50 times performance boost, especially by engaging benefits of write-back cache of storage subsystem. Nowadays we used it in our production (telco) environment. But currently ones is not safe for all cases, see https://github.com/ReOpen/ReOpenLDAP/issues/2 and https://github.com/ReOpen/ReOpenLDAP/issues/1.

2015-05-04 5:31 GMT+03:00 Shu, Xinxin xinxin.shu@intel.com:

...
Hi list,

Recently I run micro tests on LMDB on DC3700 (200GB), I use bench code https://github.com/hyc/leveldb/tree/benches , I tested fillrandsync mode and collected iostat data, found that write amplification is large For fillrandsync case:

IOPS : 1020 ops/sec

Iostat data shows that w/s on that SSD is 8093, and avgqu-sz is ~ 1, await time is about 0.16 ms, so the write amplification is ~8, which is large to me, can someone help explain why write amplification is so large? thanks

Cheers, xinxin

Shu, Xinxin

7 May 7 May

8:40 a.m.

For overwrite case, I checked height of BTree of my lmdb database, Height is 4, so for "one byte" page update, there should be 4 pages update, plus one meta page update, write amplification should be 5 rather than ~9, let me know if I missed something?

by the way, how can I get the degree of B-Tree of lmdb database?

Cheers, xinxin

-----Original Message----- From: Леонид Юрьев [mailto:leo@yuriev.ru] Sent: Tuesday, May 05, 2015 6:16 PM To: Shu, Xinxin Cc: openldap-technical@openldap.org Subject: Re: large write amplification

Hm, ANY change needs a btree-update.

Let have a item key=K, data=A. Then overwrite A to B, so now key=K, data=B.

So, by design LMDB is optimized for highload reading, but not for writes.

Leonid.

2015-05-05 10:26 GMT+03:00 Shu, Xinxin xinxin.shu@intel.com:

...

Hi leonid,

Thanks for your reply, I observed another scenario , I also tested "overwrite mode", I slightly modify source code to change default behavior (set dbflags_ = SYNC, flush data to disk once transaction is committed ), also collected iostat , the overwrite IOPS is ~ 521 ops/sec , but iostat show that w/s is ~ 4666, the write amplification is ~9, to my understanding, overwriting exist value does not adjust btree, why write amplification is so large, could you help explain ? thanks

Cheers, xinxin

-----Original Message----- From: Леонид Юрьев [mailto:leo@yuriev.ru] Sent: Monday, May 04, 2015 6:59 PM To: Shu, Xinxin Cc: openldap-technical@openldap.org Subject: Re: large write amplification

Hi, Xinxin.

I will try to answer briefly, without a details:

To allow readers be never blocked by a writer, LMDB provides a snapshot of data, indexes and directory for each completed transaction.

Most of a db-pages (which is not changed by a particular

transaction) are "shared" between such snapshots. But any changes of data itself and reflection to btree-indexes (include a particular table, free-db, main-db and so forth) require a new pages to be used and written to the disk.

In a large db a small "one-byte" change may make "dirty" a lot of db-pages (usualy 4K each). For example, one add/del/mod operation in LDAP-db with size of few GB, requires about 50-100 page-level IOPS.

Leonid.

P.S. For highload uses-cases I made a few changes in our fork of OpenLDAP/LMDB. A one of these features we called "LIFO reclaiming". It give us 10-50 times performance boost, especially by engaging benefits of write-back cache of storage subsystem. Nowadays we used it in our production (telco) environment. But currently ones is not safe for all cases, see https://github.com/ReOpen/ReOpenLDAP/issues/2 and https://github.com/ReOpen/ReOpenLDAP/issues/1.

2015-05-04 5:31 GMT+03:00 Shu, Xinxin xinxin.shu@intel.com:

...
Hi list,

Recently I run micro tests on LMDB on DC3700 (200GB), I use bench code https://github.com/hyc/leveldb/tree/benches , I tested fillrandsync mode and collected iostat data, found that write amplification is large For fillrandsync case:

IOPS : 1020 ops/sec

Iostat data shows that w/s on that SSD is 8093, and avgqu-sz is ~ 1, await time is about 0.16 ms, so the write amplification is ~8, which is large to me, can someone help explain why write amplification is so large? thanks

Cheers, xinxin

Howard Chu

11:14 a.m.

Shu, Xinxin wrote:

...

For overwrite case, I checked height of BTree of my lmdb database, Height is

4, so for "one byte" page update, there should be 4 pages update, plus one meta page update, write amplification should be 5 rather than ~9, let me know if I missed something?

There are two Btrees to update - the user data and the freeDB data.

Please read http://symas.com/mdb/#pubs rather than spending time asking questions that are already fully documented.

...

by the way, how can I get the degree of B-Tree of lmdb database?

There is no such thing. Or, it is entirely variable.

...

Cheers, xinxin

-----Original Message----- From: Леонид Юрьев [mailto:leo@yuriev.ru] Sent: Tuesday, May 05, 2015 6:16 PM To: Shu, Xinxin Cc: openldap-technical@openldap.org Subject: Re: large write amplification

Hm, ANY change needs a btree-update.

Let have a item key=K, data=A. Then overwrite A to B, so now key=K, data=B.

This is a simply "one byte" change, but a few db-pages need to be cloned and updated:

a page, which contains the data=B and records around.

a page in b-tree, that holds a pointer/reference to a page, which contains data=B and records around.

all "leaf-to-root path in btree" pages, related to a new page in btree, that holds a pointer/reference to a page, which contains data=B and records around.

...

a new root-pages of mainDB and freeDB.

a point to "new root" in meta-page, that lay in the house that Jack built ;)

So, by design LMDB is optimized for highload reading, but not for writes.

Leonid.

2015-05-05 10:26 GMT+03:00 Shu, Xinxin xinxin.shu@intel.com:

...
Hi leonid,

Thanks for your reply, I observed another scenario , I also tested "overwrite mode", I slightly modify source code to change default behavior (set dbflags_ = SYNC, flush data to disk once transaction is committed ), also collected iostat , the overwrite IOPS is ~ 521 ops/sec , but iostat show that w/s is ~ 4666, the write amplification is ~9, to my understanding, overwriting exist value does not adjust btree, why write amplification is so large, could you help explain ? thanks

Cheers, xinxin

-----Original Message----- From: Леонид Юрьев [mailto:leo@yuriev.ru] Sent: Monday, May 04, 2015 6:59 PM To: Shu, Xinxin Cc: openldap-technical@openldap.org Subject: Re: large write amplification

Hi, Xinxin.

I will try to answer briefly, without a details:

To allow readers be never blocked by a writer, LMDB provides a snapshot of data, indexes and directory for each completed transaction.

Most of a db-pages (which is not changed by a particular

transaction) are "shared" between such snapshots. But any changes of data itself and reflection to btree-indexes (include a particular table, free-db, main-db and so forth) require a new pages to be used and written to the disk.

In a large db a small "one-byte" change may make "dirty" a lot of db-pages (usualy 4K each). For example, one add/del/mod operation in LDAP-db with size of few GB, requires about 50-100 page-level IOPS.

Leonid.

P.S. For highload uses-cases I made a few changes in our fork of OpenLDAP/LMDB. A one of these features we called "LIFO reclaiming". It give us 10-50 times performance boost, especially by engaging benefits of write-back cache of storage subsystem. Nowadays we used it in our production (telco) environment. But currently ones is not safe for all cases, see https://github.com/ReOpen/ReOpenLDAP/issues/2 and https://github.com/ReOpen/ReOpenLDAP/issues/1.

2015-05-04 5:31 GMT+03:00 Shu, Xinxin xinxin.shu@intel.com:

...
Hi list,

Recently I run micro tests on LMDB on DC3700 (200GB), I use bench code https://github.com/hyc/leveldb/tree/benches , I tested fillrandsync mode and collected iostat data, found that write amplification is large For fillrandsync case:

IOPS : 1020 ops/sec

Iostat data shows that w/s on that SSD is 8093, and avgqu-sz is ~ 1, await time is about 0.16 ms, so the write amplification is ~8, which is large to me, can someone help explain why write amplification is so large? thanks

Cheers, xinxin

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

3710

Age (days ago)

3713

Last active (days ago)

openldap-technical@openldap.org

7 comments

3 participants

tags (0)

participants (3)

Howard Chu
Shu, Xinxin
Леонид Юрьев