We have an environment with no flags that contains a database with no flags. The database is append only, no deletions or modifications. It is written using a single RW transaction, in the absence of any RO transactions. We observe that when we commit and recreate the RW transaction every 2000 insertion ops, the data.mdb file size on disk is 2x larger than when committing every 64000 insertion ops. The mdb_copy -c utility shrinks the large 2k ops commit file to almost the same file size as the 64k commit one. mdb_stat -e on the data.mdb shows that when we have more commits and bigger file, we have more pages used by the same proportion.
In production we will have several large DBs (>1TB) on an NVMe card and we do not have the 2x space for periodic mdb_copy -c compactifications (and we cannot stop the writing process). We also need to commit every 2000 write ops, because there will be short-lived RO transactions that need to see the DB updates every 2000 writes.
1. Why is the file size on disk dependent on the commit frequency? (I suppose because with less frequent commits it can allocate data between pages more efficiently)?
2. What can we do to reduce data.mdb, if we must commit frequently? Can we use any environment, transaction or db flags, or anything else?
We are on Linux 5.4.0 / ext4 fs. The DB that grows 2x faster with more frequent commits has bytearr key -> u32 val structure (the byterarray key is between 31 and 36 bytes). Another DB that has a reverse u32 key -> bytearr structure oonly grows 10% larger in the more frequent commits regime.
Alec Matusis wrote:
We have an environment with no flags that contains a database with no flags. The database is append only, no deletions or modifications. It is written using a single RW transaction, in the absence of any RO transactions. We observe that when we commit and recreate the RW transaction every 2000 insertion ops, the data.mdb file size on disk is 2x larger than when committing every 64000 insertion ops. The mdb_copy c utility shrinks the large 2k ops commit file to almost the same file size as the 64k commit one. mdb_stat e on the data.mdb shows that when we have more commits and bigger file, we have more pages used by the same proportion.
In production we will have several large DBs (>1TB) on an NVMe card and we do not have the 2x space for periodic mdb_copy c compactifications (and we cannot stop the writing process). We also need to commit every 2000 write ops, because there will be short-lived RO transactions that need to see the DB updates every 2000 writes.
1. Why is the file size on disk dependent on the commit frequency? (I suppose because with less frequent commits it can allocate data between pages more efficiently)?
LMDB does copy-on-write. Every time you start a new transaction, any page you modify must be copied first. If you do many operations in the same transaction, the modified pages can be reused as-is, instead of needing to be copied again.
2. What can we do to reduce data.mdb, if we must commit frequently? Can we use any environment, transaction or db flags, or anything else?
If it is truly, strictly append-only use, which means every newly inserted key is greater than all existing keys, then you should use the MDB_APPEND flag. That will cut growth by half.
We are on Linux 5.4.0 / ext4 fs. The DB that grows 2x faster with more frequent commits has bytearr key -> u32 val structure (the byterarray key is between 31 and 36 bytes). Another DB that has a reverse u32 key -> bytearr structure oonly grows 10% larger in the more frequent commits regime.
A follow-up question:
We have an environment with one database with MDB_INTEGERKEY | MDB_DUPSORT flags. The keys and values are never deleted or modified (append-only). There's a single RW transaction on the environment, that is writing the data. It also needs to read some prior data in order to write new data. It is committed and then recreated after every 1000 to 3000 mdb_cursor_put() ops. All keys and values in the mdb_cursor_put ops are about the same size.
I am printing and plotting mdb_stat after every commit. I have observed a sudden Free Pages jump as shown on the top graph here: https://imgur.com/a/RZi1Vjn The ratio of Free/Used pages has similar jump on the second graph. The number of Used pages(computed as MDB_envinfo.me_last_pgno - free_pages or equivalently " Number of pages used" - " Free pages" from mdb_stat -fe utility) just grown monotonously on the third graph, as expected.
We commit about the same data each time(same size * same number of put ops).
What might cause this jump?
We have other databases, but they do not have this behavior- the ratio of free/used pages is more or less constant after each commit, oscillating around 0.001.
-----Original Message----- From: Howard Chu [mailto:hyc@symas.com] Sent: Wednesday, March 10, 2021 8:12 AM To: Alec Matusis matusis@matusis.com; openldap-technical@openldap.org Subject: Re: data.mdb grows faster with more frequent commits
Alec Matusis wrote:
We have an environment with no flags that contains a database with no flags. The database is append only, no deletions or modifications. It is written using a single RW transaction, in the absence of any RO transactions. We observe that when we commit and recreate the RW transaction every 2000 insertion ops, the data.mdb file size on disk is 2x
larger than when committing every 64000 insertion ops. The mdb_copy c utility shrinks the large 2k ops commit file to almost the same file size as the 64k commit one. mdb_stat e on the data.mdb shows that when we have more commits and bigger file, we have more pages used by the same proportion.
In production we will have several large DBs (>1TB) on an NVMe card and we do not have the 2x space for periodic mdb_copy c compactifications (and we cannot stop the writing process). We also need to commit every 2000 write ops, because there will be short-lived RO transactions that need to see the DB updates every 2000 writes.
1. Why is the file size on disk dependent on the commit frequency? (I suppose because with less frequent commits it can allocate data between
pages more efficiently)?
LMDB does copy-on-write. Every time you start a new transaction, any page you modify must be copied first. If you do many operations in the same transaction, the modified pages can be reused as-is, instead of needing to be copied again.
2. What can we do to reduce data.mdb, if we must commit frequently? Can
we use any environment, transaction or db flags, or anything else?
If it is truly, strictly append-only use, which means every newly inserted key is greater than all existing keys, then you should use the MDB_APPEND flag. That will cut growth by half.
We are on Linux 5.4.0 / ext4 fs. The DB that grows 2x faster with more frequent commits has bytearr key -> u32 val structure (the byterarray key
is between 31 and 36 bytes). Another DB that has a reverse u32 key -> bytearr structure oonly grows 10% larger in the more frequent commits regime.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Alec Matusis wrote:
A follow-up question:
We have an environment with one database with MDB_INTEGERKEY | MDB_DUPSORT flags. The keys and values are never deleted or modified (append-only).
That is not what append-only means in LMDB. I already defined it in my previous email.
There's a single RW transaction on the environment, that is writing the data. It also needs to read some prior data in order to write new data. It is committed and then recreated after every 1000 to 3000 mdb_cursor_put() ops. All keys and values in the mdb_cursor_put ops are about the same size.
I am printing and plotting mdb_stat after every commit. I have observed a sudden Free Pages jump as shown on the top graph here: https://imgur.com/a/RZi1Vjn The ratio of Free/Used pages has similar jump on the second graph. The number of Used pages(computed as MDB_envinfo.me_last_pgno - free_pages or equivalently " Number of pages used" - " Free pages" from mdb_stat -fe utility) just grown monotonously on the third graph, as expected.
We commit about the same data each time(same size * same number of put ops).
What might cause this jump?
Tree rebalancing when a page gets full and causes a ripple effect back up to the root of the tree.
In a pure append-only workload (using the MDB_APPEND write flag) this doesn't happen. In a workload where keys are inserted in random order, it does.
We have other databases, but they do not have this behavior- the ratio of free/used pages is more or less constant after each commit, oscillating around 0.001.
-----Original Message----- From: Howard Chu [mailto:hyc@symas.com] Sent: Wednesday, March 10, 2021 8:12 AM To: Alec Matusis matusis@matusis.com; openldap-technical@openldap.org Subject: Re: data.mdb grows faster with more frequent commits
Alec Matusis wrote:
We have an environment with no flags that contains a database with no flags. The database is append only, no deletions or modifications. It is written using a single RW transaction, in the absence of any RO transactions. We observe that when we commit and recreate the RW transaction every 2000 insertion ops, the data.mdb file size on disk is 2x
larger than when committing every 64000 insertion ops. The mdb_copy –c utility shrinks the large 2k ops commit file to almost the same file size as the 64k commit one. mdb_stat –e on the data.mdb shows that when we have more commits and bigger file, we have more pages used by the same proportion.
In production we will have several large DBs (>1TB) on an NVMe card and we do not have the 2x space for periodic mdb_copy –c compactifications (and we cannot stop the writing process). We also need to commit every 2000 write ops, because there will be short-lived RO transactions that need to see the DB updates every 2000 writes.
1. Why is the file size on disk dependent on the commit frequency? (I suppose because with less frequent commits it can allocate data between
pages more efficiently)?
LMDB does copy-on-write. Every time you start a new transaction, any page you modify must be copied first. If you do many operations in the same transaction, the modified pages can be reused as-is, instead of needing to be copied again.
2. What can we do to reduce data.mdb, if we must commit frequently? Can
we use any environment, transaction or db flags, or anything else?
If it is truly, strictly append-only use, which means every newly inserted key is greater than all existing keys, then you should use the MDB_APPEND flag. That will cut growth by half.
We are on Linux 5.4.0 / ext4 fs. The DB that grows 2x faster with more frequent commits has bytearr key -> u32 val structure (the byterarray key
is between 31 and 36 bytes). Another DB that has a reverse u32 key -> bytearr structure oonly grows 10% larger in the more frequent commits regime.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
openldap-technical@openldap.org