Hi there,
While doing some integration testing of LMDB I noticed that there may be an issue with out of order transaction handling.
The scenario is:
Open Read TXN A Open Write TXN X, and change values of the DB Commit X Open Read TXN B Open Write TXN Y, and change values of the DB Commit Y Open Read TXN C
Abort/Close TXN B.
At this point, because of the page touch between A -> B and B -> C, B now believes that the pages of A are the "last time" they are available as they were all subsequently copied for TXN C. The pages of A are then added to the freelists when B closes. When TXN A is read from the data may have been altered due to future writes as LMDB attempts to use previously allocated pages first.
This situation is more likely to arise on large batch writes, but could manifest with smaller series of writes. This would be a silent issue, as the over-written pages may be valid, and could cause data to "silently vanish" inside of the read transaction A leading to unpredictable results.
I hope that this report helps you to diagnose and resolve the issue.
William Brown wrote:
Hi there,
While doing some integration testing of LMDB I noticed that there may be an issue with out of order transaction handling.
May be? Do you have sample code that demonstrates the existence of the issue?
The scenario is:
Open Read TXN A Open Write TXN X, and change values of the DB Commit X Open Read TXN B Open Write TXN Y, and change values of the DB Commit Y Open Read TXN C
Abort/Close TXN B.
At this point, because of the page touch between A -> B and B -> C,
What page touch? Read txns don't do any such thing,
B now believes that the pages of A are the "last time" they are available as they were all subsequently copied for TXN C. The pages of A are then added to the freelists when B closes.
That's not how it works at all. The freelist is only modified when write txns commit. Notthing changes when a read txn ends, either by commit or abort.
When TXN A is read from the data may have been altered due to future writes as LMDB attempts to use previously allocated pages first.
That's not how it works. As long as read txn A is open none of its pages will be reused.
This situation is more likely to arise on large batch writes, but could manifest with smaller series of writes. This would be a silent issue, as the over-written pages may be valid, and could cause data to "silently vanish" inside of the read transaction A leading to unpredictable results.
I hope that this report helps you to diagnose and resolve the issue.
Hi,
Shortly, no issue.
LMDB provides MVCC via COW, therefore any read transaction see the constant snapshot of DB, which is correspond to last committed write transaction at the time when reading was started.
Any further write transactions will create the new snapshots, which will be visible for further reading, but not for read transactions that was started before.
Regards, Leonid.
2018-08-16 13:03 GMT+03:00 William Brown william@blackhats.net.au:
Hi there,
While doing some integration testing of LMDB I noticed that there may be an issue with out of order transaction handling.
The scenario is:
Open Read TXN A Open Write TXN X, and change values of the DB Commit X Open Read TXN B Open Write TXN Y, and change values of the DB Commit Y Open Read TXN C
Abort/Close TXN B.
At this point, because of the page touch between A -> B and B -> C, B now believes that the pages of A are the "last time" they are available as they were all subsequently copied for TXN C. The pages of A are then added to the freelists when B closes. When TXN A is read from the data may have been altered due to future writes as LMDB attempts to use previously allocated pages first.
This situation is more likely to arise on large batch writes, but could manifest with smaller series of writes. This would be a silent issue, as the over-written pages may be valid, and could cause data to "silently vanish" inside of the read transaction A leading to unpredictable results.
I hope that this report helps you to diagnose and resolve the issue.
-- Sincerely,
William
On Fri, 2018-08-17 at 01:34 +0300, Леонид Юрьев wrote:
Hi,
Shortly, no issue.
LMDB provides MVCC via COW, therefore any read transaction see the constant snapshot of DB, which is correspond to last committed write transaction at the time when reading was started.
Any further write transactions will create the new snapshots, which will be visible for further reading, but not for read transactions that was started before.
Hi,
I'm quite aware that it is COW - this issue is specific to COW trees. Transactions must be removed in order, they can not be removed out of order. It is about how pages are reclaimed for the freelist.
If you have tree state A, you have nodes say 1 through 4. We'll say 1 is the root, and 2 - 4 are the leaves.
We take the transaction at A.
Now we conduct a write with TXN X. We'll copy nodes 1 (r) and 4 to 5 (r) and 6 now and make our update. We now commit. At this point this is the "last generation" where 1 and 4 exist.
We begin the transaction at B
Now we condact a write with TXN Y. We'll copy nodes 2, 3 and 5 (r). 5 Being the root from X, 2,3 still in use at A. We now commit.
We begin a read transaction C.
At this point we now close transaction B.
B now clears it's resources - Because B is the last location where 2,3 were "alive", they are at this point freed. however, they are *still required* by TXN A for a valid and complete tree.
At this point, TXN A now consists of nodes 1, 4, and nodes 2,3 are on the freelist where they are possibly able to be reused.
You begin a new write, and commit many values. This will over-write the content of nodes 2,3 (as they are in the free list), causing the reader of TXN A to now percieve invalid data.
I have previously seen this with my in-memory only tree, but I was able to recreate this with LMDB. I'm not able to release the POC at this time. The issue arises because transactions *must* be cleared in order from oldest to newest, as a fundamental part of COW is the shared resources, and understanding when node lifetimes expire.
I hope that this explains the issue more thoroughly.
Regards, Leonid.
2018-08-16 13:03 GMT+03:00 William Brown william@blackhats.net.au:
Hi there,
While doing some integration testing of LMDB I noticed that there may be an issue with out of order transaction handling.
The scenario is:
Open Read TXN A Open Write TXN X, and change values of the DB Commit X Open Read TXN B Open Write TXN Y, and change values of the DB Commit Y Open Read TXN C
Abort/Close TXN B.
At this point, because of the page touch between A -> B and B -> C, B now believes that the pages of A are the "last time" they are available as they were all subsequently copied for TXN C. The pages of A are then added to the freelists when B closes. When TXN A is read from the data may have been altered due to future writes as LMDB attempts to use previously allocated pages first.
This situation is more likely to arise on large batch writes, but could manifest with smaller series of writes. This would be a silent issue, as the over-written pages may be valid, and could cause data to "silently vanish" inside of the read transaction A leading to unpredictable results.
I hope that this report helps you to diagnose and resolve the issue.
William Brown wrote:
On Fri, 2018-08-17 at 01:34 +0300, Леонид Юрьев wrote:
Hi,
Shortly, no issue.
LMDB provides MVCC via COW, therefore any read transaction see the constant snapshot of DB, which is correspond to last committed write transaction at the time when reading was started.
Any further write transactions will create the new snapshots, which will be visible for further reading, but not for read transactions that was started before.
Hi,
I'm quite aware that it is COW - this issue is specific to COW trees. Transactions must be removed in order, they can not be removed out of order. It is about how pages are reclaimed for the freelist
Incorrect.
If you have tree state A, you have nodes say 1 through 4. We'll say 1 is the root, and 2 - 4 are the leaves.
We take the transaction at A.
Now we conduct a write with TXN X. We'll copy nodes 1 (r) and 4 to 5 (r) and 6 now and make our update. We now commit. At this point this is the "last generation" where 1 and 4 exist.
We begin the transaction at B
Now we condact a write with TXN Y. We'll copy nodes 2, 3 and 5 (r). 5 Being the root from X, 2,3 still in use at A. We now commit.
We begin a read transaction C.
At this point we now close transaction B.
B now clears it's resources - Because B is the last location where 2,3 were "alive", they are at this point freed. however, they are *still required* by TXN A for a valid and complete tree.
Closing read txn B has no effect on the tree. Closing read txns has no effect on the on-disk structures.
At this point, TXN A now consists of nodes 1, 4, and nodes 2,3 are on the freelist where they are possibly able to be reused.
False. Closing a read txn doesn't put any nodes on the freelist. Only committing a write txn can do so.
You begin a new write, and commit many values. This will over-write the content of nodes 2,3 (as they are in the free list), causing the reader of TXN A to now percieve invalid data.
False.
I have previously seen this with my in-memory only tree,
but I was able to recreate this with LMDB.
Impossible. There is no code path in LMDB that can do what you have described.
Unless... you have disabled LMDB's lock management and are doing your own thing, and you simply got it wrong.
I'm not able to release the POC at this time. The issue arises because transactions *must* be cleared in order from oldest to newest, as a fundamental part of COW is the shared resources, and understanding when node lifetimes expire.
Your assertions have no bearing on reality, as far as LMDB goes.
I hope that this explains the issue more thoroughly.
Regards, Leonid.
2018-08-16 13:03 GMT+03:00 William Brown william@blackhats.net.au:
Hi there,
While doing some integration testing of LMDB I noticed that there may be an issue with out of order transaction handling.
The scenario is:
Open Read TXN A Open Write TXN X, and change values of the DB Commit X Open Read TXN B Open Write TXN Y, and change values of the DB Commit Y Open Read TXN C
Abort/Close TXN B.
At this point, because of the page touch between A -> B and B -> C, B now believes that the pages of A are the "last time" they are available as they were all subsequently copied for TXN C. The pages of A are then added to the freelists when B closes. When TXN A is read from the data may have been altered due to future writes as LMDB attempts to use previously allocated pages first.
This situation is more likely to arise on large batch writes, but could manifest with smaller series of writes. This would be a silent issue, as the over-written pages may be valid, and could cause data to "silently vanish" inside of the read transaction A leading to unpredictable results.
I hope that this report helps you to diagnose and resolve the issue.
On Fri, 2018-08-17 at 07:06 +0100, Howard Chu wrote:
I'm quite aware that it is COW - this issue is specific to COW trees. Transactions must be removed in order, they can not be removed out of order. It is about how pages are reclaimed for the freelist
Incorrect.
We may have to "agree to disagree" then.
Thanks for your time,
William Brown wrote:
On Fri, 2018-08-17 at 07:06 +0100, Howard Chu wrote:
I'm quite aware that it is COW - this issue is specific to COW trees. Transactions must be removed in order, they can not be removed out of order. It is about how pages are reclaimed for the freelist
Incorrect.
We may have to "agree to disagree" then.
Pure, utter nonsense. You're simply wrong, and at this point you're spewing FUD.
Leonid is one of the sharpest code reviewers around, and also one of the most accurate critics of this code, and even he agrees that there's no issue here.
This function guarantees that the oldest outstanding read txn is protected, regardless of txn commit order:
http://www.openldap.org/devel/gitweb.cgi?p=openldap.git;a=blob;f=libraries/l...
The source code doesn't lie, but for some reason you are. Either that or you're using some broken knockoff that claims to be LMDB but isn't.
openldap-technical@openldap.org