https://bugs.openldap.org/show_bug.cgi?id=10395
Issue ID: 10395 Summary: Support multiple readers on uncommitted changes Product: LMDB Version: unspecified Hardware: All OS: All Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: liblmdb Assignee: bugs@openldap.org Reporter: renault.cle@gmail.com Target Milestone: ---
Created attachment 1086 --> https://bugs.openldap.org/attachment.cgi?id=1086&action=edit A patch to support multiple readers on uncommitted changes
Hello,
The attached patch is not meant to be merged immediately into LMDB. Still, it demonstrates how I added a helpful feature to the key-value store: reading uncommitted changes from multiple transactions. I am conscious that the patch still requires some work and uses non-C99 features, i.e., atomics are C11, which could be a blocker for it to be merged upstream. I would also be delighted to merge these changes under a flag/define to ensure we don't impact other users with non-C99 stuff.
The main feature we need at Meilisearch is to read uncommitted changes from multiple threads, compute parallel post-processing of different data structures [1], and speed up the search requests. We could have done the post-processing in a following transaction by opening multiple read transactions, but this would mean that the post-processed data structure would not include newly inserted or modified document IDs. Both data structures would be desync.
Regarding the design choice, I decided to follow the same design as the nested write transactions: use the parent argument of the mdb_txn_begin [2], and allow the MDB_RDONLY flag, which was disallowed when the parent argument was non-NULL [3]. I find it clear enough that, by calling the mdb_txn_begin function with these arguments, you can call it multiple times (I need to update the doc) to obtain nested read-only transactions from the parent write transaction.
ret = mdb_txn_begin(env, parent_txn, MDB_RDONLY, new_nested_rtxn);
Note that this early proposal lacks security and error handling. The generated transactions are fake-read-only and actually write transactions that share the underlying parent allocations and data structures. This is unsafe and must be changed or reviewed carefully, but most importantly, we need to add read-only capabilities to these transactions to disallow writes. Using a Rust wrapper on top of LMDB, I wrapped the fake read-only transactions into ReadTxn, which disallows any writes at compile time. However, I haven't checked the conflict database creations or openings.
The main issues I encountered were concurrent free of the main shared data structures when the different threads owning the transactions were dropping the transactions simultaneously. So, I decided to implement the equivalent of an ARC to free resources only when the last nested transaction was freed.
I can share numbers about how this feature improves the post-processing by 4x-9x or from 1200s to 120s [1]. You can look at this PR, which I would be happy to merge once an improved version of this patch lands on LMDB upstream.
I would be very happy if you could guide me a bit on how I could improve this patch to make it mergeable into LMDB. We want to contribute useful features like this to LMDB and not keep a deviant fork. LMDB works great; we are happy about it, and its performance is predictable.
Have a lovely week, kero
[1]: https://github.com/meilisearch/meilisearch/pull/5307 [2]: https://github.com/LMDB/lmdb/blob/14d6629bc8a9fe40d8a6bee1bf71c45afe7576b6/l... [3]: https://github.com/LMDB/lmdb/blob/14d6629bc8a9fe40d8a6bee1bf71c45afe7576b6/l...
https://bugs.openldap.org/show_bug.cgi?id=10395
--- Comment #1 from Howard Chu hyc@openldap.org --- I don't see intentionally breaking ACID as being a good idea. Reading uncommitted changes breaks Atomicity and Isolation at least, and may break Consistency in various situations.
The only thing that ought to be able to see uncommitted changes is the active write transaction. If anything else does, then you don't have a transactional DB engine any more. In ACID operation "newly inserted or modified document IDs" *don't exist* until their transaction commits.
Using multiple reader txns is the correct approach. When the writer commits, you can simply txn_reset() all the readers to get them all on the latest data.
https://bugs.openldap.org/show_bug.cgi?id=10395
--- Comment #2 from kero renault.cle@gmail.com ---
I don't see intentionally breaking ACID as being a good idea. Reading uncommitted changes breaks Atomicity and Isolation at least, and may break Consistency in various situations.
I may have been unclear about the changes I made, or I may have made an error in the code. The nested transactions do the same as the write transaction: they _read their own writes_. It is already supported. The only difference is that you can _read your writes_ in parallel, now. Additionally, the parent write transaction is unusable and only an abort can be performed, and only after the nested transactions are aborted. If it's not the case, I need to implement that as we mentioned in this Mastodon thread [1].
The goal of this feature is not related to parallel writes, but rather to allow parallel reads on uncommitted changes *from within the write transaction*. Nothing can go outside. To write into the write transaction, you have to abort the many nested read-only transactions and use the original parent transaction. You must store data somewhere (RAM, disk) before being able to write to LMDB. I don't want to break ACID, here.
The only thing that ought to be able to see uncommitted changes is the active write transaction. If anything else does, then you don't have a transactional DB engine any more. In ACID operation "newly inserted or modified document IDs" *don't exist* until their transaction commits.
Currently, IIRC, nested transactions can see uncommitted changes as well, as can the outer transaction. This PR proposes to do the same, but from multiple read-only transactions that are hooked to the parent transaction. You can only have one write-active nested transaction, and in the case of this very feature, there is not even the possibility to write while the nested read-only transactions are active.
Using multiple reader txns is the correct approach. When the writer commits, you can simply txn_reset() all the readers to get them all on the latest data.
That's the issue we have with Meilisearch: we would like to be able to read uncommitted changes to avoid desync issues, and in parallel, to do it quickly and scale with the number of CPUs.
[1]: https://fosstodon.org/@kero/113465979956049984
https://bugs.openldap.org/show_bug.cgi?id=10395
--- Comment #3 from kero renault.cle@gmail.com --- Currently, IIRC, nested transactions can see uncommitted changes as well, as can the *nested* transactions [..]
https://bugs.openldap.org/show_bug.cgi?id=10395
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords|needs_review | Target Milestone|--- |1.0.0
https://bugs.openldap.org/show_bug.cgi?id=10395
kero renault.cle@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #1086|0 |1 is obsolete| |
--- Comment #4 from kero renault.cle@gmail.com --- Created attachment 1088 --> https://bugs.openldap.org/attachment.cgi?id=1088&action=edit Support nested RDONLY transactions
Hey Howard,
I took more time to work on this patch to introduce the possibility of creating multiple nested RDONLY transactions from a parent one. This was already the case with my previous patch, but now: - The nested transactions are RDONLY and disallow writes. - The parent transaction prevents creating nested transactions, unless they are all RDONLY. - Nested RDONLY transactions now avoid leaking me_pgstate/me_pghead by not allocating or copying them.
However, I still encounter a SIGBUS error when using a cursor on a dbi created by the current parent transaction but not yet committed. It seems that the issue originates from mc->mc_pg and MDB_page.mp2_flags not being properly initialized.
I wrote a reproducer at the end of this very message. Would you have a moment to review and help me identify the source of the issue?
Have a nice end of the week, kero
#include <stdio.h> #include <stdlib.h> #include "lmdb.h"
#define E(expr) CHECK((rc = (expr)) == MDB_SUCCESS, #expr) #define RES(err, expr) ((rc = expr) == (err) || (CHECK(!rc, #expr), 0)) #define CHECK(test, msg) ((test) ? (void)0 : ((void)fprintf(stderr, \ "%s:%d: %s: %s\n", __FILE__, __LINE__, msg, mdb_strerror(rc)), abort()))
int main(int argc,char * argv[]) { int rc; MDB_env *env; MDB_dbi dbi; MDB_val key, data; MDB_txn *txn, *nested; MDB_cursor *cursor;
E(mdb_env_create(&env)); E(mdb_env_set_maxdbs(env, 1)); E(mdb_env_set_maxreaders(env, 1)); E(mdb_env_set_mapsize(env, 10485760)); E(mdb_env_open(env, "./testdb", 0, 0664));
E(mdb_txn_begin(env, NULL, 0, &txn)); E(mdb_dbi_open(txn, "brand-new-db", MDB_CREATE, &dbi));
key.mv_data = "test-key\0"; key.mv_size = sizeof("test-key\0");
data.mv_data = "test-data\0"; data.mv_size = sizeof("test-data\0");
RES(MDB_KEYEXIST, mdb_put(txn, dbi, &key, &data, MDB_NOOVERWRITE));
E(mdb_txn_begin(env, txn, MDB_RDONLY, &nested)); E(mdb_cursor_open(nested, dbi, &cursor));
/* Crashes with a SIGBUS, a wrongly initialized mc->mc_pg */ RES(MDB_NOTFOUND, mdb_cursor_get(cursor, &key, &data, MDB_NEXT));
return 0; }
https://bugs.openldap.org/show_bug.cgi?id=10395
kero renault.cle@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #1088|0 |1 is obsolete| |
--- Comment #5 from kero renault.cle@gmail.com --- Created attachment 1089 --> https://bugs.openldap.org/attachment.cgi?id=1089&action=edit Attachments Support nested RDONLY transactions
Hello again,
I identified the SIGBUS error when reading a freshly created DBI from the parent transaction. The bug was located in the mdb_page_get function, specifically in the condition that specified whether we had to read the dirty list. I propose a new patch that addresses memory leaks and SIGBUS issues.
Have a nice end of the week, kero
https://bugs.openldap.org/show_bug.cgi?id=10395
kero renault.cle@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #1089|0 |1 is obsolete| |
--- Comment #6 from kero renault.cle@gmail.com --- Created attachment 1096 --> https://bugs.openldap.org/attachment.cgi?id=1096&action=edit Last patch to allow multiple nested read transactions from a write transaction
Hey Howard,
I have finalized the last version of this patch. Would you mind taking a look at it, please? We are using this patched version of LMDB in production with large datasets, and it performs exceptionally well. We are capable of multi-threading the reads we perform on uncommitted changes.
If you are not eager to merge this patch, could you please explain why? I am not quite sure I understand: We are not introducing concurrent modifications of the write transaction's content, nor allowing external read transactions to read uncommitted content. Only the program owning the write transaction can perform reads in parallel, which it was already allowed to do, but not in parallel before.
Side note: Would you mind updating the GitHub mirror of LMDB, please? There is a missing commit[1].
[1]: https://github.com/LMDB/lmdb/commit/1db2a61ade6ae087dc175b1f44303548a4aa7926
Have a nice end of the week, Thank you, kero
https://bugs.openldap.org/show_bug.cgi?id=10395
--- Comment #7 from Howard Chu hyc@openldap.org --- We can't use atomics in this code.
Haven't had time to review the rest.
https://bugs.openldap.org/show_bug.cgi?id=10395
kero renault.cle@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #1096|0 |1 is obsolete| |
--- Comment #8 from kero renault.cle@gmail.com --- Created attachment 1098 --> https://bugs.openldap.org/attachment.cgi?id=1098&action=edit Allow multiple nested read transactions from a write transaction using a mutex
Hello Howard, and thank you for the early review,
I found the time to replace the C11 atomics with C99 mutexes. However, I am not entirely sure if this is the correct way to use mutexes in this code, or if we should instead use macros or defines to ensure compatibility across all platforms. Anyway, this code is relatively small, and it would be easy to patch. It compiles and works on all platforms, as confirmed by the numerous CI tests we have on Heed [1] and Meilisearch [2].
Thank you very much, Have a nice day
[1]: https://github.com/meilisearch/heed/pull/307/commits/464a9c33f76e0679115eef7... [2]: https://github.com/meilisearch/meilisearch/pull/6042/checks