https://bugs.openldap.org/show_bug.cgi?id=9920
Issue ID: 9920 Summary: MDB_PAGE_FULL with master3 (encryption) because there is no room for the authentication data (MAC) Product: LMDB Version: unspecified Hardware: x86_64 OS: Mac OS Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: liblmdb Assignee: bugs@openldap.org Reporter: info@parlepeuple.fr Target Milestone: ---
Created attachment 915 --> https://bugs.openldap.org/attachment.cgi?id=915&action=edit proposed patch
Hello,
on master3, using the encryption at rest feature, I am testing as follow: - on a new named database, i set the encryption function with mdb_env_set_encrypt(env, encfunc, &enckey, 32) - note that I chose to have a size parameter (The size of authentication data in bytes, if any. Set this to zero for unauthenticated encryption mechanisms.) of 32 bytes. - I add 2 entries on the DB, trying to saturate the first page. I chose to add a key of 33 Bytes and a value of 1977 Bytes, so the size of each node is 2010 Bytes (obviously the 2 keys are different). - This passes and the DB has just one leaf_pages, no overflow_pages, no branch_pages, an a depth of 1. - If I add one byte to the values I insert (starting again from a blank DB), then , instead of seeing 2 overflow_pages, I get an error : MDB_PAGE_FULL. - this clearly should not have happened. - Here is some tracing : add to leaf page 2 index 0, data size 48 key size 7 [74657374646200] add to leaf page 3 index 0, data size 1978 key size 33 [000000000000000000000000000000000000000000000000000000000000000000] add to branch page 5 index 0, data size 0 key size 0 [null] add to branch page 5 index 1, data size 0 key size 33 [000000000000000000000000000000000000000000000000000000000000000000] add to leaf page 4 index 0, data size 1978 key size 33 [000000000000000000000000000000000000000000000000000000000000000000] add to leaf page 4 index 1, data size 1978 key size 33 [020202020202020202020202020202020202020202020202020202020202020202] not enough room in page 4, got 1 ptrs upper-lower = 2020 - 2 = 2016 node size = 2020
Looking at the code, I understand that there is a problem at line 9005 : } else if (node_size + data->mv_size > mc->mc_txn->mt_env->me_nodemax) {
where me_nodemax is incorrect, as it is not taking into account that some bytes will be needed for the MAC authentication code, which size is in env->me_esumsize.
me_nodemax is calculated at line 5349: env->me_nodemax = (((env->me_psize - PAGEHDRSZ ) / MDB_MINKEYS) & -2) - sizeof(indx_t);
So I substract me_esumsize with a "- env->me_esumsize" here:
env->me_nodemax = (((env->me_psize - PAGEHDRSZ - env->me_esumsize) / MDB_MINKEYS) & -2) - sizeof(indx_t);
I also substract it from me_maxfree_1pg in the line above, and in pmax in line 10435.
I do not know if my patch is correct, but it solves the issue. Maybe there are other places in the code where the me_esumsize should be substracted from the available size. By example, when calculating the number of overflow pages in OVPAGES, it does not take into account me_esumsize, but I think it is ok, because there is only one MAC for the entire set of OV pages, and there is room for it in the first OV page.
See the attached proposed patch.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #1 from NikoPLP info@parlepeuple.fr --- Hello again,
Some more input about this.
I really bump int nasty errors when using the authenticated encryption feature of master3.
What I noticed now is that some entries in the db are corrupted, when using authentication (MAC tag).
at the end of some entries, I find some trailing zeros instead of the data I added.
In fact, the number of trailing zeros is equal to the size of the MAC minus 1 or 2 (depending of it is a bigdata or not).
I was not able to trace down the problem exactly.
But I can confirm that the data on the disk is not corrupted. The problem occurs at the moment of the read (mdb_get)
The buffer that is passed to the encryption function already contains the zeros. It is not a problem with my encryption function (which is just doing a memcpy for now, to facilitate the debugging)
The size of the buffers is correct, and there is no segfault.
But I find those zeros at the end of the value.
disabling the MAC authentication solves the problem.
I could not find at which point those zeros are added to the end of the buffer.
Your help and some pointers on how to solve that would be greatly appreciated. Thanks !
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #2 from Howard Chu hyc@openldap.org --- Please provide minimal sample code demonstrating the problem.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #3 from NikoPLP info@parlepeuple.fr --- Hello,
Can you please review already the first part of the issue, that i posted on the 24th. It has pseudo code, and I would like to get your input on that.
For the second part, I will try to extract some code. But we are already deep into our application code. Many calls are done to the LMDB library before the issue arises... so i can reproduce it easily while testing my app, but i dont have a set of lmdb api calls in one file to give you. This is much more work from me already.
Looking forward to see your feedback already on the descriptions i already posted.
Thanks
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #4 from Howard Chu hyc@openldap.org --- For the first part, I've created a different patch from yours
https://git.openldap.org/openldap/openldap/-/merge_requests/567
Will try to duplicate your results later.
https://bugs.openldap.org/show_bug.cgi?id=9920
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords|needs_review |
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #5 from Howard Chu hyc@openldap.org --- (In reply to NikoPLP from comment #3)
Hello,
Can you please review already the first part of the issue, that i posted on the 24th. It has pseudo code, and I would like to get your input on that.
For the second part, I will try to extract some code. But we are already deep into our application code. Many calls are done to the LMDB library before the issue arises... so i can reproduce it easily while testing my app, but i dont have a set of lmdb api calls in one file to give you. This is much more work from me already.
Looking forward to see your feedback already on the descriptions i already posted.
Thanks
I'm unable to reproduce the data corruption you described. The pagesize page has been committed as b9db2582cb31aa0ec88371db388095cc31ceb2f4
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #6 from NikoPLP info@parlepeuple.fr --- Thanks Howard for the fix of pagesizes. I had tried the merge request back then and it was fixing the MDB_PAGE_FULL issue, but it had no impact on the other issue of corrupted data. About that corruption of values during read operations, i will send you soon a C file with a reproducible test case. Sorry i couldn't do it recently as i was very busy with other things (and i deactivated authenticated data temporarily so i could continue working. I am coming back to you within a week or two. thanks again
https://bugs.openldap.org/show_bug.cgi?id=9920
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Status|UNCONFIRMED |IN_PROGRESS
--- Comment #7 from Howard Chu hyc@openldap.org --- Have you got a reproducer test case for us?
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #8 from kero renault.cle@gmail.com --- Hey Howard and Niko,
I recently tried the AEAD implementation of master3 again with my Rust LMDB wrapper[3], and I found some issues when using authenticated encryption. At first, I thought it was me and how I was using LMDB and encryption API, but the more I searched for a usage error on my side, the more I thought it was on LMDB's side.
When disabling the authenticated encryption (setting the size of authentication data in bytes to zero) and replacing the encryption algorithm with a simple memcpy, everything works. I also tried using a simple memcpy and set the auth to a constant number, which failed (MDB_CRYPTOFAIL).
I tried changing the auth size to something like 16 in the default simple example [1], but the test seems too small to trigger an error, which confirms what I've seen so far. I need to run LMDB with encryption on a large program to make it break (currently, the redb benchmarks). As the simple encryption example does not use the auth data, I was expecting it to work, and that was the case.
So, the reproducer will likely be in Rust. My heed wrapper is only designed for AEAD (so authenticated) encryption, and I am not proficient in C (not more). I also checked the module.c/h files[2], and there are, indeed, taking the size of the authentication data into account. However, have you tried large workloads with auth data?
[1]: https://github.com/LMDB/lmdb/blob/fd3c2adae70d2ed65017100db45e0b3babfe342a/l... [2]: https://github.com/LMDB/lmdb/blob/fd3c2adae70d2ed65017100db45e0b3babfe342a/l... [3]: https://github.com/meilisearch/heed/pull/278
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #9 from NikoPLP info@parlepeuple.fr --- Hello Howard and Kero,
Sorry for the lack of answer to your last message Howard. This issue is 2 years old. I have seen your message of 2023 but I was too busy until now, so I couldnt dedicate some time to that.
The fact is that I am not using LMDB anymore for NextGraph. Already 2 years, after I found the issue with AEAD, and then another issue popped up when compiling on OpenBSD (an issue with semaphores that I couldn't resolve. the data was corrupted in a mysterious way), I have switched to RocksDB for the storage backend of NextGraph.
It is not that RocksDN is better in terms of performance (it isn't) but it is just more suitable for me as it compiles on more platforms, and the encryption plugin was working (even though i had to implement it myself, and it doesn't offer AEAD). I also had another problem with LMDB, in that it relies on memory mapped paged (mapping handled by the OS) and this is clearly not something that will work with WASM.
As our storage backend has to work on web browsers too, I didn't want to invest more time on LMDB. RocksDB isn't ready for WASM neither, but the fact that it relies on simple reads and writes to static files on disk, makes it more suitable to be adapted to IndexedDB or the newest File API in the browser, even with encryption at rest.
Anyway, I love LMDB for its simplicity, perfs, and elegance. But. The code is a mess. I am sorry to say that but the fact that it is written in C is not an excuse for very poor inline documentation and obscure variable names.
I tried to debug the code of the encryption part several times. Here for the issue at end, it gets very complex as it implies a race condition, apparently (or at least, a case that isn't included in your test suite).
It seems that all in all, the master.3 branch is not used ( i am happily surprised by Meilisearch wanting to have AEAD with LMDB ), because the branch is very difficult to find (i think there is a mention to it in an old tweet. and that's pretty much all there is. no documentation neither).
Eventually, coming back to this issue
I was also using LMDB via a Rust binding (the one of firefox) but i don't think it is related to the Rust binding.
I couldn't extract a reproducible test easily, first because the code was complex and i didn't have time to extract a list of C API calls to give to Howard. second because i stopped using LMDB, and thirdly, because as Kero just said, the bug is only triggered when a fair amount of data has been entered in the database. I cannot say how many data is needed. I have tried in my original issue to describe all the information about the zeros i found in my data (that the LMDB code is putting there after the page arrived from the OS, because as I said, the data on disk doesn't have the zeros. it smells like a buffer overrun somewhere, or a buffer index that is shifted)
The incriminated code is not from Howard, but of someone else who worked on this part several years ago. I think it would be of interest to find that person and ask him what he did and what he thinks about the issue we describe.
With the lack of documentation and inline commenting, and obscure coding, if Howard is not fully aware of what is happening in this code, maybe it would be worth it to just throw it away and start the encryption part anew.
Depends also if there is some interest or not.
The only thing I can advise to Kero is to avoid AEAD for now, until it is fix (if Howard can find the bug, but he will also need a reproducible test case. If Kero cannot produce one, I might find some time in October, in order to extract one, just for the sake of benevolence towards LMDB)
But yes Kero, I confirm that you are hitting the same bug as I did back then. It is not a problem with your code.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #10 from kero renault.cle@gmail.com --- Thank you, Niko, for your message (even though I found it a little harsh),
I will try to find a reproducer in Rust because that's the language used by the AEAD library that fails with auth data[1]. However, I will make it as close as possible to C by only using raw bindings instead of heed.
I'll try my best so that Howard can reproduce and eventually fix it.
[1]: https://github.com/RustCrypto/AEADs/tree/master
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #11 from NikoPLP info@parlepeuple.fr --- I didn't want to be harsh. I just wanted to explain why the bug is still here 2 years after the bug report. It is because the code of LMDB encryption at rest is hard to debug, because it is undocumented and hard to read. If it wasn't the case I would probably have been able to submit a patch like I did for another bug, or at least give more hints to Howard so he could find the real bug. I did extensive debugging and tracing back then. and the experience was frustrating. It is not a criticism addressed to Howard, because the code is from someone else, whose name i forgot (check the commits) and who seems to have disappeared. He is not here to answer for his code. And I am not sure how much Howard is in control over that part of the code. As I said earlier, if this part of the code is not under control of Howard, it should probably be dropped and reimplemented. But if the current maintainer and original creator (Howard) is capable of fixing it, then that would be great, of course. Looking forward your reproductible test case :)
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #12 from NikoPLP info@parlepeuple.fr --- If I recommend to toss away the code, it is because we are talking about encryption. If it is wrongly implemented, as it seems it is, this will have bad consequences to the safety of the encrypted data. If he author of the code is unreachable and didn't document his work, If the current maintainer is not fully aware of how the code works, it should be removed and redone. I wouldn't use an encryption library that isn't maintained or that has a codebase that nobody knows how it works. And I am under the impression that this part of the code isn't really under the control of anyone. Where is the contributor who submitted this code and then disappeared? Is Howard totally in control of that part of the code? It is encryption. And I am now worried it hasn't been implemented well.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #13 from Howard Chu hyc@openldap.org --- Just a reminder - mdb.master3 is a development branch, not a release branch. It's there so that developers can try it and find problems. If you expected it to be production-ready you were misinformed. Only mdb.RE/xx is released for general use.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #14 from NikoPLP info@parlepeuple.fr --- That's a good reminder Howard! And indeed it is clear that it isn't released, as it is even hard to find the code itself. I really don't want to say bad things about LMDB that I find, as many others, an otherwise very good piece of software. I hope you guys will be able to find the bug! Could you clarify a bit the situation about this contributor who left after doing the encryption part? Are you fully in control of his code?
I think the reminder that this is not production ready is also important for Kero, in case AEAD encryption at rest was planned to make it to a near release of Meilisearch, then i think you should reconsider. Even if this bug is fixed soon, there is no adhoc test suite that can make this piece of code production ready yet. I was right to raise the question. And your clarification Howard is important.
https://bugs.openldap.org/show_bug.cgi?id=9920
kero renault.cle@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |renault.cle@gmail.com
--- Comment #15 from kero renault.cle@gmail.com --- Created attachment 1028 --> https://bugs.openldap.org/attachment.cgi?id=1028&action=edit A reproducer for the encryption bug with large entries
This is a reproducer for the ITS#9920 encryption issue.
The database becomes corrupted when you insert a lot (64k) entries with keys with a length >= 19, define an auth data size >= 8, and values of size [1;150] bytes.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #16 from kero renault.cle@gmail.com --- Hey Howard,
I was able to reproduce the issue. It took me maybe less than one hour. I modified the mtest_env.c to use larger keys and values. It seems that when you insert a lot (64k) entries with keys with a length >= 19, define an auth data size >= 8, and values of size [1;150] bytes, the database becomes corrupted.
Note that I am not very fluent in C and didn't take the time to deallocate some mallocs. I also kept the key_lengths area even though keys have a constant length of KEY_LENGTH = 24. I don't think it impacts the reproducer. You can search for ITS#9920 in the code to find the interesting parameters that are corrupting the database.
I hope you'll find a fix. Have a nice day. kero
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #17 from kero renault.cle@gmail.com --- It's me again. After further investigation, I tried using eax encryption with an auth data size of 4 and 0. Both failed, so the issue doesn't seem to relate only to the auth data size.
When trying with an auth data size of 0, it went further in the process but crashed with the following error: sometimes an MDB_CURSOR_FULL, MDB_PAGE_NOTFOUND, or an MDB_CORRUPTED error, a bus error, or other times even the following assertion:
mdb.c:9072: Assertion 'MP_UPPER(mp) >= MP_LOWER(mp)' failed in mdb_node_add()
MDB_CURSOR_FULL: Internal error - cursor stack limit reached
MDB_CURSOR_FULL: Internal error - cursor stack limit reached
MDB_CORRUPTED: Located page was wrong type
In conclusion, the auth data size is not the only problem. Even if I set it to 0 and never write the tag out, it corrupts the database. However, I couldn't reproduce the issue with a zero-length auth data buffer in the mtest_enc.c test even with 6.4M entries.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #18 from kero renault.cle@gmail.com --- I come back with more interesting information: The database seems corrupted when we use threads.
The following is the output of the redb benchmarks. The database becomes corrupted when multiple threads are spawned to read in the database simultaneously, one new transaction by thread.
I also made my program work correctly, without any LMDB error, by spawning the write transaction on the main thread instead of a dedicated thread. Unfortunately, this freezes the user interface but works correctly (a little slow, but that's probably due to decryption).
I still don't know if the problem arises from spawning a write transaction in another thread or because we are trying to read while writing. However, I'm sure that reading with multiple read transactions simultaneously corrupts the database/env.
redb: Bulk loaded 1000000 items in 2974ms redb: Wrote 100 individual items in 437ms redb: Wrote 100 x 1000 items in 4178ms redb: len() in 0ms redb: Random read 1000000 items in 714ms redb: Random read 1000000 items in 730ms redb: Random range read 10000000 elements in 1645ms redb: Random range read 10000000 elements in 1638ms redb: Random read (1 threads) 1000000 items in 723ms redb: Random read (4 threads) 1000000 items in 209ms redb: Random read (8 threads) 1000000 items in 184ms redb: Random read (16 threads) 1000000 items in 185ms redb: Random read (32 threads) 1000000 items in 185ms redb: Removed 500000 items in 2764ms heed: Bulk loaded 1000000 items in 7092ms heed: Wrote 100 individual items in 299ms heed: Wrote 100 x 1000 items in 79816ms heed: len() in 0ms heed: Random read 1000000 items in 7855ms heed: Random read 1000000 items in 911ms heed: Random range read 10000000 elements in 1298ms heed: Random range read 10000000 elements in 1306ms heed: Random read (1 threads) 1000000 items in 7598ms thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Mdb(Corrupted)called `Result::unwrap()` on an `Err` value: Mdb(Corrupted)', ', benches/common.rs:419:36 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace benches/common.rs:419:36
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #19 from kero renault.cle@gmail.com --- Further information that I forgot in my previous comment (as we cannot edit them). I added a helpful draft PR[1] that helps reproduce the bug with the redb benchmarks (you need my fork of redb and a specific branch of heed).
Additional information from my fix for my program: I tried many different AEADs encryption libraries, from eax with a zero-length auth data buffer to a good chacha20poly1305 with an auth data buffer of 16 bytes, and everything worked fine as long as there was no thread involved. Note that it was working perfectly fine with LMDB 0.9.
[1]: https://github.com/Kerollmops/redb/pull/1
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #20 from kero renault.cle@gmail.com --- I found another strange behavior when using this feature with my current program.
When I tried to access the first database entry containing large values (size of images, around 300KiB) either by using a get or a cursor (the get uses a cursor internally), it crashed. It works perfectly fine with the other entries and smaller entries from the other database (basically integers or small sets of integers).
I discovered that LMDB called the encrypt callback with the wrong input-output buffers. So, I logged them, and it seems that the size of those values (Mdb_val.mv_size) is wrong here and is equal to 2^32 - 32.
input ptr: 0x000000010138c010, length: 4294967264 output ptr: 0x00000001284c8010, length: 4294967264 thread 'main' panicked at library/core/src/panicking.rs:219:5: unsafe precondition(s) violated: ptr::copy_nonoverlapping requires that both pointer arguments are aligned and non-null and the specified memory ranges do not overlap
The last message is from the Rust std library, which uses a wrapper around memcpy when compiling in debug to detect these faulty usages.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #21 from NikoPLP info@parlepeuple.fr --- That's a lot of tracing and debugging! About the thread safety, i don't remember encountering this kind of issue. All my cases were synchronous. I was using this Rust wrapper https://git.nextgraph.org/NextGraph/lmdb-rs.git
Maybe there is also an issue with your rust wrapper, that adds even more instability to your tests.
Would you mind recreating your test suite with this lmdb-rs wrapper? git clone --recursive https://git.nextgraph.org/NextGraph/lmdb-rs.git
It doesn't have the MAC activated in the current state, but try it, then I can put the AEAD again.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #22 from kero renault.cle@gmail.com --- Hey Niko,
I just looked into your `lmdb-crypto-rs` library, which is a very old library that I completely rewrote. It's indeed the original design done by the Mozilla team. Using traits, not range implementation, and not types is complex. I now use heed, and the design is much easier to understand and work with.
I don't think I will have the time to rewrite the redb benchmarks with this library, but if you want, you can try to use `lmdb-crypto-rs` on my fork[1].
Maybe there is also an issue with your rust wrapper, that adds even more instability to your tests.
Yup, maybe. But as I stated before, it works perfectly with the LMDB@0.9 version (heed itself). It doesn't mean much, but I did not share `RwTxn` between threads (!Sync + Send), and `RoTxn`s are also non-referenceable and non-moveable between threads (!Sync + !Send).
As you can see in this multi-threaded example[2] (db corresponds to a lmdb::Env in this context), the read transaction is created directly in the spawned thread and dropped (abort) at the end of it. There may be something wrong with sharing an env between threads, but it was never something in LMDB@0.9.
Just to be sure that I wasn't missing something, I re-read the documentation (that I hope is up-to-date) of mdb.master3, and there is nothing special about sharing envs between threads, and the same restriction as on mdb.master applies to the RoTxn and RwTxn. Also, I tried running the redb benchmarks with the MDB_NOTLS flag, and it corrupted the db similarly.
[1]: https://github.com/Kerollmops/redb/pull/1 [2]: https://github.com/Kerollmops/redb/blob/c93a74ab8da0a09c56ee68ec4f6a495f7851...
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #23 from kero renault.cle@gmail.com --- Created attachment 1030 --> https://bugs.openldap.org/attachment.cgi?id=1030&action=edit Reproducer of the corruption happening when using threads
Hey Howard and Niko,
I took the time to create another reproducer for the issue that occurs when using multiple threads in an encrypted environment with a non-zero auth data buffer ONLY.
However, I think the corruption happens when I define an auth data buffer of at least 6 bytes. I tried running with an auth data buffer of 4 with 10 to 100 threads, and it worked.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #24 from NikoPLP info@parlepeuple.fr --- Hello Kero, That's an interesting finding! I hope Howard will find the time to look into all that. I am myself too busy at the moment to help you in any way. But if in October things are not solved, I will dedicate some time in order to help you. Thanks for the link about redb benchmark, and yes, I will have a look at heed, as I told you in private I am curious about it. But again, not right now, as I am overwhelmed with works at the moment. Hopefully Howard will be able to help you soon. Keep up the good work!
https://bugs.openldap.org/show_bug.cgi?id=9920
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Blocks| |9367
Referenced Issues:
https://bugs.openldap.org/show_bug.cgi?id=9367 [Issue 9367] back-mdb: encryption support
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #25 from Howard Chu hyc@openldap.org --- (In reply to kero from comment #15)
Created attachment 1028 [details] A reproducer for the encryption bug with large entries
This is a reproducer for the ITS#9920 encryption issue.
The database becomes corrupted when you insert a lot (64k) entries with keys with a length >= 19, define an auth data size >= 8, and values of size [1;150] bytes.
Hi, just starting to look at this. I note that you modified mtest_enc.c which does not use authenticated encryption. As such, you've told it to reserve 16 bytes of space that will always be uninitialized since the encryption function never touches it. mtest_enc2.c is the one that tests authenticated encryption.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #26 from kero renault.cle@gmail.com ---
As such, you've told it to reserve 16 bytes of space that will always be uninitialized since the encryption function never touches it.
Yes, I specifically asked it to reserve 16 bytes of auth data. I don't use them in the encryption algorithm, but they are just used to show that it breaks anyway. You can maybe try to reproduce it with mtest_enc2.c. I didn't try it much.
Apart from asking for a large enough buffer for the auth data, is there anything more LMDB can do to provide the buffer? It is not mandatory to use it. It works as a reproducer even though I don't use the 16 bytes, right?
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #27 from Howard Chu hyc@openldap.org --- Fixed in d61005822c120e5364dcb41893372d502a1bcc81 please test.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #28 from kero renault.cle@gmail.com --- Thank you Howard for the patch,
Indeed, it is much better. I can now use different auth data buffer sizes. But it is still broken in all sorts of ways when I try to read simultaneously as I write in the database. My numerous tries ended in the following:
- MDB_BAD_DBI: So I decided to try to reopen the databases on the writing thread even though it was already created with a write transaction and committed on the main thread (the one opening the env). - MDB_BAD_TXN: Transaction must abort, has a child, or is invalid. - MDB_PAGE_NOTFOUND: Requested page not found. - MDB_CRYPTO_FAIL: Page encryption or decryption failed.
Note that all these errors ONLY happen when I try to read the database while writing it in another thread. I run multiple (10) threads that block/wait to open a new write transaction. If I DO NOT perform ANY read while writing, every read or write attempt on the write transaction performs successfully.
When I write from the main thread without reading at the same time, everything works perfectly — at least as far as I tested. It blocks my program, so that's not a perfect situation, but I can open and read in the environment, and I don't have to open/create the databases twice.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #29 from kero renault.cle@gmail.com --- I just had a SIGBUS error due to the encryption function trying to read into an invalid page by trying to read the first key of a database. I was just trying to read; no writes were being performed at the same time.
Here is the backtrace:
0 libsystem_platform.dylib 0x184abf3a8 _platform_memmove + 520 1 lil-flower 0x10446a10c heed3_encryption::env::encrypted::decrypt::h84b5ce28811df3e6 + 88 2 lil-flower 0x104455f64 std::panicking::try::h76c7716a4e73ebc3 + 116 3 lil-flower 0x104469698 heed3_encryption::env::encrypted::encrypt_func_wrapper::hf1ab12e6180da05f + 64 4 lil-flower 0x1044bbb18 mdb_rpage_encsum + 524 5 lil-flower 0x1044bad78 mdb_page_get + 856 6 lil-flower 0x1044b6b54 mdb_node_read + 252 7 lil-flower 0x1044b6498 mdb_cursor_set + 1040 8 lil-flower 0x1044b5f84 mdb_get + 344
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #30 from Howard Chu hyc@openldap.org --- Please try your tests under valgrind/helgrind or valgrind/drd and provide the valgrind output.
Note that since decryption requires copies of the DB pages, a LMDB must maintain a cache of these pages and the locking of this cache is probably missing something somewhere.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #31 from Howard Chu hyc@openldap.org --- Created attachment 1033 --> https://bugs.openldap.org/attachment.cgi?id=1033&action=edit valgrind output from OpenLDAP test008
I've added encryption support to OpenLDAP slapd so that we can test multithreaded access using our existing test suite. There are a number of data races showing up in valgrind, both in LMDB itself and in OpenSSL. We'll avoid using OpenSSL in our sample crypto code going forward.
The valgrind output is attached for reference. This is run against a slapd configured for test008, with crypto options enabled.
https://bugs.openldap.org/show_bug.cgi?id=9920
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |TEST Status|IN_PROGRESS |RESOLVED
--- Comment #32 from Howard Chu hyc@openldap.org --- Should be OK now in git mdb.master3. test008 passes now.
The OpenSSL race conditions in the sample crypto.c module could not be fixed, so libsodium was used instead.
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #33 from kero renault.cle@gmail.com --- Thank you very much, Howard, for the great work. It works nearly perfectly now.
However, I still notice some MDB_BAD_DBI errors when using my program. They are rare but still there and triggered by the multiple indexing threads, probably also while reading the database. I haven't closed the DBI.
MDB_BAD_DBI: The specified DBI handle was closed/changed unexpectedly
Would you mind also updating the GitHub mirror branch, please? Or is it an automatic task somehow?
Have a nice day!
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #34 from Howard Chu hyc@openldap.org --- (In reply to kero from comment #33)
Thank you very much, Howard, for the great work. It works nearly perfectly now.
However, I still notice some MDB_BAD_DBI errors when using my program. They are rare but still there and triggered by the multiple indexing threads, probably also while reading the database. I haven't closed the DBI.
MDB_BAD_DBI: The specified DBI handle was closed/changed unexpectedly
That sounds like an unrelated issue. Does the problem disappear when you disable encryption?
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #35 from Howard Chu hyc@openldap.org --- (In reply to Howard Chu from comment #34)
(In reply to kero from comment #33)
Thank you very much, Howard, for the great work. It works nearly perfectly now.
However, I still notice some MDB_BAD_DBI errors when using my program. They are rare but still there and triggered by the multiple indexing threads, probably also while reading the database. I haven't closed the DBI.
MDB_BAD_DBI: The specified DBI handle was closed/changed unexpectedly
That sounds like an unrelated issue. Does the problem disappear when you disable encryption?
Try again, commit 1e7891c016a4518eaf0aa29f959b0cc8a22d4111
https://bugs.openldap.org/show_bug.cgi?id=9920
--- Comment #36 from kero renault.cle@gmail.com --- I tried again and updated my program a bit. Fixed some stuff, and it seems to work fine after advanced usage. Putting thousands of files inserted.
I do not understand why I triggered the MDB_BAD_DBI error. I am no longer able to reproduce it.
Thank you very much for the multiple fixes. This encryption-at-end feature is so impressive.
Have a nice day!
https://bugs.openldap.org/show_bug.cgi?id=9920
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|TEST |FIXED
https://bugs.openldap.org/show_bug.cgi?id=9920
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED