https://bugs.openldap.org/show_bug.cgi?id=9486
Issue ID: 9486 Summary: EINVAL from mdb_txn_begin with MDB_NOSYNC Product: LMDB Version: unspecified Hardware: x86_64 OS: Linux Status: UNCONFIRMED Severity: normal Priority: --- Component: liblmdb Assignee: bugs@openldap.org Reporter: igfoo@github.com Target Milestone: ---
Created attachment 800 --> https://bugs.openldap.org/attachment.cgi?id=800&action=edit Test case
With this script:
----------------------- #!/bin/sh
set -e
rm -rf lmdb git clone https://github.com/LMDB/lmdb.git INSTALL_DIR="`pwd`/inst" cd lmdb/libraries/liblmdb make install prefix="$INSTALL_DIR" cd ../../..
gcc -Iinst/include loop.c inst/lib/liblmdb.a -o loop -pthread rm -f test.db test.db-lock for i in `seq 1 10` do ./loop $i & done wait echo "All finished" -----------------------
and the attached loop.c, I get output like: ----------------------- 1: Got error 22 from call 181 of mdb_txn_begin. 6: Got error 22 from call 129 of mdb_txn_begin. 9: Got error 22 from call 149 of mdb_txn_begin. 5: Got error 22 from call 140 of mdb_txn_begin. 3: Got error 22 from call 166 of mdb_txn_begin. 4: Got error 22 from call 156 of mdb_txn_begin. 2: Got error 22 from call 135 of mdb_txn_begin. 8: Got error 22 from call 136 of mdb_txn_begin. 7: Got error 22 from call 163 of mdb_txn_begin. 10: Got error 22 from call 163 of mdb_txn_begin. All finished ----------------------- (on some runs, some threads will report -30792 (MDB_MAP_FULL) rather than 22, but I have always seen at least some processes report 22).
Unless I am misunderstanding something, I do not think mdb_txn_begin should be failing with 22.
I haven't dug into this, but based on what I saw with my real code, I expect that this is coming from pthread_mutex_lock returning EINVAL.
I don't get any errors without MDB_NOSYNC.
This is on Linux, on an ext4 filesystem.
I'm seeing this with the current master: commit 52bc29ee2efccf09c650598635cd42a50b6ecffe Author: Howard Chu hyc@openldap.org Date: Thu Feb 11 11:34:57 2021 +0000 ITS#9461 fix typo
Please let me know if any more details would be useful.
https://bugs.openldap.org/show_bug.cgi?id=9486
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |RESOLVED Resolution|--- |WONTFIX
--- Comment #1 from Howard Chu hyc@openldap.org --- (In reply to igfoo from comment #0)
Created attachment 800 [details] Test case
With this script:
#!/bin/sh
set -e
rm -rf lmdb git clone https://github.com/LMDB/lmdb.git INSTALL_DIR="`pwd`/inst" cd lmdb/libraries/liblmdb make install prefix="$INSTALL_DIR" cd ../../..
gcc -Iinst/include loop.c inst/lib/liblmdb.a -o loop -pthread rm -f test.db test.db-lock for i in `seq 1 10` do ./loop $i & done wait echo "All finished"
and the attached loop.c, I get output like:
1: Got error 22 from call 181 of mdb_txn_begin. 6: Got error 22 from call 129 of mdb_txn_begin. 9: Got error 22 from call 149 of mdb_txn_begin. 5: Got error 22 from call 140 of mdb_txn_begin. 3: Got error 22 from call 166 of mdb_txn_begin. 4: Got error 22 from call 156 of mdb_txn_begin. 2: Got error 22 from call 135 of mdb_txn_begin. 8: Got error 22 from call 136 of mdb_txn_begin. 7: Got error 22 from call 163 of mdb_txn_begin. 10: Got error 22 from call 163 of mdb_txn_begin. All finished
(on some runs, some threads will report -30792 (MDB_MAP_FULL) rather than 22, but I have always seen at least some processes report 22).
Unless I am misunderstanding something, I do not think mdb_txn_begin should be failing with 22.
I haven't dug into this, but based on what I saw with my real code, I expect that this is coming from pthread_mutex_lock returning EINVAL.
I don't get any errors without MDB_NOSYNC.
This is on Linux, on an ext4 filesystem.
I'm seeing this with the current master: commit 52bc29ee2efccf09c650598635cd42a50b6ecffe Author: Howard Chu hyc@openldap.org Date: Thu Feb 11 11:34:57 2021 +0000 ITS#9461 fix typo
Please let me know if any more details would be useful.
Just taking a quick look at your loop.c - most likely you're hitting a race condition between when the lock file is created and when it gets an exclusive lock. This is already documented in "Caveats". Your use case is not supported. Open an env once and keep it open for the life of the process.
https://bugs.openldap.org/show_bug.cgi?id=9486
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED
https://bugs.openldap.org/show_bug.cgi?id=9486
--- Comment #2 from igfoo@github.com --- Thanks for the quick response. Unfortunately, in the real code, each process only opens an env once, but two processes might do so at the same time.
Are you referring to the "Opening a database can fail if another process is opening or closing it at exactly the same time." caveat? If so, presumably we could just wait a random amount of time and try again. Is there any documentation anywhere about what these failures might look like?
https://bugs.openldap.org/show_bug.cgi?id=9486
--- Comment #3 from Howard Chu hyc@openldap.org --- (In reply to igfoo from comment #2)
Thanks for the quick response. Unfortunately, in the real code, each process only opens an env once, but two processes might do so at the same time.
Are you referring to the "Opening a database can fail if another process is opening or closing it at exactly the same time." caveat? If so, presumably we could just wait a random amount of time and try again. Is there any documentation anywhere about what these failures might look like?
No, the behavior is undefined.
Your best bet is just to stagger the startup of your processes.
https://bugs.openldap.org/show_bug.cgi?id=9486
--- Comment #4 from Howard Chu hyc@openldap.org --- (In reply to Howard Chu from comment #3)
(In reply to igfoo from comment #2)
Thanks for the quick response. Unfortunately, in the real code, each process only opens an env once, but two processes might do so at the same time.
Are you referring to the "Opening a database can fail if another process is opening or closing it at exactly the same time." caveat? If so, presumably we could just wait a random amount of time and try again. Is there any documentation anywhere about what these failures might look like?
No, the behavior is undefined.
Your best bet is just to stagger the startup of your processes.
Your other option is to just run a dummy process to open the env before any other processes start. Then the env & lockfile will always be already initialized whenever any other process opens it, and the problem goes away. E.g.:
#include <lmdb.h> #include <stdio.h> #include <stdlib.h>
void err(char *func, int err) { fprintf(stderr, "%s failed with %d\n", func, err); exit(1); }
int main(int argc, char **argv) { const char *db_file = "test.db"; MDB_env* env = NULL; int e = mdb_env_create(&env); if(e) err("mdb_env_create", e); e = mdb_env_open(env, db_file, MDB_NOSUBDIR | MDB_NOSYNC, 0644); if(e) err("mdb_env_open", e); printf("Press return to exit: "); fflush(stdout); getchar(); mdb_env_close(env); return 0; }
https://bugs.openldap.org/show_bug.cgi?id=9486
--- Comment #5 from igfoo@github.com --- Thanks for the suggestion! I'll think about whether we can make something like that work for us.
But wouldn't another option be to wrap all calls to mdb_env_open and mdb_env_close with a lock/unlock function, like below?
And if that works, then couldn't mdb_env_open/mdb_env_close do it themselves, if I added | MDB_SAFE_PARALLEL_OPEN_CLOSE to the flags?
(a real solution would need better error handling, and perhaps a different locking mechanism on different platforms, but this should give the idea)
int fd = lock(); e = mdb_env_open(env, db_file, MDB_NOSUBDIR | MDB_NOSYNC, 0644); unlock(fd); [...] fd = lock(); mdb_env_close(env); unlock(fd);
int lock(void) { int r; int fd = open("test.env-lock", O_WRONLY | O_CREAT, 0666); if(fd == -1) { printf("%s: lock failed to open file: %d.\n", me, errno); exit(1); } struct flock lock_info; memset((void *)&lock_info, 0, sizeof(lock_info)); lock_info.l_type = F_WRLCK; lock_info.l_whence = SEEK_SET; lock_info.l_start = 0; lock_info.l_len = 1; while((r = fcntl(fd, F_SETLKW, &lock_info)) && errno == EINTR) ; if(r) { printf("%s: lock failed to lock file: %d.\n", me, errno); exit(1); } return fd; }
void unlock(int fd) { int r; struct flock lock_info; memset((void *)&lock_info, 0, sizeof(lock_info)); lock_info.l_type = F_WRLCK; lock_info.l_whence = SEEK_SET; lock_info.l_start = 0; lock_info.l_len = 1; while((r = fcntl(fd, F_UNLCK, &lock_info)) && errno == EINTR) ; if(r) { printf("%s: unlock failed to unlock file: %d.\n", me, errno); exit(1); } r = close(fd); if(r == -1) { printf("%s: unlock failed to close file: %d.\n", me, errno); exit(1); } }