https://bugs.openldap.org/show_bug.cgi?id=10095
Issue ID: 10095 Summary: Race condition causing corruption of mutexes when closing the database Product: LMDB Version: 0.9.30 Hardware: x86_64 OS: Linux Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: liblmdb Assignee: bugs@openldap.org Reporter: peter@peterzhu.ca Target Milestone: ---
We're running into a race condition across multiple processes causing the corruption of mutexes when a process closes the database caused by the fix for https://bugs.openldap.org/show_bug.cgi?id=9278 (commit https://git.openldap.org/openldap/openldap/-/commit/f683ffdc81d0edb20437cb7d...).
Here's the interleaving of two processes (p0 and p1) that can cause this situation.
p0: Opens connection to database using mdb_env_create and mdb_env_open.
...some things happen in between...
p0: Begins closing the database using mdb_env_close: p0: Calls mdb_env_close0: p0: Acquires write lock on the file lock using mdb_env_excl_lock. p0: Calls pthread_mutex_destroy on the mutexes.
SWITCH TO p1
p1: Begins opening the database using mdb_env_create. Then calls mdb_env_open, in mdb_env_open: p1: Calls mdb_env_setup_locks: p1: Calls mdb_env_excl_lock, but it's unable to acquire a write file lock due to p0 holding the write file lock. It waits on acquiring a read file lock.
SWITCH TO p0
p0: Calls close on the file descriptor which releases the write lock.
SWITCH TO p1
p1: Acquires the read file lock. p1: Does NOT call pthread_mutex_init since it did not acquire a write file lock.
...some things happen in between...
p1: Try to lock the mutex using pthread_mutex_lock. This call fails with a EINVAL due to locking a destroyed mutex.
I'm not sure how to actually solve this problem. We're currently mitigating this problem by reverting the commit linked above (so no mutexes get destroyed).