https://bugs.openldap.org/show_bug.cgi?id=9397
Issue ID: 9397 Summary: LMDB: A second process opening a file with MDB_WRITEMAP can cause the first to SIGBUS Product: LMDB Version: 0.9.26 Hardware: All OS: All Status: UNCONFIRMED Severity: normal Priority: --- Component: liblmdb Assignee: bugs@openldap.org Reporter: github@nicwatson.org Target Milestone: ---
Created attachment 780 --> https://bugs.openldap.org/attachment.cgi?id=780&action=edit Full reproduction of SIGBUS MDB_WRITEMAP issue (works on Linux only)
The fundamental problem is that a ftruncate() on Linux that makes a file smaller will cause accesses past the new end of the file to SIGBUS (see the mmap man page).
The sequence that causes a SIGBUS involves two processes.
1. The first process opens a new LMDB file with MDB_WRITEMAP. 2. The second process opens the same LMDB file with MDB_WRITEMAP and with an explicit map_size smaller than the first process's map size. * This causes an ftruncate that makes the underlying file *smaller*. 3. (Optional) The second process closes the environment and exits. 4. The first process opens a write transaction and writes a bunch of data. 5. The first process commits the transaction. This causes a memory read from the mapped memory that's now past the end of the file. On Linux, this triggers a SIGBUS.
Attached is code that fully reproduces the problem on Linux.
The most straightforward solution is to allow ftruncate to *reduce* the file size if it is the only reader. Another possibility is check the file size and ftruncate if necessary every time a write transaction is opened. A third possibility is to catch the SIGBUS signal.
Repro note: I used clone() to create the subprocess to most straightforwardly demonstrate that the problem is not due to inherited file descriptors. The problem still manifests when the processes are completely independent.
https://bugs.openldap.org/show_bug.cgi?id=9397
--- Comment #1 from Howard Chu hyc@openldap.org --- (In reply to github@nicwatson.org from comment #0)
The most straightforward solution is to allow ftruncate to *reduce* the file size if it is the only reader.
Yes. Patch welcome.
Another possibility is check the file size and ftruncate if necessary every time a write transaction is opened.
No, too slow, and wouldn't catch the problem every time. You would need to check on every actual write operation all the way up to the final commit.
A third possibility is to catch the SIGBUS signal.
No, libraries should never muck with signal handlers. Particularly in a multithreaded app where the signal may have occurred due to some other thread completely unrelated to this library. Only the main app should ever touch signal controls.
https://bugs.openldap.org/show_bug.cgi?id=9397
--- Comment #2 from github@nicwatson.org github@nicwatson.org --- Thank you for your prompt attention.
I took a look at implementing a patch to prevent ftruncate from reducing the filesize. I could use your advice in resolving races.
There are two paths to consider to the ftruncate call in mdb_env_map: from mdb_env_open2, and from mdb_env_set_mapsize.
In the former case, if the file is the only reader, the file is locked for exclusive use (via a file lock on the first byte of the lock file), so no races are possible. If the file is already open, the issues are the same as the next case.
For the user explicitly calling mdb_env_set_mapsize or opening an already-opened file, there's no exclusive lock held, so another process might be in the middle of its own call to ftruncate, and any detection of other readers would itself be racy.
So, we either need a new locking scheme for setting the file size, or need a safer (that doesn't reduce the file size) call to make than ftruncate. Taking the write transaction lock was my first guess, but that would prevent a new process from opening a file while a write transaction was held, which seems unacceptable.
Another option is we could create a new exclusive file lock from the lockfile. This lock would be held only for the period of time of checking the file size and calling ftruncate. That has backwards compatibility implications.
IMHO, this seems like a fundamental problem in the POSIX ftruncate API. Alternatives would be writing a "safe" byte to the end of the map, for example with pwrite(). I'm not really sure what a safe byte would be. (writing 0 bytes does not increase the file size, unfortunately).
Another option is using posix_fallocate for the last byte of the map. This would technically force a disk page to be allocated, which wouldn't hurt anything. The downside is availability of the API at all (for example it is not available on MacOS), and that it downgrades to a racy manual implementation if fallocate isn't available (which is I think everything but Linux). On the plus side, the race would be much smaller than it exists today, and completely eliminated on Linux.
Thoughts?
https://bugs.openldap.org/show_bug.cgi?id=9397
--- Comment #3 from Howard Chu hyc@openldap.org --- (In reply to github@nicwatson.org from comment #2)
Thank you for your prompt attention.
I took a look at implementing a patch to prevent ftruncate from reducing the filesize. I could use your advice in resolving races.
There are two paths to consider to the ftruncate call in mdb_env_map: from mdb_env_open2, and from mdb_env_set_mapsize.
In the former case, if the file is the only reader, the file is locked for exclusive use (via a file lock on the first byte of the lock file), so no races are possible. If the file is already open, the issues are the same as the next case.
For the user explicitly calling mdb_env_set_mapsize or opening an already-opened file, there's no exclusive lock held, so another process might be in the middle of its own call to ftruncate, and any detection of other readers would itself be racy.
So, we either need a new locking scheme for setting the file size, or need a safer (that doesn't reduce the file size) call to make than ftruncate. Taking the write transaction lock was my first guess, but that would prevent a new process from opening a file while a write transaction was held, which seems unacceptable.
Another option is we could create a new exclusive file lock from the lockfile. This lock would be held only for the period of time of checking the file size and calling ftruncate. That has backwards compatibility implications.
IMHO, this seems like a fundamental problem in the POSIX ftruncate API. Alternatives would be writing a "safe" byte to the end of the map, for example with pwrite(). I'm not really sure what a safe byte would be. (writing 0 bytes does not increase the file size, unfortunately).
Another option is using posix_fallocate for the last byte of the map. This would technically force a disk page to be allocated, which wouldn't hurt anything. The downside is availability of the API at all (for example it is not available on MacOS), and that it downgrades to a racy manual implementation if fallocate isn't available (which is I think everything but Linux). On the plus side, the race would be much smaller than it exists today, and completely eliminated on Linux.
Thoughts?
1st: "Don't do that." The docs for mdb_env_mapsize() make it clear that it's up to the caller to ensure that no other txns are active at the time it is called. We can first expand this statement in the docs and say that the caller is responsible to ensure that no other *users* are active - processes, threads, txns, whatever.
2nd: We can simply skip ftruncate on any requests to shrink the map. There is no pressing reason to perform the operation. Since we don't use fallocate, the excess size isn't actually consuming any disk space yet, so ftruncate doesn't save any disk space.
https://bugs.openldap.org/show_bug.cgi?id=9397
--- Comment #4 from github@nicwatson.org github@nicwatson.org --- (In reply to Howard Chu from comment #3) ...
1st: "Don't do that." The docs for mdb_env_mapsize() make it clear that it's up to the caller to ensure that no other txns are active at the time it is called. We can first expand this statement in the docs and say that the caller is responsible to ensure that no other *users* are active - processes, threads, txns, whatever.
I have a finite view on the users of LMDB, but I know of 3 libraries built on LMDB this change in contract would impact. It is common (sense) to dynamically increase the map size when the map gets full. We would essentially be asking our downstreams to implement external file locking.
2nd: We can simply skip ftruncate on any requests to shrink the map. There is no pressing reason to perform the operation. Since we don't use fallocate, the excess size isn't actually consuming any disk space yet, so ftruncate doesn't save any disk space.
I was planning on passing the "excl" value from mdb_env_open2 to a new parameter to mdb_env_map, and only allow the filesize to decrease when excl is True (and the running process has an exclusive lock). This would preserve the existing behavior with a single process user.
The file size/ftruncate race still remains though. In fact, any two processes that call ftruncate on the same file will have this problem.
Now that I think about it, an fcntl exclusive lock on the first byte of the data file would work (just for the period between checking the file size and ftruncate), and it does somewhat capture the semantics of protecting the file.
https://bugs.openldap.org/show_bug.cgi?id=9397
--- Comment #5 from Howard Chu hyc@openldap.org --- (In reply to github@nicwatson.org from comment #4)
(In reply to Howard Chu from comment #3)
2nd: We can simply skip ftruncate on any requests to shrink the map. There is no pressing reason to perform the operation. Since we don't use fallocate, the excess size isn't actually consuming any disk space yet, so ftruncate doesn't save any disk space.
I was planning on passing the "excl" value from mdb_env_open2 to a new parameter to mdb_env_map, and only allow the filesize to decrease when excl is True (and the running process has an exclusive lock). This would preserve the existing behavior with a single process user.
This still ignores the fact that there is never any actual need to shrink the filesize.
https://bugs.openldap.org/show_bug.cgi?id=9397
--- Comment #6 from github@nicwatson.org github@nicwatson.org --- (In reply to Howard Chu from comment #5)
(In reply to github@nicwatson.org from comment #4)
(In reply to Howard Chu from comment #3)
2nd: We can simply skip ftruncate on any requests to shrink the map. There is no pressing reason to perform the operation. Since we don't use fallocate, the excess size isn't actually consuming any disk space yet, so ftruncate doesn't save any disk space.
I was planning on passing the "excl" value from mdb_env_open2 to a new parameter to mdb_env_map, and only allow the filesize to decrease when excl is True (and the running process has an exclusive lock). This would preserve the existing behavior with a single process user.
This still ignores the fact that there is never any actual need to shrink the filesize.
HFS+ filesystems on MacOS do not support sparse files. ftruncate allocates (and deallocates) real disk space on these platforms.
https://bugs.openldap.org/show_bug.cgi?id=9397
--- Comment #7 from tina@tina.pm --- Hi,
I am the person who reported the original problem to Nic, after getting my DB corrupted the minute I started doing multiprocessing.
I wanted to comment on this:
(In reply to Howard Chu from comment #3)
1st: "Don't do that." The docs for mdb_env_mapsize() make it clear that it's up to the caller to ensure that no other txns are active at the time it is called. We can first expand this statement in the docs and say that the caller is responsible to ensure that no other *users* are active - processes, threads, txns, whatever.
I think that is far from clear in the documentation, and it seems to be many projects out there (including mine) doing dynamic resizing assuming that no other txns are active *in the current process*! I would urge you to update the documentation then, as this might be a DB corruption waiting to happen in many places.
Having said that, instead of delegating this to users, which would result in reinventing the wheel and falling for the many pitfalls of file locking across architectures, is there any good reason not to obtain an exclusive lock on the lock file when mdb_env_mapsize() is called?
Alternatively, would you consider offering in the API a exclusive locking function using the current mechanisms?
Thanks.
https://bugs.openldap.org/show_bug.cgi?id=9397
--- Comment #8 from Howard Chu hyc@openldap.org --- (In reply to tina from comment #7)
Hi,
I am the person who reported the original problem to Nic, after getting my DB corrupted the minute I started doing multiprocessing.
I wanted to comment on this:
(In reply to Howard Chu from comment #3)
1st: "Don't do that." The docs for mdb_env_mapsize() make it clear that it's up to the caller to ensure that no other txns are active at the time it is called. We can first expand this statement in the docs and say that the caller is responsible to ensure that no other *users* are active - processes, threads, txns, whatever.
I think that is far from clear in the documentation, and it seems to be many projects out there (including mine) doing dynamic resizing assuming that no other txns are active *in the current process*! I would urge you to update the documentation then, as this might be a DB corruption waiting to happen in many places.
There's no danger of corruption if you only resize to grow the DB, which is the only functionality you should need in an active application. Shrinking the DB should only be an administrative action, not a live runtime operation.
https://bugs.openldap.org/show_bug.cgi?id=9397
--- Comment #9 from tina@tina.pm ---
There's no danger of corruption if you only resize to grow the DB, which is the only functionality you should need in an active application. Shrinking the DB should only be an administrative action, not a live runtime operation.
The problem here happens because the other process does not see the updated mapsize properly. This is a combination of two things: one is a bug in the python library that was passing a default map_size on environment creation, and the other is that the mdb.c code ignores the persisted map_size in that situation and sets it to the last pageno used plus 1.
In any case, that information about the risk of shrinking the database would be incredibly useful in the docs too.
https://bugs.openldap.org/show_bug.cgi?id=9397
--- Comment #10 from github@nicwatson.org github@nicwatson.org --- (In reply to Howard Chu from comment #8)
(In reply to tina from comment #7)
Hi,
I am the person who reported the original problem to Nic, after getting my DB corrupted the minute I started doing multiprocessing.
I wanted to comment on this:
(In reply to Howard Chu from comment #3)
1st: "Don't do that." The docs for mdb_env_mapsize() make it clear that it's up to the caller to ensure that no other txns are active at the time it is called. We can first expand this statement in the docs and say that the caller is responsible to ensure that no other *users* are active - processes, threads, txns, whatever.
I think that is far from clear in the documentation, and it seems to be many projects out there (including mine) doing dynamic resizing assuming that no other txns are active *in the current process*! I would urge you to update the documentation then, as this might be a DB corruption waiting to happen in many places.
There's no danger of corruption if you only resize to grow the DB, which is the only functionality you should need in an active application. Shrinking the DB should only be an administrative action, not a live runtime operation.
"only resize to grow the DB" is impossible without external synchronization. Without an external lock, there's always a window between checking the file size and truncating the file when another process might have done the same.
Imagine two processes A and B that both call mdb_env_set_mapsize at the same time on an environment pointing at the same file. Process A calls the function with a size of 20 MiB, process B calls it with a size of 10 MiB. Before the call, the file size was 5 MiB.
I believe the follow sequence is possible.
Process A: 0. mdb_env_set_mapsize begins
Process B: 1. 0. mdb_env_set_mapsize begins
Process A: 1. ftruncate (inside mdb_env_set_mapsize)(20MiB) 2. mmap(20MiB) (inside mdb_env_set_mapsize) 3. mdb_env_set_mapsize completes 4. mdb_txn_begin 5. mdb_cursor_put
Process B: 6. ftruncate(10MiB)
Process A: 7. mdb_txn_commit (bus error due to access past end of file)
If step 7 happens before step 6, then we have potential data loss and a corrupted DB if the process A transaction wrote data between the 10MiB and 20 MiB offsets in the file.
In other words, you can have two separate processes increasing the map size and still truncate real data from the file and/or bus error.