https://bugs.openldap.org/show_bug.cgi?id=9397
--- Comment #3 from Howard Chu hyc@openldap.org --- (In reply to github@nicwatson.org from comment #2)
Thank you for your prompt attention.
I took a look at implementing a patch to prevent ftruncate from reducing the filesize. I could use your advice in resolving races.
There are two paths to consider to the ftruncate call in mdb_env_map: from mdb_env_open2, and from mdb_env_set_mapsize.
In the former case, if the file is the only reader, the file is locked for exclusive use (via a file lock on the first byte of the lock file), so no races are possible. If the file is already open, the issues are the same as the next case.
For the user explicitly calling mdb_env_set_mapsize or opening an already-opened file, there's no exclusive lock held, so another process might be in the middle of its own call to ftruncate, and any detection of other readers would itself be racy.
So, we either need a new locking scheme for setting the file size, or need a safer (that doesn't reduce the file size) call to make than ftruncate. Taking the write transaction lock was my first guess, but that would prevent a new process from opening a file while a write transaction was held, which seems unacceptable.
Another option is we could create a new exclusive file lock from the lockfile. This lock would be held only for the period of time of checking the file size and calling ftruncate. That has backwards compatibility implications.
IMHO, this seems like a fundamental problem in the POSIX ftruncate API. Alternatives would be writing a "safe" byte to the end of the map, for example with pwrite(). I'm not really sure what a safe byte would be. (writing 0 bytes does not increase the file size, unfortunately).
Another option is using posix_fallocate for the last byte of the map. This would technically force a disk page to be allocated, which wouldn't hurt anything. The downside is availability of the API at all (for example it is not available on MacOS), and that it downgrades to a racy manual implementation if fallocate isn't available (which is I think everything but Linux). On the plus side, the race would be much smaller than it exists today, and completely eliminated on Linux.
Thoughts?
1st: "Don't do that." The docs for mdb_env_mapsize() make it clear that it's up to the caller to ensure that no other txns are active at the time it is called. We can first expand this statement in the docs and say that the caller is responsible to ensure that no other *users* are active - processes, threads, txns, whatever.
2nd: We can simply skip ftruncate on any requests to shrink the map. There is no pressing reason to perform the operation. Since we don't use fallocate, the excess size isn't actually consuming any disk space yet, so ftruncate doesn't save any disk space.