--- Comment #3 from Howard Chu <hyc(a)openldap.org> ---
(In reply to github(a)nicwatson.org from comment #2)
Thank you for your prompt attention.
I took a look at implementing a patch to prevent ftruncate from reducing the
filesize. I could use your advice in resolving races.
There are two paths to consider to the ftruncate call in mdb_env_map: from
mdb_env_open2, and from mdb_env_set_mapsize.
In the former case, if the file is the only reader, the file is locked for
exclusive use (via a file lock on the first byte of the lock file), so no
races are possible. If the file is already open, the issues are the same as
the next case.
For the user explicitly calling mdb_env_set_mapsize or opening an
already-opened file, there's no exclusive lock held, so another process
might be in the middle of its own call to ftruncate, and any detection of
other readers would itself be racy.
So, we either need a new locking scheme for setting the file size, or need a
safer (that doesn't reduce the file size) call to make than ftruncate.
Taking the write transaction lock was my first guess, but that would prevent
a new process from opening a file while a write transaction was held, which
Another option is we could create a new exclusive file lock from the
lockfile. This lock would be held only for the period of time of checking
the file size and calling ftruncate. That has backwards compatibility
IMHO, this seems like a fundamental problem in the POSIX ftruncate API.
Alternatives would be writing a "safe" byte to the end of the map, for
example with pwrite(). I'm not really sure what a safe byte would be.
(writing 0 bytes does not increase the file size, unfortunately).
Another option is using posix_fallocate for the last byte of the map. This
would technically force a disk page to be allocated, which wouldn't hurt
anything. The downside is availability of the API at all (for example it is
not available on MacOS), and that it downgrades to a racy manual
implementation if fallocate isn't available (which is I think everything but
Linux). On the plus side, the race would be much smaller than it exists
today, and completely eliminated on Linux.
"Don't do that." The docs for mdb_env_mapsize() make it clear that it's
up to the caller to ensure that no other txns are active at the time it
is called. We can first expand this statement in the docs and say that
the caller is responsible to ensure that no other *users* are active -
processes, threads, txns, whatever.
We can simply skip ftruncate on any requests to shrink the map. There
is no pressing reason to perform the operation. Since we don't use
fallocate, the excess size isn't actually consuming any disk space yet,
so ftruncate doesn't save any disk space.
You are receiving this mail because:
You are on the CC list for the issue.