There's been a long-running discussion about the need to have APIs in liblmdb for displaying the reader table and clearing out stale slots. Quite a few open questions on the topic:
1) What should the API look like for examining the table? My initial instinct is to provide an iterator function that returns info about the next slot each time it's called. Not sure that this is necessary or most convenient though. Another possibility is just a one-shot function that walks the table itself and dumps the output as a formatted string to stdout, stderr, or a custom output callback.
2) What should APIs look like for clearing out a stale slot? Should it just be implicit inside the library, with no externally visible API? I.e., should the library periodically check on its own, with no outside intervention? Or should there be an API that lets a user explicitly request a particular slot to be freed? This latter sounds pretty dangerous, since freeing a slot that's actually still in use would allow a reader's view of the DB to be corrupted.
3) What approach should be used for automatic detection of stale slots? Currently we record the process ID and thread ID of a reader in the table. It's not clear to me that the thread ID has anything more than informational value. Since we register a per-thread destructor for slots, exiting threads should never be leaving stale slots in the first place. I'm also not sure that there are good APIs for an outside caller to determine the liveness of a given thread ID. The process ID is also prone to wraparound; it's still very common for Linux systems to use 15 bit process IDs. So just checking that a pid is still alive doesn't guarantee that it's the same process that was using an LMDB environment at any point in time. We have two main approaches to work around this latter issue:
A) set a byte range lock for every process attached to the environment. This is what slapd's alock.c already does, which is used with BDB- and LDBM- based backends. This is fairly portable code, and has the desirable property that file locks automatically go away when a process exits. But: a) On Windows, the OS can take several minutes to clean up the locks of an exited process. So just checking for presence of a lock could erroneously consider a process to be alive long after it had actually died. b) file lock syscalls are fairly slow to execute. If we are checking liveness frequently, there will be a noticeable performance hit. Their performance also degrades exponentially with the number of processes locking concurrently, and degrades further still if networked filesystems are involved. c) This approach won't tell us if a process is in Zombie state.
B) check process ID and process start time. This appears to be a fairly reliable approach, and reasonably fast, but there is no POSIX standard API for obtaining this process information. Methods for obtaining the info are fairly well documented across a variety of platforms (AIX, HPUX, multiple BSDs, Linux, Solaris, etc.) but they are all different. It appears that we can implement this compactly for each of the systems, but it means carrying around a dozen or so different implementations.
Also, assuming we want to support shared LMDB access across NFS (as discussed in an earlier thread), it seems we're going to have to use a lock-based solution anyway, since process IDs won't be meaningful across host boundaries.
We can implement approach (A) fairly easily, with no major repercussions. For (B) we would need to add a field to the reader table records to store the process start time. (Thus a lockfile format change.)
(note: performance of fcntl locks vs checking process start time was measured with some simple code on my laptop running Linux. These functions are all highly OS-dependent, so the perf ratios may vary quite a lot from system to system.)
The relative performance may not even be an issue in general, since we would only need to trigger a scan if a writer actually finds that some reader txn is preventing it from using free pages from the freeDB. Most of the time this wouldn't be happening. But if there were a legitimate long running read txn (e.g., for mdb_env_copy) we may find ourselves checking fairly often.