good evening;
i am looking for an explanation for a situation which we encountered with an lmdb database and library version is 0.9.17-3. the database's condition was such that all attempts to open it for reading failed. in at least some cases the error appears to have occurred during the operation which looked for stale leaders. a problem was also evident when attempting to copy the database:
@nl12:~# mdb_copy /srv/dydra/catalog/repositories/d2141030-9495-c040-b1a7-9e19edbeb491/ /srv/dydra/backups/public-data__rev mdb_copy: copying failed, error 131 (State not recoverable)
it was first evident in running processes which had already had the database open, but of which new threads were reopening the database environment. the only thing remarkable about the circumstances involved might have been that while several dozen threads were preparing reading simultaneously and in the processes of opening the environment to read, a monitoring thread took a snapshot of the threads, which entailed interrupting each to generate its stack trace.
once we identified and terminated all processes which had the environment open, successive open attempts succeeded. when we examined the content, however, although the space was still occupied on disk and the transaction id reflected the prior content, the indices were empty.
- what condition does that message intend to describe? - what can cause that condition? - would there have been some way to have recovered the old state - despite that message?
James Anderson wrote:
good evening;
i am looking for an explanation for a situation which we encountered with an lmdb database and library version is 0.9.17-3. the database's condition was such that all attempts to open it for reading failed. in at least some cases the error appears to have occurred during the operation which looked for stale leaders. a problem was also evident when attempting to copy the database:
@nl12:~# mdb_copy /srv/dydra/catalog/repositories/d2141030-9495-c040-b1a7-9e19edbeb491/ /srv/dydra/backups/public-data__rev mdb_copy: copying failed, error 131 (State not recoverable)
131 is not an LMDB error code. Most likely your underlying storage system failed.
good evening;
On 2020-05-29, at 20:19:02, Howard Chu hyc@symas.com wrote:
James Anderson wrote:
good evening;
i am looking for an explanation for a situation which we encountered with an lmdb database and library version is 0.9.17-3. the database's condition was such that all attempts to open it for reading failed. in at least some cases the error appears to have occurred during the operation which looked for stale leaders. a problem was also evident when attempting to copy the database:
@nl12:~# mdb_copy /srv/dydra/catalog/repositories/d2141030-9495-c040-b1a7-9e19edbeb491/ /srv/dydra/backups/public-data__rev mdb_copy: copying failed, error 131 (State not recoverable)
131 is not an LMDB error code. Most likely your underlying storage system failed.
which would mean, where that storage system is zfs, i should turn to it. ok.
is there anything to be said, independent of that, about the other two questions?
best regards, from berlin,
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
good evening;
On 2020-05-29, at 20:26:47, James Anderson anderson.james.1955@gmail.com wrote:
good evening;
On 2020-05-29, at 20:19:02, Howard Chu hyc@symas.com wrote:
James Anderson wrote:
good evening;
i am looking for an explanation for a situation which we encountered with an lmdb database and library version is 0.9.17-3. the database's condition was such that all attempts to open it for reading failed. in at least some cases the error appears to have occurred during the operation which looked for stale leaders. a problem was also evident when attempting to copy the database:
@nl12:~# mdb_copy /srv/dydra/catalog/repositories/d2141030-9495-c040-b1a7-9e19edbeb491/ /srv/dydra/backups/public-data__rev mdb_copy: copying failed, error 131 (State not recoverable)
131 is not an LMDB error code.
could it be, that it was returned from mdb_reader_check on account of an orphanded mutex?
what would have been the appropriate means to rectify that?
Most likely your underlying storage system failed.
which would mean, where that storage system is zfs, i should turn to it. ok.
is there anything to be said, independent of that, about the other two questions?
best regards, from berlin,
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
James Anderson wrote:
good evening;
On 2020-05-29, at 20:19:02, Howard Chu hyc@symas.com wrote:
James Anderson wrote:
good evening;
i am looking for an explanation for a situation which we encountered with an lmdb database and library version is 0.9.17-3. the database's condition was such that all attempts to open it for reading failed. in at least some cases the error appears to have occurred during the operation which looked for stale leaders. a problem was also evident when attempting to copy the database:
@nl12:~# mdb_copy /srv/dydra/catalog/repositories/d2141030-9495-c040-b1a7-9e19edbeb491/ /srv/dydra/backups/public-data__rev mdb_copy: copying failed, error 131 (State not recoverable)
131 is not an LMDB error code. Most likely your underlying storage system failed.
which would mean, where that storage system is zfs, i should turn to it. ok.
is there anything to be said, independent of that, about the other two questions?
No.
Once the storage system has failed, there is nothing else you can conclude.
This is ENOTRECOVERABLE error (/usr/include/asm-generic/errno.h), i.e. robust mutex is in unrecoverable/corrupted state (perhaps due on of LMDB bug).
Regards, Leonid.
On Fri, May 29, 2020 at 9:19 PM Howard Chu hyc@symas.com wrote:
James Anderson wrote:
good evening;
i am looking for an explanation for a situation which we encountered with an lmdb database and library version is 0.9.17-3. the database's condition was such that all attempts to open it for reading failed. in at least some cases the error appears to have occurred during the operation which looked for stale leaders. a problem was also evident when attempting to copy the database:
@nl12:~# mdb_copy /srv/dydra/catalog/repositories/d2141030-9495-c040-b1a7-9e19edbeb491/ /srv/dydra/backups/public-data__rev mdb_copy: copying failed, error 131 (State not recoverable)
131 is not an LMDB error code. Most likely your underlying storage system failed.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
Леонид Юрьев wrote:
This is ENOTRECOVERABLE error (/usr/include/asm-generic/errno.h), i.e.
Good catch, thanks.
robust mutex is in unrecoverable/corrupted state (perhaps due on of LMDB bug).
There are no instances of mutex usage in LMDB that don't check for EOWNERDEAD and properly recover.
Note the info about robust mutexes:
PTHREAD_MUTEX_ROBUST If a mutex is initialized with the PTHREAD_MUTEX_ROBUST attribute and its owner dies without unlocking it, any future attempts to call pthread_mu‐ tex_lock(3) on this mutex will succeed and return EOWNERDEAD to indicate that the original owner no longer exists and the mutex is in an inconsistent state. Usually after EOWNERDEAD is returned, the next owner should call pthread_mutex_consistent(3) on the acquired mutex to make it consistent again before using it any further.
If the next owner unlocks the mutex using pthread_mutex_unlock(3) before making it consistent, the mutex will be permanently unusable and any subsequent attempts to lock it using pthread_mutex_lock(3) will fail with the error ENOTRECOVERABLE. The only permitted operation on such a mutex is pthread_mu‐ tex_destroy(3).
If the next owner terminates before calling pthread_mutex_consistent(3), further pthread_mutex_lock(3) operations on this mutex will still return EOWN‐ ERDEAD.
The only way for the mutex to become unrecoverable is by calling pthread_mutex_unlock() on it before calling pthread_mute_consistent(), and LMDB will never do that. If the process dies before calling pthread_mutex_consistent(), the mutex state remains in the EOWNERDEAD state. LMDB never breaks this mutex protocol, so something else in the system is broken.
Regards, Leonid.
On Fri, May 29, 2020 at 9:19 PM Howard Chu hyc@symas.com wrote:
James Anderson wrote:
good evening;
i am looking for an explanation for a situation which we encountered with an lmdb database and library version is 0.9.17-3. the database's condition was such that all attempts to open it for reading failed. in at least some cases the error appears to have occurred during the operation which looked for stale leaders. a problem was also evident when attempting to copy the database:
@nl12:~# mdb_copy /srv/dydra/catalog/repositories/d2141030-9495-c040-b1a7-9e19edbeb491/ /srv/dydra/backups/public-data__rev mdb_copy: copying failed, error 131 (State not recoverable)
131 is not an LMDB error code. Most likely your underlying storage system failed.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
openldap-technical@openldap.org