Aaron Richton wrote:
itself. Again, we can't really tell without single-stepping thru the BDB library code. It may not be worth the effort, but that's your call.
The lock was
env_region.c:290 MUTEX_LOCK(dbenv, &renv->mutex);
but that wasn't making much sense....and after a couple minutes in dbx I realized that I've been killing myself with the attempts at db_stat. Yesterday's attempts were running db_* binaries with a wrong (but compatible) ABI. It'd be nice if Sleepycat had some more/earlier checks for that, but oh well...
Kinda figured that that's what happened.
So anyway, I corrupted base2/slave4 by running the wrong db_stat, but that left three other bases on slave4 and all three bases on slave6. I ran db_stat -l on them, the output is:
BTW, this ABI screwup shouldn't be the root cause of the failures...I haven't tried any db tools until the course of debugging this. These are AUTOREMOVE, so db_archive is unlikely, for instance.
It's still rather suspicious that slave4 and slave6 both had identical log status for base1 (1/188113) but different requested locations (1/8730339 vs 1/8730401). If they're identically configured slaves then they ought to be in lock-step. Then again, obviously they're not identical since slave6 doesn't show base4 in your log.
Do you have the db_stat output from an uncorrupted slave? What about the master?