Re: (ITS#5171) hdb txn_checkpoint failures - openldap-bugs

8 Oct 2007


      Aaron Richton wrote:
...
...
itself. Again, we can't really tell without single-stepping thru the BDB 
library code. It may not be worth the effort, but that's your call.
The lock was
env_region.c:290         MUTEX_LOCK(dbenv, &renv->mutex);
but that wasn't making much sense....and after a couple minutes in dbx I 
realized that I've been killing myself with the attempts at db_stat. 
Yesterday's attempts were running db_* binaries with a wrong (but 
compatible) ABI. It'd be nice if Sleepycat had some more/earlier checks 
for that, but oh well...
Kinda figured that that's what happened.
...
So anyway, I corrupted base2/slave4 by running the wrong db_stat, but that 
left three other bases on slave4 and all three bases on slave6. I ran 
db_stat -l on them, the output is:
https://www.nbcs.rutgers.edu/~richton/its5171_dbstatl
...
BTW, this ABI screwup shouldn't be the root cause of the failures...I 
haven't tried any db tools until the course of debugging this. These are 
AUTOREMOVE, so db_archive is unlikely, for instance.
It's still rather suspicious that slave4 and slave6 both had identical log 
status for base1 (1/188113) but different requested locations (1/8730339 vs
1/8730401). If they're identically configured slaves then they ought to be in 
lock-step. Then again, obviously they're not identical since slave6 doesn't 
show base4 in your log.
Do you have the db_stat output from an uncorrupted slave? What about the master?
-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/