Re: (ITS#5171) hdb txn_checkpoint failures

8 Oct 2007


      Aaron Richton wrote:
...
...
It's still rather suspicious that slave4 and slave6 both had identical log 
status for base1 (1/188113) but different requested locations (1/8730339 vs
1/8730401). If they're identically configured slaves then they ought to be in 
lock-step. Then again, obviously they're not identical since slave6 doesn't 
show base4 in your log.
Identical is relative. They've got the same OpenLDAP and supporting 
binaries running on the same patches of Solaris 9 running identical 
turn-up scripts with identical configuration files. But this is 
production, so we've got data changes over time. For instance, the slaves 
bootstrap with a slapadd -q, and the underlying slapcat could easily be 
different from slave4 vs. slave6 (the most recent one is automatically 
used). I'd imagine this would look different at the db layer, even once 
syncrepl eventually converged the logical data?
...
Do you have the db_stat output from an uncorrupted slave? What about the 
master?
Sure... https://www.nbcs.rutgers.edu/~richton/its5171_dbstatl2
Judging from the LSNs in use on these other servers, it sure looks like 
somebody went in and zeroed out your logs on slave4 and slave6. I don't think 
the environment spontaneously corrupted itself and reset the log offsets...
One more thing to check is just using "ls -l" to see if the actual size of the 
log files corresponds with the db_stat offsets. E.g. if slave6 base1's 
log.0000001 is really 8MB but the LSN is only 233KB, then we have to look for 
a weird in-memory corruption. If not, then somebody reset your logs.
-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: (ITS#5171) hdb txn_checkpoint failures