Re: (ITS#5171) hdb txn_checkpoint failures

6 Oct 2007


      ...
Have you got backups from just before these occurrences? Can you see what the 
last valid transaction log files were before this? Or perhaps you can get 
some db_stat's off any other slaves that are still running OK? The idea is to 
see whether the current valid CSNs on an equivalent slave are anywhere near 
the numbers being logged here, e.g. 1/188113 or 1/8730339.
Have you actually run out of disk space on the partitions holding the logs? 
It's rather suspicious that two machines would act up at the same time unless 
some admin specifically disturbed the log files on those two systems at 
around that time.
I don't have backups for slave bdb logs. The master slapcat output is 
considered sacred data; the slave bdb log files are considered derivable 
thereof and don't get backed up (we'd sooner just replace the entire slave 
if it acts up). The odds of the partitions filling is minimal; Solaris has 
that logged at kern.notice (which on our configuration is serious enough 
to mean a write to NVRAM), and logs that extend prior to September 24 
don't show any such messages.
With that said, "some admin specifically disturbed the log files around 
that time." Logs show that I was the only person in a position to do so 
(unless somebody broke in and covered their tracks; we'll ignore that 
theoretical possibility). On September 24, I reconfigured the slaves to 
use a different IP address to the master instead of the existing 
connection. The times are too coincidental to be unrelated:
(slave4) reconfigured Sep 24 09:41 (first syslog complaint 09:43)
(slave6) reconfigured Sep 24 09:39 (first syslog complaint 09:44)
So...is there something that's cued off the (reverse?) name service 
entries for the master? Does the master IP hash in to a CSN somehow? And 
if this is indeed the case/root cause...well, quite honestly, I think that 
assuming a name service database will remain constant throughout a slapd 
instance is a fallacy. Furthermore, if this is indeed the case, it should 
be absolutely trivial for me to reproduce this (I can perform a DR on 
slave4/6, and reconfigure their network again).
With that in mind, I'll likely test this reproduction early next week. I 
can still get db_stat from all slaves (working and not) at this point if 
that's interesting. Comments?

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: (ITS#5171) hdb txn_checkpoint failures