Aaron Richton wrote:
It's still rather suspicious that slave4 and slave6 both had identical log status for base1 (1/188113) but different requested locations (1/8730339 vs 1/8730401). If they're identically configured slaves then they ought to be in lock-step. Then again, obviously they're not identical since slave6 doesn't show base4 in your log.
Identical is relative. They've got the same OpenLDAP and supporting binaries running on the same patches of Solaris 9 running identical turn-up scripts with identical configuration files. But this is production, so we've got data changes over time. For instance, the slaves bootstrap with a slapadd -q, and the underlying slapcat could easily be different from slave4 vs. slave6 (the most recent one is automatically used). I'd imagine this would look different at the db layer, even once syncrepl eventually converged the logical data?
Do you have the db_stat output from an uncorrupted slave? What about the master?
Sure... https://www.nbcs.rutgers.edu/~richton/its5171_dbstatl2
Judging from the LSNs in use on these other servers, it sure looks like somebody went in and zeroed out your logs on slave4 and slave6. I don't think the environment spontaneously corrupted itself and reset the log offsets...
One more thing to check is just using "ls -l" to see if the actual size of the log files corresponds with the db_stat offsets. E.g. if slave6 base1's log.0000001 is really 8MB but the LSN is only 233KB, then we have to look for a weird in-memory corruption. If not, then somebody reset your logs.