On Wednesday 05 March 2008 00:49:21 Aaron Richton wrote:
[Gavin says]
Dig the main source. servers/slapd/syncrepl.c and servers/slapd/overlays/syncprov.c
Hmm, wrong source files. Try libraries/liblutil/csn.c, which sayeth:
- These routines are (loosly) based upon draft-ietf-ldup-model-03.txt,
- A WORK IN PROGRESS. The format will likely change.
- The format of a CSN string is: yyyymmddhhmmssz#s#r#c
- where s is a counter of operations within a timeslice, r is
- the replica id (normally zero), and c is a counter of
- modifications within this operation. s, r, and c are
- represented in hex and zero padded to lengths of 6, 3, and
- 6, respectively. (In previous implementations r was only 2 digits.)
We use http://www.openldap.org/lists/openldap-software/200602/msg00158.html, maybe with a small mod or two (I forget), to check that contextCSN isn't wedged.
I use: http://staff.telkomsa.net/~bgmilne/hobbit/ . However, I don't have a reliable algorithm for the case where the replica is marginally out of sync (e.g. one change hasn't replicated, and the replica is refreshOnly, the change previous to the one that hasn't replicated was above the threshold for "critical replication delay). Since some databases have high rates of change (4 mods/sec average), and others don't (1/week average), I get false positives on the more idle databases ...
This only works when the syncrepl thread is completely borked. A better check would be something along the lines of the Net::LDAP ldifdiff to make sure that nothing's different.
How often would you want to run such a thing, and how long would it take to run? ldapsearch -z0 | grep/wc/awk/ usually takes a significant amount of CPU time here (orders of magnitude more than slapd does to provide the entire data set).
Of course this has race condition issues (not that we make writes all that often, but on paper at least).
Some of which could be solved by an appropriate search filter?
If anybody has something like that as a monitoring plugin, you'd erase one line off my perpetual todo list...
(Yes, that would be of great interest to me. ~93% of syncrepl bugs we've seen involve very very very slight errors that only result in an entry or two being wrong. contextCSN being wrong...we pretty much only see that in the field when tcp keepalives fail to indicate the need for a reconnection.)
There are other possible causes ...
Regards, Buchan