I'm using OpenLDAP 2.4.7 on Solaris-10.
I have a database that is part of 2 master servers running in mirror mode and (eventually) some slaves servers too. I'm looking for way to tell if a particular server is "up to date". By that I mean is synchronized within X seconds with the master that gets updates. The masters will be behind a load-balancer (level 4 switch) and look like a single IP address. I can have the slaves access them directly if need be, but will try to have them talk the the virtual master server.
This is mostly a concern when talking to a slave server since they may not always be able to talk to the master servers because of network partitioning.
Is there something I can query, like contextCSN, to indicate the last time syncrepl successfully finished resyncing the particular database?
Any other way to do this or am I just trying to do something that is impossible?
Any help or pointers would be appreciated. Thanks. Roy
Is there something I can query, like contextCSN, to indicate the last time syncrepl successfully finished resyncing the particular database?
Any other way to do this or am I just trying to do something that is impossible?
See:
http://www.openldap.org/doc/admin24/replication.html#Syncrepl%20Details
"The consumer also stores its replica state, which is the provider's contextCSN received as a synchronization cookie, in the contextCSN attribute of the suffix entry. The replica state maintained by a consumer server is used as the synchronization state indicator when it performs subsequent incremental synchronization with the provider server. It is also used as a provider-side synchronization state indicator when it functions as a secondary provider server in a cascading replication configuration. Since the consumer and provider state information are maintained in the same location within their respective databases, any consumer can be promoted to a provider (and vice versa) without any special actions."
And more detail at:
http://www.rfc-editor.org/rfc/rfc4533.txt
For example:
[ghenry@suretec-master admin]$ ldapsearch -x -H ldap://127.0.0.1 -s 'base' contextCSN # extended LDIF # # LDAPv3 # base <dc=suretecsystems, dc=com> (default) with scope baseObject # filter: (objectclass=*) # requesting: contextCSN #
# suretecsystems.com dn: dc=suretecsystems,dc=com contextCSN: 20080228163422.801358Z#000000#000#000000
# search result search: 2 result: 0 Success
# numResponses: 2 # numEntries: 1 [ghenry@suretec-slave admin]$ ldapsearch -x -H ldap://127.0.0.1 -s 'base' contextCSN # extended LDIF # # LDAPv3 # base <dc=suretecsystems, dc=com> (default) with scope baseObject # filter: (objectclass=*) # requesting: contextCSN #
# suretecsystems.com dn: dc=suretecsystems,dc=com contextCSN: 20080228163422.801358Z#000000#000#000000
# search result search: 2 result: 0 Success
# numResponses: 2 # numEntries: 1
I saw the admin24 section, but thought that the contextCSN was meant to be opaque.
I've looked at rfc4533, but reading sections 2.1.2 through 2.3 doesn't give me enough information to parse the contextCSN value into something I can understand. Do you know where I could get (or write) some code to decode the result? Thanks for your help. Roy
-----Original Message----- From: Gavin Henry [mailto:ghenry@suretecsystems.com] Sent: Tuesday, March 04, 2008 10:49 AM To: Marantz, Roy Cc: openldap-software@openldap.org Subject: Re: Testing the state of replicates
Is there something I can query, like contextCSN, to indicate the last time syncrepl successfully finished resyncing the particular database?
Any other way to do this or am I just trying to do something that is impossible?
See:
http://www.openldap.org/doc/admin24/replication.html#Syncrepl%20Details
"The consumer also stores its replica state, which is the provider's contextCSN received as a synchronization cookie, in the contextCSN attribute of the suffix entry. The replica state maintained by a consumer server is used as the synchronization state indicator when it performs subsequent incremental synchronization with the provider server. It is also used as a provider-side synchronization state indicator when it functions as a secondary provider server in a cascading replication configuration. Since the consumer and provider state information are maintained in the same location within their respective databases, any consumer can be promoted to a provider (and vice versa) without any special actions."
And more detail at:
http://www.rfc-editor.org/rfc/rfc4533.txt
For example:
[ghenry@suretec-master admin]$ ldapsearch -x -H ldap://127.0.0.1 -s 'base' contextCSN # extended LDIF # # LDAPv3 # base <dc=suretecsystems, dc=com> (default) with scope baseObject # filter: (objectclass=*) # requesting: contextCSN #
# suretecsystems.com dn: dc=suretecsystems,dc=com contextCSN: 20080228163422.801358Z#000000#000#000000
# search result search: 2 result: 0 Success
# numResponses: 2 # numEntries: 1 [ghenry@suretec-slave admin]$ ldapsearch -x -H ldap://127.0.0.1 -s 'base' contextCSN # extended LDIF # # LDAPv3 # base <dc=suretecsystems, dc=com> (default) with scope baseObject # filter: (objectclass=*) # requesting: contextCSN #
# suretecsystems.com dn: dc=suretecsystems,dc=com contextCSN: 20080228163422.801358Z#000000#000#000000
# search result search: 2 result: 0 Success
# numResponses: 2 # numEntries: 1
<quote who="Marantz, Roy">
I saw the admin24 section, but thought that the contextCSN was meant to be opaque.
I've looked at rfc4533, but reading sections 2.1.2 through 2.3 doesn't give me enough information to parse the contextCSN value into something I can understand. Do you know where I could get (or write) some code to decode the result?
Dig the main source. servers/slapd/syncrepl.c and servers/slapd/overlays/syncprov.c
Thanks for your help. Roy
-----Original Message----- From: Gavin Henry [mailto:ghenry@suretecsystems.com] Sent: Tuesday, March 04, 2008 10:49 AM To: Marantz, Roy Cc: openldap-software@openldap.org Subject: Re: Testing the state of replicates
Is there something I can query, like contextCSN, to indicate the last time syncrepl successfully finished resyncing the particular database?
Any other way to do this or am I just trying to do something that is impossible?
See:
http://www.openldap.org/doc/admin24/replication.html#Syncrepl%20Details
"The consumer also stores its replica state, which is the provider's contextCSN received as a synchronization cookie, in the contextCSN attribute of the suffix entry. The replica state maintained by a consumer server is used as the synchronization state indicator when it performs subsequent incremental synchronization with the provider server. It is also used as a provider-side synchronization state indicator when it functions as a secondary provider server in a cascading replication configuration. Since the consumer and provider state information are maintained in the same location within their respective databases, any consumer can be promoted to a provider (and vice versa) without any special actions."
And more detail at:
http://www.rfc-editor.org/rfc/rfc4533.txt
For example:
[ghenry@suretec-master admin]$ ldapsearch -x -H ldap://127.0.0.1 -s 'base' contextCSN # extended LDIF # # LDAPv3 # base <dc=suretecsystems, dc=com> (default) with scope baseObject # filter: (objectclass=*) # requesting: contextCSN #
# suretecsystems.com dn: dc=suretecsystems,dc=com contextCSN: 20080228163422.801358Z#000000#000#000000
# search result search: 2 result: 0 Success
# numResponses: 2 # numEntries: 1 [ghenry@suretec-slave admin]$ ldapsearch -x -H ldap://127.0.0.1 -s 'base' contextCSN # extended LDIF # # LDAPv3 # base <dc=suretecsystems, dc=com> (default) with scope baseObject # filter: (objectclass=*) # requesting: contextCSN #
# suretecsystems.com dn: dc=suretecsystems,dc=com contextCSN: 20080228163422.801358Z#000000#000#000000
# search result search: 2 result: 0 Success
# numResponses: 2 # numEntries: 1
Those files plus various header files don't seem to correspond to what is being returned, but as I would expect, to various internal representations of the contextCSN.
contextCSN: 20080304184021.024760Z#000000#001#000000 Seems to be a date and timestamp, YYYYmmDDhhMMss.partOfSecond, followed by ???, then the rid, then ???.
From the code and struct definitions I would have expected the to see a
sid and some state indicator too.
I went looking in the schema files to see if contextCSN was mentioned with no success.
Obviously I don't know my way around the source code for OpenLDAP, but I'm willing to keep looking if you would give me another hint. I guess I would start trying to find where the contextCSN gets returned in the answer to a query, but this sounds like a difficult task going through lots of layers of the code. Again any help would be appreciated. Thanks. Roy
-----Original Message----- From: Gavin Henry [mailto:ghenry@suretecsystems.com] Sent: Tuesday, March 04, 2008 12:30 PM To: Marantz, Roy Cc: openldap-software@openldap.org Subject: RE: Testing the state of replicates
<quote who="Marantz, Roy">
I saw the admin24 section, but thought that the contextCSN was meant
to
be opaque.
I've looked at rfc4533, but reading sections 2.1.2 through 2.3 doesn't give me enough information to parse the contextCSN value into
something
I can understand. Do you know where I could get (or write) some code
to
decode the result?
Dig the main source. servers/slapd/syncrepl.c and servers/slapd/overlays/syncprov.c
Thanks for your help. Roy
-----Original Message----- From: Gavin Henry [mailto:ghenry@suretecsystems.com] Sent: Tuesday, March 04, 2008 10:49 AM To: Marantz, Roy Cc: openldap-software@openldap.org Subject: Re: Testing the state of replicates
Is there something I can query, like contextCSN, to indicate the last time syncrepl successfully finished resyncing the particular
database?
Any other way to do this or am I just trying to do something that is impossible?
See:
http://www.openldap.org/doc/admin24/replication.html#Syncrepl%20Details
"The consumer also stores its replica state, which is the provider's contextCSN received as a synchronization cookie, in the contextCSN attribute of the suffix entry. The replica state maintained by a consumer server is used as the synchronization state indicator when it performs subsequent incremental synchronization with the provider server. It is also used as a provider-side synchronization state indicator when it functions as a secondary provider server in a cascading replication configuration. Since the consumer and provider state information are maintained in the same location within their respective databases, any consumer can be promoted to a provider (and vice versa) without any special actions."
And more detail at:
http://www.rfc-editor.org/rfc/rfc4533.txt
For example:
[ghenry@suretec-master admin]$ ldapsearch -x -H ldap://127.0.0.1 -s 'base' contextCSN # extended LDIF # # LDAPv3 # base <dc=suretecsystems, dc=com> (default) with scope baseObject # filter: (objectclass=*) # requesting: contextCSN #
# suretecsystems.com dn: dc=suretecsystems,dc=com contextCSN: 20080228163422.801358Z#000000#000#000000
# search result search: 2 result: 0 Success
# numResponses: 2 # numEntries: 1 [ghenry@suretec-slave admin]$ ldapsearch -x -H ldap://127.0.0.1 -s 'base' contextCSN # extended LDIF # # LDAPv3 # base <dc=suretecsystems, dc=com> (default) with scope baseObject # filter: (objectclass=*) # requesting: contextCSN #
# suretecsystems.com dn: dc=suretecsystems,dc=com contextCSN: 20080228163422.801358Z#000000#000#000000
# search result search: 2 result: 0 Success
# numResponses: 2 # numEntries: 1
On Tue, 4 Mar 2008, Marantz, Roy wrote:
Again any help would be appreciated.
Here's a simple perl script I run on my ldap replicas to check their state. Seems to work for me, although I can't guarantee I didn't misunderstand something...
#! /usr/bin/perl
use Net::LDAP ();
my $ldap_master = Net::LDAP->new("ldap-master.csupomona.edu", timeout => 10) or do { print STDERR "Error: failed to connect to LDAP master: $@"; exit(1); };
my $ldap_slave = Net::LDAP->new("localhost", timeout => 10) or do { print STDERR "Error: failed to connect to LDAP slave: $@"; exit(1); };
my $master_csn = lookup_csn($ldap_master, 'master'); my $slave_csn = lookup_csn($ldap_slave, 'slave');
if ($master_csn ne $slave_csn) { my $prev_master_csn = $master_csn; my $prev_slave_csn = $slave_csn;
sleep 120;
$master_csn = lookup_csn($ldap_master, 'master'); $slave_csn = lookup_csn($ldap_slave, 'slave');
if ($master_csn ne $slave_csn) {
if ($prev_slave_csn ne $slave_csn) { my $master_csn_date = $master_csn; $master_csn_date =~ s/Z.*//; my $slave_csn_date = $slave_csn; $slave_csn_date =~ s/Z.*//;
if ($master_csn_date - $slave_csn_date > 1000) { print STDERR "Warning: LDAP replica out of syncronization\n"; print STDERR " master_csn = $master_csn, slave_csn = $slave_csn\n"; exit(1); } } else { print STDERR "Error: LDAP replica out of syncronization, no apparent update progress seen\n"; print STDERR " master_csn = $master_csn, slave_csn = $slave_csn\n"; exit(1); } } }
sub lookup_csn { my ($ldap, $server_name) = @_;
my $search = $ldap->search( scope => 'base', base => "dc=csupomona,dc=edu", filter => "(objectclass=*)", attrs => [ 'contextCSN' ] );
$search->code() and do { print STDERR "Error: failed to execute search on $server_name: " . $search->error() . " (" . $search->code() . ")\n"; exit(1); };
my $entry = $search->shift_entry() or do { print STDERR "Error: search on $server_name failed to find entry\n"; exit(1); };
my $csn = $entry->get_value('contextCSN'); if (!defined($csn)) { print STDERR "Error: no contextCSN attribute found in $server_name entry\n"; exit(1); }
return $csn; }
[Gavin says]
Dig the main source. servers/slapd/syncrepl.c and servers/slapd/overlays/syncprov.c
Hmm, wrong source files. Try libraries/liblutil/csn.c, which sayeth:
* These routines are (loosly) based upon draft-ietf-ldup-model-03.txt, * A WORK IN PROGRESS. The format will likely change. * * The format of a CSN string is: yyyymmddhhmmssz#s#r#c * where s is a counter of operations within a timeslice, r is * the replica id (normally zero), and c is a counter of * modifications within this operation. s, r, and c are * represented in hex and zero padded to lengths of 6, 3, and * 6, respectively. (In previous implementations r was only 2 digits.)
We use http://www.openldap.org/lists/openldap-software/200602/msg00158.html, maybe with a small mod or two (I forget), to check that contextCSN isn't wedged. This only works when the syncrepl thread is completely borked. A better check would be something along the lines of the Net::LDAP ldifdiff to make sure that nothing's different. Of course this has race condition issues (not that we make writes all that often, but on paper at least). If anybody has something like that as a monitoring plugin, you'd erase one line off my perpetual todo list...
(Yes, that would be of great interest to me. ~93% of syncrepl bugs we've seen involve very very very slight errors that only result in an entry or two being wrong. contextCSN being wrong...we pretty much only see that in the field when tcp keepalives fail to indicate the need for a reconnection.)
<quote who="Aaron Richton">
[Gavin says]
Dig the main source. servers/slapd/syncrepl.c and servers/slapd/overlays/syncprov.c
Hmm, wrong source files. Try libraries/liblutil/csn.c, which sayeth:
- These routines are (loosly) based upon draft-ietf-ldup-model-03.txt,
- A WORK IN PROGRESS. The format will likely change.
- The format of a CSN string is: yyyymmddhhmmssz#s#r#c
- where s is a counter of operations within a timeslice, r is
- the replica id (normally zero), and c is a counter of
- modifications within this operation. s, r, and c are
- represented in hex and zero padded to lengths of 6, 3, and
- 6, respectively. (In previous implementations r was only 2 digits.)
Ah, many thanks.
We use http://www.openldap.org/lists/openldap-software/200602/msg00158.html, maybe with a small mod or two (I forget), to check that contextCSN isn't wedged. This only works when the syncrepl thread is completely borked. A better check would be something along the lines of the Net::LDAP ldifdiff to make sure that nothing's different. Of course this has race condition issues (not that we make writes all that often, but on paper at least). If anybody has something like that as a monitoring plugin, you'd erase one line off my perpetual todo list...
;-) Plugin for what?
(Yes, that would be of great interest to me. ~93% of syncrepl bugs we've seen involve very very very slight errors that only result in an entry or two being wrong. contextCSN being wrong...we pretty much only see that in the field when tcp keepalives fail to indicate the need for a reconnection.)
So the entryCSN would be wrong?
So it seems easy to do this monitoring via some external agent/program. Can I do something (short of writing an overlay) to get this information with a ldap query? i.e. some query which would give me the difference between the current contextCSN of the machine I'm talking to and the master server. AFAICT, the existing overlays won't let me create this kind of synthesized value.
Alternatively, I think I'd be happy with a query to tell me if the server thinks it is having trouble talking to the master. Thanks for everything. Roy
-----Original Message----- From: Gavin Henry [mailto:ghenry@suretecsystems.com] Sent: Wednesday, March 05, 2008 3:28 AM To: Aaron Richton Cc: Marantz, Roy; openldap-software@openldap.org Subject: RE: Testing the state of replicates
<quote who="Aaron Richton">
[Gavin says]
Dig the main source. servers/slapd/syncrepl.c and servers/slapd/overlays/syncprov.c
Hmm, wrong source files. Try libraries/liblutil/csn.c, which sayeth:
- These routines are (loosly) based upon
draft-ietf-ldup-model-03.txt,
- A WORK IN PROGRESS. The format will likely change.
- The format of a CSN string is: yyyymmddhhmmssz#s#r#c
- where s is a counter of operations within a timeslice, r is
- the replica id (normally zero), and c is a counter of
- modifications within this operation. s, r, and c are
- represented in hex and zero padded to lengths of 6, 3, and
- 6, respectively. (In previous implementations r was only 2
digits.)
Ah, many thanks.
We use http://www.openldap.org/lists/openldap-software/200602/msg00158.html, maybe with a small mod or two (I forget), to check that contextCSN
isn't
wedged. This only works when the syncrepl thread is completely borked.
A
better check would be something along the lines of the Net::LDAP
ldifdiff
to make sure that nothing's different. Of course this has race
condition
issues (not that we make writes all that often, but on paper at
least). If
anybody has something like that as a monitoring plugin, you'd erase
one
line off my perpetual todo list...
;-) Plugin for what?
(Yes, that would be of great interest to me. ~93% of syncrepl bugs
we've
seen involve very very very slight errors that only result in an entry
or
two being wrong. contextCSN being wrong...we pretty much only see that
in
the field when tcp keepalives fail to indicate the need for a reconnection.)
So the entryCSN would be wrong?
Well, all you need is a simple search:
$ ldapsearch -xLLLZZ -s base -b "dc=rutgers,dc=edu" contextCSN dn: dc=rutgers,dc=edu contextCSN: 20070613170816Z#000000#00#000000
(no, that's not wedged :)
Try the python script that was in the URI I put in my previous email (if not the actual code, at least the theory). It sounds like you might be missing that this is out of band of slapd -- you don't need an overlay/module/backend or anything, just some basic over-the-network LDAP queries and arithmetic operations to figure out if they're desirable values.
On Wed, 5 Mar 2008, Marantz, Roy wrote:
So it seems easy to do this monitoring via some external agent/program. Can I do something (short of writing an overlay) to get this information with a ldap query? i.e. some query which would give me the difference between the current contextCSN of the machine I'm talking to and the master server. AFAICT, the existing overlays won't let me create this kind of synthesized value.
Alternatively, I think I'd be happy with a query to tell me if the server thinks it is having trouble talking to the master. Thanks for everything. Roy
On Wednesday 05 March 2008 00:49:21 Aaron Richton wrote:
[Gavin says]
Dig the main source. servers/slapd/syncrepl.c and servers/slapd/overlays/syncprov.c
Hmm, wrong source files. Try libraries/liblutil/csn.c, which sayeth:
- These routines are (loosly) based upon draft-ietf-ldup-model-03.txt,
- A WORK IN PROGRESS. The format will likely change.
- The format of a CSN string is: yyyymmddhhmmssz#s#r#c
- where s is a counter of operations within a timeslice, r is
- the replica id (normally zero), and c is a counter of
- modifications within this operation. s, r, and c are
- represented in hex and zero padded to lengths of 6, 3, and
- 6, respectively. (In previous implementations r was only 2 digits.)
We use http://www.openldap.org/lists/openldap-software/200602/msg00158.html, maybe with a small mod or two (I forget), to check that contextCSN isn't wedged.
I use: http://staff.telkomsa.net/~bgmilne/hobbit/ . However, I don't have a reliable algorithm for the case where the replica is marginally out of sync (e.g. one change hasn't replicated, and the replica is refreshOnly, the change previous to the one that hasn't replicated was above the threshold for "critical replication delay). Since some databases have high rates of change (4 mods/sec average), and others don't (1/week average), I get false positives on the more idle databases ...
This only works when the syncrepl thread is completely borked. A better check would be something along the lines of the Net::LDAP ldifdiff to make sure that nothing's different.
How often would you want to run such a thing, and how long would it take to run? ldapsearch -z0 | grep/wc/awk/ usually takes a significant amount of CPU time here (orders of magnitude more than slapd does to provide the entire data set).
Of course this has race condition issues (not that we make writes all that often, but on paper at least).
Some of which could be solved by an appropriate search filter?
If anybody has something like that as a monitoring plugin, you'd erase one line off my perpetual todo list...
(Yes, that would be of great interest to me. ~93% of syncrepl bugs we've seen involve very very very slight errors that only result in an entry or two being wrong. contextCSN being wrong...we pretty much only see that in the field when tcp keepalives fail to indicate the need for a reconnection.)
There are other possible causes ...
Regards, Buchan
Marantz, Roy wrote:
Those files plus various header files don't seem to correspond to what is being returned, but as I would expect, to various internal representations of the contextCSN.
contextCSN: 20080304184021.024760Z#000000#001#000000 Seems to be a date and timestamp, YYYYmmDDhhMMss.partOfSecond, followed by ???, then the rid, then ???.
From the code and struct definitions I would have expected the to see a
sid and some state indicator too.
You're thinking about this too hard.
Use ldapsearch to retrieve the contextCSN attributes from all of the servers. If they match, they are in sync. If not, then not. That *is* the state indicator.
Is there something I can query, like contextCSN, to indicate the last time syncrepl successfully finished resyncing the particular
database?
Any other way to do this or am I just trying to do something that is impossible?
See:
http://www.openldap.org/doc/admin24/replication.html#Syncrepl%20Details
"The consumer also stores its replica state, which is the provider's contextCSN received as a synchronization cookie, in the contextCSN attribute of the suffix entry. The replica state maintained by a consumer server is used as the synchronization state indicator when it performs subsequent incremental synchronization with the provider server. It is also used as a provider-side synchronization state indicator when it functions as a secondary provider server in a cascading replication configuration. Since the consumer and provider state information are maintained in the same location within their respective databases, any consumer can be promoted to a provider (and vice versa) without any special actions."
And more detail at:
http://www.rfc-editor.org/rfc/rfc4533.txt
For example:
[ghenry@suretec-master admin]$ ldapsearch -x -H ldap://127.0.0.1 -s 'base' contextCSN # extended LDIF # # LDAPv3 # base<dc=suretecsystems, dc=com> (default) with scope baseObject # filter: (objectclass=*) # requesting: contextCSN #
# suretecsystems.com dn: dc=suretecsystems,dc=com contextCSN: 20080228163422.801358Z#000000#000#000000
# search result search: 2 result: 0 Success
# numResponses: 2 # numEntries: 1 [ghenry@suretec-slave admin]$ ldapsearch -x -H ldap://127.0.0.1 -s 'base' contextCSN # extended LDIF # # LDAPv3 # base<dc=suretecsystems, dc=com> (default) with scope baseObject # filter: (objectclass=*) # requesting: contextCSN #
# suretecsystems.com dn: dc=suretecsystems,dc=com contextCSN: 20080228163422.801358Z#000000#000#000000
# search result search: 2 result: 0 Success
# numResponses: 2 # numEntries: 1
openldap-software@openldap.org