HI!
Are contextCSN values on all replicas really in sync if changes were correctly replicated?
I've implemented a monitoring check used with normal MMR setup (OpenLDAP 2.4.35, own build on Debian Squeeze) which also checks the contextCSN values on all replicas compared by server-id.
Sometimes we observe, even in isolated tests, that contextCSN values for a certain server-id differ for quite a while (up to hours) even though the changes coming from that server were definitely replicated to all other replicas. After a while the contextCSN values get suddenly updated. Unfortunately this does not always happen.
Any hint is highly appreciated.
Ciao, Michael.
Hi,
On Sat, 10 Aug 2013, Michael Ströder wrote:
HI!
Are contextCSN values on all replicas really in sync if changes were correctly replicated?
I've implemented a monitoring check used with normal MMR setup (OpenLDAP 2.4.35, own build on Debian Squeeze) which also checks the contextCSN values on all replicas compared by server-id.
Sometimes we observe, even in isolated tests, that contextCSN values for a certain server-id differ for quite a while (up to hours) even though the changes coming from that server were definitely replicated to all other replicas. After a while the contextCSN values get suddenly updated. Unfortunately this does not always happen.
Any hint is highly appreciated.
I have always suspected that this is due to the specific setting of:
syncprov-checkpoint <ops> <minutes> After a write operation has succeeded, write the contextCSN to the underlying database if <ops> write operations or more than <minutes> time have passed since the last checkpoint. Checkpointing is disabled by default.
Not sure though.
Greetings Christian
Ciao, Michael.
Christian Kratzer wrote:
On Sat, 10 Aug 2013, Michael Ströder wrote:
Are contextCSN values on all replicas really in sync if changes were correctly replicated?
I've implemented a monitoring check used with normal MMR setup (OpenLDAP 2.4.35, own build on Debian Squeeze) which also checks the contextCSN values on all replicas compared by server-id.
Sometimes we observe, even in isolated tests, that contextCSN values for a certain server-id differ for quite a while (up to hours) even though the changes coming from that server were definitely replicated to all other replicas. After a while the contextCSN values get suddenly updated. Unfortunately this does not always happen.
Any hint is highly appreciated.
I have always suspected that this is due to the specific setting of:
syncprov-checkpoint <ops> <minutes> After a write operation has succeeded, write the contextCSN to
the underlying database if <ops> write operations or more than <minutes> time have passed since the last checkpoint. Checkpointing is disabled by default.
Thanks for following-up.
AFAICS the above directive specifys when to write the contextCSN to the DB on disk similar to checkpoint directives for DB backends. So in case of a server crashing you have a quite recent contextCSN with the server-id of this particular server.
But since all replicas are up and running and I query the contextCSN values via LDAP I presume this is not relevant for my problem. Well, one never knows though...
=> will try to play with this (I don't need a high write rate on those systems).
Ciao, Michael.
On Aug 10, 2013, at 11:08 AM, Michael Ströder michael@stroeder.com wrote:
Christian Kratzer wrote:
On Sat, 10 Aug 2013, Michael Ströder wrote:
Are contextCSN values on all replicas really in sync if changes were correctly replicated?
I've implemented a monitoring check used with normal MMR setup (OpenLDAP 2.4.35, own build on Debian Squeeze) which also checks the contextCSN values on all replicas compared by server-id.
Sometimes we observe, even in isolated tests, that contextCSN values for a certain server-id differ for quite a while (up to hours) even though the changes coming from that server were definitely replicated to all other replicas. After a while the contextCSN values get suddenly updated. Unfortunately this does not always happen.
Any hint is highly appreciated.
I have always suspected that this is due to the specific setting of:
syncprov-checkpoint <ops> <minutes> After a write operation has succeeded, write the contextCSN to
the underlying database if <ops> write operations or more than <minutes> time have passed since the last checkpoint. Checkpointing is disabled by default.
Thanks for following-up.
AFAICS the above directive specifys when to write the contextCSN to the DB on disk similar to checkpoint directives for DB backends. So in case of a server crashing you have a quite recent contextCSN with the server-id of this particular server.
But since all replicas are up and running and I query the contextCSN values via LDAP I presume this is not relevant for my problem. Well, one never knows though...
=> will try to play with this (I don't need a high write rate on those systems).
Ciao, Michael.
I always set the syncprov checkpoint on all servers, replicas or masters.
--Quanah
On Aug 10, 2013, at 12:01 PM, Quanah Gibson-Mount quanah@zimbra.com wrote:
On Aug 10, 2013, at 11:08 AM, Michael Ströder michael@stroeder.com wrote:
Christian Kratzer wrote:
On Sat, 10 Aug 2013, Michael Ströder wrote:
Are contextCSN values on all replicas really in sync if changes were correctly replicated?
I've implemented a monitoring check used with normal MMR setup (OpenLDAP 2.4.35, own build on Debian Squeeze) which also checks the contextCSN values on all replicas compared by server-id.
Sometimes we observe, even in isolated tests, that contextCSN values for a certain server-id differ for quite a while (up to hours) even though the changes coming from that server were definitely replicated to all other replicas. After a while the contextCSN values get suddenly updated. Unfortunately this does not always happen.
Any hint is highly appreciated.
I have always suspected that this is due to the specific setting of:
syncprov-checkpoint <ops> <minutes> After a write operation has succeeded, write the contextCSN to
the underlying database if <ops> write operations or more than <minutes> time have passed since the last checkpoint. Checkpointing is disabled by default.
Thanks for following-up.
AFAICS the above directive specifys when to write the contextCSN to the DB on disk similar to checkpoint directives for DB backends. So in case of a server crashing you have a quite recent contextCSN with the server-id of this particular server.
But since all replicas are up and running and I query the contextCSN values via LDAP I presume this is not relevant for my problem. Well, one never knows though...
=> will try to play with this (I don't need a high write rate on those systems).
Ciao, Michael.
I always set the syncprov checkpoint on all servers, replicas or masters.
--Quanah
Correction. I load slapo-syncprov on all replicated db backends. I don't set the checkpoint.
Quanah Gibson-Mount wrote:
On Aug 10, 2013, at 12:01 PM, Quanah Gibson-Mount quanah@zimbra.com wrote:
On Aug 10, 2013, at 11:08 AM, Michael Ströder michael@stroeder.com wrote:
Christian Kratzer wrote:
On Sat, 10 Aug 2013, Michael Ströder wrote:
Are contextCSN values on all replicas really in sync if changes were correctly replicated?
I've implemented a monitoring check used with normal MMR setup (OpenLDAP 2.4.35, own build on Debian Squeeze) which also checks the contextCSN values on all replicas compared by server-id.
Sometimes we observe, even in isolated tests, that contextCSN values for a certain server-id differ for quite a while (up to hours) even though the changes coming from that server were definitely replicated to all other replicas. After a while the contextCSN values get suddenly updated. Unfortunately this does not always happen.
Any hint is highly appreciated.
I have always suspected that this is due to the specific setting of:
syncprov-checkpoint <ops> <minutes> After a write operation has succeeded, write the contextCSN to
the underlying database if <ops> write operations or more than <minutes> time have passed since the last checkpoint. Checkpointing is disabled by default.
Thanks for following-up.
AFAICS the above directive specifys when to write the contextCSN to the DB on disk similar to checkpoint directives for DB backends. So in case of a server crashing you have a quite recent contextCSN with the server-id of this particular server.
But since all replicas are up and running and I query the contextCSN values via LDAP I presume this is not relevant for my problem. Well, one never knows though...
=> will try to play with this (I don't need a high write rate on those systems).
I always set the syncprov checkpoint on all servers, replicas or masters.
Correction. I load slapo-syncprov on all replicated db backends. I don't set the checkpoint.
As said this is a MMR setup => slapo-syncprov is loaded on all replicated DB backends
Anythings else why contextCSN is not updated?
Ciao, Michael.
On Aug 11, 2013, at 7:41 AM, Michael Ströder michael@stroeder.com wrote:
Quanah Gibson-Mount wrote:
On Aug 10, 2013, at 12:01 PM, Quanah Gibson-Mount quanah@zimbra.com wrote:
On Aug 10, 2013, at 11:08 AM, Michael Ströder michael@stroeder.com wrote:
Christian Kratzer wrote:
On Sat, 10 Aug 2013, Michael Ströder wrote:
Are contextCSN values on all replicas really in sync if changes were correctly replicated?
I've implemented a monitoring check used with normal MMR setup (OpenLDAP 2.4.35, own build on Debian Squeeze) which also checks the contextCSN values on all replicas compared by server-id.
Sometimes we observe, even in isolated tests, that contextCSN values for a certain server-id differ for quite a while (up to hours) even though the changes coming from that server were definitely replicated to all other replicas. After a while the contextCSN values get suddenly updated. Unfortunately this does not always happen.
Any hint is highly appreciated.
I have always suspected that this is due to the specific setting of:
syncprov-checkpoint <ops> <minutes> After a write operation has succeeded, write the contextCSN to
the underlying database if <ops> write operations or more than <minutes> time have passed since the last checkpoint. Checkpointing is disabled by default.
Thanks for following-up.
AFAICS the above directive specifys when to write the contextCSN to the DB on disk similar to checkpoint directives for DB backends. So in case of a server crashing you have a quite recent contextCSN with the server-id of this particular server.
But since all replicas are up and running and I query the contextCSN values via LDAP I presume this is not relevant for my problem. Well, one never knows though...
=> will try to play with this (I don't need a high write rate on those systems).
I always set the syncprov checkpoint on all servers, replicas or masters.
Correction. I load slapo-syncprov on all replicated db backends. I don't set the checkpoint.
As said this is a MMR setup => slapo-syncprov is loaded on all replicated DB backends
Anythings else why contextCSN is not updated?
Ciao, Michael.
Doesn't happen to me with delta-syncrepl based MMR.
--Quanah
Quanah Gibson-Mount wrote:
On Aug 11, 2013, at 7:41 AM, Michael Ströder michael@stroeder.com wrote:
Quanah Gibson-Mount wrote:
On Aug 10, 2013, at 12:01 PM, Quanah Gibson-Mount quanah@zimbra.com wrote:
On Aug 10, 2013, at 11:08 AM, Michael Ströder michael@stroeder.com wrote:
On Sat, 10 Aug 2013, Michael Ströder wrote: > Are contextCSN values on all replicas really in sync if changes were correctly > replicated? > > I've implemented a monitoring check used with normal MMR setup (OpenLDAP > 2.4.35, own build on Debian Squeeze) which also checks the contextCSN values > on all replicas compared by server-id. > > Sometimes we observe, even in isolated tests, that contextCSN values for a > certain server-id differ for quite a while (up to hours) even though the > changes coming from that server were definitely replicated to all other > replicas. After a while the contextCSN values get suddenly updated. > Unfortunately this does not always happen. > > Any hint is highly appreciated.
[..] I always set the syncprov checkpoint on all servers, replicas or masters.
Correction. I load slapo-syncprov on all replicated db backends. I don't set the checkpoint.
As said this is a MMR setup => slapo-syncprov is loaded on all replicated DB backends
Anythings else why contextCSN is not updated?
Doesn't happen to me with delta-syncrepl based MMR.
Are you continously comparing the contextCSN values with a monitoring component?
Ciao, Michael.
Christian Kratzer ck-lists@cksoft.de schrieb am 10.08.2013 um 19:43 in
Nachricht alpine.BSF.2.00.1308101942120.18017@pohjola.cksoft.de:
Hi,
On Sat, 10 Aug 2013, Michael Ströder wrote:
HI!
Are contextCSN values on all replicas really in sync if changes were
correctly
replicated?
I've implemented a monitoring check used with normal MMR setup (OpenLDAP 2.4.35, own build on Debian Squeeze) which also checks the contextCSN
values
on all replicas compared by server-id.
Sometimes we observe, even in isolated tests, that contextCSN values for a certain server-id differ for quite a while (up to hours) even though the changes coming from that server were definitely replicated to all other replicas. After a while the contextCSN values get suddenly updated. Unfortunately this does not always happen.
Any hint is highly appreciated.
I have always suspected that this is due to the specific setting of:
syncprov-checkpoint <ops> <minutes> After a write operation has succeeded, write the contextCSN
to the underlying database if <ops> write operations or more than <minutes> time have passed since the
last checkpoint. Checkpointing is disabled by default.
Not sure though.
Hi,
do you "query" by slapcat or by an LDAP search? For the former it's documented that contextCSN is updated lazily. For the latter I'm not sure.
Regards, Ulrich
Greetings Christian
Ciao, Michael.
-- Christian Kratzer CK Software GmbH Email: ck@cksoft.de Wildberger Weg 24/2 Phone: +49 7032 893 997 - 0 D-71126 Gaeufelden Fax: +49 7032 893 997 - 9 HRB 245288, Amtsgericht Stuttgart Web: http://www.cksoft.de/ Geschaeftsfuehrer: Christian Kratzer
Ulrich Windl wrote:
Christian Kratzer ck-lists@cksoft.de schrieb am 10.08.2013 um 19:43 in
Nachricht alpine.BSF.2.00.1308101942120.18017@pohjola.cksoft.de:
On Sat, 10 Aug 2013, Michael Ströder wrote:
Are contextCSN values on all replicas really in sync if changes were
correctly
replicated?
I've implemented a monitoring check used with normal MMR setup (OpenLDAP 2.4.35, own build on Debian Squeeze) which also checks the contextCSN
values
on all replicas compared by server-id.
Sometimes we observe, even in isolated tests, that contextCSN values for a certain server-id differ for quite a while (up to hours) even though the changes coming from that server were definitely replicated to all other replicas. After a while the contextCSN values get suddenly updated. Unfortunately this does not always happen.
Any hint is highly appreciated.
I have always suspected that this is due to the specific setting of:
syncprov-checkpoint <ops> <minutes> After a write operation has succeeded, write the contextCSN
to the underlying database if <ops> write operations or more than <minutes> time have passed since the
last checkpoint. Checkpointing is disabled by default.
Not sure though.
do you "query" by slapcat or by an LDAP search? For the former it's documented that contextCSN is updated lazily. For the latter I'm not sure.
Using slapcat in a monitoring check invoked every minute would be pretty dumb. So of course I'm retrieving the contextCSN values from replicated DBs via LDAP.
Ciao, Michael.
Hi,
On Mon, 12 Aug 2013, Ulrich Windl wrote: <snipp/>
I have always suspected that this is due to the specific setting of:
syncprov-checkpoint <ops> <minutes> After a write operation has succeeded, write the contextCSN
to the underlying database if <ops> write operations or more than <minutes> time have passed since the
last checkpoint. Checkpointing is disabled by default.
Not sure though.
Hi,
do you "query" by slapcat or by an LDAP search? For the former it's documented that contextCSN is updated lazily. For the latter I'm not sure.
I have had the same use case Michael is getting at in the back of me head for some time. I would also like to verify replication status by checking the contextCSN on all servers via an ldapsearch from a monitoring script.
I would expect an ldapsearch of the contextCSN to deliver a current and valid value that should be identical over all servers.
It seems this is not the case. I will run a couple of tests to verify this myself in my testbed of 2 mmr masters and 2 slaves.
Greetings Christian
Christian Kratzer wrote:
On Mon, 12 Aug 2013, Ulrich Windl wrote:
<snipp/> >> I have always suspected that this is due to the specific setting of: >> >> syncprov-checkpoint <ops> <minutes> >> After a write operation has succeeded, write the contextCSN >> to the underlying database if <ops> write >> operations or more than <minutes> time have passed since the > >> last checkpoint. Checkpointing is disabled >> by default. >> >> Not sure though. > > do you "query" by slapcat or by an LDAP search? For the former it's documented > that contextCSN is updated lazily. For the latter I'm not sure.
I have had the same use case Michael is getting at in the back of me head for some time. I would also like to verify replication status by checking the contextCSN on all servers via an ldapsearch from a monitoring script.
I would expect an ldapsearch of the contextCSN to deliver a current and valid value that should be identical over all servers.
It seems this is not the case. I will run a couple of tests to verify this myself in my testbed of 2 mmr masters and 2 slaves.
Even worse my results with 3 MMR providers and 2 read-only consumer replicas are sometimes not very pleasant regarding data consistency. At the moment this test servers are running as VMs on ESX server. I will repeat my tests with real hardware to make sure there's nothing wrong because of the virtual environment.
Using delta-sync MMR does not make sense in my case because entries will be completely added, modified or delete. (I have no influence on the client app writing there.)
Ciao, Michael.
Hi,
On Mon, 12 Aug 2013, Michael Ströder wrote:
Christian Kratzer wrote:
On Mon, 12 Aug 2013, Ulrich Windl wrote:
<snipp/> >> I have always suspected that this is due to the specific setting of: >> >> syncprov-checkpoint <ops> <minutes> >> After a write operation has succeeded, write the contextCSN >> to the underlying database if <ops> write >> operations or more than <minutes> time have passed since the > >> last checkpoint. Checkpointing is disabled >> by default. >> >> Not sure though. > > do you "query" by slapcat or by an LDAP search? For the former it's documented > that contextCSN is updated lazily. For the latter I'm not sure.
I have had the same use case Michael is getting at in the back of me head for some time. I would also like to verify replication status by checking the contextCSN on all servers via an ldapsearch from a monitoring script.
I would expect an ldapsearch of the contextCSN to deliver a current and valid value that should be identical over all servers.
It seems this is not the case. I will run a couple of tests to verify this myself in my testbed of 2 mmr masters and 2 slaves.
Even worse my results with 3 MMR providers and 2 read-only consumer replicas are sometimes not very pleasant regarding data consistency. At the moment this test servers are running as VMs on ESX server. I will repeat my tests with real hardware to make sure there's nothing wrong because of the virtual environment.
I have a mixed hardware / virtualisation based environment available. Will check the contextCSN over there and let you know.
Using delta-sync MMR does not make sense in my case because entries will be completely added, modified or delete. (I have no influence on the client app writing there.)
same here. I also like the added bonus of giving the customer a stable and easy recovery roadmap of:
- stop slapd - rm -f /var/lib/ldap/* - start slapd
The customer has a relatively small dataset so replication is a matter seconds or at least under a couple of minutes.
Greetings Christian
Hi Michael,
On Mon, 12 Aug 2013, Michael Ströder wrote:
Christian Kratzer wrote:
On Mon, 12 Aug 2013, Ulrich Windl wrote:
<snipp/> >> I have always suspected that this is due to the specific setting of: >> >> syncprov-checkpoint <ops> <minutes> >> After a write operation has succeeded, write the contextCSN >> to the underlying database if <ops> write >> operations or more than <minutes> time have passed since the > >> last checkpoint. Checkpointing is disabled >> by default. >> >> Not sure though. > > do you "query" by slapcat or by an LDAP search? For the former it's documented > that contextCSN is updated lazily. For the latter I'm not sure.
I have had the same use case Michael is getting at in the back of me head for some time. I would also like to verify replication status by checking the contextCSN on all servers via an ldapsearch from a monitoring script.
I would expect an ldapsearch of the contextCSN to deliver a current and valid value that should be identical over all servers.
It seems this is not the case. I will run a couple of tests to verify this myself in my testbed of 2 mmr masters and 2 slaves.
Even worse my results with 3 MMR providers and 2 read-only consumer replicas are sometimes not very pleasant regarding data consistency. At the moment this test servers are running as VMs on ESX server. I will repeat my tests with real hardware to make sure there's nothing wrong because of the virtual environment.
I just verified the status on 5 relatively identical setups.
One with 3 MMR masters and 2 read only consumers.
The others with 2 MMR masters and 2 read only consumers.
The contextCSN of the data and cn=config databases were all in sync from what I could initially see.
I tried disturbing the peace by runnign some updates and restarting some slaves but could not immediately product a difference.
This is all openldap-2.4.35 on CentOS or RedHat EL running on a mix of hardware, VMWare and KVM virtualisiation.
These are all copies of the same setup with different data so the configuration is the same.
Do you have a checkup script I could deploy against my setups to closely monitor contextCSN values. Perhaps my manual checks were just too slow.
Greetings Christian
On Mon, 12 Aug 2013 12:22:49 +0200 (CEST) Christian Kratzer ck-lists@cksoft.de wrote
I just verified the status on 5 relatively identical setups.
One with 3 MMR masters and 2 read only consumers.
The others with 2 MMR masters and 2 read only consumers.
The contextCSN of the data and cn=config databases were all in sync from what I could initially see.
Unfortunately this happens only occasionally. I can't reproduce every time.
Do you have a checkup script I could deploy against my setups to closely monitor contextCSN values. Perhaps my manual checks were just too slow.
If it happens you can observe it for several minutes. We had some cases where it last hours.
Ciao, Michael.
Michael Ströder wrote:
Are contextCSN values on all replicas really in sync if changes were correctly replicated?
I've implemented a monitoring check used with normal MMR setup (OpenLDAP 2.4.35, own build on Debian Squeeze) which also checks the contextCSN values on all replicas compared by server-id.
Sometimes we observe, even in isolated tests, that contextCSN values for a certain server-id differ for quite a while (up to hours) even though the changes coming from that server were definitely replicated to all other replicas. After a while the contextCSN values get suddenly updated. Unfortunately this does not always happen.
Any hint is highly appreciated.
Still seeing this issue with OpenLDAP 2.4.36 that contextCSN values retrieved via LDAP differ for quite a while.
Restarting slapd immediately updates the contextCSN values.
Any idea?
Ciao, Michael.
--On Thursday, September 26, 2013 10:38 AM +0200 Michael Ströder michael@stroeder.com wrote:
Still seeing this issue with OpenLDAP 2.4.36 that contextCSN values retrieved via LDAP differ for quite a while.
Restarting slapd immediately updates the contextCSN values.
Any idea?
Still not seeing this on my servers. :/ So it is not clear to me why you do.
--Quanah
--
Quanah Gibson-Mount Lead Engineer Zimbra Software, LLC -------------------- Zimbra :: the leader in open source messaging and collaboration
On Thu, 26 Sep 2013 08:41:10 -0700 Quanah Gibson-Mount quanah@zimbra.com wrote
--On Thursday, September 26, 2013 10:38 AM +0200 Michael Ströder michael@stroeder.com wrote:
Still seeing this issue with OpenLDAP 2.4.36 that contextCSN values retrieved via LDAP differ for quite a while.
Restarting slapd immediately updates the contextCSN values.
Any idea?
Still not seeing this on my servers. :/ So it is not clear to me why you do.
I've managed to reproduce it deterministically by triggering internal ops in slapo-memberof.
See this ITS: http://www.OpenLDAP.org/its/index.cgi?findid=7710
Ciao, Michael.
--On Thursday, September 26, 2013 5:50 PM +0200 Michael Ströder michael@stroeder.com wrote:
I've managed to reproduce it deterministically by triggering internal ops in slapo-memberof.
See this ITS: http://www.OpenLDAP.org/its/index.cgi?findid=7710
Ah, makes sense. I don't use slapo-memberof.
--Quanah
--
Quanah Gibson-Mount Lead Engineer Zimbra Software, LLC -------------------- Zimbra :: the leader in open source messaging and collaboration
On Thu, 26 Sep 2013 08:41:10 -0700 Quanah Gibson-Mount quanah@zimbra.com wrote
--On Thursday, September 26, 2013 10:38 AM +0200 Michael Ströder michael@stroeder.com wrote:
Still seeing this issue with OpenLDAP 2.4.36 that contextCSN values retrieved via LDAP differ for quite a while.
Restarting slapd immediately updates the contextCSN values.
Any idea?
Still not seeing this on my servers. :/ So it is not clear to me why you do.
This is a really critical issue for this deployment. We want to have correct contextCSN monitoring but false alarms will make the admins quite unhappy (they are called at night if a alarm goes red).
Also I suspect that this might be the cause for occasional slapd lockups when renaming/moving entries in this MMR setup.
Ciao, Michael.
openldap-technical@openldap.org