I am a newbie with MMR. I've just set up my first test of this. I have found it to be almost impossible to determine (via the contextCSN values) if the masters (let alone all the replicas) are actually synched.
How are you folks, who are already using MMR, checking/verifying that the MMR participants (and their replicas) are actually in sync?
Thanks,
Frank Swasey Frank.Swasey@uvm.edu schrieb am 27.05.2016 um 14:45 in Nachricht
alpine.OSX.2.20.1605270839380.38351@vc104198.hiz.rqh:
I am a newbie with MMR. I've just set up my first test of this. I have found it to be almost impossible to determine (via the contextCSN values) if the masters (let alone all the replicas) are actually synched.
How are you folks, who are already using MMR, checking/verifying that the MMR participants (and their replicas) are actually in sync?
Dump, sort and compare the databases, not to talk about logging (while testing) to see when something or what is synced.
Thanks,
Frank Swasey | http://www.uvm.edu/~fcs Sr Systems Administrator | Always remember: You are UNIQUE, University of Vermont | just like everyone else. "I am not young enough to know everything." - Oscar Wilde (1854-1900)
Today at 9:28am, Ulrich Windl wrote:
Frank Swasey Frank.Swasey@uvm.edu schrieb am 27.05.2016 um 14:45 in Nachricht
alpine.OSX.2.20.1605270839380.38351@vc104198.hiz.rqh:
How are you folks, who are already using MMR, checking/verifying that the MMR participants (and their replicas) are actually in sync?
Dump, sort and compare the databases, not to talk about logging (while testing) to see when something or what is synced.
That's really not something that a nagios check should be doing...
2016-05-27 16:36 GMT+02:00 Frank Swasey Frank.Swasey@uvm.edu:
Today at 9:28am, Ulrich Windl wrote:
Frank Swasey Frank.Swasey@uvm.edu schrieb am 27.05.2016 um 14:45 in
Nachricht
How are you folks, who are already using MMR, checking/verifying that the MMR participants (and their replicas) are actually in sync?
Dump, sort and compare the databases, not to talk about logging (while testing) to see when something or what is synced.
That's really not something that a nagios check should be doing...
You can use this Nagios srcipt, but it relies on the contextCSN values: http://ltb-project.org/wiki/documentation/nagios-plugins/check_ldap_syncrepl...
Clément.
Frank Swasey Frank.Swasey@uvm.edu schrieb am 27.05.2016 um 16:36 in Nachricht
alpine.OSX.2.20.1605271036110.38351@vc104198.hiz.rqh:
Today at 9:28am, Ulrich Windl wrote:
Frank Swasey Frank.Swasey@uvm.edu schrieb am 27.05.2016 um 14:45 in
Nachricht
alpine.OSX.2.20.1605270839380.38351@vc104198.hiz.rqh:
How are you folks, who are already using MMR, checking/verifying that the MMR participants (and their replicas) are actually in sync?
Dump, sort and compare the databases, not to talk about logging (while
testing) to see when something or what is synced.
That's really not something that a nagios check should be doing...
Then create a fake object, update it periodically, and query it on each server.
-- Frank Swasey | http://www.uvm.edu/~fcs Sr Systems Administrator | Always remember: You are UNIQUE, University of Vermont | just like everyone else. "I am not young enough to know everything." - Oscar Wilde (1854-1900)
On 27/05/2016 13:45, Frank Swasey wrote:
How are you folks, who are already using MMR, checking/verifying that the MMR participants (and their replicas) are actually in sync?
I check the contextCSN and the dn count. Not foolproof and our database is small enough that counting the dn's isn't an appreciable overhead; in many cases doing that wouldn't be desirable or practical.
To verify that replication is really working I have a process that updates an attribute in a reserved record and verifies that the value and contextCSN has updated everywhere.
On Fri, May 27, 2016 at 08:45:28AM -0400, Frank Swasey wrote:
How are you folks, who are already using MMR, checking/verifying that the MMR participants (and their replicas) are actually in sync?
To clarify, are you directing all writes to one master, or are you actually spreading writes across all of them simultaneously?
I have my systems set up with MMR, but with a load balancer in front such that only one node is ever actually receiving writes. I was going to include the perl script we use to verify they are in sync, but if you're actually running MMR with distributed writes, I'm not sure if it would work. I've never had an issue with it when writes flipped between nodes (on purpose or otherwise 8-/ ) and I'm sure there've been occasions when two nodes might have gotten a few writes at the same time, but I've never had multiple nodes receiving writes simultaneously for an extended period of time.
Although thanks to management that believes more in audit check boxes than actual security, we're going to be turning on account lockouts soon, which will generate a write load on each individual server, so I guess I'll get see what happens to my sync check script in that case :).
Hi,
I'd discussed this in the IRC channel a while back and we were all a bit stumped so I thought it would be worth sending it to the mailing list. If I use delta-syncrepl in an MMR cluster with >2 nodes I get the following behaviour when adding and deleting objects:
Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2: rid=033 cookie=rid=033,sid=006,csn=20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn: queueing 0x4659f80 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0x4659bc0 20160603133551.231411Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn: queueing 0x4659e80 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: syncprov_matchops: skipping original sid 006 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0x4659e80 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0x4659f80 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: syncrepl_message_to_op: rid=033 be_add cn=marksgroup2,ou=ug,ou=iti,ou=is,dc=authorise,dc=ed,dc=ac,dc=uk (0) Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn: queueing 0x4659fc0 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: syncprov_sendresp: to=004, cookie=rid=032,sid=005,csn=20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2: rid=031 cookie=rid=031,sid=004 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn: queueing 0x46599c0 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_queue_csn: queueing 0xb58a140 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0x46599c0 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0x4659fc0 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: slap_graduate_commit_csn: removing 0xb58a140 20160603133734.289843Z#000000#006#000000 Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: syncrepl_message_to_op: rid=031 be_add cn=marksgroup2,ou=ug,ou=iti,ou=is,dc=authorise,dc=ed,dc=ac,dc=uk (68) Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2: rid=031 delta-sync lost sync on (reqStart=20160603133734.000001Z,cn=accesslog), switching to REFRESH Jun 3 14:37:34 oak.authorise.is.ed.ac.uk slapd[29618]: do_syncrep2: rid=031 LDAP_RES_INTERMEDIATE - SYNC_ID_SET
It looks very much like a time sync issue but I've confirmed that my servers are all synced against a single common NTP server and are as closely synced time-wise as could be reasonably expected:
[root@alder ~]# ntpstat synchronised to NTP server (129.215.205.191) at stratum 3 time correct to within 39 ms polling server every 64 s [root@beech ~]# ntpstat synchronised to NTP server (129.215.205.191) at stratum 3 time correct to within 38 ms polling server every 64 s [root@oak ~]# ntpstat synchronised to NTP server (129.215.205.191) at stratum 3 time correct to within 38 ms polling server every 64 s [root@rowan ~]# ntpstat synchronised to NTP server (129.215.205.191) at stratum 3 time correct to within 39 ms polling server every 64 s
In terms of server config the olcSyncrepl directives are of the form:
olcSyncrepl: {0}rid=31 provider=ldaps://beech.authorise.is.ed.ac.uk/ realm=EASE.ED.AC.UK bindmethod=sasl saslmech=gssapi authcid="replicator.authorise.is.ed.ac.uk@EASE.ED.AC.UK" searchbase="dc=authorise,dc=ed,dc=ac,dc=uk" logbase="cn=accesslog" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))" schemachecking=on type=refreshAndPersist retry="60 +" tls_reqcert=never timeout=1 keepalive=300:12:5 syncdata=accesslog
with a line for each of the 4 servers.
and my syncprov config on the database being replicated is: dn: olcOverlay={2}syncprov,olcDatabase={1}mdb,cn=config objectClass: olcOverlayConfig objectClass: olcConfig objectClass: top objectClass: olcSyncProvConfig olcOverlay: {2}syncprov structuralObjectClass: olcSyncProvConfig
If I have 2 servers in the cluster it works fine, if I use pull-based replication (refreshOnly) it also works fine and if I use traditional syncrepl with refreshAndPersist it works fine.
I'm using OpenLDAP 2.4.44 on Centos 6 and 7 x86_64 with the same behaviour.
Any ideas? At the moment I've proceeded to go-live with pull-based replication but it would be nice to get push-based replication working. Also while I have the "old" servers sitting there it makes sense to use these for testing/debugging before retiring them.
--On Friday, June 03, 2016 4:54 PM +0100 Mark Cairney Mark.Cairney@ed.ac.uk wrote:
Hi,
I'd discussed this in the IRC channel a while back and we were all a bit stumped so I thought it would be worth sending it to the mailing list. If I use delta-syncrepl in an MMR cluster with >2 nodes I get the following behaviour when adding and deleting objects:
Likely http://www.openldap.org/its/index.cgi/?findid=8432
--Quanah
--
Quanah Gibson-Mount Platform Architect Manager, Systems Team Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
On Fri, Jun 03, 2016 at 04:06:45PM -0700, Quanah Gibson-Mount wrote:
This is a new issue with 2.4.44? We've been running a 4 node MMR system under 2.4.43 that's been very stable and were planning to update to 2.4.44 this summer. Would it be better to hold off on such an update?
Thanks...
Hi,
I see the same behaviour as before in 2.4.43- log output from one of the servers attached. That ITS looks similar to what I'm seeing but not entirely the same as in my case it seems to stabilise after about 5 seconds (which is still not desired behaviour).
I had a previous issue with 2.4.43 which caused one of my servers to segfault suddenly so rolling back to that in production isn't an option anyway.
Just in case there's anything amiss in my config my syncprov and accesslog overlays have the following:
# {2}syncprov, {1}mdb, config dn: olcOverlay={2}syncprov,olcDatabase={1}mdb,cn=config objectClass: olcOverlayConfig objectClass: olcConfig objectClass: top objectClass: olcSyncProvConfig olcOverlay: {2}syncprov
# {3}accesslog, {1}mdb, config dn: olcOverlay={3}accesslog,olcDatabase={1}mdb,cn=config objectClass: olcOverlayConfig objectClass: olcAccessLogConfig olcOverlay: {3}accesslog olcAccessLogDB: cn=accesslog olcAccessLogOps: writes olcAccessLogPurge: 07+00:00 01+00:00 olcAccessLogSuccess: TRUE
And the config on my accesslog DB is:
# {2}mdb, config dn: olcDatabase={2}mdb,cn=config objectClass: olcDatabaseConfig objectClass: olcMdbConfig olcDatabase: {2}mdb olcDbDirectory: /usr/local/authz/var/openldap-data/accesslog olcSuffix: cn=accesslog olcAccess: {0}to * by dn.exact="uid=replicator.authorise.is.ed.ac.uk,ou=people ,ou=central,dc=authorise,dc=ed,dc=ac,dc=uk" write by dn="cn=Manager,dc=author ise,dc=ed,dc=ac,dc=uk" write olcLimits: {0}dn.exact="uid=replicator.authorise.is.ed.ac.uk,ou=people,ou=cent ral,dc=authorise,dc=ed,dc=ac,dc=uk" time.soft=unlimited time.hard=unlimited s ize.soft=unlimited size.hard=unlimited olcLimits: {1}dn.exact="cn=Manager,dc=authorise,dc=ed,dc=ac,dc=uk" time.soft=u nlimited time.hard=unlimited size.soft=unlimited size.hard=unlimited olcRootDN: cn=Manager,cn=accesslog olcRootPW: <-----SNIP------> olcDbIndex: default eq olcDbIndex: entryCSN,objectClass,reqEnd,reqResult,reqStart,reqDN olcDbMaxReaders: 96 olcDbMaxSize: 32212254720 olcDbMode: 0600 olcDbSearchStack: 16
# {0}syncprov, {2}mdb, config dn: olcOverlay={0}syncprov,olcDatabase={2}mdb,cn=config objectClass: olcOverlayConfig objectClass: olcConfig objectClass: top objectClass: olcSyncProvConfig olcOverlay: {0}syncprov olcSpNoPresent: TRUE olcSpReloadHint: TRUE
On 04/06/16 23:01, Paul B. Henson wrote:
On Fri, Jun 03, 2016 at 04:06:45PM -0700, Quanah Gibson-Mount wrote:
This is a new issue with 2.4.44? We've been running a 4 node MMR system under 2.4.43 that's been very stable and were planning to update to 2.4.44 this summer. Would it be better to hold off on such an update?
Thanks...
--On Saturday, June 04, 2016 4:01 PM -0700 "Paul B. Henson" henson@acm.org wrote:
On Fri, Jun 03, 2016 at 04:06:45PM -0700, Quanah Gibson-Mount wrote:
This is a new issue with 2.4.44? We've been running a 4 node MMR system under 2.4.43 that's been very stable and were planning to update to 2.4.44 this summer. Would it be better to hold off on such an update?
Thanks...
No, it's not new. Fix for 8432 is now in openldap head. I backported it to my builds, and the issue went away. Note as per Howard's comments, the problem may still exist for those who do not use delta-sycnrepl MMR.
--Quanah
--
Quanah Gibson-Mount Platform Architect Manager, Systems Team Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration A division of Synacor, Inc
On Thu, 2 Jun 2016 at 11:43pm, Paul B. Henson wrote:
To clarify, are you directing all writes to one master, or are you actually spreading writes across all of them simultaneously?
I am intending that only one will officially be active at a time. However, I am doing that by activating the service IP addresses on the system that I want to be active instead of using a load balancer.
On Tue, Jun 07, 2016 at 08:02:14AM -0400, Frank Swasey wrote:
I am intending that only one will officially be active at a time. However, I am doing that by activating the service IP addresses on the system that I want to be active instead of using a load balancer.
Cool. In that case, here is the script we've been running for years to check that our 4 MMR servers are synced. It occasionally blips when we do a major idm run that creates or deletes 20-30 thousand accounts, but other than that I rarely if ever get alerts from it. It runs on each of the four servers, and our config management server populates it with the names of the other three to check.
openldap-technical@openldap.org