Hello,
I've been running a multi master configuration without any problem for years. This running servers are running in 5 ubuntu 14.04 servers with openldap 2.4.43.
The configuration is:
dn: olcDatabase={3}mdb,cn=config ... olcSyncrepl: {0}rid=31 provider=ldap://canis31.um.es binddn=<repl user dn> bindmethod=simple credentials=XXXXXXX searc hbase=dc=Telematica type=refreshAndPersist retry="300 +" timeout=1 olcSyncrepl: {1}rid=32 provider=ldap://canis32.um.es binddn=<repl user dn> bindmethod=simple credentials=XXXXXXX searc hbase=dc=Telematica type=refreshAndPersist retry="300 +" timeout=1 olcSyncrepl: {2}rid=33 provider=ldap://canis33.um.es binddn=<repl user dn> bindmethod=simple credentials=XXXXXXX searc hbase=dc=Telematica type=refreshAndPersist retry="300 +" timeout=1 olcSyncrepl: {3}rid=34 provider=ldap://canis34.um.es binddn=<repl user dn> bindmethod=simple credentials=XXXXXXX searc hbase=dc=Telematica type=refreshAndPersist retry="300 +" timeout=1
dn: olcOverlay={0}dynlist,olcDatabase={3}mdb,cn=config objectClass: olcDynamicList objectClass: olcOverlayConfig objectClass: olcConfig olcOverlay: {0}dynlist olcDlAttrSet: {0}labeledURIObject labeledURI
dn: olcOverlay={1}ppolicy,olcDatabase={3}mdb,cn=config objectClass: olcPPolicyConfig objectClass: olcOverlayConfig objectClass: olcConfig olcOverlay: {1}ppolicy olcPPolicyDefault: cn=default,ou=policies,dc=Telematica
dn: olcOverlay={2}syncprov,olcDatabase={3}mdb,cn=config objectClass: olcSyncProvConfig objectClass: olcOverlayConfig objectClass: olcConfig olcOverlay: {2}syncprov olcSpCheckpoint: 100 10 olcSpSessionlog: 100
where each server has 4 olcSyncrepl attributes pointing to the other servers.
I had no problem with this configuration for years
Now I'm deploying an update of these servers in a new ubuntu 18.04 server with openldap 2.4.47. In order to synchronize entries between them, I have linked this new server to one of the other (and this one to the new one).
Configuration in the new one (named canis41) is:
dn: olcDatabase={3}mdb,cn=config ... olcSyncrepl: {0}rid=39 provider=ldap://canis39.um.es binddn=<repl user dn> bindmethod=simple credentials=XXXXXXXX searc hbase=dc=Telematica type=refreshAndPersist retry="60 +" timeout=1 schemache cking=off scope=sub olcSyncrepl: {1}rid=42 provider=ldap://canis42.um.es binddn=<repl user dn> bindmethod=simple credentials=XXXXXXXX searc hbase=dc=Telematica type=refreshAndPersist retry="30 +" timeout=1 logbase=" cn=log" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))" schemaCh ecking=on syncdata=accesslog exattrs="pwdFailureTime"
dn: olcOverlay={0}syncprov,olcDatabase={3}mdb,cn=config objectClass: olcSyncProvConfig objectClass: olcOverlayConfig objectClass: olcConfig olcOverlay: {0}syncprov olcSpCheckpoint: 100 10 olcSpSessionlog: 100
dn: olcOverlay={1}ppolicy,olcDatabase={3}mdb,cn=config objectClass: olcPPolicyConfig objectClass: olcOverlayConfig objectClass: olcConfig olcOverlay: {1}ppolicy olcPPolicyDefault: cn=default,ou=policies,dc=Telematica
dn: olcOverlay={2}dynlist,olcDatabase={3}mdb,cn=config objectClass: olcDynamicList objectClass: olcOverlayConfig objectClass: olcConfig olcOverlay: {2}dynlist olcDlAttrSet: {0}labeledURIObject labeledURI
where canis39 is one the former servers and canis42 is a new server too synchronizing just with canis41.
My problem is that synchronization is working, but sometimes modifications done in the canis3x farm are delayed a lot of time until they are replicated to the new one (sometimes in the order of 40-60 minutes).
I'm logging sync logs, but I haven't found much information about these logs. Is there any way to debug it? How?
--On Monday, March 18, 2019 12:21 PM +0100 "Ángel L. Mateo" amateo@um.es wrote:
I'm logging sync logs, but I haven't found much information about these logs. Is there any way to debug it? How?
Unfortunately, you've not really provided any actionable information, so it's hard to say. I can say that I've had MMR deployments for years, using delta-syncrepl without delay issues like you are reporting. There are several examples of setting up N-way delta-syncrepl MMR servers in the tests/data/regressions area, for example the regression tests for ITS8444 and ITS8800.
Regards, Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
El 20/3/19 a las 19:26, Quanah Gibson-Mount escribió:
--On Monday, March 18, 2019 12:21 PM +0100 "Ángel L. Mateo" amateo@um.es wrote:
I'm logging sync logs, but I haven't found much information about these logs. Is there any way to debug it? How?
Unfortunately, you've not really provided any actionable information, so it's hard to say. I can say that I've had MMR deployments for years, using delta-syncrepl without delay issues like you are reporting. There are several examples of setting up N-way delta-syncrepl MMR servers in the tests/data/regressions area, for example the regression tests for ITS8444 and ITS8800.
I've been running MMR (not delta) for years too without problems too.
Now the server with problems works without problems for days, but then it start delaying syncs.
The problem I have is that I don't know how to debug it.
Is there any documentation about how to debug it? about the sync logs messages?
El 21/3/19 a las 20:26, Michael Ströder escribió:
On 3/21/19 8:22 AM, Ángel L. Mateo wrote:
Now the server with problems works without problems for days, but then it start delaying syncs.
How do you detect this?
Checking contextCSN attribute of all ldap servers. I get something like this:
contextCSN: 20190322064915.077600Z#000000#01f#000000 contextCSN: 20190322065006.637604Z#000000#020#000000 contextCSN: 20190322065002.859879Z#000000#021#000000 contextCSN: 20190322065000.303715Z#000000#022#000000 contextCSN: 20190301102558.398349Z#000000#027#000000 contextCSN: 20190314080533.305657Z#000000#029#000000
There is one value for every server. When everything is ok, these values are the same in all servers. But sometimes in the new server are different, with values older than the in the others.
On 3/22/19 7:54 AM, Angel L. Mateo wrote:
El 21/3/19 a las 20:26, Michael Ströder escribió:
On 3/21/19 8:22 AM, Ángel L. Mateo wrote:
Now the server with problems works without problems for days, but then it start delaying syncs.
How do you detect this?
Checking contextCSN attribute of all ldap servers. I get something like this:
contextCSN: 20190322064915.077600Z#000000#01f#000000 contextCSN: 20190322065006.637604Z#000000#020#000000 contextCSN: 20190322065002.859879Z#000000#021#000000 contextCSN: 20190322065000.303715Z#000000#022#000000 contextCSN: 20190301102558.398349Z#000000#027#000000 contextCSN: 20190314080533.305657Z#000000#029#000000
There is one value for every server. When everything is ok, these values are the same in all servers. But sometimes in the new server are different, with values older than the in the others.
This is most times caused by an OpenLDAP bug. I see this quite often with MMR providers even though the entries have been correctly replicated to the other providers. Hence I asked for your detection method.
I've double-checked the code of my monitoring script very often!
And I'm not the only one seeing this false alarm in the monitoring. E.g. two guys approached me after my OpenLDAP lightning monitoring talk at FOSDEM reporting the same issue. And they use another monitoring tool.
So please check whether changes were correctly replicated instead. Yes, that's nearly impossible if you have many changes on many entries.
I've considered to search the highest entryCSN value per provider ID (server-side sorting on entryCSN, search limit 1) to compare it against its accompanying contextCSN value. But during first superficial tests I only got strange results. I have to investigate further before I can come up with detailed results.
Ciao, Michael.
El 22/3/19 a las 17:19, Michael Ströder escribió:
Checking contextCSN attribute of all ldap servers. I get something like this:
contextCSN: 20190322064915.077600Z#000000#01f#000000 contextCSN: 20190322065006.637604Z#000000#020#000000 contextCSN: 20190322065002.859879Z#000000#021#000000 contextCSN: 20190322065000.303715Z#000000#022#000000 contextCSN: 20190301102558.398349Z#000000#027#000000 contextCSN: 20190314080533.305657Z#000000#029#000000
There is one value for every server. When everything is ok, these values are the same in all servers. But sometimes in the new server are different, with values older than the in the others.
This is most times caused by an OpenLDAP bug. I see this quite often with MMR providers even though the entries have been correctly replicated to the other providers. Hence I asked for your detection method.
I don't know if there is or not a bug related with this (I didn't have any false alert for my running farm), but when I have this alert, then I checked for latest changes (got from accesslog) and they aren't replicated. So the delay is really happening.
--On Thursday, March 21, 2019 9:22 AM +0100 "Ángel L. Mateo" amateo@um.es wrote:
I've been running MMR (not delta) for years too without problems too.
Now the server with problems works without problems for days, but then it start delaying syncs.
The problem I have is that I don't know how to debug it.
Is there any documentation about how to debug it? about the sync logs messages?
I think one of the key differences to understand between syncrepl and delta-syncrepl is that syncrepl will simply overwrite entries that have discrepencies while delta-syncrepl will go into REFRESH mode. Since you started out with an OpenLDAP version that has numerous replication bugs, it may be your masters are not in accordance with one another, and the "delays" you are seeing are actually systems going into refresh.
The first thing you should look at is if there are any instances of REFRESH in your logs, something along the lines of:
do_syncrep2: rid=001 delta-sync lost sync, switching to REFRESH
If you see that, the issue isn't delayed replication, the issue is that your databases are out of sync.
If you don't see instances of REFRESH mode being triggered in your log, we can discuss further what the logs indicate and other things that may need looking at.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
El 21/3/19 a las 22:51, Quanah Gibson-Mount escribió:
--On Thursday, March 21, 2019 9:22 AM +0100 "Ángel L. Mateo" amateo@um.es wrote:
I've been running MMR (not delta) for years too without problems too.
Now the server with problems works without problems for days, but then it start delaying syncs.
The problem I have is that I don't know how to debug it.
Is there any documentation about how to debug it? about the sync logs messages?
I think one of the key differences to understand between syncrepl and delta-syncrepl is that syncrepl will simply overwrite entries that have discrepencies while delta-syncrepl will go into REFRESH mode. Since you started out with an OpenLDAP version that has numerous replication bugs, it may be your masters are not in accordance with one another, and the "delays" you are seeing are actually systems going into refresh.
replication is using syncrepl, not delta-syncrepl.
The first thing you should look at is if there are any instances of REFRESH in your logs, something along the lines of:
do_syncrep2: rid=001 delta-sync lost sync, switching to REFRESH
If you see that, the issue isn't delayed replication, the issue is that your databases are out of sync.
I can't find that exact message, but I have these in the consumer:
Mar 21 10:00:26 canis41 slapd[68402]: do_syncrep2: rid=039 LDAP_RES_INTERMEDIATE - REFRESH_PRESENT Mar 21 10:40:56 canis41 slapd[68402]: do_syncrep2: rid=039 LDAP_RES_INTERMEDIATE - REFRESH_PRESENT Mar 21 15:45:52 canis41 slapd[68402]: do_syncrep2: rid=039 LDAP_RES_INTERMEDIATE - REFRESH_PRESENT
If you don't see instances of REFRESH mode being triggered in your log, we can discuss further what the logs indicate and other things that may need looking at.
--On Friday, March 22, 2019 9:01 AM +0100 "Angel L. Mateo" amateo@um.es wrote:
replication is using syncrepl, not delta-syncrepl.
One of your quoted syncrepl stanzas clearly shows delta-syncrepl. I cannot magically intuit which version you're having problems with.
olcSyncrepl: {1}rid=42 provider=ldap://canis42.um.es binddn=<repl user dn> bindmethod=simple credentials=XXXXXXXX searc hbase=dc=Telematica type=refreshAndPersist retry="30 +" timeout=1 logbase=" cn=log" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))" schemaCh ecking=on syncdata=accesslog exattrs="pwdFailureTime"
^^^^^^^^^^^^^^^^
I can't find that exact message, but I have these in the consumer:
Mar 21 10:00:26 canis41 slapd[68402]: do_syncrep2: rid=039 LDAP_RES_INTERMEDIATE - REFRESH_PRESENT Mar 21 10:40:56 canis41 slapd[68402]: do_syncrep2: rid=039 LDAP_RES_INTERMEDIATE - REFRESH_PRESENT Mar 21 15:45:52 canis41 slapd[68402]: do_syncrep2: rid=039 LDAP_RES_INTERMEDIATE - REFRESH_PRESENT
So that's not the same thing, and since you're not using delta-syncrepl you wouldn't find the string I was asking about.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
El 22/3/19 a las 19:22, Quanah Gibson-Mount escribió:
--On Friday, March 22, 2019 9:01 AM +0100 "Angel L. Mateo" amateo@um.es wrote:
replication is using syncrepl, not delta-syncrepl.
One of your quoted syncrepl stanzas clearly shows delta-syncrepl. I cannot magically intuit which version you're having problems with.
olcSyncrepl: {1}rid=42 provider=ldap://canis42.um.es binddn=<repl user dn> bindmethod=simple credentials=XXXXXXXX searc hbase=dc=Telematica type=refreshAndPersist retry="30 +" timeout=1 logbase=" cn=log" logfilter="(&(objectClass=auditWriteObject)(reqResult=0))" schemaCh ecking=on syncdata=accesslog exattrs="pwdFailureTime"
Maybe I haven't clearly explain the scenario.
* I have a farm with syncrepl running for years without any problem. This farm is running 2.4.43. Servers of this farm are named canis3x * I have a new farm with servers named canis41 and canis42 (with 2.4.47), where: ** canis41 is syncrepl with canis39 (to keep both farms synchronized) ** canis41 and canis42 has delta-syncrepl between them (my idea is to switch off canis3x farm and then left just this delta-syncrepl)
The problem I'm having is with syncrepl between canis39 and canis41.
--On Monday, March 25, 2019 9:34 AM +0100 "Ángel L. Mateo" amateo@um.es wrote:
- I have a farm with syncrepl running for years without any problem. This
farm is running 2.4.43. Servers of this farm are named canis3x
- I have a new farm with servers named canis41 and canis42 (with 2.4.47),
where: ** canis41 is syncrepl with canis39 (to keep both farms synchronized) ** canis41 and canis42 has delta-syncrepl between them (my idea is to switch off canis3x farm and then left just this delta-syncrepl)
The problem I'm having is with syncrepl between canis39 and canis41.
None of your configurations set a keepalive. If you have any sort of load balancer/switch in between, they will often cut idle connections, and replication will stop until slapd is restarted.
I recommend always defining a keepalive interval in the syncrepl stanza to avoid these types of issues.
keepalive=240:10:30
usually works well, unless the device is configured to close idle connections at less than a 5 minute interval.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
Quanah Gibson-Mount quanah@symas.com schrieb am 25.03.2019 um 19:56 in
Nachricht <CF0FD0042D3D17C49C943E62@[192.168.1.39]>:
--On Monday, March 25, 2019 9:34 AM +0100 "Ángel L. Mateo" amateo@um.es wrote:
- I have a farm with syncrepl running for years without any problem. This
farm is running 2.4.43. Servers of this farm are named canis3x
- I have a new farm with servers named canis41 and canis42 (with 2.4.47),
where: ** canis41 is syncrepl with canis39 (to keep both farms synchronized) ** canis41 and canis42 has delta-syncrepl between them (my idea is to switch off canis3x farm and then left just this delta-syncrepl)
The problem I'm having is with syncrepl between canis39 and canis41.
None of your configurations set a keepalive. If you have any sort of load balancer/switch in between, they will often cut idle connections, and replication will stop until slapd is restarted.
I recommend always defining a keepalive interval in the syncrepl stanza to avoid these types of issues.
keepalive=240:10:30
usually works well, unless the device is configured to close idle connections at less than a 5 minute interval.
Hi!
I don't understand that: Keepalive (as I know it) has nothing to do with idle connections, but only with dead connections. And what's the magic with 5 minutes?
Regards, Ulrich
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
Am 26.03.19 um 07:58 schrieb Ulrich Windl:
I don't understand that: Keepalive (as I know it) has nothing to do with idle connections, but only with dead connections.
Any statefull firewall will reset/drop/delete/discard a connections state information from it's internal connection table after some time if no traffic was seen (for example for 5 minutes)
Andreas
"A. Schulze" sca@andreasschulze.de schrieb am 26.03.2019 um 14:59 in
Nachricht 89c3ef23-e670-d4ae-a189-74135643c014@andreasschulze.de:
Am 26.03.19 um 07:58 schrieb Ulrich Windl:
I don't understand that: Keepalive (as I know it) has nothing to do with
idle
connections, but only with dead connections.
Any statefull firewall will reset/drop/delete/discard a connections state information from it's internal connection table after some time if no traffic was seen (for example for 5 minutes)
Actually a bad idea: A few years again I wrote a TCP redirector that had problems: First, I had no timeout, so dead connections could not be detected, so I added a timeout. Second, idle connections timed out, so I added keep-alive. Then I was happy.
It seems LDAP connections here also live for a very long time, sometimes idle for hours...
Regards, Ulrich
--On Tuesday, March 26, 2019 8:58 AM +0100 Ulrich Windl Ulrich.Windl@rz.uni-regensburg.de wrote:
I recommend always defining a keepalive interval in the syncrepl stanza to avoid these types of issues.
keepalive=240:10:30
usually works well, unless the device is configured to close idle connections at less than a 5 minute interval.
Hi!
I don't understand that: Keepalive (as I know it) has nothing to do with idle connections, but only with dead connections. And what's the magic with 5 minutes?
Hello,
You're understanding is generally incorrect. The paramater is called "keep alive" and not "kill dead" for a reason. When using refreshAndPersist connections in an environment where you have a traffic device (such as an F5), the devices are often configured to terminate idle connections after some period of time (which often defaults to 5 minutes). By configuring a keepalive in the syncrepl stanza, the connection appears "alive" to the traffic device even if there is no replication traffic occurring. Thus "keep alive".
There is a *secondary* benefit, in that it does also allow slapd to detect if a connection was severed, and so re-establish a new connection in that case, but the primary purpose is to keep it from ever getting severed in the first place.
If you have no "keepalive" set, and the traffic device severs the connection, slapd cannot detect this, and will think it's still connected and no replication will occur until slapd is restarted.
I.e., the primary purpose is to keep the connection alive.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
Although I'm not using any intermediate device between ldap servers (they all are in the same network) I have added the keepalive option to syncrepl, without any result.
El 26/3/19 a las 15:06, Quanah Gibson-Mount escribió:
--On Tuesday, March 26, 2019 8:58 AM +0100 Ulrich Windl Ulrich.Windl@rz.uni-regensburg.de wrote:
I recommend always defining a keepalive interval in the syncrepl stanza to avoid these types of issues.
keepalive=240:10:30
usually works well, unless the device is configured to close idle connections at less than a 5 minute interval.
Hi!
I don't understand that: Keepalive (as I know it) has nothing to do with idle connections, but only with dead connections. And what's the magic with 5 minutes?
Hello,
You're understanding is generally incorrect. The paramater is called "keep alive" and not "kill dead" for a reason. When using refreshAndPersist connections in an environment where you have a traffic device (such as an F5), the devices are often configured to terminate idle connections after some period of time (which often defaults to 5 minutes). By configuring a keepalive in the syncrepl stanza, the connection appears "alive" to the traffic device even if there is no replication traffic occurring. Thus "keep alive".
There is a *secondary* benefit, in that it does also allow slapd to detect if a connection was severed, and so re-establish a new connection in that case, but the primary purpose is to keep it from ever getting severed in the first place.
If you have no "keepalive" set, and the traffic device severs the connection, slapd cannot detect this, and will think it's still connected and no replication will occur until slapd is restarted.
I.e., the primary purpose is to keep the connection alive.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
--On Wednesday, March 27, 2019 9:24 AM +0100 "Ángel L. Mateo" amateo@um.es wrote:
Although I'm not using any intermediate device between ldap servers (they all are in the same network) I have added the keepalive option to syncrepl, without any result.
Yeah, it's more of a safeguard to ensure one avoids those types of issues.
As for your current situation, I did have a customer the other day who was reporting multi-hour replication delays. But when I looked into it, the issue was not unexpected. I don't know if this has any relevance to your situation, but here is what they were doing.
They had a 4-way MMR setup. They were doing 50k ADDs to each of the masters. This is one of the worst things one could do, becuase it will massively slow down replication and cause all the servers to get behind due to the contention caused by the massive number of ADD ops occurring on each server while also trying to replicate the ADDs from the other servers. I.e., the best route would be to simply ADD all 150k entries to a single member node and let them replicate out from there.
You've noted before wanting some information on what replication lines mean. It's hard to answer that exactly, as what's important to examine and what isn't important can be difficult to determine without looking at the actual logs and having some knowledge of what's going on, but this may be of some help. Here are some different sets of logs and scenarios. All logs, etc, are based on delta-syncrepl being in use.
Scenario #1:
Skipping an change that's too far in the past. This should have been processed already from the original change on the original master (SID 066). This change request came from SID 0c9
Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=201 cookie=rid=201,sid=0c9,csn=20190325191922.072386Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=201 CSN too old, ignoring 20190325191922.072386Z#000000#066#000000 (reqStart=20190325191922.000185Z,cn=deltalog)
Scenario #2:
Applying a change that originated on another server:
Mar 25 12:19:23 master1 slapd[29939]: do_syncrep2: rid=102 cookie=rid=102,sid=066,csn=20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing 0x7f9ab88ce0f0 20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: syncrepl_message_to_op: rid=102 tid bcffc700 Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing 0x7f9ab842eda0 20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: syncprov_sendresp: to=12d, cookie=rid=101,sid=065,csn=20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: syncprov_matchops: skipping original sid 066 Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing 0x7f9ab842eda0 20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing 0x7f9ab88ce0f0 20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: syncprov_sendresp: to=0c9, cookie=rid=101,sid=065,csn=20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: syncrepl_message_to_op: rid=102 be_add cn=QA_MACHINE_9F06E7769C9B49B49D737414662C9089,ou=machines,dc=example,dc=com (0) Mar 25 12:19:23 master1 slapd[29939]: slap_queue_csn: queueing 0x7f9ab8a3b930 20190325191923.177761Z#000000#066#000000 Mar 25 12:19:23 master1 slapd[29939]: slap_graduate_commit_csn: removing 0x7f9ab8a3b930 20190325191923.177761Z#000000#066#000000
So in the above, this master (SID065) got a change from another master (SID066). It was not a change it had received yet from 066, so it applies it. It skips sending this change back to the originating server (Thus the "skipping original sid 066" line). It forwards the change to its other listeners (0c9, 12d) in case they do not have a direct link to 066, and commits the change locally.
Scenario #3:
Originating a change:
Mar 25 03:58:24 master1 slapd[29939]: conn=5788689 op=75 ADD dn="cn=ATOM Virtual:843a4b737992,ou=identities,dc=example,dc=com" Mar 25 03:58:24 master1 slapd[29939]: slap_queue_csn: queueing 0x7f9ab053ed40 20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: slap_queue_csn: queueing 0x7f9ab106c0f0 20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=12d, cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=0c9, cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: syncprov_sendresp: to=066, cookie=rid=101,sid=065,csn=20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: slap_graduate_commit_csn: removing 0x7f9ab106c0f0 20190325105824.517526Z#000000#065#000000 Mar 25 03:58:24 master1 slapd[29939]: conn=5788689 op=75 RESULT tag=105 err=0 text= Mar 25 03:58:24 master1 slapd[29939]: slap_graduate_commit_csn: removing 0x7f9ab053ed40 20190325105824.517526Z#000000#065#000000
So in the above, the master originating this change (065) creates the relevant CSN and forwards this change to its listeners (SIDs 12d, 0c9, 066), commits, and cleans up.
To read a CSN value:
20190325105824.517526Z#000000#065#000000
The first part is the time (Zulu) that the change occured. So in above case, the change occurred at:
2019/03/25 @ 10:58:24.517526 seconds Zulu time (Server time as we see above was 03:58:24), so this server is 7 hours behind Zulu time.
The server where the modification was made was the one with a SID value of 065.
If you're looking at the contextCSN values on a server, it essentially tells you the last change it processed from each master that it talks to (and its own current change value). If the values aren't matching between servers, then you need to see if those servers are still processing changes from the other masters.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
El 29/3/19 a las 18:49, Quanah Gibson-Mount escribió:
--On Wednesday, March 27, 2019 9:24 AM +0100 "Ángel L. Mateo" amateo@um.es wrote:
Although I'm not using any intermediate device between ldap servers (they all are in the same network) I have added the keepalive option to syncrepl, without any result.
Yeah, it's more of a safeguard to ensure one avoids those types of issues.
As for your current situation, I did have a customer the other day who was reporting multi-hour replication delays. But when I looked into it, the issue was not unexpected. I don't know if this has any relevance to your situation, but here is what they were doing.
They had a 4-way MMR setup. They were doing 50k ADDs to each of the masters. This is one of the worst things one could do, becuase it will massively slow down replication and cause all the servers to get behind due to the contention caused by the massive number of ADD ops occurring on each server while also trying to replicate the ADDs from the other servers. I.e., the best route would be to simply ADD all 150k entries to a single member node and let them replicate out from there.
In my case I'm not doing massive ADDs. The only modification I could have is related with password policy overlay in a >40k users directory.
--On Monday, April 01, 2019 9:40 AM +0200 "Ángel L. Mateo" amateo@um.es wrote:
In my case I'm not doing massive ADDs. The only modification I could have is related with password policy overlay in a >40k users directory.
Hi,
Unfortunately there's been near zero information provided that's actionable on this end, so unless you're willing to be more forthcoming, there's not really any additional assistance I can provide.
Regards, Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
El 25/3/19 a las 19:56, Quanah Gibson-Mount escribió:
None of your configurations set a keepalive. If you have any sort of load balancer/switch in between, they will often cut idle connections, and replication will stop until slapd is restarted.
I recommend always defining a keepalive interval in the syncrepl stanza to avoid these types of issues.
keepalive=240:10:30
usually works well, unless the device is configured to close idle connections at less than a 5 minute interval.
Finally, it seems that the problem could be related with conections being closed, although I don't have any network device between servers.
I have found messages like:
Apr 3 14:31:48 canis41 slapd[122414]: conn=211822 fd=21 closed (idletimeout)
in my logs for replica connections.
I tried to configure keepalive when you suggested it, but I configured it in just one connection of the replica.
Now I have configured it in both consumers and since (yesterday) then I have no connections closed neither replica delays.
Now I have to find why my system is closing these connections. The difference between the old farm and this new server is the ubuntu version (from 14.04 to 18.04). I don't know if systemd (used in ubuntu 18.04) could be the reason.
--On Thursday, April 04, 2019 2:53 PM +0200 "Ángel L. Mateo" amateo@um.es wrote:
Apr 3 14:31:48 canis41 slapd[122414]: conn=211822 fd=21 closed (idletimeout)
in my logs for replica connections.
This indicates that there is an "idletimeout" directive configured and that the "limits" directive has not been configured for the replication DN to by pass that limit. The replication DN generally should not be subject to time or size limits.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
Quanah Gibson-Mount wrote:
--On Thursday, April 04, 2019 2:53 PM +0200 "Ángel L. Mateo" amateo@um.es wrote:
Apr 3 14:31:48 canis41 slapd[122414]: conn=211822 fd=21 closed (idletimeout)
in my logs for replica connections.
This indicates that there is an "idletimeout" directive configured and that the "limits" directive has not been configured for the replication DN to by pass that limit. The replication DN generally should not be subject to time or size limits.
slapd doesn't enforce idletimeout on syncrepl connections. If that message is showing up it must be for some other connection. But if setting keepalive fixes the issue, that still implies that something in the network layer is not forwarding traffic as it should.
El 4/4/19 a las 16:32, Quanah Gibson-Mount escribió:
--On Thursday, April 04, 2019 2:53 PM +0200 "Ángel L. Mateo" amateo@um.es wrote:
Apr 3 14:31:48 canis41 slapd[122414]: conn=211822 fd=21 closed (idletimeout)
in my logs for replica connections.
This indicates that there is an "idletimeout" directive configured and that the "limits" directive has not been configured for the replication DN to by pass that limit. The replication DN generally should not be subject to time or size limits.
I have an idletimeout directive:
dn: cn=config olcIdleTimeout: 600
but I also have the lits directive for replica user:
olcLimits: {6}dn.exact=cn=Replica,ou=People,ou=Management,o=SlapdRoot size.hard=unlimited size.soft=unlimited time.hard=unlimited time.soft=unlimited
so, according to Howard Chu, its must be another reason.
--On Friday, April 05, 2019 9:26 AM +0200 "Angel L. Mateo" amateo@um.es wrote:
dn: cn=config olcIdleTimeout: 600
but I also have the lits directive for replica user:
olcLimits: {6}dn.exact=cn=Replica,ou=People,ou=Management,o=SlapdRoot size.hard=unlimited size.soft=unlimited time.hard=unlimited time.soft=unlimited
so, according to Howard Chu, its must be another reason.
Yes, apparently so. But it wouldn't be something on the OpenLDAP side that's causing the connections to be broken. I'm glad setting the keepalive parameter resolved the issue.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
El 5/4/19 a las 15:23, Quanah Gibson-Mount escribió:
--On Friday, April 05, 2019 9:26 AM +0200 "Angel L. Mateo" amateo@um.es wrote:
dn: cn=config olcIdleTimeout: 600
but I also have the lits directive for replica user:
olcLimits: {6}dn.exact=cn=Replica,ou=People,ou=Management,o=SlapdRoot size.hard=unlimited size.soft=unlimited time.hard=unlimited time.soft=unlimited
so, according to Howard Chu, its must be another reason.
Yes, apparently so. But it wouldn't be something on the OpenLDAP side that's causing the connections to be broken.
The only difference between old servers and the new one, besides openldap version, is the version of the OS (ubuntu 14.04 vs ubuntu 18.04). I'm looking form something in the os closing idle connections, but I can't find it.
El 21/3/19 a las 22:51, Quanah Gibson-Mount escribió:
The first thing you should look at is if there are any instances of REFRESH in your logs, something along the lines of:
do_syncrep2: rid=001 delta-sync lost sync, switching to REFRESH
If you see that, the issue isn't delayed replication, the issue is that your databases are out of sync.
Yesterday I had delays with replication and there was the refresh logs I have already sent.
But now I'm having delays again (since 10min), but I don't have any refresh log.
--On Friday, March 22, 2019 10:55 AM +0100 "Angel L. Mateo" amateo@um.es wrote:
Yesterday I had delays with replication and there was the refresh logs I have already sent.
But now I'm having delays again (since 10min), but I don't have any refresh log.
How are you determining that you have a delay?
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
On 3/22/19 7:23 PM, Quanah Gibson-Mount wrote:
--On Friday, March 22, 2019 10:55 AM +0100 "Angel L. Mateo" amateo@um.es wrote:
Yesterday I had delays with replication and there was the refresh logs I have already sent.
But now I'm having delays again (since 10min), but I don't have any refresh log.
How are you determining that you have a delay?
He already responded that he's comparing contextCSN values.
See also my response. It's the usual monitoring-shows-contextCSN-diff while replication just works. It's a bug I'm observing for years now.
Ciao, Michael.
--On Friday, March 22, 2019 8:35 PM +0100 Michael Ströder michael@stroeder.com wrote:
See also my response. It's the usual monitoring-shows-contextCSN-diff while replication just works. It's a bug I'm observing for years now.
It's not a bug.
syncprov-checkpoint <ops> <minutes> After a write operation has succeeded, write the contextCSN to the underlying database if <ops> write operations or more than <minutes> time have passed since the last checkpoint. Checkpointing is disabled by default.
Note that he has a 10 minute interval set on the checkpoint:
olcSpCheckpoint: 100 10
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
On 3/22/19 8:41 PM, Quanah Gibson-Mount wrote:
--On Friday, March 22, 2019 8:35 PM +0100 Michael Ströder michael@stroeder.com wrote:
See also my response. It's the usual monitoring-shows-contextCSN-diff while replication just works. It's a bug I'm observing for years now.
It's not a bug.
syncprov-checkpoint <ops> <minutes> After a write operation has succeeded, write the contextCSN to the underlying database if <ops> write operations or more than <minutes> time have passed since the last checkpoint.
Does that really mean that contextCSN value retrieved via LDAP are not current? This would also break replication.
Up to now I understood that this option is similar to slapd-mdb's checkpoint directive. Which means, the data retrieved via LDAP is current but the DB files might not be fsynced yet.
Also in Æ-DIR the default is:
syncprov-checkpoint 1000 2
So according to the man-page you cited it means the gap must not be greater than two minutes. But it's sometimes several hours and it lasts for hours even if you write again something.
IMO it's simply a bug.
Ciao, Michael.
--On Friday, March 22, 2019 10:05 PM +0100 Michael Ströder michael@stroeder.com wrote:
So according to the man-page you cited it means the gap must not be greater than two minutes. But it's sometimes several hours and it lasts for hours even if you write again something.
IMO it's simply a bug.
I would suggest creating a repeatable test case and filing an ITS then, since one generally does need to be able to validate existing contextCSNs across a cluster. ;)
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
On 3/22/19 9:07 PM, Quanah Gibson-Mount wrote:
--On Friday, March 22, 2019 10:05 PM +0100 Michael Ströder michael@stroeder.com wrote:
So according to the man-page you cited it means the gap must not be greater than two minutes. But it's sometimes several hours and it lasts for hours even if you write again something.
IMO it's simply a bug.
I would suggest creating a repeatable test case and filing an ITS then, since one generally does need to be able to validate existing contextCSNs across a cluster. ;)
If it would be so easy I would have done it already.
There are no obvious conditions when this happens. It just happens every now and then.
Ciao, Michael.
El 22/3/19 a las 19:23, Quanah Gibson-Mount escribió:
--On Friday, March 22, 2019 10:55 AM +0100 "Angel L. Mateo" amateo@um.es wrote:
Yesterday I had delays with replication and there was the refresh logs I have already sent.
But now I'm having delays again (since 10min), but I don't have any refresh log.
How are you determining that you have a delay?
I manually get changes from accesslog database (of provider ldap server) and look for them in the consumer ldap.
El 25/3/19 a las 8:28, Ángel L. Mateo escribió:
El 22/3/19 a las 19:23, Quanah Gibson-Mount escribió:
How are you determining that you have a delay?
I manually get changes from accesslog database (of provider ldap server) and look for them in the consumer ldap.
To be honest, when I conextCSN are not synchronized, then I check entryCSN for entries recently changed.
And I don't know if it could be a problem with syncprov-checkpoint, but, as I understand, this is a parameter for the provider, then the provider could have changes not writed to its contextCSN, but then it writes them, all consumer should have this value replicated, shouldn't it?
In my scenario, all canis3x have the same contextCSN attributes (so they have consolidated them), but canis41 hasn't sync them.
openldap-technical@openldap.org