Hi,
I'm experiencing an issue between my 3 providers and multiple consumer setup and delta sync repl. We manage a primary, or active, provider and send all writes to the primary as long as it's healthy letting the two others replicate and be standby providers ready to take over in the event of a failure. All consumers replicate from all providers and all providers replicate from all providers. After the system was running healthily for over a week a standby provider was restarted. This caused my consumers to re-establish the persistent sync connection. Upon re-establishing the connection, some consumers began a sync refresh with the following message.
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep1: rid=003 starting refresh (sending cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;20210119225955.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;20210226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#000000;20210610231714.990702Z#000000#066#000000;20210614191744.122968Z#000000#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 LDAP_RES_SEARCH_RESULT Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 delta-sync lost sync, switching to REFRESH Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 (4096) Content Sync Refresh Required
This was re-establishing a connection with rid=003 which is "20210412175600.595586Z#000000#835#000000" (a standby system) however we have only been sending writes to server #44d# (the primary provider). We see 44d CSN is over 7 days old, beyond our providers access log period. On the consumer that did not trigger sync refresh we see
Jun 28 18:32:55 openldap-hdb-consumer slapd[24439]: do_syncrep1: rid=003 starting refresh (sending cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;20210119225955.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;20210226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#000000;20210621212459.620195Z#000000#066#000000;20210621214400.407867Z#000000#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Here we see 20210621214400.407867Z#000000#44d#000000 is much more recent and did not trigger a full resync, although it is close to the 7 day threshold at this point. We notice the rid=003 835 csn is the same as the consumer experiencing the problem which makes me believe the #44d# csn being old is what causes this sync refresh.
I am concerned why when the standby provider is restarted the connection is getting re-established with old provider CSNs, when I search the CSNs on the consumers they look newer than the ones used to reestablish the connection. If we restart slapd on the providers after running consumers for 7 days it seems like it will trigger a sync refresh. How can we make the consumers re-establish the connection with the most recent CSN? Replication is working as expected, just the CSNs seem to remain old in this connection message. The sync refresh behavior causes a large load on the consumers and providers spiking bind times and degrading service making this concerning for our production environment.
Configuration details are below. Your insights are appreciated.
@(#) $OpenLDAP: slapd 2.4.56 (Dec 14 2020 17:31:23) $ @57af0b7ce0ba:/root/rpmbuild/BUILD/openldap-2.4.56/openldap-2.4.56/servers/slapd Provider config Primary DB
# {2}mdb, config dn: olcDatabase={2}mdb,cn=config objectClass: olcDatabaseConfig objectClass: olcMdbConfig olcDatabase: {2}mdb olcDbDirectory: /var/lib/ldap ... olcSizeLimit: unlimited olcSyncrepl: {0}rid=1 provider=ldap://10.20.0.4 bindmethod=simple binddn ="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refresh AndPersist retry="5 10 60 +" sizelimit=unlimited timelimit=unlimited starttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter=" (&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcSyncrepl: {1}rid=2 provider=ldap://10.20.1.151 bindmethod=simple bind dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre shAndPersist retry="5 10 60 +" sizelimit=unlimited timelimit=unlimited starttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter ="(&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcSyncrepl: {2}rid=3 provider=ldap://10.20.2.240 bindmethod=simple bind dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre shAndPersist retry="5 10 60 +" sizelimit=unlimited timelimit=unlimited starttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter ="(&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcMirrorMode: TRUE ... olcDbMaxSize: 15000000000 Sync Prov # {1}syncprov, {2}mdb, config dn: olcOverlay={1}syncprov,olcDatabase={2}mdb,cn=config objectClass: olcOverlayConfig objectClass: olcSyncProvConfig olcOverlay: {1}syncprov olcSpCheckpoint: 1 1 Access Log # {3}accesslog, {2}mdb, config dn: olcOverlay={3}accesslog,olcDatabase={2}mdb,cn=config objectClass: olcOverlayConfig objectClass: olcAccessLogConfig olcOverlay: {3}accesslog olcAccessLogDB: cn=accesslog olcAccessLogOps: writes olcAccessLogPurge: 07+00:00 01+00:00 olcAccessLogSuccess: TRUE # {3}mdb, config dn: olcDatabase={3}mdb,cn=config objectClass: olcDatabaseConfig objectClass: olcMdbConfig olcDatabase: {3}mdb olcDbDirectory: /var/lib/ldap/accesslog olcSuffix: cn=accesslog olcRootDN: cn=admin,dc=example,dc=com olcDbIndex: default eq olcDbIndex: entryCSN,objectClass,reqEnd,reqResult,reqStart olcDbMaxSize: 15000000000 # {0}syncprov, {3}mdb, config dn: olcOverlay={0}syncprov,olcDatabase={3}mdb,cn=config objectClass: olcOverlayConfig objectClass: olcSyncProvConfig olcOverlay: {0}syncprov olcSpNoPresent: TRUE olcSpReloadHint: TRUE
Consumer Config: # {2}hdb, config dn: olcDatabase={2}hdb,cn=config objectClass: olcDatabaseConfig objectClass: olcHdbConfig olcDatabase: {2}hdb olcDbDirectory: /var/lib/ldap olcSuffix: dc=example,dc=com olcRootDN: cn=admin,dc=example,dc=com olcRootPW: {SSHA} olcSyncrepl: {0}rid=1 provider=ldap://10.20.0.4 bindmethod=simple binddn ="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refresh AndPersist retry="60 +" sizelimit=unlimited timelimit=unlimited start tls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter="(&(ob jectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcSyncrepl: {1}rid=2 provider=ldap://10.20.1.151 bindmethod=simple bind dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre shAndPersist retry="60 +" sizelimit=unlimited timelimit=unlimited sta rttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter="(&( objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcSyncrepl: {2}rid=3 provider=ldap://10.20.2.240 bindmethod=simple bind dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre shAndPersist retry="60 +" sizelimit=unlimited timelimit=unlimited sta rttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter="(&( objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcDbCheckpoint: 102400 10
--On Tuesday, June 29, 2021 6:50 PM +0000 thomaswilliampritchard@gmail.com wrote:
Hi,
I'm experiencing an issue between my 3 providers and multiple consumer setup and delta sync repl. We manage a primary, or active, provider and send all writes to the primary as long as it's healthy letting the two others replicate and be standby providers ready to take over in the event of a failure. All consumers replicate from all providers and all providers replicate from all providers. After the system was running healthily for over a week a standby provider was restarted. This caused my consumers to re-establish the persistent sync connection. Upon re-establishing the connection, some consumers began a sync refresh with the following message.
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep1: rid=003 starting refresh (sending cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;2021011922595 5.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;2021 0226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#00 0000;20210610231714.990702Z#000000#066#000000;20210614191744.122968Z#0000 00#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684 843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 LDAP_RES_SEARCH_RESULT Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 delta-sync lost sync, switching to REFRESH Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 (4096) Content Sync Refresh Required
This was re-establishing a connection with rid=003 which is "20210412175600.595586Z#000000#835#000000" (a standby system) however we have only been sending writes to server #44d# (the primary provider). We see 44d CSN is over 7 days old, beyond our providers access log period. On the consumer that did not trigger sync refresh we see
Jun 28 18:32:55 openldap-hdb-consumer slapd[24439]: do_syncrep1: rid=003 starting refresh (sending cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;2021011922595 5.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;2021 0226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#00 0000;20210621212459.620195Z#000000#066#000000;20210621214400.407867Z#0000 00#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684 843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Here we see 20210621214400.407867Z#000000#44d#000000 is much more recent and did not trigger a full resync, although it is close to the 7 day threshold at this point. We notice the rid=003 835 csn is the same as the consumer experiencing the problem which makes me believe the #44d# csn being old is what causes this sync refresh.
I am concerned why when the standby provider is restarted the connection is getting re-established with old provider CSNs, when I search the CSNs on the consumers they look newer than the ones used to reestablish the connection. If we restart slapd on the providers after running consumers for 7 days it seems like it will trigger a sync refresh. How can we make the consumers re-establish the connection with the most recent CSN? Replication is working as expected, just the CSNs seem to remain old in this connection message. The sync refresh behavior causes a large load on the consumers and providers spiking bind times and degrading service making this concerning for our production environment.
The actual age of the CSN is generally immaterial, as long as that is what current CSN on the provider is. I.e., if the CSN on 835 provider *for itself* matches what was on the consumer, that's fine. The real issue seems to be that the consumer stopped recieving updates for CSN 44d, so when the session was bounced for any given provider, the consumer was going to go into REFRESH. What you need to determine is why that consumer stopped receiving updates, as this would trigger a refresh no matter which provider got bounced since none of the providers would have the data available in their accesslog.
I generally advise using some type of monitoring on the CSNs for each server so you can quickly be notified when such an issue has arisen. I would note that your syncrepl configurations do not specify any keepalive settings which is generally recommended so that if some type of network device (load balancers and other traffic management systems do this) closes the syncrepl connection, slapd can detect this and re-establish it.
Regards, Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
Hi Quanah,
Thank you for the considerations. We have systems in place monitoring replication state 24/7. The consumers are checked every minute to ensure they have a very close CSN to the provider CSNs. Manually searching the context CSN confirmed that the CSNs on the provider were newer than the CSN used to re-establish the persistent connection. It seems just the sync-repl service, when re-establishing the dropped delta sync refreshAndPersist connection, is using an old CSN even though the consumer has an updated CSN.
Another example of this occurred today.
Re-establishing persistent connection after provider was restarted.
Jul 6 22:57:37 openldap-hdb-consumer-1 slapd[5749]: do_syncrep1: rid=002 starting refresh (sending cookie=rid=002,csn=20210331192036.214412Z#000000#000#000000;20210119225955.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;20210226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#000000;20210623181421.795352Z#000000#066#000000;20210706153144.905110Z#000000#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000)
We see 20210623181421.795352Z#000000#066#000000
Then immediately after a systemctl restart slapd, upon establishing the first connection after restart -
Jul 6 22:57:38 openldap-hdb-consumer-1 slapd[23892]: do_syncrep1: rid=002 starting refresh (sending cookie=rid=002,csn=20210331192036.214412Z#000000#000#000000;20210119225955.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;20210226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#000000;20210706225509.538827Z#000000#066#000000;20210706153144.905110Z#000000#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000)
20210706225509.538827Z#000000#066#000000
We see after the system reboots sync repl establishes the connection with a more up to date CSN. It's almost as if sync repl cache's the persistent search message and re-uses it later without checking if the CSNs are updated when it re-establishes the connection.
Does that seem possible?
Correction * Manually searching the context CSN confirmed that the CSNs on the consumers were newer than the CSN used to re-establish the persistent connection.
--On Wednesday, July 7, 2021 1:18 AM +0000 thomaswilliampritchard@gmail.com wrote:
Correction * Manually searching the context CSN confirmed that the CSNs on the consumers were newer than the CSN used to re-establish the persistent connection.
Have you updated yet to 2.4.59 or are you still on 2.4.56? I seem to recall a fix for something like this, but I don't recall if it went into 2.4 or was 2.5 only. I'd definitely update to 2.4.59 as a first step if not there yet.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
openldap-technical@openldap.org