--On Tuesday, June 29, 2021 6:50 PM +0000 thomaswilliampritchard@gmail.com wrote:
Hi,
I'm experiencing an issue between my 3 providers and multiple consumer setup and delta sync repl. We manage a primary, or active, provider and send all writes to the primary as long as it's healthy letting the two others replicate and be standby providers ready to take over in the event of a failure. All consumers replicate from all providers and all providers replicate from all providers. After the system was running healthily for over a week a standby provider was restarted. This caused my consumers to re-establish the persistent sync connection. Upon re-establishing the connection, some consumers began a sync refresh with the following message.
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep1: rid=003 starting refresh (sending cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;2021011922595 5.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;2021 0226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#00 0000;20210610231714.990702Z#000000#066#000000;20210614191744.122968Z#0000 00#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684 843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 LDAP_RES_SEARCH_RESULT Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 delta-sync lost sync, switching to REFRESH Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 (4096) Content Sync Refresh Required
This was re-establishing a connection with rid=003 which is "20210412175600.595586Z#000000#835#000000" (a standby system) however we have only been sending writes to server #44d# (the primary provider). We see 44d CSN is over 7 days old, beyond our providers access log period. On the consumer that did not trigger sync refresh we see
Jun 28 18:32:55 openldap-hdb-consumer slapd[24439]: do_syncrep1: rid=003 starting refresh (sending cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;2021011922595 5.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;2021 0226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#00 0000;20210621212459.620195Z#000000#066#000000;20210621214400.407867Z#0000 00#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684 843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Here we see 20210621214400.407867Z#000000#44d#000000 is much more recent and did not trigger a full resync, although it is close to the 7 day threshold at this point. We notice the rid=003 835 csn is the same as the consumer experiencing the problem which makes me believe the #44d# csn being old is what causes this sync refresh.
I am concerned why when the standby provider is restarted the connection is getting re-established with old provider CSNs, when I search the CSNs on the consumers they look newer than the ones used to reestablish the connection. If we restart slapd on the providers after running consumers for 7 days it seems like it will trigger a sync refresh. How can we make the consumers re-establish the connection with the most recent CSN? Replication is working as expected, just the CSNs seem to remain old in this connection message. The sync refresh behavior causes a large load on the consumers and providers spiking bind times and degrading service making this concerning for our production environment.
The actual age of the CSN is generally immaterial, as long as that is what current CSN on the provider is. I.e., if the CSN on 835 provider *for itself* matches what was on the consumer, that's fine. The real issue seems to be that the consumer stopped recieving updates for CSN 44d, so when the session was bounced for any given provider, the consumer was going to go into REFRESH. What you need to determine is why that consumer stopped receiving updates, as this would trigger a refresh no matter which provider got bounced since none of the providers would have the data available in their accesslog.
I generally advise using some type of monitoring on the CSNs for each server so you can quickly be notified when such an issue has arisen. I would note that your syncrepl configurations do not specify any keepalive settings which is generally recommended so that if some type of network device (load balancers and other traffic management systems do this) closes the syncrepl connection, slapd can detect this and re-establish it.
Regards, Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com