Re: Consumer Delta Sync Lost After Provider Restarted

6 Jul 2021


      --On Tuesday, June 29, 2021 6:50 PM +0000 thomaswilliampritchard@gmail.com 
wrote:
...
Hi,
I'm experiencing an issue between my 3 providers and multiple consumer
setup and delta sync repl. We manage a primary, or active, provider and
send all writes to the primary as long as it's healthy letting the two
others replicate and be standby providers ready to take over in the event
of a failure. All consumers replicate from all providers and all
providers replicate from all providers. After the system was running
healthily for over a week a standby provider was restarted. This caused
my consumers to re-establish the persistent sync connection. Upon
re-establishing the connection, some consumers began a sync refresh with
the following message.
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep1: rid=003
starting refresh (sending
cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;2021011922595
5.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;2021
0226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#00
0000;20210610231714.990702Z#000000#066#000000;20210614191744.122968Z#0000
00#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684
843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Jun 28
18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003
LDAP_RES_SEARCH_RESULT Jun 28 18:32:55 openldap-hdb-consumer
slapd[15746]: do_syncrep2: rid=003 delta-sync lost sync, switching to
REFRESH Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2:
rid=003 (4096) Content Sync Refresh Required
This was re-establishing a connection with rid=003 which is
"20210412175600.595586Z#000000#835#000000" (a standby system) however we
have only been sending writes to server #44d# (the primary provider). We
see 44d CSN is over 7 days old, beyond our providers access log period.
On  the consumer that did not trigger sync refresh we see
Jun 28 18:32:55 openldap-hdb-consumer slapd[24439]: do_syncrep1: rid=003
starting refresh (sending
cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;2021011922595
5.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;2021
0226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#00
0000;20210621212459.620195Z#000000#066#000000;20210621214400.407867Z#0000
00#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684
843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Here we
see 20210621214400.407867Z#000000#44d#000000 is much more recent and did
not trigger a full resync, although it is close to the 7 day threshold at
this point. We notice the rid=003 835 csn is the same as the consumer
experiencing the problem which makes me believe the #44d# csn being old
is what causes this sync refresh.
I am concerned why when the standby provider is restarted the connection
is getting re-established with old provider CSNs, when I search the CSNs
on the consumers they look newer than the ones used to reestablish the
connection. If we restart slapd on the providers after running consumers
for 7 days it seems like it will trigger a sync refresh. How can we make
the consumers re-establish the connection with the most recent CSN?
Replication is working as expected, just the CSNs seem to remain old in
this connection message. The sync refresh behavior causes a large load on
the consumers and providers spiking bind times and degrading service
making this concerning for our production environment.
The actual age of the CSN is generally immaterial, as long as that is what 
current CSN on the provider is.  I.e., if the CSN on 835 provider *for 
itself* matches what was on the consumer, that's fine.  The real issue 
seems to be that the consumer stopped recieving updates for CSN 44d, so 
when the session was bounced for any given provider, the consumer was going 
to go into REFRESH.  What you need to determine is why that consumer 
stopped receiving updates, as this would trigger a refresh no matter which 
provider got bounced since none of the providers would have the data 
available in their accesslog.
I generally advise using some type of monitoring on the CSNs for each 
server so you can quickly be notified when such an issue has arisen.  I 
would note that your syncrepl configurations do not specify any keepalive 
settings which is generally recommended so that if some type of network 
device (load balancers and other traffic management systems do this) closes 
the syncrepl connection, slapd can detect this and re-establish it.
Regards,
Quanah
--
Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
http://www.symas.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: Consumer Delta Sync Lost After Provider Restarted