Hi,
I'm experiencing an issue between my 3 providers and multiple consumer setup and delta sync repl. We manage a primary, or active, provider and send all writes to the primary as long as it's healthy letting the two others replicate and be standby providers ready to take over in the event of a failure. All consumers replicate from all providers and all providers replicate from all providers. After the system was running healthily for over a week a standby provider was restarted. This caused my consumers to re-establish the persistent sync connection. Upon re-establishing the connection, some consumers began a sync refresh with the following message.
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep1: rid=003 starting refresh (sending cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;20210119225955.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;20210226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#000000;20210610231714.990702Z#000000#066#000000;20210614191744.122968Z#000000#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000)
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 LDAP_RES_SEARCH_RESULT
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 delta-sync lost sync, switching to REFRESH
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 (4096) Content Sync Refresh Required
This was re-establishing a connection with rid=003 which is "20210412175600.595586Z#000000#835#000000" (a standby system)
however we have only been sending writes to server #44d# (the primary provider). We see 44d CSN is over 7 days old, beyond our providers access log period.
On the consumer that did not trigger sync refresh we see
Jun 28 18:32:55 openldap-hdb-consumer slapd[24439]: do_syncrep1: rid=003 starting refresh (sending cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;20210119225955.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;20210226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#000000;20210621212459.620195Z#000000#066#000000;20210621214400.407867Z#000000#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000)
Here we see 20210621214400.407867Z#000000#44d#000000 is much more recent and did not trigger a full resync, although it is close to the 7 day threshold at this point. We notice the rid=003 835 csn is the same as the consumer experiencing the problem which makes me believe the #44d# csn being old is what causes this sync refresh.
I am concerned why when the standby provider is restarted the connection is getting re-established with old provider CSNs, when I search the CSNs on the consumers they look newer than the ones used to reestablish the connection. If we restart slapd on the providers after running consumers for 7 days it seems like it will trigger a sync refresh. How can we make the consumers re-establish the connection with the most recent CSN? Replication is working as expected, just the CSNs seem to remain old in this connection message. The sync refresh behavior causes a large load on the consumers and providers spiking bind times and degrading service making this concerning for our production environment.
Configuration details are below.
Your insights are appreciated.
@(#) $OpenLDAP: slapd 2.4.56 (Dec 14 2020 17:31:23) $
@57af0b7ce0ba:/root/rpmbuild/BUILD/openldap-2.4.56/openldap-2.4.56/servers/slapd
Provider config
Primary DB
# {2}mdb, config
dn: olcDatabase={2}mdb,cn=config
objectClass: olcDatabaseConfig
objectClass: olcMdbConfig
olcDatabase: {2}mdb
olcDbDirectory: /var/lib/ldap
...
olcSizeLimit: unlimited
olcSyncrepl: {0}rid=1 provider=ldap://10.20.0.4 bindmethod=simple binddn
="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refresh
AndPersist retry="5 10 60 +" sizelimit=unlimited timelimit=unlimited
starttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter="
(&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog
olcSyncrepl: {1}rid=2 provider=ldap://10.20.1.151 bindmethod=simple bind
dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre
shAndPersist retry="5 10 60 +" sizelimit=unlimited timelimit=unlimited
starttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter
="(&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog
olcSyncrepl: {2}rid=3 provider=ldap://10.20.2.240 bindmethod=simple bind
dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre
shAndPersist retry="5 10 60 +" sizelimit=unlimited timelimit=unlimited
starttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter
="(&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog
olcMirrorMode: TRUE
...
olcDbMaxSize: 15000000000
Sync Prov
# {1}syncprov, {2}mdb, config
dn: olcOverlay={1}syncprov,olcDatabase={2}mdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: {1}syncprov
olcSpCheckpoint: 1 1
Access Log
# {3}accesslog, {2}mdb, config
dn: olcOverlay={3}accesslog,olcDatabase={2}mdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcAccessLogConfig
olcOverlay: {3}accesslog
olcAccessLogDB: cn=accesslog
olcAccessLogOps: writes
olcAccessLogPurge: 07+00:00 01+00:00
olcAccessLogSuccess: TRUE
# {3}mdb, config
dn: olcDatabase={3}mdb,cn=config
objectClass: olcDatabaseConfig
objectClass: olcMdbConfig
olcDatabase: {3}mdb
olcDbDirectory: /var/lib/ldap/accesslog
olcSuffix: cn=accesslog
olcRootDN: cn=admin,dc=example,dc=com
olcDbIndex: default eq
olcDbIndex: entryCSN,objectClass,reqEnd,reqResult,reqStart
olcDbMaxSize: 15000000000
# {0}syncprov, {3}mdb, config
dn: olcOverlay={0}syncprov,olcDatabase={3}mdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcSyncProvConfig
olcOverlay: {0}syncprov
olcSpNoPresent: TRUE
olcSpReloadHint: TRUE
Consumer Config:
# {2}hdb, config
dn: olcDatabase={2}hdb,cn=config
objectClass: olcDatabaseConfig
objectClass: olcHdbConfig
olcDatabase: {2}hdb
olcDbDirectory: /var/lib/ldap
olcSuffix: dc=example,dc=com
olcRootDN: cn=admin,dc=example,dc=com
olcRootPW: {SSHA}
olcSyncrepl: {0}rid=1 provider=ldap://10.20.0.4 bindmethod=simple binddn
="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refresh
AndPersist retry="60 +" sizelimit=unlimited timelimit=unlimited start
tls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter="(&(ob
jectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog
olcSyncrepl: {1}rid=2 provider=ldap://10.20.1.151 bindmethod=simple bind
dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre
shAndPersist retry="60 +" sizelimit=unlimited timelimit=unlimited sta
rttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter="(&(
objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog
olcSyncrepl: {2}rid=3 provider=ldap://10.20.2.240 bindmethod=simple bind
dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre
shAndPersist retry="60 +" sizelimit=unlimited timelimit=unlimited sta
rttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter="(&(
objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog
olcDbCheckpoint: 102400 10