Hi,
I'm experiencing an issue between my 3 providers and multiple consumer setup and delta sync repl. We manage a primary, or active, provider and send all writes to the primary as long as it's healthy letting the two others replicate and be standby providers ready to take over in the event of a failure. All consumers replicate from all providers and all providers replicate from all providers. After the system was running healthily for over a week a standby provider was restarted. This caused my consumers to re-establish the persistent sync connection. Upon re-establishing the connection, some consumers began a sync refresh with the following message.
Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep1: rid=003 starting refresh (sending cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;20210119225955.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;20210226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#000000;20210610231714.990702Z#000000#066#000000;20210614191744.122968Z#000000#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 LDAP_RES_SEARCH_RESULT Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 delta-sync lost sync, switching to REFRESH Jun 28 18:32:55 openldap-hdb-consumer slapd[15746]: do_syncrep2: rid=003 (4096) Content Sync Refresh Required
This was re-establishing a connection with rid=003 which is "20210412175600.595586Z#000000#835#000000" (a standby system) however we have only been sending writes to server #44d# (the primary provider). We see 44d CSN is over 7 days old, beyond our providers access log period. On the consumer that did not trigger sync refresh we see
Jun 28 18:32:55 openldap-hdb-consumer slapd[24439]: do_syncrep1: rid=003 starting refresh (sending cookie=rid=003,csn=20210331192036.214412Z#000000#000#000000;20210119225955.133811Z#000000#001#000000;20210128213906.596429Z#000000#002#000000;20210226190704.219043Z#000000#005#000000;20210412181659.152626Z#000000#065#000000;20210621212459.620195Z#000000#066#000000;20210621214400.407867Z#000000#44d#000000;20210412175600.595586Z#000000#835#000000;20210423182110.684843Z#000000#836#000000;20210331193249.570935Z#000000#ce5#000000) Here we see 20210621214400.407867Z#000000#44d#000000 is much more recent and did not trigger a full resync, although it is close to the 7 day threshold at this point. We notice the rid=003 835 csn is the same as the consumer experiencing the problem which makes me believe the #44d# csn being old is what causes this sync refresh.
I am concerned why when the standby provider is restarted the connection is getting re-established with old provider CSNs, when I search the CSNs on the consumers they look newer than the ones used to reestablish the connection. If we restart slapd on the providers after running consumers for 7 days it seems like it will trigger a sync refresh. How can we make the consumers re-establish the connection with the most recent CSN? Replication is working as expected, just the CSNs seem to remain old in this connection message. The sync refresh behavior causes a large load on the consumers and providers spiking bind times and degrading service making this concerning for our production environment.
Configuration details are below. Your insights are appreciated.
@(#) $OpenLDAP: slapd 2.4.56 (Dec 14 2020 17:31:23) $ @57af0b7ce0ba:/root/rpmbuild/BUILD/openldap-2.4.56/openldap-2.4.56/servers/slapd Provider config Primary DB
# {2}mdb, config dn: olcDatabase={2}mdb,cn=config objectClass: olcDatabaseConfig objectClass: olcMdbConfig olcDatabase: {2}mdb olcDbDirectory: /var/lib/ldap ... olcSizeLimit: unlimited olcSyncrepl: {0}rid=1 provider=ldap://10.20.0.4 bindmethod=simple binddn ="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refresh AndPersist retry="5 10 60 +" sizelimit=unlimited timelimit=unlimited starttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter=" (&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcSyncrepl: {1}rid=2 provider=ldap://10.20.1.151 bindmethod=simple bind dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre shAndPersist retry="5 10 60 +" sizelimit=unlimited timelimit=unlimited starttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter ="(&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcSyncrepl: {2}rid=3 provider=ldap://10.20.2.240 bindmethod=simple bind dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre shAndPersist retry="5 10 60 +" sizelimit=unlimited timelimit=unlimited starttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter ="(&(objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcMirrorMode: TRUE ... olcDbMaxSize: 15000000000 Sync Prov # {1}syncprov, {2}mdb, config dn: olcOverlay={1}syncprov,olcDatabase={2}mdb,cn=config objectClass: olcOverlayConfig objectClass: olcSyncProvConfig olcOverlay: {1}syncprov olcSpCheckpoint: 1 1 Access Log # {3}accesslog, {2}mdb, config dn: olcOverlay={3}accesslog,olcDatabase={2}mdb,cn=config objectClass: olcOverlayConfig objectClass: olcAccessLogConfig olcOverlay: {3}accesslog olcAccessLogDB: cn=accesslog olcAccessLogOps: writes olcAccessLogPurge: 07+00:00 01+00:00 olcAccessLogSuccess: TRUE # {3}mdb, config dn: olcDatabase={3}mdb,cn=config objectClass: olcDatabaseConfig objectClass: olcMdbConfig olcDatabase: {3}mdb olcDbDirectory: /var/lib/ldap/accesslog olcSuffix: cn=accesslog olcRootDN: cn=admin,dc=example,dc=com olcDbIndex: default eq olcDbIndex: entryCSN,objectClass,reqEnd,reqResult,reqStart olcDbMaxSize: 15000000000 # {0}syncprov, {3}mdb, config dn: olcOverlay={0}syncprov,olcDatabase={3}mdb,cn=config objectClass: olcOverlayConfig objectClass: olcSyncProvConfig olcOverlay: {0}syncprov olcSpNoPresent: TRUE olcSpReloadHint: TRUE
Consumer Config: # {2}hdb, config dn: olcDatabase={2}hdb,cn=config objectClass: olcDatabaseConfig objectClass: olcHdbConfig olcDatabase: {2}hdb olcDbDirectory: /var/lib/ldap olcSuffix: dc=example,dc=com olcRootDN: cn=admin,dc=example,dc=com olcRootPW: {SSHA} olcSyncrepl: {0}rid=1 provider=ldap://10.20.0.4 bindmethod=simple binddn ="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refresh AndPersist retry="60 +" sizelimit=unlimited timelimit=unlimited start tls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter="(&(ob jectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcSyncrepl: {1}rid=2 provider=ldap://10.20.1.151 bindmethod=simple bind dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre shAndPersist retry="60 +" sizelimit=unlimited timelimit=unlimited sta rttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter="(&( objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcSyncrepl: {2}rid=3 provider=ldap://10.20.2.240 bindmethod=simple bind dn="cn=replicant,dc=example,dc=com" credentials="" searchbase="dc=example,dc=com" schemachecking=on type=refre shAndPersist retry="60 +" sizelimit=unlimited timelimit=unlimited sta rttls=critical tls_reqcert=demand logbase="cn=accesslog" logfilter="(&( objectClass=auditWriteObject)(reqResult=0))" syncdata=accesslog olcDbCheckpoint: 102400 10