On Mon, Feb 27, 2023 at 08:12:44PM +0100, Geert Hendrickx wrote:
On Mon, Feb 27, 2023 at 19:18:38 +0100, Ondřej Kuzník wrote:
Hi Geert, you didn't answer the questions whether you also monitor the accesslog's contextCSN? In deltasync, the combination of both is important.
Ok, we don't. I'll take a look next time things are drifting.
In a stable environment, the accesslog's contextCSN is identical to the main db's contextCSN, for every SID.
Hi Geert, yes, accesslog's contextCSN should always be in sync with its main DB.
My emphasis on "incremental". Usually when contextCSN and cookie are found to be incompatible (missing sids from cookie, or even tighter constraints when configured with syncprov-sessionlog-source), it tells the consumer to step down from deltasync or start without a cookie.
Ok, I assumed the consumer decides on its own if it's in sync with a given provider, by comparing its contextCSN to the provider's, and only if it's NOT in sync, query the provider's accesslog for delta sync from the CSN it's currently at, if possible.
Except for initial sync (no data in consumer), the consumer always tries deltasync first. Provider then proceeds accordingly or tells the consumer to fall back to plain syncrepl (the most common reason to see e-syncRefreshRequired - 4096).
At least for the main db, it makes a significant performance difference if the accesslog gets too large. Therefor we mdb_copy -c the database from time to time. We do this on one server, then distribute this mdb to other servers and drop their accesslog, since it doesn't match the (imported) main db anymore. But then other replica's start logging "Content Sync Refresh Required" for the corresponding rid, even if no updates are coming in through *that* server, so its contextCSN is static.
You mean accesslog DB or accesslog freelist? Also now you're saying you're obliterating the whole accesslog (and compacting the mainDB), where previously you said you were compacting accesslog.
We are compacting (mdb_copy -c) the main db on one server, AND throwing away the accesslog on other servers where we import this mdb, because it then no longer matches the local accesslog.
Context at https://openldap.org/lists/openldap-technical/201708/msg00049.html (although this was about a different LDAP database than the one we're currently talking about.)
Unless your entries are larger than pagesize *and* you have massive churn on those, you don't want to do this. Are you confident that's the case? What is your number of overflow pages? What kind of entries is it down to? If it's entries with large number of values in an attribute (e.g. groups), you might also want to look into sortvals (see man 5 slapd.conf) and multival (man 5 slapd-mdb) to store them more efficiently.
This always went fine, but now turns out to confuse other consumers in an MMR environment. Should we instead run mdb_copy -c locally on each server? (this can be a pretty slow operation) Or is there another "clean" way to copy mdb databases between replica's? Include the corresponding accesslog?
It should be safe to include the accesslog *if* server was shut down cleanly and everything was flushed into both.
Do you configure persistent or in-memory sessionlog?
The other scenario was that after large batch updates, when the accesslog has grown much bigger than usual (which is not a problem in itself), after logpurge this leaves a large freelist in the accesslog as well. So as a precaution, we "clean up" here as well by just dropping that accesslog - obviously at a quite time and when all servers are in sync. This turned out to be a mistake.
Are your accesslog entries so large that they don't fit a page? If not, just let the freelist be reused for the next time you have a large batch of updates again. That's what it's there for. And even then, accesslog in particular shouldn't really suffer from fragmentation as much as the main DB would.
If this is the case then yeah, you're removing every way the servers could have performed an efficient resync after reconnect/restart and that will take time and processing power to perform (probably running a refresh present, which is only one step up from a total resync). This makes little sense operationally.
Ok, so far we only looked at the contextCSN of the main DIT, assuming this told the whole story.
Yeah, in δ-multiprovider both main DB and accesslog (and their contextCSNs) are used together and should be monitored as such.
Regards,