Hi
We're on OpenLDAP 2.5.14, built from source on EL 8.
After dropping the accesslog on an MMR master, we keep seeing on all replica's:
do_syncrep2: rid=xxx (4096) Content Sync Refresh Required
and all replica's keep falling back to refresh mode every few minutes.
It's strange because we do this more often (typically after batch updates, which leaves a large freelist in the accesslog upon pruning), but it's the first time we hit this issue. We stopped the master (at a time no updates are coming in), drop the accesslog, and start it again.
Does it make sense to wait until this fixes itself? Or should I fix this manually somehow? I already tried mdb_copying the provider db to the replica's, but the same thing keeps happening. In the meantime replication is working very slowly...
Geert
On Fri, Feb 24, 2023 at 11:08:26PM +0100, Geert Hendrickx wrote:
We're on OpenLDAP 2.5.14, built from source on EL 8.
After dropping the accesslog on an MMR master, we keep seeing on all replica's:
do_syncrep2: rid=xxx (4096) Content Sync Refresh Required
and all replica's keep falling back to refresh mode every few minutes.
It's strange because we do this more often (typically after batch updates, which leaves a large freelist in the accesslog upon pruning), but it's the first time we hit this issue. We stopped the master (at a time no updates are coming in), drop the accesslog, and start it again.
Does it make sense to wait until this fixes itself? Or should I fix this manually somehow? I already tried mdb_copying the provider db to the replica's, but the same thing keeps happening. In the meantime replication is working very slowly...
Hi Geert, I would start any investigation by comparing the contextCSNs between nodes (both DB and its accesslog). Also check the reason why the provider sends 4096.
Also not sure you need to touch accesslog so often, why not size your storage to deal with the extra capacity properly? Having a large freelist shouldn't be considered a problem in and of itself.
Regards,
On Mon, Feb 27, 2023 at 15:32:39 +0100, Ondřej Kuzník wrote:
On Fri, Feb 24, 2023 at 11:08:26PM +0100, Geert Hendrickx wrote:
do_syncrep2: rid=xxx (4096) Content Sync Refresh Required
Hi Geert, I would start any investigation by comparing the contextCSNs between nodes (both DB and its accesslog). Also check the reason why the provider sends 4096.
We monitor and compare the contextCSN's continuously, that's how we noticed the replication was not continuous anymore, but in "bursts". It seems to reinitiate a full sync all the time (every 5 to 10 minutes), as long as new updates were coming in. It only got back to regular delta sync once we had a long enough period during the night with no updates.
What exactly is the meaning of the 4096 ?
Also not sure you need to touch accesslog so often, why not size your storage to deal with the extra capacity properly? Having a large freelist shouldn't be considered a problem in and of itself.
At least for the main db, it makes a significant performance difference if the accesslog gets too large. Therefor we mdb_copy -c the database from time to time. We do this on one server, then distribute this mdb to other servers and drop their accesslog, since it doesn't match the (imported) main db anymore. But then other replica's start logging "Content Sync Refresh Required" for the corresponding rid, even if no updates are coming in through *that* server, so its contextCSN is static.
Geert
On Mon, Feb 27, 2023 at 03:53:49PM +0100, Geert Hendrickx wrote:
We monitor and compare the contextCSN's continuously, that's how we noticed the replication was not continuous anymore, but in "bursts". It seems to reinitiate a full sync all the time (every 5 to 10 minutes), as long as new updates were coming in. It only got back to regular delta sync once we had a long enough period during the night with no updates.
Hi Geert, you didn't answer the questions whether you also monitor the accesslog's contextCSN? In deltasync, the combination of both is important.
What exactly is the meaning of the 4096 ?
See RFC 4533: "The server is REQUIRED to: ... c) indicate that the *incremental* convergence is not possible by returning e-syncRefreshRequired,"
My emphasis on "incremental". Usually when contextCSN and cookie are found to be incompatible (missing sids from cookie, or even tighter constraints when configured with syncprov-sessionlog-source), it tells the consumer to step down from deltasync or start without a cookie.
Also not sure you need to touch accesslog so often, why not size your storage to deal with the extra capacity properly? Having a large freelist shouldn't be considered a problem in and of itself.
At least for the main db, it makes a significant performance difference if the accesslog gets too large. Therefor we mdb_copy -c the database from time to time. We do this on one server, then distribute this mdb to other servers and drop their accesslog, since it doesn't match the (imported) main db anymore. But then other replica's start logging "Content Sync Refresh Required" for the corresponding rid, even if no updates are coming in through *that* server, so its contextCSN is static.
You mean accesslog DB or accesslog freelist? Also now you're saying you're obliterating the whole accesslog (and compacting the mainDB), where previously you said you were compacting accesslog.
If this is the case then yeah, you're removing every way the servers could have performed an efficient resync after reconnect/restart and that will take time and processing power to perform (probably running a refresh present, which is only one step up from a total resync). This makes little sense operationally.
Regards,
On Mon, Feb 27, 2023 at 19:18:38 +0100, Ondřej Kuzník wrote:
Hi Geert, you didn't answer the questions whether you also monitor the accesslog's contextCSN? In deltasync, the combination of both is important.
Ok, we don't. I'll take a look next time things are drifting.
In a stable environment, the accesslog's contextCSN is identical to the main db's contextCSN, for every SID.
What exactly is the meaning of the 4096 ?
See RFC 4533: "The server is REQUIRED to: ... c) indicate that the *incremental* convergence is not possible by returning e-syncRefreshRequired,"
My emphasis on "incremental". Usually when contextCSN and cookie are found to be incompatible (missing sids from cookie, or even tighter constraints when configured with syncprov-sessionlog-source), it tells the consumer to step down from deltasync or start without a cookie.
Ok, I assumed the consumer decides on its own if it's in sync with a given provider, by comparing its contextCSN to the provider's, and only if it's NOT in sync, query the provider's accesslog for delta sync from the CSN it's currently at, if possible.
At least for the main db, it makes a significant performance difference if the accesslog gets too large. Therefor we mdb_copy -c the database from time to time. We do this on one server, then distribute this mdb to other servers and drop their accesslog, since it doesn't match the (imported) main db anymore. But then other replica's start logging "Content Sync Refresh Required" for the corresponding rid, even if no updates are coming in through *that* server, so its contextCSN is static.
You mean accesslog DB or accesslog freelist? Also now you're saying you're obliterating the whole accesslog (and compacting the mainDB), where previously you said you were compacting accesslog.
We are compacting (mdb_copy -c) the main db on one server, AND throwing away the accesslog on other servers where we import this mdb, because it then no longer matches the local accesslog.
Context at https://openldap.org/lists/openldap-technical/201708/msg00049.html (although this was about a different LDAP database than the one we're currently talking about.)
This always went fine, but now turns out to confuse other consumers in an MMR environment. Should we instead run mdb_copy -c locally on each server? (this can be a pretty slow operation) Or is there another "clean" way to copy mdb databases between replica's? Include the corresponding accesslog?
The other scenario was that after large batch updates, when the accesslog has grown much bigger than usual (which is not a problem in itself), after logpurge this leaves a large freelist in the accesslog as well. So as a precaution, we "clean up" here as well by just dropping that accesslog - obviously at a quite time and when all servers are in sync. This turned out to be a mistake.
If this is the case then yeah, you're removing every way the servers could have performed an efficient resync after reconnect/restart and that will take time and processing power to perform (probably running a refresh present, which is only one step up from a total resync). This makes little sense operationally.
Ok, so far we only looked at the contextCSN of the main DIT, assuming this told the whole story.
Geert
On Mon, Feb 27, 2023 at 08:12:44PM +0100, Geert Hendrickx wrote:
On Mon, Feb 27, 2023 at 19:18:38 +0100, Ondřej Kuzník wrote:
Hi Geert, you didn't answer the questions whether you also monitor the accesslog's contextCSN? In deltasync, the combination of both is important.
Ok, we don't. I'll take a look next time things are drifting.
In a stable environment, the accesslog's contextCSN is identical to the main db's contextCSN, for every SID.
Hi Geert, yes, accesslog's contextCSN should always be in sync with its main DB.
My emphasis on "incremental". Usually when contextCSN and cookie are found to be incompatible (missing sids from cookie, or even tighter constraints when configured with syncprov-sessionlog-source), it tells the consumer to step down from deltasync or start without a cookie.
Ok, I assumed the consumer decides on its own if it's in sync with a given provider, by comparing its contextCSN to the provider's, and only if it's NOT in sync, query the provider's accesslog for delta sync from the CSN it's currently at, if possible.
Except for initial sync (no data in consumer), the consumer always tries deltasync first. Provider then proceeds accordingly or tells the consumer to fall back to plain syncrepl (the most common reason to see e-syncRefreshRequired - 4096).
At least for the main db, it makes a significant performance difference if the accesslog gets too large. Therefor we mdb_copy -c the database from time to time. We do this on one server, then distribute this mdb to other servers and drop their accesslog, since it doesn't match the (imported) main db anymore. But then other replica's start logging "Content Sync Refresh Required" for the corresponding rid, even if no updates are coming in through *that* server, so its contextCSN is static.
You mean accesslog DB or accesslog freelist? Also now you're saying you're obliterating the whole accesslog (and compacting the mainDB), where previously you said you were compacting accesslog.
We are compacting (mdb_copy -c) the main db on one server, AND throwing away the accesslog on other servers where we import this mdb, because it then no longer matches the local accesslog.
Context at https://openldap.org/lists/openldap-technical/201708/msg00049.html (although this was about a different LDAP database than the one we're currently talking about.)
Unless your entries are larger than pagesize *and* you have massive churn on those, you don't want to do this. Are you confident that's the case? What is your number of overflow pages? What kind of entries is it down to? If it's entries with large number of values in an attribute (e.g. groups), you might also want to look into sortvals (see man 5 slapd.conf) and multival (man 5 slapd-mdb) to store them more efficiently.
This always went fine, but now turns out to confuse other consumers in an MMR environment. Should we instead run mdb_copy -c locally on each server? (this can be a pretty slow operation) Or is there another "clean" way to copy mdb databases between replica's? Include the corresponding accesslog?
It should be safe to include the accesslog *if* server was shut down cleanly and everything was flushed into both.
Do you configure persistent or in-memory sessionlog?
The other scenario was that after large batch updates, when the accesslog has grown much bigger than usual (which is not a problem in itself), after logpurge this leaves a large freelist in the accesslog as well. So as a precaution, we "clean up" here as well by just dropping that accesslog - obviously at a quite time and when all servers are in sync. This turned out to be a mistake.
Are your accesslog entries so large that they don't fit a page? If not, just let the freelist be reused for the next time you have a large batch of updates again. That's what it's there for. And even then, accesslog in particular shouldn't really suffer from fragmentation as much as the main DB would.
If this is the case then yeah, you're removing every way the servers could have performed an efficient resync after reconnect/restart and that will take time and processing power to perform (probably running a refresh present, which is only one step up from a total resync). This makes little sense operationally.
Ok, so far we only looked at the contextCSN of the main DIT, assuming this told the whole story.
Yeah, in δ-multiprovider both main DB and accesslog (and their contextCSNs) are used together and should be monitored as such.
Regards,
On Tue, Feb 28, 2023 at 11:16:45 +0100, Ondřej Kuzník wrote:
Unless your entries are larger than pagesize *and* you have massive churn on those, you don't want to do this. Are you confident that's the case? What is your number of overflow pages? What kind of entries is it down to? If it's entries with large number of values in an attribute (e.g. groups), you might also want to look into sortvals (see man 5 slapd.conf) and multival (man 5 slapd-mdb) to store them more efficiently.
Hi
We've had (and still have) this issue with large attributes and large multi-valued attributes with Zimbra (see previous discussion with Quanah), where we applied sortvals and multival. But in this scenario it's not the case; all objects are of similar small size, with (mostly) single valued attributes. Yet our freelist reaches 200K+ free pages during periods with heavy updates (mostly deletes/adds), which has a measurable impact on write performance.
For batch migrations we recently tried combining multiple updates into LDAP transactions, which is significantly faster on a clean db, but makes the freelist performance impact *worse* once the freelist is large enough. Could it be because transactions require larger free pages which makes it go through the entire freelist?
It should be safe to include the accesslog *if* server was shut down cleanly and everything was flushed into both.
Should nightly backups include the accesslog as well then? (implying we can no longer make simple mdb_copy backups while slapd is running... Or is it good enough to dump the accesslog *after* the main db, so it includes the relevant AND newer accesslog data?)
Do you configure persistent or in-memory sessionlog?
in memory
Are your accesslog entries so large that they don't fit a page? If not, just let the freelist be reused for the next time you have a large batch of updates again. That's what it's there for. And even then, accesslog in particular shouldn't really suffer from fragmentation as much as the main DB would.
Ok. We're seeing 1M+ free pages in the accesslog after large batch jobs and subsequent logpurge. It could be completely innocent like you say, we just clean it up as a precaution due to the proven main db perf issue mentioned above. We'll hold this off for now and see.
Yeah, in δ-multiprovider both main DB and accesslog (and their contextCSNs) are used together and should be monitored as such.
Thanks for your advice, will revise the monitoring.
Geert
On Tue, Feb 28, 2023 at 01:42:20PM +0100, Geert Hendrickx wrote:
We've had (and still have) this issue with large attributes and large multi-valued attributes with Zimbra (see previous discussion with Quanah), where we applied sortvals and multival. But in this scenario it's not the case; all objects are of similar small size, with (mostly) single valued attributes. Yet our freelist reaches 200K+ free pages during periods with heavy updates (mostly deletes/adds), which has a measurable impact on write performance.
Hi Geert, are you sure it's the freelist and not the random access as pages become non-contiguous? The former would represent a constant decline in performance where the latter would eventually taper from high (best case) performance to regular performance you should be able to expect? Have you been able to rule that out?
For batch migrations we recently tried combining multiple updates into LDAP transactions, which is significantly faster on a clean db, but makes the freelist performance impact *worse* once the freelist is large enough. Could it be because transactions require larger free pages which makes it go through the entire freelist?
Someone else needs to comment on that.
It should be safe to include the accesslog *if* server was shut down cleanly and everything was flushed into both.
Should nightly backups include the accesslog as well then? (implying we can no longer make simple mdb_copy backups while slapd is running... Or is it good enough to dump the accesslog *after* the main db, so it includes the relevant AND newer accesslog data?)
Disaster recovery does not need accesslog unless you need it for auditing purposes, but given you are happy to wipe it, I don't think that's the case. What you're doing here is not disaster recovery and you can't do this online.
Do you configure persistent or in-memory sessionlog?
in memory
After you kill accesslog, you disable deltasync. Since you're also restarting, the provider has no data on how to replay anything and needs to send the list of all entries (at least their UUIDs). This is expensive and slow. Replication seems to proceed in slow leaps that cost a *lot* of processing on the provider and a fair amount of bandwidth. Isn't that what you're seeing?
Are your accesslog entries so large that they don't fit a page? If not, just let the freelist be reused for the next time you have a large batch of updates again. That's what it's there for. And even then, accesslog in particular shouldn't really suffer from fragmentation as much as the main DB would.
Ok. We're seeing 1M+ free pages in the accesslog after large batch jobs and subsequent logpurge. It could be completely innocent like you say, we just clean it up as a precaution due to the proven main db perf issue mentioned above. We'll hold this off for now and see.
As I always suggest, when you have a hypothesis, see if you can test it before you implement something in production. If you can also report here how it went, we can help you confirm/form a better one. In this case, I think performance settles and you can adjust resources to match your requirements.
Yeah, in δ-multiprovider both main DB and accesslog (and their contextCSNs) are used together and should be monitored as such.
Thanks for your advice, will revise the monitoring.
Regards,
On Tue, Feb 28, 2023 at 16:12:25 +0100, Ondřej Kuzník wrote:
On Tue, Feb 28, 2023 at 01:42:20PM +0100, Geert Hendrickx wrote:
We've had (and still have) this issue with large attributes and large multi-valued attributes with Zimbra (see previous discussion with Quanah), where we applied sortvals and multival. But in this scenario it's not the case; all objects are of similar small size, with (mostly) single valued attributes. Yet our freelist reaches 200K+ free pages during periods with heavy updates (mostly deletes/adds), which has a measurable impact on write performance.
Hi Geert, are you sure it's the freelist and not the random access as pages become non-contiguous? The former would represent a constant decline in performance where the latter would eventually taper from high (best case) performance to regular performance you should be able to expect? Have you been able to rule that out?
mdb_copy -c fixes it, so I assume it's only the freelist size, not actual fragmentation (mdb_copy doesn't reorder any data, right?). Random access shouldn't matter much, as it's all on an SSD-based SAN.
Also, the decline isn't constant. In normal operations, the freelist stays fairly small (it is "consumed" all the time by regular updates). Only during batch updates (because of a currently ongoing migration) it explodes and doesn't get "consumed" in time for the next batch update, and causes performance degradation for subsequent batches.
After you kill accesslog, you disable deltasync. Since you're also restarting, the provider has no data on how to replay anything and needs to send the list of all entries (at least their UUIDs). This is expensive and slow. Replication seems to proceed in slow leaps that cost a *lot* of processing on the provider and a fair amount of bandwidth. Isn't that what you're seeing?
Yes, this is indeed the case and it keeps doing that as long as updates are coming in. Once there are no updates for a full refresh cycle (eg. during the night, or because we pause updates) it is able to revert to delta sync.
After you kill accesslog, you disable deltasync.
This is the essential part. I always assumed it could proceed with deltasync of the provider and replica have the same contextCSN, even with an empty accesslog.
This probably went un-noticed for a long time since dropping the accesslog on a non-active master causes no (visible) delays. Only on an active master.
Thanks for your insights, things are much clearer now, and we have adjusted our processes accordingly.
Geert
openldap-technical@openldap.org