On Tue, Apr 23, 2019 at 04:48:28PM +0000, quanah@openldap.org wrote:
In testing a particular use case/setup scenario, I found that it's possible to cause a replica to slam a provider with unending requests. In this specific case, I was setting up delta-syncrepl MMR, but I believe the issue applies to standard syncrepl, and is not MMR specific. The scenario looks like this:
Initially we have a stand alone server, which no overlays in place. The configuration is done via cn=config, which allows for us to update the configuration without a server restart.
[...]
I believe the problem is that the root entry for the database contains no contextCSN. This is likely due to the fact that:
[...]
However, I think the overall behavior is undesirable. If there is no contextCSN present, it should not lead to replication clients executing a potential DoS on the provider. It also generated ~60GB of logs at loglevel stats in 1 day.
Ok, I think this is consumer's fault and limited to refreshAndPersist deltasync (with or without MMR).
Going by what I think I remember of the consumer code did: - on set up, it finds out there's no cookie to go by so it goes into refresh on the main DB - main DB responds with success but no/empty cookie - consumer starts over but again finds itself with no cookie, so it goes to step one
But the consumer is actually up to date at that point as the search suggested, so it should just go ahead and do the refreshAndPersist on accesslog as it planned to originally. And as operations hit the main DB, they will replicate accordingly, even if that were to happen after the original search and before this one[0].
So in that case the behaviour would be as follows: - on set up, it finds out there's no cookie to go by so it goes into refresh on the main DB - main DB responds with success but no/empty cookie - consumer starts over, remembering that its cookie (albeit empty) is valid, so sends a refreshAndPersist search on accesslog DB - that yields no traffic, but will give it the right data once anything replication-worthy happens, job done - unless the connection is actually severed (restarts, ...) before anything needs replicating, we start from step one but no overhead was incurred, we're still fine
[0]. Unless so much time has elapsed between the two searches (that happen on the same connection BTW) that some accesslog ops have already been expired. Expiration is usually configured in days, not seconds and an admin that doesn't notice a consumer going AWOL for that amount of time probably deserves that.