Re: (ITS#9015) Replication goes haywire querying promoted master - openldap-bugs

23 Apr 2019


      On Tue, Apr 23, 2019 at 04:48:28PM +0000, quanah@openldap.org wrote:
...
In testing a particular use case/setup scenario, I found that it's possible to
cause a replica to slam a provider with unending requests.  In this specific
case, I was setting up delta-syncrepl MMR, but I believe the issue applies to
standard syncrepl, and is not MMR specific.  The scenario looks like this:
Initially we have a stand alone server, which no overlays in place.  The
configuration is done via cn=config, which allows for us to update the
configuration without a server restart.
[...]
I believe the problem is that the root entry for the database contains no
contextCSN.  This is likely due to the fact that:
[...]
However, I think the overall behavior is undesirable.  If there is no contextCSN
present, it should not lead to replication clients executing a potential DoS on
the provider.  It also generated ~60GB of logs at loglevel stats in 1 day.
Ok, I think this is consumer's fault and limited to refreshAndPersist
deltasync (with or without MMR).
Going by what I think I remember of the consumer code did:
- on set up, it finds out there's no cookie to go by so it goes into
  refresh on the main DB
- main DB responds with success but no/empty cookie
- consumer starts over but again finds itself with no cookie, so it goes
  to step one
But the consumer is actually up to date at that point as the search
suggested, so it should just go ahead and do the refreshAndPersist on
accesslog as it planned to originally. And as operations hit the main
DB, they will replicate accordingly, even if that were to happen after
the original search and before this one[0].
So in that case the behaviour would be as follows:
- on set up, it finds out there's no cookie to go by so it goes into
  refresh on the main DB
- main DB responds with success but no/empty cookie
- consumer starts over, remembering that its cookie (albeit empty) is
  valid, so sends a refreshAndPersist search on accesslog DB
- that yields no traffic, but will give it the right data once anything
  replication-worthy happens, job done
- unless the connection is actually severed (restarts, ...) before
  anything needs replicating, we start from step one but no overhead
  was incurred, we're still fine
[0]. Unless so much time has elapsed between the two searches (that
happen on the same connection BTW) that some accesslog ops have already
been expired. Expiration is usually configured in days, not seconds and
an admin that doesn't notice a consumer going AWOL for that amount of
time probably deserves that.
-- 
Ondřej Kuzník
Senior Software Engineer
Symas Corporation                       http://www.symas.com
Packaged, certified, and supported LDAP solutions powered by OpenLDAP