On Mon, Oct 15, 2018 at 01:53:30PM +0000, hyc@symas.com wrote:
ondra@mistotebe.net wrote:
This is my understanding of the above discussion:
- deltasync consumer has just switched to full refresh (but is ahead from this provider in some ways)
- provider sends the present list
- consumer deletes extra entries, builds a new cookie
- problem is that the new cookie is built to reflect the union of both the local and received cookies even though we may have undone some of the changes which we then ignore
If that's accurate, there are some approaches that could fix it:
Simple one is to remember the actual cookie we got from the server and refuse to delete entries with entryCSN ahead of the provided CSN set. Problem is that we get even further from being able to replicate from a generic RFC4533 provider.
Instead, when present phase is initiated, we might terminate all other sessions, adopt the complete CSN set and restart them only once the new CSN set has been fully established.
(2) makes sense.
Also, whenever we fall back from deltasync into plain syncrepl, we should make sure that the accesslog entries we generate from this are never used for further replication which might be thought to be a separate issue.
That should already be the case, since none of these ops will have a valid CSN.
I faintly remember Quanah seeing these accesslog entries used by consumers at some point, but I might be mistaken.
The more general point is making sure its potential syncrepl consumer not even try and use the accesslog entries we added before these - the refresh has created a strange gap in the middle (or worse, duplicated ops if a contextCSN element jumped backwards). But if we enforced that, the question is how to get modifications originating from this replica replicated elsewhere - unless we decide they can't be salvaged?
And should the contextCSN reset terminate not just all inbound syncrepl sessions, but the outbound ones as well?