It's becoming untenable how a plain refresh cannot be represented in accesslog in a way that's capable of serving a deltasync session. Whatever happens, we have lost a fair amount of information to run a proper deltasync yet if we don't want to abandon this functionality, we have to try and fill some in.
The current issues:
With refreshDeletes=FALSE we get a list of entries in no particular order (each can be tagged with a CSN of the latest change), then have to synthesize a list of deletes with no idea what CSN they belong to at all, the best we can do is pick the whole CSN vector (we certainly *cannot* pick a single CSN for any of them, not even the highest timestamp).
If we're receiving refreshDeletes=TRUE, we have lost less information as there are no implied deletes in the normal case where no subtree renames happen to remove a whole subtree from view (we fail here at the moment). Updates are still liable to be returned in arbitrary order, at least we know they're tagged with some sort of CSN, most likely the CSN of the last change to the entry except for entries that have been deleted.
Some of the above is due to how the provider decides to send information down to us and we are free to change that. The major gap that the provider cannot fill is deletes implied by present phase ending. The entries deleted there cannot be tagged with a single CSN since we need to remember the whole vector to make a consistent decision whether or not to recreate an entry in the future.
Since we do not follow RFC 4533 in many aspects already, e.g. we send cookies in the middle of a refresh, our cookies are often unusable on their own, we can just own up to this and define an extension to the protocol if needed.
On Tue, Jun 14, 2022 at 01:40:56PM +0200, Ondřej Kuzník wrote:
It's becoming untenable how a plain refresh cannot be represented in accesslog in a way that's capable of serving a deltasync session. Whatever happens, we have lost a fair amount of information to run a proper deltasync yet if we don't want to abandon this functionality, we have to try and fill some in.
Had a discussion on how to address this, some proposals have been floated so far, I'll broadly keep them to one per thread. This is one of them, call it proposal 1 if you will.
Changes are contained to the provider: During a refresh (both present and delete phase), we send entries and deletes ordered by CSN, this makes their consumer's accesslog in better shape for deltasync consumption. Currently we send deletes first, then updates if we run a delete phase. This requires that we change that[0], mixing output from sessionlog and internal search, which for accesslog based sessionlog calls for two concurrent searches being run while processing a single operation, that's currently impossible. And we need to provide an efficient server side sorting implementation that does this without reading all entries first, then sorting them.
This way we never explicitly send changes out of order. Except that there remains an implicit out of order change at the end of a present phase, the consumer still has to log those implied deletes somehow and choosing any single CSN leads to scenarios where divergence is still inevitable. On the other hand, any part of this change can be introduced at the same time as most of the other proposals.
[0]. RFC 4533 is extremely vague about when updates are sent at all during refresh, so it seems we are within our rights to do this.
On Tue, Jun 14, 2022 at 01:40:56PM +0200, Ondřej Kuzník wrote:
It's becoming untenable how a plain refresh cannot be represented in accesslog in a way that's capable of serving a deltasync session. Whatever happens, we have lost a fair amount of information to run a proper deltasync yet if we don't want to abandon this functionality, we have to try and fill some in.
This is another proposal, let's call it proposal 2 for the sake of the argument.
The final accesslog content is similar to proposal 1, except most of the work is done inside the consumer: - the consumer will need a separate scratch space - a consumer entering a plain refresh (present or delete phase, doesn't matter) will make changes to the DB as instructed, however it makes sure they are not logged into accesslog, they go into a separate scratch space, when refresh is done, that scratch DB is sorted in CSN order and injected into accesslog - alternatively, accesslog is told to keep them as scratch and reorder later, syncprov doesn't see these log entries until they're being reordered
This requires tight(er) coupling between syncrepl consumer and accesslog to facilitate the signalling.
If the plain refresh is interrupted for whatever reason, we need a way to abort the changes to the replicated DB. That is a pretty tall order, so alternatively we need the ability to resume a refresh with a provider that would give us a cookie that's the same or newer in all respects to the one we would have received from the original refresh. That's not currently possible either, but fixable - we'd need to extend our cookie signalling so that a provider starts the session with a cookie committing it to a CSN set that will be given at the end of the refresh. That could also be the cookie consumer can send when retrying.
It is open what needs to be done with locally originated changes during this time and whether they can be accepted at all. As with proposal 1, how to log implicit deletes coming from a present phase is an open question.
On Tue, Jun 14, 2022 at 01:40:56PM +0200, Ondřej Kuzník wrote:
It's becoming untenable how a plain refresh cannot be represented in accesslog in a way that's capable of serving a deltasync session. Whatever happens, we have lost a fair amount of information to run a proper deltasync yet if we don't want to abandon this functionality, we have to try and fill some in.
Now for proposal 3.
Most of the work is at the provider again: If a consumer gets into a plain refresh, they pass that information down in such a way that accesslog (and its syncprov) gets to find out. If a syncprov sees this message, they will end running sessions with e-syncRefreshRequired, having them fall back to a plain refresh too.
Accesslog also needs to record this (and the finalisation of that refresh) in some way that also makes future consumers understand that they (don't) need to fall back into plain, even in the case of connection loss or other malfunction.
This accesslog message also acts like a poison that spreads through the cluster, eventually all members will end up in plain refreshes. Taking this approach requires a design that is guaranteed to upgrade everyone back into a regular deltasync in a timely manner.
One possible approach might include an idea floated in the other proposals where we have the provider commit to a CSN that refresh will end with and commit that as a time limit on the poison we recorded into our accesslog.
On Tue, Jun 14, 2022 at 01:40:56PM +0200, Ondřej Kuzník wrote:
It's becoming untenable how a plain refresh cannot be represented in accesslog in a way that's capable of serving a deltasync session. Whatever happens, we have lost a fair amount of information to run a proper deltasync yet if we don't want to abandon this functionality, we have to try and fill some in.
Continuing with the theme, proposal 4 is analogous with the earlier one, but the onus is on the consumer.
The provider still notifies its accesslog of entering (and potentially exiting) a plain refresh on the replicated DB. A delta consumer processing this entry will note this down into its own accesslog and transition into a fallback state. That state is in many ways a plain style refresh, other outbound sessions are abandoned, but we keep reading the accesslog, processing these regardless of their CSN, not committing the cookie either.
This leaves the same issue of getting out as proposal 3, with the complication that we could still be in a persist phase when this happens and end up in a circular stalemate of everyone being in this zombie state. Again we need to be able to guarantee this won't linger as it could isolate a subset of the environment from the rest.
Another issue is that we would have introduced a new mode into the protocol, something between plain and delta syncrepl and that would need to be properly understood.
On Tue, Jun 14, 2022 at 01:40:56PM +0200, Ondřej Kuzník wrote:
It's becoming untenable how a plain refresh cannot be represented in accesslog in a way that's capable of serving a deltasync session. Whatever happens, we have lost a fair amount of information to run a proper deltasync yet if we don't want to abandon this functionality, we have to try and fill some in.
There is no record of the expectations for deltasync in a multiprovider environment, so probably worth putting this down too what they are from my view, not necessarily Howard's, who actually wrote the thing.
The intention is convergence[0] first and foremost, while sending the changes rather than the full entries.
Since conflicting writes will always happen and every node will have their view of the DB at the time, they might make a different decision. In deltasync, each will record the final modification into their accesslog which might differ and this cascades bounded only by the number of hosts involved[1]. In the end, we need all hosts that read various versions and subsections of the log in varying order to always converge.
A historic expectation has always been that accesslog be written in order and relayed in the same order as written[2], implicitly assuming that CSNs for each SID will always be stored in a non-descending order. This is why some backends (back-ldif) are not suited to contain accesslog DBs. This non-descending storage expectation might need to be revisited if necessary, hopefully not.
Another expectation is that fallback behaviour be both graceful and efficient: similarly to convergence, sessions will eventually move back to deltasync if at arbitrary point we were to stop introducing *conflicting* changes into the environment. At the same time, for the sake of convergence, we need to be tolerant of some/all links running a plain syncrepl refresh at some points in time.
We have to expect to be running in a real-world environment, where arbitrary[3] topologies might be in place and any number of links and/or nodes can be out of commission for any amount of time. When isolated node rejoin, they should be able to converge eventually, regardless how long the isolation/partition was. Still, we can't require that the accesslog size is unbounded and need to be able to detect when we don't retain the relevant data anymore and work around it[4].
Each node's accesslog DB should always be self-consistent: If a read-only consumer starts with the same DB as the provider at some point, they shall always be able to replay its accesslog cleanly, regardless of what kinds of conflict resolution the provider had to go through. N.B. If it is impossible to write a self-consistent accesslog in certain situations, it is ok to pretend as if certain parts of the accesslog have already been purged, e.g. by attaching meta-information understood by syncprov to that effect.
Regardless of the promises stated above, we should also expect that administrators deploying any multiprovider environment actively monitor it. Just like backups, if replication is not checked routinely, it almost always breaks when you actually need it. There are multiple resources on how to do so and more tools can and will be developed as the need is identified.
There are also some non-expectations, generally shared with plain syncrepl anyway: - If a host/network link is underpowered for the amount of changes coming in, they might fall behind, this doesn't affect eventual convergence[0] and it is up to the administrator to size their environment correctly - Any host configured to accept writes will do so, allowing conflicts to arise, any/all of these might be (partially) reverted in the face of conflicting writes elsewhere in the environment, note that this is already the case with plain syncrepl - We do not aim to minimise the number of "redundant" messages passed if there are multiple paths between nodes. LDAP semantics do not allow this to be done in a safe way with a CSN based replication system
I hope I haven't missed anything important.
[0]. Let's take the usual definition of eventual convergence - if at arbitrary point we were to stop introducing new changes to the environment and restore connectivity, all participating nodes will arrive at identical content in a finite number of steps (and there's a way to tell when that's happened) [1]. Contrast this with "log replication" in Raft et al., where all members of a cluster coordinate to build a shared view of the actual history, not accepting a change until it has been accepted by a majority [2]. If this assumption is violated like in ITS#9358, the consumer will have to skip some legitimate operations and diverge [3]. We can still assume that in the designed topology all the nodes that accept write operations belong to the same strongly connected component [4]. This is the assumption that was at the core of the issue described in ITS#9823
Ondřej Kuzník wrote:
- We do not aim to minimise the number of "redundant" messages passed if there are multiple paths between nodes. LDAP semantics do not allow this to be done in a safe way with a CSN based replication system
We would like to reduce redundancy where possible. That's the aim of ITS#9356.
I hope I haven't missed anything important.