On Tue, Jun 14, 2022 at 01:40:56PM +0200, Ondřej Kuzník wrote:
It's becoming untenable how a plain refresh cannot be represented in accesslog in a way that's capable of serving a deltasync session. Whatever happens, we have lost a fair amount of information to run a proper deltasync yet if we don't want to abandon this functionality, we have to try and fill some in.
There is no record of the expectations for deltasync in a multiprovider environment, so probably worth putting this down too what they are from my view, not necessarily Howard's, who actually wrote the thing.
The intention is convergence[0] first and foremost, while sending the changes rather than the full entries.
Since conflicting writes will always happen and every node will have their view of the DB at the time, they might make a different decision. In deltasync, each will record the final modification into their accesslog which might differ and this cascades bounded only by the number of hosts involved[1]. In the end, we need all hosts that read various versions and subsections of the log in varying order to always converge.
A historic expectation has always been that accesslog be written in order and relayed in the same order as written[2], implicitly assuming that CSNs for each SID will always be stored in a non-descending order. This is why some backends (back-ldif) are not suited to contain accesslog DBs. This non-descending storage expectation might need to be revisited if necessary, hopefully not.
Another expectation is that fallback behaviour be both graceful and efficient: similarly to convergence, sessions will eventually move back to deltasync if at arbitrary point we were to stop introducing *conflicting* changes into the environment. At the same time, for the sake of convergence, we need to be tolerant of some/all links running a plain syncrepl refresh at some points in time.
We have to expect to be running in a real-world environment, where arbitrary[3] topologies might be in place and any number of links and/or nodes can be out of commission for any amount of time. When isolated node rejoin, they should be able to converge eventually, regardless how long the isolation/partition was. Still, we can't require that the accesslog size is unbounded and need to be able to detect when we don't retain the relevant data anymore and work around it[4].
Each node's accesslog DB should always be self-consistent: If a read-only consumer starts with the same DB as the provider at some point, they shall always be able to replay its accesslog cleanly, regardless of what kinds of conflict resolution the provider had to go through. N.B. If it is impossible to write a self-consistent accesslog in certain situations, it is ok to pretend as if certain parts of the accesslog have already been purged, e.g. by attaching meta-information understood by syncprov to that effect.
Regardless of the promises stated above, we should also expect that administrators deploying any multiprovider environment actively monitor it. Just like backups, if replication is not checked routinely, it almost always breaks when you actually need it. There are multiple resources on how to do so and more tools can and will be developed as the need is identified.
There are also some non-expectations, generally shared with plain syncrepl anyway: - If a host/network link is underpowered for the amount of changes coming in, they might fall behind, this doesn't affect eventual convergence[0] and it is up to the administrator to size their environment correctly - Any host configured to accept writes will do so, allowing conflicts to arise, any/all of these might be (partially) reverted in the face of conflicting writes elsewhere in the environment, note that this is already the case with plain syncrepl - We do not aim to minimise the number of "redundant" messages passed if there are multiple paths between nodes. LDAP semantics do not allow this to be done in a safe way with a CSN based replication system
I hope I haven't missed anything important.
[0]. Let's take the usual definition of eventual convergence - if at arbitrary point we were to stop introducing new changes to the environment and restore connectivity, all participating nodes will arrive at identical content in a finite number of steps (and there's a way to tell when that's happened) [1]. Contrast this with "log replication" in Raft et al., where all members of a cluster coordinate to build a shared view of the actual history, not accepting a change until it has been accepted by a majority [2]. If this assumption is violated like in ITS#9358, the consumer will have to skip some legitimate operations and diverge [3]. We can still assume that in the designed topology all the nodes that accept write operations belong to the same strongly connected component [4]. This is the assumption that was at the core of the issue described in ITS#9823