Ulrich Windl wrote:
Rick van Rein rick@openfortress.nl schrieb am 08.12.2016 um 22:37 in
Nachricht 5849D28D.8050408@openfortress.nl:
Hello,
I've been thinking about replication schemes lately. I very much like SyncRepl, especially that it can be used in master-master mode, but it also has a downside -- it returns before a majority of replicated servers have agreed on a change. One might say that the change is committed before certainty has been established.
It seems you are mixing a quorum protocol with a synchronisation protocol: RAID1 with sequential (not parallel) writes has the same weaknesses, BTW.
if a change was committed to one server, the change is there; certainly. What is your "certainty"?
This is different with the replication scheme that is now built into BerkeleyDB, by Oracle; this has a scheme based on majority voting, and with automatic resumption after downtime based on this majority. The BerkeleyDB will not return cheerfully from a commit until a majority has confirmed.
Generally the majority can be wrong: Assume you have a network-failure in a three-node MMR configuration: You update one node while the other two are unreachable. The communication resumes, do you expect the change on the none node to be reverted to majority, or should the majority be updated from the one node that has more recent data?
Indeed. In syncrepl, "voting" is irrelevant. Changes will be accepted by any provider node that a client can reach. When connectivity is restored all nodes will bring each other up to date. In majority-based voting, you will lose any writes to the minority node, which leaves you with unresolvable inconsistencies. I.e., data is removed but the clients believe it was written.
AFAIK the HDB backend that was once the default for OpenLDAP has been replaced with the plain vanilla BerkeleyDB... so it seems that the replication scheme of the latter can be used. Is that right / does anyone see problems with that / ...?
back-hdb and back-bdb both use BerkeleyDB. BerkeleyDB is now deprecated/obsolete, and LMDB is the default backend.
BDB's replication is page-oriented, so it would consume far more network resources than syncrepl. We have never recommended its use. At this point, with the licensing incompatibility, it's all moot.
Oracle licensing conditions may be a different thing. Despite of all that LDAP replication is not done at the database level.
Was the intention of your message to say: "Oracle has the better databases"?
Regards, Ulrich
Hey,
Generally the majority can be wrong: Assume you have a network-failure in a three-node MMR configuration: You update one node while the other two are unreachable. The communication resumes, do you expect the change on the none node to be reverted to majority, or should the majority be updated from the one node that has more recent data?
Indeed. In syncrepl, "voting" is irrelevant. Changes will be accepted by any provider node that a client can reach. When connectivity is restored all nodes will bring each other up to date. In majority-based voting, you will lose any writes to the minority node, which leaves you with unresolvable inconsistencies. I.e., data is removed but the clients believe it was written.
This turns out to be a matter of choice -- I would not go for majority voting without getting confirmation from a majority about the success of a transaction, which is what BerkeleyDB does.
What I'm hearing here is that this "formal" approach leads to more delays, and it doesn't add much in practice -- just the *certainty* about data having been stored with the quality level assured by replication. The certainty comes at a writing delay, and is only of use when lightning strikes just after a write to one master.
Interestingly, OpenStack Swift takes the same approach -- commit a write based on local storage, then replicate later.
back-hdb and back-bdb both use BerkeleyDB. BerkeleyDB is now deprecated/obsolete, and LMDB is the default backend.
I'm preparing new installations, so I suppose I will get to see it as the default.
BDB's replication is page-oriented, so it would consume far more network resources than syncrepl. We have never recommended its use.
It was indeed a design consideration that I was weighing. I think the trade-off recommended here is clear, and makes sense. I don't flush after every disk use either, after all.
Thanks, -Rick
openldap-technical@openldap.org