Hey,
Generally the majority can be wrong: Assume you have a network-failure in a three-node MMR configuration: You update one node while the other two are unreachable. The communication resumes, do you expect the change on the none node to be reverted to majority, or should the majority be updated from the one node that has more recent data?
Indeed. In syncrepl, "voting" is irrelevant. Changes will be accepted by any provider node that a client can reach. When connectivity is restored all nodes will bring each other up to date. In majority-based voting, you will lose any writes to the minority node, which leaves you with unresolvable inconsistencies. I.e., data is removed but the clients believe it was written.
This turns out to be a matter of choice -- I would not go for majority voting without getting confirmation from a majority about the success of a transaction, which is what BerkeleyDB does.
What I'm hearing here is that this "formal" approach leads to more delays, and it doesn't add much in practice -- just the *certainty* about data having been stored with the quality level assured by replication. The certainty comes at a writing delay, and is only of use when lightning strikes just after a write to one master.
Interestingly, OpenStack Swift takes the same approach -- commit a write based on local storage, then replicate later.
back-hdb and back-bdb both use BerkeleyDB. BerkeleyDB is now deprecated/obsolete, and LMDB is the default backend.
I'm preparing new installations, so I suppose I will get to see it as the default.
BDB's replication is page-oriented, so it would consume far more network resources than syncrepl. We have never recommended its use.
It was indeed a design consideration that I was weighing. I think the trade-off recommended here is clear, and makes sense. I don't flush after every disk use either, after all.
Thanks, -Rick