In the example with 5 nodes and a split, it is my understanding that the two nodes can't elect a leader.
While the candidate in the smaller split receives votes from a majority of the split, there is no true majority, so no leader. The cluster is configured with the total number of nodes.
What could happen is that an already elected leader continues to think it's the leader for a while while the rest fo the cluster elects a new leader. The split leader will however fail to commit its log, and throw them away once it rejoins.
Another important detail that's missing is that node only votes once pere term, and only for a node that has an equal or higher term than itself. It will never vote twice or vote for an outdated node.
Changing the configuration is in fact handled in a special way at the end of the raft paper in a way that avoids split-brain.
[edit] Oh, the 2-node split was in fact already the leader, so it does exactly what I described. Dur...
Given that nodes are trusted and communicate securely, why would a node be dishonest? (I'm thinking about this in a standard server setting: is your point that you can't use it for distributed peer-to-peer stuff on random user machines?)
It seems to me that homogeneity amongst nodes in a distributed consensus mechanism is of utmost importance; once you implement a hierarchical power structure dishonesty becomes difficult to deal with... especially when "voting" for a "leader" is involved in mapping the node landscape.
Can someone explain to me how Raft differs from Viewstamped Replication? From reading both papers (vr revisited) it looks like Raft just renamed all of VR's nomenclature without changing anything significant. Paxos is fairly different since it only relies on a distinguished leader for guaranteed progress, it "works" without it. Under the hood the mechanism is still similar though, as opposed to something like chain replication.
> VR uses a leaderbased approach with many similarities to Raft.
> However, Raft has less mechanism that VR or ZooKeeper
because it minimizes the functionality in non-leaders. For
example, log entries in Raft flow in only one direction:
outward from the leader in AppendEntries RPCs. In VR
log entries flow in both directions (leaders can receive
log entries during the election process); this results in
additional mechanism and complexity
> Raft has fewer message types than any other algorithm for consensus-based log replication that we are
aware of. For example, we counted the message types VR
and ZooKeeper use for basic consensus and membership
changes (excluding log compaction and client interaction,
as these are nearly independent of the algorithms). VR
and ZooKeeper each define 10 different message types,
while Raft has only 4 message types (two RPC requests
and their responses).
Raft itself is very nice, and the paper
https://ramcloud.stanford.edu/wiki/download/attachments/1137...
Does an excellent job in explaining it, but
I have some problems with this claim:
"A user study with 43 students
at two universities shows that Raft is significantly easier
to understand than Paxos: after learning both algorithms,
33 of these students were able to answer questions about
Raft better than questions about Paxos".
Honestly, what I think happened is this: They first explained paxos to the poor students, then asked questions.
In a later session explained Raft and asked questions.
Can't it be the students started processing the problem of distributed consensus between the sessions so they got a better grasp of the topic? This would mean paxos helped them understand raft better.
Anyway I'm nitpicking etc etc.
> Each student watched one video, took the corresponding quiz, watched the second video, and took the second quiz. About half of the participants did the Paxos portion first and the other half did the Raft portion first in order to
account for both individual differences in performance
and experience gained from the first portion of the study.
We compared participants’ scores on each quiz to deter-
mine whether participants showed a better understanding
of Raft.
This seems to be a good strategy but it is difficult to factor out the quality of the explanations.
The reason I'm so picky about this claim is that before you know it, you have mythical pseudo-statistical claims like "some programmers are more than 10 times as good as others"
that will live a life of their own. CS has way too many of those.
Can't this lose commits? It looks like the message to commit the log entry to the followers happens after the message to the client to say that the commit is confirmed. The client can't actually tell when their commit has been replicated. If the leader dies before sending that confirm to the followers, the client will end up thinking the new leader has a commit which it's going to have to roll back.
Is this something the full algorithm handles differently to the way the diagrams would indicate?
The response to the client is only sent after a majority of followers have the log entry. That's described in the text in the "Protocol Overview", and nicely animated in "Log Replication".
Yes, they have the log entry, but it can still be rolled back, can't it? Here's how I understand the process:
1. Client sends log entry to leader
2. Leader appends log entry, forwards it to followers
3. Majority of followers confirm
4. Leader commits the log entry
5. Leader confirms the commit to the client
6. Followers commit on the next heartbeat
What happens if the leader goes away between 5 and 6? To my eyes, it looks like the followers will time out, elect a new leader, and have to roll back the last log entry.
If an entry has been replicated to a majority of followers, then the new leader is guaranteed to have that entry and therefore it won't be rolled back.
That is correct. The solution to this is given in section 5.4.1 (election restriction), section 5.4.2 (Committing entries from previous terms) and section 8 (Client interaction) of the Raft paper.
Roughly, a newly elected leader will have all committed entries (guaranteed by the "election restriction", 5.4.2) but it does not know precisely which are committed. The new leader will commit a no-op log entry (section 8) and after it has received replies from a majority of the cluster it will know which entries have already been committed.
This is the whole idea of a distributed storage. You add many nodes (that are only confirmed to have the transaction in memory) to reduce the probability of all nodes crashing during one commit and this is considered more reliable than locking the whole data storage to wait for a confirmed disc write. I'm assuming you mean commit==write to disc?
ie. P(client queuing up too many requests and crashing because db is too slow waiting for disc) > P(5 distributed db nodes with the transaction in memory crashing simultaneously before it was written to disc)
Either one or both sides of a split can't gain majority. Therefore you won't get a success message until you can be sure that the data won't get overwritten after a split is reconnected.
Horrible presentation - the enforced pauses (even hiding the continue button during them!) cause frustration after everysinglesentence.
Why can't I read at a normal pace instead of being interrupted all the time and having to wait while the next sentence is shown?
[edit] This behavior would be very suitable if I was making a presentation to an audience with this content - but it's quite contrary to what's needed for the audience to view the content themselves at their own pacing.
It would be even nicer if the "Continue" button had a permanent position (and if I could use enter/space/pagedown/... instead of mouse), but I didn't notice that it was hidden between animations. I guess I am slower than you are. :)
I am not sure if the concept is valid (some other comments have issue with that), but it was well presented. Good job, OP - keep it up!
The presentation originally moved forward on its own but as you get into later sections there's a lot going on visually so it's easy to miss key points in the explanation. There's also an issue that the presentation is attached to the wall clock in later sections so you need the visual in a certain state before moving forward on the explanation.
In hindsight, I agree that there are better ways to present this information. Part of this project is to help me learn how to best communicate complicated topics like distributed consensus in the most effective way. Most existing resources are 20 page PhD dissertations which are not very accessible to beginners so I'm just figuring this out as I go.
I'm changing the format in future presentations. I'm working on infrastructure now to allow D3.js visualizations to be embedded into Medium blog posts. That'll give the best of both worlds -- read at your own pace and interact with mini visualizations as you go along.
I'm always looking for CS topics to visualize and help explain better. If you (or anyone) has any suggestions, please let me know. I'm @benbjohnson on Twitter.
I liked it very much. This works well for individuals whose reading speed exceeds their speed of comprehension. For example, I tend to read way to fast for my own good and need something to pace me (only for technical texts, not fiction). This works very well. It allows me to focus on one thing at a time.
Much of this feedback misses the point -- the presentation is meant to supplement other forms of study. Read the paper a couple of times, and then come back to this. You'll appreciate its dynamic nature then.
This maybe a bit off-topic but I fail to understand why the top 2 textbooks on Distributed Computing - Tanenbaum and Coulouris - don't have a dedicated section on Consensus Algorithms. I learned distributed computing from Tanenbaum and can't recall encountering it.
Contrary some of the folks here, I found the presentation very cool. But that maybe because I'm a slow learner.
I accidentally clicked twice on "Continue" and there wasn't any mean to go back to read the missed comment. This slide-show player wants to be clean and simple and ends up with holes in functionality.
In the network partition example, you say that in the smaller partition, changes cannot be commited because they cannot be replicated to the majority of nodes (as the smaller partition is... smaller). How is the partition to know this? The system can't tell the difference between a node leaving the network and a node undergoing a (tempoary) partition.
To give an example, say I have n machines in datacenter A, and n*.99 in datacenter B. datacenter A gets destroyed, permanently. Does datacenter B now reject all (EDIT: where reject = not commit) requests until a human comes along to tell it that datacenter A isn't coming back?
> To give an example, say I have n machines in datacenter A, and n*.99 in datacenter B. datacenter A gets destroyed, permanently. Does datacenter B now reject all (EDIT: where reject = not commit) requests until a human comes along to tell it that datacenter A isn't coming back?
Of CAP, you are now choosing CP with Raft. So yes, the system is unavailable until an external agent fixes it. In other words, the system needs to have a majority of nodes online to be "available".
What would happen if nodes were to be added to each side of a network partition (unknown to the other side), so that each side believed they had a majority? Or is the "writing" side of the partition determined at partition time, and not changed until they are restored?
* needs a round of Raft to notify its presence to other nodes
So you can only add new nodes (automatically) when you have a 'live' system.
majority = ceil((2n + 1)/2) : so by getting the number of available nodes in the partition, nodes can figure out if they are in the majority or minority cluster.
See section 6 in the paper for details of its implementation.
In the goraft implementation https://github.com/goraft they use an explicit leave command that gets added to the log. This way if a leave command was not received, a partition can be assumed.
Just a nitpick here with the qualifications that I'm one of the authors of Serf: Serf doesn't use Raft. Serf is masterless and the distributed messaging protocol used is SWIM (a gossip protocol).
If I've understood the presentation correctly, Raft is a master-slave protocol that determines how to choose a master.
Considering it basically relies on random chance (I.E. who receives the message first) to elect a master, has basically no real way of resolving a conflict in election (I.E. if two nodes receive the same amount of votes, we do a re-election ad infinitum) and does not address the situation of two nodes having conflicting sets of data (for instance from network partition).
Considering all that, this protocol doesn't seem very interesting (from a use-case point of view).
The conflicting data under partition is covered I think - the set that doesn't meet quorum won't accept writes and will see it has a lower election term than the other partition when it re-joins.
EDIT:
From a use vase point of view, it simplifies the construction of CP systems. This has directly led to etcd and consul, which would be many times more complex had their authors had to implement paxos.
Both etcd and consul are still young software, but if you take a look at the 'Call me Maybe' series of blog posts it's pretty apparent that there's a massive deficiency in current systems handling of network partitions.
> and does not address the situation of two nodes having conflicting sets of data (for instance from network partition).
I believe it does address this. Each log entry is either committed or not; an entry can only be committed if it has been replicated to a majority of nodes. Any node that lacks a committed entry cannot be elected master because of the election rules: a node will not vote for another node less complete than itself. Since a committed entry has been replicated to a majority, a node lacking that entry cannot receive a majority of the votes. (Thus the committed log entries will always be the same on all nodes (though some may be behind, and may only have a subset), which is the purpose of the protocol.)
> Considering it basically relies on random chance (I.E. who receives the message first) to elect a master, has basically no real way of resolving a conflict in election
This is mostly true. The PDF slides I link to below recommend that the election timeout be much greater than the broadcast time, the idea being that things should work out in the long run.
> I believe it does address this. Each log entry is either committed or not; an entry can only be committed if it has been replicated to a majority of nodes. Any node that lacks a committed entry cannot be elected master because of the election rules: a node will not vote for another node less complete than itself. Since a committed entry has been replicated to a majority, a node lacking that entry cannot receive a majority of the votes. (Thus the committed log entries will always be the same on all nodes (though some may be behind, and may only have a subset), which is the purpose of the protocol.)
What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader? how are is "majority" calculated? is the raft protocol unable to handle half of it's nodes being taken down?
What happens if two clusters break off, both choose a leader (if it's possible), both gets new writes and then both clusters come back together?
> I'd love to hear of alternatives.
I no of no protocols per se, but for implementations of a master-slave protocol, there's mongo's replica-set algorithm (one notable change is that each node can have a priority).
There are also master-master implementations (such as cassandra's) that require no election, and serve IMO more interesting use-cases.
> What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader?
A Raft cluster must have an odd number of nodes.
> how are is "majority" calculated?
ceil(nodes/2).
> is the raft protocol unable to handle half of it's nodes being taken down? What happens if two clusters break off, both choose a leader (if it's possible), both gets new writes and then both clusters come back together?
> A Raft cluster must have an odd number of nodes.
Why must a Raft cluster have an odd number of nodes?
> > how are is "majority" calculated?
> ceil(nodes/2).
A majority is defined as having greater than half the votes. I.e., you need nodes / 2 + ((nodes + 1) % 2) votes, or more simply votes > nodes / 2. Even in an even-sized cluster that can only hold true for one node, and not cause splits.
Not sure I understand. A node in each split cluster would need at least 4 votes to be elected leader. Hence no node can be elected leader since all split clusters have strictly fewer than 4 nodes.
Theorem. With 2n + 1 nodes, there can not be two separate majorities after a net split.
Proof. By way of contradiction, assume there are two separate majorities. Each separate majority would contain at least ceil((2n + 1)/2) = n + 1 nodes. This implies that there are in total at least 2(n + 1) = 2n + 2 nodes in the system, contradiction.
> What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader?
No majority is possible here.
> how are is "majority" calculated?
The definition of majority is greater than half the set (that is the meaning of the word). If you have four members, 3 is the lowest such number that is greater than 4 / 2.
It does seem like the election timeout needs to be way longer than 150ms. The (default) TCP min_rto parameter on Linux is 200ms. A single lost packet could cause an election with the timeout from the paper. You'd either need to use longer timeouts or reduce min_rto to something sensible. And of course the equation changes for geographically distant servers.
While the candidate in the smaller split receives votes from a majority of the split, there is no true majority, so no leader. The cluster is configured with the total number of nodes.
What could happen is that an already elected leader continues to think it's the leader for a while while the rest fo the cluster elects a new leader. The split leader will however fail to commit its log, and throw them away once it rejoins.
Another important detail that's missing is that node only votes once pere term, and only for a node that has an equal or higher term than itself. It will never vote twice or vote for an outdated node.
Changing the configuration is in fact handled in a special way at the end of the raft paper in a way that avoids split-brain.
[edit] Oh, the 2-node split was in fact already the leader, so it does exactly what I described. Dur...