2014-09-04 13:12:41 +00:00
|
|
|
|
# Raft Distributed Consensus Algorithm
|
|
|
|
|
|
|
|
|
|
This text is copied from it's original PDF form during implementation for reference.
|
|
|
|
|
|
|
|
|
|
http://ramcloud.stanford.edu/raft.pdf
|
|
|
|
|
|
|
|
|
|
|
2014-09-04 12:32:32 +00:00
|
|
|
|
## 5 The Raft consensus algorithm
|
|
|
|
|
|
|
|
|
|
Raft is an algorithm for managing a replicated log of the form described in
|
|
|
|
|
Section 2. Figure 2 summarizes the algorithm in condensed form for reference,
|
|
|
|
|
and Figure 3 lists key properties of the algorithm; the elements of these
|
|
|
|
|
figures are discussed piecewise over the rest of this section.
|
|
|
|
|
|
|
|
|
|
Raft implements consensus by first electing a distinguished leader, then
|
|
|
|
|
giving the leader complete responsibility for managing the replicated log. The
|
|
|
|
|
leader accepts log entries from clients, replicates them on other servers, and
|
|
|
|
|
tells servers when it is safe to apply log entries to their state machines.
|
|
|
|
|
Having a leader simplifies the management of the replicated log. For example,
|
|
|
|
|
the leader can decide where to place new entries in the log without consulting
|
|
|
|
|
other servers, and data flows in a simple fashion from the leader to other
|
|
|
|
|
servers. A leader can fail or become disconnected from the other servers, in
|
|
|
|
|
which case a new leader is elected.
|
|
|
|
|
|
|
|
|
|
Given the leader approach, Raft decomposes the consensus problem into three
|
|
|
|
|
relatively independent subproblems, which are discussed in the subsections
|
|
|
|
|
that follow:
|
|
|
|
|
|
|
|
|
|
• Leader election: a new leader must be chosen when an existing leader fails
|
|
|
|
|
(Section 5.2).
|
|
|
|
|
|
|
|
|
|
• Log replication: the leader must accept log entries from clients and
|
|
|
|
|
replicate them across the cluster, forcing the other logs to agree with its
|
|
|
|
|
own (Section 5.3).
|
|
|
|
|
|
|
|
|
|
• Safety: the key safety property for Raft is the State Machine Safety
|
|
|
|
|
Property in Figure 3: if any server has applied a particular log entry to its
|
|
|
|
|
state machine, then no other server may apply a different command for the same
|
|
|
|
|
log index. Section 5.4 describes how Raft ensures this property; the solution
|
|
|
|
|
involves an additional restriction on the election mechanism described in
|
|
|
|
|
Section 5.2.
|
|
|
|
|
|
|
|
|
|
After presenting the consensus algorithm, this section discusses the issue of
|
|
|
|
|
availability and the role of timing in the system.
|
|
|
|
|
|
|
|
|
|
### 5.1 Raft basics
|
|
|
|
|
|
|
|
|
|
A Raft cluster contains several servers; five is a typical number, which
|
|
|
|
|
allows the system to tolerate two failures. At any given time each server is
|
|
|
|
|
in one of three states: leader, follower, or candidate. In normal operation
|
|
|
|
|
there is exactly one leader and all of the other servers are followers.
|
|
|
|
|
Followers are passive: they issue no requests on their own but simply respond
|
|
|
|
|
to requests from leaders and candidates. The leader handles all client
|
|
|
|
|
requests (if a client contacts a follower, the follower redirects it to the
|
|
|
|
|
leader). The third state, candidate, is used to elect a new leader as
|
|
|
|
|
described in Section 5.2. Figure 4 shows the states and their transitions; the
|
|
|
|
|
transitions are discussed below.
|
|
|
|
|
|
|
|
|
|
Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms
|
|
|
|
|
are numbered with consecutive integers. Each term begins with an election, in
|
|
|
|
|
which one or more candidates attempt to become leader as described in Section
|
|
|
|
|
5.2. If a candidate wins the election, then it serves as leader for the rest
|
|
|
|
|
of the term. In some situations an election will result in a split vote. In
|
|
|
|
|
this case the term will end with no leader; a new term (with a new election)
|
|
|
|
|
|
|
|
|
|
will begin shortly. Raft ensures that there is at most one leader in a given
|
|
|
|
|
term.
|
|
|
|
|
|
|
|
|
|
Different servers may observe the transitions between terms at different
|
|
|
|
|
times, and in some situations a server may not observe an election or even
|
|
|
|
|
entire terms. Terms act as a logical clock [14] in Raft, and they allow
|
|
|
|
|
servers to detect obsolete information such as stale leaders. Each server
|
|
|
|
|
stores a current term number, which increases monotonically over time. Current
|
|
|
|
|
terms are exchanged whenever servers communicate; if one server’s current term
|
|
|
|
|
is smaller than the other’s, then it updates its current term to the larger
|
|
|
|
|
value. If a candidate or leader discovers that its term is out of date, it
|
|
|
|
|
immediately reverts to follower state. If a server receives a request with a
|
|
|
|
|
stale term number, it rejects the request.
|
|
|
|
|
|
|
|
|
|
Raft servers communicate using remote procedure calls (RPCs), and the basic
|
|
|
|
|
consensus algorithm requires only two types of RPCs. RequestVote RPCs are
|
|
|
|
|
initiated by candidates during elections (Section 5.2), and AppendEntries RPCs
|
|
|
|
|
are initiated by leaders to replicate log entries and to provide a form of
|
|
|
|
|
heartbeat (Section 5.3). Section 7 adds a third RPC for transferring snapshots
|
|
|
|
|
between servers. Servers retry RPCs if they do not receive a response in a
|
|
|
|
|
timely manner, and they issue RPCs in parallel for best performance.
|
|
|
|
|
|
|
|
|
|
### 5.2 Leader election
|
|
|
|
|
|
|
|
|
|
Raft uses a heartbeat mechanism to trigger leader election. When servers start
|
|
|
|
|
up, they begin as followers. A server remains in follower state as long as it
|
|
|
|
|
receives valid RPCs from a leader or candidate. Leaders send periodic
|
|
|
|
|
heartbeats (AppendEntries RPCs that carry no log entries) to all followers in
|
|
|
|
|
order to maintain their authority. If a follower receives no communication
|
|
|
|
|
over a period of time called the election timeout, then it assumes there is no
|
|
|
|
|
viable leader and begins an election to choose a new leader.
|
|
|
|
|
|
|
|
|
|
To begin an election, a follower increments its current term and transitions
|
|
|
|
|
to candidate state. It then votes for itself and issues RequestVote RPCs in
|
|
|
|
|
parallel to each of the other servers in the cluster. A candidate continues in
|
|
|
|
|
this state until one of three things happens: (a) it wins the election, (b)
|
|
|
|
|
another server establishes itself as leader, or (c) a period of time goes by
|
|
|
|
|
with no winner. These outcomes are discussed separately in the paragraphs
|
|
|
|
|
below.
|
|
|
|
|
|
|
|
|
|
A candidate wins an election if it receives votes from a majority of the
|
|
|
|
|
servers in the full cluster for the same term. Each server will vote for at
|
|
|
|
|
most one candidate in a given term, on a first-come-first-served basis (note:
|
|
|
|
|
Section 5.4 adds an additional restriction on votes). The majority rule
|
|
|
|
|
ensures that at most one candidate can win the election for a particular term
|
|
|
|
|
(the Election Safety Property in Figure 3). Once a candidate wins an election,
|
|
|
|
|
it becomes leader. It then sends heartbeat messages to all of the other
|
|
|
|
|
servers to establish its authority and prevent new elections.
|
|
|
|
|
|
|
|
|
|
While waiting for votes, a candidate may receive an AppendEntries RPC from
|
|
|
|
|
another server claiming to be leader. If the leader’s term (included in its
|
|
|
|
|
RPC) is at least as large as the candidate’s current term, then the candidate
|
|
|
|
|
recognizes the leader as legitimate and returns to follower state. If the term
|
|
|
|
|
in the RPC is smaller than the candidate’s current term, then the candidate
|
|
|
|
|
rejects the RPC and continues in candidate state.
|
|
|
|
|
|
|
|
|
|
The third possible outcome is that a candidate neither wins nor loses the
|
|
|
|
|
election: if many followers become candidates at the same time, votes could be
|
|
|
|
|
split so that no candidate obtains a majority. When this happens, each
|
|
|
|
|
candidate will time out and start a new election by incrementing its term and
|
|
|
|
|
initiating another round of RequestVote RPCs. However, without extra measures
|
|
|
|
|
split votes could repeat indefinitely.
|
|
|
|
|
|
|
|
|
|
Raft uses randomized election timeouts to ensure that split votes are rare and
|
|
|
|
|
that they are resolved quickly. To prevent split votes in the first place,
|
|
|
|
|
election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms).
|
|
|
|
|
This spreads out the servers so that in most cases only a single server will
|
|
|
|
|
time out; it wins the election and sends heartbeats before any other servers
|
|
|
|
|
time out. The same mechanism is used to handle split votes. Each candidate
|
|
|
|
|
restarts its randomized election timeout at the start of an election, and it
|
|
|
|
|
waits for that timeout to elapse before starting the next election; this
|
|
|
|
|
reduces the likelihood of another split vote in the new election. Section 9.3
|
|
|
|
|
shows that this approach elects a leader rapidly.
|
|
|
|
|
|
|
|
|
|
Elections are an example of how understandability guided our choice between
|
|
|
|
|
design alternatives. Initially we planned to use a ranking system: each
|
|
|
|
|
candidate was assigned a unique rank, which was used to select between
|
|
|
|
|
competing candidates. If a candidate discovered another candidate with higher
|
|
|
|
|
rank, it would return to follower state so that the higher ranking candidate
|
|
|
|
|
could more easily win the next election. We found that this approach created
|
|
|
|
|
subtle issues around availability (a lower-ranked server might need to time
|
|
|
|
|
out and become a candidate again if a higher-ranked server fails, but if it
|
|
|
|
|
does so too soon, it can reset progress towards electing a leader). We made
|
|
|
|
|
adjustments to the algorithm several times, but after each adjustment new
|
|
|
|
|
corner cases appeared. Eventually we concluded that the randomized retry
|
|
|
|
|
approach is more obvious and understandable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 5.3 Log replication
|
|
|
|
|
|
|
|
|
|
Once a leader has been elected, it begins servicing client requests. Each
|
|
|
|
|
client request contains a command to be executed by the replicated state
|
|
|
|
|
machines. The leader appends the command to its log as a new entry, then
|
|
|
|
|
issues AppendEntries RPCs in parallel to each of the other servers to
|
|
|
|
|
replicate the entry. When the entry has been safely replicated (as described
|
|
|
|
|
below), the leader applies the entry to its state machine and returns the
|
|
|
|
|
result of that execution to the client. If followers crash or run slowly, or
|
|
|
|
|
if network packets are lost, the leader retries AppendEntries RPCs
|
|
|
|
|
indefinitely (even after it has responded to the client) until all followers
|
|
|
|
|
eventually store all log entries.
|
|
|
|
|
|
|
|
|
|
Logs are organized as shown in Figure 6. Each log entry stores a state machine
|
|
|
|
|
command along with the term number when the entry was received by the leader.
|
|
|
|
|
The term numbers in log entries are used to detect inconsistencies between
|
|
|
|
|
logs and to ensure some of the properties in Figure 3. Each log entry also has
|
|
|
|
|
an integer index identifying its position in the log.
|
|
|
|
|
|
|
|
|
|
The leader decides when it is safe to apply a log entry to the state machines;
|
|
|
|
|
such an entry is called committed. Raft guarantees that committed entries are
|
|
|
|
|
durable and will eventually be executed by all of the available state
|
|
|
|
|
machines. A log entry is committed once the leader that created the entry has
|
|
|
|
|
replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This
|
|
|
|
|
also commits all preceding entries in the leader’s log, including entries
|
|
|
|
|
created by previous leaders. Section 5.4 discusses some subtleties when
|
|
|
|
|
applying this rule after leader changes, and it also shows that this
|
|
|
|
|
definition of commitment is safe. The leader keeps track of the highest index
|
|
|
|
|
it knows to be committed, and it includes that index in future AppendEntries
|
|
|
|
|
RPCs (including heartbeats) so that the other servers eventually find out.
|
|
|
|
|
Once a follower learns that a log entry is committed, it applies the entry to
|
|
|
|
|
its local state machine (in log order).
|
|
|
|
|
|
|
|
|
|
We designed the Raft log mechanism to maintain a high level of coherency
|
|
|
|
|
between the logs on different servers. Not only does this simplify the
|
|
|
|
|
system’s behavior and make it more predictable, but it is an important
|
|
|
|
|
component of ensuring safety. Raft maintains the following properties, which
|
|
|
|
|
together constitute the Log Matching Property in Figure 3:
|
|
|
|
|
|
|
|
|
|
• If two entries in different logs have the same index and term, then they
|
|
|
|
|
store the same command.
|
|
|
|
|
|
|
|
|
|
• If two entries in different logs have the same index and term, then the logs
|
|
|
|
|
are identical in all preceding entries.
|
|
|
|
|
|
|
|
|
|
The first property follows from the fact that a leader creates at most one
|
|
|
|
|
entry with a given log index in a given term, and log entries never change
|
|
|
|
|
their position in the log. The second property is guaranteed by a simple
|
|
|
|
|
consistency check performed by AppendEntries. When sending an AppendEntries
|
|
|
|
|
RPC, the leader includes the index and term of the entry in its log that
|
|
|
|
|
immediately precedes the new entries. If the follower does not find an entry
|
|
|
|
|
in its log with the same index and term, then it refuses the new entries. The
|
|
|
|
|
consistency check acts as an induction step: the initial empty state of the
|
|
|
|
|
logs satisfies the Log Matching Property, and the consistency check preserves
|
|
|
|
|
the Log Matching Property whenever logs are extended. As a result, whenever
|
|
|
|
|
AppendEntries returns successfully, the leader knows that the follower’s log
|
|
|
|
|
is identical to its own log up through the new entries.
|
|
|
|
|
|
|
|
|
|
During normal operation, the logs of the leader and followers stay consistent,
|
|
|
|
|
so the AppendEntries consistency check never fails. However, leader crashes
|
|
|
|
|
can leave the logs inconsistent (the old leader may not have fully replicated
|
|
|
|
|
all of the entries in its log). These inconsistencies can compound over a
|
|
|
|
|
series of leader and follower crashes. Figure 7 illustrates the ways in which
|
|
|
|
|
followers’ logs may differ from that of a new leader. A follower may be
|
|
|
|
|
missing entries that are present on the leader, it may have extra entries that
|
|
|
|
|
are not present on the leader, or both. Missing and extraneous entries in a
|
|
|
|
|
log may span multiple terms.
|
|
|
|
|
|
|
|
|
|
In Raft, the leader handles inconsistencies by forcing the followers’ logs to
|
|
|
|
|
duplicate its own. This means that conflicting entries in follower logs will
|
|
|
|
|
be overwritten with entries from the leader’s log. Section 5.4 will show that
|
|
|
|
|
this is safe when coupled with one more restriction.
|
|
|
|
|
|
|
|
|
|
To bring a follower’s log into consistency with its own, the leader must find
|
|
|
|
|
the latest log entry where the two logs agree, delete any entries in the
|
|
|
|
|
follower’s log after that point, and send the follower all of the leader’s
|
|
|
|
|
entries after that point. All of these actions happen in response to the
|
|
|
|
|
consistency check performed by AppendEntries RPCs. The leader maintains a
|
|
|
|
|
nextIndex for each follower, which is the index of the next log entry the
|
|
|
|
|
leader will send to that follower. When a leader first comes to power, it
|
|
|
|
|
initializes all nextIndex values to the index just after the last one in its
|
|
|
|
|
log (11 in Figure 7). If a follower’s log is inconsistent with the leader’s,
|
|
|
|
|
the AppendEntries consistency check will fail in the next AppendEntries RPC.
|
|
|
|
|
After a rejection, the leader decrements nextIndex and retries the
|
|
|
|
|
AppendEntries RPC. Eventually nextIndex will reach a point where the leader
|
|
|
|
|
and follower logs match. When this happens, AppendEntries will succeed, which
|
|
|
|
|
removes any conflicting entries in the follower’s log and appends entries from
|
|
|
|
|
the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is
|
|
|
|
|
consistent with the leader’s, and it will remain that way for the rest of the
|
|
|
|
|
term.
|
|
|
|
|
|
|
|
|
|
If desired, the protocol can be optimized to reduce the number of rejected
|
|
|
|
|
AppendEntries RPCs. For example, when rejecting an AppendEntries request, the
|
|
|
|
|
follower With this mechanism, a leader does not need to take any special
|
|
|
|
|
actions to restore log consistency when it comes to power. It just begins
|
|
|
|
|
normal operation, and the logs automatically converge in response to failures
|
|
|
|
|
of the AppendEntries consistency check. A leader never overwrites or deletes
|
|
|
|
|
entries in its own log (the Leader Append-Only Property in Figure 3).
|
|
|
|
|
|
|
|
|
|
This log replication mechanism exhibits the desirable consensus properties
|
|
|
|
|
described in Section 2: Raft can accept, replicate, and apply new log entries
|
|
|
|
|
as long as a majority of the servers are up; in the normal case a new entry
|
|
|
|
|
can be replicated with a single round of RPCs to a majority of the cluster;
|
|
|
|
|
and a single slow follower will not impact performance.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### 5.4 Safety
|
|
|
|
|
|
|
|
|
|
The previous sections described how Raft elects leaders and replicates log
|
|
|
|
|
entries. However, the mechanisms described so far are not quite sufficient to
|
|
|
|
|
ensure that each state machine executes exactly the same commands in the same
|
|
|
|
|
order. For example, a follower might be unavailable while the leader commits
|
|
|
|
|
several log entries, then it could be elected leader and overwrite these
|
|
|
|
|
entries with new ones; as a result, different state machines might execute
|
|
|
|
|
different command sequences.
|
|
|
|
|
|
|
|
|
|
This section completes the Raft algorithm by adding a restriction on which
|
|
|
|
|
servers may be elected leader. The restriction ensures that the leader for any
|
|
|
|
|
given term contains all of the entries committed in previous terms (the Leader
|
|
|
|
|
Completeness Property from Figure 3). Given the election restriction, we then
|
|
|
|
|
make the rules for commitment more precise. Finally, we present a proof sketch
|
|
|
|
|
for the Leader Completeness Property and show how it leads to correct behavior
|
|
|
|
|
of the replicated state machine.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### 5.4.1 Election restriction
|
|
|
|
|
|
|
|
|
|
In any leader-based consensus algorithm, the leader must eventually store all
|
|
|
|
|
of the committed log entries. In some consensus algorithms, such as
|
|
|
|
|
Viewstamped Replication [22], a leader can be elected even if it doesn’t
|
|
|
|
|
initially contain all of the committed entries. These algorithms contain
|
|
|
|
|
additional mechanisms to identify the missing entries and transmit them to the
|
|
|
|
|
new leader, either during the election process or shortly afterwards.
|
|
|
|
|
Unfortunately, this results in considerable additional mechanism and
|
|
|
|
|
complexity. Raft uses a simpler approach where it guarantees that all the
|
|
|
|
|
committed entries from previous terms are present on each new leader from the
|
|
|
|
|
moment of its election, without the need to transfer those entries to the
|
|
|
|
|
leader. This means that log entries only flow in one direction, from leaders
|
|
|
|
|
to followers, and leaders never overwrite existing entries in their logs.
|
|
|
|
|
|
|
|
|
|
Raft uses the voting process to prevent a candidate from winning an election
|
|
|
|
|
unless its log contains all committed entries. A candidate must contact a
|
|
|
|
|
majority of the cluster in order to be elected, which means that every
|
|
|
|
|
committed entry must be present in at least one of those servers. If the
|
|
|
|
|
candidate’s log is at least as up-to-date as any other log in that majority
|
|
|
|
|
(where “up-to-date” is defined precisely below), then it will hold all the
|
|
|
|
|
committed entries. The RequestVote RPC implements this restriction: the RPC
|
|
|
|
|
includes information about the candidate’s log, and the voter denies its vote
|
|
|
|
|
if its own log is more up-to-date than that of the candidate.
|
|
|
|
|
|
|
|
|
|
Raft determines which of two logs is more up-to-date by comparing the index
|
|
|
|
|
and term of the last entries in the logs. If the logs have last entries with
|
|
|
|
|
different terms, then the log with the later term is more up-to-date. If the
|
|
|
|
|
logs end with the same term, then whichever log is longer is more up-to-date.
|
|
|
|
|
|
|
|
|
|
#### 5.4.2 Committing entries from previous terms
|
|
|
|
|
|
|
|
|
|
As described in Section 5.3, a leader knows that an entry from its current
|
|
|
|
|
term is committed once that entry is stored on a majority of the servers. If a
|
|
|
|
|
leader crashes before committing an entry, future leaders will attempt to
|
|
|
|
|
finish replicating the entry. However, a leader cannot immediately conclude
|
|
|
|
|
that an entry from a previous term is committed once it is stored on a
|
|
|
|
|
majority of servers. Figure 8 illustrates a situation where an old log entry
|
|
|
|
|
is stored on a majority of servers, yet can still be overwritten by a future
|
|
|
|
|
leader.
|
|
|
|
|
|
|
|
|
|
To eliminate problems like the one in Figure 8, Raft never commits log entries
|
|
|
|
|
from previous terms by counting replicas. Only log entries from the leader’s
|
|
|
|
|
current term are committed by counting replicas; once an entry from the
|
|
|
|
|
current term has been committed in this way, then all prior entries are
|
|
|
|
|
committed indirectly because of the Log Matching Property. There are some
|
|
|
|
|
situations where a leader could safely conclude that an older log entry is
|
|
|
|
|
committed (for example, if that entry is stored on every server), but Raft
|
|
|
|
|
takes a more conservative approach for simplicity.
|
|
|
|
|
|
|
|
|
|
Raft incurs this extra complexity in the commitment rules because log entries
|
|
|
|
|
retain their original term numbers when a leader replicates entries from
|
|
|
|
|
previous terms. In other consensus algorithms, if a new leader rereplicates
|
|
|
|
|
entries from prior “terms,” it must do so with its new “term number.” Raft’s
|
|
|
|
|
approach makes it easier to reason about log entries, since they maintain the
|
|
|
|
|
same term number over time and across logs. In addition, new leaders in Raft
|
|
|
|
|
send fewer log entries from previous terms than in other algorithms (other
|
|
|
|
|
algorithms must send redundant log entries to renumber them before they can be
|
|
|
|
|
committed).
|
|
|
|
|
|
|
|
|
|
#### 5.4.3 Safety argument
|
|
|
|
|
|
|
|
|
|
Given the complete Raft algorithm, we can now argue more precisely that the
|
|
|
|
|
Leader Completeness Property holds (this argument is based on the safety
|
|
|
|
|
proof; see Section 9.2). We assume that the Leader Completeness Property does
|
|
|
|
|
not hold, then we prove a contradiction. Suppose the leader for term T
|
|
|
|
|
(leaderT) commits a log entry from its term, but that log entry is not stored
|
|
|
|
|
by the leader of some future term. Consider the smallest term U > T whose
|
|
|
|
|
leader (leaderU) does not store the entry.
|
|
|
|
|
|
|
|
|
|
1. The committed entry must have been absent from leaderU’s log at the time of
|
|
|
|
|
its election (leaders never delete or overwrite entries).
|
|
|
|
|
|
|
|
|
|
2. leaderT replicated the entry on a majority of the cluster, and leaderU
|
|
|
|
|
received votes from a majority of the cluster. Thus, at least one server (“the
|
|
|
|
|
voter”) both accepted the entry from leaderT and voted for leaderU, as shown
|
|
|
|
|
in Figure 9. The voter is key to reaching a contradiction.
|
|
|
|
|
|
|
|
|
|
3. The voter must have accepted the committed entry from leaderT before voting
|
|
|
|
|
for leaderU; otherwise it would have rejected the AppendEntries request from
|
|
|
|
|
leaderT (its current term would have been higher than T).
|
|
|
|
|
|
|
|
|
|
4. The voter still stored the entry when it voted for leaderU, since every
|
|
|
|
|
intervening leader contained the entry (by assumption), leaders never remove
|
|
|
|
|
entries, and followers only remove entries if they conflict with the leader.
|
|
|
|
|
|
|
|
|
|
5. The voter granted its vote to leaderU, so leaderU’s log must have been as
|
|
|
|
|
up-to-date as the voter’s. This leads to one of two contradictions.
|
|
|
|
|
|
|
|
|
|
6. First, if the voter and leaderU shared the same last log term, then
|
|
|
|
|
leaderU’s log must have been at least as long as the voter’s, so its log
|
|
|
|
|
contained every entry in the voter’s log. This is a contradiction, since the
|
|
|
|
|
voter contained the committed entry and leaderU was assumed not to.
|
|
|
|
|
|
|
|
|
|
7. Otherwise, leaderU’s last log term must have been larger than the voter’s.
|
|
|
|
|
Moreover, it was larger than T, since the voter’s last log term was at least T
|
|
|
|
|
(it contains the committed entry from term T). The earlier leader that created
|
|
|
|
|
leaderU’s last log entry must have contained the committed entry in its log
|
|
|
|
|
(by assumption). Then, by the Log Matching Property, leaderU’s log must also
|
|
|
|
|
contain the committed entry, which is a contradiction.
|
|
|
|
|
|
|
|
|
|
8. This completes the contradiction. Thus, the leaders of all terms greater
|
|
|
|
|
than T must contain all entries from term T that are committed in term T.
|
|
|
|
|
|
|
|
|
|
9. The Log Matching Property guarantees that future leaders will also contain
|
|
|
|
|
entries that are committed indirectly, such as index 2 in Figure 8(d).
|
|
|
|
|
|
|
|
|
|
Given the Leader Completeness Property, we can prove the State Machine Safety
|
|
|
|
|
Property from Figure 3, which states that if a server has applied a log entry
|
|
|
|
|
at a given index to its state machine, no other server will ever apply a
|
|
|
|
|
different log entry for the same index. At the time a server applies a log
|
|
|
|
|
entry to its state machine, its log must be identical to the leader’s log up
|
|
|
|
|
through that entry and the entry must be committed. Now consider the lowest
|
|
|
|
|
term in which any server applies a given log index; the Log Completeness
|
|
|
|
|
Property guarantees that the leaders for all higher terms will store that same
|
|
|
|
|
log entry, so servers that apply the index in later terms will apply the same
|
|
|
|
|
value. Thus, the State Machine Safety Property holds.
|
|
|
|
|
|
|
|
|
|
Finally, Raft requires servers to apply entries in log index order. Combined
|
|
|
|
|
with the State Machine Safety Property, this means that all servers will apply
|
|
|
|
|
exactly the same set of log entries to their state machines, in the same
|
|
|
|
|
order.
|
|
|
|
|
|
|
|
|
|
### 5.5 Follower and candidate crashes
|
|
|
|
|
|
|
|
|
|
Until this point we have focused on leader failures. Follower and candidate
|
|
|
|
|
crashes are much simpler to handle than leader crashes, and they are both
|
|
|
|
|
handled in the same way. If a follower or candidate crashes, then future
|
|
|
|
|
RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these
|
|
|
|
|
failures by retrying indefinitely; if the crashed server restarts, then the
|
|
|
|
|
RPC will complete successfully. If a server crashes after completing an RPC
|
|
|
|
|
but before responding, then it will receive the same RPC again after it
|
|
|
|
|
restarts. Raft RPCs are idempotent, so this causes no harm. For example, if a
|
|
|
|
|
follower receives an AppendEntries request that includes log entries already
|
|
|
|
|
present in its log, it ignores those entries in the new request.
|
|
|
|
|
|
|
|
|
|
### 5.6 Timing and availability
|
|
|
|
|
|
|
|
|
|
One of our requirements for Raft is that safety must not depend on timing: the
|
|
|
|
|
system must not produce incorrect results just because some event happens more
|
|
|
|
|
quickly or slowly than expected. However, availability (the ability of the
|
|
|
|
|
system to respond to clients in a timely manner) must inevitably depend on
|
|
|
|
|
timing. For example, if message exchanges take longer than the typical time
|
|
|
|
|
between server crashes, candidates will not stay up long enough to win an
|
|
|
|
|
election; without a steady leader, Raft cannot make progress.
|
|
|
|
|
|
|
|
|
|
Leader election is the aspect of Raft where timing is most critical. Raft will
|
|
|
|
|
be able to elect and maintain a steady leader as long as the system satisfies
|
|
|
|
|
the following timing requirement:
|
|
|
|
|
|
|
|
|
|
broadcastTime ≪ electionTimeout ≪ MTBF
|
|
|
|
|
|
|
|
|
|
In this inequality broadcastTime is the average time it takes a server to send
|
|
|
|
|
RPCs in parallel to every server in the cluster and receive their responses;
|
|
|
|
|
electionTimeout is the election timeout described in Section 5.2; and MTBF is
|
|
|
|
|
the average time between failures for a single server. The broadcast time
|
|
|
|
|
should be an order of magnitude less than the election timeout so that leaders
|
|
|
|
|
can reliably send the heartbeat messages required to keep followers from
|
|
|
|
|
starting elections; given the randomized approach used for election timeouts,
|
|
|
|
|
this inequality also makes split votes unlikely. The election timeout should
|
|
|
|
|
be a few orders of magnitude less than MTBF so that the system makes steady
|
|
|
|
|
progress. When the leader crashes, the system will be unavailable for roughly
|
|
|
|
|
the election timeout; we would like this to represent only a small fraction of
|
|
|
|
|
overall time.
|
|
|
|
|
|
|
|
|
|
The broadcast time and MTBF are properties of the underlying system, while the
|
|
|
|
|
election timeout is something we must choose. Raft’s RPCs typically require
|
|
|
|
|
the recipient to persist information to stable storage, so the broadcast time
|
|
|
|
|
may range from 0.5ms to 20ms, depending on storage technology. As a result,
|
|
|
|
|
the election timeout is likely to be somewhere between 10ms and 500ms. Typical
|
|
|
|
|
server MTBFs are several months or more, which easily satisfies the timing
|
|
|
|
|
requirement.
|
|
|
|
|
|
|
|
|
|
## 6 Cluster membership changes
|
|
|
|
|
|
|
|
|
|
Up until now we have assumed that the cluster configuration (the set of
|
|
|
|
|
servers participating in the consensus algorithm) is fixed. In practice, it
|
|
|
|
|
will occasionally be necessary to change the configuration, for example to
|
|
|
|
|
replace servers when they fail or to change the degree of replication.
|
|
|
|
|
Although this can be done by taking the entire cluster off-line, updating
|
|
|
|
|
configuration files, and then restarting the cluster, this would leave the
|
|
|
|
|
cluster unavailable during the changeover. In addition, if there are any
|
|
|
|
|
manual steps, they risk operator error. In order to avoid these issues, we
|
|
|
|
|
decided to automate configuration changes and incorporate them into the Raft
|
|
|
|
|
consensus algorithm.
|
|
|
|
|
|
|
|
|
|
For the configuration change mechanism to be safe, there must be no point
|
|
|
|
|
during the transition where it is possible for two leaders to be elected for
|
|
|
|
|
the same term. Unfortunately, any approach where servers switch directly from
|
|
|
|
|
the old configuration to the new configuration is unsafe. It isn’t possible to
|
|
|
|
|
atomically switch all of the servers at once, so the cluster can potentially
|
|
|
|
|
split into two independent majorities during the transition (see Figure 10).
|
|
|
|
|
|
|
|
|
|
In order to ensure safety, configuration changes must use a two-phase
|
|
|
|
|
approach. There are a variety of ways to implement the two phases. For
|
|
|
|
|
example, some systems (e.g., [22]) use the first phase to disable the old
|
|
|
|
|
configuration so it cannot process client requests; then the second phase
|
|
|
|
|
enables the new configuration. In Raft the cluster first switches to a
|
|
|
|
|
transitional configuration we call joint consensus; once the joint consensus
|
|
|
|
|
has been committed, the system then transitions to the new configuration. The
|
|
|
|
|
joint consensus combines both the old and new configurations:
|
|
|
|
|
|
|
|
|
|
• Log entries are replicated to all servers in both configurations.
|
|
|
|
|
|
|
|
|
|
• Any server from either configuration may serve as leader.
|
|
|
|
|
|
|
|
|
|
• Agreement (for elections and entry commitment) requires separate majorities
|
|
|
|
|
from both the old and new configurations.
|
|
|
|
|
|
|
|
|
|
The joint consensus allows individual servers to transition between
|
|
|
|
|
configurations at different times without compromising safety. Furthermore,
|
|
|
|
|
joint consensus allows the cluster to continue servicing client requests
|
|
|
|
|
throughout the configuration change.
|
|
|
|
|
|
|
|
|
|
Cluster configurations are stored and communicated using special entries in
|
|
|
|
|
the replicated log; Figure 11 illustrates the configuration change process.
|
|
|
|
|
When the leader receives a request to change the configuration from Cold to
|
|
|
|
|
Cnew, it stores the configuration for joint consensus (Cold,new in the figure)
|
|
|
|
|
as a log entry and replicates that entry using the mechanisms described
|
|
|
|
|
previously. Once a given server adds the new configuration entry to its log,
|
|
|
|
|
it uses that configuration for all future decisions (a server always uses the
|
|
|
|
|
latest configuration in its log, regardless of whether the entry is
|
|
|
|
|
committed). This means that the leader will use the rules of Cold,new to
|
|
|
|
|
determine when the log entry for Cold,new is committed. If the leader crashes,
|
|
|
|
|
a new leader may be chosen under either Cold or Cold,new, depending on whether
|
|
|
|
|
the winning candidate has received Cold,new. In any case, Cnew cannot make
|
|
|
|
|
unilateral decisions during this period.
|
|
|
|
|
|
|
|
|
|
Once Cold,new has been committed, neither Cold nor Cnew can make decisions
|
|
|
|
|
without approval of the other, and the Leader Completeness Property ensures
|
|
|
|
|
that only servers with the Cold,new log entry can be elected as leader. It is
|
|
|
|
|
now safe for the leader to create a log entry describing Cnew and replicate it
|
|
|
|
|
to the cluster. Again, this configuration will take effect on each server as
|
|
|
|
|
soon as it is seen. When the new configuration has been committed under the
|
|
|
|
|
rules of Cnew, the old configuration is irrelevant and servers not in the new
|
|
|
|
|
configuration can be shut down. As shown in Figure 11, there is no time when
|
|
|
|
|
Cold and Cnew can both make unilateral decisions; this guarantees safety.
|
|
|
|
|
|
|
|
|
|
There are three more issues to address for reconfiguration. The first issue is
|
|
|
|
|
that new servers may not initially store any log entries. If they are added to
|
|
|
|
|
the cluster in this state, it could take quite a while for them to catch up,
|
|
|
|
|
during which time it might not be possible to commit new log entries. In order
|
|
|
|
|
to avoid availability gaps, Raft introduces an additional phase before the
|
|
|
|
|
configuration change, in which the new servers join the cluster as non-voting
|
|
|
|
|
members (the leader replicates log entries to them, but they are not
|
|
|
|
|
considered for majorities). Once the new servers have caught up with the rest
|
|
|
|
|
of the cluster, the reconfiguration can proceed as described above.
|
|
|
|
|
|
|
|
|
|
The second issue is that the cluster leader may not be part of the new
|
|
|
|
|
configuration. In this case, the leader steps down (returns to follower state)
|
|
|
|
|
once it has committed the Cnew log entry. This means that there will be a
|
|
|
|
|
period of time (while it is committing Cnew ) when the leader is managing a
|
|
|
|
|
cluster that does not include itself; it replicates log entries but does not
|
|
|
|
|
count itself in majorities. The leader transition occurs when Cnew is
|
|
|
|
|
committed because this is the first point when the new configuration can
|
|
|
|
|
operate independently (it will always be possible to choose a leader from Cnew
|
|
|
|
|
). Before this point, it may be the case that only a server from Cold can be
|
|
|
|
|
elected leader.
|
|
|
|
|
|
|
|
|
|
The third issue is that removed servers (those not in Cnew) can disrupt the
|
|
|
|
|
cluster. These servers will not receive heartbeats, so they will time out and
|
|
|
|
|
start new elections. They will then send RequestVote RPCs with new term
|
|
|
|
|
numbers, and this will cause the current leader to revert to follower state. A
|
|
|
|
|
new leader will eventually be elected, but the removed servers will time out
|
|
|
|
|
again and the process will repeat, resulting in poor availability.
|
|
|
|
|
|
|
|
|
|
To prevent this problem, servers disregard RequestVote RPCs when they believe
|
|
|
|
|
a current leader exists. Specifically, if a server receives a RequestVote RPC
|
|
|
|
|
within the minimum election timeout of hearing from a current leader, it does
|
|
|
|
|
not update its term or grant its vote. This does not affect normal elections,
|
|
|
|
|
where each server waits at least a minimum election timeout before starting an
|
|
|
|
|
election. However, it helps avoid disruptions from removed servers: if a
|
|
|
|
|
leader is able to get heartbeats to its cluster, then it will not be deposed
|
|
|
|
|
by larger term numbers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 7 Log compaction
|
|
|
|
|
|
|
|
|
|
Raft’s log grows during normal operation to incorporate more client requests,
|
|
|
|
|
but in a practical system, it cannot grow without bound. As the log grows
|
|
|
|
|
longer, it occupies more space and takes more time to replay. This will
|
|
|
|
|
eventually cause availability problems without some mechanism to discard
|
|
|
|
|
obsolete information that has accumulated in the log.
|
|
|
|
|
|
|
|
|
|
Snapshotting is the simplest approach to compaction. In snapshotting, the
|
|
|
|
|
entire current system state is written to a snapshot on stable storage, then
|
|
|
|
|
the entire log up to that point is discarded. Snapshotting is used in Chubby
|
|
|
|
|
and ZooKeeper, and the remainder of this section describes snapshotting in
|
|
|
|
|
Raft.
|
|
|
|
|
|
|
|
|
|
Incremental approaches to compaction, such as log cleaning [36] and
|
|
|
|
|
log-structured merge trees [30, 5], are also possible. These operate on a
|
|
|
|
|
fraction of the data at once, so they spread the load of compaction more
|
|
|
|
|
evenly over time. They first select a region of data that has accumulated many
|
|
|
|
|
deleted and overwritten objects, then they rewrite the live objects from that
|
|
|
|
|
region more compactly and free the region. This requires significant
|
|
|
|
|
additional mechanism and complexity compared to snapshotting, which simplifies
|
|
|
|
|
the problem by always operating on the entire data set. While log cleaning
|
|
|
|
|
would require modifications to Raft, state machines can implement LSM trees
|
|
|
|
|
using the same interface as snapshotting.
|
|
|
|
|
|
|
|
|
|
Figure 12 shows the basic idea of snapshotting in Raft. Each server takes
|
|
|
|
|
snapshots independently, covering just the committed entries in its log. Most
|
|
|
|
|
of the work consists of the state machine writing its current state to the
|
|
|
|
|
snapshot. Raft also includes a small amount of metadata in the snapshot: the
|
|
|
|
|
last included index is the index of the last entry in the log that the
|
|
|
|
|
snapshot replaces (the last entry the state machine had applied), and the last
|
|
|
|
|
included term is the term of this entry. These are preserved to support the
|
|
|
|
|
AppendEntries consistency check for the first log entry following the
|
|
|
|
|
snapshot, since that entry needs a previous log index and term. To enable
|
|
|
|
|
cluster membership changes (Section 6), the snapshot also includes the latest
|
|
|
|
|
configuration in the log as of last included index. Once a server completes
|
|
|
|
|
writing a snapshot, it may delete all log entries up through the last included
|
|
|
|
|
index, as well as any prior snapshot.
|
|
|
|
|
|
|
|
|
|
Although servers normally take snapshots independently, the leader must
|
|
|
|
|
occasionally send snapshots to followers that lag behind. This happens when
|
|
|
|
|
the leader has already discarded the next log entry that it needs to send to a
|
|
|
|
|
follower. Fortunately, this situation is unlikely in normal operation: a
|
|
|
|
|
follower that has kept up with the leader would already have this entry.
|
|
|
|
|
However, an exceptionally slow follower or a new server joining the cluster
|
|
|
|
|
(Section 6) would not. The way to bring such a follower up-to-date is for the
|
|
|
|
|
leader to send it a snapshot over the network.
|
|
|
|
|
|
|
|
|
|
The leader uses a new RPC called InstallSnapshot to send snapshots to
|
|
|
|
|
followers that are too far behind; see Figure 13. When a follower receives a
|
|
|
|
|
snapshot with this RPC, it must decide what to do with its existing log
|
|
|
|
|
entries. Usually the snapshot will contain new information not already in the
|
|
|
|
|
recipient’s log. In this case, the follower discards its entire log; it is all
|
|
|
|
|
superseded by the snapshot and may possibly have uncommitted entries that
|
|
|
|
|
conflict with the snapshot. If instead the follower receives a snapshot that
|
|
|
|
|
describes a prefix of its log (due to retransmission or by mistake), then log
|
|
|
|
|
entries covered by the snapshot are deleted but entries following the snapshot
|
|
|
|
|
are still valid and must be retained.
|
|
|
|
|
|
|
|
|
|
This snapshotting approach departs from Raft’s strong leader principle, since
|
|
|
|
|
followers can take snapshots without the knowledge of the leader. However, we
|
|
|
|
|
think this departure is justified. While having a leader helps avoid
|
|
|
|
|
conflicting decisions in reaching consensus, consensus has already been
|
|
|
|
|
reached when snapshotting, so no decisions conflict. Data still only flows
|
|
|
|
|
from leaders to followers, just followers can now reorganize their data.
|
|
|
|
|
|
|
|
|
|
We considered an alternative leader-based approach in which only the leader
|
|
|
|
|
would create a snapshot, then it would send this snapshot to each of its
|
|
|
|
|
followers. However, this has two disadvantages. First, sending the snapshot to
|
|
|
|
|
each follower would waste network bandwidth and slow the snapshotting process.
|
|
|
|
|
Each follower already has the information needed to produce its own snapshots,
|
|
|
|
|
and it is typically much cheaper for a server to produce a snapshot from its
|
|
|
|
|
local state than it is to send and receive one over the network. Second, the
|
|
|
|
|
leader’s implementation would be more complex. For example, the leader would
|
|
|
|
|
need to send snapshots to followers in parallel with replicating new log
|
|
|
|
|
entries to them, so as not to block new client requests.
|
|
|
|
|
|
|
|
|
|
There are two more issues that impact snapshotting performance. First, servers
|
|
|
|
|
must decide when to snapshot. If a server snapshots too often, it wastes disk
|
|
|
|
|
bandwidth and energy; if it snapshots too infrequently, it risks exhausting
|
|
|
|
|
its storage capacity, and it increases the time required to replay the log
|
|
|
|
|
during restarts. One simple strategy is to take a snapshot when the log
|
|
|
|
|
reaches a fixed size in bytes. If this size is set to be significantly larger
|
|
|
|
|
than the expected size of a snapshot, then the disk bandwidth overhead for
|
|
|
|
|
snapshotting will be small.
|
|
|
|
|
|
|
|
|
|
The second performance issue is that writing a snapshot can take a significant
|
|
|
|
|
amount of time, and we do not want this to delay normal operations. The
|
|
|
|
|
solution is to use copy-on-write techniques so that new updates can be
|
|
|
|
|
accepted without impacting the snapshot being written. For example, state
|
|
|
|
|
machines built with functional data structures naturally support this.
|
|
|
|
|
Alternatively, the operating system’s copy-on-write support (e.g., fork on
|
|
|
|
|
Linux) can be used to create an in-memory snapshot of the entire state machine
|
|
|
|
|
(our implementation uses this approach).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 8 Client Interaction
|
|
|
|
|
|
|
|
|
|
This section describes how clients interact with Raft, including how clients
|
|
|
|
|
find the cluster leader and how Raft supports linearizable semantics [10].
|
|
|
|
|
These issues apply to all consensus-based systems, and Raft’s solutions are
|
|
|
|
|
similar to other systems.
|
|
|
|
|
|
|
|
|
|
Clients of Raft send all of their requests to the leader. When a client first
|
|
|
|
|
starts up, it connects to a randomlychosen server. If the client’s first
|
|
|
|
|
choice is not the leader, that server will reject the client’s request and
|
|
|
|
|
supply information about the most recent leader it has heard from
|
|
|
|
|
(AppendEntries requests include the network address of the leader). If the
|
|
|
|
|
leader crashes, client requests will time out; clients then try again with
|
|
|
|
|
randomly-chosen servers.
|
|
|
|
|
|
|
|
|
|
Our goal for Raft is to implement linearizable semantics (each operation
|
|
|
|
|
appears to execute instantaneously, exactly once, at some point between its
|
|
|
|
|
invocation and its response). However, as described so far Raft can execute a
|
|
|
|
|
command multiple times: for example, if the leader crashes after committing
|
|
|
|
|
the log entry but before responding to the client, the client will retry the
|
|
|
|
|
command with a new leader, causing it to be executed a second time. The
|
|
|
|
|
solution is for clients to assign unique serial numbers to every command.
|
|
|
|
|
Then, the state machine tracks the latest serial number processed for each
|
|
|
|
|
client, along with the associated response. If it receives a command whose
|
|
|
|
|
serial number has already been executed, it responds immediately without
|
|
|
|
|
re-executing the request.
|
|
|
|
|
|
|
|
|
|
Read-only operations can be handled without writing anything into the log.
|
|
|
|
|
However, with no additional measures, this would run the risk of returning
|
|
|
|
|
stale data, since the leader responding to the request might have been
|
|
|
|
|
superseded by a newer leader of which it is unaware. Linearizable reads must
|
|
|
|
|
not return stale data, and Raft needs two extra precautions to guarantee this
|
|
|
|
|
without using the log. First, a leader must have the latest information on
|
|
|
|
|
which entries are committed. The Leader Completeness Property guarantees that
|
|
|
|
|
a leader has all committed entries, but at the start of its term, it may not
|
|
|
|
|
know which those are. To find out, it needs to commit an entry from its term.
|
|
|
|
|
Raft handles this by having each leader commit a blank no-op entry into the
|
|
|
|
|
log at the start of its term. Second, a leader must check whether it has been
|
|
|
|
|
deposed before processing a read-only request (its information may be stale if
|
|
|
|
|
a more recent leader has been elected). Raft handles this by having the leader
|
|
|
|
|
exchange heartbeat messages with a majority of the cluster before responding
|
|
|
|
|
to read-only requests. Alternatively, the leader could rely on the heartbeat
|
|
|
|
|
mechanism to provide a form of lease [9], but this would rely on timing for
|
|
|
|
|
safety (it assumes bounded clock skew).
|
2014-09-04 13:12:41 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Figures
|
|
|
|
|
|
|
|
|
|
### Figure 2
|
|
|
|
|
|
|
|
|
|
#### State
|
|
|
|
|
|
|
|
|
|
Persistent state on all servers: (Updated on stable storage before responding to RPCs)
|
|
|
|
|
|
|
|
|
|
currentTerm latest term server has seen (initialized to 0 on first boot,
|
|
|
|
|
increases monotonically)
|
|
|
|
|
|
|
|
|
|
votedFor candidateId that received vote in current term (or null if none)
|
|
|
|
|
|
|
|
|
|
log[] log entries; each entry contains command for state machine, and
|
|
|
|
|
term when entry was received by leader (first index is 1)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Volatile state on all servers:
|
|
|
|
|
|
|
|
|
|
commitIndex index of highest log entry known to be committed
|
|
|
|
|
(initialized to 0, increases monotonically)
|
|
|
|
|
|
|
|
|
|
lastApplied index of highest log entry applied to state machine
|
|
|
|
|
(initialized to 0, increases monotonically)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Volatile state on leaders: (Reinitialized after election)
|
|
|
|
|
|
|
|
|
|
nextIndex[] for each server, index of the next log entry to send to that
|
|
|
|
|
server (initialized to leader last log index + 1)
|
|
|
|
|
|
|
|
|
|
matchIndex[] for each server, index of highest log entry known to be
|
|
|
|
|
replicated on server (initialized to 0, increases monotonically)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### AppendEntries RPC
|
|
|
|
|
|
|
|
|
|
Invoked by leader to replicate log entries (§5.3); also used as heartbeat (§5.2).
|
|
|
|
|
|
|
|
|
|
Arguments:
|
|
|
|
|
|
|
|
|
|
term leader’s term
|
|
|
|
|
leaderId so follower can redirect clients
|
|
|
|
|
prevLogIndex index of log entry immediately preceding new ones
|
|
|
|
|
prevLogTerm term of prevLogIndex entry
|
|
|
|
|
entries[] log entries to store (empty for heartbeat; may send more
|
|
|
|
|
than one for efficiency)
|
|
|
|
|
leaderCommit leader’s commitIndex
|
|
|
|
|
|
|
|
|
|
Results:
|
|
|
|
|
|
|
|
|
|
term currentTerm, for leader to update itself
|
|
|
|
|
success true if follower contained entry matching prevLogIndex
|
|
|
|
|
and prevLogTerm
|
|
|
|
|
|
|
|
|
|
Receiver implementation:
|
|
|
|
|
|
|
|
|
|
1. Reply false if term < currentTerm (§5.1)
|
|
|
|
|
|
|
|
|
|
2. Reply false if log doesn’t contain an entry at prevLogIndex whose term
|
|
|
|
|
matches prevLogTerm (§5.3)
|
|
|
|
|
|
|
|
|
|
3. If an existing entry conflicts with a new one (same index but different
|
|
|
|
|
terms), delete the existing entry and all that follow it (§5.3)
|
|
|
|
|
|
|
|
|
|
4. Append any new entries not already in the log
|
|
|
|
|
|
|
|
|
|
5. If leaderCommit > commitIndex, set commitIndex = min(leaderCommit, index of last new entry)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### RequestVote RPC
|
|
|
|
|
|
|
|
|
|
Invoked by candidates to gather votes (§5.2).
|
|
|
|
|
|
|
|
|
|
Arguments:
|
|
|
|
|
|
|
|
|
|
term candidate’s term
|
|
|
|
|
candidateId candidate requesting vote
|
|
|
|
|
lastLogIndex index of candidate’s last log entry (§5.4)
|
|
|
|
|
lastLogTerm term of candidate’s last log entry (§5.4)
|
|
|
|
|
|
|
|
|
|
Results:
|
|
|
|
|
|
|
|
|
|
term currentTerm, for candidate to update itself
|
|
|
|
|
voteGranted true means candidate received vote
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Receiver implementation:
|
|
|
|
|
|
|
|
|
|
1. Reply false if term < currentTerm (§5.1)
|
|
|
|
|
|
|
|
|
|
2. If votedFor is null or candidateId, and candidate’s log is at least as
|
|
|
|
|
up-to-date as receiver’s log, grant vote (§5.2, §5.4)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### Rules for Servers
|
|
|
|
|
|
|
|
|
|
All Servers:
|
|
|
|
|
|
|
|
|
|
• If commitIndex > lastApplied: increment lastApplied, apply log[lastApplied]
|
|
|
|
|
to state machine (§5.3)
|
|
|
|
|
|
|
|
|
|
• If RPC request or response contains term T > currentTerm:
|
|
|
|
|
set currentTerm = T, convert to follower (§5.1)
|
|
|
|
|
|
|
|
|
|
Followers (§5.2):
|
|
|
|
|
|
|
|
|
|
• Respond to RPCs from candidates and leaders
|
|
|
|
|
|
|
|
|
|
• If election timeout elapses without receiving AppendEntries
|
|
|
|
|
|
|
|
|
|
RPC from current leader or granting vote to candidate: convert to candidate
|
|
|
|
|
|
|
|
|
|
Candidates (§5.2):
|
|
|
|
|
|
|
|
|
|
• On conversion to candidate, start election:
|
|
|
|
|
|
|
|
|
|
• Increment currentTerm
|
|
|
|
|
|
|
|
|
|
• Vote for self
|
|
|
|
|
|
|
|
|
|
• Reset election timer
|
|
|
|
|
|
|
|
|
|
• Send RequestVote RPCs to all other servers
|
|
|
|
|
|
|
|
|
|
• If votes received from majority of servers: become leader
|
|
|
|
|
|
|
|
|
|
• If AppendEntries RPC received from new leader: convert to follower
|
|
|
|
|
|
|
|
|
|
• If election timeout elapses: start new election
|
|
|
|
|
|
|
|
|
|
Leaders:
|
|
|
|
|
|
|
|
|
|
• Upon election: send initial empty AppendEntries RPCs (heartbeat) to each
|
|
|
|
|
server; repeat during idle periods to prevent election timeouts (§5.2)
|
|
|
|
|
|
|
|
|
|
• If command received from client: append entry to local log, respond after
|
|
|
|
|
entry applied to state machine (§5.3)
|
|
|
|
|
|
|
|
|
|
• If last log index ≥ nextIndex for a follower: send AppendEntries RPC with log
|
|
|
|
|
entries starting at nextIndex
|
|
|
|
|
|
|
|
|
|
• If successful: update nextIndex and matchIndex for follower (§5.3)
|
|
|
|
|
|
|
|
|
|
• If AppendEntries fails because of log inconsistency: decrement nextIndex and
|
|
|
|
|
retry (§5.3)
|
|
|
|
|
|
|
|
|
|
• If there exists an N such that N > commitIndex, a majority of
|
|
|
|
|
matchIndex[i] ≥ N, and log[N].term == currentTerm:
|
|
|
|
|
set commitIndex = N (§5.3, §5.4).
|