885 lines
46 KiB
Markdown
885 lines
46 KiB
Markdown
# Raft Distributed Consensus Algorithm
|
||
|
||
This text is copied from it's original PDF form during implementation for reference.
|
||
|
||
http://ramcloud.stanford.edu/raft.pdf
|
||
|
||
|
||
## 5 The Raft consensus algorithm
|
||
|
||
Raft is an algorithm for managing a replicated log of the form described in
|
||
Section 2. Figure 2 summarizes the algorithm in condensed form for reference,
|
||
and Figure 3 lists key properties of the algorithm; the elements of these
|
||
figures are discussed piecewise over the rest of this section.
|
||
|
||
Raft implements consensus by first electing a distinguished leader, then
|
||
giving the leader complete responsibility for managing the replicated log. The
|
||
leader accepts log entries from clients, replicates them on other servers, and
|
||
tells servers when it is safe to apply log entries to their state machines.
|
||
Having a leader simplifies the management of the replicated log. For example,
|
||
the leader can decide where to place new entries in the log without consulting
|
||
other servers, and data flows in a simple fashion from the leader to other
|
||
servers. A leader can fail or become disconnected from the other servers, in
|
||
which case a new leader is elected.
|
||
|
||
Given the leader approach, Raft decomposes the consensus problem into three
|
||
relatively independent subproblems, which are discussed in the subsections
|
||
that follow:
|
||
|
||
• Leader election: a new leader must be chosen when an existing leader fails
|
||
(Section 5.2).
|
||
|
||
• Log replication: the leader must accept log entries from clients and
|
||
replicate them across the cluster, forcing the other logs to agree with its
|
||
own (Section 5.3).
|
||
|
||
• Safety: the key safety property for Raft is the State Machine Safety
|
||
Property in Figure 3: if any server has applied a particular log entry to its
|
||
state machine, then no other server may apply a different command for the same
|
||
log index. Section 5.4 describes how Raft ensures this property; the solution
|
||
involves an additional restriction on the election mechanism described in
|
||
Section 5.2.
|
||
|
||
After presenting the consensus algorithm, this section discusses the issue of
|
||
availability and the role of timing in the system.
|
||
|
||
### 5.1 Raft basics
|
||
|
||
A Raft cluster contains several servers; five is a typical number, which
|
||
allows the system to tolerate two failures. At any given time each server is
|
||
in one of three states: leader, follower, or candidate. In normal operation
|
||
there is exactly one leader and all of the other servers are followers.
|
||
Followers are passive: they issue no requests on their own but simply respond
|
||
to requests from leaders and candidates. The leader handles all client
|
||
requests (if a client contacts a follower, the follower redirects it to the
|
||
leader). The third state, candidate, is used to elect a new leader as
|
||
described in Section 5.2. Figure 4 shows the states and their transitions; the
|
||
transitions are discussed below.
|
||
|
||
Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms
|
||
are numbered with consecutive integers. Each term begins with an election, in
|
||
which one or more candidates attempt to become leader as described in Section
|
||
5.2. If a candidate wins the election, then it serves as leader for the rest
|
||
of the term. In some situations an election will result in a split vote. In
|
||
this case the term will end with no leader; a new term (with a new election)
|
||
|
||
will begin shortly. Raft ensures that there is at most one leader in a given
|
||
term.
|
||
|
||
Different servers may observe the transitions between terms at different
|
||
times, and in some situations a server may not observe an election or even
|
||
entire terms. Terms act as a logical clock [14] in Raft, and they allow
|
||
servers to detect obsolete information such as stale leaders. Each server
|
||
stores a current term number, which increases monotonically over time. Current
|
||
terms are exchanged whenever servers communicate; if one server’s current term
|
||
is smaller than the other’s, then it updates its current term to the larger
|
||
value. If a candidate or leader discovers that its term is out of date, it
|
||
immediately reverts to follower state. If a server receives a request with a
|
||
stale term number, it rejects the request.
|
||
|
||
Raft servers communicate using remote procedure calls (RPCs), and the basic
|
||
consensus algorithm requires only two types of RPCs. RequestVote RPCs are
|
||
initiated by candidates during elections (Section 5.2), and AppendEntries RPCs
|
||
are initiated by leaders to replicate log entries and to provide a form of
|
||
heartbeat (Section 5.3). Section 7 adds a third RPC for transferring snapshots
|
||
between servers. Servers retry RPCs if they do not receive a response in a
|
||
timely manner, and they issue RPCs in parallel for best performance.
|
||
|
||
### 5.2 Leader election
|
||
|
||
Raft uses a heartbeat mechanism to trigger leader election. When servers start
|
||
up, they begin as followers. A server remains in follower state as long as it
|
||
receives valid RPCs from a leader or candidate. Leaders send periodic
|
||
heartbeats (AppendEntries RPCs that carry no log entries) to all followers in
|
||
order to maintain their authority. If a follower receives no communication
|
||
over a period of time called the election timeout, then it assumes there is no
|
||
viable leader and begins an election to choose a new leader.
|
||
|
||
To begin an election, a follower increments its current term and transitions
|
||
to candidate state. It then votes for itself and issues RequestVote RPCs in
|
||
parallel to each of the other servers in the cluster. A candidate continues in
|
||
this state until one of three things happens: (a) it wins the election, (b)
|
||
another server establishes itself as leader, or (c) a period of time goes by
|
||
with no winner. These outcomes are discussed separately in the paragraphs
|
||
below.
|
||
|
||
A candidate wins an election if it receives votes from a majority of the
|
||
servers in the full cluster for the same term. Each server will vote for at
|
||
most one candidate in a given term, on a first-come-first-served basis (note:
|
||
Section 5.4 adds an additional restriction on votes). The majority rule
|
||
ensures that at most one candidate can win the election for a particular term
|
||
(the Election Safety Property in Figure 3). Once a candidate wins an election,
|
||
it becomes leader. It then sends heartbeat messages to all of the other
|
||
servers to establish its authority and prevent new elections.
|
||
|
||
While waiting for votes, a candidate may receive an AppendEntries RPC from
|
||
another server claiming to be leader. If the leader’s term (included in its
|
||
RPC) is at least as large as the candidate’s current term, then the candidate
|
||
recognizes the leader as legitimate and returns to follower state. If the term
|
||
in the RPC is smaller than the candidate’s current term, then the candidate
|
||
rejects the RPC and continues in candidate state.
|
||
|
||
The third possible outcome is that a candidate neither wins nor loses the
|
||
election: if many followers become candidates at the same time, votes could be
|
||
split so that no candidate obtains a majority. When this happens, each
|
||
candidate will time out and start a new election by incrementing its term and
|
||
initiating another round of RequestVote RPCs. However, without extra measures
|
||
split votes could repeat indefinitely.
|
||
|
||
Raft uses randomized election timeouts to ensure that split votes are rare and
|
||
that they are resolved quickly. To prevent split votes in the first place,
|
||
election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms).
|
||
This spreads out the servers so that in most cases only a single server will
|
||
time out; it wins the election and sends heartbeats before any other servers
|
||
time out. The same mechanism is used to handle split votes. Each candidate
|
||
restarts its randomized election timeout at the start of an election, and it
|
||
waits for that timeout to elapse before starting the next election; this
|
||
reduces the likelihood of another split vote in the new election. Section 9.3
|
||
shows that this approach elects a leader rapidly.
|
||
|
||
Elections are an example of how understandability guided our choice between
|
||
design alternatives. Initially we planned to use a ranking system: each
|
||
candidate was assigned a unique rank, which was used to select between
|
||
competing candidates. If a candidate discovered another candidate with higher
|
||
rank, it would return to follower state so that the higher ranking candidate
|
||
could more easily win the next election. We found that this approach created
|
||
subtle issues around availability (a lower-ranked server might need to time
|
||
out and become a candidate again if a higher-ranked server fails, but if it
|
||
does so too soon, it can reset progress towards electing a leader). We made
|
||
adjustments to the algorithm several times, but after each adjustment new
|
||
corner cases appeared. Eventually we concluded that the randomized retry
|
||
approach is more obvious and understandable.
|
||
|
||
|
||
### 5.3 Log replication
|
||
|
||
Once a leader has been elected, it begins servicing client requests. Each
|
||
client request contains a command to be executed by the replicated state
|
||
machines. The leader appends the command to its log as a new entry, then
|
||
issues AppendEntries RPCs in parallel to each of the other servers to
|
||
replicate the entry. When the entry has been safely replicated (as described
|
||
below), the leader applies the entry to its state machine and returns the
|
||
result of that execution to the client. If followers crash or run slowly, or
|
||
if network packets are lost, the leader retries AppendEntries RPCs
|
||
indefinitely (even after it has responded to the client) until all followers
|
||
eventually store all log entries.
|
||
|
||
Logs are organized as shown in Figure 6. Each log entry stores a state machine
|
||
command along with the term number when the entry was received by the leader.
|
||
The term numbers in log entries are used to detect inconsistencies between
|
||
logs and to ensure some of the properties in Figure 3. Each log entry also has
|
||
an integer index identifying its position in the log.
|
||
|
||
The leader decides when it is safe to apply a log entry to the state machines;
|
||
such an entry is called committed. Raft guarantees that committed entries are
|
||
durable and will eventually be executed by all of the available state
|
||
machines. A log entry is committed once the leader that created the entry has
|
||
replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This
|
||
also commits all preceding entries in the leader’s log, including entries
|
||
created by previous leaders. Section 5.4 discusses some subtleties when
|
||
applying this rule after leader changes, and it also shows that this
|
||
definition of commitment is safe. The leader keeps track of the highest index
|
||
it knows to be committed, and it includes that index in future AppendEntries
|
||
RPCs (including heartbeats) so that the other servers eventually find out.
|
||
Once a follower learns that a log entry is committed, it applies the entry to
|
||
its local state machine (in log order).
|
||
|
||
We designed the Raft log mechanism to maintain a high level of coherency
|
||
between the logs on different servers. Not only does this simplify the
|
||
system’s behavior and make it more predictable, but it is an important
|
||
component of ensuring safety. Raft maintains the following properties, which
|
||
together constitute the Log Matching Property in Figure 3:
|
||
|
||
• If two entries in different logs have the same index and term, then they
|
||
store the same command.
|
||
|
||
• If two entries in different logs have the same index and term, then the logs
|
||
are identical in all preceding entries.
|
||
|
||
The first property follows from the fact that a leader creates at most one
|
||
entry with a given log index in a given term, and log entries never change
|
||
their position in the log. The second property is guaranteed by a simple
|
||
consistency check performed by AppendEntries. When sending an AppendEntries
|
||
RPC, the leader includes the index and term of the entry in its log that
|
||
immediately precedes the new entries. If the follower does not find an entry
|
||
in its log with the same index and term, then it refuses the new entries. The
|
||
consistency check acts as an induction step: the initial empty state of the
|
||
logs satisfies the Log Matching Property, and the consistency check preserves
|
||
the Log Matching Property whenever logs are extended. As a result, whenever
|
||
AppendEntries returns successfully, the leader knows that the follower’s log
|
||
is identical to its own log up through the new entries.
|
||
|
||
During normal operation, the logs of the leader and followers stay consistent,
|
||
so the AppendEntries consistency check never fails. However, leader crashes
|
||
can leave the logs inconsistent (the old leader may not have fully replicated
|
||
all of the entries in its log). These inconsistencies can compound over a
|
||
series of leader and follower crashes. Figure 7 illustrates the ways in which
|
||
followers’ logs may differ from that of a new leader. A follower may be
|
||
missing entries that are present on the leader, it may have extra entries that
|
||
are not present on the leader, or both. Missing and extraneous entries in a
|
||
log may span multiple terms.
|
||
|
||
In Raft, the leader handles inconsistencies by forcing the followers’ logs to
|
||
duplicate its own. This means that conflicting entries in follower logs will
|
||
be overwritten with entries from the leader’s log. Section 5.4 will show that
|
||
this is safe when coupled with one more restriction.
|
||
|
||
To bring a follower’s log into consistency with its own, the leader must find
|
||
the latest log entry where the two logs agree, delete any entries in the
|
||
follower’s log after that point, and send the follower all of the leader’s
|
||
entries after that point. All of these actions happen in response to the
|
||
consistency check performed by AppendEntries RPCs. The leader maintains a
|
||
nextIndex for each follower, which is the index of the next log entry the
|
||
leader will send to that follower. When a leader first comes to power, it
|
||
initializes all nextIndex values to the index just after the last one in its
|
||
log (11 in Figure 7). If a follower’s log is inconsistent with the leader’s,
|
||
the AppendEntries consistency check will fail in the next AppendEntries RPC.
|
||
After a rejection, the leader decrements nextIndex and retries the
|
||
AppendEntries RPC. Eventually nextIndex will reach a point where the leader
|
||
and follower logs match. When this happens, AppendEntries will succeed, which
|
||
removes any conflicting entries in the follower’s log and appends entries from
|
||
the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is
|
||
consistent with the leader’s, and it will remain that way for the rest of the
|
||
term.
|
||
|
||
If desired, the protocol can be optimized to reduce the number of rejected
|
||
AppendEntries RPCs. For example, when rejecting an AppendEntries request, the
|
||
follower With this mechanism, a leader does not need to take any special
|
||
actions to restore log consistency when it comes to power. It just begins
|
||
normal operation, and the logs automatically converge in response to failures
|
||
of the AppendEntries consistency check. A leader never overwrites or deletes
|
||
entries in its own log (the Leader Append-Only Property in Figure 3).
|
||
|
||
This log replication mechanism exhibits the desirable consensus properties
|
||
described in Section 2: Raft can accept, replicate, and apply new log entries
|
||
as long as a majority of the servers are up; in the normal case a new entry
|
||
can be replicated with a single round of RPCs to a majority of the cluster;
|
||
and a single slow follower will not impact performance.
|
||
|
||
|
||
### 5.4 Safety
|
||
|
||
The previous sections described how Raft elects leaders and replicates log
|
||
entries. However, the mechanisms described so far are not quite sufficient to
|
||
ensure that each state machine executes exactly the same commands in the same
|
||
order. For example, a follower might be unavailable while the leader commits
|
||
several log entries, then it could be elected leader and overwrite these
|
||
entries with new ones; as a result, different state machines might execute
|
||
different command sequences.
|
||
|
||
This section completes the Raft algorithm by adding a restriction on which
|
||
servers may be elected leader. The restriction ensures that the leader for any
|
||
given term contains all of the entries committed in previous terms (the Leader
|
||
Completeness Property from Figure 3). Given the election restriction, we then
|
||
make the rules for commitment more precise. Finally, we present a proof sketch
|
||
for the Leader Completeness Property and show how it leads to correct behavior
|
||
of the replicated state machine.
|
||
|
||
|
||
#### 5.4.1 Election restriction
|
||
|
||
In any leader-based consensus algorithm, the leader must eventually store all
|
||
of the committed log entries. In some consensus algorithms, such as
|
||
Viewstamped Replication [22], a leader can be elected even if it doesn’t
|
||
initially contain all of the committed entries. These algorithms contain
|
||
additional mechanisms to identify the missing entries and transmit them to the
|
||
new leader, either during the election process or shortly afterwards.
|
||
Unfortunately, this results in considerable additional mechanism and
|
||
complexity. Raft uses a simpler approach where it guarantees that all the
|
||
committed entries from previous terms are present on each new leader from the
|
||
moment of its election, without the need to transfer those entries to the
|
||
leader. This means that log entries only flow in one direction, from leaders
|
||
to followers, and leaders never overwrite existing entries in their logs.
|
||
|
||
Raft uses the voting process to prevent a candidate from winning an election
|
||
unless its log contains all committed entries. A candidate must contact a
|
||
majority of the cluster in order to be elected, which means that every
|
||
committed entry must be present in at least one of those servers. If the
|
||
candidate’s log is at least as up-to-date as any other log in that majority
|
||
(where “up-to-date” is defined precisely below), then it will hold all the
|
||
committed entries. The RequestVote RPC implements this restriction: the RPC
|
||
includes information about the candidate’s log, and the voter denies its vote
|
||
if its own log is more up-to-date than that of the candidate.
|
||
|
||
Raft determines which of two logs is more up-to-date by comparing the index
|
||
and term of the last entries in the logs. If the logs have last entries with
|
||
different terms, then the log with the later term is more up-to-date. If the
|
||
logs end with the same term, then whichever log is longer is more up-to-date.
|
||
|
||
#### 5.4.2 Committing entries from previous terms
|
||
|
||
As described in Section 5.3, a leader knows that an entry from its current
|
||
term is committed once that entry is stored on a majority of the servers. If a
|
||
leader crashes before committing an entry, future leaders will attempt to
|
||
finish replicating the entry. However, a leader cannot immediately conclude
|
||
that an entry from a previous term is committed once it is stored on a
|
||
majority of servers. Figure 8 illustrates a situation where an old log entry
|
||
is stored on a majority of servers, yet can still be overwritten by a future
|
||
leader.
|
||
|
||
To eliminate problems like the one in Figure 8, Raft never commits log entries
|
||
from previous terms by counting replicas. Only log entries from the leader’s
|
||
current term are committed by counting replicas; once an entry from the
|
||
current term has been committed in this way, then all prior entries are
|
||
committed indirectly because of the Log Matching Property. There are some
|
||
situations where a leader could safely conclude that an older log entry is
|
||
committed (for example, if that entry is stored on every server), but Raft
|
||
takes a more conservative approach for simplicity.
|
||
|
||
Raft incurs this extra complexity in the commitment rules because log entries
|
||
retain their original term numbers when a leader replicates entries from
|
||
previous terms. In other consensus algorithms, if a new leader rereplicates
|
||
entries from prior “terms,” it must do so with its new “term number.” Raft’s
|
||
approach makes it easier to reason about log entries, since they maintain the
|
||
same term number over time and across logs. In addition, new leaders in Raft
|
||
send fewer log entries from previous terms than in other algorithms (other
|
||
algorithms must send redundant log entries to renumber them before they can be
|
||
committed).
|
||
|
||
#### 5.4.3 Safety argument
|
||
|
||
Given the complete Raft algorithm, we can now argue more precisely that the
|
||
Leader Completeness Property holds (this argument is based on the safety
|
||
proof; see Section 9.2). We assume that the Leader Completeness Property does
|
||
not hold, then we prove a contradiction. Suppose the leader for term T
|
||
(leaderT) commits a log entry from its term, but that log entry is not stored
|
||
by the leader of some future term. Consider the smallest term U > T whose
|
||
leader (leaderU) does not store the entry.
|
||
|
||
1. The committed entry must have been absent from leaderU’s log at the time of
|
||
its election (leaders never delete or overwrite entries).
|
||
|
||
2. leaderT replicated the entry on a majority of the cluster, and leaderU
|
||
received votes from a majority of the cluster. Thus, at least one server (“the
|
||
voter”) both accepted the entry from leaderT and voted for leaderU, as shown
|
||
in Figure 9. The voter is key to reaching a contradiction.
|
||
|
||
3. The voter must have accepted the committed entry from leaderT before voting
|
||
for leaderU; otherwise it would have rejected the AppendEntries request from
|
||
leaderT (its current term would have been higher than T).
|
||
|
||
4. The voter still stored the entry when it voted for leaderU, since every
|
||
intervening leader contained the entry (by assumption), leaders never remove
|
||
entries, and followers only remove entries if they conflict with the leader.
|
||
|
||
5. The voter granted its vote to leaderU, so leaderU’s log must have been as
|
||
up-to-date as the voter’s. This leads to one of two contradictions.
|
||
|
||
6. First, if the voter and leaderU shared the same last log term, then
|
||
leaderU’s log must have been at least as long as the voter’s, so its log
|
||
contained every entry in the voter’s log. This is a contradiction, since the
|
||
voter contained the committed entry and leaderU was assumed not to.
|
||
|
||
7. Otherwise, leaderU’s last log term must have been larger than the voter’s.
|
||
Moreover, it was larger than T, since the voter’s last log term was at least T
|
||
(it contains the committed entry from term T). The earlier leader that created
|
||
leaderU’s last log entry must have contained the committed entry in its log
|
||
(by assumption). Then, by the Log Matching Property, leaderU’s log must also
|
||
contain the committed entry, which is a contradiction.
|
||
|
||
8. This completes the contradiction. Thus, the leaders of all terms greater
|
||
than T must contain all entries from term T that are committed in term T.
|
||
|
||
9. The Log Matching Property guarantees that future leaders will also contain
|
||
entries that are committed indirectly, such as index 2 in Figure 8(d).
|
||
|
||
Given the Leader Completeness Property, we can prove the State Machine Safety
|
||
Property from Figure 3, which states that if a server has applied a log entry
|
||
at a given index to its state machine, no other server will ever apply a
|
||
different log entry for the same index. At the time a server applies a log
|
||
entry to its state machine, its log must be identical to the leader’s log up
|
||
through that entry and the entry must be committed. Now consider the lowest
|
||
term in which any server applies a given log index; the Log Completeness
|
||
Property guarantees that the leaders for all higher terms will store that same
|
||
log entry, so servers that apply the index in later terms will apply the same
|
||
value. Thus, the State Machine Safety Property holds.
|
||
|
||
Finally, Raft requires servers to apply entries in log index order. Combined
|
||
with the State Machine Safety Property, this means that all servers will apply
|
||
exactly the same set of log entries to their state machines, in the same
|
||
order.
|
||
|
||
### 5.5 Follower and candidate crashes
|
||
|
||
Until this point we have focused on leader failures. Follower and candidate
|
||
crashes are much simpler to handle than leader crashes, and they are both
|
||
handled in the same way. If a follower or candidate crashes, then future
|
||
RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these
|
||
failures by retrying indefinitely; if the crashed server restarts, then the
|
||
RPC will complete successfully. If a server crashes after completing an RPC
|
||
but before responding, then it will receive the same RPC again after it
|
||
restarts. Raft RPCs are idempotent, so this causes no harm. For example, if a
|
||
follower receives an AppendEntries request that includes log entries already
|
||
present in its log, it ignores those entries in the new request.
|
||
|
||
### 5.6 Timing and availability
|
||
|
||
One of our requirements for Raft is that safety must not depend on timing: the
|
||
system must not produce incorrect results just because some event happens more
|
||
quickly or slowly than expected. However, availability (the ability of the
|
||
system to respond to clients in a timely manner) must inevitably depend on
|
||
timing. For example, if message exchanges take longer than the typical time
|
||
between server crashes, candidates will not stay up long enough to win an
|
||
election; without a steady leader, Raft cannot make progress.
|
||
|
||
Leader election is the aspect of Raft where timing is most critical. Raft will
|
||
be able to elect and maintain a steady leader as long as the system satisfies
|
||
the following timing requirement:
|
||
|
||
broadcastTime ≪ electionTimeout ≪ MTBF
|
||
|
||
In this inequality broadcastTime is the average time it takes a server to send
|
||
RPCs in parallel to every server in the cluster and receive their responses;
|
||
electionTimeout is the election timeout described in Section 5.2; and MTBF is
|
||
the average time between failures for a single server. The broadcast time
|
||
should be an order of magnitude less than the election timeout so that leaders
|
||
can reliably send the heartbeat messages required to keep followers from
|
||
starting elections; given the randomized approach used for election timeouts,
|
||
this inequality also makes split votes unlikely. The election timeout should
|
||
be a few orders of magnitude less than MTBF so that the system makes steady
|
||
progress. When the leader crashes, the system will be unavailable for roughly
|
||
the election timeout; we would like this to represent only a small fraction of
|
||
overall time.
|
||
|
||
The broadcast time and MTBF are properties of the underlying system, while the
|
||
election timeout is something we must choose. Raft’s RPCs typically require
|
||
the recipient to persist information to stable storage, so the broadcast time
|
||
may range from 0.5ms to 20ms, depending on storage technology. As a result,
|
||
the election timeout is likely to be somewhere between 10ms and 500ms. Typical
|
||
server MTBFs are several months or more, which easily satisfies the timing
|
||
requirement.
|
||
|
||
## 6 Cluster membership changes
|
||
|
||
Up until now we have assumed that the cluster configuration (the set of
|
||
servers participating in the consensus algorithm) is fixed. In practice, it
|
||
will occasionally be necessary to change the configuration, for example to
|
||
replace servers when they fail or to change the degree of replication.
|
||
Although this can be done by taking the entire cluster off-line, updating
|
||
configuration files, and then restarting the cluster, this would leave the
|
||
cluster unavailable during the changeover. In addition, if there are any
|
||
manual steps, they risk operator error. In order to avoid these issues, we
|
||
decided to automate configuration changes and incorporate them into the Raft
|
||
consensus algorithm.
|
||
|
||
For the configuration change mechanism to be safe, there must be no point
|
||
during the transition where it is possible for two leaders to be elected for
|
||
the same term. Unfortunately, any approach where servers switch directly from
|
||
the old configuration to the new configuration is unsafe. It isn’t possible to
|
||
atomically switch all of the servers at once, so the cluster can potentially
|
||
split into two independent majorities during the transition (see Figure 10).
|
||
|
||
In order to ensure safety, configuration changes must use a two-phase
|
||
approach. There are a variety of ways to implement the two phases. For
|
||
example, some systems (e.g., [22]) use the first phase to disable the old
|
||
configuration so it cannot process client requests; then the second phase
|
||
enables the new configuration. In Raft the cluster first switches to a
|
||
transitional configuration we call joint consensus; once the joint consensus
|
||
has been committed, the system then transitions to the new configuration. The
|
||
joint consensus combines both the old and new configurations:
|
||
|
||
• Log entries are replicated to all servers in both configurations.
|
||
|
||
• Any server from either configuration may serve as leader.
|
||
|
||
• Agreement (for elections and entry commitment) requires separate majorities
|
||
from both the old and new configurations.
|
||
|
||
The joint consensus allows individual servers to transition between
|
||
configurations at different times without compromising safety. Furthermore,
|
||
joint consensus allows the cluster to continue servicing client requests
|
||
throughout the configuration change.
|
||
|
||
Cluster configurations are stored and communicated using special entries in
|
||
the replicated log; Figure 11 illustrates the configuration change process.
|
||
When the leader receives a request to change the configuration from Cold to
|
||
Cnew, it stores the configuration for joint consensus (Cold,new in the figure)
|
||
as a log entry and replicates that entry using the mechanisms described
|
||
previously. Once a given server adds the new configuration entry to its log,
|
||
it uses that configuration for all future decisions (a server always uses the
|
||
latest configuration in its log, regardless of whether the entry is
|
||
committed). This means that the leader will use the rules of Cold,new to
|
||
determine when the log entry for Cold,new is committed. If the leader crashes,
|
||
a new leader may be chosen under either Cold or Cold,new, depending on whether
|
||
the winning candidate has received Cold,new. In any case, Cnew cannot make
|
||
unilateral decisions during this period.
|
||
|
||
Once Cold,new has been committed, neither Cold nor Cnew can make decisions
|
||
without approval of the other, and the Leader Completeness Property ensures
|
||
that only servers with the Cold,new log entry can be elected as leader. It is
|
||
now safe for the leader to create a log entry describing Cnew and replicate it
|
||
to the cluster. Again, this configuration will take effect on each server as
|
||
soon as it is seen. When the new configuration has been committed under the
|
||
rules of Cnew, the old configuration is irrelevant and servers not in the new
|
||
configuration can be shut down. As shown in Figure 11, there is no time when
|
||
Cold and Cnew can both make unilateral decisions; this guarantees safety.
|
||
|
||
There are three more issues to address for reconfiguration. The first issue is
|
||
that new servers may not initially store any log entries. If they are added to
|
||
the cluster in this state, it could take quite a while for them to catch up,
|
||
during which time it might not be possible to commit new log entries. In order
|
||
to avoid availability gaps, Raft introduces an additional phase before the
|
||
configuration change, in which the new servers join the cluster as non-voting
|
||
members (the leader replicates log entries to them, but they are not
|
||
considered for majorities). Once the new servers have caught up with the rest
|
||
of the cluster, the reconfiguration can proceed as described above.
|
||
|
||
The second issue is that the cluster leader may not be part of the new
|
||
configuration. In this case, the leader steps down (returns to follower state)
|
||
once it has committed the Cnew log entry. This means that there will be a
|
||
period of time (while it is committing Cnew ) when the leader is managing a
|
||
cluster that does not include itself; it replicates log entries but does not
|
||
count itself in majorities. The leader transition occurs when Cnew is
|
||
committed because this is the first point when the new configuration can
|
||
operate independently (it will always be possible to choose a leader from Cnew
|
||
). Before this point, it may be the case that only a server from Cold can be
|
||
elected leader.
|
||
|
||
The third issue is that removed servers (those not in Cnew) can disrupt the
|
||
cluster. These servers will not receive heartbeats, so they will time out and
|
||
start new elections. They will then send RequestVote RPCs with new term
|
||
numbers, and this will cause the current leader to revert to follower state. A
|
||
new leader will eventually be elected, but the removed servers will time out
|
||
again and the process will repeat, resulting in poor availability.
|
||
|
||
To prevent this problem, servers disregard RequestVote RPCs when they believe
|
||
a current leader exists. Specifically, if a server receives a RequestVote RPC
|
||
within the minimum election timeout of hearing from a current leader, it does
|
||
not update its term or grant its vote. This does not affect normal elections,
|
||
where each server waits at least a minimum election timeout before starting an
|
||
election. However, it helps avoid disruptions from removed servers: if a
|
||
leader is able to get heartbeats to its cluster, then it will not be deposed
|
||
by larger term numbers.
|
||
|
||
|
||
## 7 Log compaction
|
||
|
||
Raft’s log grows during normal operation to incorporate more client requests,
|
||
but in a practical system, it cannot grow without bound. As the log grows
|
||
longer, it occupies more space and takes more time to replay. This will
|
||
eventually cause availability problems without some mechanism to discard
|
||
obsolete information that has accumulated in the log.
|
||
|
||
Snapshotting is the simplest approach to compaction. In snapshotting, the
|
||
entire current system state is written to a snapshot on stable storage, then
|
||
the entire log up to that point is discarded. Snapshotting is used in Chubby
|
||
and ZooKeeper, and the remainder of this section describes snapshotting in
|
||
Raft.
|
||
|
||
Incremental approaches to compaction, such as log cleaning [36] and
|
||
log-structured merge trees [30, 5], are also possible. These operate on a
|
||
fraction of the data at once, so they spread the load of compaction more
|
||
evenly over time. They first select a region of data that has accumulated many
|
||
deleted and overwritten objects, then they rewrite the live objects from that
|
||
region more compactly and free the region. This requires significant
|
||
additional mechanism and complexity compared to snapshotting, which simplifies
|
||
the problem by always operating on the entire data set. While log cleaning
|
||
would require modifications to Raft, state machines can implement LSM trees
|
||
using the same interface as snapshotting.
|
||
|
||
Figure 12 shows the basic idea of snapshotting in Raft. Each server takes
|
||
snapshots independently, covering just the committed entries in its log. Most
|
||
of the work consists of the state machine writing its current state to the
|
||
snapshot. Raft also includes a small amount of metadata in the snapshot: the
|
||
last included index is the index of the last entry in the log that the
|
||
snapshot replaces (the last entry the state machine had applied), and the last
|
||
included term is the term of this entry. These are preserved to support the
|
||
AppendEntries consistency check for the first log entry following the
|
||
snapshot, since that entry needs a previous log index and term. To enable
|
||
cluster membership changes (Section 6), the snapshot also includes the latest
|
||
configuration in the log as of last included index. Once a server completes
|
||
writing a snapshot, it may delete all log entries up through the last included
|
||
index, as well as any prior snapshot.
|
||
|
||
Although servers normally take snapshots independently, the leader must
|
||
occasionally send snapshots to followers that lag behind. This happens when
|
||
the leader has already discarded the next log entry that it needs to send to a
|
||
follower. Fortunately, this situation is unlikely in normal operation: a
|
||
follower that has kept up with the leader would already have this entry.
|
||
However, an exceptionally slow follower or a new server joining the cluster
|
||
(Section 6) would not. The way to bring such a follower up-to-date is for the
|
||
leader to send it a snapshot over the network.
|
||
|
||
The leader uses a new RPC called InstallSnapshot to send snapshots to
|
||
followers that are too far behind; see Figure 13. When a follower receives a
|
||
snapshot with this RPC, it must decide what to do with its existing log
|
||
entries. Usually the snapshot will contain new information not already in the
|
||
recipient’s log. In this case, the follower discards its entire log; it is all
|
||
superseded by the snapshot and may possibly have uncommitted entries that
|
||
conflict with the snapshot. If instead the follower receives a snapshot that
|
||
describes a prefix of its log (due to retransmission or by mistake), then log
|
||
entries covered by the snapshot are deleted but entries following the snapshot
|
||
are still valid and must be retained.
|
||
|
||
This snapshotting approach departs from Raft’s strong leader principle, since
|
||
followers can take snapshots without the knowledge of the leader. However, we
|
||
think this departure is justified. While having a leader helps avoid
|
||
conflicting decisions in reaching consensus, consensus has already been
|
||
reached when snapshotting, so no decisions conflict. Data still only flows
|
||
from leaders to followers, just followers can now reorganize their data.
|
||
|
||
We considered an alternative leader-based approach in which only the leader
|
||
would create a snapshot, then it would send this snapshot to each of its
|
||
followers. However, this has two disadvantages. First, sending the snapshot to
|
||
each follower would waste network bandwidth and slow the snapshotting process.
|
||
Each follower already has the information needed to produce its own snapshots,
|
||
and it is typically much cheaper for a server to produce a snapshot from its
|
||
local state than it is to send and receive one over the network. Second, the
|
||
leader’s implementation would be more complex. For example, the leader would
|
||
need to send snapshots to followers in parallel with replicating new log
|
||
entries to them, so as not to block new client requests.
|
||
|
||
There are two more issues that impact snapshotting performance. First, servers
|
||
must decide when to snapshot. If a server snapshots too often, it wastes disk
|
||
bandwidth and energy; if it snapshots too infrequently, it risks exhausting
|
||
its storage capacity, and it increases the time required to replay the log
|
||
during restarts. One simple strategy is to take a snapshot when the log
|
||
reaches a fixed size in bytes. If this size is set to be significantly larger
|
||
than the expected size of a snapshot, then the disk bandwidth overhead for
|
||
snapshotting will be small.
|
||
|
||
The second performance issue is that writing a snapshot can take a significant
|
||
amount of time, and we do not want this to delay normal operations. The
|
||
solution is to use copy-on-write techniques so that new updates can be
|
||
accepted without impacting the snapshot being written. For example, state
|
||
machines built with functional data structures naturally support this.
|
||
Alternatively, the operating system’s copy-on-write support (e.g., fork on
|
||
Linux) can be used to create an in-memory snapshot of the entire state machine
|
||
(our implementation uses this approach).
|
||
|
||
|
||
## 8 Client Interaction
|
||
|
||
This section describes how clients interact with Raft, including how clients
|
||
find the cluster leader and how Raft supports linearizable semantics [10].
|
||
These issues apply to all consensus-based systems, and Raft’s solutions are
|
||
similar to other systems.
|
||
|
||
Clients of Raft send all of their requests to the leader. When a client first
|
||
starts up, it connects to a randomlychosen server. If the client’s first
|
||
choice is not the leader, that server will reject the client’s request and
|
||
supply information about the most recent leader it has heard from
|
||
(AppendEntries requests include the network address of the leader). If the
|
||
leader crashes, client requests will time out; clients then try again with
|
||
randomly-chosen servers.
|
||
|
||
Our goal for Raft is to implement linearizable semantics (each operation
|
||
appears to execute instantaneously, exactly once, at some point between its
|
||
invocation and its response). However, as described so far Raft can execute a
|
||
command multiple times: for example, if the leader crashes after committing
|
||
the log entry but before responding to the client, the client will retry the
|
||
command with a new leader, causing it to be executed a second time. The
|
||
solution is for clients to assign unique serial numbers to every command.
|
||
Then, the state machine tracks the latest serial number processed for each
|
||
client, along with the associated response. If it receives a command whose
|
||
serial number has already been executed, it responds immediately without
|
||
re-executing the request.
|
||
|
||
Read-only operations can be handled without writing anything into the log.
|
||
However, with no additional measures, this would run the risk of returning
|
||
stale data, since the leader responding to the request might have been
|
||
superseded by a newer leader of which it is unaware. Linearizable reads must
|
||
not return stale data, and Raft needs two extra precautions to guarantee this
|
||
without using the log. First, a leader must have the latest information on
|
||
which entries are committed. The Leader Completeness Property guarantees that
|
||
a leader has all committed entries, but at the start of its term, it may not
|
||
know which those are. To find out, it needs to commit an entry from its term.
|
||
Raft handles this by having each leader commit a blank no-op entry into the
|
||
log at the start of its term. Second, a leader must check whether it has been
|
||
deposed before processing a read-only request (its information may be stale if
|
||
a more recent leader has been elected). Raft handles this by having the leader
|
||
exchange heartbeat messages with a majority of the cluster before responding
|
||
to read-only requests. Alternatively, the leader could rely on the heartbeat
|
||
mechanism to provide a form of lease [9], but this would rely on timing for
|
||
safety (it assumes bounded clock skew).
|
||
|
||
|
||
## Figures
|
||
|
||
### Figure 2
|
||
|
||
#### State
|
||
|
||
Persistent state on all servers: (Updated on stable storage before responding to RPCs)
|
||
|
||
currentTerm latest term server has seen (initialized to 0 on first boot,
|
||
increases monotonically)
|
||
|
||
votedFor candidateId that received vote in current term (or null if none)
|
||
|
||
log[] log entries; each entry contains command for state machine, and
|
||
term when entry was received by leader (first index is 1)
|
||
|
||
|
||
Volatile state on all servers:
|
||
|
||
commitIndex index of highest log entry known to be committed
|
||
(initialized to 0, increases monotonically)
|
||
|
||
lastApplied index of highest log entry applied to state machine
|
||
(initialized to 0, increases monotonically)
|
||
|
||
|
||
Volatile state on leaders: (Reinitialized after election)
|
||
|
||
nextIndex[] for each server, index of the next log entry to send to that
|
||
server (initialized to leader last log index + 1)
|
||
|
||
matchIndex[] for each server, index of highest log entry known to be
|
||
replicated on server (initialized to 0, increases monotonically)
|
||
|
||
|
||
#### AppendEntries RPC
|
||
|
||
Invoked by leader to replicate log entries (§5.3); also used as heartbeat (§5.2).
|
||
|
||
Arguments:
|
||
|
||
term leader’s term
|
||
leaderId so follower can redirect clients
|
||
prevLogIndex index of log entry immediately preceding new ones
|
||
prevLogTerm term of prevLogIndex entry
|
||
entries[] log entries to store (empty for heartbeat; may send more
|
||
than one for efficiency)
|
||
leaderCommit leader’s commitIndex
|
||
|
||
Results:
|
||
|
||
term currentTerm, for leader to update itself
|
||
success true if follower contained entry matching prevLogIndex
|
||
and prevLogTerm
|
||
|
||
Receiver implementation:
|
||
|
||
1. Reply false if term < currentTerm (§5.1)
|
||
|
||
2. Reply false if log doesn’t contain an entry at prevLogIndex whose term
|
||
matches prevLogTerm (§5.3)
|
||
|
||
3. If an existing entry conflicts with a new one (same index but different
|
||
terms), delete the existing entry and all that follow it (§5.3)
|
||
|
||
4. Append any new entries not already in the log
|
||
|
||
5. If leaderCommit > commitIndex, set commitIndex = min(leaderCommit, index of last new entry)
|
||
|
||
|
||
#### RequestVote RPC
|
||
|
||
Invoked by candidates to gather votes (§5.2).
|
||
|
||
Arguments:
|
||
|
||
term candidate’s term
|
||
candidateId candidate requesting vote
|
||
lastLogIndex index of candidate’s last log entry (§5.4)
|
||
lastLogTerm term of candidate’s last log entry (§5.4)
|
||
|
||
Results:
|
||
|
||
term currentTerm, for candidate to update itself
|
||
voteGranted true means candidate received vote
|
||
|
||
|
||
Receiver implementation:
|
||
|
||
1. Reply false if term < currentTerm (§5.1)
|
||
|
||
2. If votedFor is null or candidateId, and candidate’s log is at least as
|
||
up-to-date as receiver’s log, grant vote (§5.2, §5.4)
|
||
|
||
|
||
#### Rules for Servers
|
||
|
||
All Servers:
|
||
|
||
• If commitIndex > lastApplied: increment lastApplied, apply log[lastApplied]
|
||
to state machine (§5.3)
|
||
|
||
• If RPC request or response contains term T > currentTerm:
|
||
set currentTerm = T, convert to follower (§5.1)
|
||
|
||
Followers (§5.2):
|
||
|
||
• Respond to RPCs from candidates and leaders
|
||
|
||
• If election timeout elapses without receiving AppendEntries
|
||
|
||
RPC from current leader or granting vote to candidate: convert to candidate
|
||
|
||
Candidates (§5.2):
|
||
|
||
• On conversion to candidate, start election:
|
||
|
||
• Increment currentTerm
|
||
|
||
• Vote for self
|
||
|
||
• Reset election timer
|
||
|
||
• Send RequestVote RPCs to all other servers
|
||
|
||
• If votes received from majority of servers: become leader
|
||
|
||
• If AppendEntries RPC received from new leader: convert to follower
|
||
|
||
• If election timeout elapses: start new election
|
||
|
||
Leaders:
|
||
|
||
• Upon election: send initial empty AppendEntries RPCs (heartbeat) to each
|
||
server; repeat during idle periods to prevent election timeouts (§5.2)
|
||
|
||
• If command received from client: append entry to local log, respond after
|
||
entry applied to state machine (§5.3)
|
||
|
||
• If last log index ≥ nextIndex for a follower: send AppendEntries RPC with log
|
||
entries starting at nextIndex
|
||
|
||
• If successful: update nextIndex and matchIndex for follower (§5.3)
|
||
|
||
• If AppendEntries fails because of log inconsistency: decrement nextIndex and
|
||
retry (§5.3)
|
||
|
||
• If there exists an N such that N > commitIndex, a majority of
|
||
matchIndex[i] ≥ N, and log[N].term == currentTerm:
|
||
set commitIndex = N (§5.3, §5.4).
|
||
|
||
|
||
|
||
---
|
||
|
||
## Streaming Raft RPC (Non-standard)
|
||
|
||
|
||
### Heartbeat
|
||
|
||
Sent from the leader to the follower to establish dominance.
|
||
|
||
Arguments
|
||
|
||
term the leader's current term
|
||
leaderID so follower can redirect clients
|
||
commitIndex the current commit index
|
||
|
||
Results
|
||
|
||
term currentTerm, for leader to update itself.
|
||
index highest index written to disk.
|
||
|
||
|
||
|
||
### Stream
|
||
|
||
Request from the follower to the leader.
|
||
|
||
Arguments
|
||
|
||
term the follower's current term
|
||
prevLogIndex index of log entry immediately preceding new ones
|
||
prevLogTerm term of prevLogIndex entry
|
||
|
||
Results
|
||
|
||
none, streaming only
|