1---
2layout: "docs"
3page_title: "Consensus Protocol"
4sidebar_current: "docs-internals-consensus"
5description: |-
6  Consul uses a consensus protocol to provide Consistency as defined by CAP. The consensus protocol is based on Raft: In search of an Understandable Consensus Algorithm. For a visual explanation of Raft, see The Secret Lives of Data.
7---
8
9# Consensus Protocol
10
11Consul uses a [consensus protocol](https://en.wikipedia.org/wiki/Consensus_(computer_science))
12to provide [Consistency (as defined by CAP)](https://en.wikipedia.org/wiki/CAP_theorem).
13The consensus protocol is based on
14["Raft: In search of an Understandable Consensus Algorithm"](https://raft.github.io/raft.pdf).
15For a visual explanation of Raft, see [The Secret Lives of Data](http://thesecretlivesofdata.com/raft).
16
17## Raft Protocol Overview
18
19Raft is a consensus algorithm that is based on
20[Paxos](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29). Compared
21to Paxos, Raft is designed to have fewer states and a simpler, more
22understandable algorithm.
23
24There are a few key terms to know when discussing Raft:
25
26* Log - The primary unit of work in a Raft system is a log entry. The problem
27of consistency can be decomposed into a *replicated log*. A log is an ordered
28sequence of entries. Entries includes any cluster change: adding nodes, adding services, new key-value pairs, etc. We consider the log consistent
29if all members agree on the entries and their order.
30
31* FSM - [Finite State Machine](https://en.wikipedia.org/wiki/Finite-state_machine).
32An FSM is a collection of finite states with transitions between them. As new logs
33are applied, the FSM is allowed to transition between states. Application of the
34same sequence of logs must result in the same state, meaning behavior must be deterministic.
35
36* Peer set - The peer set is the set of all members participating in log replication.
37For Consul's purposes, all server nodes are in the peer set of the local datacenter.
38
39* Quorum - A quorum is a majority of members from a peer set: for a set of size `n`,
40quorum requires at least `(n/2)+1` members.
41For example, if there are 5 members in the peer set, we would need 3 nodes
42to form a quorum. If a quorum of nodes is unavailable for any reason, the
43cluster becomes *unavailable* and no new logs can be committed.
44
45* Committed Entry - An entry is considered *committed* when it is durably stored
46on a quorum of nodes. Once an entry is committed it can be applied.
47
48* Leader - At any given time, the peer set elects a single node to be the leader.
49The leader is responsible for ingesting new log entries, replicating to followers,
50and managing when an entry is considered committed.
51
52Raft is a complex protocol and will not be covered here in detail (for those who
53desire a more comprehensive treatment, the full specification is available in this
54[paper](https://raft.github.io/raft.pdf)).
55We will, however, attempt to provide a high level description which may be useful
56for building a mental model.
57
58Raft nodes are always in one of three states: follower, candidate, or leader. All
59nodes initially start out as a follower. In this state, nodes can accept log entries
60from a leader and cast votes. If no entries are received for some time, nodes
61self-promote to the candidate state. In the candidate state, nodes request votes from
62their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
63The leader must accept new log entries and replicate to all the other followers.
64In addition, if stale reads are not acceptable, all queries must also be performed on
65the leader.
66
67Once a cluster has a leader, it is able to accept new log entries. A client can
68request that a leader append a new log entry (from Raft's perspective, a log entry
69is an opaque binary blob). The leader then writes the entry to durable storage and
70attempts to replicate to a quorum of followers. Once the log entry is considered
71*committed*, it can be *applied* to a finite state machine. The finite state machine
72is application specific; in Consul's case, we use
73[MemDB](https://github.com/hashicorp/go-memdb) to maintain cluster state. Consul's writes
74block until it is both _committed_ and _applied_. This achieves read after write semantics
75when used with the [consistent](/api/features/consistency.html#consistent) mode for queries.
76
77Obviously, it would be undesirable to allow a replicated log to grow in an unbounded
78fashion. Raft provides a mechanism by which the current state is snapshotted and the
79log is compacted. Because of the FSM abstraction, restoring the state of the FSM must
80result in the same state as a replay of old logs. This allows Raft to capture the FSM
81state at a point in time and then remove all the logs that were used to reach that
82state. This is performed automatically without user intervention and prevents unbounded
83disk usage while also minimizing time spent replaying logs. One of the advantages of
84using MemDB is that it allows Consul to continue accepting new transactions even while
85old state is being snapshotted, preventing any availability issues.
86
87Consensus is fault-tolerant up to the point where quorum is available.
88If a quorum of nodes is unavailable, it is impossible to process log entries or reason
89about peer membership. For example, suppose there are only 2 peers: A and B. The quorum
90size is also 2, meaning both nodes must agree to commit a log entry. If either A or B
91fails, it is now impossible to reach quorum. This means the cluster is unable to add
92or remove a node or to commit any additional log entries. This results in
93*unavailability*. At this point, manual intervention would be required to remove
94either A or B and to restart the remaining node in bootstrap mode.
95
96A Raft cluster of 3 nodes can tolerate a single node failure while a cluster
97of 5 can tolerate 2 node failures. The recommended configuration is to either
98run 3 or 5 Consul servers per datacenter. This maximizes availability without
99greatly sacrificing performance. The [deployment table](#deployment_table) below
100summarizes the potential cluster size options and the fault tolerance of each.
101
102In terms of performance, Raft is comparable to Paxos. Assuming stable leadership,
103committing a log entry requires a single round trip to half of the cluster.
104Thus, performance is bound by disk I/O and network latency. Although Consul is
105not designed to be a high-throughput write system, it should handle on the order
106of hundreds to thousands of transactions per second depending on network and
107hardware configuration.
108
109## Raft in Consul
110
111Only Consul server nodes participate in Raft and are part of the peer set. All
112client nodes forward requests to servers. Part of the reason for this design is
113that, as more members are added to the peer set, the size of the quorum also increases.
114This introduces performance problems as you may be waiting for hundreds of machines
115to agree on an entry instead of a handful.
116
117When getting started, a single Consul server is put into "bootstrap" mode. This mode
118allows it to self-elect as a leader. Once a leader is elected, other servers can be
119added to the peer set in a way that preserves consistency and safety. Eventually,
120once the first few servers are added, bootstrap mode can be disabled. See [this
121document](/docs/install/bootstrapping.html) for more details.
122
123Since all servers participate as part of the peer set, they all know the current
124leader. When an RPC request arrives at a non-leader server, the request is
125forwarded to the leader. If the RPC is a *query* type, meaning it is read-only,
126the leader generates the result based on the current state of the FSM. If
127the RPC is a *transaction* type, meaning it modifies state, the leader
128generates a new log entry and applies it using Raft. Once the log entry is committed
129and applied to the FSM, the transaction is complete.
130
131Because of the nature of Raft's replication, performance is sensitive to network
132latency. For this reason, each datacenter elects an independent leader and maintains
133a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible
134only for data in their datacenter. When a request is received for a remote datacenter,
135the request is forwarded to the correct leader. This design allows for lower latency
136transactions and higher availability without sacrificing consistency.
137
138## Consistency Modes
139
140Although all writes to the replicated log go through Raft, reads are more
141flexible. To support various trade-offs that developers may want, Consul
142supports 3 different consistency modes for reads.
143
144The three read modes are:
145
146* `default` - Raft makes use of leader leasing, providing a time window
147  in which the leader assumes its role is stable. However, if a leader
148  is partitioned from the remaining peers, a new leader may be elected
149  while the old leader is holding the lease. This means there are 2 leader
150  nodes. There is no risk of a split-brain since the old leader will be
151  unable to commit new logs. However, if the old leader services any reads,
152  the values are potentially stale. The default consistency mode relies only
153  on leader leasing, exposing clients to potentially stale values. We make
154  this trade-off because reads are fast, usually strongly consistent, and
155  only stale in a hard-to-trigger situation. The time window of stale reads
156  is also bounded since the leader will step down due to the partition.
157
158* `consistent` - This mode is strongly consistent without caveats. It requires
159  that a leader verify with a quorum of peers that it is still leader. This
160  introduces an additional round-trip to all server nodes. The trade-off is
161  always consistent reads but increased latency due to the extra round trip.
162
163* `stale` - This mode allows any server to service the read regardless of whether
164  it is the leader. This means reads can be arbitrarily stale but are generally
165  within 50 milliseconds of the leader. The trade-off is very fast and scalable
166  reads but with stale values. This mode allows reads without a leader meaning
167  a cluster that is unavailable will still be able to respond.
168
169For more documentation about using these various modes, see the
170[HTTP API](/api/features/consistency.html).
171
172## <a name="deployment_table"></a>Deployment Table
173
174Below is a table that shows quorum size and failure tolerance for various
175cluster sizes. The recommended deployment is either 3 or 5 servers. A single
176server deployment is _**highly**_ discouraged as data loss is inevitable in a
177failure scenario.
178
179<table class="table table-bordered table-striped">
180  <tr>
181    <th>Servers</th>
182    <th>Quorum Size</th>
183    <th>Failure Tolerance</th>
184  </tr>
185  <tr>
186    <td>1</td>
187    <td>1</td>
188    <td>0</td>
189  </tr>
190  <tr>
191    <td>2</td>
192    <td>2</td>
193    <td>0</td>
194  </tr>
195  <tr class="warning">
196    <td>3</td>
197    <td>2</td>
198    <td>1</td>
199  </tr>
200  <tr>
201    <td>4</td>
202    <td>3</td>
203    <td>1</td>
204  </tr>
205  <tr class="warning">
206    <td>5</td>
207    <td>3</td>
208    <td>2</td>
209  </tr>
210  <tr>
211    <td>6</td>
212    <td>4</td>
213    <td>2</td>
214  </tr>
215  <tr>
216    <td>7</td>
217    <td>4</td>
218    <td>3</td>
219  </tr>
220</table>
221