• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..25-Apr-2020-

confchange/H25-Apr-2020-1,024657

quorum/H25-Apr-2020-788492

raftpb/H25-Apr-2020-3,1012,813

rafttest/H25-Apr-2020-1,7961,140

testdata/H03-May-2022-

tracker/H25-Apr-2020-1,166730

OWNERSH A D25-Apr-2020206 2019

README.mdH A D25-Apr-202011.1 KiB198151

bootstrap.goH A D25-Apr-20202.9 KiB8134

design.mdH A D25-Apr-20205.5 KiB5843

diff_test.goH A D25-Apr-20201.5 KiB6646

doc.goH A D25-Apr-202013.2 KiB3011

example_test.goH A D25-Apr-20201.2 KiB4824

interaction_test.goH A D25-Apr-20201.1 KiB3514

log.goH A D25-Apr-202010.4 KiB373270

log_test.goH A D25-Apr-202022.3 KiB820706

log_unstable.goH A D25-Apr-20204.7 KiB15898

log_unstable_test.goH A D25-Apr-20208.2 KiB360315

logger.goH A D25-Apr-20203.2 KiB13391

node.goH A D25-Apr-202018 KiB587366

node_bench_test.goH A D25-Apr-20201.2 KiB5230

node_test.goH A D25-Apr-202027.3 KiB1,021817

raft.goH A D25-Apr-202057.6 KiB1,6801,103

raft_flow_control_test.goH A D25-Apr-20204.8 KiB15795

raft_paper_test.goH A D25-Apr-202028.7 KiB937718

raft_snap_test.goH A D25-Apr-20204.2 KiB14296

raft_test.goH A D25-Apr-2020128.7 KiB4,3743,266

rawnode.goH A D25-Apr-20207.9 KiB242139

rawnode_test.goH A D25-Apr-202028.2 KiB974759

read_only.goH A D25-Apr-20203.5 KiB12272

status.goH A D25-Apr-20202.6 KiB10770

storage.goH A D25-Apr-20208.4 KiB274162

storage_test.goH A D25-Apr-20207.9 KiB291236

util.goH A D25-Apr-20206.2 KiB234184

util_test.goH A D25-Apr-20203.1 KiB10778

README.md

1# Raft library
2
3Raft is a protocol with which a cluster of nodes can maintain a replicated state machine.
4The state machine is kept in sync through the use of a replicated log.
5For more details on Raft, see "In Search of an Understandable Consensus Algorithm"
6(https://raft.github.io/raft.pdf) by Diego Ongaro and John Ousterhout.
7
8This Raft library is stable and feature complete. As of 2016, it is **the most widely used** Raft library in production, serving tens of thousands clusters each day. It powers distributed systems such as etcd, Kubernetes, Docker Swarm, Cloud Foundry Diego, CockroachDB, TiDB, Project Calico, Flannel, Hyperledger and more.
9
10Most Raft implementations have a monolithic design, including storage handling, messaging serialization, and network transport. This library instead follows a minimalistic design philosophy by only implementing the core raft algorithm. This minimalism buys flexibility, determinism, and performance.
11
12To keep the codebase small as well as provide flexibility, the library only implements the Raft algorithm; both network and disk IO are left to the user. Library users must implement their own transportation layer for message passing between Raft peers over the wire. Similarly, users must implement their own storage layer to persist the Raft log and state.
13
14In order to easily test the Raft library, its behavior should be deterministic. To achieve this determinism, the library models Raft as a state machine.  The state machine takes a `Message` as input. A message can either be a local timer update or a network message sent from a remote peer. The state machine's output is a 3-tuple `{[]Messages, []LogEntries, NextState}` consisting of an array of `Messages`, `log entries`, and `Raft state changes`. For state machines with the same state, the same state machine input should always generate the same state machine output.
15
16A simple example application, _raftexample_, is also available to help illustrate how to use this package in practice: https://github.com/etcd-io/etcd/tree/master/contrib/raftexample
17
18# Features
19
20This raft implementation is a full feature implementation of Raft protocol. Features includes:
21
22- Leader election
23- Log replication
24- Log compaction
25- Membership changes
26- Leadership transfer extension
27- Efficient linearizable read-only queries served by both the leader and followers
28  - leader checks with quorum and bypasses Raft log before processing read-only queries
29  - followers asks leader to get a safe read index before processing read-only queries
30- More efficient lease-based linearizable read-only queries served by both the leader and followers
31  - leader bypasses Raft log and processing read-only queries locally
32  - followers asks leader to get a safe read index before processing read-only queries
33  - this approach relies on the clock of the all the machines in raft group
34
35This raft implementation also includes a few optional enhancements:
36
37- Optimistic pipelining to reduce log replication latency
38- Flow control for log replication
39- Batching Raft messages to reduce synchronized network I/O calls
40- Batching log entries to reduce disk synchronized I/O
41- Writing to leader's disk in parallel
42- Internal proposal redirection from followers to leader
43- Automatic stepping down when the leader loses quorum
44- Protection against unbounded log growth when quorum is lost
45
46## Notable Users
47
48- [cockroachdb](https://github.com/cockroachdb/cockroach) A Scalable, Survivable, Strongly-Consistent SQL Database
49- [dgraph](https://github.com/dgraph-io/dgraph) A Scalable, Distributed, Low Latency, High Throughput Graph Database
50- [etcd](https://github.com/etcd-io/etcd) A distributed reliable key-value store
51- [tikv](https://github.com/pingcap/tikv) A Distributed transactional key value database powered by Rust and Raft
52- [swarmkit](https://github.com/docker/swarmkit) A toolkit for orchestrating distributed systems at any scale.
53- [chain core](https://github.com/chain/chain) Software for operating permissioned, multi-asset blockchain networks
54
55## Usage
56
57The primary object in raft is a Node. Either start a Node from scratch using raft.StartNode or start a Node from some initial state using raft.RestartNode.
58
59To start a three-node cluster
60```go
61  storage := raft.NewMemoryStorage()
62  c := &raft.Config{
63    ID:              0x01,
64    ElectionTick:    10,
65    HeartbeatTick:   1,
66    Storage:         storage,
67    MaxSizePerMsg:   4096,
68    MaxInflightMsgs: 256,
69  }
70  // Set peer list to the other nodes in the cluster.
71  // Note that they need to be started separately as well.
72  n := raft.StartNode(c, []raft.Peer{{ID: 0x02}, {ID: 0x03}})
73```
74
75Start a single node cluster, like so:
76```go
77  // Create storage and config as shown above.
78  // Set peer list to itself, so this node can become the leader of this single-node cluster.
79  peers := []raft.Peer{{ID: 0x01}}
80  n := raft.StartNode(c, peers)
81```
82
83To allow a new node to join this cluster, do not pass in any peers. First, add the node to the existing cluster by calling `ProposeConfChange` on any existing node inside the cluster. Then, start the node with an empty peer list, like so:
84```go
85  // Create storage and config as shown above.
86  n := raft.StartNode(c, nil)
87```
88
89To restart a node from previous state:
90```go
91  storage := raft.NewMemoryStorage()
92
93  // Recover the in-memory storage from persistent snapshot, state and entries.
94  storage.ApplySnapshot(snapshot)
95  storage.SetHardState(state)
96  storage.Append(entries)
97
98  c := &raft.Config{
99    ID:              0x01,
100    ElectionTick:    10,
101    HeartbeatTick:   1,
102    Storage:         storage,
103    MaxSizePerMsg:   4096,
104    MaxInflightMsgs: 256,
105  }
106
107  // Restart raft without peer information.
108  // Peer information is already included in the storage.
109  n := raft.RestartNode(c)
110```
111
112After creating a Node, the user has a few responsibilities:
113
114First, read from the Node.Ready() channel and process the updates it contains. These steps may be performed in parallel, except as noted in step 2.
115
1161. Write Entries, HardState and Snapshot to persistent storage in order, i.e. Entries first, then HardState and Snapshot if they are not empty. If persistent storage supports atomic writes then all of them can be written together. Note that when writing an Entry with Index i, any previously-persisted entries with Index >= i must be discarded.
117
1182. Send all Messages to the nodes named in the To field. It is important that no messages be sent until the latest HardState has been persisted to disk, and all Entries written by any previous Ready batch (Messages may be sent while entries from the same batch are being persisted). To reduce the I/O latency, an optimization can be applied to make leader write to disk in parallel with its followers (as explained at section 10.2.1 in Raft thesis). If any Message has type MsgSnap, call Node.ReportSnapshot() after it has been sent (these messages may be large). Note: Marshalling messages is not thread-safe; it is important to make sure that no new entries are persisted while marshalling. The easiest way to achieve this is to serialise the messages directly inside the main raft loop.
119
1203. Apply Snapshot (if any) and CommittedEntries to the state machine. If any committed Entry has Type EntryConfChange, call Node.ApplyConfChange() to apply it to the node. The configuration change may be cancelled at this point by setting the NodeID field to zero before calling ApplyConfChange (but ApplyConfChange must be called one way or the other, and the decision to cancel must be based solely on the state machine and not external information such as the observed health of the node).
121
1224. Call Node.Advance() to signal readiness for the next batch of updates. This may be done at any time after step 1, although all updates must be processed in the order they were returned by Ready.
123
124Second, all persisted log entries must be made available via an implementation of the Storage interface. The provided MemoryStorage type can be used for this (if repopulating its state upon a restart), or a custom disk-backed implementation can be supplied.
125
126Third, after receiving a message from another node, pass it to Node.Step:
127
128```go
129	func recvRaftRPC(ctx context.Context, m raftpb.Message) {
130		n.Step(ctx, m)
131	}
132```
133
134Finally, call `Node.Tick()` at regular intervals (probably via a `time.Ticker`). Raft has two important timeouts: heartbeat and the election timeout. However, internally to the raft package time is represented by an abstract "tick".
135
136The total state machine handling loop will look something like this:
137
138```go
139  for {
140    select {
141    case <-s.Ticker:
142      n.Tick()
143    case rd := <-s.Node.Ready():
144      saveToStorage(rd.HardState, rd.Entries, rd.Snapshot)
145      send(rd.Messages)
146      if !raft.IsEmptySnap(rd.Snapshot) {
147        processSnapshot(rd.Snapshot)
148      }
149      for _, entry := range rd.CommittedEntries {
150        process(entry)
151        if entry.Type == raftpb.EntryConfChange {
152          var cc raftpb.ConfChange
153          cc.Unmarshal(entry.Data)
154          s.Node.ApplyConfChange(cc)
155        }
156      }
157      s.Node.Advance()
158    case <-s.done:
159      return
160    }
161  }
162```
163
164To propose changes to the state machine from the node to take application data, serialize it into a byte slice and call:
165
166```go
167	n.Propose(ctx, data)
168```
169
170If the proposal is committed, data will appear in committed entries with type raftpb.EntryNormal. There is no guarantee that a proposed command will be committed; the command may have to be reproposed after a timeout.
171
172To add or remove node in a cluster, build ConfChange struct 'cc' and call:
173
174```go
175	n.ProposeConfChange(ctx, cc)
176```
177
178After config change is committed, some committed entry with type raftpb.EntryConfChange will be returned. This must be applied to node through:
179
180```go
181	var cc raftpb.ConfChange
182	cc.Unmarshal(data)
183	n.ApplyConfChange(cc)
184```
185
186Note: An ID represents a unique node in a cluster for all time. A
187given ID MUST be used only once even if the old node has been removed.
188This means that for example IP addresses make poor node IDs since they
189may be reused. Node IDs must be non-zero.
190
191## Implementation notes
192
193This implementation is up to date with the final Raft thesis (https://github.com/ongardie/dissertation/blob/master/stanford.pdf), although this implementation of the membership change protocol differs somewhat from that described in chapter 4. The key invariant that membership changes happen one node at a time is preserved, but in our implementation the membership change takes effect when its entry is applied, not when it is added to the log (so the entry is committed under the old membership instead of the new). This is equivalent in terms of safety, since the old and new configurations are guaranteed to overlap.
194
195To ensure there is no attempt to commit two membership changes at once by matching log positions (which would be unsafe since they should have different quorum requirements), any proposed membership change is simply disallowed while any uncommitted change appears in the leader's log.
196
197This approach introduces a problem when removing a member from a two-member cluster: If one of the members dies before the other one receives the commit of the confchange entry, then the member cannot be removed any more since the cluster cannot make progress. For this reason it is highly recommended to use three or more nodes in every cluster.
198