1---
2layout: "docs"
3page_title: "Gossip Protocol"
4sidebar_current: "docs-internals-gossip"
5description: |-
6Serf uses a gossip protocol to broadcast messages to the cluster. This page documents the details of this internal protocol. The gossip protocol is based on SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol, with a few minor adaptations, mostly to increase propagation speed and convergence rate.
7---
8
9# Gossip Protocol
10
11Serf uses a [gossip protocol](https://en.wikipedia.org/wiki/Gossip_protocol)
12to broadcast messages to the cluster. This page documents the details of
13this internal protocol. The gossip protocol is based on
14["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf),
15with a few minor adaptations, mostly to increase propagation speed
16and convergence rate.
17
18~> **Advanced Topic!** This page covers the technical details of
19the internals of Serf. You don't need to know these details to effectively
20operate and use Serf. These details are documented here for those who wish
21to learn about them without having to go spelunking through the source code.
22
23## SWIM Protocol Overview
24
25Serf begins by joining an existing cluster or starting a new
26cluster. If starting a new cluster, additional nodes are expected to join
27it. New nodes in an existing cluster must be given the address of at
28least one existing member in order to join the cluster. The new member
29does a full state sync with the existing member over TCP and begins gossiping its
30existence to the cluster.
31
32Gossip is done over UDP with a configurable but fixed fanout and interval.
33This ensures that network usage is constant with regards to number of nodes.
34Complete state exchanges with a random node are done periodically over
35TCP, but much less often than gossip messages. This increases the likelihood
36that the membership list converges properly since the full state is exchanged
37and merged. The interval between full state exchanges is configurable or can
38be disabled entirely.
39
40Failure detection is done by periodic random probing using a configurable interval.
41If the node fails to ack within a reasonable time (typically some multiple
42of RTT), then an indirect probe is attempted. An indirect probe asks a
43configurable number of random nodes to probe the same node, in case there
44are network issues causing our own node to fail the probe. If both our
45probe and the indirect probes fail within a reasonable time, then the
46node is marked "suspicious" and this knowledge is gossiped to the cluster.
47A suspicious node is still considered a member of cluster. If the suspect member
48of the cluster does not dispute the suspicion within a configurable period of
49time, the node is finally considered dead, and this state is then gossiped
50to the cluster.
51
52This is a brief and incomplete description of the protocol. For a better idea,
53please read the
54[SWIM paper](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf)
55in its entirety, along with the Serf source code.
56
57## SWIM Modifications
58
59As mentioned earlier, the gossip protocol is based on SWIM but includes
60minor changes, mostly to increase propagation speed and convergence rates.
61
62The changes from SWIM are noted here:
63
64* Serf does a full state sync over TCP periodically. SWIM only propagates
65  changes over gossip. While both are eventually consistent, Serf is able to
66  more quickly reach convergence, as well as gracefully recover from network
67  partitions.
68
69* Serf has a dedicated gossip layer separate from the failure detection
70  protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
71  Serf uses piggybacking along with dedicated gossip messages. This
72  feature lets you have a higher gossip rate (for example once per 200ms)
73  and a slower failure detection rate (such as once per second), resulting
74  in overall faster convergence rates and data propagation speeds.
75
76* Serf keeps the state of dead nodes around for a set amount of time,
77  so that when full syncs are requested, the requester also receives information
78  about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
79  state immediately upon learning that the node is dead. This change again helps
80  the cluster converge more quickly.
81
82<a name="lifeguard"></a>
83## Lifeguard Enhancements
84
85SWIM makes the assumption that the local node is healthy in the sense
86that soft real-time processing of packets is possible. However, in cases
87where the local node is experiencing CPU or network exhaustion this assumption
88can be violated. The result is that the node health can occasionally flap,
89resulting in false monitoring alarms, adding noise to telemetry, and simply
90causing the overall cluster to waste CPU and network resources diagnosing a
91failure that may not truly exist.
92
93Serf 0.8 added Lifeguard, which completely resolves this issue with novel
94enhancements to SWIM.
95
96The first extension introduces a "nack" message to probe queries. If the
97probing node realizes it is missing "nack" messages then it becomes aware
98that it may be degraded and slows down its failure detector. As nack messages
99begin arriving, the failure detector is sped back up.
100
101The second change introduces a dynamically changing suspicion timeout
102before declaring another node as failured. The probing node will initially
103start with a very long suspicion timeout. As other nodes in the cluster confirm
104a node is suspect, the timer accelerates. During normal operations the
105detection time is actually the same as in previous versions of Serf. However,
106if a node is degraded and doesn't get confirmations, there is a long timeout
107which allows the suspected node to refute its status and remain healthy.
108
109These two mechanisms combine to make Serf much more robust to degraded nodes in a
110cluster, while keeping failure detection performance unchanged. There is no
111additional configuration for Lifeguard, it tunes itself automatically.
112
113For more details about Lifeguard, please see the
114[Making Gossip More Robust with Lifeguard](https://www.hashicorp.com/blog/making-gossip-more-robust-with-lifeguard/)
115blog post, which provides a high level overview of the HashiCorp Research paper
116[Lifeguard : SWIM-ing with Situational Awareness](https://arxiv.org/abs/1707.00788).
117
118## Serf-Specific Messages
119
120On top of the SWIM-based gossip layer, Serf sends some custom message types.
121
122Serf makes heavy use of [Lamport clocks](https://en.wikipedia.org/wiki/Lamport_timestamps)
123to maintain some notion of message ordering despite being eventually
124consistent. Every message sent by Serf contains a Lamport clock time.
125
126When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
127the gossip layer. Because the underlying gossip layer makes no differentiation
128between a node leaving the cluster and a node being detected as failed, this
129allows the higher level Serf layer to detect a failure versus a graceful
130leave.
131
132When a node joins the cluster, Serf sends a _join intent_. The purpose
133of this intent is solely to attach a Lamport clock time to a join so that
134it can be ordered properly in case a leave comes out of order.
135
136For custom events and queries, Serf sends either a _user event_,
137or _user query_ message. This message contains a Lamport time, event name, and event payload.
138Because user events are sent along the gossip layer, which uses UDP,
139the payload and entire message framing must fit within a single UDP packet.
140
141`UserEventSizeLimit` can be configured, but a hard limit of `9KB` is applied.
142It's up to the user to make sure that the "user event"'s network transmission "path" fits their MTU and/or other packet constraints.
143
144
145