1---
2title: High-availability for store instances
3type: proposal
4menu: proposals
5status: rejected
6owner: mattbostock
7---
8
9## Summary
10
11This proposal makes total sense and solves our goals when using gossip. However there exists a very easy solution
12to this problem in form of using just static entry with any loadbalancer like Kubernetes Service to load balance
13through different Store Gateways. Those are technically stateless, so request can fetch the data independently.
14
15## Motivation
16
17Thanos store instances currently have no explicit support for
18high-availability; query instances treat all store instances equally. If
19multiple store instances are used as gateways to a single bucket in an object
20store, Thanos query instances will wait for all instances to respond (subject
21to timeouts) before returning a response.
22
23## Goals
24
25- Explicitly support and document high availability for store instances.
26
27- Reduce the query latency incurred by failing store instances when other store
28  instances could return the same response faster.
29
30## Proposal
31
32Thanos supports deduplication of metrics retrieved from multiple Prometheus
33servers to avoid gaps in query responses where a single Prometheus server
34failed but similar data was recorded by another Prometheus server in the same
35failure domain. To support deduplication, Thanos must wait for all Thanos
36sidecar servers to return their data (subject to timeouts) before returning a
37response to a client.
38
39When retrieving data from Thanos bucket store instances, however, the desired
40behaviour is different; we want Thanos use the first successful response it
41receives, on the assumption that all bucket store instances that communicate
42with the same bucket have access to the same data.
43
44To support the desired behaviour for bucket store instances while still
45allowing for deduplication, we propose to expand the [InfoResponse
46Protobuf](https://github.com/thanos-io/thanos/blob/b67aa3a709062be97215045f7488df67a9af2c66/pkg/store/storepb/rpc.proto#L28-L32)
47used by the Store API by adding two fields:
48
49- a string identifier that can be used to group store instances
50
51- an enum representing the [peer type as defined in the cluster
52  package](https://github.com/thanos-io/thanos/blob/673614d9310f3f90fdb4585ca6201496ff92c697/pkg/cluster/cluster.go#L51-L64)
53
54For example;
55
56```diff
57--- before	2018-07-02 15:49:09.000000000 +0100
58+++ after	2018-07-02 15:49:13.000000000 +0100
59@@ -1,5 +1,6 @@
60 message InfoResponse {
61   repeated Label labels = 1 [(gogoproto.nullable) = false];
62   int64 min_time        = 2;
63   int64 max_time        = 3;
64+  string store_group_id = 4;
65+  enum PeerType {
66+    STORE  = 0;
67+    SOURCE = 1;
68+    QUERY  = 2;
69+  }
70+  PeerType peer_type    = 5;
71 }
72```
73
74For the purpose of querying data from store instances, stores instance will be
75grouped by:
76
77- labels, as returned as part of `InfoResponse`
78- the new `store_group_id` string identifier
79
80Therefore, stores having identical sets of labels and identical values for
81`store_group_id` will belong in the same group for the purpose of querying
82data. Stores having an empty `store_group_id` field and matching labels will be
83considered to be part of the same group. Stores having an empty
84`store_group_id` field and empty label sets will also be considered part of the
85same group.
86
87If a service implementing the store API (a 'store instance') has a `STORE` or
88`QUERY` peer type, query instances will treat each store instance in the same
89group as having access to the same data. Query instances will randomly pick any
90two store instances[1][] from the same group and use the first response
91returned.
92
93[1]: https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf
94
95Otherwise, for the `SOURCE` peer type, query instances will wait for all
96instances within the same group to respond (subject to existing timeouts)
97before returning a response, consistent with the current behaviour. This is
98necessary to collect all data available for the purposes of deduplication and
99to fill gaps in data where an individual Prometheus server failed to ingest
100data for a period of time.
101
102Each service implementing the store API must determine what value the
103`store_group_id` should return. For bucket stores, `store_group_id` should
104contain the concatenation of the object store URL and bucket name. For all
105other existing services implementing the store API, we will use an empty string
106for `store_group_id` until a reason exists to use it.
107
108Multiple buckets or object stores will be supported by setting the
109`store_group_id`.
110
111Existing instances running older versions of Thanos will be assumed to have
112an empty string for `store_group_id` and a `SOURCE` peer type, which will
113retain existing behaviour when awaiting responses.
114
115### Scope
116
117Horizontal scaling should be handled separately and is out of scope for this
118proposal.
119
120## User experience
121
122From a user's point of view, query responses should be faster and more reliable:
123
124- Running multiple bucket store instances will allow the query to be served even
125  if a single store instance fails.
126
127- Query latency should be lower since the response will be served from the
128  first bucket store instance to reply.
129
130The user experience for query responses involving only Thanos sidecars will be
131unaffected.
132
133## Alternatives considered
134
135### Implicitly relying on store labels
136
137Rather than expanding the `InfoResponse` Protobuf, we had originally considered
138relying on an empty set of store labels to determine that a store instance was
139acting as a gateway.
140
141We decided against this approach as it would make debugging harder due to its
142implicit nature, and is likely to cause bugs in future.
143
144### Using boolean fields to determine query behaviour
145
146We rejected the idea of adding a `gateway` or `deduplicated` boolean field to
147`InfoResponse` in the store RPC API. The value of these fields would have had
148the same effect on query behaviour as returning the peer type field as proposed
149above and would be more explicit, but were specific to this use case.
150
151The peer type field in `InfoResponse` proposed above could be used for other
152use cases aside from determining query behaviour.
153
154## Related future work
155
156### Sharing data between store instances
157
158Thanos bucket stores download index and metadata from the object store on
159start-up. If multiple instances of a bucket store are used to provide high
160availability, each instance will download the same files for its own use. These
161file sizes can be in the order of gigabytes.
162
163Ideally, the overhead of each store instance downloading its own data would be
164avoided. We decided that it would be more appropriate to tackle sharing data as
165part of future work to support the horizontal scaling of store instances.
166