1--- 2title: High-availability for store instances 3type: proposal 4menu: proposals 5status: rejected 6owner: mattbostock 7--- 8 9## Summary 10 11This proposal makes total sense and solves our goals when using gossip. However there exists a very easy solution 12to this problem in form of using just static entry with any loadbalancer like Kubernetes Service to load balance 13through different Store Gateways. Those are technically stateless, so request can fetch the data independently. 14 15## Motivation 16 17Thanos store instances currently have no explicit support for 18high-availability; query instances treat all store instances equally. If 19multiple store instances are used as gateways to a single bucket in an object 20store, Thanos query instances will wait for all instances to respond (subject 21to timeouts) before returning a response. 22 23## Goals 24 25- Explicitly support and document high availability for store instances. 26 27- Reduce the query latency incurred by failing store instances when other store 28 instances could return the same response faster. 29 30## Proposal 31 32Thanos supports deduplication of metrics retrieved from multiple Prometheus 33servers to avoid gaps in query responses where a single Prometheus server 34failed but similar data was recorded by another Prometheus server in the same 35failure domain. To support deduplication, Thanos must wait for all Thanos 36sidecar servers to return their data (subject to timeouts) before returning a 37response to a client. 38 39When retrieving data from Thanos bucket store instances, however, the desired 40behaviour is different; we want Thanos use the first successful response it 41receives, on the assumption that all bucket store instances that communicate 42with the same bucket have access to the same data. 43 44To support the desired behaviour for bucket store instances while still 45allowing for deduplication, we propose to expand the [InfoResponse 46Protobuf](https://github.com/thanos-io/thanos/blob/b67aa3a709062be97215045f7488df67a9af2c66/pkg/store/storepb/rpc.proto#L28-L32) 47used by the Store API by adding two fields: 48 49- a string identifier that can be used to group store instances 50 51- an enum representing the [peer type as defined in the cluster 52 package](https://github.com/thanos-io/thanos/blob/673614d9310f3f90fdb4585ca6201496ff92c697/pkg/cluster/cluster.go#L51-L64) 53 54For example; 55 56```diff 57--- before 2018-07-02 15:49:09.000000000 +0100 58+++ after 2018-07-02 15:49:13.000000000 +0100 59@@ -1,5 +1,6 @@ 60 message InfoResponse { 61 repeated Label labels = 1 [(gogoproto.nullable) = false]; 62 int64 min_time = 2; 63 int64 max_time = 3; 64+ string store_group_id = 4; 65+ enum PeerType { 66+ STORE = 0; 67+ SOURCE = 1; 68+ QUERY = 2; 69+ } 70+ PeerType peer_type = 5; 71 } 72``` 73 74For the purpose of querying data from store instances, stores instance will be 75grouped by: 76 77- labels, as returned as part of `InfoResponse` 78- the new `store_group_id` string identifier 79 80Therefore, stores having identical sets of labels and identical values for 81`store_group_id` will belong in the same group for the purpose of querying 82data. Stores having an empty `store_group_id` field and matching labels will be 83considered to be part of the same group. Stores having an empty 84`store_group_id` field and empty label sets will also be considered part of the 85same group. 86 87If a service implementing the store API (a 'store instance') has a `STORE` or 88`QUERY` peer type, query instances will treat each store instance in the same 89group as having access to the same data. Query instances will randomly pick any 90two store instances[1][] from the same group and use the first response 91returned. 92 93[1]: https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf 94 95Otherwise, for the `SOURCE` peer type, query instances will wait for all 96instances within the same group to respond (subject to existing timeouts) 97before returning a response, consistent with the current behaviour. This is 98necessary to collect all data available for the purposes of deduplication and 99to fill gaps in data where an individual Prometheus server failed to ingest 100data for a period of time. 101 102Each service implementing the store API must determine what value the 103`store_group_id` should return. For bucket stores, `store_group_id` should 104contain the concatenation of the object store URL and bucket name. For all 105other existing services implementing the store API, we will use an empty string 106for `store_group_id` until a reason exists to use it. 107 108Multiple buckets or object stores will be supported by setting the 109`store_group_id`. 110 111Existing instances running older versions of Thanos will be assumed to have 112an empty string for `store_group_id` and a `SOURCE` peer type, which will 113retain existing behaviour when awaiting responses. 114 115### Scope 116 117Horizontal scaling should be handled separately and is out of scope for this 118proposal. 119 120## User experience 121 122From a user's point of view, query responses should be faster and more reliable: 123 124- Running multiple bucket store instances will allow the query to be served even 125 if a single store instance fails. 126 127- Query latency should be lower since the response will be served from the 128 first bucket store instance to reply. 129 130The user experience for query responses involving only Thanos sidecars will be 131unaffected. 132 133## Alternatives considered 134 135### Implicitly relying on store labels 136 137Rather than expanding the `InfoResponse` Protobuf, we had originally considered 138relying on an empty set of store labels to determine that a store instance was 139acting as a gateway. 140 141We decided against this approach as it would make debugging harder due to its 142implicit nature, and is likely to cause bugs in future. 143 144### Using boolean fields to determine query behaviour 145 146We rejected the idea of adding a `gateway` or `deduplicated` boolean field to 147`InfoResponse` in the store RPC API. The value of these fields would have had 148the same effect on query behaviour as returning the peer type field as proposed 149above and would be more explicit, but were specific to this use case. 150 151The peer type field in `InfoResponse` proposed above could be used for other 152use cases aside from determining query behaviour. 153 154## Related future work 155 156### Sharing data between store instances 157 158Thanos bucket stores download index and metadata from the object store on 159start-up. If multiple instances of a bucket store are used to provide high 160availability, each instance will download the same files for its own use. These 161file sizes can be in the order of gigabytes. 162 163Ideally, the overhead of each store instance downloading its own data would be 164avoided. We decided that it would be more appropriate to tackle sharing data as 165part of future work to support the horizontal scaling of store instances. 166