1# Alerts
2
3Here are some example alerts configured for Kubernetes environment.
4
5## Compaction
6
7[embedmd]:# (../tmp/thanos-compact.yaml yaml)
8```yaml
9name: thanos-compact
10rules:
11- alert: ThanosCompactMultipleRunning
12  annotations:
13    description: No more than one Thanos Compact instance should be running at once.
14      There are {{$value}} instances running.
15    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactmultiplerunning
16    summary: Thanos Compact has multiple instances running.
17  expr: sum by (job) (up{job=~".*thanos-compact.*"}) > 1
18  for: 5m
19  labels:
20    severity: warning
21- alert: ThanosCompactHalted
22  annotations:
23    description: Thanos Compact {{$labels.job}} has failed to run and now is halted.
24    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthalted
25    summary: Thanos Compact has failed to run ans is now halted.
26  expr: thanos_compact_halted{job=~".*thanos-compact.*"} == 1
27  for: 5m
28  labels:
29    severity: warning
30- alert: ThanosCompactHighCompactionFailures
31  annotations:
32    description: Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}%
33      of compactions.
34    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthighcompactionfailures
35    summary: Thanos Compact is failing to execute compactions.
36  expr: |
37    (
38      sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~".*thanos-compact.*"}[5m]))
39    /
40      sum by (job) (rate(thanos_compact_group_compactions_total{job=~".*thanos-compact.*"}[5m]))
41    * 100 > 5
42    )
43  for: 15m
44  labels:
45    severity: warning
46- alert: ThanosCompactBucketHighOperationFailures
47  annotations:
48    description: Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value
49      | humanize}}% of operations.
50    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactbuckethighoperationfailures
51    summary: Thanos Compact Bucket is having a high number of operation failures.
52  expr: |
53    (
54      sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-compact.*"}[5m]))
55    /
56      sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-compact.*"}[5m]))
57    * 100 > 5
58    )
59  for: 15m
60  labels:
61    severity: warning
62- alert: ThanosCompactHasNotRun
63  annotations:
64    description: Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.
65    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthasnotrun
66    summary: Thanos Compact has not uploaded anything for last 24 hours.
67  expr: (time() - max by (job) (max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~".*thanos-compact.*"}[24h])))
68    / 60 / 60 > 24
69  labels:
70    severity: warning
71```
72
73## Ruler
74
75For Thanos Ruler we run some alerts in local Prometheus, to make sure that Thanos Ruler is working:
76
77[embedmd]:# (../tmp/thanos-rule.yaml yaml)
78```yaml
79name: thanos-rule
80rules:
81- alert: ThanosRuleQueueIsDroppingAlerts
82  annotations:
83    description: Thanos Rule {{$labels.instance}} is failing to queue alerts.
84    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulequeueisdroppingalerts
85    summary: Thanos Rule is failing to queue alerts.
86  expr: |
87    sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0
88  for: 5m
89  labels:
90    severity: critical
91- alert: ThanosRuleSenderIsFailingAlerts
92  annotations:
93    description: Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager.
94    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulesenderisfailingalerts
95    summary: Thanos Rule is failing to send alerts to alertmanager.
96  expr: |
97    sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0
98  for: 5m
99  labels:
100    severity: critical
101- alert: ThanosRuleHighRuleEvaluationFailures
102  annotations:
103    description: Thanos Rule {{$labels.instance}} is failing to evaluate rules.
104    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulehighruleevaluationfailures
105    summary: Thanos Rule is failing to evaluate rules.
106  expr: |
107    (
108      sum by (job, instance) (rate(prometheus_rule_evaluation_failures_total{job=~".*thanos-rule.*"}[5m]))
109    /
110      sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m]))
111    * 100 > 5
112    )
113  for: 5m
114  labels:
115    severity: critical
116- alert: ThanosRuleHighRuleEvaluationWarnings
117  annotations:
118    description: Thanos Rule {{$labels.instance}} has high number of evaluation warnings.
119    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulehighruleevaluationwarnings
120    summary: Thanos Rule has high number of evaluation warnings.
121  expr: |
122    sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total{job=~".*thanos-rule.*"}[5m])) > 0
123  for: 15m
124  labels:
125    severity: info
126- alert: ThanosRuleRuleEvaluationLatencyHigh
127  annotations:
128    description: Thanos Rule {{$labels.instance}} has higher evaluation latency than
129      interval for {{$labels.rule_group}}.
130    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleruleevaluationlatencyhigh
131    summary: Thanos Rule has high rule evaluation latency.
132  expr: |
133    (
134      sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~".*thanos-rule.*"})
135    >
136      sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"})
137    )
138  for: 5m
139  labels:
140    severity: warning
141- alert: ThanosRuleGrpcErrorRate
142  annotations:
143    description: Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}%
144      of requests.
145    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulegrpcerrorrate
146    summary: Thanos Rule is failing to handle grpc requests.
147  expr: |
148    (
149      sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-rule.*"}[5m]))
150    /
151      sum by (job, instance) (rate(grpc_server_started_total{job=~".*thanos-rule.*"}[5m]))
152    * 100 > 5
153    )
154  for: 5m
155  labels:
156    severity: warning
157- alert: ThanosRuleConfigReloadFailure
158  annotations:
159    description: Thanos Rule {{$labels.job}} has not been able to reload its configuration.
160    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleconfigreloadfailure
161    summary: Thanos Rule has not been able to reload configuration.
162  expr: avg by (job, instance) (thanos_rule_config_last_reload_successful{job=~".*thanos-rule.*"})
163    != 1
164  for: 5m
165  labels:
166    severity: info
167- alert: ThanosRuleQueryHighDNSFailures
168  annotations:
169    description: Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing
170      DNS queries for query endpoints.
171    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulequeryhighdnsfailures
172    summary: Thanos Rule is having high number of DNS failures.
173  expr: |
174    (
175      sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~".*thanos-rule.*"}[5m]))
176    /
177      sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~".*thanos-rule.*"}[5m]))
178    * 100 > 1
179    )
180  for: 15m
181  labels:
182    severity: warning
183- alert: ThanosRuleAlertmanagerHighDNSFailures
184  annotations:
185    description: Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing
186      DNS queries for Alertmanager endpoints.
187    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulealertmanagerhighdnsfailures
188    summary: Thanos Rule is having high number of DNS failures.
189  expr: |
190    (
191      sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~".*thanos-rule.*"}[5m]))
192    /
193      sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~".*thanos-rule.*"}[5m]))
194    * 100 > 1
195    )
196  for: 15m
197  labels:
198    severity: warning
199- alert: ThanosRuleNoEvaluationFor10Intervals
200  annotations:
201    description: Thanos Rule {{$labels.job}} has {{$value | humanize}}% rule groups
202      that did not evaluate for at least 10x of their expected interval.
203    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulenoevaluationfor10intervals
204    summary: Thanos Rule has rule groups that did not evaluate for 10 intervals.
205  expr: |
206    time() -  max by (job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~".*thanos-rule.*"})
207    >
208    10 * max by (job, instance, group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"})
209  for: 5m
210  labels:
211    severity: info
212- alert: ThanosNoRuleEvaluations
213  annotations:
214    description: Thanos Rule {{$labels.instance}} did not perform any rule evaluations
215      in the past 10 minutes.
216    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosnoruleevaluations
217    summary: Thanos Rule did not perform any rule evaluations.
218  expr: |
219    sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) <= 0
220      and
221    sum by (job, instance) (thanos_rule_loaded_rules{job=~".*thanos-rule.*"}) > 0
222  for: 5m
223  labels:
224    severity: critical
225```
226
227## Store Gateway
228
229[embedmd]:# (../tmp/thanos-store.yaml yaml)
230```yaml
231name: thanos-store
232rules:
233- alert: ThanosStoreGrpcErrorRate
234  annotations:
235    description: Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}%
236      of requests.
237    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoregrpcerrorrate
238    summary: Thanos Store is failing to handle qrpcd requests.
239  expr: |
240    (
241      sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-store.*"}[5m]))
242    /
243      sum by (job) (rate(grpc_server_started_total{job=~".*thanos-store.*"}[5m]))
244    * 100 > 5
245    )
246  for: 5m
247  labels:
248    severity: warning
249- alert: ThanosStoreSeriesGateLatencyHigh
250  annotations:
251    description: Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}}
252      seconds for store series gate requests.
253    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreseriesgatelatencyhigh
254    summary: Thanos Store has high latency for store series gate requests.
255  expr: |
256    (
257      histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2
258    and
259      sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0
260    )
261  for: 10m
262  labels:
263    severity: warning
264- alert: ThanosStoreBucketHighOperationFailures
265  annotations:
266    description: Thanos Store {{$labels.job}} Bucket is failing to execute {{$value
267      | humanize}}% of operations.
268    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstorebuckethighoperationfailures
269    summary: Thanos Store Bucket is failing to execute operations.
270  expr: |
271    (
272      sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-store.*"}[5m]))
273    /
274      sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m]))
275    * 100 > 5
276    )
277  for: 15m
278  labels:
279    severity: warning
280- alert: ThanosStoreObjstoreOperationLatencyHigh
281  annotations:
282    description: Thanos Store {{$labels.job}} Bucket has a 99th percentile latency
283      of {{$value}} seconds for the bucket operations.
284    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreobjstoreoperationlatencyhigh
285    summary: Thanos Store is having high latency for bucket operations.
286  expr: |
287    (
288      histogram_quantile(0.99, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2
289    and
290      sum by (job) (rate(thanos_objstore_bucket_operation_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0
291    )
292  for: 10m
293  labels:
294    severity: warning
295```
296
297## Sidecar
298
299[embedmd]:# (../tmp/thanos-sidecar.yaml yaml)
300```yaml
301name: thanos-sidecar
302rules:
303- alert: ThanosSidecarPrometheusDown
304  annotations:
305    description: Thanos Sidecar {{$labels.instance}} cannot connect to Prometheus.
306    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown
307    summary: Thanos Sidecar cannot connect to Prometheus
308  expr: |
309    thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
310  for: 5m
311  labels:
312    severity: critical
313- alert: ThanosSidecarBucketOperationsFailed
314  annotations:
315    description: Thanos Sidecar {{$labels.instance}} bucket operations are failing
316    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarbucketoperationsfailed
317    summary: Thanos Sidecar bucket operations are failing
318  expr: |
319    sum by (job, instance) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-sidecar.*"}[5m])) > 0
320  for: 5m
321  labels:
322    severity: critical
323- alert: ThanosSidecarUnhealthy
324  annotations:
325    description: Thanos Sidecar {{$labels.instance}} is unhealthy for more than {{$value}}
326      seconds.
327    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy
328    summary: Thanos Sidecar is unhealthy.
329  expr: |
330    time() - max by (job, instance) (timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"})) >= 240
331  labels:
332    severity: critical
333```
334
335## Query
336
337[embedmd]:# (../tmp/thanos-query.yaml yaml)
338```yaml
339name: thanos-query
340rules:
341- alert: ThanosQueryHttpRequestQueryErrorRateHigh
342  annotations:
343    description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}%
344      of "query" requests.
345    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhttprequestqueryerrorratehigh
346    summary: Thanos Query is failing to handle requests.
347  expr: |
348    (
349      sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query"}[5m]))
350    /
351      sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query"}[5m]))
352    ) * 100 > 5
353  for: 5m
354  labels:
355    severity: critical
356- alert: ThanosQueryHttpRequestQueryRangeErrorRateHigh
357  annotations:
358    description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}%
359      of "query_range" requests.
360    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhttprequestqueryrangeerrorratehigh
361    summary: Thanos Query is failing to handle requests.
362  expr: |
363    (
364      sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query_range"}[5m]))
365    /
366      sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query_range"}[5m]))
367    ) * 100 > 5
368  for: 5m
369  labels:
370    severity: critical
371- alert: ThanosQueryGrpcServerErrorRate
372  annotations:
373    description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}%
374      of requests.
375    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosquerygrpcservererrorrate
376    summary: Thanos Query is failing to handle requests.
377  expr: |
378    (
379      sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-query.*"}[5m]))
380    /
381      sum by (job) (rate(grpc_server_started_total{job=~".*thanos-query.*"}[5m]))
382    * 100 > 5
383    )
384  for: 5m
385  labels:
386    severity: warning
387- alert: ThanosQueryGrpcClientErrorRate
388  annotations:
389    description: Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}%
390      of requests.
391    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosquerygrpcclienterrorrate
392    summary: Thanos Query is failing to send requests.
393  expr: |
394    (
395      sum by (job) (rate(grpc_client_handled_total{grpc_code!="OK", job=~".*thanos-query.*"}[5m]))
396    /
397      sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m]))
398    ) * 100 > 5
399  for: 5m
400  labels:
401    severity: warning
402- alert: ThanosQueryHighDNSFailures
403  annotations:
404    description: Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing
405      DNS queries for store endpoints.
406    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhighdnsfailures
407    summary: Thanos Query is having high number of DNS failures.
408  expr: |
409    (
410      sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~".*thanos-query.*"}[5m]))
411    /
412      sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~".*thanos-query.*"}[5m]))
413    ) * 100 > 1
414  for: 15m
415  labels:
416    severity: warning
417- alert: ThanosQueryInstantLatencyHigh
418  annotations:
419    description: Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}}
420      seconds for instant queries.
421    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryinstantlatencyhigh
422    summary: Thanos Query has high latency for queries.
423  expr: |
424    (
425      histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m]))) > 40
426    and
427      sum by (job) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m])) > 0
428    )
429  for: 10m
430  labels:
431    severity: critical
432- alert: ThanosQueryRangeLatencyHigh
433  annotations:
434    description: Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}}
435      seconds for range queries.
436    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryrangelatencyhigh
437    summary: Thanos Query has high latency for queries.
438  expr: |
439    (
440      histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query_range"}[5m]))) > 90
441    and
442      sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*", handler="query_range"}[5m])) > 0
443    )
444  for: 10m
445  labels:
446    severity: critical
447```
448
449## Receive
450
451[embedmd]:# (../tmp/thanos-receive.yaml yaml)
452```yaml
453name: thanos-receive
454rules:
455- alert: ThanosReceiveHttpRequestErrorRateHigh
456  annotations:
457    description: Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}%
458      of requests.
459    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehttprequesterrorratehigh
460    summary: Thanos Receive is failing to handle requests.
461  expr: |
462    (
463      sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-receive.*", handler="receive"}[5m]))
464    /
465      sum by (job) (rate(http_requests_total{job=~".*thanos-receive.*", handler="receive"}[5m]))
466    ) * 100 > 5
467  for: 5m
468  labels:
469    severity: critical
470- alert: ThanosReceiveHttpRequestLatencyHigh
471  annotations:
472    description: Thanos Receive {{$labels.job}} has a 99th percentile latency of {{
473      $value }} seconds for requests.
474    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehttprequestlatencyhigh
475    summary: Thanos Receive has high HTTP requests latency.
476  expr: |
477    (
478      histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-receive.*", handler="receive"}[5m]))) > 10
479    and
480      sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-receive.*", handler="receive"}[5m])) > 0
481    )
482  for: 10m
483  labels:
484    severity: critical
485- alert: ThanosReceiveHighReplicationFailures
486  annotations:
487    description: Thanos Receive {{$labels.job}} is failing to replicate {{$value |
488      humanize}}% of requests.
489    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighreplicationfailures
490    summary: Thanos Receive is having high number of replication failures.
491  expr: |
492    thanos_receive_replication_factor > 1
493      and
494    (
495      (
496        sum by (job) (rate(thanos_receive_replications_total{result="error", job=~".*thanos-receive.*"}[5m]))
497      /
498        sum by (job) (rate(thanos_receive_replications_total{job=~".*thanos-receive.*"}[5m]))
499      )
500      >
501      (
502        max by (job) (floor((thanos_receive_replication_factor{job=~".*thanos-receive.*"}+1) / 2))
503      /
504        max by (job) (thanos_receive_hashring_nodes{job=~".*thanos-receive.*"})
505      )
506    ) * 100
507  for: 5m
508  labels:
509    severity: warning
510- alert: ThanosReceiveHighForwardRequestFailures
511  annotations:
512    description: Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}%
513      of requests.
514    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighforwardrequestfailures
515    summary: Thanos Receive is failing to forward requests.
516  expr: |
517    (
518      sum by (job) (rate(thanos_receive_forward_requests_total{result="error", job=~".*thanos-receive.*"}[5m]))
519    /
520      sum by (job) (rate(thanos_receive_forward_requests_total{job=~".*thanos-receive.*"}[5m]))
521    ) * 100 > 20
522  for: 5m
523  labels:
524    severity: info
525- alert: ThanosReceiveHighHashringFileRefreshFailures
526  annotations:
527    description: Thanos Receive {{$labels.job}} is failing to refresh hashring file,
528      {{$value | humanize}} of attempts failed.
529    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighhashringfilerefreshfailures
530    summary: Thanos Receive is failing to refresh hasring file.
531  expr: |
532    (
533      sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~".*thanos-receive.*"}[5m]))
534    /
535      sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~".*thanos-receive.*"}[5m]))
536    > 0
537    )
538  for: 15m
539  labels:
540    severity: warning
541- alert: ThanosReceiveConfigReloadFailure
542  annotations:
543    description: Thanos Receive {{$labels.job}} has not been able to reload hashring
544      configurations.
545    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveconfigreloadfailure
546    summary: Thanos Receive has not been able to reload configuration.
547  expr: avg by (job) (thanos_receive_config_last_reload_successful{job=~".*thanos-receive.*"})
548    != 1
549  for: 5m
550  labels:
551    severity: warning
552- alert: ThanosReceiveNoUpload
553  annotations:
554    description: Thanos Receive {{$labels.instance}} has not uploaded latest data
555      to object storage.
556    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivenoupload
557    summary: Thanos Receive has not uploaded latest data to object storage.
558  expr: |
559    (up{job=~".*thanos-receive.*"} - 1)
560    + on (job, instance) # filters to only alert on current instance last 3h
561    (sum by (job, instance) (increase(thanos_shipper_uploads_total{job=~".*thanos-receive.*"}[3h])) == 0)
562  for: 3h
563  labels:
564    severity: critical
565```
566
567## Replicate
568
569[embedmd]:# (../tmp/thanos-bucket-replicate.yaml yaml)
570```yaml
571name: thanos-bucket-replicate
572rules:
573- alert: ThanosBucketReplicateErrorRate
574  annotations:
575    description: Thanos Replicate is failing to run, {{$value | humanize}}% of attempts
576      failed.
577    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosbucketreplicateerrorrate
578    summary: Thanose Replicate is failing to run.
579  expr: |
580    (
581      sum by (job) (rate(thanos_replicate_replication_runs_total{result="error", job=~".*thanos-bucket-replicate.*"}[5m]))
582    / on (job) group_left
583      sum by (job) (rate(thanos_replicate_replication_runs_total{job=~".*thanos-bucket-replicate.*"}[5m]))
584    ) * 100 >= 10
585  for: 5m
586  labels:
587    severity: critical
588- alert: ThanosBucketReplicateRunLatency
589  annotations:
590    description: Thanos Replicate {{$labels.job}} has a 99th percentile latency of
591      {{$value}} seconds for the replicate operations.
592    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosbucketreplicaterunlatency
593    summary: Thanos Replicate has a high latency for replicate operations.
594  expr: |
595    (
596      histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m]))) > 20
597    and
598      sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m])) > 0
599    )
600  for: 5m
601  labels:
602    severity: critical
603```
604
605## Extras
606
607### Absent Rules
608
609[embedmd]:# (../tmp/thanos-component-absent.yaml yaml)
610```yaml
611name: thanos-component-absent
612rules:
613- alert: ThanosBucketReplicateIsDown
614  annotations:
615    description: ThanosBucketReplicate has disappeared. Prometheus target for the
616      component cannot be discovered.
617    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosbucketreplicateisdown
618    summary: Thanos component has disappeared.
619  expr: |
620    absent(up{job=~".*thanos-bucket-replicate.*"} == 1)
621  for: 5m
622  labels:
623    severity: critical
624- alert: ThanosCompactIsDown
625  annotations:
626    description: ThanosCompact has disappeared. Prometheus target for the component
627      cannot be discovered.
628    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactisdown
629    summary: Thanos component has disappeared.
630  expr: |
631    absent(up{job=~".*thanos-compact.*"} == 1)
632  for: 5m
633  labels:
634    severity: critical
635- alert: ThanosQueryIsDown
636  annotations:
637    description: ThanosQuery has disappeared. Prometheus target for the component
638      cannot be discovered.
639    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryisdown
640    summary: Thanos component has disappeared.
641  expr: |
642    absent(up{job=~".*thanos-query.*"} == 1)
643  for: 5m
644  labels:
645    severity: critical
646- alert: ThanosReceiveIsDown
647  annotations:
648    description: ThanosReceive has disappeared. Prometheus target for the component
649      cannot be discovered.
650    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveisdown
651    summary: Thanos component has disappeared.
652  expr: |
653    absent(up{job=~".*thanos-receive.*"} == 1)
654  for: 5m
655  labels:
656    severity: critical
657- alert: ThanosRuleIsDown
658  annotations:
659    description: ThanosRule has disappeared. Prometheus target for the component cannot
660      be discovered.
661    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleisdown
662    summary: Thanos component has disappeared.
663  expr: |
664    absent(up{job=~".*thanos-rule.*"} == 1)
665  for: 5m
666  labels:
667    severity: critical
668- alert: ThanosSidecarIsDown
669  annotations:
670    description: ThanosSidecar has disappeared. Prometheus target for the component
671      cannot be discovered.
672    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarisdown
673    summary: Thanos component has disappeared.
674  expr: |
675    absent(up{job=~".*thanos-sidecar.*"} == 1)
676  for: 5m
677  labels:
678    severity: critical
679- alert: ThanosStoreIsDown
680  annotations:
681    description: ThanosStore has disappeared. Prometheus target for the component
682      cannot be discovered.
683    runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreisdown
684    summary: Thanos component has disappeared.
685  expr: |
686    absent(up{job=~".*thanos-store.*"} == 1)
687  for: 5m
688  labels:
689    severity: critical
690```
691