1# Alerts 2 3Here are some example alerts configured for Kubernetes environment. 4 5## Compaction 6 7[embedmd]:# (../tmp/thanos-compact.yaml yaml) 8```yaml 9name: thanos-compact 10rules: 11- alert: ThanosCompactMultipleRunning 12 annotations: 13 description: No more than one Thanos Compact instance should be running at once. 14 There are {{$value}} instances running. 15 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactmultiplerunning 16 summary: Thanos Compact has multiple instances running. 17 expr: sum by (job) (up{job=~".*thanos-compact.*"}) > 1 18 for: 5m 19 labels: 20 severity: warning 21- alert: ThanosCompactHalted 22 annotations: 23 description: Thanos Compact {{$labels.job}} has failed to run and now is halted. 24 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthalted 25 summary: Thanos Compact has failed to run ans is now halted. 26 expr: thanos_compact_halted{job=~".*thanos-compact.*"} == 1 27 for: 5m 28 labels: 29 severity: warning 30- alert: ThanosCompactHighCompactionFailures 31 annotations: 32 description: Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% 33 of compactions. 34 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthighcompactionfailures 35 summary: Thanos Compact is failing to execute compactions. 36 expr: | 37 ( 38 sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~".*thanos-compact.*"}[5m])) 39 / 40 sum by (job) (rate(thanos_compact_group_compactions_total{job=~".*thanos-compact.*"}[5m])) 41 * 100 > 5 42 ) 43 for: 15m 44 labels: 45 severity: warning 46- alert: ThanosCompactBucketHighOperationFailures 47 annotations: 48 description: Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value 49 | humanize}}% of operations. 50 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactbuckethighoperationfailures 51 summary: Thanos Compact Bucket is having a high number of operation failures. 52 expr: | 53 ( 54 sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-compact.*"}[5m])) 55 / 56 sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-compact.*"}[5m])) 57 * 100 > 5 58 ) 59 for: 15m 60 labels: 61 severity: warning 62- alert: ThanosCompactHasNotRun 63 annotations: 64 description: Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours. 65 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompacthasnotrun 66 summary: Thanos Compact has not uploaded anything for last 24 hours. 67 expr: (time() - max by (job) (max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~".*thanos-compact.*"}[24h]))) 68 / 60 / 60 > 24 69 labels: 70 severity: warning 71``` 72 73## Ruler 74 75For Thanos Ruler we run some alerts in local Prometheus, to make sure that Thanos Ruler is working: 76 77[embedmd]:# (../tmp/thanos-rule.yaml yaml) 78```yaml 79name: thanos-rule 80rules: 81- alert: ThanosRuleQueueIsDroppingAlerts 82 annotations: 83 description: Thanos Rule {{$labels.instance}} is failing to queue alerts. 84 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulequeueisdroppingalerts 85 summary: Thanos Rule is failing to queue alerts. 86 expr: | 87 sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0 88 for: 5m 89 labels: 90 severity: critical 91- alert: ThanosRuleSenderIsFailingAlerts 92 annotations: 93 description: Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager. 94 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulesenderisfailingalerts 95 summary: Thanos Rule is failing to send alerts to alertmanager. 96 expr: | 97 sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0 98 for: 5m 99 labels: 100 severity: critical 101- alert: ThanosRuleHighRuleEvaluationFailures 102 annotations: 103 description: Thanos Rule {{$labels.instance}} is failing to evaluate rules. 104 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulehighruleevaluationfailures 105 summary: Thanos Rule is failing to evaluate rules. 106 expr: | 107 ( 108 sum by (job, instance) (rate(prometheus_rule_evaluation_failures_total{job=~".*thanos-rule.*"}[5m])) 109 / 110 sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) 111 * 100 > 5 112 ) 113 for: 5m 114 labels: 115 severity: critical 116- alert: ThanosRuleHighRuleEvaluationWarnings 117 annotations: 118 description: Thanos Rule {{$labels.instance}} has high number of evaluation warnings. 119 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulehighruleevaluationwarnings 120 summary: Thanos Rule has high number of evaluation warnings. 121 expr: | 122 sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total{job=~".*thanos-rule.*"}[5m])) > 0 123 for: 15m 124 labels: 125 severity: info 126- alert: ThanosRuleRuleEvaluationLatencyHigh 127 annotations: 128 description: Thanos Rule {{$labels.instance}} has higher evaluation latency than 129 interval for {{$labels.rule_group}}. 130 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleruleevaluationlatencyhigh 131 summary: Thanos Rule has high rule evaluation latency. 132 expr: | 133 ( 134 sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~".*thanos-rule.*"}) 135 > 136 sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"}) 137 ) 138 for: 5m 139 labels: 140 severity: warning 141- alert: ThanosRuleGrpcErrorRate 142 annotations: 143 description: Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% 144 of requests. 145 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulegrpcerrorrate 146 summary: Thanos Rule is failing to handle grpc requests. 147 expr: | 148 ( 149 sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-rule.*"}[5m])) 150 / 151 sum by (job, instance) (rate(grpc_server_started_total{job=~".*thanos-rule.*"}[5m])) 152 * 100 > 5 153 ) 154 for: 5m 155 labels: 156 severity: warning 157- alert: ThanosRuleConfigReloadFailure 158 annotations: 159 description: Thanos Rule {{$labels.job}} has not been able to reload its configuration. 160 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleconfigreloadfailure 161 summary: Thanos Rule has not been able to reload configuration. 162 expr: avg by (job, instance) (thanos_rule_config_last_reload_successful{job=~".*thanos-rule.*"}) 163 != 1 164 for: 5m 165 labels: 166 severity: info 167- alert: ThanosRuleQueryHighDNSFailures 168 annotations: 169 description: Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing 170 DNS queries for query endpoints. 171 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulequeryhighdnsfailures 172 summary: Thanos Rule is having high number of DNS failures. 173 expr: | 174 ( 175 sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~".*thanos-rule.*"}[5m])) 176 / 177 sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) 178 * 100 > 1 179 ) 180 for: 15m 181 labels: 182 severity: warning 183- alert: ThanosRuleAlertmanagerHighDNSFailures 184 annotations: 185 description: Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing 186 DNS queries for Alertmanager endpoints. 187 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulealertmanagerhighdnsfailures 188 summary: Thanos Rule is having high number of DNS failures. 189 expr: | 190 ( 191 sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~".*thanos-rule.*"}[5m])) 192 / 193 sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) 194 * 100 > 1 195 ) 196 for: 15m 197 labels: 198 severity: warning 199- alert: ThanosRuleNoEvaluationFor10Intervals 200 annotations: 201 description: Thanos Rule {{$labels.job}} has {{$value | humanize}}% rule groups 202 that did not evaluate for at least 10x of their expected interval. 203 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulenoevaluationfor10intervals 204 summary: Thanos Rule has rule groups that did not evaluate for 10 intervals. 205 expr: | 206 time() - max by (job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~".*thanos-rule.*"}) 207 > 208 10 * max by (job, instance, group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"}) 209 for: 5m 210 labels: 211 severity: info 212- alert: ThanosNoRuleEvaluations 213 annotations: 214 description: Thanos Rule {{$labels.instance}} did not perform any rule evaluations 215 in the past 10 minutes. 216 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosnoruleevaluations 217 summary: Thanos Rule did not perform any rule evaluations. 218 expr: | 219 sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) <= 0 220 and 221 sum by (job, instance) (thanos_rule_loaded_rules{job=~".*thanos-rule.*"}) > 0 222 for: 5m 223 labels: 224 severity: critical 225``` 226 227## Store Gateway 228 229[embedmd]:# (../tmp/thanos-store.yaml yaml) 230```yaml 231name: thanos-store 232rules: 233- alert: ThanosStoreGrpcErrorRate 234 annotations: 235 description: Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% 236 of requests. 237 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoregrpcerrorrate 238 summary: Thanos Store is failing to handle qrpcd requests. 239 expr: | 240 ( 241 sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-store.*"}[5m])) 242 / 243 sum by (job) (rate(grpc_server_started_total{job=~".*thanos-store.*"}[5m])) 244 * 100 > 5 245 ) 246 for: 5m 247 labels: 248 severity: warning 249- alert: ThanosStoreSeriesGateLatencyHigh 250 annotations: 251 description: Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} 252 seconds for store series gate requests. 253 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreseriesgatelatencyhigh 254 summary: Thanos Store has high latency for store series gate requests. 255 expr: | 256 ( 257 histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 258 and 259 sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0 260 ) 261 for: 10m 262 labels: 263 severity: warning 264- alert: ThanosStoreBucketHighOperationFailures 265 annotations: 266 description: Thanos Store {{$labels.job}} Bucket is failing to execute {{$value 267 | humanize}}% of operations. 268 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstorebuckethighoperationfailures 269 summary: Thanos Store Bucket is failing to execute operations. 270 expr: | 271 ( 272 sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-store.*"}[5m])) 273 / 274 sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m])) 275 * 100 > 5 276 ) 277 for: 15m 278 labels: 279 severity: warning 280- alert: ThanosStoreObjstoreOperationLatencyHigh 281 annotations: 282 description: Thanos Store {{$labels.job}} Bucket has a 99th percentile latency 283 of {{$value}} seconds for the bucket operations. 284 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreobjstoreoperationlatencyhigh 285 summary: Thanos Store is having high latency for bucket operations. 286 expr: | 287 ( 288 histogram_quantile(0.99, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 289 and 290 sum by (job) (rate(thanos_objstore_bucket_operation_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0 291 ) 292 for: 10m 293 labels: 294 severity: warning 295``` 296 297## Sidecar 298 299[embedmd]:# (../tmp/thanos-sidecar.yaml yaml) 300```yaml 301name: thanos-sidecar 302rules: 303- alert: ThanosSidecarPrometheusDown 304 annotations: 305 description: Thanos Sidecar {{$labels.instance}} cannot connect to Prometheus. 306 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown 307 summary: Thanos Sidecar cannot connect to Prometheus 308 expr: | 309 thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0 310 for: 5m 311 labels: 312 severity: critical 313- alert: ThanosSidecarBucketOperationsFailed 314 annotations: 315 description: Thanos Sidecar {{$labels.instance}} bucket operations are failing 316 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarbucketoperationsfailed 317 summary: Thanos Sidecar bucket operations are failing 318 expr: | 319 sum by (job, instance) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-sidecar.*"}[5m])) > 0 320 for: 5m 321 labels: 322 severity: critical 323- alert: ThanosSidecarUnhealthy 324 annotations: 325 description: Thanos Sidecar {{$labels.instance}} is unhealthy for more than {{$value}} 326 seconds. 327 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy 328 summary: Thanos Sidecar is unhealthy. 329 expr: | 330 time() - max by (job, instance) (timestamp(thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"})) >= 240 331 labels: 332 severity: critical 333``` 334 335## Query 336 337[embedmd]:# (../tmp/thanos-query.yaml yaml) 338```yaml 339name: thanos-query 340rules: 341- alert: ThanosQueryHttpRequestQueryErrorRateHigh 342 annotations: 343 description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% 344 of "query" requests. 345 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhttprequestqueryerrorratehigh 346 summary: Thanos Query is failing to handle requests. 347 expr: | 348 ( 349 sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query"}[5m])) 350 / 351 sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query"}[5m])) 352 ) * 100 > 5 353 for: 5m 354 labels: 355 severity: critical 356- alert: ThanosQueryHttpRequestQueryRangeErrorRateHigh 357 annotations: 358 description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% 359 of "query_range" requests. 360 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhttprequestqueryrangeerrorratehigh 361 summary: Thanos Query is failing to handle requests. 362 expr: | 363 ( 364 sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query_range"}[5m])) 365 / 366 sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query_range"}[5m])) 367 ) * 100 > 5 368 for: 5m 369 labels: 370 severity: critical 371- alert: ThanosQueryGrpcServerErrorRate 372 annotations: 373 description: Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% 374 of requests. 375 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosquerygrpcservererrorrate 376 summary: Thanos Query is failing to handle requests. 377 expr: | 378 ( 379 sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-query.*"}[5m])) 380 / 381 sum by (job) (rate(grpc_server_started_total{job=~".*thanos-query.*"}[5m])) 382 * 100 > 5 383 ) 384 for: 5m 385 labels: 386 severity: warning 387- alert: ThanosQueryGrpcClientErrorRate 388 annotations: 389 description: Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% 390 of requests. 391 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosquerygrpcclienterrorrate 392 summary: Thanos Query is failing to send requests. 393 expr: | 394 ( 395 sum by (job) (rate(grpc_client_handled_total{grpc_code!="OK", job=~".*thanos-query.*"}[5m])) 396 / 397 sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m])) 398 ) * 100 > 5 399 for: 5m 400 labels: 401 severity: warning 402- alert: ThanosQueryHighDNSFailures 403 annotations: 404 description: Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing 405 DNS queries for store endpoints. 406 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryhighdnsfailures 407 summary: Thanos Query is having high number of DNS failures. 408 expr: | 409 ( 410 sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~".*thanos-query.*"}[5m])) 411 / 412 sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~".*thanos-query.*"}[5m])) 413 ) * 100 > 1 414 for: 15m 415 labels: 416 severity: warning 417- alert: ThanosQueryInstantLatencyHigh 418 annotations: 419 description: Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} 420 seconds for instant queries. 421 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryinstantlatencyhigh 422 summary: Thanos Query has high latency for queries. 423 expr: | 424 ( 425 histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m]))) > 40 426 and 427 sum by (job) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m])) > 0 428 ) 429 for: 10m 430 labels: 431 severity: critical 432- alert: ThanosQueryRangeLatencyHigh 433 annotations: 434 description: Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} 435 seconds for range queries. 436 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryrangelatencyhigh 437 summary: Thanos Query has high latency for queries. 438 expr: | 439 ( 440 histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query_range"}[5m]))) > 90 441 and 442 sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*", handler="query_range"}[5m])) > 0 443 ) 444 for: 10m 445 labels: 446 severity: critical 447``` 448 449## Receive 450 451[embedmd]:# (../tmp/thanos-receive.yaml yaml) 452```yaml 453name: thanos-receive 454rules: 455- alert: ThanosReceiveHttpRequestErrorRateHigh 456 annotations: 457 description: Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% 458 of requests. 459 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehttprequesterrorratehigh 460 summary: Thanos Receive is failing to handle requests. 461 expr: | 462 ( 463 sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-receive.*", handler="receive"}[5m])) 464 / 465 sum by (job) (rate(http_requests_total{job=~".*thanos-receive.*", handler="receive"}[5m])) 466 ) * 100 > 5 467 for: 5m 468 labels: 469 severity: critical 470- alert: ThanosReceiveHttpRequestLatencyHigh 471 annotations: 472 description: Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ 473 $value }} seconds for requests. 474 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehttprequestlatencyhigh 475 summary: Thanos Receive has high HTTP requests latency. 476 expr: | 477 ( 478 histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-receive.*", handler="receive"}[5m]))) > 10 479 and 480 sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-receive.*", handler="receive"}[5m])) > 0 481 ) 482 for: 10m 483 labels: 484 severity: critical 485- alert: ThanosReceiveHighReplicationFailures 486 annotations: 487 description: Thanos Receive {{$labels.job}} is failing to replicate {{$value | 488 humanize}}% of requests. 489 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighreplicationfailures 490 summary: Thanos Receive is having high number of replication failures. 491 expr: | 492 thanos_receive_replication_factor > 1 493 and 494 ( 495 ( 496 sum by (job) (rate(thanos_receive_replications_total{result="error", job=~".*thanos-receive.*"}[5m])) 497 / 498 sum by (job) (rate(thanos_receive_replications_total{job=~".*thanos-receive.*"}[5m])) 499 ) 500 > 501 ( 502 max by (job) (floor((thanos_receive_replication_factor{job=~".*thanos-receive.*"}+1) / 2)) 503 / 504 max by (job) (thanos_receive_hashring_nodes{job=~".*thanos-receive.*"}) 505 ) 506 ) * 100 507 for: 5m 508 labels: 509 severity: warning 510- alert: ThanosReceiveHighForwardRequestFailures 511 annotations: 512 description: Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% 513 of requests. 514 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighforwardrequestfailures 515 summary: Thanos Receive is failing to forward requests. 516 expr: | 517 ( 518 sum by (job) (rate(thanos_receive_forward_requests_total{result="error", job=~".*thanos-receive.*"}[5m])) 519 / 520 sum by (job) (rate(thanos_receive_forward_requests_total{job=~".*thanos-receive.*"}[5m])) 521 ) * 100 > 20 522 for: 5m 523 labels: 524 severity: info 525- alert: ThanosReceiveHighHashringFileRefreshFailures 526 annotations: 527 description: Thanos Receive {{$labels.job}} is failing to refresh hashring file, 528 {{$value | humanize}} of attempts failed. 529 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighhashringfilerefreshfailures 530 summary: Thanos Receive is failing to refresh hasring file. 531 expr: | 532 ( 533 sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~".*thanos-receive.*"}[5m])) 534 / 535 sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~".*thanos-receive.*"}[5m])) 536 > 0 537 ) 538 for: 15m 539 labels: 540 severity: warning 541- alert: ThanosReceiveConfigReloadFailure 542 annotations: 543 description: Thanos Receive {{$labels.job}} has not been able to reload hashring 544 configurations. 545 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveconfigreloadfailure 546 summary: Thanos Receive has not been able to reload configuration. 547 expr: avg by (job) (thanos_receive_config_last_reload_successful{job=~".*thanos-receive.*"}) 548 != 1 549 for: 5m 550 labels: 551 severity: warning 552- alert: ThanosReceiveNoUpload 553 annotations: 554 description: Thanos Receive {{$labels.instance}} has not uploaded latest data 555 to object storage. 556 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivenoupload 557 summary: Thanos Receive has not uploaded latest data to object storage. 558 expr: | 559 (up{job=~".*thanos-receive.*"} - 1) 560 + on (job, instance) # filters to only alert on current instance last 3h 561 (sum by (job, instance) (increase(thanos_shipper_uploads_total{job=~".*thanos-receive.*"}[3h])) == 0) 562 for: 3h 563 labels: 564 severity: critical 565``` 566 567## Replicate 568 569[embedmd]:# (../tmp/thanos-bucket-replicate.yaml yaml) 570```yaml 571name: thanos-bucket-replicate 572rules: 573- alert: ThanosBucketReplicateErrorRate 574 annotations: 575 description: Thanos Replicate is failing to run, {{$value | humanize}}% of attempts 576 failed. 577 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosbucketreplicateerrorrate 578 summary: Thanose Replicate is failing to run. 579 expr: | 580 ( 581 sum by (job) (rate(thanos_replicate_replication_runs_total{result="error", job=~".*thanos-bucket-replicate.*"}[5m])) 582 / on (job) group_left 583 sum by (job) (rate(thanos_replicate_replication_runs_total{job=~".*thanos-bucket-replicate.*"}[5m])) 584 ) * 100 >= 10 585 for: 5m 586 labels: 587 severity: critical 588- alert: ThanosBucketReplicateRunLatency 589 annotations: 590 description: Thanos Replicate {{$labels.job}} has a 99th percentile latency of 591 {{$value}} seconds for the replicate operations. 592 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosbucketreplicaterunlatency 593 summary: Thanos Replicate has a high latency for replicate operations. 594 expr: | 595 ( 596 histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m]))) > 20 597 and 598 sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m])) > 0 599 ) 600 for: 5m 601 labels: 602 severity: critical 603``` 604 605## Extras 606 607### Absent Rules 608 609[embedmd]:# (../tmp/thanos-component-absent.yaml yaml) 610```yaml 611name: thanos-component-absent 612rules: 613- alert: ThanosBucketReplicateIsDown 614 annotations: 615 description: ThanosBucketReplicate has disappeared. Prometheus target for the 616 component cannot be discovered. 617 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosbucketreplicateisdown 618 summary: Thanos component has disappeared. 619 expr: | 620 absent(up{job=~".*thanos-bucket-replicate.*"} == 1) 621 for: 5m 622 labels: 623 severity: critical 624- alert: ThanosCompactIsDown 625 annotations: 626 description: ThanosCompact has disappeared. Prometheus target for the component 627 cannot be discovered. 628 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanoscompactisdown 629 summary: Thanos component has disappeared. 630 expr: | 631 absent(up{job=~".*thanos-compact.*"} == 1) 632 for: 5m 633 labels: 634 severity: critical 635- alert: ThanosQueryIsDown 636 annotations: 637 description: ThanosQuery has disappeared. Prometheus target for the component 638 cannot be discovered. 639 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosqueryisdown 640 summary: Thanos component has disappeared. 641 expr: | 642 absent(up{job=~".*thanos-query.*"} == 1) 643 for: 5m 644 labels: 645 severity: critical 646- alert: ThanosReceiveIsDown 647 annotations: 648 description: ThanosReceive has disappeared. Prometheus target for the component 649 cannot be discovered. 650 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveisdown 651 summary: Thanos component has disappeared. 652 expr: | 653 absent(up{job=~".*thanos-receive.*"} == 1) 654 for: 5m 655 labels: 656 severity: critical 657- alert: ThanosRuleIsDown 658 annotations: 659 description: ThanosRule has disappeared. Prometheus target for the component cannot 660 be discovered. 661 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleisdown 662 summary: Thanos component has disappeared. 663 expr: | 664 absent(up{job=~".*thanos-rule.*"} == 1) 665 for: 5m 666 labels: 667 severity: critical 668- alert: ThanosSidecarIsDown 669 annotations: 670 description: ThanosSidecar has disappeared. Prometheus target for the component 671 cannot be discovered. 672 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarisdown 673 summary: Thanos component has disappeared. 674 expr: | 675 absent(up{job=~".*thanos-sidecar.*"} == 1) 676 for: 5m 677 labels: 678 severity: critical 679- alert: ThanosStoreIsDown 680 annotations: 681 description: ThanosStore has disappeared. Prometheus target for the component 682 cannot be discovered. 683 runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosstoreisdown 684 summary: Thanos component has disappeared. 685 expr: | 686 absent(up{job=~".*thanos-store.*"} == 1) 687 for: 5m 688 labels: 689 severity: critical 690``` 691