1--- 2stage: Platforms 3group: Scalability 4info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments 5--- 6 7# GitLab Application Service Level Indicators (SLIs) 8 9> [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4 10 11It is possible to define [Service Level Indicators 12(SLIs)](https://en.wikipedia.org/wiki/Service_level_indicator) 13directly in the Ruby codebase. This keeps the definition of operations 14and their success close to the implementation and allows the people 15building features to easily define how these features should be 16monitored. 17 18Defining an SLI causes 2 19[Prometheus 20counters](https://prometheus.io/docs/concepts/metric_types/#counter) 21to be emitted from the rails application: 22 23- `gitlab_sli:<sli name>:total`: incremented for each operation. 24- `gitlab_sli:<sli_name>:success_total`: incremented for successful 25 operations. 26 27## Existing SLIs 28 291. [`rails_request_apdex`](rails_request_apdex.md) 30 31## Defining a new SLI 32 33An SLI can be defined using the `Gitlab::Metrics::Sli` class. 34 35Before the first scrape, it is important to have [initialized the SLI 36with all possible 37label-combinations](https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics). This 38avoid confusing results when using these counters in calculations. 39 40To initialize an SLI, use the `.inilialize_sli` class method, for 41example: 42 43```ruby 44Gitlab::Metrics::Sli.initialize_sli(:received_email, [ 45 { 46 feature_category: :team_planning, 47 email_type: :create_issue 48 }, 49 { 50 feature_category: :service_desk, 51 email_type: :service_desk 52 }, 53 { 54 feature_category: :code_review, 55 email_type: :create_merge_request 56 } 57]) 58``` 59 60Metrics must be initialized before they get 61scraped for the first time. This could be done at the start time of the 62process that will emit them, in which case we need to pay attention 63not to increase application's boot time too much. This is preferable 64if possible. 65 66Alternatively, if initializing would take too long, this can be done 67during the first scrape. We need to make sure we don't do it for every 68scrape. This can be done as follows: 69 70```ruby 71def initialize_request_slis_if_needed! 72 return if Gitlab::Metrics::Sli.initialized?(:rails_request_apdex) 73 Gitlab::Metrics::Sli.initialize_sli(:rails_request_apdex, possible_request_labels) 74end 75``` 76 77Also pay attention to do it for the different metrics 78endpoints we have. Currently the 79[`WebExporter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/web_exporter.rb) 80and the 81[`HealthController`](https://gitlab.com/gitlab-org/gitlab/blob/master/app/controllers/health_controller.rb) 82for Rails and 83[`SidekiqExporter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/sidekiq_exporter.rb) 84for Sidekiq. 85 86## Tracking operations for an SLI 87 88Tracking an operation in the newly defined SLI can be done like this: 89 90```ruby 91Gitlab::Metrics::Sli[:received_email].increment( 92 labels: { 93 feature_category: :service_desk, 94 email_type: :service_desk 95 }, 96 success: issue_created? 97) 98``` 99 100Calling `#increment` on this SLI will increment the total Prometheus counter 101 102```prometheus 103gitlab_sli:received_email:total{ feature_category='service_desk', email_type='service_desk' } 104``` 105 106If the `success:` argument passed is truthy, then the success counter 107will also be incremented: 108 109```prometheus 110gitlab_sli:received_email:success_total{ feature_category='service_desk', email_type='service_desk' } 111``` 112 113So far, only tracking `apdex` using a success rate is supported. If you 114need to track errors this way, please upvote 115[this issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1395) 116and leave a comment so we can prioritize this. 117 118## Using the SLI in service monitoring and alerts 119 120When the application is emitting metrics for a new SLI, they need 121to be consumed from the [metrics catalog](https://gitlab.com/gitlab-com/runbooks/-/tree/master/metrics-catalog) 122to result in alerts, and included in the error budget for stage 123groups and GitLab.com's overall availability. 124 125Start by adding the new SLI to the 126[Application-SLI library](https://gitlab.com/gitlab-com/runbooks/-/blob/d109886dfd5170793eeb8de3d69aafd4a9da78f6/metrics-catalog/gitlab-slis/library.libsonnet#L4). 127After that, add the following information: 128 129- `name`: the name of the SLI as defined in code. For example 130 `received_email`. 131- `significantLabels`: an array of Prometheus labels that belong to the 132 metrics. For example: `["email_type"]`. If the significant labels 133 for the SLI include `feature_category`, the metrics will also 134 feed into the 135 [error budgets for stage groups](../stage_group_dashboards.md#error-budget). 136- `featureCategory`: if the SLI applies to a single feature category, 137 you can specify it statically through this field to feed the SLI 138 into the error budgets for stage groups. 139- `description`: a Markdown string explaining the SLI. It will 140 be shown on dashboards and alerts. 141- `kind`: the kind of indicator. Only `sliDefinition.apdexKind` is supported at the moment. 142 Reach out in 143 [this issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1395) 144 if you want to implement an SLI for success or error rates. 145 146When done, run `make generate` to generate recording rules for 147the new SLI. This command creates recordings for all services 148emitting these metrics aggregated over `significantLabels`. 149 150Open up a merge request with these changes and request review from a Scalability 151team member. 152 153When these changes are merged, and the aggregations in 154[Thanos](https://thanos.gitlab.net) recorded, query Thanos to see 155the success ratio of the new aggregated metrics. For example: 156 157```prometheus 158sum by (environment, stage, type)(gitlab_sli_aggregation:rails_request_apdex:apdex:success:rate_1h) 159/ 160sum by (environment, stage, type)(gitlab_sli_aggregation:rails_request_apdex:apdex:weight:rate_1h) 161``` 162 163This shows the success ratio, which can guide you to set an 164appropriate SLO when adding this SLI to a service. 165 166Then, add the SLI to the appropriate service 167catalog file. For example, the [`web` service](https://gitlab.com/gitlab-com/runbooks/-/blob/2b7be37a006c236bd684a4e6a1fbf4c66158292a/metrics-catalog/services/web.jsonnet#L198): 168 169```jsonnet 170rails_requests: 171 sliLibrary.get('rails_request_apdex') 172 .generateServiceLevelIndicator({ job: 'gitlab-rails' }) 173``` 174 175To pass extra selectors and override properties of the SLI, see the 176[service monitoring documentation](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/README.md). 177 178SLIs with statically defined feature categories can already receive 179alerts about the SLI in specified Slack channels. For more information, read the 180[alert routing documentation](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/alert-routing.md). 181In [this project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/614) 182we are extending this so alerts for SLIs with a `feature_category` 183label in the source metrics can also be routed. 184 185For any question, please don't hesitate to create an issue in 186[the Scalability issue tracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues) 187or come find us in 188[#g_scalability](https://gitlab.slack.com/archives/CMMF8TKR9) on Slack. 189