1---
2stage: Platforms
3group: Scalability
4info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
5---
6
7# GitLab Application Service Level Indicators (SLIs)
8
9> [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4
10
11It is possible to define [Service Level Indicators
12(SLIs)](https://en.wikipedia.org/wiki/Service_level_indicator)
13directly in the Ruby codebase. This keeps the definition of operations
14and their success close to the implementation and allows the people
15building features to easily define how these features should be
16monitored.
17
18Defining an SLI causes 2
19[Prometheus
20counters](https://prometheus.io/docs/concepts/metric_types/#counter)
21to be emitted from the rails application:
22
23- `gitlab_sli:<sli name>:total`: incremented for each operation.
24- `gitlab_sli:<sli_name>:success_total`: incremented for successful
25  operations.
26
27## Existing SLIs
28
291. [`rails_request_apdex`](rails_request_apdex.md)
30
31## Defining a new SLI
32
33An SLI can be defined using the `Gitlab::Metrics::Sli` class.
34
35Before the first scrape, it is important to have [initialized the SLI
36with all possible
37label-combinations](https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics). This
38avoid confusing results when using these counters in calculations.
39
40To initialize an SLI, use the `.inilialize_sli` class method, for
41example:
42
43```ruby
44Gitlab::Metrics::Sli.initialize_sli(:received_email, [
45  {
46    feature_category: :team_planning,
47    email_type: :create_issue
48  },
49  {
50    feature_category: :service_desk,
51    email_type: :service_desk
52  },
53  {
54    feature_category: :code_review,
55    email_type: :create_merge_request
56  }
57])
58```
59
60Metrics must be initialized before they get
61scraped for the first time. This could be done at the start time of the
62process that will emit them, in which case we need to pay attention
63not to increase application's boot time too much. This is preferable
64if possible.
65
66Alternatively, if initializing would take too long, this can be done
67during the first scrape. We need to make sure we don't do it for every
68scrape. This can be done as follows:
69
70```ruby
71def initialize_request_slis_if_needed!
72  return if Gitlab::Metrics::Sli.initialized?(:rails_request_apdex)
73  Gitlab::Metrics::Sli.initialize_sli(:rails_request_apdex, possible_request_labels)
74end
75```
76
77Also pay attention to do it for the different metrics
78endpoints we have. Currently the
79[`WebExporter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/web_exporter.rb)
80and the
81[`HealthController`](https://gitlab.com/gitlab-org/gitlab/blob/master/app/controllers/health_controller.rb)
82for Rails and
83[`SidekiqExporter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/sidekiq_exporter.rb)
84for Sidekiq.
85
86## Tracking operations for an SLI
87
88Tracking an operation in the newly defined SLI can be done like this:
89
90```ruby
91Gitlab::Metrics::Sli[:received_email].increment(
92  labels: {
93    feature_category: :service_desk,
94    email_type: :service_desk
95  },
96  success: issue_created?
97)
98```
99
100Calling `#increment` on this SLI will increment the total Prometheus counter
101
102```prometheus
103gitlab_sli:received_email:total{ feature_category='service_desk', email_type='service_desk' }
104```
105
106If the `success:` argument passed is truthy, then the success counter
107will also be incremented:
108
109```prometheus
110gitlab_sli:received_email:success_total{ feature_category='service_desk', email_type='service_desk' }
111```
112
113So far, only tracking `apdex` using a success rate is supported. If you
114need to track errors this way, please upvote
115[this issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1395)
116and leave a comment so we can prioritize this.
117
118## Using the SLI in service monitoring and alerts
119
120When the application is emitting metrics for a new SLI, they need
121to be consumed from the [metrics catalog](https://gitlab.com/gitlab-com/runbooks/-/tree/master/metrics-catalog)
122to result in alerts, and included in the error budget for stage
123groups and GitLab.com's overall availability.
124
125Start by adding the new SLI to the
126[Application-SLI library](https://gitlab.com/gitlab-com/runbooks/-/blob/d109886dfd5170793eeb8de3d69aafd4a9da78f6/metrics-catalog/gitlab-slis/library.libsonnet#L4).
127After that, add the following information:
128
129- `name`: the name of the SLI as defined in code. For example
130  `received_email`.
131- `significantLabels`: an array of Prometheus labels that belong to the
132  metrics. For example: `["email_type"]`. If the significant labels
133  for the SLI include `feature_category`, the metrics will also
134  feed into the
135  [error budgets for stage groups](../stage_group_dashboards.md#error-budget).
136- `featureCategory`: if the SLI applies to a single feature category,
137  you can specify it statically through this field to feed the SLI
138  into the error budgets for stage groups.
139- `description`: a Markdown string explaining the SLI. It will
140  be shown on dashboards and alerts.
141- `kind`: the kind of indicator. Only `sliDefinition.apdexKind` is supported at the moment.
142  Reach out in
143  [this issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1395)
144  if you want to implement an SLI for success or error rates.
145
146When done, run `make generate` to generate recording rules for
147the new SLI. This command creates recordings for all services
148emitting these metrics aggregated over `significantLabels`.
149
150Open up a merge request with these changes and request review from a Scalability
151team member.
152
153When these changes are merged, and the aggregations in
154[Thanos](https://thanos.gitlab.net) recorded, query Thanos to see
155the success ratio of the new aggregated metrics. For example:
156
157```prometheus
158sum by (environment, stage, type)(gitlab_sli_aggregation:rails_request_apdex:apdex:success:rate_1h)
159/
160sum by (environment, stage, type)(gitlab_sli_aggregation:rails_request_apdex:apdex:weight:rate_1h)
161```
162
163This shows the success ratio, which can guide you to set an
164appropriate SLO when adding this SLI to a service.
165
166Then, add the SLI to the appropriate service
167catalog file. For example, the [`web` service](https://gitlab.com/gitlab-com/runbooks/-/blob/2b7be37a006c236bd684a4e6a1fbf4c66158292a/metrics-catalog/services/web.jsonnet#L198):
168
169```jsonnet
170rails_requests:
171  sliLibrary.get('rails_request_apdex')
172    .generateServiceLevelIndicator({ job: 'gitlab-rails' })
173```
174
175To pass extra selectors and override properties of the SLI, see the
176[service monitoring documentation](https://gitlab.com/gitlab-com/runbooks/blob/master/metrics-catalog/README.md).
177
178SLIs with statically defined feature categories can already receive
179alerts about the SLI in specified Slack channels. For more information, read the
180[alert routing documentation](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/alert-routing.md).
181In [this project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/614)
182we are extending this so alerts for SLIs with a `feature_category`
183label in the source metrics can also be routed.
184
185For any question, please don't hesitate to create an issue in
186[the Scalability issue tracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues)
187or come find us in
188[#g_scalability](https://gitlab.slack.com/archives/CMMF8TKR9) on Slack.
189