1---
2stage: none
3group: unassigned
4info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
5---
6
7# GitLab scalability
8
9This section describes the current architecture of GitLab as it relates to
10scalability and reliability.
11
12## Reference Architecture Overview
13
14![Reference Architecture Diagram](img/reference_architecture.png)
15
16_[diagram source - GitLab employees only](https://docs.google.com/drawings/d/1RTGtuoUrE0bDT-9smoHbFruhEMI4Ys6uNrufe5IA-VI/edit)_
17
18The diagram above shows a GitLab reference architecture scaled up for 50,000
19users. We discuss each component below.
20
21## Components
22
23### PostgreSQL
24
25The PostgreSQL database holds all metadata for projects, issues, merge
26requests, users, and so on. The schema is managed by the Rails application
27[db/structure.sql](https://gitlab.com/gitlab-org/gitlab/-/blob/master/db/structure.sql).
28
29GitLab Web/API servers and Sidekiq nodes talk directly to the database by using a
30Rails object relational model (ORM). Most SQL queries are accessed by using this
31ORM, although some custom SQL is also written for performance or for
32exploiting advanced PostgreSQL features (like recursive CTEs or LATERAL JOINs).
33
34The application has a tight coupling to the database schema. When the
35application starts, Rails queries the database schema, caching the tables and
36column types for the data requested. Because of this schema cache, dropping a
37column or table while the application is running can produce 500 errors to the
38user. This is why we have a [process for dropping columns and other
39no-downtime changes](avoiding_downtime_in_migrations.md).
40
41#### Multi-tenancy
42
43A single database is used to store all customer data. Each user can belong to
44many groups or projects, and the access level (including guest, developer, or
45maintainer) to groups and projects determines what users can see and
46what they can access.
47
48Users with the administrator role can access all projects and even impersonate
49users.
50
51#### Sharding and partitioning
52
53The database is not divided up in any way; currently all data lives in
54one database in many different tables. This works for simple
55applications, but as the data set grows, it becomes more challenging to
56maintain and support one database with tables with many rows.
57
58There are two ways to deal with this:
59
60- Partitioning. Locally split up tables data.
61- Sharding. Distribute data across multiple databases.
62
63Partitioning is a built-in PostgreSQL feature and requires minimal changes
64in the application. However, it [requires PostgreSQL
6511](https://www.2ndquadrant.com/en/blog/partitioning-evolution-postgresql-11/).
66
67For example, a natural way to partition is to [partition tables by
68dates](https://gitlab.com/groups/gitlab-org/-/epics/2023). For example,
69the `events` and `audit_events` table are natural candidates for this
70kind of partitioning.
71
72Sharding is likely more difficult and requires significant changes
73to the schema and application. For example, if we have to store projects
74in many different databases, we immediately run into the question, "How
75can we retrieve data across different projects?" One answer to this is
76to abstract data access into API calls that abstract the database from
77the application, but this is a significant amount of work.
78
79There are solutions that may help abstract the sharding to some extent
80from the application. For example, we want to look at [Citus
81Data](https://www.citusdata.com/product/community) closely. Citus Data
82provides a Rails plugin that adds a [tenant ID to ActiveRecord
83models](https://www.citusdata.com/blog/2017/01/05/easily-scale-out-multi-tenant-apps/).
84
85Sharding can also be done based on feature verticals. This is the
86microservice approach to sharding, where each service represents a
87bounded context and operates on its own service-specific database
88cluster. In that model data wouldn't be distributed according to some
89internal key (such as tenant IDs) but based on team and product
90ownership. It shares a lot of challenges with traditional, data-oriented
91sharding, however. For instance, joining data has to happen in the
92application itself rather than on the query layer (although additional
93layers like GraphQL might mitigate that) and it requires true
94parallelism to run efficiently (that is, a scatter-gather model to collect,
95then zip up data records), which is a challenge in itself in Ruby based
96systems.
97
98#### Database size
99
100A recent [database checkup shows a breakdown of the table sizes on
101GitLab.com](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8022#master-1022016101-8).
102Since `merge_request_diff_files` contains over 1 TB of data, we want to
103reduce/eliminate this table first. GitLab has support for [storing diffs in
104object storage](../administration/merge_request_diffs.md), which we [want to do on
105GitLab.com](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7356).
106
107#### High availability
108
109There are several strategies to provide high-availability and redundancy:
110
111- Write-ahead logs (WAL) streamed to object storage (for example, S3, or Google Cloud
112  Storage).
113- Read-replicas (hot backups).
114- Delayed replicas.
115
116To restore a database from a point in time, a base backup needs to have
117been taken prior to that incident. Once a database has restored from
118that backup, the database can apply the WAL logs in order until the
119database has reached the target time.
120
121On GitLab.com, Consul and Patroni work together to coordinate failovers with
122the read replicas. [Omnibus ships with Patroni](../administration/postgresql/replication_and_failover.md).
123
124#### Load-balancing
125
126GitLab EE has [application support for load balancing using read replicas](../administration/postgresql/database_load_balancing.md). This load balancer does
127some actions that aren't traditionally available in standard load balancers. For
128example, the application considers a replica only if its replication lag is low
129(for example, WAL data behind by less than 100 MB).
130
131More [details are in a blog
132post](https://about.gitlab.com/blog/2017/10/02/scaling-the-gitlab-database/).
133
134### PgBouncer
135
136As PostgreSQL forks a backend process for each request, PostgreSQL has a
137finite limit of connections that it can support, typically around 300 by
138default. Without a connection pooler like PgBouncer, it's quite possible to
139hit connection limits. Once the limits are reached, then GitLab generates
140errors or slow down as it waits for a connection to be available.
141
142#### High availability
143
144PgBouncer is a single-threaded process. Under heavy traffic, PgBouncer can
145saturate a single core, which can result in slower response times for
146background job and/or Web requests. There are two ways to address this
147limitation:
148
149- Run multiple PgBouncer instances.
150- Use a multi-threaded connection pooler (for example,
151  [Odyssey](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7776).
152
153On some Linux systems, it's possible to run [multiple PgBouncer instances on
154the same port](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/4796).
155
156On GitLab.com, we run multiple PgBouncer instances on different ports to
157avoid saturating a single core.
158
159In addition, the PgBouncer instances that communicate with the primary
160and secondaries are set up a bit differently:
161
162- Multiple PgBouncer instances in different availability zones talk to the
163  PostgreSQL primary.
164- Multiple PgBouncer processes are colocated with PostgreSQL read replicas.
165
166For replicas, colocating is advantageous because it reduces network hops
167and hence latency. However, for the primary, colocating is
168disadvantageous because PgBouncer would become a single point of failure
169and cause errors. When a failover occurs, one of two things could
170happen:
171
172- The primary disappears from the network.
173- The primary becomes a replica.
174
175In the first case, if PgBouncer is colocated with the primary, database
176connections would time out or fail to connect, and downtime would
177occur. Having multiple PgBouncer instances in front of a load balancer
178talking to the primary can mitigate this.
179
180In the second case, existing connections to the newly-demoted replica
181may execute a write query, which would fail. During a failover, it may
182be advantageous to shut down the PgBouncer talking to the primary to
183ensure no more traffic arrives for it. The alternative would be to make
184the application aware of the failover event and terminate its
185connections gracefully.
186
187### Redis
188
189There are three ways Redis is used in GitLab:
190
191- Queues: Sidekiq jobs marshal jobs into JSON payloads.
192- Persistent state: Session data and exclusive leases.
193- Cache: Repository data (like Branch and tag names) and view partials.
194
195For GitLab instances running at scale, splitting Redis usage into
196separate Redis clusters helps for two reasons:
197
198- Each has different persistence requirements.
199- Load isolation.
200
201For example, the cache instance can behave like an least-recently used
202(LRU) cache by setting the `maxmemory` configuration option. That option
203should not be set for the queues or persistent clusters because data
204would be evicted from memory at random times. This would cause jobs to
205be dropped on the floor, which would cause many problems (like merges
206not running or builds not updating).
207
208Sidekiq also polls its queues quite frequently, and this activity can
209slow down other queries. For this reason, having a dedicated Redis
210cluster for Sidekiq can help improve performance and reduce load on the
211Redis process.
212
213#### High availability/Risks
214
215Single-core: Like PgBouncer, a single Redis process can only use one
216core. It does not support multi-threading.
217
218Dumb secondaries: Redis secondaries (also known as replicas) don't actually
219handle any load. Unlike PostgreSQL secondaries, they don't even serve
220read queries. They simply replicate data from the primary and take over
221only when the primary fails.
222
223### Redis Sentinels
224
225[Redis Sentinel](https://redis.io/topics/sentinel) provides high
226availability for Redis by watching the primary. If multiple Sentinels
227detect that the primary has gone away, the Sentinels performs an
228election to determine a new leader.
229
230#### Failure Modes
231
232No leader: A Redis cluster can get into a mode where there are no
233primaries. For example, this can happen if Redis nodes are misconfigured
234to follow the wrong node. Sometimes this requires forcing one node to
235become a primary by using the [`REPLICAOF NO ONE`
236command](https://redis.io/commands/replicaof).
237
238### Sidekiq
239
240Sidekiq is a multi-threaded, background job processing system used in
241Ruby on Rails applications. In GitLab, Sidekiq performs the heavy
242lifting of many activities, including:
243
244- Updating merge requests after a push.
245- Sending email messages.
246- Updating user authorizations.
247- Processing CI builds and pipelines.
248
249The full list of jobs can be found in the
250[`app/workers`](https://gitlab.com/gitlab-org/gitlab/-/tree/master/app/workers)
251and
252[`ee/app/workers`](https://gitlab.com/gitlab-org/gitlab/-/tree/master/ee/app/workers)
253directories in the GitLab codebase.
254
255#### Runaway Queues
256
257As jobs are added to the Sidekiq queue, Sidekiq worker threads need to
258pull these jobs from the queue and finish them at a rate faster than
259they are added. When an imbalance occurs (for example, delays in the database
260or slow jobs), Sidekiq queues can balloon and lead to runaway queues.
261
262In recent months, many of these queues have ballooned due to delays in
263PostgreSQL, PgBouncer, and Redis. For example, PgBouncer saturation can
264cause jobs to wait a few seconds before obtaining a database connection,
265which can cascade into a large slowdown. Optimizing these basic
266interconnections comes first.
267
268However, there are a number of strategies to ensure queues get drained
269in a timely manner:
270
271- Add more processing capacity. This can be done by spinning up more
272  instances of Sidekiq or [Sidekiq Cluster](../administration/operations/extra_sidekiq_processes.md).
273- Split jobs into smaller units of work. For example, `PostReceive`
274  used to process each commit message in the push, but now it farms out
275  this to `ProcessCommitWorker`.
276- Redistribute/gerrymander Sidekiq processes by queue
277  types. Long-running jobs (for example, relating to project import) can often
278  squeeze out jobs that run fast (for example, delivering email). [This technique
279  was used in to optimize our existing Sidekiq deployment](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7219#note_218019483).
280- Optimize jobs. Eliminating unnecessary work, reducing network calls
281  (including SQL and Gitaly), and optimizing processor time can yield significant
282  benefits.
283
284From the Sidekiq logs, it's possible to see which jobs run the most
285frequently and/or take the longest. For example, these Kibana
286visualizations show the jobs that consume the most total time:
287
288![Most time-consuming Sidekiq jobs](img/sidekiq_most_time_consuming_jobs.png)
289
290_[visualization source - GitLab employees only](https://log.gitlab.net/goto/2c036582dfc3219eeaa49a76eab2564b)_
291
292This shows the jobs that had the longest durations:
293
294![Longest running Sidekiq jobs](img/sidekiq_longest_running_jobs.png)
295
296_[visualization source - GitLab employees only](https://log.gitlab.net/goto/494f6c8afb61d98c4ff264520d184416)_
297