1--- 2stage: none 3group: unassigned 4info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments 5--- 6 7# GitLab scalability 8 9This section describes the current architecture of GitLab as it relates to 10scalability and reliability. 11 12## Reference Architecture Overview 13 14![Reference Architecture Diagram](img/reference_architecture.png) 15 16_[diagram source - GitLab employees only](https://docs.google.com/drawings/d/1RTGtuoUrE0bDT-9smoHbFruhEMI4Ys6uNrufe5IA-VI/edit)_ 17 18The diagram above shows a GitLab reference architecture scaled up for 50,000 19users. We discuss each component below. 20 21## Components 22 23### PostgreSQL 24 25The PostgreSQL database holds all metadata for projects, issues, merge 26requests, users, and so on. The schema is managed by the Rails application 27[db/structure.sql](https://gitlab.com/gitlab-org/gitlab/-/blob/master/db/structure.sql). 28 29GitLab Web/API servers and Sidekiq nodes talk directly to the database by using a 30Rails object relational model (ORM). Most SQL queries are accessed by using this 31ORM, although some custom SQL is also written for performance or for 32exploiting advanced PostgreSQL features (like recursive CTEs or LATERAL JOINs). 33 34The application has a tight coupling to the database schema. When the 35application starts, Rails queries the database schema, caching the tables and 36column types for the data requested. Because of this schema cache, dropping a 37column or table while the application is running can produce 500 errors to the 38user. This is why we have a [process for dropping columns and other 39no-downtime changes](avoiding_downtime_in_migrations.md). 40 41#### Multi-tenancy 42 43A single database is used to store all customer data. Each user can belong to 44many groups or projects, and the access level (including guest, developer, or 45maintainer) to groups and projects determines what users can see and 46what they can access. 47 48Users with the administrator role can access all projects and even impersonate 49users. 50 51#### Sharding and partitioning 52 53The database is not divided up in any way; currently all data lives in 54one database in many different tables. This works for simple 55applications, but as the data set grows, it becomes more challenging to 56maintain and support one database with tables with many rows. 57 58There are two ways to deal with this: 59 60- Partitioning. Locally split up tables data. 61- Sharding. Distribute data across multiple databases. 62 63Partitioning is a built-in PostgreSQL feature and requires minimal changes 64in the application. However, it [requires PostgreSQL 6511](https://www.2ndquadrant.com/en/blog/partitioning-evolution-postgresql-11/). 66 67For example, a natural way to partition is to [partition tables by 68dates](https://gitlab.com/groups/gitlab-org/-/epics/2023). For example, 69the `events` and `audit_events` table are natural candidates for this 70kind of partitioning. 71 72Sharding is likely more difficult and requires significant changes 73to the schema and application. For example, if we have to store projects 74in many different databases, we immediately run into the question, "How 75can we retrieve data across different projects?" One answer to this is 76to abstract data access into API calls that abstract the database from 77the application, but this is a significant amount of work. 78 79There are solutions that may help abstract the sharding to some extent 80from the application. For example, we want to look at [Citus 81Data](https://www.citusdata.com/product/community) closely. Citus Data 82provides a Rails plugin that adds a [tenant ID to ActiveRecord 83models](https://www.citusdata.com/blog/2017/01/05/easily-scale-out-multi-tenant-apps/). 84 85Sharding can also be done based on feature verticals. This is the 86microservice approach to sharding, where each service represents a 87bounded context and operates on its own service-specific database 88cluster. In that model data wouldn't be distributed according to some 89internal key (such as tenant IDs) but based on team and product 90ownership. It shares a lot of challenges with traditional, data-oriented 91sharding, however. For instance, joining data has to happen in the 92application itself rather than on the query layer (although additional 93layers like GraphQL might mitigate that) and it requires true 94parallelism to run efficiently (that is, a scatter-gather model to collect, 95then zip up data records), which is a challenge in itself in Ruby based 96systems. 97 98#### Database size 99 100A recent [database checkup shows a breakdown of the table sizes on 101GitLab.com](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8022#master-1022016101-8). 102Since `merge_request_diff_files` contains over 1 TB of data, we want to 103reduce/eliminate this table first. GitLab has support for [storing diffs in 104object storage](../administration/merge_request_diffs.md), which we [want to do on 105GitLab.com](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7356). 106 107#### High availability 108 109There are several strategies to provide high-availability and redundancy: 110 111- Write-ahead logs (WAL) streamed to object storage (for example, S3, or Google Cloud 112 Storage). 113- Read-replicas (hot backups). 114- Delayed replicas. 115 116To restore a database from a point in time, a base backup needs to have 117been taken prior to that incident. Once a database has restored from 118that backup, the database can apply the WAL logs in order until the 119database has reached the target time. 120 121On GitLab.com, Consul and Patroni work together to coordinate failovers with 122the read replicas. [Omnibus ships with Patroni](../administration/postgresql/replication_and_failover.md). 123 124#### Load-balancing 125 126GitLab EE has [application support for load balancing using read replicas](../administration/postgresql/database_load_balancing.md). This load balancer does 127some actions that aren't traditionally available in standard load balancers. For 128example, the application considers a replica only if its replication lag is low 129(for example, WAL data behind by less than 100 MB). 130 131More [details are in a blog 132post](https://about.gitlab.com/blog/2017/10/02/scaling-the-gitlab-database/). 133 134### PgBouncer 135 136As PostgreSQL forks a backend process for each request, PostgreSQL has a 137finite limit of connections that it can support, typically around 300 by 138default. Without a connection pooler like PgBouncer, it's quite possible to 139hit connection limits. Once the limits are reached, then GitLab generates 140errors or slow down as it waits for a connection to be available. 141 142#### High availability 143 144PgBouncer is a single-threaded process. Under heavy traffic, PgBouncer can 145saturate a single core, which can result in slower response times for 146background job and/or Web requests. There are two ways to address this 147limitation: 148 149- Run multiple PgBouncer instances. 150- Use a multi-threaded connection pooler (for example, 151 [Odyssey](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7776). 152 153On some Linux systems, it's possible to run [multiple PgBouncer instances on 154the same port](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/4796). 155 156On GitLab.com, we run multiple PgBouncer instances on different ports to 157avoid saturating a single core. 158 159In addition, the PgBouncer instances that communicate with the primary 160and secondaries are set up a bit differently: 161 162- Multiple PgBouncer instances in different availability zones talk to the 163 PostgreSQL primary. 164- Multiple PgBouncer processes are colocated with PostgreSQL read replicas. 165 166For replicas, colocating is advantageous because it reduces network hops 167and hence latency. However, for the primary, colocating is 168disadvantageous because PgBouncer would become a single point of failure 169and cause errors. When a failover occurs, one of two things could 170happen: 171 172- The primary disappears from the network. 173- The primary becomes a replica. 174 175In the first case, if PgBouncer is colocated with the primary, database 176connections would time out or fail to connect, and downtime would 177occur. Having multiple PgBouncer instances in front of a load balancer 178talking to the primary can mitigate this. 179 180In the second case, existing connections to the newly-demoted replica 181may execute a write query, which would fail. During a failover, it may 182be advantageous to shut down the PgBouncer talking to the primary to 183ensure no more traffic arrives for it. The alternative would be to make 184the application aware of the failover event and terminate its 185connections gracefully. 186 187### Redis 188 189There are three ways Redis is used in GitLab: 190 191- Queues: Sidekiq jobs marshal jobs into JSON payloads. 192- Persistent state: Session data and exclusive leases. 193- Cache: Repository data (like Branch and tag names) and view partials. 194 195For GitLab instances running at scale, splitting Redis usage into 196separate Redis clusters helps for two reasons: 197 198- Each has different persistence requirements. 199- Load isolation. 200 201For example, the cache instance can behave like an least-recently used 202(LRU) cache by setting the `maxmemory` configuration option. That option 203should not be set for the queues or persistent clusters because data 204would be evicted from memory at random times. This would cause jobs to 205be dropped on the floor, which would cause many problems (like merges 206not running or builds not updating). 207 208Sidekiq also polls its queues quite frequently, and this activity can 209slow down other queries. For this reason, having a dedicated Redis 210cluster for Sidekiq can help improve performance and reduce load on the 211Redis process. 212 213#### High availability/Risks 214 215Single-core: Like PgBouncer, a single Redis process can only use one 216core. It does not support multi-threading. 217 218Dumb secondaries: Redis secondaries (also known as replicas) don't actually 219handle any load. Unlike PostgreSQL secondaries, they don't even serve 220read queries. They simply replicate data from the primary and take over 221only when the primary fails. 222 223### Redis Sentinels 224 225[Redis Sentinel](https://redis.io/topics/sentinel) provides high 226availability for Redis by watching the primary. If multiple Sentinels 227detect that the primary has gone away, the Sentinels performs an 228election to determine a new leader. 229 230#### Failure Modes 231 232No leader: A Redis cluster can get into a mode where there are no 233primaries. For example, this can happen if Redis nodes are misconfigured 234to follow the wrong node. Sometimes this requires forcing one node to 235become a primary by using the [`REPLICAOF NO ONE` 236command](https://redis.io/commands/replicaof). 237 238### Sidekiq 239 240Sidekiq is a multi-threaded, background job processing system used in 241Ruby on Rails applications. In GitLab, Sidekiq performs the heavy 242lifting of many activities, including: 243 244- Updating merge requests after a push. 245- Sending email messages. 246- Updating user authorizations. 247- Processing CI builds and pipelines. 248 249The full list of jobs can be found in the 250[`app/workers`](https://gitlab.com/gitlab-org/gitlab/-/tree/master/app/workers) 251and 252[`ee/app/workers`](https://gitlab.com/gitlab-org/gitlab/-/tree/master/ee/app/workers) 253directories in the GitLab codebase. 254 255#### Runaway Queues 256 257As jobs are added to the Sidekiq queue, Sidekiq worker threads need to 258pull these jobs from the queue and finish them at a rate faster than 259they are added. When an imbalance occurs (for example, delays in the database 260or slow jobs), Sidekiq queues can balloon and lead to runaway queues. 261 262In recent months, many of these queues have ballooned due to delays in 263PostgreSQL, PgBouncer, and Redis. For example, PgBouncer saturation can 264cause jobs to wait a few seconds before obtaining a database connection, 265which can cascade into a large slowdown. Optimizing these basic 266interconnections comes first. 267 268However, there are a number of strategies to ensure queues get drained 269in a timely manner: 270 271- Add more processing capacity. This can be done by spinning up more 272 instances of Sidekiq or [Sidekiq Cluster](../administration/operations/extra_sidekiq_processes.md). 273- Split jobs into smaller units of work. For example, `PostReceive` 274 used to process each commit message in the push, but now it farms out 275 this to `ProcessCommitWorker`. 276- Redistribute/gerrymander Sidekiq processes by queue 277 types. Long-running jobs (for example, relating to project import) can often 278 squeeze out jobs that run fast (for example, delivering email). [This technique 279 was used in to optimize our existing Sidekiq deployment](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7219#note_218019483). 280- Optimize jobs. Eliminating unnecessary work, reducing network calls 281 (including SQL and Gitaly), and optimizing processor time can yield significant 282 benefits. 283 284From the Sidekiq logs, it's possible to see which jobs run the most 285frequently and/or take the longest. For example, these Kibana 286visualizations show the jobs that consume the most total time: 287 288![Most time-consuming Sidekiq jobs](img/sidekiq_most_time_consuming_jobs.png) 289 290_[visualization source - GitLab employees only](https://log.gitlab.net/goto/2c036582dfc3219eeaa49a76eab2564b)_ 291 292This shows the jobs that had the longest durations: 293 294![Longest running Sidekiq jobs](img/sidekiq_longest_running_jobs.png) 295 296_[visualization source - GitLab employees only](https://log.gitlab.net/goto/494f6c8afb61d98c4ff264520d184416)_ 297