1--- 2stage: none 3group: unassigned 4info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments 5--- 6 7# Generating chaos in a test GitLab instance 8 9<!-- vale gitlab.Spelling = NO --> 10 11As [Werner Vogels](https://twitter.com/Werner), the CTO at Amazon Web Services, famously put it, **Everything fails, all the time**. 12 13<!-- vale gitlab.Spelling = NO --> 14 15As a developer, it's as important to consider the failure modes in which your software may operate as much as normal operation. Doing so can mean the difference between a minor hiccup leading to a scattering of `500` errors experienced by a tiny fraction of users, and a full site outage that affects all users for an extended period. 16 17To paraphrase [Tolstoy](https://en.wikipedia.org/wiki/Anna_Karenina_principle), _all happy servers are alike, but all failing servers are failing in their own way_. Luckily, there are ways we can attempt to simulate these failure modes, and the chaos endpoints are tools for assisting in this process. 18 19Currently, there are four endpoints for simulating the following conditions: 20 21- Slow requests. 22- CPU-bound requests. 23- Memory leaks. 24- Unexpected process crashes. 25 26## Enabling chaos endpoints 27 28For obvious reasons, these endpoints are not enabled by default on `production`. 29They are enabled by default on **development** environments. 30 31WARNING: 32It is required that you secure access to the chaos endpoints using a secret token. 33You should not enable them in production unless you absolutely know what you're doing. 34 35A secret token can be set through the `GITLAB_CHAOS_SECRET` environment variable. 36For example, when using the [GDK](https://gitlab.com/gitlab-org/gitlab-development-kit) 37this can be done with the following command: 38 39```shell 40GITLAB_CHAOS_SECRET=secret gdk start 41``` 42 43Replace `secret` with your own secret token. 44 45## Invoking chaos 46 47After you have enabled the chaos endpoints and restarted the application, you can start testing using the endpoints. 48 49By default, when invoking a chaos endpoint, the web worker process which receives the request handles it. This means, for example, that if the Kill 50operation is invoked, the Puma worker process handling the request is killed. To test these operations in Sidekiq, the `async` parameter on 51each endpoint can be set to `true`. This runs the chaos process in a Sidekiq worker. 52 53## Memory leaks 54 55To simulate a memory leak in your application, use the `/-/chaos/leakmem` endpoint. 56 57The memory is not retained after the request finishes. After the request has completed, the Ruby garbage collector attempts to recover the memory. 58 59```plaintext 60GET /-/chaos/leakmem 61GET /-/chaos/leakmem?memory_mb=1024 62GET /-/chaos/leakmem?memory_mb=1024&duration_s=50 63GET /-/chaos/leakmem?memory_mb=1024&duration_s=50&async=true 64``` 65 66| Attribute | Type | Required | Description | 67| ------------ | ------- | -------- | ------------------------------------------------------------------------------------ | 68| `memory_mb` | integer | no | How much memory, in MB, should be leaked. Defaults to 100MB. | 69| `duration_s` | integer | no | Minimum duration_s, in seconds, that the memory should be retained. Defaults to 30s. | 70| `async` | boolean | no | Set to true to leak memory in a Sidekiq background worker process | 71 72```shell 73curl "http://localhost:3000/-/chaos/leakmem?memory_mb=1024&duration_s=10" \ 74 --header 'X-Chaos-Secret: secret' 75curl "http://localhost:3000/-/chaos/leakmem?memory_mb=1024&duration_s=10&token=secret" 76``` 77 78## CPU spin 79 80This endpoint attempts to fully utilise a single core, at 100%, for the given period. 81 82Depending on your rack server setup, your request may timeout after a predetermined period (normally 60 seconds). 83 84```plaintext 85GET /-/chaos/cpu_spin 86GET /-/chaos/cpu_spin?duration_s=50 87GET /-/chaos/cpu_spin?duration_s=50&async=true 88``` 89 90| Attribute | Type | Required | Description | 91| ------------ | ------- | -------- | --------------------------------------------------------------------- | 92| `duration_s` | integer | no | Duration, in seconds, that the core is used. Defaults to 30s | 93| `async` | boolean | no | Set to true to consume CPU in a Sidekiq background worker process | 94 95```shell 96curl "http://localhost:3000/-/chaos/cpu_spin?duration_s=60" \ 97 --header 'X-Chaos-Secret: secret' 98curl "http://localhost:3000/-/chaos/cpu_spin?duration_s=60&token=secret" 99``` 100 101## DB spin 102 103This endpoint attempts to fully utilise a single core, and interleave it with DB request, for the given period. 104This endpoint can be used to model yielding execution to another threads when running concurrently. 105 106Depending on your rack server setup, your request may timeout after a predetermined period (normally 60 seconds). 107 108```plaintext 109GET /-/chaos/db_spin 110GET /-/chaos/db_spin?duration_s=50 111GET /-/chaos/db_spin?duration_s=50&async=true 112``` 113 114| Attribute | Type | Required | Description | 115| ------------ | ------- | -------- | --------------------------------------------------------------------------- | 116| `interval_s` | float | no | Interval, in seconds, for every DB request. Defaults to 1s | 117| `duration_s` | integer | no | Duration, in seconds, that the core is used. Defaults to 30s | 118| `async` | boolean | no | Set to true to perform the operation in a Sidekiq background worker process | 119 120```shell 121curl "http://localhost:3000/-/chaos/db_spin?interval_s=1&duration_s=60" \ 122 --header 'X-Chaos-Secret: secret' 123curl "http://localhost:3000/-/chaos/db_spin?interval_s=1&duration_s=60&token=secret" 124``` 125 126## Sleep 127 128This endpoint is similar to the CPU Spin endpoint but simulates off-processor activity, such as network calls to backend services. It sleeps for a given `duration_s`. 129 130As with the CPU Spin endpoint, this may lead to your request timing out if `duration_s` exceeds the configured limit. 131 132```plaintext 133GET /-/chaos/sleep 134GET /-/chaos/sleep?duration_s=50 135GET /-/chaos/sleep?duration_s=50&async=true 136``` 137 138| Attribute | Type | Required | Description | 139| ------------ | ------- | -------- | ---------------------------------------------------------------------- | 140| `duration_s` | integer | no | Duration, in seconds, that the request sleeps for. Defaults to 30s | 141| `async` | boolean | no | Set to true to sleep in a Sidekiq background worker process | 142 143```shell 144curl "http://localhost:3000/-/chaos/sleep?duration_s=60" \ 145 --header 'X-Chaos-Secret: secret' 146curl "http://localhost:3000/-/chaos/sleep?duration_s=60&token=secret" 147``` 148 149## Kill 150 151This endpoint simulates the unexpected death of a worker process using the `KILL` signal. 152 153Because this endpoint uses the `KILL` signal, the process isn't given an 154opportunity to clean up or shut down. 155 156```plaintext 157GET /-/chaos/kill 158GET /-/chaos/kill?async=true 159``` 160 161| Attribute | Type | Required | Description | 162| ------------ | ------- | -------- | ---------------------------------------------------------------------- | 163| `async` | boolean | no | Set to true to signal a Sidekiq background worker process | 164 165```shell 166curl "http://localhost:3000/-/chaos/kill" --header 'X-Chaos-Secret: secret' 167curl "http://localhost:3000/-/chaos/kill?token=secret" 168``` 169 170## Quit 171 172This endpoint simulates the unexpected death of a worker process using the `QUIT` signal. 173Unlike `KILL`, the `QUIT` signal will also attempt to write a core dump. 174See [core(5)](https://man7.org/linux/man-pages/man5/core.5.html) for more information. 175 176```plaintext 177GET /-/chaos/quit 178GET /-/chaos/quit?async=true 179``` 180 181| Attribute | Type | Required | Description | 182| ------------ | ------- | -------- | ---------------------------------------------------------------------- | 183| `async` | boolean | no | Set to true to signal a Sidekiq background worker process | 184 185```shell 186curl "http://localhost:3000/-/chaos/quit" --header 'X-Chaos-Secret: secret' 187curl "http://localhost:3000/-/chaos/quit?token=secret" 188``` 189 190## Run garbage collector 191 192This endpoint triggers a GC run on the worker handling the request and returns its worker ID 193plus GC stats as JSON. This is mostly useful when running Puma in standalone mode, since 194otherwise the worker handling the request will not be known upfront. 195 196Endpoint: 197 198```plaintext 199POST /-/chaos/gc 200``` 201 202Example request: 203 204```shell 205curl --request POST "http://localhost:3000/-/chaos/gc" \ 206 --header 'X-Chaos-Secret: secret' 207curl --request POST "http://localhost:3000/-/chaos/gc?token=secret" 208``` 209 210Example response: 211 212```json 213{ 214 "worker_id": "puma_1", 215 "gc_stat": { 216 "count": 94, 217 "heap_allocated_pages": 9077, 218 "heap_sorted_length": 9077, 219 "heap_allocatable_pages": 0, 220 "heap_available_slots": 3699720, 221 "heap_live_slots": 2827510, 222 "heap_free_slots": 872210, 223 "heap_final_slots": 0, 224 "heap_marked_slots": 2827509, 225 "heap_eden_pages": 9077, 226 "heap_tomb_pages": 0, 227 "total_allocated_pages": 9077, 228 "total_freed_pages": 0, 229 "total_allocated_objects": 14229357, 230 "total_freed_objects": 11401847, 231 "malloc_increase_bytes": 8192, 232 "malloc_increase_bytes_limit": 30949538, 233 "minor_gc_count": 71, 234 "major_gc_count": 23, 235 "compact_count": 0, 236 "remembered_wb_unprotected_objects": 41685, 237 "remembered_wb_unprotected_objects_limit": 83370, 238 "old_objects": 2617806, 239 "old_objects_limit": 5235612, 240 "oldmalloc_increase_bytes": 8192, 241 "oldmalloc_increase_bytes_limit": 122713697 242 } 243} 244``` 245