1--- 2stage: Enablement 3group: Database 4info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments 5--- 6 7# Iterating tables in batches 8 9Rails provides a method called `in_batches` that can be used to iterate over 10rows in batches. For example: 11 12```ruby 13User.in_batches(of: 10) do |relation| 14 relation.update_all(updated_at: Time.now) 15end 16``` 17 18Unfortunately, this method is implemented in a way that is not very efficient, 19both query and memory usage wise. 20 21To work around this you can include the `EachBatch` module into your models, 22then use the `each_batch` class method. For example: 23 24```ruby 25class User < ActiveRecord::Base 26 include EachBatch 27end 28 29User.each_batch(of: 10) do |relation| 30 relation.update_all(updated_at: Time.now) 31end 32``` 33 34This will end up producing queries such as: 35 36```plaintext 37User Load (0.7ms) SELECT "users"."id" FROM "users" WHERE ("users"."id" >= 41654) ORDER BY "users"."id" ASC LIMIT 1 OFFSET 1000 38 (0.7ms) SELECT COUNT(*) FROM "users" WHERE ("users"."id" >= 41654) AND ("users"."id" < 42687) 39``` 40 41The API of this method is similar to `in_batches`, though it doesn't support 42all of the arguments that `in_batches` supports. You should always use 43`each_batch` _unless_ you have a specific need for `in_batches`. 44 45## Avoid iterating over non-unique columns 46 47One should proceed with extra caution, and possibly avoid iterating over a column that can contain 48duplicate values. When you iterate over an attribute that is not unique, even with the applied max 49batch size, there is no guarantee that the resulting batches will not surpass it. The following 50snippet demonstrates this situation when one attempt to select `Ci::Build` entries for users with 51`id` between `1` and `10,000`, the database returns `1 215 178` matching rows. 52 53```ruby 54[ gstg ] production> Ci::Build.where(user_id: (1..10_000)).size 55=> 1215178 56``` 57 58This happens because built relation is translated into the following query 59 60```ruby 61[ gstg ] production> puts Ci::Build.where(user_id: (1..10_000)).to_sql 62SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."user_id" BETWEEN 1 AND 10000 63=> nil 64``` 65 66`And` queries which filter non-unique column by range `WHERE "ci_builds"."user_id" BETWEEN ? AND ?`, 67even though the range size is limited to a certain threshold (`10,000` in the previous example) this 68threshold does not translate to the size of the returned dataset. That happens because when taking 69`n` possible values of attributes, one can't tell for sure that the number of records that contains 70them will be less than `n`. 71 72## Column definition 73 74`EachBatch` uses the primary key of the model by default for the iteration. This works most of the 75cases, however in some cases, you might want to use a different column for the iteration. 76 77```ruby 78Project.distinct.each_batch(column: :creator_id, of: 10) do |relation| 79 puts User.where(id: relation.select(:creator_id)).map(&:id) 80end 81``` 82 83The query above iterates over the project creators and prints them out without duplications. 84 85NOTE: 86In case the column is not unique (no unique index definition), calling the `distinct` method on 87the relation is necessary. Using not unique column without `distinct` may result in `each_batch` 88falling into an endless loop as described in following 89[issue](https://gitlab.com/gitlab-org/gitlab/-/issues/285097). 90 91## `EachBatch` in data migrations 92 93When dealing with data migrations the preferred way to iterate over a large volume of data is using 94`EachBatch`. 95 96A special case of data migration is a [background migration](background_migrations.md#scheduling) 97where the actual data modification is executed in a background job. The migration code that 98determines the data ranges (slices) and schedules the background jobs uses `each_batch`. 99 100## Efficient usage of `each_batch` 101 102`EachBatch` helps to iterate over large tables. It's important to highlight that `EachBatch` is 103not going to magically solve all iteration related performance problems and it might not help at 104all in some scenarios. From the database point of view, correctly configured database indexes are 105also necessary to make `EachBatch` perform well. 106 107### Example 1: Simple iteration 108 109Let's consider that we want to iterate over the `users` table and print the `User` records to the 110standard output. The `users` table contains millions of records, thus running one query to fetch 111the users will likely time out. 112 113![Users table overview](img/each_batch_users_table_v13_7.png) 114 115This is a simplified version of the `users` table which contains several rows. We have a few 116smaller gaps in the `id` column to make the example a bit more realistic (a few records were 117already deleted). Currently, we have one index on the `id` field. 118 119Loading all users into memory (avoid): 120 121```ruby 122users = User.all 123 124users.each { |user| puts user.inspect } 125``` 126 127Use `each_batch`: 128 129```ruby 130# Note: for this example I picked 5 as the batch size, the default is 1_000 131User.each_batch(of: 5) do |relation| 132 relation.each { |user| puts user.inspect } 133end 134``` 135 136#### How `each_batch` works 137 138As the first step, it finds the lowest `id` (start `id`) in the table by executing the following 139database query: 140 141```sql 142SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT 1 143``` 144 145![Reading the start `id` value](img/each_batch_users_table_iteration_1_v13_7.png) 146 147Notice that the query only reads data from the index (`INDEX ONLY SCAN`), the table is not 148accessed. Database indexes are sorted so taking out the first item is a very cheap operation. 149 150The next step is to find the next `id` (end `id`) which should respect the batch size 151configuration. In this example we used a batch size of 5. `EachBatch` uses the `OFFSET` clause 152to get a "shifted" `id` value. 153 154```sql 155SELECT "users"."id" FROM "users" WHERE "users"."id" >= 1 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5 156``` 157 158![Reading the end `id` value](img/each_batch_users_table_iteration_2_v13_7.png) 159 160Again, the query only looks into the index. The `OFFSET 5` takes out the sixth `id` value: this 161query reads a maximum of six items from the index regardless of the table size or the iteration 162count. 163 164At this point, we know the `id` range for the first batch. Now it's time to construct the query 165for the `relation` block. 166 167```sql 168SELECT "users".* FROM "users" WHERE "users"."id" >= 1 AND "users"."id" < 302 169``` 170 171![Reading the rows from the `users` table](img/each_batch_users_table_iteration_3_v13_7.png) 172 173Notice the `<` sign. Previously six items were read from the index and in this query, the last 174value is "excluded". The query will look at the index to get the location of the five `user` 175rows on the disk and read the rows from the table. The returned array is processed in Ruby. 176 177The first iteration is done. For the next iteration, the last `id` value is reused from the 178previous iteration in order to find out the next end `id` value. 179 180```sql 181SELECT "users"."id" FROM "users" WHERE "users"."id" >= 302 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5 182``` 183 184![Reading the second end `id` value](img/each_batch_users_table_iteration_4_v13_7.png) 185 186Now we can easily construct the `users` query for the second iteration. 187 188```sql 189SELECT "users".* FROM "users" WHERE "users"."id" >= 302 AND "users"."id" < 353 190``` 191 192![Reading the rows for the second iteration from the users table](img/each_batch_users_table_iteration_5_v13_7.png) 193 194### Example 2: Iteration with filters 195 196Building on top of the previous example, we want to print users with zero sign-in count. We keep 197track of the number of sign-ins in the `sign_in_count` column so we write the following code: 198 199```ruby 200users = User.where(sign_in_count: 0) 201 202users.each_batch(of: 5) do |relation| 203 relation.each { |user| puts user.inspect } 204end 205``` 206 207`each_batch` will produce the following SQL query for the start `id` value: 208 209```sql 210SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1 211``` 212 213Selecting only the `id` column and ordering by `id` is going to "force" the database to use the 214index on the `id` (primary key index) column however, we also have an extra condition on the 215`sign_in_count` column. The column is not part of the index, so the database needs to look into 216the actual table to find the first matching row. 217 218![Reading the index with extra filter](img/each_batch_users_table_filter_v13_7.png) 219 220NOTE: 221The number of scanned rows depends on the data distribution in the table. 222 223- Best case scenario: the first user was never logged in. The database reads only one row. 224- Worst case scenario: all users were logged in at least once. The database reads all rows. 225 226In this particular example, the database had to read 10 rows (regardless of our batch size setting) 227to determine the first `id` value. In a "real-world" application it's hard to predict whether the 228filtering is going to cause problems or not. In the case of GitLab, verifying the data on a 229production replica is a good start, but keep in mind that data distribution on GitLab.com can be 230different from self-managed instances. 231 232#### Improve filtering with `each_batch` 233 234##### Specialized conditional index 235 236```sql 237CREATE INDEX index_on_users_never_logged_in ON users (id) WHERE sign_in_count = 0 238``` 239 240This is how our table and the newly created index looks like: 241 242![Reading the specialized index](img/each_batch_users_table_filtered_index_v13_7.png) 243 244This index definition covers the conditions on the `id` and `sign_in_count` columns thus makes the 245`each_batch` queries very effective (similar to the simple iteration example). 246 247It's rare when a user was never signed in so we a anticipate small index size. Including only the 248`id` in the index definition also helps to keep the index size small. 249 250##### Index on columns 251 252Later on, we might want to iterate over the table filtering for different `sign_in_count` values, in 253those cases we cannot use the previously suggested conditional index because the `WHERE` condition 254does not match with our new filter (`sign_in_count > 10`). 255 256To address this problem, we have two options: 257 258- Create another, conditional index to cover the new query. 259- Replace the index with a more generalized configuration. 260 261NOTE: 262Having multiple indexes on the same table and on the same columns could be a performance bottleneck 263when writing data. 264 265Let's consider the following index (avoid): 266 267```sql 268CREATE INDEX index_on_users_never_logged_in ON users (id, sign_in_count) 269``` 270 271The index definition starts with the `id` column which makes the index very inefficient from data 272selectivity point of view. 273 274```sql 275SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1 276``` 277 278Executing the query above results in an `INDEX ONLY SCAN`. However, the query still needs to 279iterate over an unknown number of entries in the index, and then find the first item where the 280`sign_in_count` is `0`. 281 282![Reading an ineffective index](img/each_batch_users_table_bad_index_v13_7.png) 283 284We can improve the query significantly by swapping the columns in the index definition (prefer). 285 286```sql 287CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count, id) 288``` 289 290![Reading a good index](img/each_batch_users_table_good_index_v13_7.png) 291 292The following index definition is not going to work well with `each_batch` (avoid). 293 294```sql 295CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count) 296``` 297 298Since `each_batch` builds range queries based on the `id` column, this index cannot be used 299efficiently. The DB reads the rows from the table or uses a bitmap search where the primary 300key index is also read. 301 302##### "Slow" iteration 303 304Slow iteration means that we use a good index configuration to iterate over the table and 305apply filtering on the yielded relation. 306 307```ruby 308User.each_batch(of: 5) do |relation| 309 relation.where(sign_in_count: 0).each { |user| puts user inspect } 310end 311``` 312 313The iteration uses the primary key index (on the `id` column) which makes it safe from statement 314timeouts. The filter (`sign_in_count: 0`) is applied on the `relation` where the `id` is already 315constrained (range). The number of rows is limited. 316 317Slow iteration generally takes more time to finish. The iteration count is higher and 318one iteration could yield fewer records than the batch size. Iterations may even yield 3190 records. This is not an optimal solution; however, in some cases (especially when 320dealing with large tables) this is the only viable option. 321 322### Using Subqueries 323 324Using subqueries in your `each_batch` query does not work well in most cases. Consider the following example: 325 326```ruby 327projects = Project.where(creator_id: Issue.where(confidential: true).select(:author_id)) 328 329projects.each_batch do |relation| 330 # do something 331end 332``` 333 334The iteration uses the `id` column of the `projects` table. The batching does not affect the 335subquery. This means for each iteration, the subquery is executed by the database. This adds a 336constant "load" on the query which often ends up in statement timeouts. We have an unknown number 337of [confidential issues](../user/project/issues/confidential_issues.md), the execution time 338and the accessed database rows depend on the data distribution in the `issues` table. 339 340NOTE: 341Using subqueries works only when the subquery returns a small number of rows. 342 343#### Improving Subqueries 344 345When dealing with subqueries, a slow iteration approach could work: the filter on `creator_id` 346can be part of the generated `relation` object. 347 348```ruby 349projects = Project.all 350 351projects.each_batch do |relation| 352 relation.where(creator_id: Issue.where(confidential: true).select(:author_id)) 353end 354``` 355 356If the query on the `issues` table itself is not performant enough, a nested loop could be 357constructed. Try to avoid it when possible. 358 359```ruby 360projects = Project.all 361 362projects.each_batch do |relation| 363 issues = Issue.where(confidential: true) 364 365 issues.each_batch do |issues_relation| 366 relation.where(creator_id: issues_relation.select(:author_id)) 367 end 368end 369``` 370 371If we know that the `issues` table has many more rows than `projects`, it would make sense to flip 372the queries, where the `issues` table is batched first. 373 374### Using `JOIN` and `EXISTS` 375 376When to use `JOINS`: 377 378- When there's a 1:1 or 1:N relationship between the tables where we know that the joined record 379(almost) always exists. This works well for "extension-like" tables: 380 - `projects` - `project_settings` 381 - `users` - `user_details` 382 - `users` - `user_statuses` 383- `LEFT JOIN` works well in this case. Conditions on the joined table need to go to the yielded 384relation so the iteration is not affected by the data distribution in the joined table. 385 386Example: 387 388```ruby 389users = User.joins("LEFT JOIN personal_access_tokens on personal_access_tokens.user_id = users.id") 390 391users.each_batch do |relation| 392 relation.where("personal_access_tokens.name = 'name'") 393end 394``` 395 396`EXISTS` queries should be added only to the inner `relation` of the `each_batch` query: 397 398```ruby 399User.each_batch do |relation| 400 relation.where("EXISTS (SELECT 1 FROM ...") 401end 402``` 403 404### Complex queries on the relation object 405 406When the `relation` object has several extra conditions, the execution plans might become 407"unstable". 408 409Example: 410 411```ruby 412Issue.each_batch do |relation| 413 relation 414 .joins(:metrics) 415 .joins(:merge_requests_closing_issues) 416 .where("id IN (SELECT ...)") 417 .where(confidential: true) 418end 419``` 420 421Here, we expect that the `relation` query reads the `BATCH_SIZE` of user records and then 422filters down the results according to the provided queries. The planner might decide that 423using a bitmap index lookup with the index on the `confidential` column is a better way to 424execute the query. This can cause an unexpectedly high amount of rows to be read and the 425query could time out. 426 427Problem: we know for sure that the relation is returning maximum `BATCH_SIZE` of records 428however, the planner does not know this. 429 430Common table expression (CTE) trick to force the range query to execute first: 431 432```ruby 433Issue.each_batch(of: 1000) do |relation| 434 cte = Gitlab::SQL::CTE.new(:batched_relation, relation.limit(1000)) 435 436 scope = cte 437 .apply_to(Issue.all) 438 .joins(:metrics) 439 .joins(:merge_requests_closing_issues) 440 .where("id IN (SELECT ...)") 441 .where(confidential: true) 442 443 puts scope.to_a 444end 445``` 446 447### `EachBatch` vs `BatchCount` 448 449When adding new counters for Service Ping, the preferred way to count records is using the 450`Gitlab::Database::BatchCount` class. The iteration logic implemented in `BatchCount` 451has similar performance characteristics like `EachBatch`. Most of the tips and suggestions 452for improving `BatchCount` mentioned above applies to `BatchCount` as well. 453 454## Iterate with keyset pagination 455 456There are a few special cases where iterating with `EachBatch` does not work. `EachBatch` 457requires one distinct column (usually the primary key), which makes the iteration impossible 458for timestamp columns and tables with composite primary keys. 459 460Where `EachBatch` does not work, you can use 461[keyset pagination](database/pagination_guidelines.md#keyset-pagination) to iterate over the 462table or a range of rows. The scaling and performance characteristics are very similar to 463`EachBatch`. 464 465Examples: 466 467- Iterate over the table in a specific order (timestamp columns) in combination with a tie-breaker 468if column user to sort by does not contain unique values. 469- Iterate over the table with composite primary keys. 470 471### Iterate over the issues in a project by creation date 472 473You can use keyset pagination to iterate over any database column in a specific order (for example, 474`created_at DESC`). To ensure consistent order of the returned records with the same values for 475`created_at`, use a tie-breaker column with unique values (for example, `id`). 476 477Assume you have the following index in the `issues` table: 478 479```sql 480idx_issues_on_project_id_and_created_at_and_id" btree (project_id, created_at, id) 481``` 482 483### Fetching records for further processing 484 485The following snippet iterates over issue records within the project using the specified order 486(`created_at, id`). 487 488```ruby 489scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker 490 491iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope) 492 493iterator.each_batch(of: 100) do |records| 494 puts records.map(&:id) 495end 496``` 497 498You can add extra filters to the query. This example only lists the issue IDs created in the last 49930 days: 500 501```ruby 502scope = Issue.where(project_id: 278964).where('created_at > ?', 30.days.ago).order(:created_at, :id) # id is the tie-breaker 503 504iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope) 505 506iterator.each_batch(of: 100) do |records| 507 puts records.map(&:id) 508end 509``` 510 511### Updating records in the batch 512 513For complex `ActiveRecord` queries, the `.update_all` method does not work well, because it 514generates an incorrect `UPDATE` statement. 515You can use raw SQL for updating records in batches: 516 517```ruby 518scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker 519 520iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope) 521 522iterator.each_batch(of: 100) do |records| 523 ApplicationRecord.connection.execute("UPDATE issues SET updated_at=NOW() WHERE issues.id in (#{records.dup.reselect(:id).to_sql})") 524end 525``` 526 527NOTE: 528To keep the iteration stable and predictable, avoid updating the columns in the `ORDER BY` clause. 529 530### Iterate over the `merge_request_diff_commits` table 531 532The `merge_request_diff_commits` table uses a composite primary key (`merge_request_diff_id, 533relative_order`), which makes `EachBatch` impossible to use efficiently. 534 535To paginate over the `merge_request_diff_commits` table, you can use the following snippet: 536 537```ruby 538# Custom order object configuration: 539order = Gitlab::Pagination::Keyset::Order.build([ 540 Gitlab::Pagination::Keyset::ColumnOrderDefinition.new( 541 attribute_name: 'merge_request_diff_id', 542 order_expression: MergeRequestDiffCommit.arel_table[:merge_request_diff_id].asc, 543 nullable: :not_nullable, 544 distinct: false, 545 ), 546 Gitlab::Pagination::Keyset::ColumnOrderDefinition.new( 547 attribute_name: 'relative_order', 548 order_expression: MergeRequestDiffCommit.arel_table[:relative_order].asc, 549 nullable: :not_nullable, 550 distinct: false, 551 ) 552]) 553MergeRequestDiffCommit.include(FromUnion) # keyset pagination generates UNION queries 554 555scope = MergeRequestDiffCommit.order(order) 556 557iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope) 558 559iterator.each_batch(of: 100) do |records| 560 puts records.map { |record| [record.merge_request_diff_id, record.relative_order] }.inspect 561end 562``` 563 564### Order object configuration 565 566Keyset pagination works well with simple `ActiveRecord` `order` scopes 567([first example](iterating_tables_in_batches.md#iterate-over-the-issues-in-a-project-by-creation-date). 568However, in special cases, you need to describe the columns in the `ORDER BY` clause (second example) 569for the underlying keyset pagination library. When the `ORDER BY` configuration cannot be 570automatically determined by the keyset pagination library, an error is raised. 571 572The code comments of the 573[`Gitlab::Pagination::Keyset::Order`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/order.rb) 574and [`Gitlab::Pagination::Keyset::ColumnOrderDefinition`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/column_order_definition.rb) 575classes give an overview of the possible options for configuring the `ORDER BY` clause. You can 576also find a few code examples in the 577[keyset pagination](database/keyset_pagination.md#complex-order-configuration) documentation. 578