1---
2stage: Enablement
3group: Database
4info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
5---
6
7# Iterating tables in batches
8
9Rails provides a method called `in_batches` that can be used to iterate over
10rows in batches. For example:
11
12```ruby
13User.in_batches(of: 10) do |relation|
14  relation.update_all(updated_at: Time.now)
15end
16```
17
18Unfortunately, this method is implemented in a way that is not very efficient,
19both query and memory usage wise.
20
21To work around this you can include the `EachBatch` module into your models,
22then use the `each_batch` class method. For example:
23
24```ruby
25class User < ActiveRecord::Base
26  include EachBatch
27end
28
29User.each_batch(of: 10) do |relation|
30  relation.update_all(updated_at: Time.now)
31end
32```
33
34This will end up producing queries such as:
35
36```plaintext
37User Load (0.7ms)  SELECT  "users"."id" FROM "users" WHERE ("users"."id" >= 41654)  ORDER BY "users"."id" ASC LIMIT 1 OFFSET 1000
38  (0.7ms)  SELECT COUNT(*) FROM "users" WHERE ("users"."id" >= 41654) AND ("users"."id" < 42687)
39```
40
41The API of this method is similar to `in_batches`, though it doesn't support
42all of the arguments that `in_batches` supports. You should always use
43`each_batch` _unless_ you have a specific need for `in_batches`.
44
45## Avoid iterating over non-unique columns
46
47One should proceed with extra caution, and possibly avoid iterating over a column that can contain
48duplicate values. When you iterate over an attribute that is not unique, even with the applied max
49batch size, there is no guarantee that the resulting batches will not surpass it. The following
50snippet demonstrates this situation when one attempt to select `Ci::Build` entries for users with
51`id` between `1` and `10,000`, the database returns `1 215 178` matching rows.
52
53```ruby
54[ gstg ] production> Ci::Build.where(user_id: (1..10_000)).size
55=> 1215178
56```
57
58This happens because built relation is translated into the following query
59
60```ruby
61[ gstg ] production> puts Ci::Build.where(user_id: (1..10_000)).to_sql
62SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."user_id" BETWEEN 1 AND 10000
63=> nil
64```
65
66`And` queries which filter non-unique column by range `WHERE "ci_builds"."user_id" BETWEEN ? AND ?`,
67even though the range size is limited to a certain threshold (`10,000` in the previous example) this
68threshold does not translate to the size of the returned dataset. That happens because when taking
69`n` possible values of attributes, one can't tell for sure that the number of records that contains
70them will be less than `n`.
71
72## Column definition
73
74`EachBatch` uses the primary key of the model by default for the iteration. This works most of the
75cases, however in some cases, you might want to use a different column for the iteration.
76
77```ruby
78Project.distinct.each_batch(column: :creator_id, of: 10) do |relation|
79  puts User.where(id: relation.select(:creator_id)).map(&:id)
80end
81```
82
83The query above iterates over the project creators and prints them out without duplications.
84
85NOTE:
86In case the column is not unique (no unique index definition), calling the `distinct` method on
87the relation is necessary. Using not unique column without `distinct` may result in `each_batch`
88falling into an endless loop as described in following
89[issue](https://gitlab.com/gitlab-org/gitlab/-/issues/285097).
90
91## `EachBatch` in data migrations
92
93When dealing with data migrations the preferred way to iterate over a large volume of data is using
94`EachBatch`.
95
96A special case of data migration is a [background migration](background_migrations.md#scheduling)
97where the actual data modification is executed in a background job. The migration code that
98determines the data ranges (slices) and schedules the background jobs uses `each_batch`.
99
100## Efficient usage of `each_batch`
101
102`EachBatch` helps to iterate over large tables. It's important to highlight that `EachBatch` is
103not going to magically solve all iteration related performance problems and it might not help at
104all in some scenarios. From the database point of view, correctly configured database indexes are
105also necessary to make `EachBatch` perform well.
106
107### Example 1: Simple iteration
108
109Let's consider that we want to iterate over the `users` table and print the `User` records to the
110standard output. The `users` table contains millions of records, thus running one query to fetch
111the users will likely time out.
112
113![Users table overview](img/each_batch_users_table_v13_7.png)
114
115This is a simplified version of the `users` table which contains several rows. We have a few
116smaller gaps in the `id` column to make the example a bit more realistic (a few records were
117already deleted). Currently, we have one index on the `id` field.
118
119Loading all users into memory (avoid):
120
121```ruby
122users = User.all
123
124users.each { |user| puts user.inspect }
125```
126
127Use `each_batch`:
128
129```ruby
130# Note: for this example I picked 5 as the batch size, the default is 1_000
131User.each_batch(of: 5) do |relation|
132  relation.each { |user| puts user.inspect }
133end
134```
135
136#### How `each_batch` works
137
138As the first step, it finds the lowest `id` (start `id`) in the table by executing the following
139database query:
140
141```sql
142SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT 1
143```
144
145![Reading the start `id` value](img/each_batch_users_table_iteration_1_v13_7.png)
146
147Notice that the query only reads data from the index (`INDEX ONLY SCAN`), the table is not
148accessed. Database indexes are sorted so taking out the first item is a very cheap operation.
149
150The next step is to find the next `id` (end `id`) which should respect the batch size
151configuration. In this example we used a batch size of 5. `EachBatch` uses the `OFFSET` clause
152to get a "shifted" `id` value.
153
154```sql
155SELECT "users"."id" FROM "users" WHERE "users"."id" >= 1 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5
156```
157
158![Reading the end `id` value](img/each_batch_users_table_iteration_2_v13_7.png)
159
160Again, the query only looks into the index. The `OFFSET 5` takes out the sixth `id` value: this
161query reads a maximum of six items from the index regardless of the table size or the iteration
162count.
163
164At this point, we know the `id` range for the first batch. Now it's time to construct the query
165for the `relation` block.
166
167```sql
168SELECT "users".* FROM "users" WHERE "users"."id" >= 1 AND "users"."id" < 302
169```
170
171![Reading the rows from the `users` table](img/each_batch_users_table_iteration_3_v13_7.png)
172
173Notice the `<` sign. Previously six items were read from the index and in this query, the last
174value is "excluded". The query will look at the index to get the location of the five `user`
175rows on the disk and read the rows from the table. The returned array is processed in Ruby.
176
177The first iteration is done. For the next iteration, the last `id` value is reused from the
178previous iteration in order to find out the next end `id` value.
179
180```sql
181SELECT "users"."id" FROM "users" WHERE "users"."id" >= 302 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5
182```
183
184![Reading the second end `id` value](img/each_batch_users_table_iteration_4_v13_7.png)
185
186Now we can easily construct the `users` query for the second iteration.
187
188```sql
189SELECT "users".* FROM "users" WHERE "users"."id" >= 302 AND "users"."id" < 353
190```
191
192![Reading the rows for the second iteration from the users table](img/each_batch_users_table_iteration_5_v13_7.png)
193
194### Example 2: Iteration with filters
195
196Building on top of the previous example, we want to print users with zero sign-in count. We keep
197track of the number of sign-ins in the `sign_in_count` column so we write the following code:
198
199```ruby
200users = User.where(sign_in_count: 0)
201
202users.each_batch(of: 5) do |relation|
203  relation.each { |user| puts user.inspect }
204end
205```
206
207`each_batch` will produce the following SQL query for the start `id` value:
208
209```sql
210SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1
211```
212
213Selecting only the `id` column and ordering by `id` is going to "force" the database to use the
214index on the `id` (primary key index) column however, we also have an extra condition on the
215`sign_in_count` column. The column is not part of the index, so the database needs to look into
216the actual table to find the first matching row.
217
218![Reading the index with extra filter](img/each_batch_users_table_filter_v13_7.png)
219
220NOTE:
221The number of scanned rows depends on the data distribution in the table.
222
223- Best case scenario: the first user was never logged in. The database reads only one row.
224- Worst case scenario: all users were logged in at least once. The database reads all rows.
225
226In this particular example, the database had to read 10 rows (regardless of our batch size setting)
227to determine the first `id` value. In a "real-world" application it's hard to predict whether the
228filtering is going to cause problems or not. In the case of GitLab, verifying the data on a
229production replica is a good start, but keep in mind that data distribution on GitLab.com can be
230different from self-managed instances.
231
232#### Improve filtering with `each_batch`
233
234##### Specialized conditional index
235
236```sql
237CREATE INDEX index_on_users_never_logged_in ON users (id) WHERE sign_in_count = 0
238```
239
240This is how our table and the newly created index looks like:
241
242![Reading the specialized index](img/each_batch_users_table_filtered_index_v13_7.png)
243
244This index definition covers the conditions on the `id` and `sign_in_count` columns thus makes the
245`each_batch` queries very effective (similar to the simple iteration example).
246
247It's rare when a user was never signed in so we a anticipate small index size. Including only the
248`id` in the index definition also helps to keep the index size small.
249
250##### Index on columns
251
252Later on, we might want to iterate over the table filtering for different `sign_in_count` values, in
253those cases we cannot use the previously suggested conditional index because the `WHERE` condition
254does not match with our new filter (`sign_in_count > 10`).
255
256To address this problem, we have two options:
257
258- Create another, conditional index to cover the new query.
259- Replace the index with a more generalized configuration.
260
261NOTE:
262Having multiple indexes on the same table and on the same columns could be a performance bottleneck
263when writing data.
264
265Let's consider the following index (avoid):
266
267```sql
268CREATE INDEX index_on_users_never_logged_in ON users (id, sign_in_count)
269```
270
271The index definition starts with the `id` column which makes the index very inefficient from data
272selectivity point of view.
273
274```sql
275SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1
276```
277
278Executing the query above results in an `INDEX ONLY SCAN`. However, the query still needs to
279iterate over an unknown number of entries in the index, and then find the first item where the
280`sign_in_count` is `0`.
281
282![Reading an ineffective index](img/each_batch_users_table_bad_index_v13_7.png)
283
284We can improve the query significantly by swapping the columns in the index definition (prefer).
285
286```sql
287CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count, id)
288```
289
290![Reading a good index](img/each_batch_users_table_good_index_v13_7.png)
291
292The following index definition is not going to work well with `each_batch` (avoid).
293
294```sql
295CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count)
296```
297
298Since `each_batch` builds range queries based on the `id` column, this index cannot be used
299efficiently. The DB reads the rows from the table or uses a bitmap search where the primary
300key index is also read.
301
302##### "Slow" iteration
303
304Slow iteration means that we use a good index configuration to iterate over the table and
305apply filtering on the yielded relation.
306
307```ruby
308User.each_batch(of: 5) do |relation|
309  relation.where(sign_in_count: 0).each { |user| puts user inspect }
310end
311```
312
313The iteration uses the primary key index (on the `id` column) which makes it safe from statement
314timeouts. The filter (`sign_in_count: 0`) is applied on the `relation` where the `id` is already
315constrained (range). The number of rows is limited.
316
317Slow iteration generally takes more time to finish. The iteration count is higher and
318one iteration could yield fewer records than the batch size. Iterations may even yield
3190 records. This is not an optimal solution; however, in some cases (especially when
320dealing with large tables) this is the only viable option.
321
322### Using Subqueries
323
324Using subqueries in your `each_batch` query does not work well in most cases. Consider the following example:
325
326```ruby
327projects = Project.where(creator_id: Issue.where(confidential: true).select(:author_id))
328
329projects.each_batch do |relation|
330  # do something
331end
332```
333
334The iteration uses the `id` column of the `projects` table. The batching does not affect the
335subquery. This means for each iteration, the subquery is executed by the database. This adds a
336constant "load" on the query which often ends up in statement timeouts. We have an unknown number
337of [confidential issues](../user/project/issues/confidential_issues.md), the execution time
338and the accessed database rows depend on the data distribution in the `issues` table.
339
340NOTE:
341Using subqueries works only when the subquery returns a small number of rows.
342
343#### Improving Subqueries
344
345When dealing with subqueries, a slow iteration approach could work: the filter on `creator_id`
346can be part of the generated `relation` object.
347
348```ruby
349projects = Project.all
350
351projects.each_batch do |relation|
352  relation.where(creator_id: Issue.where(confidential: true).select(:author_id))
353end
354```
355
356If the query on the `issues` table itself is not performant enough, a nested loop could be
357constructed. Try to avoid it when possible.
358
359```ruby
360projects = Project.all
361
362projects.each_batch do |relation|
363  issues = Issue.where(confidential: true)
364
365  issues.each_batch do |issues_relation|
366    relation.where(creator_id: issues_relation.select(:author_id))
367  end
368end
369```
370
371If we know that the `issues` table has many more rows than `projects`, it would make sense to flip
372the queries, where the `issues` table is batched first.
373
374### Using `JOIN` and `EXISTS`
375
376When to use `JOINS`:
377
378- When there's a 1:1 or 1:N relationship between the tables where we know that the joined record
379(almost) always exists. This works well for "extension-like" tables:
380  - `projects` - `project_settings`
381  - `users` - `user_details`
382  - `users` - `user_statuses`
383- `LEFT JOIN` works well in this case. Conditions on the joined table need to go to the yielded
384relation so the iteration is not affected by the data distribution in the joined table.
385
386Example:
387
388```ruby
389users = User.joins("LEFT JOIN personal_access_tokens on personal_access_tokens.user_id = users.id")
390
391users.each_batch do |relation|
392  relation.where("personal_access_tokens.name = 'name'")
393end
394```
395
396`EXISTS` queries should be added only to the inner `relation` of the `each_batch` query:
397
398```ruby
399User.each_batch do |relation|
400  relation.where("EXISTS (SELECT 1 FROM ...")
401end
402```
403
404### Complex queries on the relation object
405
406When the `relation` object has several extra conditions, the execution plans might become
407"unstable".
408
409Example:
410
411```ruby
412Issue.each_batch do |relation|
413  relation
414    .joins(:metrics)
415    .joins(:merge_requests_closing_issues)
416    .where("id IN (SELECT ...)")
417    .where(confidential: true)
418end
419```
420
421Here, we expect that the `relation` query reads the `BATCH_SIZE` of user records and then
422filters down the results according to the provided queries. The planner might decide that
423using a bitmap index lookup with the index on the `confidential` column is a better way to
424execute the query. This can cause an unexpectedly high amount of rows to be read and the
425query could time out.
426
427Problem: we know for sure that the relation is returning maximum `BATCH_SIZE` of records
428however, the planner does not know this.
429
430Common table expression (CTE) trick to force the range query to execute first:
431
432```ruby
433Issue.each_batch(of: 1000) do |relation|
434  cte = Gitlab::SQL::CTE.new(:batched_relation, relation.limit(1000))
435
436  scope = cte
437    .apply_to(Issue.all)
438    .joins(:metrics)
439    .joins(:merge_requests_closing_issues)
440    .where("id IN (SELECT ...)")
441    .where(confidential: true)
442
443  puts scope.to_a
444end
445```
446
447### `EachBatch` vs `BatchCount`
448
449When adding new counters for Service Ping, the preferred way to count records is using the
450`Gitlab::Database::BatchCount` class. The iteration logic implemented in `BatchCount`
451has similar performance characteristics like `EachBatch`. Most of the tips and suggestions
452for improving `BatchCount` mentioned above applies to `BatchCount` as well.
453
454## Iterate with keyset pagination
455
456There are a few special cases where iterating with `EachBatch` does not work. `EachBatch`
457requires one distinct column (usually the primary key), which makes the iteration impossible
458for timestamp columns and tables with composite primary keys.
459
460Where `EachBatch` does not work, you can use
461[keyset pagination](database/pagination_guidelines.md#keyset-pagination) to iterate over the
462table or a range of rows. The scaling and performance characteristics are very similar to
463`EachBatch`.
464
465Examples:
466
467- Iterate over the table in a specific order (timestamp columns) in combination with a tie-breaker
468if column user to sort by does not contain unique values.
469- Iterate over the table with composite primary keys.
470
471### Iterate over the issues in a project by creation date
472
473You can use keyset pagination to iterate over any database column in a specific order (for example,
474`created_at DESC`). To ensure consistent order of the returned records with the same values for
475`created_at`, use a tie-breaker column with unique values (for example, `id`).
476
477Assume you have the following index in the `issues` table:
478
479```sql
480idx_issues_on_project_id_and_created_at_and_id" btree (project_id, created_at, id)
481```
482
483### Fetching records for further processing
484
485The following snippet iterates over issue records within the project using the specified order
486(`created_at, id`).
487
488```ruby
489scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker
490
491iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
492
493iterator.each_batch(of: 100) do |records|
494  puts records.map(&:id)
495end
496```
497
498You can add extra filters to the query. This example only lists the issue IDs created in the last
49930 days:
500
501```ruby
502scope = Issue.where(project_id: 278964).where('created_at > ?', 30.days.ago).order(:created_at, :id) # id is the tie-breaker
503
504iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
505
506iterator.each_batch(of: 100) do |records|
507  puts records.map(&:id)
508end
509```
510
511### Updating records in the batch
512
513For complex `ActiveRecord` queries, the `.update_all` method does not work well, because it
514generates an incorrect `UPDATE` statement.
515You can use raw SQL for updating records in batches:
516
517```ruby
518scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker
519
520iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
521
522iterator.each_batch(of: 100) do |records|
523  ApplicationRecord.connection.execute("UPDATE issues SET updated_at=NOW() WHERE issues.id in (#{records.dup.reselect(:id).to_sql})")
524end
525```
526
527NOTE:
528To keep the iteration stable and predictable, avoid updating the columns in the `ORDER BY` clause.
529
530### Iterate over the `merge_request_diff_commits` table
531
532The `merge_request_diff_commits` table uses a composite primary key (`merge_request_diff_id,
533relative_order`), which makes `EachBatch` impossible to use efficiently.
534
535To paginate over the `merge_request_diff_commits` table, you can use the following snippet:
536
537```ruby
538# Custom order object configuration:
539order = Gitlab::Pagination::Keyset::Order.build([
540  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
541    attribute_name: 'merge_request_diff_id',
542    order_expression: MergeRequestDiffCommit.arel_table[:merge_request_diff_id].asc,
543    nullable: :not_nullable,
544    distinct: false,
545  ),
546  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
547    attribute_name: 'relative_order',
548    order_expression: MergeRequestDiffCommit.arel_table[:relative_order].asc,
549    nullable: :not_nullable,
550    distinct: false,
551  )
552])
553MergeRequestDiffCommit.include(FromUnion) # keyset pagination generates UNION queries
554
555scope = MergeRequestDiffCommit.order(order)
556
557iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
558
559iterator.each_batch(of: 100) do |records|
560  puts records.map { |record| [record.merge_request_diff_id, record.relative_order] }.inspect
561end
562```
563
564### Order object configuration
565
566Keyset pagination works well with simple `ActiveRecord` `order` scopes
567([first example](iterating_tables_in_batches.md#iterate-over-the-issues-in-a-project-by-creation-date).
568However, in special cases, you need to describe the columns in the `ORDER BY` clause (second example)
569for the underlying keyset pagination library. When the `ORDER BY` configuration cannot be
570automatically determined by the keyset pagination library, an error is raised.
571
572The code comments of the
573[`Gitlab::Pagination::Keyset::Order`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/order.rb)
574and [`Gitlab::Pagination::Keyset::ColumnOrderDefinition`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/column_order_definition.rb)
575classes give an overview of the possible options for configuring the `ORDER BY` clause. You can
576also find a few code examples in the
577[keyset pagination](database/keyset_pagination.md#complex-order-configuration) documentation.
578