1---
2stage: Plan
3group: Project Management
4info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
5---
6# Filtering by label
7
8## Introduction
9
10GitLab has [labels](../user/project/labels.md) that can be assigned to issues,
11merge requests, and epics. Labels on those objects are a many-to-many relation
12through the polymorphic `label_links` table.
13
14To filter these objects by multiple labels - for instance, 'all open
15issues with the label ~Plan and the label ~backend' - we generate a
16query containing a `GROUP BY` clause. In a simple form, this looks like:
17
18```sql
19SELECT
20    issues.*
21FROM
22    issues
23    INNER JOIN label_links ON label_links.target_id = issues.id
24        AND label_links.target_type = 'Issue'
25    INNER JOIN labels ON labels.id = label_links.label_id
26WHERE
27    issues.project_id = 13083
28    AND (issues.state IN ('opened'))
29    AND labels.title IN ('Plan',
30        'backend')
31GROUP BY
32    issues.id
33HAVING (COUNT(DISTINCT labels.title) = 2)
34ORDER BY
35    issues.updated_at DESC,
36    issues.id DESC
37LIMIT 20 OFFSET 0
38```
39
40In particular, note that:
41
421. We `GROUP BY issues.id` so that we can ...
431. Use the `HAVING (COUNT(DISTINCT labels.title) = 2)` condition to ensure that
44   all matched issues have both labels.
45
46This is more complicated than is ideal. It makes the query construction more
47prone to errors (such as
48[issue #15557](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/15557)).
49
50## Attempt A: `WHERE EXISTS`
51
52### Attempt A1: use multiple subqueries with `WHERE EXISTS`
53
54In [issue #37137](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/37137)
55and its associated [merge request](https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/14022),
56we tried to replace the `GROUP BY` with multiple uses of `WHERE EXISTS`. For the
57example above, this would give:
58
59```sql
60WHERE (EXISTS (
61        SELECT
62            TRUE
63        FROM
64            label_links
65            INNER JOIN labels ON labels.id = label_links.label_id
66        WHERE
67            labels.title = 'Plan'
68            AND target_type = 'Issue'
69            AND target_id = issues.id))
70AND (EXISTS (
71        SELECT
72            TRUE
73        FROM
74            label_links
75            INNER JOIN labels ON labels.id = label_links.label_id
76        WHERE
77            labels.title = 'backend'
78            AND target_type = 'Issue'
79            AND target_id = issues.id))
80```
81
82While this worked without schema changes, and did improve readability somewhat,
83it did not improve query performance.
84
85### Attempt A2: use label IDs in the `WHERE EXISTS` clause
86
87In [merge request #34503](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/34503), we followed a similar approach to A1. But this time, we
88did a separate query to fetch the IDs of the labels used in the filter so that we avoid the `JOIN` in the `EXISTS` clause and filter directly by
89`label_links.label_id`. We also added a new index on `label_links` for the `target_id`, `label_id`, and `target_type` columns to speed up this query.
90
91Finding the label IDs wasn't straightforward because there could be multiple labels with the same title within a single root namespace. We solved
92this by grouping the label IDs by title and then using the array of IDs in the `EXISTS` clauses.
93
94This resulted in a significant performance improvement. However, this optimization could not be applied to the dashboard pages
95where we do not have a project or group context. We could not easily search for the label IDs here because that would mean searching across all
96projects and groups that the user has access to.
97
98## Attempt B: Denormalize using an array column
99
100Having [removed MySQL support in GitLab 12.1](https://about.gitlab.com/blog/2019/06/27/removing-mysql-support/),
101using [PostgreSQL's arrays](https://www.postgresql.org/docs/11/arrays.html) became more
102tractable as we didn't have to support two databases. We discussed denormalizing
103the `label_links` table for querying in
104[issue #49651](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/49651),
105with two options: label IDs and titles.
106
107We can think of both of those as array columns on `issues`, `merge_requests`,
108and `epics`: `issues.label_ids` would be an array column of label IDs, and
109`issues.label_titles` would be an array of label titles.
110
111These array columns can be complemented with [GIN
112indexes](https://www.postgresql.org/docs/11/gin-intro.html) to improve
113matching.
114
115### Attempt B1: store label IDs for each object
116
117This has some strong advantages over titles:
118
1191. Unless a label is deleted, or a project is moved, we never need to
120   bulk-update the denormalized column.
1211. It uses less storage than the titles.
122
123Unfortunately, our application design makes this hard. If we were able to query
124just by label ID easily, we wouldn't need the `INNER JOIN labels` in the initial
125query at the start of this document. GitLab allows users to filter by label
126title across projects and even across groups, so a filter by the label ~Plan may
127include labels with multiple distinct IDs.
128
129We do not want users to have to know about the different IDs, which means that
130given this data set:
131
132| Project | ~Plan label ID | ~backend label ID |
133| ------- | -------------- | ----------------- |
134| A       | 11             | 12                |
135| B       | 21             | 22                |
136| C       | 31             | 32                |
137
138We would need something like:
139
140```sql
141WHERE
142    label_ids @> ARRAY[11, 12]
143    OR label_ids @> ARRAY[21, 22]
144    OR label_ids @> ARRAY[31, 32]
145```
146
147This can get even more complicated when we consider that in some cases, there
148might be two ~backend labels - with different IDs - that could apply to the same
149object, so the number of combinations would balloon further.
150
151### Attempt B2: store label titles for each object
152
153From the perspective of updating the object, this is the worst
154option. We have to bulk update the objects when:
155
1561. The objects are moved from one project to another.
1571. The project is moved from one group to another.
1581. The label is renamed.
1591. The label is deleted.
160
161It also uses much more storage. Querying is simple, though:
162
163```sql
164WHERE
165    label_titles @> ARRAY['Plan', 'backend']
166```
167
168And our [tests in issue #49651](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/49651#note_188777346)
169showed that this could be fast.
170
171However, at present, the disadvantages outweigh the advantages.
172
173## Conclusion
174
175We found a method A2 that does not need denormalization and improves the query performance significantly. This
176did not apply to all cases, but we were able to apply method A1 to the rest of the cases so that we remove the
177`GROUP BY` and `HAVING` clauses in all scenarios.
178
179This simplified the query and improved the performance in the most common cases.
180