• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..08-Nov-2021-

MakefileH A D08-Nov-2021583 2815

READMEH A D08-Nov-20219.1 KiB190152

brin.cH A D08-Nov-202150.2 KiB1,7681,028

brin_bloom.cH A D08-Nov-202125.4 KiB810333

brin_inclusion.cH A D08-Nov-202121.3 KiB658356

brin_minmax.cH A D08-Nov-20219.3 KiB318211

brin_minmax_multi.cH A D08-Nov-202186.5 KiB3,1641,573

brin_pageops.cH A D08-Nov-202125.3 KiB916504

brin_revmap.cH A D08-Nov-202118.9 KiB665384

brin_tuple.cH A D08-Nov-202118.4 KiB709399

brin_validate.cH A D08-Nov-20219.2 KiB282201

brin_xlog.cH A D08-Nov-20218.4 KiB362236

README

1Block Range Indexes (BRIN)
2==========================
3
4BRIN indexes intend to enable very fast scanning of extremely large tables.
5
6The essential idea of a BRIN index is to keep track of summarizing values in
7consecutive groups of heap pages (page ranges); for example, the minimum and
8maximum values for datatypes with a btree opclass, or the bounding box for
9geometric types.  These values can be used to avoid scanning such pages
10during a table scan, depending on query quals.
11
12The cost of this is having to update the stored summary values of each page
13range as tuples are inserted into them.
14
15
16Access Method Design
17--------------------
18
19Since item pointers are not stored inside indexes of this type, it is not
20possible to support the amgettuple interface.  Instead, we only provide
21amgetbitmap support.  The amgetbitmap routine returns a lossy TIDBitmap
22comprising all pages in those page ranges that match the query
23qualifications.  The recheck step in the BitmapHeapScan node prunes tuples
24that are not visible according to the query qualifications.
25
26An operator class must have the following entries:
27
28- generic support procedures (pg_amproc), identical to all opclasses:
29  * "opcinfo" (BRIN_PROCNUM_OPCINFO) initializes a structure for index
30    creation or scanning
31  * "addValue" (BRIN_PROCNUM_ADDVALUE) takes an index tuple and a heap item,
32    and possibly changes the index tuple so that it includes the heap item
33    values
34  * "consistent" (BRIN_PROCNUM_CONSISTENT) takes an index tuple and query
35    quals, and returns whether the index tuple values match the query quals.
36  * "union" (BRIN_PROCNUM_UNION) takes two index tuples and modifies the first
37    one so that it represents the union of the two.
38Procedure numbers up to 10 are reserved for future expansion.
39
40Additionally, each opclass needs additional support functions:
41- Minmax-style operator classes:
42  * Proc numbers 11-14 are used for the functions implementing inequality
43    operators for the type, in this order: less than, less or equal,
44    greater or equal, greater than.
45
46Opclasses using a different design will require different additional procedure
47numbers.
48
49Operator classes also need to have operator (pg_amop) entries so that the
50optimizer can choose the index to execute queries.
51- Minmax-style operator classes:
52  * The same operators as btree (<=, <, =, >=, >)
53
54Each index tuple stores some NULL bits and some opclass-specified values, which
55are stored in a single null bitmask of length twice the number of columns.  The
56generic NULL bits indicate, for each column:
57  * bt_hasnulls: Whether there's any NULL value at all in the page range
58  * bt_allnulls: Whether all values are NULLs in the page range
59
60The opclass-specified values are:
61- Minmax-style operator classes
62  * minimum value across all tuples in the range
63  * maximum value across all tuples in the range
64
65Note that the addValue and Union support procedures  must be careful to
66datumCopy() the values they want to store in the in-memory BRIN tuple, and
67must pfree() the old copies when replacing older ones.  Since some values
68referenced from the tuple persist and others go away, there is no
69well-defined lifetime for a memory context that would make this automatic.
70
71
72The Range Map
73-------------
74
75To find the index tuple for a particular page range, we have an internal
76structure we call the range map, or "revmap" for short.  This stores one TID
77per page range, which is the address of the index tuple summarizing that
78range.  Since the map entries are fixed size, it is possible to compute the
79address of the range map entry for any given heap page by simple arithmetic.
80
81When a new heap tuple is inserted in a summarized page range, we compare the
82existing index tuple with the new heap tuple.  If the heap tuple is outside
83the summarization data given by the index tuple for any indexed column (or
84if the new heap tuple contains null values but the index tuple indicates
85there are no nulls), the index is updated with the new values.  In many
86cases it is possible to update the index tuple in-place, but if the new
87index tuple is larger than the old one and there's not enough space in the
88page, it is necessary to create a new index tuple with the new values.  The
89range map can be updated quickly to point to it; the old index tuple is
90removed.
91
92If the range map points to an invalid TID, the corresponding page range is
93considered to be not summarized.  When tuples are added to unsummarized
94pages, nothing needs to happen.
95
96To scan a table following a BRIN index, we scan the range map sequentially.
97This yields index tuples in ascending page range order.  Query quals are
98matched to each index tuple; if they match, each page within the page range
99is returned as part of the output TID bitmap.  If there's no match, they are
100skipped.  Range map entries returning invalid index TIDs, that is
101unsummarized page ranges, are also returned in the TID bitmap.
102
103The revmap is stored in the first few blocks of the index main fork,
104immediately following the metapage.  Whenever the revmap needs to be
105extended by another page, existing tuples in that page are moved to some
106other page.
107
108Heap tuples can be removed from anywhere without restriction.  It might be
109useful to mark the corresponding index tuple somehow, if the heap tuple is
110one of the constraining values of the summary data (i.e. either min or max
111in the case of a btree-opclass-bearing datatype), so that in the future we
112are aware of the need to re-execute summarization on that range, leading to
113a possible tightening of the summary values.
114
115Summarization
116-------------
117
118At index creation time, the whole table is scanned; for each page range the
119summarizing values of each indexed column and nulls bitmap are collected and
120stored in the index.  The partially-filled page range at the end of the
121table is also summarized.
122
123As new tuples get inserted at the end of the table, they may update the
124index tuple that summarizes the partial page range at the end.  Eventually
125that page range is complete and new tuples belong in a new page range that
126hasn't yet been summarized.  Those insertions do not create a new index
127entry; instead, the page range remains unsummarized until later.
128
129Whenever VACUUM is run on the table, all unsummarized page ranges are
130summarized.  This action can also be invoked by the user via
131brin_summarize_new_values().  Both these procedures scan all the
132unsummarized ranges, and create a summary tuple.  Again, this includes the
133partially-filled page range at the end of the table.
134
135Vacuuming
136---------
137
138Since no heap TIDs are stored in a BRIN index, it's not necessary to scan the
139index when heap tuples are removed.  It might be that some summary values can
140be tightened if heap tuples have been deleted; but this would represent an
141optimization opportunity only, not a correctness issue.  It's simpler to
142represent this as the need to re-run summarization on the affected page range
143rather than "subtracting" values from the existing one.  This is not
144currently implemented.
145
146Note that if there are no indexes on the table other than the BRIN index,
147usage of maintenance_work_mem by vacuum can be decreased significantly, because
148no detailed index scan needs to take place (and thus it's not necessary for
149vacuum to save TIDs to remove).  It's unlikely that BRIN would be the only
150indexes in a table, though, because primary keys can be btrees only, and so
151we don't implement this optimization.
152
153
154Optimizer
155---------
156
157The optimizer selects the index based on the operator class' pg_amop
158entries for the column.
159
160
161Future improvements
162-------------------
163
164* Different-size page ranges?
165  In the current design, each "index entry" in a BRIN index covers the same
166  number of pages.  There's no hard reason for this; it might make sense to
167  allow the index to self-tune so that some index entries cover smaller page
168  ranges, if this allows the summary values to be more compact.  This would incur
169  larger BRIN overhead for the index itself, but might allow better pruning of
170  page ranges during scan.  In the limit of one index tuple per page, the index
171  itself would occupy too much space, even though we would be able to skip
172  reading the most heap pages, because the summary values are tight; in the
173  opposite limit of a single tuple that summarizes the whole table, we wouldn't
174  be able to prune anything even though the index is very small.  This can
175  probably be made to work by using the range map as an index in itself.
176
177* More compact representation for TIDBitmap?
178  TIDBitmap is the structure used to represent bitmap scans.  The
179  representation of lossy page ranges is not optimal for our purposes, because
180  it uses a Bitmapset to represent pages in the range; since we're going to return
181  all pages in a large range, it might be more convenient to allow for a
182  struct that uses start and end page numbers to represent the range, instead.
183
184* Better vacuuming?
185  It might be useful to enable passing more useful info to BRIN indexes during
186  vacuuming about tuples that are deleted, i.e. do not require the callback to
187  pass each tuple's TID.  For instance we might need a callback that passes a
188  block number instead of a TID.  That would help determine when to re-run
189  summarization on blocks that have seen lots of tuple deletions.
190