• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..08-Nov-2021-

MakefileH A D08-Nov-2021592 2916

READMEH A D08-Nov-202119.3 KiB390316

spgdoinsert.cH A D08-Nov-202169.6 KiB2,3551,442

spginsert.cH A D08-Nov-20216.8 KiB244136

spgkdtreeproc.cH A D08-Nov-20218.4 KiB350246

spgproc.cH A D08-Nov-20212.1 KiB8953

spgquadtreeproc.cH A D08-Nov-202111.6 KiB472338

spgscan.cH A D08-Nov-202128.3 KiB1,098762

spgtextproc.cH A D08-Nov-202119.7 KiB700465

spgutils.cH A D08-Nov-202136.4 KiB1,339787

spgvacuum.cH A D08-Nov-202127 KiB976588

spgvalidate.cH A D08-Nov-202112 KiB393282

spgxlog.cH A D08-Nov-202126.2 KiB1,014728

README

1src/backend/access/spgist/README
2
3SP-GiST is an abbreviation of space-partitioned GiST.  It provides a
4generalized infrastructure for implementing space-partitioned data
5structures, such as quadtrees, k-d trees, and radix trees (tries).  When
6implemented in main memory, these structures are usually designed as a set of
7dynamically-allocated nodes linked by pointers.  This is not suitable for
8direct storing on disk, since the chains of pointers can be rather long and
9require too many disk accesses. In contrast, disk based data structures
10should have a high fanout to minimize I/O.  The challenge is to map tree
11nodes to disk pages in such a way that the search algorithm accesses only a
12few disk pages, even if it traverses many nodes.
13
14
15COMMON STRUCTURE DESCRIPTION
16
17Logically, an SP-GiST tree is a set of tuples, each of which can be either
18an inner or leaf tuple.  Each inner tuple contains "nodes", which are
19(label,pointer) pairs, where the pointer (ItemPointerData) is a pointer to
20another inner tuple or to the head of a list of leaf tuples.  Inner tuples
21can have different numbers of nodes (children).  Branches can be of different
22depth (actually, there is no control or code to support balancing), which
23means that the tree is non-balanced.  However, leaf and inner tuples cannot
24be intermixed at the same level: a downlink from a node of an inner tuple
25leads either to one inner tuple, or to a list of leaf tuples.
26
27The SP-GiST core requires that inner and leaf tuples fit on a single index
28page, and even more stringently that the list of leaf tuples reached from a
29single inner-tuple node all be stored on the same index page.  (Restricting
30such lists to not cross pages reduces seeks, and allows the list links to be
31stored as simple 2-byte OffsetNumbers.)  SP-GiST index opclasses should
32therefore ensure that not too many nodes can be needed in one inner tuple,
33and that inner-tuple prefixes and leaf-node datum values not be too large.
34
35Inner and leaf tuples are stored separately: the former are stored only on
36"inner" pages, the latter only on "leaf" pages.  Also, there are special
37restrictions on the root page.  Early in an index's life, when there is only
38one page's worth of data, the root page contains an unorganized set of leaf
39tuples.  After the first page split has occurred, the root is required to
40contain exactly one inner tuple.
41
42When the search traversal algorithm reaches an inner tuple, it chooses a set
43of nodes to continue tree traverse in depth.  If it reaches a leaf page it
44scans a list of leaf tuples to find the ones that match the query. SP-GiST
45also supports ordered (nearest-neighbor) searches - that is during scan pending
46nodes are put into priority queue, so traversal is performed by the
47closest-first model.
48
49
50The insertion algorithm descends the tree similarly, except it must choose
51just one node to descend to from each inner tuple.  Insertion might also have
52to modify the inner tuple before it can descend: it could add a new node, or
53it could "split" the tuple to obtain a less-specific prefix that can match
54the value to be inserted.  If it's necessary to append a new leaf tuple to a
55list and there is no free space on page, then SP-GiST creates a new inner
56tuple and distributes leaf tuples into a set of lists on, perhaps, several
57pages.
58
59An inner tuple consists of:
60
61  optional prefix value - all successors must be consistent with it.
62    Example:
63        radix tree   - prefix value is a common prefix string
64        quad tree    - centroid
65        k-d tree     - one coordinate
66
67  list of nodes, where node is a (label, pointer) pair.
68    Example of a label: a single character for radix tree
69
70A leaf tuple consists of:
71
72  a leaf value
73    Example:
74        radix tree - the rest of string (postfix)
75        quad and k-d tree - the point itself
76
77  ItemPointer to the corresponding heap tuple
78  nextOffset number of next leaf tuple in a chain on a leaf page
79
80  optional nulls bitmask
81  optional INCLUDE-column values
82
83For compatibility with pre-v14 indexes, a leaf tuple has a nulls bitmask
84only if there are null values (among the leaf value and the INCLUDE values)
85*and* there is at least one INCLUDE column.  The null-ness of the leaf
86value can be inferred from whether the tuple is on a "nulls page" (see below)
87so it is not necessary to represent it explicitly.  But we include it anyway
88in a bitmask used with INCLUDE values, so that standard tuple deconstruction
89code can be used.
90
91
92NULLS HANDLING
93
94We assume that SPGiST-indexable operators are strict (can never succeed for
95null inputs).  It is still desirable to index nulls, so that whole-table
96indexscans are possible and so that "x IS NULL" can be implemented by an
97SPGiST indexscan.  However, we prefer that SPGiST index opclasses not have
98to cope with nulls.  Therefore, the main tree of an SPGiST index does not
99include any null entries.  We store null entries in a separate SPGiST tree
100occupying a disjoint set of pages (in particular, its own root page).
101Insertions and searches in the nulls tree do not use any of the
102opclass-supplied functions, but just use hardwired logic comparable to
103AllTheSame cases in the normal tree.
104
105
106INSERTION ALGORITHM
107
108Insertion algorithm is designed to keep the tree in a consistent state at
109any moment.  Here is a simplified insertion algorithm specification
110(numbers refer to notes below):
111
112  Start with the first tuple on the root page (1)
113
114  loop:
115    if (page is leaf) then
116        if (enough space)
117            insert on page and exit (5)
118        else (7)
119            call PickSplitFn() (2)
120        end if
121    else
122        switch (chooseFn())
123            case MatchNode  - descend through selected node
124            case AddNode    - add node and then retry chooseFn (3, 6)
125            case SplitTuple - split inner tuple to prefix and postfix, then
126                              retry chooseFn with the prefix tuple (4, 6)
127    end if
128
129Notes:
130
131(1) Initially, we just dump leaf tuples into the root page until it is full;
132then we split it.  Once the root is not a leaf page, it can have only one
133inner tuple, so as to keep the amount of free space on the root as large as
134possible.  Both of these rules are meant to postpone doing PickSplit on the
135root for as long as possible, so that the topmost partitioning of the search
136space is as good as we can easily make it.
137
138(2) Current implementation allows to do picksplit and insert a new leaf tuple
139in one operation, if the new list of leaf tuples fits on one page. It's
140always possible for trees with small nodes like quad tree or k-d tree, but
141radix trees may require another picksplit.
142
143(3) Addition of node must keep size of inner tuple small enough to fit on a
144page.  After addition, inner tuple could become too large to be stored on
145current page because of other tuples on page. In this case it will be moved
146to another inner page (see notes about page management). When moving tuple to
147another page, we can't change the numbers of other tuples on the page, else
148we'd make downlink pointers to them invalid. To prevent that, SP-GiST leaves
149a "placeholder" tuple, which can be reused later whenever another tuple is
150added to the page. See also Concurrency and Vacuum sections below. Right now
151only radix trees could add a node to the tuple; quad trees and k-d trees
152make all possible nodes at once in PickSplitFn() call.
153
154(4) Prefix value could only partially match a new value, so the SplitTuple
155action allows breaking the current tree branch into upper and lower sections.
156Another way to say it is that we can split the current inner tuple into
157"prefix" and "postfix" parts, where the prefix part is able to match the
158incoming new value. Consider example of insertion into a radix tree. We use
159the following notation, where tuple's id is just for discussion (no such id
160is actually stored):
161
162inner tuple: {tuple id}(prefix string)[ comma separated list of node labels ]
163leaf tuple: {tuple id}<value>
164
165Suppose we need to insert string 'www.gogo.com' into inner tuple
166
167    {1}(www.google.com/)[a, i]
168
169The string does not match the prefix so we cannot descend.  We must
170split the inner tuple into two tuples:
171
172    {2}(www.go)[o]  - prefix tuple
173                |
174                {3}(gle.com/)[a,i] - postfix tuple
175
176On the next iteration of loop we find that 'www.gogo.com' matches the
177prefix, but not any node label, so we add a node [g] to tuple {2}:
178
179                   NIL (no child exists yet)
180                   |
181    {2}(www.go)[o, g]
182                |
183                {3}(gle.com/)[a,i]
184
185Now we can descend through the [g] node, which will cause us to update
186the target string to just 'o.com'.  Finally, we'll insert a leaf tuple
187bearing that string:
188
189                  {4}<o.com>
190                   |
191    {2}(www.go)[o, g]
192                |
193                {3}(gle.com/)[a,i]
194
195As we can see, the original tuple's node array moves to postfix tuple without
196any changes.  Note also that SP-GiST core assumes that prefix tuple is not
197larger than old inner tuple.  That allows us to store prefix tuple directly
198in place of old inner tuple.  SP-GiST core will try to store postfix tuple on
199the same page if possible, but will use another page if there is not enough
200free space (see notes 5 and 6).  Currently, quad and k-d trees don't use this
201feature, because they have no concept of a prefix being "inconsistent" with
202any new value.  They grow their depth only by PickSplitFn() call.
203
204(5) If pointer from node of parent is a NIL pointer, algorithm chooses a leaf
205page to store on.  At first, it tries to use the last-used leaf page with the
206largest free space (which we track in each backend) to better utilize disk
207space.  If that's not large enough, then the algorithm allocates a new page.
208
209(6) Management of inner pages is very similar to management of leaf pages,
210described in (5).
211
212(7) Actually, current implementation can move the whole list of leaf tuples
213and a new tuple to another page, if the list is short enough. This improves
214space utilization, but doesn't change the basis of the algorithm.
215
216
217CONCURRENCY
218
219While descending the tree, the insertion algorithm holds exclusive lock on
220two tree levels at a time, ie both parent and child pages (but parent and
221child pages can be the same, see notes above).  There is a possibility of
222deadlock between two insertions if there are cross-referenced pages in
223different branches.  That is, if inner tuple on page M has a child on page N
224while an inner tuple from another branch is on page N and has a child on
225page M, then two insertions descending the two branches could deadlock,
226since they will each hold their parent page's lock while trying to get the
227child page's lock.
228
229Currently, we deal with this by conditionally locking buffers as we descend
230the tree.  If we fail to get lock on a buffer, we release both buffers and
231restart the insertion process.  This is potentially inefficient, but the
232locking costs of a more deterministic approach seem very high.
233
234To reduce the number of cases where that happens, we introduce a concept of
235"triple parity" of pages: if inner tuple is on page with BlockNumber N, then
236its child tuples should be placed on the same page, or else on a page with
237BlockNumber M where (N+1) mod 3 == M mod 3.  This rule ensures that tuples
238on page M will have no children on page N, since (M+1) mod 3 != N mod 3.
239That makes it unlikely that two insertion processes will conflict against
240each other while descending the tree.  It's not perfect though: in the first
241place, we could still get a deadlock among three or more insertion processes,
242and in the second place, it's impractical to preserve this invariant in every
243case when we expand or split an inner tuple.  So we still have to allow for
244deadlocks.
245
246Insertion may also need to take locks on an additional inner and/or leaf page
247to add tuples of the right type(s), when there's not enough room on the pages
248it descended through.  However, we don't care exactly which such page we add
249to, so deadlocks can be avoided by conditionally locking the additional
250buffers: if we fail to get lock on an additional page, just try another one.
251
252Search traversal algorithm is rather traditional.  At each non-leaf level, it
253share-locks the page, identifies which node(s) in the current inner tuple
254need to be visited, and puts those addresses on a stack of pages to examine
255later.  It then releases lock on the current buffer before visiting the next
256stack item.  So only one page is locked at a time, and no deadlock is
257possible.  But instead, we have to worry about race conditions: by the time
258we arrive at a pointed-to page, a concurrent insertion could have replaced
259the target inner tuple (or leaf tuple chain) with data placed elsewhere.
260To handle that, whenever the insertion algorithm changes a nonempty downlink
261in an inner tuple, it places a "redirect tuple" in place of the lower-level
262inner tuple or leaf-tuple chain head that the link formerly led to.  Scans
263(though not insertions) must be prepared to honor such redirects.  Only a
264scan that had already visited the parent level could possibly reach such a
265redirect tuple, so we can remove redirects once all active transactions have
266been flushed out of the system.
267
268
269DEAD TUPLES
270
271Tuples on leaf pages can be in one of four states:
272
273SPGIST_LIVE: normal, live pointer to a heap tuple.
274
275SPGIST_REDIRECT: placeholder that contains a link to another place in the
276index.  When a chain of leaf tuples has to be moved to another page, a
277redirect tuple is inserted in place of the chain's head tuple.  The parent
278inner tuple's downlink is updated when this happens, but concurrent scans
279might be "in flight" from the parent page to the child page (since they
280release lock on the parent page before attempting to lock the child).
281The redirect pointer serves to tell such a scan where to go.  A redirect
282pointer is only needed for as long as such concurrent scans could be in
283progress.  Eventually, it's converted into a PLACEHOLDER dead tuple by
284VACUUM, and is then a candidate for replacement.  Searches that find such
285a tuple (which should never be part of a chain) should immediately proceed
286to the other place, forgetting about the redirect tuple.  Insertions that
287reach such a tuple should raise error, since a valid downlink should never
288point to such a tuple.
289
290SPGIST_DEAD: tuple is dead, but it cannot be removed or moved to a
291different offset on the page because there is a link leading to it from
292some inner tuple elsewhere in the index.  (Such a tuple is never part of a
293chain, since we don't need one unless there is nothing live left in its
294chain.)  Searches should ignore such entries.  If an insertion action
295arrives at such a tuple, it should either replace it in-place (if there's
296room on the page to hold the desired new leaf tuple) or replace it with a
297redirection pointer to wherever it puts the new leaf tuple.
298
299SPGIST_PLACEHOLDER: tuple is dead, and there are known to be no links to
300it from elsewhere.  When a live tuple is deleted or moved away, and not
301replaced by a redirect pointer, it is replaced by a placeholder to keep
302the offsets of later tuples on the same page from changing.  Placeholders
303can be freely replaced when adding a new tuple to the page, and also
304VACUUM will delete any that are at the end of the range of valid tuple
305offsets.  Both searches and insertions should complain if a link from
306elsewhere leads them to a placeholder tuple.
307
308When the root page is also a leaf, all its tuple should be in LIVE state;
309there's no need for the others since there are no links and no need to
310preserve offset numbers.
311
312Tuples on inner pages can be in LIVE, REDIRECT, or PLACEHOLDER states.
313The REDIRECT state has the same function as on leaf pages, to send
314concurrent searches to the place where they need to go after an inner
315tuple is moved to another page.  Expired REDIRECT pointers are converted
316to PLACEHOLDER status by VACUUM, and are then candidates for replacement.
317DEAD state is not currently possible, since VACUUM does not attempt to
318remove unused inner tuples.
319
320
321VACUUM
322
323VACUUM (or more precisely, spgbulkdelete) performs a single sequential scan
324over the entire index.  On both leaf and inner pages, we can convert old
325REDIRECT tuples into PLACEHOLDER status, and then remove any PLACEHOLDERs
326that are at the end of the page (since they aren't needed to preserve the
327offsets of any live tuples).  On leaf pages, we scan for tuples that need
328to be deleted because their heap TIDs match a vacuum target TID.
329
330If we find a deletable tuple that is not at the head of its chain, we
331can simply replace it with a PLACEHOLDER, updating the chain links to
332remove it from the chain.  If it is at the head of its chain, but there's
333at least one live tuple remaining in the chain, we move that live tuple
334to the head tuple's offset, replacing it with a PLACEHOLDER to preserve
335the offsets of other tuples.  This keeps the parent inner tuple's downlink
336valid.  If we find ourselves deleting all live tuples in a chain, we
337replace the head tuple with a DEAD tuple and the rest with PLACEHOLDERS.
338The parent inner tuple's downlink thus points to the DEAD tuple, and the
339rules explained in the previous section keep everything working.
340
341VACUUM doesn't know a-priori which tuples are heads of their chains, but
342it can easily figure that out by constructing a predecessor array that's
343the reverse map of the nextOffset links (ie, when we see tuple x links to
344tuple y, we set predecessor[y] = x).  Then head tuples are the ones with
345no predecessor.
346
347Because insertions can occur while VACUUM runs, a pure sequential scan
348could miss deleting some target leaf tuples, because they could get moved
349from a not-yet-visited leaf page to an already-visited leaf page as a
350consequence of a PickSplit or MoveLeafs operation.  Failing to delete any
351target TID is not acceptable, so we have to extend the algorithm to cope
352with such cases.  We recognize that such a move might have occurred when
353we see a leaf-page REDIRECT tuple whose XID indicates it might have been
354created after the VACUUM scan started.  We add the redirection target TID
355to a "pending list" of places we need to recheck.  Between pages of the
356main sequential scan, we empty the pending list by visiting each listed
357TID.  If it points to an inner tuple (from a PickSplit), add each downlink
358TID to the pending list.  If it points to a leaf page, vacuum that page.
359(We could just vacuum the single pointed-to chain, but vacuuming the
360whole page simplifies the code and reduces the odds of VACUUM having to
361modify the same page multiple times.)  To ensure that pending-list
362processing can never get into an endless loop, even in the face of
363concurrent index changes, we don't remove list entries immediately but
364only after we've completed all pending-list processing; instead we just
365mark items as done after processing them.  Adding a TID that's already in
366the list is a no-op, whether or not that item is marked done yet.
367
368spgbulkdelete also updates the index's free space map.
369
370Currently, spgvacuumcleanup has nothing to do if spgbulkdelete was
371performed; otherwise, it does an spgbulkdelete scan with an empty target
372list, so as to clean up redirections and placeholders, update the free
373space map, and gather statistics.
374
375
376LAST USED PAGE MANAGEMENT
377
378The list of last used pages contains four pages - a leaf page and three
379inner pages, one from each "triple parity" group.  (Actually, there's one
380such list for the main tree and a separate one for the nulls tree.)  This
381list is stored between calls on the index meta page, but updates are never
382WAL-logged to decrease WAL traffic.  Incorrect data on meta page isn't
383critical, because we could allocate a new page at any moment.
384
385
386AUTHORS
387
388    Teodor Sigaev <teodor@sigaev.ru>
389    Oleg Bartunov <oleg@sai.msu.su>
390