1src/backend/access/spgist/README 2 3SP-GiST is an abbreviation of space-partitioned GiST. It provides a 4generalized infrastructure for implementing space-partitioned data 5structures, such as quadtrees, k-d trees, and radix trees (tries). When 6implemented in main memory, these structures are usually designed as a set of 7dynamically-allocated nodes linked by pointers. This is not suitable for 8direct storing on disk, since the chains of pointers can be rather long and 9require too many disk accesses. In contrast, disk based data structures 10should have a high fanout to minimize I/O. The challenge is to map tree 11nodes to disk pages in such a way that the search algorithm accesses only a 12few disk pages, even if it traverses many nodes. 13 14 15COMMON STRUCTURE DESCRIPTION 16 17Logically, an SP-GiST tree is a set of tuples, each of which can be either 18an inner or leaf tuple. Each inner tuple contains "nodes", which are 19(label,pointer) pairs, where the pointer (ItemPointerData) is a pointer to 20another inner tuple or to the head of a list of leaf tuples. Inner tuples 21can have different numbers of nodes (children). Branches can be of different 22depth (actually, there is no control or code to support balancing), which 23means that the tree is non-balanced. However, leaf and inner tuples cannot 24be intermixed at the same level: a downlink from a node of an inner tuple 25leads either to one inner tuple, or to a list of leaf tuples. 26 27The SP-GiST core requires that inner and leaf tuples fit on a single index 28page, and even more stringently that the list of leaf tuples reached from a 29single inner-tuple node all be stored on the same index page. (Restricting 30such lists to not cross pages reduces seeks, and allows the list links to be 31stored as simple 2-byte OffsetNumbers.) SP-GiST index opclasses should 32therefore ensure that not too many nodes can be needed in one inner tuple, 33and that inner-tuple prefixes and leaf-node datum values not be too large. 34 35Inner and leaf tuples are stored separately: the former are stored only on 36"inner" pages, the latter only on "leaf" pages. Also, there are special 37restrictions on the root page. Early in an index's life, when there is only 38one page's worth of data, the root page contains an unorganized set of leaf 39tuples. After the first page split has occurred, the root is required to 40contain exactly one inner tuple. 41 42When the search traversal algorithm reaches an inner tuple, it chooses a set 43of nodes to continue tree traverse in depth. If it reaches a leaf page it 44scans a list of leaf tuples to find the ones that match the query. SP-GiST 45also supports ordered (nearest-neighbor) searches - that is during scan pending 46nodes are put into priority queue, so traversal is performed by the 47closest-first model. 48 49 50The insertion algorithm descends the tree similarly, except it must choose 51just one node to descend to from each inner tuple. Insertion might also have 52to modify the inner tuple before it can descend: it could add a new node, or 53it could "split" the tuple to obtain a less-specific prefix that can match 54the value to be inserted. If it's necessary to append a new leaf tuple to a 55list and there is no free space on page, then SP-GiST creates a new inner 56tuple and distributes leaf tuples into a set of lists on, perhaps, several 57pages. 58 59Inner tuple consists of: 60 61 optional prefix value - all successors must be consistent with it. 62 Example: 63 radix tree - prefix value is a common prefix string 64 quad tree - centroid 65 k-d tree - one coordinate 66 67 list of nodes, where node is a (label, pointer) pair. 68 Example of a label: a single character for radix tree 69 70Leaf tuple consists of: 71 72 a leaf value 73 Example: 74 radix tree - the rest of string (postfix) 75 quad and k-d tree - the point itself 76 77 ItemPointer to the heap 78 79 80NULLS HANDLING 81 82We assume that SPGiST-indexable operators are strict (can never succeed for 83null inputs). It is still desirable to index nulls, so that whole-table 84indexscans are possible and so that "x IS NULL" can be implemented by an 85SPGiST indexscan. However, we prefer that SPGiST index opclasses not have 86to cope with nulls. Therefore, the main tree of an SPGiST index does not 87include any null entries. We store null entries in a separate SPGiST tree 88occupying a disjoint set of pages (in particular, its own root page). 89Insertions and searches in the nulls tree do not use any of the 90opclass-supplied functions, but just use hardwired logic comparable to 91AllTheSame cases in the normal tree. 92 93 94INSERTION ALGORITHM 95 96Insertion algorithm is designed to keep the tree in a consistent state at 97any moment. Here is a simplified insertion algorithm specification 98(numbers refer to notes below): 99 100 Start with the first tuple on the root page (1) 101 102 loop: 103 if (page is leaf) then 104 if (enough space) 105 insert on page and exit (5) 106 else (7) 107 call PickSplitFn() (2) 108 end if 109 else 110 switch (chooseFn()) 111 case MatchNode - descend through selected node 112 case AddNode - add node and then retry chooseFn (3, 6) 113 case SplitTuple - split inner tuple to prefix and postfix, then 114 retry chooseFn with the prefix tuple (4, 6) 115 end if 116 117Notes: 118 119(1) Initially, we just dump leaf tuples into the root page until it is full; 120then we split it. Once the root is not a leaf page, it can have only one 121inner tuple, so as to keep the amount of free space on the root as large as 122possible. Both of these rules are meant to postpone doing PickSplit on the 123root for as long as possible, so that the topmost partitioning of the search 124space is as good as we can easily make it. 125 126(2) Current implementation allows to do picksplit and insert a new leaf tuple 127in one operation, if the new list of leaf tuples fits on one page. It's 128always possible for trees with small nodes like quad tree or k-d tree, but 129radix trees may require another picksplit. 130 131(3) Addition of node must keep size of inner tuple small enough to fit on a 132page. After addition, inner tuple could become too large to be stored on 133current page because of other tuples on page. In this case it will be moved 134to another inner page (see notes about page management). When moving tuple to 135another page, we can't change the numbers of other tuples on the page, else 136we'd make downlink pointers to them invalid. To prevent that, SP-GiST leaves 137a "placeholder" tuple, which can be reused later whenever another tuple is 138added to the page. See also Concurrency and Vacuum sections below. Right now 139only radix trees could add a node to the tuple; quad trees and k-d trees 140make all possible nodes at once in PickSplitFn() call. 141 142(4) Prefix value could only partially match a new value, so the SplitTuple 143action allows breaking the current tree branch into upper and lower sections. 144Another way to say it is that we can split the current inner tuple into 145"prefix" and "postfix" parts, where the prefix part is able to match the 146incoming new value. Consider example of insertion into a radix tree. We use 147the following notation, where tuple's id is just for discussion (no such id 148is actually stored): 149 150inner tuple: {tuple id}(prefix string)[ comma separated list of node labels ] 151leaf tuple: {tuple id}<value> 152 153Suppose we need to insert string 'www.gogo.com' into inner tuple 154 155 {1}(www.google.com/)[a, i] 156 157The string does not match the prefix so we cannot descend. We must 158split the inner tuple into two tuples: 159 160 {2}(www.go)[o] - prefix tuple 161 | 162 {3}(gle.com/)[a,i] - postfix tuple 163 164On the next iteration of loop we find that 'www.gogo.com' matches the 165prefix, but not any node label, so we add a node [g] to tuple {2}: 166 167 NIL (no child exists yet) 168 | 169 {2}(www.go)[o, g] 170 | 171 {3}(gle.com/)[a,i] 172 173Now we can descend through the [g] node, which will cause us to update 174the target string to just 'o.com'. Finally, we'll insert a leaf tuple 175bearing that string: 176 177 {4}<o.com> 178 | 179 {2}(www.go)[o, g] 180 | 181 {3}(gle.com/)[a,i] 182 183As we can see, the original tuple's node array moves to postfix tuple without 184any changes. Note also that SP-GiST core assumes that prefix tuple is not 185larger than old inner tuple. That allows us to store prefix tuple directly 186in place of old inner tuple. SP-GiST core will try to store postfix tuple on 187the same page if possible, but will use another page if there is not enough 188free space (see notes 5 and 6). Currently, quad and k-d trees don't use this 189feature, because they have no concept of a prefix being "inconsistent" with 190any new value. They grow their depth only by PickSplitFn() call. 191 192(5) If pointer from node of parent is a NIL pointer, algorithm chooses a leaf 193page to store on. At first, it tries to use the last-used leaf page with the 194largest free space (which we track in each backend) to better utilize disk 195space. If that's not large enough, then the algorithm allocates a new page. 196 197(6) Management of inner pages is very similar to management of leaf pages, 198described in (5). 199 200(7) Actually, current implementation can move the whole list of leaf tuples 201and a new tuple to another page, if the list is short enough. This improves 202space utilization, but doesn't change the basis of the algorithm. 203 204 205CONCURRENCY 206 207While descending the tree, the insertion algorithm holds exclusive lock on 208two tree levels at a time, ie both parent and child pages (but parent and 209child pages can be the same, see notes above). There is a possibility of 210deadlock between two insertions if there are cross-referenced pages in 211different branches. That is, if inner tuple on page M has a child on page N 212while an inner tuple from another branch is on page N and has a child on 213page M, then two insertions descending the two branches could deadlock, 214since they will each hold their parent page's lock while trying to get the 215child page's lock. 216 217Currently, we deal with this by conditionally locking buffers as we descend 218the tree. If we fail to get lock on a buffer, we release both buffers and 219restart the insertion process. This is potentially inefficient, but the 220locking costs of a more deterministic approach seem very high. 221 222To reduce the number of cases where that happens, we introduce a concept of 223"triple parity" of pages: if inner tuple is on page with BlockNumber N, then 224its child tuples should be placed on the same page, or else on a page with 225BlockNumber M where (N+1) mod 3 == M mod 3. This rule ensures that tuples 226on page M will have no children on page N, since (M+1) mod 3 != N mod 3. 227That makes it unlikely that two insertion processes will conflict against 228each other while descending the tree. It's not perfect though: in the first 229place, we could still get a deadlock among three or more insertion processes, 230and in the second place, it's impractical to preserve this invariant in every 231case when we expand or split an inner tuple. So we still have to allow for 232deadlocks. 233 234Insertion may also need to take locks on an additional inner and/or leaf page 235to add tuples of the right type(s), when there's not enough room on the pages 236it descended through. However, we don't care exactly which such page we add 237to, so deadlocks can be avoided by conditionally locking the additional 238buffers: if we fail to get lock on an additional page, just try another one. 239 240Search traversal algorithm is rather traditional. At each non-leaf level, it 241share-locks the page, identifies which node(s) in the current inner tuple 242need to be visited, and puts those addresses on a stack of pages to examine 243later. It then releases lock on the current buffer before visiting the next 244stack item. So only one page is locked at a time, and no deadlock is 245possible. But instead, we have to worry about race conditions: by the time 246we arrive at a pointed-to page, a concurrent insertion could have replaced 247the target inner tuple (or leaf tuple chain) with data placed elsewhere. 248To handle that, whenever the insertion algorithm changes a nonempty downlink 249in an inner tuple, it places a "redirect tuple" in place of the lower-level 250inner tuple or leaf-tuple chain head that the link formerly led to. Scans 251(though not insertions) must be prepared to honor such redirects. Only a 252scan that had already visited the parent level could possibly reach such a 253redirect tuple, so we can remove redirects once all active transactions have 254been flushed out of the system. 255 256 257DEAD TUPLES 258 259Tuples on leaf pages can be in one of four states: 260 261SPGIST_LIVE: normal, live pointer to a heap tuple. 262 263SPGIST_REDIRECT: placeholder that contains a link to another place in the 264index. When a chain of leaf tuples has to be moved to another page, a 265redirect tuple is inserted in place of the chain's head tuple. The parent 266inner tuple's downlink is updated when this happens, but concurrent scans 267might be "in flight" from the parent page to the child page (since they 268release lock on the parent page before attempting to lock the child). 269The redirect pointer serves to tell such a scan where to go. A redirect 270pointer is only needed for as long as such concurrent scans could be in 271progress. Eventually, it's converted into a PLACEHOLDER dead tuple by 272VACUUM, and is then a candidate for replacement. Searches that find such 273a tuple (which should never be part of a chain) should immediately proceed 274to the other place, forgetting about the redirect tuple. Insertions that 275reach such a tuple should raise error, since a valid downlink should never 276point to such a tuple. 277 278SPGIST_DEAD: tuple is dead, but it cannot be removed or moved to a 279different offset on the page because there is a link leading to it from 280some inner tuple elsewhere in the index. (Such a tuple is never part of a 281chain, since we don't need one unless there is nothing live left in its 282chain.) Searches should ignore such entries. If an insertion action 283arrives at such a tuple, it should either replace it in-place (if there's 284room on the page to hold the desired new leaf tuple) or replace it with a 285redirection pointer to wherever it puts the new leaf tuple. 286 287SPGIST_PLACEHOLDER: tuple is dead, and there are known to be no links to 288it from elsewhere. When a live tuple is deleted or moved away, and not 289replaced by a redirect pointer, it is replaced by a placeholder to keep 290the offsets of later tuples on the same page from changing. Placeholders 291can be freely replaced when adding a new tuple to the page, and also 292VACUUM will delete any that are at the end of the range of valid tuple 293offsets. Both searches and insertions should complain if a link from 294elsewhere leads them to a placeholder tuple. 295 296When the root page is also a leaf, all its tuple should be in LIVE state; 297there's no need for the others since there are no links and no need to 298preserve offset numbers. 299 300Tuples on inner pages can be in LIVE, REDIRECT, or PLACEHOLDER states. 301The REDIRECT state has the same function as on leaf pages, to send 302concurrent searches to the place where they need to go after an inner 303tuple is moved to another page. Expired REDIRECT pointers are converted 304to PLACEHOLDER status by VACUUM, and are then candidates for replacement. 305DEAD state is not currently possible, since VACUUM does not attempt to 306remove unused inner tuples. 307 308 309VACUUM 310 311VACUUM (or more precisely, spgbulkdelete) performs a single sequential scan 312over the entire index. On both leaf and inner pages, we can convert old 313REDIRECT tuples into PLACEHOLDER status, and then remove any PLACEHOLDERs 314that are at the end of the page (since they aren't needed to preserve the 315offsets of any live tuples). On leaf pages, we scan for tuples that need 316to be deleted because their heap TIDs match a vacuum target TID. 317 318If we find a deletable tuple that is not at the head of its chain, we 319can simply replace it with a PLACEHOLDER, updating the chain links to 320remove it from the chain. If it is at the head of its chain, but there's 321at least one live tuple remaining in the chain, we move that live tuple 322to the head tuple's offset, replacing it with a PLACEHOLDER to preserve 323the offsets of other tuples. This keeps the parent inner tuple's downlink 324valid. If we find ourselves deleting all live tuples in a chain, we 325replace the head tuple with a DEAD tuple and the rest with PLACEHOLDERS. 326The parent inner tuple's downlink thus points to the DEAD tuple, and the 327rules explained in the previous section keep everything working. 328 329VACUUM doesn't know a-priori which tuples are heads of their chains, but 330it can easily figure that out by constructing a predecessor array that's 331the reverse map of the nextOffset links (ie, when we see tuple x links to 332tuple y, we set predecessor[y] = x). Then head tuples are the ones with 333no predecessor. 334 335Because insertions can occur while VACUUM runs, a pure sequential scan 336could miss deleting some target leaf tuples, because they could get moved 337from a not-yet-visited leaf page to an already-visited leaf page as a 338consequence of a PickSplit or MoveLeafs operation. Failing to delete any 339target TID is not acceptable, so we have to extend the algorithm to cope 340with such cases. We recognize that such a move might have occurred when 341we see a leaf-page REDIRECT tuple whose XID indicates it might have been 342created after the VACUUM scan started. We add the redirection target TID 343to a "pending list" of places we need to recheck. Between pages of the 344main sequential scan, we empty the pending list by visiting each listed 345TID. If it points to an inner tuple (from a PickSplit), add each downlink 346TID to the pending list. If it points to a leaf page, vacuum that page. 347(We could just vacuum the single pointed-to chain, but vacuuming the 348whole page simplifies the code and reduces the odds of VACUUM having to 349modify the same page multiple times.) To ensure that pending-list 350processing can never get into an endless loop, even in the face of 351concurrent index changes, we don't remove list entries immediately but 352only after we've completed all pending-list processing; instead we just 353mark items as done after processing them. Adding a TID that's already in 354the list is a no-op, whether or not that item is marked done yet. 355 356spgbulkdelete also updates the index's free space map. 357 358Currently, spgvacuumcleanup has nothing to do if spgbulkdelete was 359performed; otherwise, it does an spgbulkdelete scan with an empty target 360list, so as to clean up redirections and placeholders, update the free 361space map, and gather statistics. 362 363 364LAST USED PAGE MANAGEMENT 365 366The list of last used pages contains four pages - a leaf page and three 367inner pages, one from each "triple parity" group. (Actually, there's one 368such list for the main tree and a separate one for the nulls tree.) This 369list is stored between calls on the index meta page, but updates are never 370WAL-logged to decrease WAL traffic. Incorrect data on meta page isn't 371critical, because we could allocate a new page at any moment. 372 373 374AUTHORS 375 376 Teodor Sigaev <teodor@sigaev.ru> 377 Oleg Bartunov <oleg@sai.msu.su> 378