1src/backend/access/nbtree/README 2 3Btree Indexing 4============== 5 6This directory contains a correct implementation of Lehman and Yao's 7high-concurrency B-tree management algorithm (P. Lehman and S. Yao, 8Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions 9on Database Systems, Vol 6, No. 4, December 1981, pp 650-670). We also 10use a simplified version of the deletion logic described in Lanin and 11Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm, 12Proceedings of 1986 Fall Joint Computer Conference, pp 380-389). 13 14The basic Lehman & Yao Algorithm 15-------------------------------- 16 17Compared to a classic B-tree, L&Y adds a right-link pointer to each page, 18to the page's right sibling. It also adds a "high key" to each page, which 19is an upper bound on the keys that are allowed on that page. These two 20additions make it possible detect a concurrent page split, which allows the 21tree to be searched without holding any read locks (except to keep a single 22page from being modified while reading it). 23 24When a search follows a downlink to a child page, it compares the page's 25high key with the search key. If the search key is greater than the high 26key, the page must've been split concurrently, and you must follow the 27right-link to find the new page containing the key range you're looking 28for. This might need to be repeated, if the page has been split more than 29once. 30 31Differences to the Lehman & Yao algorithm 32----------------------------------------- 33 34We have made the following changes in order to incorporate the L&Y algorithm 35into Postgres: 36 37The requirement that all btree keys be unique is too onerous, 38but the algorithm won't work correctly without it. Fortunately, it is 39only necessary that keys be unique on a single tree level, because L&Y 40only use the assumption of key uniqueness when re-finding a key in a 41parent page (to determine where to insert the key for a split page). 42Therefore, we can use the link field to disambiguate multiple 43occurrences of the same user key: only one entry in the parent level 44will be pointing at the page we had split. (Indeed we need not look at 45the real "key" at all, just at the link field.) We can distinguish 46items at the leaf level in the same way, by examining their links to 47heap tuples; we'd never have two items for the same heap tuple. 48 49Lehman and Yao assume that the key range for a subtree S is described 50by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent 51page. This does not work for nonunique keys (for example, if we have 52enough equal keys to spread across several leaf pages, there *must* be 53some equal bounding keys in the first level up). Therefore we assume 54Ki <= v <= Ki+1 instead. A search that finds exact equality to a 55bounding key in an upper tree level must descend to the left of that 56key to ensure it finds any equal keys in the preceding page. An 57insertion that sees the high key of its target page is equal to the key 58to be inserted has a choice whether or not to move right, since the new 59key could go on either page. (Currently, we try to find a page where 60there is room for the new key without a split.) 61 62Lehman and Yao don't require read locks, but assume that in-memory 63copies of tree pages are unshared. Postgres shares in-memory buffers 64among backends. As a result, we do page-level read locking on btree 65pages in order to guarantee that no record is modified while we are 66examining it. This reduces concurrency but guarantees correct 67behavior. An advantage is that when trading in a read lock for a 68write lock, we need not re-read the page after getting the write lock. 69Since we're also holding a pin on the shared buffer containing the 70page, we know that buffer still contains the page and is up-to-date. 71 72We support the notion of an ordered "scan" of an index as well as 73insertions, deletions, and simple lookups. A scan in the forward 74direction is no problem, we just use the right-sibling pointers that 75L&Y require anyway. (Thus, once we have descended the tree to the 76correct start point for the scan, the scan looks only at leaf pages 77and never at higher tree levels.) To support scans in the backward 78direction, we also store a "left sibling" link much like the "right 79sibling". (This adds an extra step to the L&Y split algorithm: while 80holding the write lock on the page being split, we also lock its former 81right sibling to update that page's left-link. This is safe since no 82writer of that page can be interested in acquiring a write lock on our 83page.) A backwards scan has one additional bit of complexity: after 84following the left-link we must account for the possibility that the 85left sibling page got split before we could read it. So, we have to 86move right until we find a page whose right-link matches the page we 87came from. (Actually, it's even harder than that; see deletion discussion 88below.) 89 90Page read locks are held only for as long as a scan is examining a page. 91To minimize lock/unlock traffic, an index scan always searches a leaf page 92to identify all the matching items at once, copying their heap tuple IDs 93into backend-local storage. The heap tuple IDs are then processed while 94not holding any page lock within the index. We do continue to hold a pin 95on the leaf page in some circumstances, to protect against concurrent 96deletions (see below). In this state the scan is effectively stopped 97"between" pages, either before or after the page it has pinned. This is 98safe in the presence of concurrent insertions and even page splits, because 99items are never moved across pre-existing page boundaries --- so the scan 100cannot miss any items it should have seen, nor accidentally return the same 101item twice. The scan must remember the page's right-link at the time it 102was scanned, since that is the page to move right to; if we move right to 103the current right-link then we'd re-scan any items moved by a page split. 104We don't similarly remember the left-link, since it's best to use the most 105up-to-date left-link when trying to move left (see detailed move-left 106algorithm below). 107 108In most cases we release our lock and pin on a page before attempting 109to acquire pin and lock on the page we are moving to. In a few places 110it is necessary to lock the next page before releasing the current one. 111This is safe when moving right or up, but not when moving left or down 112(else we'd create the possibility of deadlocks). 113 114Lehman and Yao fail to discuss what must happen when the root page 115becomes full and must be split. Our implementation is to split the 116root in the same way that any other page would be split, then construct 117a new root page holding pointers to both of the resulting pages (which 118now become siblings on the next level of the tree). The new root page 119is then installed by altering the root pointer in the meta-data page (see 120below). This works because the root is not treated specially in any 121other way --- in particular, searches will move right using its link 122pointer if the link is set. Therefore, searches will find the data 123that's been moved into the right sibling even if they read the meta-data 124page before it got updated. This is the same reasoning that makes a 125split of a non-root page safe. The locking considerations are similar too. 126 127When an inserter recurses up the tree, splitting internal pages to insert 128links to pages inserted on the level below, it is possible that it will 129need to access a page above the level that was the root when it began its 130descent (or more accurately, the level that was the root when it read the 131meta-data page). In this case the stack it made while descending does not 132help for finding the correct page. When this happens, we find the correct 133place by re-descending the tree until we reach the level one above the 134level we need to insert a link to, and then moving right as necessary. 135(Typically this will take only two fetches, the meta-data page and the new 136root, but in principle there could have been more than one root split 137since we saw the root. We can identify the correct tree level by means of 138the level numbers stored in each page. The situation is rare enough that 139we do not need a more efficient solution.) 140 141Lehman and Yao assume fixed-size keys, but we must deal with 142variable-size keys. Therefore there is not a fixed maximum number of 143keys per page; we just stuff in as many as will fit. When we split a 144page, we try to equalize the number of bytes, not items, assigned to 145each of the resulting pages. Note we must include the incoming item in 146this calculation, otherwise it is possible to find that the incoming 147item doesn't fit on the split page where it needs to go! 148 149The Deletion Algorithm 150---------------------- 151 152Before deleting a leaf item, we get a super-exclusive lock on the target 153page, so that no other backend has a pin on the page when the deletion 154starts. This is not necessary for correctness in terms of the btree index 155operations themselves; as explained above, index scans logically stop 156"between" pages and so can't lose their place. The reason we do it is to 157provide an interlock between non-full VACUUM and indexscans. Since VACUUM 158deletes index entries before reclaiming heap tuple line pointers, the 159super-exclusive lock guarantees that VACUUM can't reclaim for re-use a 160line pointer that an indexscanning process might be about to visit. This 161guarantee works only for simple indexscans that visit the heap in sync 162with the index scan, not for bitmap scans. We only need the guarantee 163when using non-MVCC snapshot rules; when using an MVCC snapshot, it 164doesn't matter if the heap tuple is replaced with an unrelated tuple at 165the same TID, because the new tuple won't be visible to our scan anyway. 166Therefore, a scan using an MVCC snapshot which has no other confounding 167factors will not hold the pin after the page contents are read. The 168current reasons for exceptions, where a pin is still needed, are if the 169index is not WAL-logged or if the scan is an index-only scan. If later 170work allows the pin to be dropped for all cases we will be able to 171simplify the vacuum code, since the concept of a super-exclusive lock 172for btree indexes will no longer be needed. 173 174Because a pin is not always held, and a page can be split even while 175someone does hold a pin on it, it is possible that an indexscan will 176return items that are no longer stored on the page it has a pin on, but 177rather somewhere to the right of that page. To ensure that VACUUM can't 178prematurely remove such heap tuples, we require btbulkdelete to obtain a 179super-exclusive lock on every leaf page in the index, even pages that 180don't contain any deletable tuples. Any scan which could yield incorrect 181results if the tuple at a TID matching the scan's range and filter 182conditions were replaced by a different tuple while the scan is in 183progress must hold the pin on each index page until all index entries read 184from the page have been processed. This guarantees that the btbulkdelete 185call cannot return while any indexscan is still holding a copy of a 186deleted index tuple if the scan could be confused by that. Note that this 187requirement does not say that btbulkdelete must visit the pages in any 188particular order. (See also on-the-fly deletion, below.) 189 190There is no such interlocking for deletion of items in internal pages, 191since backends keep no lock nor pin on a page they have descended past. 192Hence, when a backend is ascending the tree using its stack, it must 193be prepared for the possibility that the item it wants is to the left of 194the recorded position (but it can't have moved left out of the recorded 195page). Since we hold a lock on the lower page (per L&Y) until we have 196re-found the parent item that links to it, we can be assured that the 197parent item does still exist and can't have been deleted. Also, because 198we are matching downlink page numbers and not data keys, we don't have any 199problem with possibly misidentifying the parent item. 200 201Page Deletion 202------------- 203 204We consider deleting an entire page from the btree only when it's become 205completely empty of items. (Merging partly-full pages would allow better 206space reuse, but it seems impractical to move existing data items left or 207right to make this happen --- a scan moving in the opposite direction 208might miss the items if so.) Also, we *never* delete the rightmost page 209on a tree level (this restriction simplifies the traversal algorithms, as 210explained below). Page deletion always begins from an empty leaf page. An 211internal page can only be deleted as part of a branch leading to a leaf 212page, where each internal page has only one child and that child is also to 213be deleted. 214 215Deleting a leaf page is a two-stage process. In the first stage, the page 216is unlinked from its parent, and marked as half-dead. The parent page must 217be found using the same type of search as used to find the parent during an 218insertion split. We lock the target and the parent pages, change the 219target's downlink to point to the right sibling, and remove its old 220downlink. This causes the target page's key space to effectively belong to 221its right sibling. (Neither the left nor right sibling pages need to 222change their "high key" if any; so there is no problem with possibly not 223having enough space to replace a high key.) At the same time, we mark the 224target page as half-dead, which causes any subsequent searches to ignore it 225and move right (or left, in a backwards scan). This leaves the tree in a 226similar state as during a page split: the page has no downlink pointing to 227it, but it's still linked to its siblings. 228 229(Note: Lanin and Shasha prefer to make the key space move left, but their 230argument for doing so hinges on not having left-links, which we have 231anyway. So we simplify the algorithm by moving key space right.) 232 233To preserve consistency on the parent level, we cannot merge the key space 234of a page into its right sibling unless the right sibling is a child of 235the same parent --- otherwise, the parent's key space assignment changes 236too, meaning we'd have to make bounding-key updates in its parent, and 237perhaps all the way up the tree. Since we can't possibly do that 238atomically, we forbid this case. That means that the rightmost child of a 239parent node can't be deleted unless it's the only remaining child, in which 240case we will delete the parent too (see below). 241 242In the second-stage, the half-dead leaf page is unlinked from its siblings. 243We first lock the left sibling (if any) of the target, the target page 244itself, and its right sibling (there must be one) in that order. Then we 245update the side-links in the siblings, and mark the target page deleted. 246 247When we're about to delete the last remaining child of a parent page, things 248are slightly more complicated. In the first stage, we leave the immediate 249parent of the leaf page alone, and remove the downlink to the parent page 250instead, from the grandparent. If it's the last child of the grandparent 251too, we recurse up until we find a parent with more than one child, and 252remove the downlink of that page. The leaf page is marked as half-dead, and 253the block number of the page whose downlink was removed is stashed in the 254half-dead leaf page. This leaves us with a chain of internal pages, with 255one downlink each, leading to the half-dead leaf page, and no downlink 256pointing to the topmost page in the chain. 257 258While we recurse up to find the topmost parent in the chain, we keep the 259leaf page locked, but don't need to hold locks on the intermediate pages 260between the leaf and the topmost parent -- insertions into upper tree levels 261happen only as a result of splits of child pages, and that can't happen as 262long as we're keeping the leaf locked. The internal pages in the chain 263cannot acquire new children afterwards either, because the leaf page is 264marked as half-dead and won't be split. 265 266Removing the downlink to the top of the to-be-deleted chain effectively 267transfers the key space to the right sibling for all the intermediate levels 268too, in one atomic operation. A concurrent search might still visit the 269intermediate pages, but it will move right when it reaches the half-dead page 270at the leaf level. 271 272In the second stage, the topmost page in the chain is unlinked from its 273siblings, and the half-dead leaf page is updated to point to the next page 274down in the chain. This is repeated until there are no internal pages left 275in the chain. Finally, the half-dead leaf page itself is unlinked from its 276siblings. 277 278A deleted page cannot be reclaimed immediately, since there may be other 279processes waiting to reference it (ie, search processes that just left the 280parent, or scans moving right or left from one of the siblings). These 281processes must observe that the page is marked dead and recover 282accordingly. Searches and forward scans simply follow the right-link 283until they find a non-dead page --- this will be where the deleted page's 284key-space moved to. 285 286Moving left in a backward scan is complicated because we must consider 287the possibility that the left sibling was just split (meaning we must find 288the rightmost page derived from the left sibling), plus the possibility 289that the page we were just on has now been deleted and hence isn't in the 290sibling chain at all anymore. So the move-left algorithm becomes: 2910. Remember the page we are on as the "original page". 2921. Follow the original page's left-link (we're done if this is zero). 2932. If the current page is live and its right-link matches the "original 294 page", we are done. 2953. Otherwise, move right one or more times looking for a live page whose 296 right-link matches the "original page". If found, we are done. (In 297 principle we could scan all the way to the right end of the index, but 298 in practice it seems better to give up after a small number of tries. 299 It's unlikely the original page's sibling split more than a few times 300 while we were in flight to it; if we do not find a matching link in a 301 few tries, then most likely the original page is deleted.) 3024. Return to the "original page". If it is still live, return to step 1 303 (we guessed wrong about it being deleted, and should restart with its 304 current left-link). If it is dead, move right until a non-dead page 305 is found (there must be one, since rightmost pages are never deleted), 306 mark that as the new "original page", and return to step 1. 307This algorithm is correct because the live page found by step 4 will have 308the same left keyspace boundary as the page we started from. Therefore, 309when we ultimately exit, it must be on a page whose right keyspace 310boundary matches the left boundary of where we started --- which is what 311we need to be sure we don't miss or re-scan any items. 312 313A deleted page can only be reclaimed once there is no scan or search that 314has a reference to it; until then, it must stay in place with its 315right-link undisturbed. We implement this by waiting until all active 316snapshots and registered snapshots as of the deletion are gone; which is 317overly strong, but is simple to implement within Postgres. When marked 318dead, a deleted page is labeled with the next-transaction counter value. 319VACUUM can reclaim the page for re-use when this transaction number is 320older than RecentGlobalXmin. As collateral damage, this implementation 321also waits for running XIDs with no snapshots and for snapshots taken 322until the next transaction to allocate an XID commits. 323 324Reclaiming a page doesn't actually change its state on disk --- we simply 325record it in the shared-memory free space map, from which it will be 326handed out the next time a new page is needed for a page split. The 327deleted page's contents will be overwritten by the split operation. 328(Note: if we find a deleted page with an extremely old transaction 329number, it'd be worthwhile to re-mark it with FrozenTransactionId so that 330a later xid wraparound can't cause us to think the page is unreclaimable. 331But in more normal situations this would be a waste of a disk write.) 332 333Because we never delete the rightmost page of any level (and in particular 334never delete the root), it's impossible for the height of the tree to 335decrease. After massive deletions we might have a scenario in which the 336tree is "skinny", with several single-page levels below the root. 337Operations will still be correct in this case, but we'd waste cycles 338descending through the single-page levels. To handle this we use an idea 339from Lanin and Shasha: we keep track of the "fast root" level, which is 340the lowest single-page level. The meta-data page keeps a pointer to this 341level as well as the true root. All ordinary operations initiate their 342searches at the fast root not the true root. When we split a page that is 343alone on its level or delete the next-to-last page on a level (both cases 344are easily detected), we have to make sure that the fast root pointer is 345adjusted appropriately. In the split case, we do this work as part of the 346atomic update for the insertion into the parent level; in the delete case 347as part of the atomic update for the delete (either way, the metapage has 348to be the last page locked in the update to avoid deadlock risks). This 349avoids race conditions if two such operations are executing concurrently. 350 351VACUUM needs to do a linear scan of an index to search for deleted pages 352that can be reclaimed because they are older than all open transactions. 353For efficiency's sake, we'd like to use the same linear scan to search for 354deletable tuples. Before Postgres 8.2, btbulkdelete scanned the leaf pages 355in index order, but it is possible to visit them in physical order instead. 356The tricky part of this is to avoid missing any deletable tuples in the 357presence of concurrent page splits: a page split could easily move some 358tuples from a page not yet passed over by the sequential scan to a 359lower-numbered page already passed over. (This wasn't a concern for the 360index-order scan, because splits always split right.) To implement this, 361we provide a "vacuum cycle ID" mechanism that makes it possible to 362determine whether a page has been split since the current btbulkdelete 363cycle started. If btbulkdelete finds a page that has been split since 364it started, and has a right-link pointing to a lower page number, then 365it temporarily suspends its sequential scan and visits that page instead. 366It must continue to follow right-links and vacuum dead tuples until 367reaching a page that either hasn't been split since btbulkdelete started, 368or is above the location of the outer sequential scan. Then it can resume 369the sequential scan. This ensures that all tuples are visited. It may be 370that some tuples are visited twice, but that has no worse effect than an 371inaccurate index tuple count (and we can't guarantee an accurate count 372anyway in the face of concurrent activity). Note that this still works 373if the has-been-recently-split test has a small probability of false 374positives, so long as it never gives a false negative. This makes it 375possible to implement the test with a small counter value stored on each 376index page. 377 378Fastpath For Index Insertion 379---------------------------- 380 381We optimize for a common case of insertion of increasing index key 382values by caching the last page to which this backend inserted the last 383value, if this page was the rightmost leaf page. For the next insert, we 384can then quickly check if the cached page is still the rightmost leaf 385page and also the correct place to hold the current value. We can avoid 386the cost of walking down the tree in such common cases. 387 388The optimization works on the assumption that there can only be one 389non-ignorable leaf rightmost page, and so even a RecentGlobalXmin style 390interlock isn't required. We cannot fail to detect that our hint was 391invalidated, because there can only be one such page in the B-Tree at 392any time. It's possible that the page will be deleted and recycled 393without a backend's cached page also being detected as invalidated, but 394only when we happen to recycle a block that once again gets recycled as the 395rightmost leaf page. 396 397On-the-Fly Deletion Of Index Tuples 398----------------------------------- 399 400If a process visits a heap tuple and finds that it's dead and removable 401(ie, dead to all open transactions, not only that process), then we can 402return to the index and mark the corresponding index entry "known dead", 403allowing subsequent index scans to skip visiting the heap tuple. The 404"known dead" marking works by setting the index item's lp_flags state 405to LP_DEAD. This is currently only done in plain indexscans, not bitmap 406scans, because only plain scans visit the heap and index "in sync" and so 407there's not a convenient way to do it for bitmap scans. 408 409Once an index tuple has been marked LP_DEAD it can actually be removed 410from the index immediately; since index scans only stop "between" pages, 411no scan can lose its place from such a deletion. We separate the steps 412because we allow LP_DEAD to be set with only a share lock (it's exactly 413like a hint bit for a heap tuple), but physically removing tuples requires 414exclusive lock. In the current code we try to remove LP_DEAD tuples when 415we are otherwise faced with having to split a page to do an insertion (and 416hence have exclusive lock on it already). 417 418This leaves the index in a state where it has no entry for a dead tuple 419that still exists in the heap. This is not a problem for the current 420implementation of VACUUM, but it could be a problem for anything that 421explicitly tries to find index entries for dead tuples. (However, the 422same situation is created by REINDEX, since it doesn't enter dead 423tuples into the index.) 424 425It's sufficient to have an exclusive lock on the index page, not a 426super-exclusive lock, to do deletion of LP_DEAD items. It might seem 427that this breaks the interlock between VACUUM and indexscans, but that is 428not so: as long as an indexscanning process has a pin on the page where 429the index item used to be, VACUUM cannot complete its btbulkdelete scan 430and so cannot remove the heap tuple. This is another reason why 431btbulkdelete has to get a super-exclusive lock on every leaf page, not 432only the ones where it actually sees items to delete. So that we can 433handle the cases where we attempt LP_DEAD flagging for a page after we 434have released its pin, we remember the LSN of the index page when we read 435the index tuples from it; we do not attempt to flag index tuples as dead 436if the we didn't hold the pin the entire time and the LSN has changed. 437 438WAL Considerations 439------------------ 440 441The insertion and deletion algorithms in themselves don't guarantee btree 442consistency after a crash. To provide robustness, we depend on WAL 443replay. A single WAL entry is effectively an atomic action --- we can 444redo it from the log if it fails to complete. 445 446Ordinary item insertions (that don't force a page split) are of course 447single WAL entries, since they only affect one page. The same for 448leaf-item deletions (if the deletion brings the leaf page to zero items, 449it is now a candidate to be deleted, but that is a separate action). 450 451An insertion that causes a page split is logged as a single WAL entry for 452the changes occurring on the insertion's level --- including update of the 453right sibling's left-link --- followed by a second WAL entry for the 454insertion on the parent level (which might itself be a page split, requiring 455an additional insertion above that, etc). 456 457For a root split, the followon WAL entry is a "new root" entry rather than 458an "insertion" entry, but details are otherwise much the same. 459 460Because splitting involves multiple atomic actions, it's possible that the 461system crashes between splitting a page and inserting the downlink for the 462new half to the parent. After recovery, the downlink for the new page will 463be missing. The search algorithm works correctly, as the page will be found 464by following the right-link from its left sibling, although if a lot of 465downlinks in the tree are missing, performance will suffer. A more serious 466consequence is that if the page without a downlink gets split again, the 467insertion algorithm will fail to find the location in the parent level to 468insert the downlink. 469 470Our approach is to create any missing downlinks on-the-fly, when searching 471the tree for a new insertion. It could be done during searches, too, but 472it seems best not to put any extra updates in what would otherwise be a 473read-only operation (updating is not possible in hot standby mode anyway). 474It would seem natural to add the missing downlinks in VACUUM, but since 475inserting a downlink might require splitting a page, it might fail if you 476run out of disk space. That would be bad during VACUUM - the reason for 477running VACUUM in the first place might be that you run out of disk space, 478and now VACUUM won't finish because you're out of disk space. In contrast, 479an insertion can require enlarging the physical file anyway. 480 481To identify missing downlinks, when a page is split, the left page is 482flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT). 483When the downlink is inserted to the parent, the flag is cleared atomically 484with the insertion. The child page is kept locked until the insertion in 485the parent is finished and the flag in the child cleared, but can be 486released immediately after that, before recursing up the tree if the parent 487also needs to be split. This ensures that incompletely split pages should 488not be seen under normal circumstances; only if insertion to the parent 489has failed for some reason. 490 491We flag the left page, even though it's the right page that's missing the 492downlink, because it's more convenient to know already when following the 493right-link from the left page to the right page that it will need to have 494its downlink inserted to the parent. 495 496When splitting a non-root page that is alone on its level, the required 497metapage update (of the "fast root" link) is performed and logged as part 498of the insertion into the parent level. When splitting the root page, the 499metapage update is handled as part of the "new root" action. 500 501Each step in page deletion is logged as a separate WAL entry: marking the 502leaf as half-dead and removing the downlink is one record, and unlinking a 503page is a second record. If vacuum is interrupted for some reason, or the 504system crashes, the tree is consistent for searches and insertions. The 505next VACUUM will find the half-dead leaf page and continue the deletion. 506 507Before 9.4, we used to keep track of incomplete splits and page deletions 508during recovery and finish them immediately at end of recovery, instead of 509doing it lazily at the next insertion or vacuum. However, that made the 510recovery much more complicated, and only fixed the problem when crash 511recovery was performed. An incomplete split can also occur if an otherwise 512recoverable error, like out-of-memory or out-of-disk-space, happens while 513inserting the downlink to the parent. 514 515Scans during Recovery 516--------------------- 517 518The btree index type can be safely used during recovery. During recovery 519we have at most one writer and potentially many readers. In that 520situation the locking requirements can be relaxed and we do not need 521double locking during block splits. Each WAL record makes changes to a 522single level of the btree using the correct locking sequence and so 523is safe for concurrent readers. Some readers may observe a block split 524in progress as they descend the tree, but they will simply move right 525onto the correct page. 526 527During recovery all index scans start with ignore_killed_tuples = false 528and we never set kill_prior_tuple. We do this because the oldest xmin 529on the standby server can be older than the oldest xmin on the master 530server, which means tuples can be marked as killed even when they are 531still visible on the standby. We don't WAL log tuple killed bits, but 532they can still appear in the standby because of full page writes. So 533we must always ignore them in standby, and that means it's not worth 534setting them either. 535 536Note that we talk about scans that are started during recovery. We go to 537a little trouble to allow a scan to start during recovery and end during 538normal running after recovery has completed. This is a key capability 539because it allows running applications to continue while the standby 540changes state into a normally running server. 541 542The interlocking required to avoid returning incorrect results from 543non-MVCC scans is not required on standby nodes. That is because 544HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(), 545HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only 546ever used during write transactions, which cannot exist on the standby. 547MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC() 548is not a problem. That leaves concern only for HeapTupleSatisfiesToast(). 549HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's 550because it doesn't need to - if the main heap row is visible then the 551toast rows will also be visible. So as long as we follow a toast 552pointer from a visible (live) tuple the corresponding toast rows 553will also be visible, so we do not need to recheck MVCC on them. 554There is one minor exception, which is that the optimizer sometimes 555looks at the boundaries of value ranges using SnapshotDirty, which 556could result in returning a newer value for query statistics; this 557would affect the query plan in rare cases, but not the correctness. 558The risk window is small since the stats look at the min and max values 559in the index, so the scan retrieves a tid then immediately uses it 560to look in the heap. It is unlikely that the tid could have been 561deleted, vacuumed and re-inserted in the time taken to look in the heap 562via direct tid access. So we ignore that scan type as a problem. 563 564Other Things That Are Handy to Know 565----------------------------------- 566 567Page zero of every btree is a meta-data page. This page stores the 568location of the root page --- both the true root and the current effective 569root ("fast" root). To avoid fetching the metapage for every single index 570search, we cache a copy of the meta-data information in the index's 571relcache entry (rd_amcache). This is a bit ticklish since using the cache 572implies following a root page pointer that could be stale. However, a 573backend following a cached pointer can sufficiently verify whether it 574reached the intended page; either by checking the is-root flag when it 575is going to the true root, or by checking that the page has no siblings 576when going to the fast root. At worst, this could result in descending 577some extra tree levels if we have a cached pointer to a fast root that is 578now above the real fast root. Such cases shouldn't arise often enough to 579be worth optimizing; and in any case we can expect a relcache flush will 580discard the cached metapage before long, since a VACUUM that's moved the 581fast root pointer can be expected to issue a statistics update for the 582index. 583 584The algorithm assumes we can fit at least three items per page 585(a "high key" and two real data items). Therefore it's unsafe 586to accept items larger than 1/3rd page size. Larger items would 587work sometimes, but could cause failures later on depending on 588what else gets put on their page. 589 590"ScanKey" data structures are used in two fundamentally different ways 591in this code, which we describe as "search" scankeys and "insertion" 592scankeys. A search scankey is the kind passed to btbeginscan() or 593btrescan() from outside the btree code. The sk_func pointers in a search 594scankey point to comparison functions that return boolean, such as int4lt. 595There might be more than one scankey entry for a given index column, or 596none at all. (We require the keys to appear in index column order, but 597the order of multiple keys for a given column is unspecified.) An 598insertion scankey uses the same array-of-ScanKey data structure, but the 599sk_func pointers point to btree comparison support functions (ie, 3-way 600comparators that return int4 values interpreted as <0, =0, >0). In an 601insertion scankey there is exactly one entry per index column. Insertion 602scankeys are built within the btree code (eg, by _bt_mkscankey()) and are 603used to locate the starting point of a scan, as well as for locating the 604place to insert a new index tuple. (Note: in the case of an insertion 605scankey built from a search scankey, there might be fewer keys than 606index columns, indicating that we have no constraints for the remaining 607index columns.) After we have located the starting point of a scan, the 608original search scankey is consulted as each index entry is sequentially 609scanned to decide whether to return the entry and whether the scan can 610stop (see _bt_checkkeys()). 611 612We use term "pivot" index tuples to distinguish tuples which don't point 613to heap tuples, but rather used for tree navigation. Pivot tuples includes 614all tuples on non-leaf pages and high keys on leaf pages. Note that pivot 615index tuples are only used to represent which part of the key space belongs 616on each page, and can have attribute values copied from non-pivot tuples 617that were deleted and killed by VACUUM some time ago. In principle, we could 618truncate away attributes that are not needed for a page high key during a leaf 619page split, provided that the remaining attributes distinguish the last index 620tuple on the post-split left page as belonging on the left page, and the first 621index tuple on the post-split right page as belonging on the right page. This 622optimization is sometimes called suffix truncation, and may appear in a future 623release. Since the high key is subsequently reused as the downlink in the 624parent page for the new right page, suffix truncation can increase index 625fan-out considerably by keeping pivot tuples short. INCLUDE indexes similarly 626truncate away non-key attributes at the time of a leaf page split, 627increasing fan-out. 628 629Notes About Data Representation 630------------------------------- 631 632The right-sibling link required by L&Y is kept in the page "opaque 633data" area, as is the left-sibling link, the page level, and some flags. 634The page level counts upwards from zero at the leaf level, to the tree 635depth minus 1 at the root. (Counting up from the leaves ensures that we 636don't need to renumber any existing pages when splitting the root.) 637 638The Postgres disk block data format (an array of items) doesn't fit 639Lehman and Yao's alternating-keys-and-pointers notion of a disk page, 640so we have to play some games. 641 642On a page that is not rightmost in its tree level, the "high key" is 643kept in the page's first item, and real data items start at item 2. 644The link portion of the "high key" item goes unused. A page that is 645rightmost has no "high key", so data items start with the first item. 646Putting the high key at the left, rather than the right, may seem odd, 647but it avoids moving the high key as we add data items. 648 649On a leaf page, the data items are simply links to (TIDs of) tuples 650in the relation being indexed, with the associated key values. 651 652On a non-leaf page, the data items are down-links to child pages with 653bounding keys. The key in each data item is the *lower* bound for 654keys on that child page, so logically the key is to the left of that 655downlink. The high key (if present) is the upper bound for the last 656downlink. The first data item on each such page has no lower bound 657--- or lower bound of minus infinity, if you prefer. The comparison 658routines must treat it accordingly. The actual key stored in the 659item is irrelevant, and need not be stored at all. This arrangement 660corresponds to the fact that an L&Y non-leaf page has one more pointer 661than key. 662