• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..08-Nov-2021-

MakefileH A D08-Nov-2021577 2916

READMEH A D08-Nov-202161.4 KiB1,057938

nbtcompare.cH A D08-Nov-20217.2 KiB336231

nbtdedup.cH A D08-Nov-202136.9 KiB1,099537

nbtinsert.cH A D08-Nov-2021103.7 KiB3,0101,449

nbtpage.cH A D08-Nov-202199.8 KiB3,0741,455

nbtree.cH A D08-Nov-202143.9 KiB1,447739

nbtsearch.cH A D08-Nov-202175.6 KiB2,5021,256

nbtsort.cH A D08-Nov-202162.7 KiB2,0101,073

nbtsplitloc.cH A D08-Nov-202142.2 KiB1,191542

nbtutils.cH A D08-Nov-202184.7 KiB2,7521,420

nbtvalidate.cH A D08-Nov-202112 KiB381256

nbtxlog.cH A D08-Nov-202132.8 KiB1,127735

README

1src/backend/access/nbtree/README
2
3Btree Indexing
4==============
5
6This directory contains a correct implementation of Lehman and Yao's
7high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
8Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
9on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).  We also
10use a simplified version of the deletion logic described in Lanin and
11Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,
12Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).
13
14The basic Lehman & Yao Algorithm
15--------------------------------
16
17Compared to a classic B-tree, L&Y adds a right-link pointer to each page,
18to the page's right sibling.  It also adds a "high key" to each page, which
19is an upper bound on the keys that are allowed on that page.  These two
20additions make it possible detect a concurrent page split, which allows the
21tree to be searched without holding any read locks (except to keep a single
22page from being modified while reading it).
23
24When a search follows a downlink to a child page, it compares the page's
25high key with the search key.  If the search key is greater than the high
26key, the page must've been split concurrently, and you must follow the
27right-link to find the new page containing the key range you're looking
28for.  This might need to be repeated, if the page has been split more than
29once.
30
31Lehman and Yao talk about alternating "separator" keys and downlinks in
32internal pages rather than tuples or records.  We use the term "pivot"
33tuple to refer to tuples which don't point to heap tuples, that are used
34only for tree navigation.  All tuples on non-leaf pages and high keys on
35leaf pages are pivot tuples.  Since pivot tuples are only used to represent
36which part of the key space belongs on each page, they can have attribute
37values copied from non-pivot tuples that were deleted and killed by VACUUM
38some time ago.  A pivot tuple may contain a "separator" key and downlink,
39just a separator key (i.e. the downlink value is implicitly undefined), or
40just a downlink (i.e. all attributes are truncated away).
41
42The requirement that all btree keys be unique is satisfied by treating heap
43TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
44order.  This is necessary because Lehman and Yao also require that the key
45range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
46the adjacent keys in the parent page (Ki must be _strictly_ less than v,
47which is assured by having reliably unique keys).  Keys are always unique
48on their level, with the exception of a leaf page's high key, which can be
49fully equal to the last item on the page.
50
51The Postgres implementation of suffix truncation must make sure that the
52Lehman and Yao invariants hold, and represents that absent/truncated
53attributes in pivot tuples have the sentinel value "minus infinity".  The
54later section on suffix truncation will be helpful if it's unclear how the
55Lehman & Yao invariants work with a real world example.
56
57Differences to the Lehman & Yao algorithm
58-----------------------------------------
59
60We have made the following changes in order to incorporate the L&Y algorithm
61into Postgres:
62
63Lehman and Yao don't require read locks, but assume that in-memory
64copies of tree pages are unshared.  Postgres shares in-memory buffers
65among backends.  As a result, we do page-level read locking on btree
66pages in order to guarantee that no record is modified while we are
67examining it.  This reduces concurrency but guarantees correct
68behavior.
69
70We support the notion of an ordered "scan" of an index as well as
71insertions, deletions, and simple lookups.  A scan in the forward
72direction is no problem, we just use the right-sibling pointers that
73L&Y require anyway.  (Thus, once we have descended the tree to the
74correct start point for the scan, the scan looks only at leaf pages
75and never at higher tree levels.)  To support scans in the backward
76direction, we also store a "left sibling" link much like the "right
77sibling".  (This adds an extra step to the L&Y split algorithm: while
78holding the write lock on the page being split, we also lock its former
79right sibling to update that page's left-link.  This is safe since no
80writer of that page can be interested in acquiring a write lock on our
81page.)  A backwards scan has one additional bit of complexity: after
82following the left-link we must account for the possibility that the
83left sibling page got split before we could read it.  So, we have to
84move right until we find a page whose right-link matches the page we
85came from.  (Actually, it's even harder than that; see page deletion
86discussion below.)
87
88Page read locks are held only for as long as a scan is examining a page.
89To minimize lock/unlock traffic, an index scan always searches a leaf page
90to identify all the matching items at once, copying their heap tuple IDs
91into backend-local storage.  The heap tuple IDs are then processed while
92not holding any page lock within the index.  We do continue to hold a pin
93on the leaf page in some circumstances, to protect against concurrent
94deletions (see below).  In this state the scan is effectively stopped
95"between" pages, either before or after the page it has pinned.  This is
96safe in the presence of concurrent insertions and even page splits, because
97items are never moved across pre-existing page boundaries --- so the scan
98cannot miss any items it should have seen, nor accidentally return the same
99item twice.  The scan must remember the page's right-link at the time it
100was scanned, since that is the page to move right to; if we move right to
101the current right-link then we'd re-scan any items moved by a page split.
102We don't similarly remember the left-link, since it's best to use the most
103up-to-date left-link when trying to move left (see detailed move-left
104algorithm below).
105
106In most cases we release our lock and pin on a page before attempting
107to acquire pin and lock on the page we are moving to.  In a few places
108it is necessary to lock the next page before releasing the current one.
109This is safe when moving right or up, but not when moving left or down
110(else we'd create the possibility of deadlocks).
111
112Lehman and Yao fail to discuss what must happen when the root page
113becomes full and must be split.  Our implementation is to split the
114root in the same way that any other page would be split, then construct
115a new root page holding pointers to both of the resulting pages (which
116now become siblings on the next level of the tree).  The new root page
117is then installed by altering the root pointer in the meta-data page (see
118below).  This works because the root is not treated specially in any
119other way --- in particular, searches will move right using its link
120pointer if the link is set.  Therefore, searches will find the data
121that's been moved into the right sibling even if they read the meta-data
122page before it got updated.  This is the same reasoning that makes a
123split of a non-root page safe.  The locking considerations are similar too.
124
125When an inserter recurses up the tree, splitting internal pages to insert
126links to pages inserted on the level below, it is possible that it will
127need to access a page above the level that was the root when it began its
128descent (or more accurately, the level that was the root when it read the
129meta-data page).  In this case the stack it made while descending does not
130help for finding the correct page.  When this happens, we find the correct
131place by re-descending the tree until we reach the level one above the
132level we need to insert a link to, and then moving right as necessary.
133(Typically this will take only two fetches, the meta-data page and the new
134root, but in principle there could have been more than one root split
135since we saw the root.  We can identify the correct tree level by means of
136the level numbers stored in each page.  The situation is rare enough that
137we do not need a more efficient solution.)
138
139Lehman and Yao must couple/chain locks as part of moving right when
140relocating a child page's downlink during an ascent of the tree.  This is
141the only point where Lehman and Yao have to simultaneously hold three
142locks (a lock on the child, the original parent, and the original parent's
143right sibling).  We don't need to couple internal page locks for pages on
144the same level, though.  We match a child's block number to a downlink
145from a pivot tuple one level up, whereas Lehman and Yao match on the
146separator key associated with the downlink that was followed during the
147initial descent.  We can release the lock on the original parent page
148before acquiring a lock on its right sibling, since there is never any
149need to deal with the case where the separator key that we must relocate
150becomes the original parent's high key.  Lanin and Shasha don't couple
151locks here either, though they also don't couple locks between levels
152during ascents.  They are willing to "wait and try again" to avoid races.
153Their algorithm is optimistic, which means that "an insertion holds no
154more than one write lock at a time during its ascent".  We more or less
155stick with Lehman and Yao's approach of conservatively coupling parent and
156child locks when ascending the tree, since it's far simpler.
157
158Lehman and Yao assume fixed-size keys, but we must deal with
159variable-size keys.  Therefore there is not a fixed maximum number of
160keys per page; we just stuff in as many as will fit.  When we split a
161page, we try to equalize the number of bytes, not items, assigned to
162pages (though suffix truncation is also considered).  Note we must include
163the incoming item in this calculation, otherwise it is possible to find
164that the incoming item doesn't fit on the split page where it needs to go!
165
166Deleting index tuples during VACUUM
167-----------------------------------
168
169Before deleting a leaf item, we get a super-exclusive lock on the target
170page, so that no other backend has a pin on the page when the deletion
171starts.  This is not necessary for correctness in terms of the btree index
172operations themselves; as explained above, index scans logically stop
173"between" pages and so can't lose their place.  The reason we do it is to
174provide an interlock between VACUUM and indexscans.  Since VACUUM deletes
175index entries before reclaiming heap tuple line pointers, the
176super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
177line pointer that an indexscanning process might be about to visit.  This
178guarantee works only for simple indexscans that visit the heap in sync
179with the index scan, not for bitmap scans.  We only need the guarantee
180when using non-MVCC snapshot rules; when using an MVCC snapshot, it
181doesn't matter if the heap tuple is replaced with an unrelated tuple at
182the same TID, because the new tuple won't be visible to our scan anyway.
183Therefore, a scan using an MVCC snapshot which has no other confounding
184factors will not hold the pin after the page contents are read.  The
185current reasons for exceptions, where a pin is still needed, are if the
186index is not WAL-logged or if the scan is an index-only scan.  If later
187work allows the pin to be dropped for all cases we will be able to
188simplify the vacuum code, since the concept of a super-exclusive lock
189for btree indexes will no longer be needed.
190
191Because a pin is not always held, and a page can be split even while
192someone does hold a pin on it, it is possible that an indexscan will
193return items that are no longer stored on the page it has a pin on, but
194rather somewhere to the right of that page.  To ensure that VACUUM can't
195prematurely remove such heap tuples, we require btbulkdelete to obtain a
196super-exclusive lock on every leaf page in the index, even pages that
197don't contain any deletable tuples.  Any scan which could yield incorrect
198results if the tuple at a TID matching the scan's range and filter
199conditions were replaced by a different tuple while the scan is in
200progress must hold the pin on each index page until all index entries read
201from the page have been processed.  This guarantees that the btbulkdelete
202call cannot return while any indexscan is still holding a copy of a
203deleted index tuple if the scan could be confused by that.  Note that this
204requirement does not say that btbulkdelete must visit the pages in any
205particular order.  (See also simple deletion and bottom-up deletion,
206below.)
207
208There is no such interlocking for deletion of items in internal pages,
209since backends keep no lock nor pin on a page they have descended past.
210Hence, when a backend is ascending the tree using its stack, it must
211be prepared for the possibility that the item it wants is to the left of
212the recorded position (but it can't have moved left out of the recorded
213page).  Since we hold a lock on the lower page (per L&Y) until we have
214re-found the parent item that links to it, we can be assured that the
215parent item does still exist and can't have been deleted.
216
217VACUUM's linear scan, concurrent page splits
218--------------------------------------------
219
220VACUUM accesses the index by doing a linear scan to search for deletable
221TIDs, while considering the possibility of deleting empty pages in
222passing.  This is in physical/block order, not logical/keyspace order.
223The tricky part of this is avoiding missing any deletable tuples in the
224presence of concurrent page splits: a page split could easily move some
225tuples from a page not yet passed over by the sequential scan to a
226lower-numbered page already passed over.
227
228To implement this, we provide a "vacuum cycle ID" mechanism that makes it
229possible to determine whether a page has been split since the current
230btbulkdelete cycle started.  If btbulkdelete finds a page that has been
231split since it started, and has a right-link pointing to a lower page
232number, then it temporarily suspends its sequential scan and visits that
233page instead.  It must continue to follow right-links and vacuum dead
234tuples until reaching a page that either hasn't been split since
235btbulkdelete started, or is above the location of the outer sequential
236scan.  Then it can resume the sequential scan.  This ensures that all
237tuples are visited.  It may be that some tuples are visited twice, but
238that has no worse effect than an inaccurate index tuple count (and we
239can't guarantee an accurate count anyway in the face of concurrent
240activity).  Note that this still works if the has-been-recently-split test
241has a small probability of false positives, so long as it never gives a
242false negative.  This makes it possible to implement the test with a small
243counter value stored on each index page.
244
245Deleting entire pages during VACUUM
246-----------------------------------
247
248We consider deleting an entire page from the btree only when it's become
249completely empty of items.  (Merging partly-full pages would allow better
250space reuse, but it seems impractical to move existing data items left or
251right to make this happen --- a scan moving in the opposite direction
252might miss the items if so.)  Also, we *never* delete the rightmost page
253on a tree level (this restriction simplifies the traversal algorithms, as
254explained below).  Page deletion always begins from an empty leaf page.  An
255internal page can only be deleted as part of deleting an entire subtree.
256This is always a "skinny" subtree consisting of a "chain" of internal pages
257plus a single leaf page.  There is one page on each level of the subtree,
258and each level/page covers the same key space.
259
260Deleting a leaf page is a two-stage process.  In the first stage, the page
261is unlinked from its parent, and marked as half-dead.  The parent page must
262be found using the same type of search as used to find the parent during an
263insertion split.  We lock the target and the parent pages, change the
264target's downlink to point to the right sibling, and remove its old
265downlink.  This causes the target page's key space to effectively belong to
266its right sibling.  (Neither the left nor right sibling pages need to
267change their "high key" if any; so there is no problem with possibly not
268having enough space to replace a high key.)  At the same time, we mark the
269target page as half-dead, which causes any subsequent searches to ignore it
270and move right (or left, in a backwards scan).  This leaves the tree in a
271similar state as during a page split: the page has no downlink pointing to
272it, but it's still linked to its siblings.
273
274(Note: Lanin and Shasha prefer to make the key space move left, but their
275argument for doing so hinges on not having left-links, which we have
276anyway.  So we simplify the algorithm by moving the key space right.  This
277is only possible because we don't match on a separator key when ascending
278the tree during a page split, unlike Lehman and Yao/Lanin and Shasha -- it
279doesn't matter if the downlink is re-found in a pivot tuple whose separator
280key does not match the one encountered when inserter initially descended
281the tree.)
282
283To preserve consistency on the parent level, we cannot merge the key space
284of a page into its right sibling unless the right sibling is a child of
285the same parent --- otherwise, the parent's key space assignment changes
286too, meaning we'd have to make bounding-key updates in its parent, and
287perhaps all the way up the tree.  Since we can't possibly do that
288atomically, we forbid this case.  That means that the rightmost child of a
289parent node can't be deleted unless it's the only remaining child, in which
290case we will delete the parent too (see below).
291
292In the second-stage, the half-dead leaf page is unlinked from its siblings.
293We first lock the left sibling (if any) of the target, the target page
294itself, and its right sibling (there must be one) in that order.  Then we
295update the side-links in the siblings, and mark the target page deleted.
296
297When we're about to delete the last remaining child of a parent page, things
298are slightly more complicated.  In the first stage, we leave the immediate
299parent of the leaf page alone, and remove the downlink to the parent page
300instead, from the grandparent.  If it's the last child of the grandparent
301too, we recurse up until we find a parent with more than one child, and
302remove the downlink of that page.  The leaf page is marked as half-dead, and
303the block number of the page whose downlink was removed is stashed in the
304half-dead leaf page.  This leaves us with a chain of internal pages, with
305one downlink each, leading to the half-dead leaf page, and no downlink
306pointing to the topmost page in the chain.
307
308While we recurse up to find the topmost parent in the chain, we keep the
309leaf page locked, but don't need to hold locks on the intermediate pages
310between the leaf and the topmost parent -- insertions into upper tree levels
311happen only as a result of splits of child pages, and that can't happen as
312long as we're keeping the leaf locked.  The internal pages in the chain
313cannot acquire new children afterwards either, because the leaf page is
314marked as half-dead and won't be split.
315
316Removing the downlink to the top of the to-be-deleted subtree/chain
317effectively transfers the key space to the right sibling for all the
318intermediate levels too, in one atomic operation.  A concurrent search might
319still visit the intermediate pages, but it will move right when it reaches
320the half-dead page at the leaf level.  In particular, the search will move to
321the subtree to the right of the half-dead leaf page/to-be-deleted subtree,
322since the half-dead leaf page's right sibling must be a "cousin" page, not a
323"true" sibling page (or a second cousin page when the to-be-deleted chain
324starts at leaf page's grandparent page, and so on).
325
326In the second stage, the topmost page in the chain is unlinked from its
327siblings, and the half-dead leaf page is updated to point to the next page
328down in the chain.  This is repeated until there are no internal pages left
329in the chain.  Finally, the half-dead leaf page itself is unlinked from its
330siblings.
331
332A deleted page cannot be recycled immediately, since there may be other
333processes waiting to reference it (ie, search processes that just left the
334parent, or scans moving right or left from one of the siblings).  These
335processes must be able to observe a deleted page for some time after the
336deletion operation, in order to be able to at least recover from it (they
337recover by moving right, as with concurrent page splits).  Searchers never
338have to worry about concurrent page recycling.
339
340See "Placing deleted pages in the FSM" section below for a description of
341when and how deleted pages become safe for VACUUM to make recyclable.
342
343Page deletion and backwards scans
344---------------------------------
345
346Moving left in a backward scan is complicated because we must consider
347the possibility that the left sibling was just split (meaning we must find
348the rightmost page derived from the left sibling), plus the possibility
349that the page we were just on has now been deleted and hence isn't in the
350sibling chain at all anymore.  So the move-left algorithm becomes:
351
3520. Remember the page we are on as the "original page".
3531. Follow the original page's left-link (we're done if this is zero).
3542. If the current page is live and its right-link matches the "original
355   page", we are done.
3563. Otherwise, move right one or more times looking for a live page whose
357   right-link matches the "original page".  If found, we are done.  (In
358   principle we could scan all the way to the right end of the index, but
359   in practice it seems better to give up after a small number of tries.
360   It's unlikely the original page's sibling split more than a few times
361   while we were in flight to it; if we do not find a matching link in a
362   few tries, then most likely the original page is deleted.)
3634. Return to the "original page".  If it is still live, return to step 1
364   (we guessed wrong about it being deleted, and should restart with its
365   current left-link).  If it is dead, move right until a non-dead page
366   is found (there must be one, since rightmost pages are never deleted),
367   mark that as the new "original page", and return to step 1.
368
369This algorithm is correct because the live page found by step 4 will have
370the same left keyspace boundary as the page we started from.  Therefore,
371when we ultimately exit, it must be on a page whose right keyspace
372boundary matches the left boundary of where we started --- which is what
373we need to be sure we don't miss or re-scan any items.
374
375Page deletion and tree height
376-----------------------------
377
378Because we never delete the rightmost page of any level (and in particular
379never delete the root), it's impossible for the height of the tree to
380decrease.  After massive deletions we might have a scenario in which the
381tree is "skinny", with several single-page levels below the root.
382Operations will still be correct in this case, but we'd waste cycles
383descending through the single-page levels.  To handle this we use an idea
384from Lanin and Shasha: we keep track of the "fast root" level, which is
385the lowest single-page level.  The meta-data page keeps a pointer to this
386level as well as the true root.  All ordinary operations initiate their
387searches at the fast root not the true root.  When we split a page that is
388alone on its level or delete the next-to-last page on a level (both cases
389are easily detected), we have to make sure that the fast root pointer is
390adjusted appropriately.  In the split case, we do this work as part of the
391atomic update for the insertion into the parent level; in the delete case
392as part of the atomic update for the delete (either way, the metapage has
393to be the last page locked in the update to avoid deadlock risks).  This
394avoids race conditions if two such operations are executing concurrently.
395
396Placing deleted pages in the FSM
397--------------------------------
398
399Recycling a page is decoupled from page deletion.  A deleted page can only
400be put in the FSM to be recycled once there is no possible scan or search
401that has a reference to it; until then, it must stay in place with its
402sibling links undisturbed, as a tombstone that allows concurrent searches
403to detect and then recover from concurrent deletions (which are rather
404like concurrent page splits to searchers).  This design is an
405implementation of what Lanin and Shasha call "the drain technique".
406
407We implement the technique by waiting until all active snapshots and
408registered snapshots as of the page deletion are gone; which is overly
409strong, but is simple to implement within Postgres.  When marked fully
410dead, a deleted page is labeled with the next-transaction counter value.
411VACUUM can reclaim the page for re-use when the stored XID is guaranteed
412to be "visible to everyone".  As collateral damage, we wait for snapshots
413taken until the next transaction to allocate an XID commits.  We also wait
414for running XIDs with no snapshots.
415
416Prior to PostgreSQL 14, VACUUM would only place _old_ deleted pages that
417it encounters during its linear scan (pages deleted by a previous VACUUM
418operation) in the FSM.  Newly deleted pages were never placed in the FSM,
419because that was assumed to _always_ be unsafe.  That assumption was
420unnecessarily pessimistic in practice, though -- it often doesn't take
421very long for newly deleted pages to become safe to place in the FSM.
422There is no truly principled way to predict when deleted pages will become
423safe to place in the FSM for recycling -- it might become safe almost
424immediately (long before the current VACUUM completes), or it might not
425even be safe by the time the next VACUUM takes place.  Recycle safety is
426purely a question of maintaining the consistency (or at least the apparent
427consistency) of a physical data structure.  The state within the backend
428running VACUUM is simply not relevant.
429
430PostgreSQL 14 added the ability for VACUUM to consider if it's possible to
431recycle newly deleted pages at the end of the full index scan where the
432page deletion took place.  It is convenient to check if it's safe at that
433point.  This does require that VACUUM keep around a little bookkeeping
434information about newly deleted pages, but that's very cheap.  Using
435in-memory state for this avoids the need to revisit newly deleted pages a
436second time later on -- we can just use safexid values from the local
437bookkeeping state to determine recycle safety in a deferred fashion.
438
439The need for additional FSM indirection after a page deletion operation
440takes place is a natural consequence of the highly permissive rules for
441index scans with Lehman and Yao's design.  In general an index scan
442doesn't have to hold a lock or even a pin on any page when it descends the
443tree (nothing that you'd usually think of as an interlock is held "between
444levels").  At the same time, index scans cannot be allowed to land on a
445truly unrelated page due to concurrent recycling (not to be confused with
446concurrent deletion), because that results in wrong answers to queries.
447Simpler approaches to page deletion that don't need to defer recycling are
448possible, but none seem compatible with Lehman and Yao's design.
449
450Placing an already-deleted page in the FSM to be recycled when needed
451doesn't actually change the state of the page.  The page will be changed
452whenever it is subsequently taken from the FSM for reuse.  The deleted
453page's contents will be overwritten by the split operation (it will become
454the new right sibling page).
455
456Fastpath For Index Insertion
457----------------------------
458
459We optimize for a common case of insertion of increasing index key
460values by caching the last page to which this backend inserted the last
461value, if this page was the rightmost leaf page. For the next insert, we
462can then quickly check if the cached page is still the rightmost leaf
463page and also the correct place to hold the current value. We can avoid
464the cost of walking down the tree in such common cases.
465
466The optimization works on the assumption that there can only be one
467non-ignorable leaf rightmost page, and so not even a visible-to-everyone
468style interlock is required.  We cannot fail to detect that our hint was
469invalidated, because there can only be one such page in the B-Tree at
470any time. It's possible that the page will be deleted and recycled
471without a backend's cached page also being detected as invalidated, but
472only when we happen to recycle a block that once again gets recycled as the
473rightmost leaf page.
474
475Simple deletion
476---------------
477
478If a process visits a heap tuple and finds that it's dead and removable
479(ie, dead to all open transactions, not only that process), then we can
480return to the index and mark the corresponding index entry "known dead",
481allowing subsequent index scans to skip visiting the heap tuple.  The
482"known dead" marking works by setting the index item's lp_flags state
483to LP_DEAD.  This is currently only done in plain indexscans, not bitmap
484scans, because only plain scans visit the heap and index "in sync" and so
485there's not a convenient way to do it for bitmap scans.  Note also that
486LP_DEAD bits are often set when checking a unique index for conflicts on
487insert (this is simpler because it takes place when we hold an exclusive
488lock on the leaf page).
489
490Once an index tuple has been marked LP_DEAD it can actually be deleted
491from the index immediately; since index scans only stop "between" pages,
492no scan can lose its place from such a deletion.  We separate the steps
493because we allow LP_DEAD to be set with only a share lock (it's exactly
494like a hint bit for a heap tuple), but physically removing tuples requires
495exclusive lock.  Also, delaying the deletion often allows us to pick up
496extra index tuples that weren't initially safe for index scans to mark
497LP_DEAD.  We do this with index tuples whose TIDs point to the same table
498blocks as an LP_DEAD-marked tuple.  They're practically free to check in
499passing, and have a pretty good chance of being safe to delete due to
500various locality effects.
501
502We only try to delete LP_DEAD tuples (and nearby tuples) when we are
503otherwise faced with having to split a page to do an insertion (and hence
504have exclusive lock on it already).  Deduplication and bottom-up index
505deletion can also prevent a page split, but simple deletion is always our
506preferred approach.  (Note that posting list tuples can only have their
507LP_DEAD bit set when every table TID within the posting list is known
508dead.  This isn't much of a problem in practice because LP_DEAD bits are
509just a starting point for simple deletion -- we still manage to perform
510granular deletes of posting list TIDs quite often.)
511
512It's sufficient to have an exclusive lock on the index page, not a
513super-exclusive lock, to do deletion of LP_DEAD items.  It might seem
514that this breaks the interlock between VACUUM and indexscans, but that is
515not so: as long as an indexscanning process has a pin on the page where
516the index item used to be, VACUUM cannot complete its btbulkdelete scan
517and so cannot remove the heap tuple.  This is another reason why
518btbulkdelete has to get a super-exclusive lock on every leaf page, not only
519the ones where it actually sees items to delete.
520
521LP_DEAD setting by index scans cannot be sure that a TID whose index tuple
522it had planned on LP_DEAD-setting has not been recycled by VACUUM if it
523drops its pin in the meantime.  It must conservatively also remember the
524LSN of the page, and only act to set LP_DEAD bits when the LSN has not
525changed at all. (Avoiding dropping the pin entirely also makes it safe, of
526course.)
527
528Bottom-Up deletion
529------------------
530
531We attempt to delete whatever duplicates happen to be present on the page
532when the duplicates are suspected to be caused by version churn from
533successive UPDATEs.  This only happens when we receive an executor hint
534indicating that optimizations like heapam's HOT have not worked out for
535the index -- the incoming tuple must be a logically unchanged duplicate
536which is needed for MVCC purposes, suggesting that that might well be the
537dominant source of new index tuples on the leaf page in question.  (Also,
538bottom-up deletion is triggered within unique indexes in cases with
539continual INSERT and DELETE related churn, since that is easy to detect
540without any external hint.)
541
542Simple deletion will already have failed to prevent a page split when a
543bottom-up deletion pass takes place (often because no LP_DEAD bits were
544ever set on the page).  The two mechanisms have closely related
545implementations.  The same WAL records are used for each operation, and
546the same tableam infrastructure is used to determine what TIDs/tuples are
547actually safe to delete.  The implementations only differ in how they pick
548TIDs to consider for deletion, and whether or not the tableam will give up
549before accessing all table blocks (bottom-up deletion lives with the
550uncertainty of its success by keeping the cost of failure low).  Even
551still, the two mechanisms are clearly distinct at the conceptual level.
552
553Bottom-up index deletion is driven entirely by heuristics (whereas simple
554deletion is guaranteed to delete at least those index tuples that are
555already LP_DEAD marked -- there must be at least one).  We have no
556certainty that we'll find even one index tuple to delete.  That's why we
557closely cooperate with the tableam to keep the costs it pays in balance
558with the benefits we receive.  The interface that we use for this is
559described in detail in access/tableam.h.
560
561Bottom-up index deletion can be thought of as a backstop mechanism against
562unnecessary version-driven page splits.  It is based in part on an idea
563from generational garbage collection: the "generational hypothesis".  This
564is the empirical observation that "most objects die young".  Within
565nbtree, new index tuples often quickly appear in the same place, and then
566quickly become garbage.  There can be intense concentrations of garbage in
567relatively few leaf pages with certain workloads (or there could be in
568earlier versions of PostgreSQL without bottom-up index deletion, at
569least).  See doc/src/sgml/btree.sgml for a high-level description of the
570design principles behind bottom-up index deletion in nbtree, including
571details of how it complements VACUUM.
572
573We expect to find a reasonably large number of tuples that are safe to
574delete within each bottom-up pass.  If we don't then we won't need to
575consider the question of bottom-up deletion for the same leaf page for
576quite a while (usually because the page splits, which resolves the
577situation for the time being).  We expect to perform regular bottom-up
578deletion operations against pages that are at constant risk of unnecessary
579page splits caused only by version churn.  When the mechanism works well
580we'll constantly be "on the verge" of having version-churn-driven page
581splits, but never actually have even one.
582
583Our duplicate heuristics work well despite being fairly simple.
584Unnecessary page splits only occur when there are truly pathological
585levels of version churn (in theory a small amount of version churn could
586make a page split occur earlier than strictly necessary, but that's pretty
587harmless).  We don't have to understand the underlying workload; we only
588have to understand the general nature of the pathology that we target.
589Version churn is easy to spot when it is truly pathological.  Affected
590leaf pages are fairly homogeneous.
591
592WAL Considerations
593------------------
594
595The insertion and deletion algorithms in themselves don't guarantee btree
596consistency after a crash.  To provide robustness, we depend on WAL
597replay.  A single WAL entry is effectively an atomic action --- we can
598redo it from the log if it fails to complete.
599
600Ordinary item insertions (that don't force a page split) are of course
601single WAL entries, since they only affect one page.  The same for
602leaf-item deletions (if the deletion brings the leaf page to zero items,
603it is now a candidate to be deleted, but that is a separate action).
604
605An insertion that causes a page split is logged as a single WAL entry for
606the changes occurring on the insertion's level --- including update of the
607right sibling's left-link --- followed by a second WAL entry for the
608insertion on the parent level (which might itself be a page split, requiring
609an additional insertion above that, etc).
610
611For a root split, the follow-on WAL entry is a "new root" entry rather than
612an "insertion" entry, but details are otherwise much the same.
613
614Because splitting involves multiple atomic actions, it's possible that the
615system crashes between splitting a page and inserting the downlink for the
616new half to the parent.  After recovery, the downlink for the new page will
617be missing.  The search algorithm works correctly, as the page will be found
618by following the right-link from its left sibling, although if a lot of
619downlinks in the tree are missing, performance will suffer.  A more serious
620consequence is that if the page without a downlink gets split again, the
621insertion algorithm will fail to find the location in the parent level to
622insert the downlink.
623
624Our approach is to create any missing downlinks on-the-fly, when searching
625the tree for a new insertion.  It could be done during searches, too, but
626it seems best not to put any extra updates in what would otherwise be a
627read-only operation (updating is not possible in hot standby mode anyway).
628It would seem natural to add the missing downlinks in VACUUM, but since
629inserting a downlink might require splitting a page, it might fail if you
630run out of disk space.  That would be bad during VACUUM - the reason for
631running VACUUM in the first place might be that you run out of disk space,
632and now VACUUM won't finish because you're out of disk space.  In contrast,
633an insertion can require enlarging the physical file anyway.  There is one
634minor exception: VACUUM finishes interrupted splits of internal pages when
635deleting their children.  This allows the code for re-finding parent items
636to be used by both page splits and page deletion.
637
638To identify missing downlinks, when a page is split, the left page is
639flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
640When the downlink is inserted to the parent, the flag is cleared atomically
641with the insertion.  The child page is kept locked until the insertion in
642the parent is finished and the flag in the child cleared, but can be
643released immediately after that, before recursing up the tree if the parent
644also needs to be split.  This ensures that incompletely split pages should
645not be seen under normal circumstances; only if insertion to the parent
646has failed for some reason. (It's also possible for a reader to observe
647a page with the incomplete split flag set during recovery; see later
648section on "Scans during Recovery" for details.)
649
650We flag the left page, even though it's the right page that's missing the
651downlink, because it's more convenient to know already when following the
652right-link from the left page to the right page that it will need to have
653its downlink inserted to the parent.
654
655When splitting a non-root page that is alone on its level, the required
656metapage update (of the "fast root" link) is performed and logged as part
657of the insertion into the parent level.  When splitting the root page, the
658metapage update is handled as part of the "new root" action.
659
660Each step in page deletion is logged as a separate WAL entry: marking the
661leaf as half-dead and removing the downlink is one record, and unlinking a
662page is a second record.  If vacuum is interrupted for some reason, or the
663system crashes, the tree is consistent for searches and insertions.  The
664next VACUUM will find the half-dead leaf page and continue the deletion.
665
666Before 9.4, we used to keep track of incomplete splits and page deletions
667during recovery and finish them immediately at end of recovery, instead of
668doing it lazily at the next insertion or vacuum.  However, that made the
669recovery much more complicated, and only fixed the problem when crash
670recovery was performed.  An incomplete split can also occur if an otherwise
671recoverable error, like out-of-memory or out-of-disk-space, happens while
672inserting the downlink to the parent.
673
674Scans during Recovery
675---------------------
676
677nbtree indexes support read queries in Hot Standby mode. Every atomic
678action/WAL record makes isolated changes that leave the tree in a
679consistent state for readers. Readers lock pages according to the same
680rules that readers follow on the primary. (Readers may have to move
681right to recover from a "concurrent" page split or page deletion, just
682like on the primary.)
683
684However, there are a couple of differences in how pages are locked by
685replay/the startup process as compared to the original write operation
686on the primary. The exceptions involve page splits and page deletions.
687The first phase and second phase of a page split are processed
688independently during replay, since they are independent atomic actions.
689We do not attempt to recreate the coupling of parent and child page
690write locks that took place on the primary. This is safe because readers
691never care about the incomplete split flag anyway. Holding on to an
692extra write lock on the primary is only necessary so that a second
693writer cannot observe the incomplete split flag before the first writer
694finishes the split. If we let concurrent writers on the primary observe
695an incomplete split flag on the same page, each writer would attempt to
696complete the unfinished split, corrupting the parent page.  (Similarly,
697replay of page deletion records does not hold a write lock on the target
698leaf page throughout; only the primary needs to block out concurrent
699writers that insert on to the page being deleted.)
700
701WAL replay holds same-level locks in a way that matches the approach
702taken during original execution, though. This prevent readers from
703observing same-level inconsistencies. It's probably possible to be more
704lax about how same-level locks are acquired during recovery (most kinds
705of readers could still move right to recover if we didn't couple
706same-level locks), but we prefer to be conservative here.
707
708During recovery all index scans start with ignore_killed_tuples = false
709and we never set kill_prior_tuple. We do this because the oldest xmin
710on the standby server can be older than the oldest xmin on the primary
711server, which means tuples can be marked LP_DEAD even when they are
712still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
713they can still appear in the standby because of full page writes. So
714we must always ignore them in standby, and that means it's not worth
715setting them either.  (When LP_DEAD-marked tuples are eventually deleted
716on the primary, the deletion is WAL-logged.  Queries that run on a
717standby therefore get much of the benefit of any LP_DEAD setting that
718takes place on the primary.)
719
720Note that we talk about scans that are started during recovery. We go to
721a little trouble to allow a scan to start during recovery and end during
722normal running after recovery has completed. This is a key capability
723because it allows running applications to continue while the standby
724changes state into a normally running server.
725
726The interlocking required to avoid returning incorrect results from
727non-MVCC scans is not required on standby nodes. We still get a
728super-exclusive lock ("cleanup lock") when replaying VACUUM records
729during recovery, but recovery does not need to lock every leaf page
730(only those leaf pages that have items to delete). That is safe because
731HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
732HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only ever
733used during write transactions, which cannot exist on the standby. MVCC
734scans are already protected by definition, so HeapTupleSatisfiesMVCC()
735is not a problem. The optimizer looks at the boundaries of value ranges
736using HeapTupleSatisfiesNonVacuumable() with an index-only scan, which
737is also safe. That leaves concern only for HeapTupleSatisfiesToast().
738
739HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
740because it doesn't need to - if the main heap row is visible then the
741toast rows will also be visible. So as long as we follow a toast
742pointer from a visible (live) tuple the corresponding toast rows
743will also be visible, so we do not need to recheck MVCC on them.
744
745Other Things That Are Handy to Know
746-----------------------------------
747
748Page zero of every btree is a meta-data page.  This page stores the
749location of the root page --- both the true root and the current effective
750root ("fast" root).  To avoid fetching the metapage for every single index
751search, we cache a copy of the meta-data information in the index's
752relcache entry (rd_amcache).  This is a bit ticklish since using the cache
753implies following a root page pointer that could be stale.  However, a
754backend following a cached pointer can sufficiently verify whether it
755reached the intended page; either by checking the is-root flag when it
756is going to the true root, or by checking that the page has no siblings
757when going to the fast root.  At worst, this could result in descending
758some extra tree levels if we have a cached pointer to a fast root that is
759now above the real fast root.  Such cases shouldn't arise often enough to
760be worth optimizing; and in any case we can expect a relcache flush will
761discard the cached metapage before long, since a VACUUM that's moved the
762fast root pointer can be expected to issue a statistics update for the
763index.
764
765The algorithm assumes we can fit at least three items per page
766(a "high key" and two real data items).  Therefore it's unsafe
767to accept items larger than 1/3rd page size.  Larger items would
768work sometimes, but could cause failures later on depending on
769what else gets put on their page.
770
771"ScanKey" data structures are used in two fundamentally different ways
772in this code, which we describe as "search" scankeys and "insertion"
773scankeys.  A search scankey is the kind passed to btbeginscan() or
774btrescan() from outside the btree code.  The sk_func pointers in a search
775scankey point to comparison functions that return boolean, such as int4lt.
776There might be more than one scankey entry for a given index column, or
777none at all.  (We require the keys to appear in index column order, but
778the order of multiple keys for a given column is unspecified.)  An
779insertion scankey ("BTScanInsert" data structure) uses a similar
780array-of-ScanKey data structure, but the sk_func pointers point to btree
781comparison support functions (ie, 3-way comparators that return int4 values
782interpreted as <0, =0, >0).  In an insertion scankey there is at most one
783entry per index column.  There is also other data about the rules used to
784locate where to begin the scan, such as whether or not the scan is a
785"nextkey" scan.  Insertion scankeys are built within the btree code (eg, by
786_bt_mkscankey()) and are used to locate the starting point of a scan, as
787well as for locating the place to insert a new index tuple.  (Note: in the
788case of an insertion scankey built from a search scankey or built from a
789truncated pivot tuple, there might be fewer keys than index columns,
790indicating that we have no constraints for the remaining index columns.)
791After we have located the starting point of a scan, the original search
792scankey is consulted as each index entry is sequentially scanned to decide
793whether to return the entry and whether the scan can stop (see
794_bt_checkkeys()).
795
796Notes about suffix truncation
797-----------------------------
798
799We truncate away suffix key attributes that are not needed for a page high
800key during a leaf page split.  The remaining attributes must distinguish
801the last index tuple on the post-split left page as belonging on the left
802page, and the first index tuple on the post-split right page as belonging
803on the right page.  Tuples logically retain truncated key attributes,
804though they implicitly have "negative infinity" as their value, and have no
805storage overhead.  Since the high key is subsequently reused as the
806downlink in the parent page for the new right page, suffix truncation makes
807pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
808attributes truncated at the time of a leaf page split, but may also have
809some key attributes truncated away, based on the usual criteria for key
810attributes.  They are not a special case, since non-key attributes are
811merely payload to B-Tree searches.
812
813The goal of suffix truncation of key attributes is to improve index
814fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
815and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
8162, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
817based on their paper.  Note that Postgres only implements what the paper
818refers to as simple prefix B-Trees.  Note also that the paper assumes that
819the tree has keys that consist of single strings that maintain the "prefix
820property", much like strings that are stored in a suffix tree (comparisons
821of earlier bytes must always be more significant than comparisons of later
822bytes, and, in general, the strings must compare in a way that doesn't
823break transitive consistency as they're split into pieces).  Suffix
824truncation in Postgres currently only works at the whole-attribute
825granularity, but it would be straightforward to invent opclass
826infrastructure that manufactures a smaller attribute value in the case of
827variable-length types, such as text.  An opclass support function could
828manufacture the shortest possible key value that still correctly separates
829each half of a leaf page split.
830
831There is sophisticated criteria for choosing a leaf page split point.  The
832general idea is to make suffix truncation effective without unduly
833influencing the balance of space for each half of the page split.  The
834choice of leaf split point can be thought of as a choice among points
835*between* items on the page to be split, at least if you pretend that the
836incoming tuple was placed on the page already (you have to pretend because
837there won't actually be enough space for it on the page).  Choosing the
838split point between two index tuples where the first non-equal attribute
839appears as early as possible results in truncating away as many suffix
840attributes as possible.  Evenly balancing space among each half of the
841split is usually the first concern, but even small adjustments in the
842precise split point can allow truncation to be far more effective.
843
844Suffix truncation is primarily valuable because it makes pivot tuples
845smaller, which delays splits of internal pages, but that isn't the only
846reason why it's effective.  Even truncation that doesn't make pivot tuples
847smaller due to alignment still prevents pivot tuples from being more
848restrictive than truly necessary in how they describe which values belong
849on which pages.
850
851While it's not possible to correctly perform suffix truncation during
852internal page splits, it's still useful to be discriminating when splitting
853an internal page.  The split point that implies a downlink be inserted in
854the parent that's the smallest one available within an acceptable range of
855the fillfactor-wise optimal split point is chosen.  This idea also comes
856from the Prefix B-Tree paper.  This process has much in common with what
857happens at the leaf level to make suffix truncation effective.  The overall
858effect is that suffix truncation tends to produce smaller, more
859discriminating pivot tuples, especially early in the lifetime of the index,
860while biasing internal page splits makes the earlier, smaller pivot tuples
861end up in the root page, delaying root page splits.
862
863Logical duplicates are given special consideration.  The logic for
864selecting a split point goes to great lengths to avoid having duplicates
865span more than one page, and almost always manages to pick a split point
866between two user-key-distinct tuples, accepting a completely lopsided split
867if it must.  When a page that's already full of duplicates must be split,
868the fallback strategy assumes that duplicates are mostly inserted in
869ascending heap TID order.  The page is split in a way that leaves the left
870half of the page mostly full, and the right half of the page mostly empty.
871The overall effect is that leaf page splits gracefully adapt to inserts of
872large groups of duplicates, maximizing space utilization.  Note also that
873"trapping" large groups of duplicates on the same leaf page like this makes
874deduplication more efficient.  Deduplication can be performed infrequently,
875without merging together existing posting list tuples too often.
876
877Notes about deduplication
878-------------------------
879
880We deduplicate non-pivot tuples in non-unique indexes to reduce storage
881overhead, and to avoid (or at least delay) page splits.  Note that the
882goals for deduplication in unique indexes are rather different; see later
883section for details.  Deduplication alters the physical representation of
884tuples without changing the logical contents of the index, and without
885adding overhead to read queries.  Non-pivot tuples are merged together
886into a single physical tuple with a posting list (a simple array of heap
887TIDs with the standard item pointer format).  Deduplication is always
888applied lazily, at the point where it would otherwise be necessary to
889perform a page split.  It occurs only when LP_DEAD items have been
890removed, as our last line of defense against splitting a leaf page
891(bottom-up index deletion may be attempted first, as our second last line
892of defense).  We can set the LP_DEAD bit with posting list tuples, though
893only when all TIDs are known dead.
894
895Our lazy approach to deduplication allows the page space accounting used
896during page splits to have absolutely minimal special case logic for
897posting lists.  Posting lists can be thought of as extra payload that
898suffix truncation will reliably truncate away as needed during page
899splits, just like non-key columns from an INCLUDE index tuple.
900Incoming/new tuples can generally be treated as non-overlapping plain
901items (though see section on posting list splits for information about how
902overlapping new/incoming items are really handled).
903
904The representation of posting lists is almost identical to the posting
905lists used by GIN, so it would be straightforward to apply GIN's varbyte
906encoding compression scheme to individual posting lists.  Posting list
907compression would break the assumptions made by posting list splits about
908page space accounting (see later section), so it's not clear how
909compression could be integrated with nbtree.  Besides, posting list
910compression does not offer a compelling trade-off for nbtree, since in
911general nbtree is optimized for consistent performance with many
912concurrent readers and writers.  Compression would also make the deletion
913of a subset of TIDs from a posting list slow and complicated, which would
914be a big problem for workloads that depend heavily on bottom-up index
915deletion.
916
917A major goal of our lazy approach to deduplication is to limit the
918performance impact of deduplication with random updates.  Even concurrent
919append-only inserts of the same key value will tend to have inserts of
920individual index tuples in an order that doesn't quite match heap TID
921order.  Delaying deduplication minimizes page level fragmentation.
922
923Deduplication in unique indexes
924-------------------------------
925
926Very often, the number of distinct values that can ever be placed on
927almost any given leaf page in a unique index is fixed and permanent.  For
928example, a primary key on an identity column will usually only have leaf
929page splits caused by the insertion of new logical rows within the
930rightmost leaf page.  If there is a split of a non-rightmost leaf page,
931then the split must have been triggered by inserts associated with UPDATEs
932of existing logical rows.  Splitting a leaf page purely to store multiple
933versions is a false economy.  In effect, we're permanently degrading the
934index structure just to absorb a temporary burst of duplicates.
935
936Deduplication in unique indexes helps to prevent these pathological page
937splits.  Storing duplicates in a space efficient manner is not the goal,
938since in the long run there won't be any duplicates anyway.  Rather, we're
939buying time for standard garbage collection mechanisms to run before a
940page split is needed.
941
942Unique index leaf pages only get a deduplication pass when an insertion
943(that might have to split the page) observed an existing duplicate on the
944page in passing.  This is based on the assumption that deduplication will
945only work out when _all_ new insertions are duplicates from UPDATEs.  This
946may mean that we miss an opportunity to delay a page split, but that's
947okay because our ultimate goal is to delay leaf page splits _indefinitely_
948(i.e. to prevent them altogether).  There is little point in trying to
949delay a split that is probably inevitable anyway.  This allows us to avoid
950the overhead of attempting to deduplicate with unique indexes that always
951have few or no duplicates.
952
953Note: Avoiding "unnecessary" page splits driven by version churn is also
954the goal of bottom-up index deletion, which was added to PostgreSQL 14.
955Bottom-up index deletion is now the preferred way to deal with this
956problem (with all kinds of indexes, though especially with unique
957indexes).  Still, deduplication can sometimes augment bottom-up index
958deletion.  When deletion cannot free tuples (due to an old snapshot
959holding up cleanup), falling back on deduplication provides additional
960capacity.  Delaying the page split by deduplicating can allow a future
961bottom-up deletion pass of the same page to succeed.
962
963Posting list splits
964-------------------
965
966When the incoming tuple happens to overlap with an existing posting list,
967a posting list split is performed.  Like a page split, a posting list
968split resolves a situation where a new/incoming item "won't fit", while
969inserting the incoming item in passing (i.e. as part of the same atomic
970action).  It's possible (though not particularly likely) that an insert of
971a new item on to an almost-full page will overlap with a posting list,
972resulting in both a posting list split and a page split.  Even then, the
973atomic action that splits the posting list also inserts the new item
974(since page splits always insert the new item in passing).  Including the
975posting list split in the same atomic action as the insert avoids problems
976caused by concurrent inserts into the same posting list --  the exact
977details of how we change the posting list depend upon the new item, and
978vice-versa.  A single atomic action also minimizes the volume of extra
979WAL required for a posting list split, since we don't have to explicitly
980WAL-log the original posting list tuple.
981
982Despite piggy-backing on the same atomic action that inserts a new tuple,
983posting list splits can be thought of as a separate, extra action to the
984insert itself (or to the page split itself).  Posting list splits
985conceptually "rewrite" an insert that overlaps with an existing posting
986list into an insert that adds its final new item just to the right of the
987posting list instead.  The size of the posting list won't change, and so
988page space accounting code does not need to care about posting list splits
989at all.  This is an important upside of our design; the page split point
990choice logic is very subtle even without it needing to deal with posting
991list splits.
992
993Only a few isolated extra steps are required to preserve the illusion that
994the new item never overlapped with an existing posting list in the first
995place: the heap TID of the incoming tuple has its TID replaced with the
996rightmost/max heap TID from the existing/originally overlapping posting
997list.  Similarly, the original incoming item's TID is relocated to the
998appropriate offset in the posting list (we usually shift TIDs out of the
999way to make a hole for it).  Finally, the posting-split-with-page-split
1000case must generate a new high key based on an imaginary version of the
1001original page that has both the final new item and the after-list-split
1002posting tuple (page splits usually just operate against an imaginary
1003version that contains the new item/item that won't fit).
1004
1005This approach avoids inventing an "eager" atomic posting split operation
1006that splits the posting list without simultaneously finishing the insert
1007of the incoming item.  This alternative design might seem cleaner, but it
1008creates subtle problems for page space accounting.  In general, there
1009might not be enough free space on the page to split a posting list such
1010that the incoming/new item no longer overlaps with either posting list
1011half --- the operation could fail before the actual retail insert of the
1012new item even begins.  We'd end up having to handle posting list splits
1013that need a page split anyway.  Besides, supporting variable "split points"
1014while splitting posting lists won't actually improve overall space
1015utilization.
1016
1017Notes About Data Representation
1018-------------------------------
1019
1020The right-sibling link required by L&Y is kept in the page "opaque
1021data" area, as is the left-sibling link, the page level, and some flags.
1022The page level counts upwards from zero at the leaf level, to the tree
1023depth minus 1 at the root.  (Counting up from the leaves ensures that we
1024don't need to renumber any existing pages when splitting the root.)
1025
1026The Postgres disk block data format (an array of items) doesn't fit
1027Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
1028so we have to play some games.  (The alternating-keys-and-pointers
1029notion is important for internal page splits, which conceptually split
1030at the middle of an existing pivot tuple -- the tuple's "separator" key
1031goes on the left side of the split as the left side's new high key,
1032while the tuple's pointer/downlink goes on the right side as the
1033first/minus infinity downlink.)
1034
1035On a page that is not rightmost in its tree level, the "high key" is
1036kept in the page's first item, and real data items start at item 2.
1037The link portion of the "high key" item goes unused.  A page that is
1038rightmost has no "high key" (it's implicitly positive infinity), so
1039data items start with the first item.  Putting the high key at the
1040left, rather than the right, may seem odd, but it avoids moving the
1041high key as we add data items.
1042
1043On a leaf page, the data items are simply links to (TIDs of) tuples
1044in the relation being indexed, with the associated key values.
1045
1046On a non-leaf page, the data items are down-links to child pages with
1047bounding keys.  The key in each data item is a strict lower bound for
1048keys on that child page, so logically the key is to the left of that
1049downlink.  The high key (if present) is the upper bound for the last
1050downlink.  The first data item on each such page has no lower bound
1051--- or lower bound of minus infinity, if you prefer.  The comparison
1052routines must treat it accordingly.  The actual key stored in the
1053item is irrelevant, and need not be stored at all.  This arrangement
1054corresponds to the fact that an L&Y non-leaf page has one more pointer
1055than key.  Suffix truncation's negative infinity attributes behave in
1056the same way.
1057