access/heap/README.HOT

src/backend/access/heap/README.HOT

Heap Only Tuples (HOT)
======================

The Heap Only Tuple (HOT) feature eliminates redundant index entries and
allows the re-use of space taken by DELETEd or obsoleted UPDATEd tuples
without performing a table-wide vacuum.  It does this by allowing
single-page vacuuming, also called "defragmentation".

Note: there is a Glossary at the end of this document that may be helpful
for first-time readers.


Technical Challenges
--------------------

Page-at-a-time vacuuming is normally impractical because of the costs of
finding and removing the index entries that link to the tuples to be
reclaimed.  Standard vacuuming scans the indexes to ensure all such index
entries are removed, amortizing the index scan cost across as many dead
tuples as possible; this approach does not scale down well to the case of
reclaiming just a few tuples.  In principle one could recompute the index
keys and do standard index searches to find the index entries, but this is
risky in the presence of possibly-buggy user-defined functions in
functional indexes.  An allegedly immutable function that in fact is not
immutable might prevent us from re-finding an index entry (and we cannot
throw an error for not finding it, in view of the fact that dead index
entries are sometimes reclaimed early).  That would lead to a seriously
corrupt index, in the form of entries pointing to tuple slots that by now
contain some unrelated content.  In any case we would prefer to be able
to do vacuuming without invoking any user-written code.

HOT solves this problem for a restricted but useful special case:
where a tuple is repeatedly updated in ways that do not change its
indexed columns.  (Here, "indexed column" means any column referenced
at all in an index definition, including for example columns that are
tested in a partial-index predicate but are not stored in the index.)

An additional property of HOT is that it reduces index size by avoiding
the creation of identically-keyed index entries.  This improves search
speeds.


Update Chains With a Single Index Entry
---------------------------------------

Without HOT, every version of a row in an update chain has its own index
entries, even if all indexed columns are the same.  With HOT, a new tuple
placed on the same page and with all indexed columns the same as its
parent row version does not get new index entries.  This means there is
only one index entry for the entire update chain on the heap page.
An index-entry-less tuple is marked with the HEAP_ONLY_TUPLE flag.
The prior row version is marked HEAP_HOT_UPDATED, and (as always in an
update chain) its t_ctid field links forward to the newer version.

For example:

	Index points to 1
	lp [1]  [2]

	[111111111]->[2222222222]

In the above diagram, the index points to line pointer 1, and tuple 1 is
marked as HEAP_HOT_UPDATED.  Tuple 2 is a HOT tuple, meaning it has
no index entry pointing to it, and is marked as HEAP_ONLY_TUPLE.
Although tuple 2 is not directly referenced by the index, it can still be
found by an index search: after traversing from the index to tuple 1,
the index search proceeds forward to child tuples as long as it sees the
HEAP_HOT_UPDATED flag set.  Since we restrict the HOT chain to lie within
a single page, this requires no additional page fetches and doesn't
introduce much performance penalty.

Eventually, tuple 1 will no longer be visible to any transaction.
At that point its space could be reclaimed, but its line pointer cannot,
since the index still links to that line pointer and we still need to
be able to find tuple 2 in an index search.  HOT handles this by turning
line pointer 1 into a "redirecting line pointer", which links to tuple 2
but has no actual tuple attached.  This state of affairs looks like

	Index points to 1
	lp [1]->[2]

	[2222222222]

If now the row is updated again, to version 3, the page looks like this:

	Index points to 1
	lp [1]->[2]  [3]

	[2222222222]->[3333333333]

At some later time when no transaction can see tuple 2 in its snapshot,
tuple 2 and its line pointer can be pruned entirely:

	Index points to 1
	lp [1]------>[3]

	[3333333333]

This is safe because no index entry points to line pointer 2.  Subsequent
insertions into the page can now recycle both line pointer 2 and the
space formerly used by tuple 2.

If an update changes any indexed column, or there is not room on the
same page for the new tuple, then the HOT chain ends: the last member
has a regular t_ctid link to the next version and is not marked
HEAP_HOT_UPDATED.  (In principle we could continue a HOT chain across
pages, but this would destroy the desired property of being able to
reclaim space with just page-local manipulations.  Anyway, we don't
want to have to chase through multiple heap pages to get from an index
entry to the desired tuple, so it seems better to create a new index
entry for the new tuple.)  If further updates occur, the next version
could become the root of a new HOT chain.

Line pointer 1 has to remain as long as there is any non-dead member of
the chain on the page.  When there is not, it is marked "dead".
This lets us reclaim the last child line pointer and associated tuple
immediately.  The next regular VACUUM pass can reclaim the index entries
pointing at the line pointer and then the line pointer itself.  Since a
line pointer is small compared to a tuple, this does not represent an
undue space cost.

Note: we can use a "dead" line pointer for any DELETEd tuple,
whether it was part of a HOT chain or not.  This allows space reclamation
in advance of running VACUUM for plain DELETEs as well as HOT updates.

The requirement for doing a HOT update is that none of the indexed
columns are changed.  This is checked at execution time by comparing the
binary representation of the old and new values.  We insist on bitwise
equality rather than using datatype-specific equality routines.  The
main reason to avoid the latter is that there might be multiple notions
of equality for a datatype, and we don't know exactly which one is
relevant for the indexes at hand.  We assume that bitwise equality
guarantees equality for all purposes.


Abort Cases
-----------

If a heap-only tuple's xmin is aborted, then it can be removed immediately:
it was never visible to any other transaction, and all descendant row
versions must be aborted as well.  Therefore we need not consider it part
of a HOT chain.  By the same token, if a HOT-updated tuple's xmax is
aborted, there is no need to follow the chain link.  However, there is a
race condition here: the transaction that did the HOT update might abort
between the time we inspect the HOT-updated tuple and the time we reach
the descendant heap-only tuple.  It is conceivable that someone prunes
the heap-only tuple before that, and even conceivable that the line pointer
is re-used for another purpose.  Therefore, when following a HOT chain,
it is always necessary to be prepared for the possibility that the
linked-to line pointer is unused, dead, or redirected; and if it is a
normal line pointer, we still have to check that XMIN of the tuple matches
the XMAX of the tuple we left.  Otherwise we should assume that we have
come to the end of the HOT chain.  Note that this sort of XMIN/XMAX
matching is required when following ordinary update chains anyway.

(Early versions of the HOT code assumed that holding pin on the page
buffer while following a HOT link would prevent this type of problem,
but checking XMIN/XMAX matching is a much more robust solution.)


Index/Sequential Scans
----------------------

When doing an index scan, whenever we reach a HEAP_HOT_UPDATED tuple whose
xmax is not aborted, we need to follow its t_ctid link and check that
entry as well; possibly repeatedly until we reach the end of the HOT
chain.  (When using an MVCC snapshot it is possible to optimize this a
bit: there can be at most one visible tuple in the chain, so we can stop
when we find it.  This rule does not work for non-MVCC snapshots, though.)

Sequential scans do not need to pay attention to the HOT links because
they scan every line pointer on the page anyway.  The same goes for a
bitmap heap scan with a lossy bitmap.


Pruning
-------

HOT pruning means updating line pointers so that HOT chains are
reduced in length, by collapsing out line pointers for intermediate dead
tuples.  Although this makes those line pointers available for re-use,
it does not immediately make the space occupied by their tuples available.


Defragmentation
---------------

Defragmentation centralizes unused space.  After we have converted root
line pointers to redirected line pointers and pruned away any dead
intermediate line pointers, the tuples they linked to are free space.
But unless that space is adjacent to the central "hole" on the page
(the pd_lower-to-pd_upper area) it cannot be used by tuple insertion.
Defragmentation moves the surviving tuples to coalesce all the free
space into one "hole".  This is done with the same PageRepairFragmentation
function that regular VACUUM uses.


When can/should we prune or defragment?
---------------------------------------

This is the most interesting question in HOT implementation, since there
is no simple right answer: we must use heuristics to determine when it's
most efficient to perform pruning and/or defragmenting.

We cannot prune or defragment unless we can get a "buffer cleanup lock"
on the target page; otherwise, pruning might destroy line pointers that
other backends have live references to, and defragmenting might move
tuples that other backends have live pointers to.  Thus the general
approach must be to heuristically decide if we should try to prune
or defragment, and if so try to acquire the buffer cleanup lock without
blocking.  If we succeed we can proceed with our housekeeping work.
If we cannot get the lock (which should not happen often, except under
very heavy contention) then the housekeeping has to be postponed till
some other time.  The worst-case consequence of this is only that an
UPDATE cannot be made HOT but has to link to a new tuple version placed on
some other page, for lack of centralized space on the original page.

Ideally we would do defragmenting only when we are about to attempt
heap_update on a HOT-safe tuple.  The difficulty with this approach
is that the update query has certainly got a pin on the old tuple, and
therefore our attempt to acquire a buffer cleanup lock will always fail.
(This corresponds to the idea that we don't want to move the old tuple
out from under where the query's HeapTuple pointer points.  It might
be possible to finesse that, but it seems fragile.)

Pruning, however, is potentially useful even when we are not about to
insert a new tuple, since shortening a HOT chain reduces the cost of
subsequent index searches.  However it is unclear that this gain is
large enough to accept any extra maintenance burden for.

The currently planned heuristic is to prune and defrag when first accessing
a page that potentially has prunable tuples (as flagged by the pd_prune_xid
page hint field) and that either has free space less than MAX(fillfactor
target free space, BLCKSZ/10) *or* has recently had an UPDATE fail to
find enough free space to store an updated tuple version.  (These rules
are subject to change.)

We have effectively implemented the "truncate dead tuples to just line
pointer" idea that has been proposed and rejected before because of fear
of line pointer bloat: we might end up with huge numbers of line pointers
and just a few actual tuples on a page.  To limit the damage in the worst
case, and to keep various work arrays as well as the bitmaps in bitmap
scans reasonably sized, the maximum number of line pointers per page
is arbitrarily capped at MaxHeapTuplesPerPage (the most tuples that
could fit without HOT pruning).

Effectively, space reclamation happens during tuple retrieval when the
page is nearly full (<10% free) and a buffer cleanup lock can be
acquired.  This means that UPDATE, DELETE, and SELECT can trigger space
reclamation, but often not during INSERT ... VALUES because it does
not retrieve a row.


VACUUM
------

There is little change to regular vacuum.  It performs pruning to remove
dead heap-only tuples, and cleans up any dead line pointers as if they were
regular dead tuples.


Statistics
----------

Currently, we count HOT updates the same as cold updates for statistics
purposes, though there is an additional per-table counter that counts
only HOT updates.  When a page pruning operation is able to remove a
physical tuple by eliminating an intermediate heap-only tuple or
replacing a physical root tuple by a redirect pointer, a decrement in
the table's number of dead tuples is reported to pgstats, which may
postpone autovacuuming.  Note that we do not count replacing a root tuple
by a DEAD line pointer as decrementing n_dead_tuples; we still want
autovacuum to run to clean up the index entries and DEAD item.

This area probably needs further work ...


CREATE INDEX
------------

CREATE INDEX presents a problem for HOT updates.  While the existing HOT
chains all have the same index values for existing indexes, the columns
in the new index might change within a pre-existing HOT chain, creating
a "broken" chain that can't be indexed properly.

To address this issue, regular (non-concurrent) CREATE INDEX makes the
new index usable only by new transactions and transactions that don't
have snapshots older than the CREATE INDEX command.  This prevents
queries that can see the inconsistent HOT chains from trying to use the
new index and getting incorrect results.  Queries that can see the index
can only see the rows that were visible after the index was created,
hence the HOT chains are consistent for them.

Entries in the new index point to root tuples (tuples with current index
pointers) so that our index uses the same index pointers as all other
indexes on the table.  However the row we want to index is actually at
the *end* of the chain, ie, the most recent live tuple on the HOT chain.
That is the one we compute the index entry values for, but the TID
we put into the index is that of the root tuple.  Since queries that
will be allowed to use the new index cannot see any of the older tuple
versions in the chain, the fact that they might not match the index entry
isn't a problem.  (Such queries will check the tuple visibility
information of the older versions and ignore them, without ever looking at
their contents, so the content inconsistency is OK.)  Subsequent updates
to the live tuple will be allowed to extend the HOT chain only if they are
HOT-safe for all the indexes.

Because we have ShareLock on the table, any DELETE_IN_PROGRESS or
INSERT_IN_PROGRESS tuples should have come from our own transaction.
Therefore we can consider them committed since if the CREATE INDEX
commits, they will be committed, and if it aborts the index is discarded.
An exception to this is that early lock release is customary for system
catalog updates, and so we might find such tuples when reindexing a system
catalog.  In that case we deal with it by waiting for the source
transaction to commit or roll back.  (We could do that for user tables
too, but since the case is unexpected we prefer to throw an error.)

Practically, we prevent certain transactions from using the new index by
setting pg_index.indcheckxmin to TRUE.  Transactions are allowed to use
such an index only after pg_index.xmin is below their TransactionXmin
horizon, thereby ensuring that any incompatible rows in HOT chains are
dead to them. (pg_index.xmin will be the XID of the CREATE INDEX
transaction.  The reason for using xmin rather than a normal column is
that the regular vacuum freezing mechanism will take care of converting
xmin to FrozenTransactionId before it can wrap around.)

This means in particular that the transaction creating the index will be
unable to use the index if the transaction has old snapshots.  We
alleviate that problem somewhat by not setting indcheckxmin unless the
table actually contains HOT chains with RECENTLY_DEAD members.

Another unpleasant consequence is that it is now risky to use SnapshotAny
in an index scan: if the index was created more recently than the last
vacuum, it's possible that some of the visited tuples do not match the
index entry they are linked to.  This does not seem to be a fatal
objection, since there are few users of SnapshotAny and most use seqscans.
The only exception at this writing is CLUSTER, which is okay because it
does not require perfect ordering of the indexscan readout (and especially
so because CLUSTER tends to write recently-dead tuples out of order anyway).


CREATE INDEX CONCURRENTLY
-------------------------

In the concurrent case we must take a different approach.  We create the
pg_index entry immediately, before we scan the table.  The pg_index entry
is marked as "not ready for inserts".  Then we commit and wait for any
transactions which have the table open to finish.  This ensures that no
new HOT updates will change the key value for our new index, because all
transactions will see the existence of the index and will respect its
constraint on which updates can be HOT.  Other transactions must include
such an index when determining HOT-safety of updates, even though they
must ignore it for both insertion and searching purposes.

We must do this to avoid making incorrect index entries.  For example,
suppose we are building an index on column X and we make an index entry for
a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
indexed column, HOT-updates the row to have X=2, and commits.  We now have
an index entry for X=1 pointing at a HOT chain whose live row has X=2.
We could make an index entry with X=2 during the validation pass, but
there is no nice way to get rid of the wrong entry with X=1.  So we must
have the HOT-safety property enforced before we start to build the new
index.

After waiting for transactions which had the table open, we build the index
for all rows that are valid in a fresh snapshot.  Any tuples visible in the
snapshot will have only valid forward-growing HOT chains.  (They might have
older HOT updates behind them which are broken, but this is OK for the same
reason it's OK in a regular index build.)  As above, we point the index
entry at the root of the HOT-update chain but we use the key value from the
live tuple.

We mark the index open for inserts (but still not ready for reads) then
we again wait for transactions which have the table open.  Then we take
a second reference snapshot and validate the index.  This searches for
tuples missing from the index, and inserts any missing ones.  Again,
the index entries have to have TIDs equal to HOT-chain root TIDs, but
the value to be inserted is the one from the live tuple.

Then we wait until every transaction that could have a snapshot older than
the second reference snapshot is finished.  This ensures that nobody is
alive any longer who could need to see any tuples that might be missing
from the index, as well as ensuring that no one can see any inconsistent
rows in a broken HOT chain (the first condition is stronger than the
second).  Finally, we can mark the index valid for searches.

Note that we do not need to set pg_index.indcheckxmin in this code path,
because we have outwaited any transactions that would need to avoid using
the index.  (indcheckxmin is only needed because non-concurrent CREATE
INDEX doesn't want to wait; its stronger lock would create too much risk of
deadlock if it did.)


DROP INDEX CONCURRENTLY
-----------------------

DROP INDEX CONCURRENTLY is sort of the reverse sequence of CREATE INDEX
CONCURRENTLY.  We first mark the index as not indisvalid, and then wait for
any transactions that could be using it in queries to end.  (During this
time, index updates must still be performed as normal, since such
transactions might expect freshly inserted tuples to be findable.)
Then, we clear indisready and indislive, and again wait for transactions
that could be updating the index to end.  Finally we can drop the index
normally (though taking only ShareUpdateExclusiveLock on its parent table).

The reason we need the pg_index.indislive flag is that after the second
wait step begins, we don't want transactions to be touching the index at
all; otherwise they might suffer errors if the DROP finally commits while
they are reading catalog entries for the index.  If we had only indisvalid
and indisready, this state would be indistinguishable from the first stage
of CREATE INDEX CONCURRENTLY --- but in that state, we *do* want
transactions to examine the index, since they must consider it in
HOT-safety checks.


Limitations and Restrictions
----------------------------

It is worth noting that HOT forever forecloses alternative approaches
to vacuuming, specifically the recompute-the-index-keys approach alluded
to in Technical Challenges above.  It'll be tough to recompute the index
keys for a root line pointer you don't have data for anymore ...


Glossary
--------

Broken HOT Chain

	A HOT chain in which the key value for an index has changed.

	This is not allowed to occur normally but if a new index is created
	it can happen.  In that case various strategies are used to ensure
	that no transaction for which the older tuples are visible can
	use the index.

Cold update

	A normal, non-HOT update, in which index entries are made for
	the new version of the tuple.

Dead line pointer

	A stub line pointer, that does not point to anything, but cannot
	be removed or reused yet because there are index pointers to it.
	Semantically same as a dead tuple.  It has state LP_DEAD.

Heap-only tuple

	A heap tuple with no index pointers, which can only be reached
	from indexes indirectly through its ancestral root tuple.
	Marked with HEAP_ONLY_TUPLE flag.

HOT-safe

	A proposed tuple update is said to be HOT-safe if it changes
	none of the tuple's indexed columns.  It will only become an
	actual HOT update if we can find room on the same page for
	the new tuple version.

HOT update

	An UPDATE where the new tuple becomes a heap-only tuple, and no
	new index entries are made.

HOT-updated tuple

	An updated tuple, for which the next tuple in the chain is a
	heap-only tuple.  Marked with HEAP_HOT_UPDATED flag.

Indexed column

	A column used in an index definition.  The column might not
	actually be stored in the index --- it could be used in a
	functional index's expression, or used in a partial index
	predicate.  HOT treats all these cases alike.

Redirecting line pointer

	A line pointer that points to another line pointer and has no
	associated tuple.  It has the special lp_flags state LP_REDIRECT,
	and lp_off is the OffsetNumber of the line pointer it links to.
	This is used when a root tuple becomes dead but we cannot prune
	the line pointer because there are non-dead heap-only tuples
	further down the chain.

Root tuple

	The first tuple in a HOT update chain; the one that indexes point to.

Update chain

	A chain of updated tuples, in which each tuple's ctid points to
	the next tuple in the chain. A HOT update chain is an update chain
	(or portion of an update chain) that consists of a root tuple and
	one or more heap-only tuples.  A complete update chain can contain
	both HOT and non-HOT (cold) updated tuples.