1src/backend/access/heap/README.HOT 2 3Heap Only Tuples (HOT) 4====================== 5 6The Heap Only Tuple (HOT) feature eliminates redundant index entries and 7allows the re-use of space taken by DELETEd or obsoleted UPDATEd tuples 8without performing a table-wide vacuum. It does this by allowing 9single-page vacuuming, also called "defragmentation". 10 11Note: there is a Glossary at the end of this document that may be helpful 12for first-time readers. 13 14 15Technical Challenges 16-------------------- 17 18Page-at-a-time vacuuming is normally impractical because of the costs of 19finding and removing the index entries that link to the tuples to be 20reclaimed. Standard vacuuming scans the indexes to ensure all such index 21entries are removed, amortizing the index scan cost across as many dead 22tuples as possible; this approach does not scale down well to the case of 23reclaiming just a few tuples. In principle one could recompute the index 24keys and do standard index searches to find the index entries, but this is 25risky in the presence of possibly-buggy user-defined functions in 26functional indexes. An allegedly immutable function that in fact is not 27immutable might prevent us from re-finding an index entry (and we cannot 28throw an error for not finding it, in view of the fact that dead index 29entries are sometimes reclaimed early). That would lead to a seriously 30corrupt index, in the form of entries pointing to tuple slots that by now 31contain some unrelated content. In any case we would prefer to be able 32to do vacuuming without invoking any user-written code. 33 34HOT solves this problem for a restricted but useful special case: 35where a tuple is repeatedly updated in ways that do not change its 36indexed columns. (Here, "indexed column" means any column referenced 37at all in an index definition, including for example columns that are 38tested in a partial-index predicate but are not stored in the index.) 39 40An additional property of HOT is that it reduces index size by avoiding 41the creation of identically-keyed index entries. This improves search 42speeds. 43 44 45Update Chains With a Single Index Entry 46--------------------------------------- 47 48Without HOT, every version of a row in an update chain has its own index 49entries, even if all indexed columns are the same. With HOT, a new tuple 50placed on the same page and with all indexed columns the same as its 51parent row version does not get new index entries. This means there is 52only one index entry for the entire update chain on the heap page. 53An index-entry-less tuple is marked with the HEAP_ONLY_TUPLE flag. 54The prior row version is marked HEAP_HOT_UPDATED, and (as always in an 55update chain) its t_ctid field links forward to the newer version. 56 57For example: 58 59 Index points to 1 60 lp [1] [2] 61 62 [111111111]->[2222222222] 63 64In the above diagram, the index points to line pointer 1, and tuple 1 is 65marked as HEAP_HOT_UPDATED. Tuple 2 is a HOT tuple, meaning it has 66no index entry pointing to it, and is marked as HEAP_ONLY_TUPLE. 67Although tuple 2 is not directly referenced by the index, it can still be 68found by an index search: after traversing from the index to tuple 1, 69the index search proceeds forward to child tuples as long as it sees the 70HEAP_HOT_UPDATED flag set. Since we restrict the HOT chain to lie within 71a single page, this requires no additional page fetches and doesn't 72introduce much performance penalty. 73 74Eventually, tuple 1 will no longer be visible to any transaction. 75At that point its space could be reclaimed, but its line pointer cannot, 76since the index still links to that line pointer and we still need to 77be able to find tuple 2 in an index search. HOT handles this by turning 78line pointer 1 into a "redirecting line pointer", which links to tuple 2 79but has no actual tuple attached. This state of affairs looks like 80 81 Index points to 1 82 lp [1]->[2] 83 84 [2222222222] 85 86If now the row is updated again, to version 3, the page looks like this: 87 88 Index points to 1 89 lp [1]->[2] [3] 90 91 [2222222222]->[3333333333] 92 93At some later time when no transaction can see tuple 2 in its snapshot, 94tuple 2 and its line pointer can be pruned entirely: 95 96 Index points to 1 97 lp [1]------>[3] 98 99 [3333333333] 100 101This is safe because no index entry points to line pointer 2. Subsequent 102insertions into the page can now recycle both line pointer 2 and the 103space formerly used by tuple 2. 104 105If an update changes any indexed column, or there is not room on the 106same page for the new tuple, then the HOT chain ends: the last member 107has a regular t_ctid link to the next version and is not marked 108HEAP_HOT_UPDATED. (In principle we could continue a HOT chain across 109pages, but this would destroy the desired property of being able to 110reclaim space with just page-local manipulations. Anyway, we don't 111want to have to chase through multiple heap pages to get from an index 112entry to the desired tuple, so it seems better to create a new index 113entry for the new tuple.) If further updates occur, the next version 114could become the root of a new HOT chain. 115 116Line pointer 1 has to remain as long as there is any non-dead member of 117the chain on the page. When there is not, it is marked "dead". 118This lets us reclaim the last child line pointer and associated tuple 119immediately. The next regular VACUUM pass can reclaim the index entries 120pointing at the line pointer and then the line pointer itself. Since a 121line pointer is small compared to a tuple, this does not represent an 122undue space cost. 123 124Note: we can use a "dead" line pointer for any DELETEd tuple, 125whether it was part of a HOT chain or not. This allows space reclamation 126in advance of running VACUUM for plain DELETEs as well as HOT updates. 127 128The requirement for doing a HOT update is that none of the indexed 129columns are changed. This is checked at execution time by comparing the 130binary representation of the old and new values. We insist on bitwise 131equality rather than using datatype-specific equality routines. The 132main reason to avoid the latter is that there might be multiple notions 133of equality for a datatype, and we don't know exactly which one is 134relevant for the indexes at hand. We assume that bitwise equality 135guarantees equality for all purposes. 136 137 138Abort Cases 139----------- 140 141If a heap-only tuple's xmin is aborted, then it can be removed immediately: 142it was never visible to any other transaction, and all descendant row 143versions must be aborted as well. Therefore we need not consider it part 144of a HOT chain. By the same token, if a HOT-updated tuple's xmax is 145aborted, there is no need to follow the chain link. However, there is a 146race condition here: the transaction that did the HOT update might abort 147between the time we inspect the HOT-updated tuple and the time we reach 148the descendant heap-only tuple. It is conceivable that someone prunes 149the heap-only tuple before that, and even conceivable that the line pointer 150is re-used for another purpose. Therefore, when following a HOT chain, 151it is always necessary to be prepared for the possibility that the 152linked-to line pointer is unused, dead, or redirected; and if it is a 153normal line pointer, we still have to check that XMIN of the tuple matches 154the XMAX of the tuple we left. Otherwise we should assume that we have 155come to the end of the HOT chain. Note that this sort of XMIN/XMAX 156matching is required when following ordinary update chains anyway. 157 158(Early versions of the HOT code assumed that holding pin on the page 159buffer while following a HOT link would prevent this type of problem, 160but checking XMIN/XMAX matching is a much more robust solution.) 161 162 163Index/Sequential Scans 164---------------------- 165 166When doing an index scan, whenever we reach a HEAP_HOT_UPDATED tuple whose 167xmax is not aborted, we need to follow its t_ctid link and check that 168entry as well; possibly repeatedly until we reach the end of the HOT 169chain. (When using an MVCC snapshot it is possible to optimize this a 170bit: there can be at most one visible tuple in the chain, so we can stop 171when we find it. This rule does not work for non-MVCC snapshots, though.) 172 173Sequential scans do not need to pay attention to the HOT links because 174they scan every line pointer on the page anyway. The same goes for a 175bitmap heap scan with a lossy bitmap. 176 177 178Pruning 179------- 180 181HOT pruning means updating line pointers so that HOT chains are 182reduced in length, by collapsing out line pointers for intermediate dead 183tuples. Although this makes those line pointers available for re-use, 184it does not immediately make the space occupied by their tuples available. 185 186 187Defragmentation 188--------------- 189 190Defragmentation centralizes unused space. After we have converted root 191line pointers to redirected line pointers and pruned away any dead 192intermediate line pointers, the tuples they linked to are free space. 193But unless that space is adjacent to the central "hole" on the page 194(the pd_lower-to-pd_upper area) it cannot be used by tuple insertion. 195Defragmentation moves the surviving tuples to coalesce all the free 196space into one "hole". This is done with the same PageRepairFragmentation 197function that regular VACUUM uses. 198 199 200When can/should we prune or defragment? 201--------------------------------------- 202 203This is the most interesting question in HOT implementation, since there 204is no simple right answer: we must use heuristics to determine when it's 205most efficient to perform pruning and/or defragmenting. 206 207We cannot prune or defragment unless we can get a "buffer cleanup lock" 208on the target page; otherwise, pruning might destroy line pointers that 209other backends have live references to, and defragmenting might move 210tuples that other backends have live pointers to. Thus the general 211approach must be to heuristically decide if we should try to prune 212or defragment, and if so try to acquire the buffer cleanup lock without 213blocking. If we succeed we can proceed with our housekeeping work. 214If we cannot get the lock (which should not happen often, except under 215very heavy contention) then the housekeeping has to be postponed till 216some other time. The worst-case consequence of this is only that an 217UPDATE cannot be made HOT but has to link to a new tuple version placed on 218some other page, for lack of centralized space on the original page. 219 220Ideally we would do defragmenting only when we are about to attempt 221heap_update on a HOT-safe tuple. The difficulty with this approach 222is that the update query has certainly got a pin on the old tuple, and 223therefore our attempt to acquire a buffer cleanup lock will always fail. 224(This corresponds to the idea that we don't want to move the old tuple 225out from under where the query's HeapTuple pointer points. It might 226be possible to finesse that, but it seems fragile.) 227 228Pruning, however, is potentially useful even when we are not about to 229insert a new tuple, since shortening a HOT chain reduces the cost of 230subsequent index searches. However it is unclear that this gain is 231large enough to accept any extra maintenance burden for. 232 233The currently planned heuristic is to prune and defrag when first accessing 234a page that potentially has prunable tuples (as flagged by the pd_prune_xid 235page hint field) and that either has free space less than MAX(fillfactor 236target free space, BLCKSZ/10) *or* has recently had an UPDATE fail to 237find enough free space to store an updated tuple version. (These rules 238are subject to change.) 239 240We have effectively implemented the "truncate dead tuples to just line 241pointer" idea that has been proposed and rejected before because of fear 242of line pointer bloat: we might end up with huge numbers of line pointers 243and just a few actual tuples on a page. To limit the damage in the worst 244case, and to keep various work arrays as well as the bitmaps in bitmap 245scans reasonably sized, the maximum number of line pointers per page 246is arbitrarily capped at MaxHeapTuplesPerPage (the most tuples that 247could fit without HOT pruning). 248 249Effectively, space reclamation happens during tuple retrieval when the 250page is nearly full (<10% free) and a buffer cleanup lock can be 251acquired. This means that UPDATE, DELETE, and SELECT can trigger space 252reclamation, but often not during INSERT ... VALUES because it does 253not retrieve a row. 254 255 256VACUUM 257------ 258 259There is little change to regular vacuum. It performs pruning to remove 260dead heap-only tuples, and cleans up any dead line pointers as if they were 261regular dead tuples. 262 263 264Statistics 265---------- 266 267Currently, we count HOT updates the same as cold updates for statistics 268purposes, though there is an additional per-table counter that counts 269only HOT updates. When a page pruning operation is able to remove a 270physical tuple by eliminating an intermediate heap-only tuple or 271replacing a physical root tuple by a redirect pointer, a decrement in 272the table's number of dead tuples is reported to pgstats, which may 273postpone autovacuuming. Note that we do not count replacing a root tuple 274by a DEAD line pointer as decrementing n_dead_tuples; we still want 275autovacuum to run to clean up the index entries and DEAD item. 276 277This area probably needs further work ... 278 279 280CREATE INDEX 281------------ 282 283CREATE INDEX presents a problem for HOT updates. While the existing HOT 284chains all have the same index values for existing indexes, the columns 285in the new index might change within a pre-existing HOT chain, creating 286a "broken" chain that can't be indexed properly. 287 288To address this issue, regular (non-concurrent) CREATE INDEX makes the 289new index usable only by new transactions and transactions that don't 290have snapshots older than the CREATE INDEX command. This prevents 291queries that can see the inconsistent HOT chains from trying to use the 292new index and getting incorrect results. Queries that can see the index 293can only see the rows that were visible after the index was created, 294hence the HOT chains are consistent for them. 295 296Entries in the new index point to root tuples (tuples with current index 297pointers) so that our index uses the same index pointers as all other 298indexes on the table. However the row we want to index is actually at 299the *end* of the chain, ie, the most recent live tuple on the HOT chain. 300That is the one we compute the index entry values for, but the TID 301we put into the index is that of the root tuple. Since queries that 302will be allowed to use the new index cannot see any of the older tuple 303versions in the chain, the fact that they might not match the index entry 304isn't a problem. (Such queries will check the tuple visibility 305information of the older versions and ignore them, without ever looking at 306their contents, so the content inconsistency is OK.) Subsequent updates 307to the live tuple will be allowed to extend the HOT chain only if they are 308HOT-safe for all the indexes. 309 310Because we have ShareLock on the table, any DELETE_IN_PROGRESS or 311INSERT_IN_PROGRESS tuples should have come from our own transaction. 312Therefore we can consider them committed since if the CREATE INDEX 313commits, they will be committed, and if it aborts the index is discarded. 314An exception to this is that early lock release is customary for system 315catalog updates, and so we might find such tuples when reindexing a system 316catalog. In that case we deal with it by waiting for the source 317transaction to commit or roll back. (We could do that for user tables 318too, but since the case is unexpected we prefer to throw an error.) 319 320Practically, we prevent certain transactions from using the new index by 321setting pg_index.indcheckxmin to TRUE. Transactions are allowed to use 322such an index only after pg_index.xmin is below their TransactionXmin 323horizon, thereby ensuring that any incompatible rows in HOT chains are 324dead to them. (pg_index.xmin will be the XID of the CREATE INDEX 325transaction. The reason for using xmin rather than a normal column is 326that the regular vacuum freezing mechanism will take care of converting 327xmin to FrozenTransactionId before it can wrap around.) 328 329This means in particular that the transaction creating the index will be 330unable to use the index if the transaction has old snapshots. We 331alleviate that problem somewhat by not setting indcheckxmin unless the 332table actually contains HOT chains with RECENTLY_DEAD members. 333 334Another unpleasant consequence is that it is now risky to use SnapshotAny 335in an index scan: if the index was created more recently than the last 336vacuum, it's possible that some of the visited tuples do not match the 337index entry they are linked to. This does not seem to be a fatal 338objection, since there are few users of SnapshotAny and most use seqscans. 339The only exception at this writing is CLUSTER, which is okay because it 340does not require perfect ordering of the indexscan readout (and especially 341so because CLUSTER tends to write recently-dead tuples out of order anyway). 342 343 344CREATE INDEX CONCURRENTLY 345------------------------- 346 347In the concurrent case we must take a different approach. We create the 348pg_index entry immediately, before we scan the table. The pg_index entry 349is marked as "not ready for inserts". Then we commit and wait for any 350transactions which have the table open to finish. This ensures that no 351new HOT updates will change the key value for our new index, because all 352transactions will see the existence of the index and will respect its 353constraint on which updates can be HOT. Other transactions must include 354such an index when determining HOT-safety of updates, even though they 355must ignore it for both insertion and searching purposes. 356 357We must do this to avoid making incorrect index entries. For example, 358suppose we are building an index on column X and we make an index entry for 359a non-HOT tuple with X=1. Then some other backend, unaware that X is an 360indexed column, HOT-updates the row to have X=2, and commits. We now have 361an index entry for X=1 pointing at a HOT chain whose live row has X=2. 362We could make an index entry with X=2 during the validation pass, but 363there is no nice way to get rid of the wrong entry with X=1. So we must 364have the HOT-safety property enforced before we start to build the new 365index. 366 367After waiting for transactions which had the table open, we build the index 368for all rows that are valid in a fresh snapshot. Any tuples visible in the 369snapshot will have only valid forward-growing HOT chains. (They might have 370older HOT updates behind them which are broken, but this is OK for the same 371reason it's OK in a regular index build.) As above, we point the index 372entry at the root of the HOT-update chain but we use the key value from the 373live tuple. 374 375We mark the index open for inserts (but still not ready for reads) then 376we again wait for transactions which have the table open. Then we take 377a second reference snapshot and validate the index. This searches for 378tuples missing from the index, and inserts any missing ones. Again, 379the index entries have to have TIDs equal to HOT-chain root TIDs, but 380the value to be inserted is the one from the live tuple. 381 382Then we wait until every transaction that could have a snapshot older than 383the second reference snapshot is finished. This ensures that nobody is 384alive any longer who could need to see any tuples that might be missing 385from the index, as well as ensuring that no one can see any inconsistent 386rows in a broken HOT chain (the first condition is stronger than the 387second). Finally, we can mark the index valid for searches. 388 389Note that we do not need to set pg_index.indcheckxmin in this code path, 390because we have outwaited any transactions that would need to avoid using 391the index. (indcheckxmin is only needed because non-concurrent CREATE 392INDEX doesn't want to wait; its stronger lock would create too much risk of 393deadlock if it did.) 394 395 396DROP INDEX CONCURRENTLY 397----------------------- 398 399DROP INDEX CONCURRENTLY is sort of the reverse sequence of CREATE INDEX 400CONCURRENTLY. We first mark the index as not indisvalid, and then wait for 401any transactions that could be using it in queries to end. (During this 402time, index updates must still be performed as normal, since such 403transactions might expect freshly inserted tuples to be findable.) 404Then, we clear indisready and indislive, and again wait for transactions 405that could be updating the index to end. Finally we can drop the index 406normally (though taking only ShareUpdateExclusiveLock on its parent table). 407 408The reason we need the pg_index.indislive flag is that after the second 409wait step begins, we don't want transactions to be touching the index at 410all; otherwise they might suffer errors if the DROP finally commits while 411they are reading catalog entries for the index. If we had only indisvalid 412and indisready, this state would be indistinguishable from the first stage 413of CREATE INDEX CONCURRENTLY --- but in that state, we *do* want 414transactions to examine the index, since they must consider it in 415HOT-safety checks. 416 417 418Limitations and Restrictions 419---------------------------- 420 421It is worth noting that HOT forever forecloses alternative approaches 422to vacuuming, specifically the recompute-the-index-keys approach alluded 423to in Technical Challenges above. It'll be tough to recompute the index 424keys for a root line pointer you don't have data for anymore ... 425 426 427Glossary 428-------- 429 430Broken HOT Chain 431 432 A HOT chain in which the key value for an index has changed. 433 434 This is not allowed to occur normally but if a new index is created 435 it can happen. In that case various strategies are used to ensure 436 that no transaction for which the older tuples are visible can 437 use the index. 438 439Cold update 440 441 A normal, non-HOT update, in which index entries are made for 442 the new version of the tuple. 443 444Dead line pointer 445 446 A stub line pointer, that does not point to anything, but cannot 447 be removed or reused yet because there are index pointers to it. 448 Semantically same as a dead tuple. It has state LP_DEAD. 449 450Heap-only tuple 451 452 A heap tuple with no index pointers, which can only be reached 453 from indexes indirectly through its ancestral root tuple. 454 Marked with HEAP_ONLY_TUPLE flag. 455 456HOT-safe 457 458 A proposed tuple update is said to be HOT-safe if it changes 459 none of the tuple's indexed columns. It will only become an 460 actual HOT update if we can find room on the same page for 461 the new tuple version. 462 463HOT update 464 465 An UPDATE where the new tuple becomes a heap-only tuple, and no 466 new index entries are made. 467 468HOT-updated tuple 469 470 An updated tuple, for which the next tuple in the chain is a 471 heap-only tuple. Marked with HEAP_HOT_UPDATED flag. 472 473Indexed column 474 475 A column used in an index definition. The column might not 476 actually be stored in the index --- it could be used in a 477 functional index's expression, or used in a partial index 478 predicate. HOT treats all these cases alike. 479 480Redirecting line pointer 481 482 A line pointer that points to another line pointer and has no 483 associated tuple. It has the special lp_flags state LP_REDIRECT, 484 and lp_off is the OffsetNumber of the line pointer it links to. 485 This is used when a root tuple becomes dead but we cannot prune 486 the line pointer because there are non-dead heap-only tuples 487 further down the chain. 488 489Root tuple 490 491 The first tuple in a HOT update chain; the one that indexes point to. 492 493Update chain 494 495 A chain of updated tuples, in which each tuple's ctid points to 496 the next tuple in the chain. A HOT update chain is an update chain 497 (or portion of an update chain) that consists of a root tuple and 498 one or more heap-only tuples. A complete update chain can contain 499 both HOT and non-HOT (cold) updated tuples. 500