• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..08-Nov-2021-

MakefileH A D08-Nov-2021493 185

README.HOTH A D08-Nov-202122.9 KiB500391

README.tuplockH A D08-Nov-20218.2 KiB156131

heapam.cH A D08-Nov-2021286.8 KiB9,5745,180

hio.cH A D08-Nov-202120.9 KiB630267

pruneheap.cH A D08-Nov-202125.3 KiB838383

rewriteheap.cH A D08-Nov-202140.7 KiB1,310654

syncscan.cH A D08-Nov-20219.3 KiB323147

tuptoaster.cH A D08-Nov-202165.6 KiB2,4271,363

visibilitymap.cH A D08-Nov-202124 KiB702301

README.HOT

1src/backend/access/heap/README.HOT
2
3Heap Only Tuples (HOT)
4======================
5
6The Heap Only Tuple (HOT) feature eliminates redundant index entries and
7allows the re-use of space taken by DELETEd or obsoleted UPDATEd tuples
8without performing a table-wide vacuum.  It does this by allowing
9single-page vacuuming, also called "defragmentation".
10
11Note: there is a Glossary at the end of this document that may be helpful
12for first-time readers.
13
14
15Technical Challenges
16--------------------
17
18Page-at-a-time vacuuming is normally impractical because of the costs of
19finding and removing the index entries that link to the tuples to be
20reclaimed.  Standard vacuuming scans the indexes to ensure all such index
21entries are removed, amortizing the index scan cost across as many dead
22tuples as possible; this approach does not scale down well to the case of
23reclaiming just a few tuples.  In principle one could recompute the index
24keys and do standard index searches to find the index entries, but this is
25risky in the presence of possibly-buggy user-defined functions in
26functional indexes.  An allegedly immutable function that in fact is not
27immutable might prevent us from re-finding an index entry (and we cannot
28throw an error for not finding it, in view of the fact that dead index
29entries are sometimes reclaimed early).  That would lead to a seriously
30corrupt index, in the form of entries pointing to tuple slots that by now
31contain some unrelated content.  In any case we would prefer to be able
32to do vacuuming without invoking any user-written code.
33
34HOT solves this problem for a restricted but useful special case:
35where a tuple is repeatedly updated in ways that do not change its
36indexed columns.  (Here, "indexed column" means any column referenced
37at all in an index definition, including for example columns that are
38tested in a partial-index predicate but are not stored in the index.)
39
40An additional property of HOT is that it reduces index size by avoiding
41the creation of identically-keyed index entries.  This improves search
42speeds.
43
44
45Update Chains With a Single Index Entry
46---------------------------------------
47
48Without HOT, every version of a row in an update chain has its own index
49entries, even if all indexed columns are the same.  With HOT, a new tuple
50placed on the same page and with all indexed columns the same as its
51parent row version does not get new index entries.  This means there is
52only one index entry for the entire update chain on the heap page.
53An index-entry-less tuple is marked with the HEAP_ONLY_TUPLE flag.
54The prior row version is marked HEAP_HOT_UPDATED, and (as always in an
55update chain) its t_ctid field links forward to the newer version.
56
57For example:
58
59	Index points to 1
60	lp [1]  [2]
61
62	[111111111]->[2222222222]
63
64In the above diagram, the index points to line pointer 1, and tuple 1 is
65marked as HEAP_HOT_UPDATED.  Tuple 2 is a HOT tuple, meaning it has
66no index entry pointing to it, and is marked as HEAP_ONLY_TUPLE.
67Although tuple 2 is not directly referenced by the index, it can still be
68found by an index search: after traversing from the index to tuple 1,
69the index search proceeds forward to child tuples as long as it sees the
70HEAP_HOT_UPDATED flag set.  Since we restrict the HOT chain to lie within
71a single page, this requires no additional page fetches and doesn't
72introduce much performance penalty.
73
74Eventually, tuple 1 will no longer be visible to any transaction.
75At that point its space could be reclaimed, but its line pointer cannot,
76since the index still links to that line pointer and we still need to
77be able to find tuple 2 in an index search.  HOT handles this by turning
78line pointer 1 into a "redirecting line pointer", which links to tuple 2
79but has no actual tuple attached.  This state of affairs looks like
80
81	Index points to 1
82	lp [1]->[2]
83
84	[2222222222]
85
86If now the row is updated again, to version 3, the page looks like this:
87
88	Index points to 1
89	lp [1]->[2]  [3]
90
91	[2222222222]->[3333333333]
92
93At some later time when no transaction can see tuple 2 in its snapshot,
94tuple 2 and its line pointer can be pruned entirely:
95
96	Index points to 1
97	lp [1]------>[3]
98
99	[3333333333]
100
101This is safe because no index entry points to line pointer 2.  Subsequent
102insertions into the page can now recycle both line pointer 2 and the
103space formerly used by tuple 2.
104
105If an update changes any indexed column, or there is not room on the
106same page for the new tuple, then the HOT chain ends: the last member
107has a regular t_ctid link to the next version and is not marked
108HEAP_HOT_UPDATED.  (In principle we could continue a HOT chain across
109pages, but this would destroy the desired property of being able to
110reclaim space with just page-local manipulations.  Anyway, we don't
111want to have to chase through multiple heap pages to get from an index
112entry to the desired tuple, so it seems better to create a new index
113entry for the new tuple.)  If further updates occur, the next version
114could become the root of a new HOT chain.
115
116Line pointer 1 has to remain as long as there is any non-dead member of
117the chain on the page.  When there is not, it is marked "dead".
118This lets us reclaim the last child line pointer and associated tuple
119immediately.  The next regular VACUUM pass can reclaim the index entries
120pointing at the line pointer and then the line pointer itself.  Since a
121line pointer is small compared to a tuple, this does not represent an
122undue space cost.
123
124Note: we can use a "dead" line pointer for any DELETEd tuple,
125whether it was part of a HOT chain or not.  This allows space reclamation
126in advance of running VACUUM for plain DELETEs as well as HOT updates.
127
128The requirement for doing a HOT update is that none of the indexed
129columns are changed.  This is checked at execution time by comparing the
130binary representation of the old and new values.  We insist on bitwise
131equality rather than using datatype-specific equality routines.  The
132main reason to avoid the latter is that there might be multiple notions
133of equality for a datatype, and we don't know exactly which one is
134relevant for the indexes at hand.  We assume that bitwise equality
135guarantees equality for all purposes.
136
137
138Abort Cases
139-----------
140
141If a heap-only tuple's xmin is aborted, then it can be removed immediately:
142it was never visible to any other transaction, and all descendant row
143versions must be aborted as well.  Therefore we need not consider it part
144of a HOT chain.  By the same token, if a HOT-updated tuple's xmax is
145aborted, there is no need to follow the chain link.  However, there is a
146race condition here: the transaction that did the HOT update might abort
147between the time we inspect the HOT-updated tuple and the time we reach
148the descendant heap-only tuple.  It is conceivable that someone prunes
149the heap-only tuple before that, and even conceivable that the line pointer
150is re-used for another purpose.  Therefore, when following a HOT chain,
151it is always necessary to be prepared for the possibility that the
152linked-to item pointer is unused, dead, or redirected; and if it is a
153normal item pointer, we still have to check that XMIN of the tuple matches
154the XMAX of the tuple we left.  Otherwise we should assume that we have
155come to the end of the HOT chain.  Note that this sort of XMIN/XMAX
156matching is required when following ordinary update chains anyway.
157
158(Early versions of the HOT code assumed that holding pin on the page
159buffer while following a HOT link would prevent this type of problem,
160but checking XMIN/XMAX matching is a much more robust solution.)
161
162
163Index/Sequential Scans
164----------------------
165
166When doing an index scan, whenever we reach a HEAP_HOT_UPDATED tuple whose
167xmax is not aborted, we need to follow its t_ctid link and check that
168entry as well; possibly repeatedly until we reach the end of the HOT
169chain.  (When using an MVCC snapshot it is possible to optimize this a
170bit: there can be at most one visible tuple in the chain, so we can stop
171when we find it.  This rule does not work for non-MVCC snapshots, though.)
172
173Sequential scans do not need to pay attention to the HOT links because
174they scan every item pointer on the page anyway.  The same goes for a
175bitmap heap scan with a lossy bitmap.
176
177
178Pruning
179-------
180
181HOT pruning means updating item pointers so that HOT chains are
182reduced in length, by collapsing out line pointers for intermediate dead
183tuples.  Although this makes those line pointers available for re-use,
184it does not immediately make the space occupied by their tuples available.
185
186
187Defragmentation
188---------------
189
190Defragmentation centralizes unused space.  After we have converted root
191line pointers to redirected line pointers and pruned away any dead
192intermediate line pointers, the tuples they linked to are free space.
193But unless that space is adjacent to the central "hole" on the page
194(the pd_lower-to-pd_upper area) it cannot be used by tuple insertion.
195Defragmentation moves the surviving tuples to coalesce all the free
196space into one "hole".  This is done with the same PageRepairFragmentation
197function that regular VACUUM uses.
198
199
200When can/should we prune or defragment?
201---------------------------------------
202
203This is the most interesting question in HOT implementation, since there
204is no simple right answer: we must use heuristics to determine when it's
205most efficient to perform pruning and/or defragmenting.
206
207We cannot prune or defragment unless we can get a "buffer cleanup lock"
208on the target page; otherwise, pruning might destroy line pointers that
209other backends have live references to, and defragmenting might move
210tuples that other backends have live pointers to.  Thus the general
211approach must be to heuristically decide if we should try to prune
212or defragment, and if so try to acquire the buffer cleanup lock without
213blocking.  If we succeed we can proceed with our housekeeping work.
214If we cannot get the lock (which should not happen often, except under
215very heavy contention) then the housekeeping has to be postponed till
216some other time.  The worst-case consequence of this is only that an
217UPDATE cannot be made HOT but has to link to a new tuple version placed on
218some other page, for lack of centralized space on the original page.
219
220Ideally we would do defragmenting only when we are about to attempt
221heap_update on a HOT-safe tuple.  The difficulty with this approach
222is that the update query has certainly got a pin on the old tuple, and
223therefore our attempt to acquire a buffer cleanup lock will always fail.
224(This corresponds to the idea that we don't want to move the old tuple
225out from under where the query's HeapTuple pointer points.  It might
226be possible to finesse that, but it seems fragile.)
227
228Pruning, however, is potentially useful even when we are not about to
229insert a new tuple, since shortening a HOT chain reduces the cost of
230subsequent index searches.  However it is unclear that this gain is
231large enough to accept any extra maintenance burden for.
232
233The currently planned heuristic is to prune and defrag when first accessing
234a page that potentially has prunable tuples (as flagged by the pd_prune_xid
235page hint field) and that either has free space less than MAX(fillfactor
236target free space, BLCKSZ/10) *or* has recently had an UPDATE fail to
237find enough free space to store an updated tuple version.  (These rules
238are subject to change.)
239
240We have effectively implemented the "truncate dead tuples to just line
241pointer" idea that has been proposed and rejected before because of fear
242of line pointer bloat: we might end up with huge numbers of line pointers
243and just a few actual tuples on a page.  To limit the damage in the worst
244case, and to keep various work arrays as well as the bitmaps in bitmap
245scans reasonably sized, the maximum number of line pointers per page
246is arbitrarily capped at MaxHeapTuplesPerPage (the most tuples that
247could fit without HOT pruning).
248
249Effectively, space reclamation happens during tuple retrieval when the
250page is nearly full (<10% free) and a buffer cleanup lock can be
251acquired.  This means that UPDATE, DELETE, and SELECT can trigger space
252reclamation, but often not during INSERT ... VALUES because it does
253not retrieve a row.
254
255
256VACUUM
257------
258
259There is little change to regular vacuum.  It performs pruning to remove
260dead heap-only tuples, and cleans up any dead line pointers as if they were
261regular dead tuples.
262
263
264Statistics
265----------
266
267Currently, we count HOT updates the same as cold updates for statistics
268purposes, though there is an additional per-table counter that counts
269only HOT updates.  When a page pruning operation is able to remove a
270physical tuple by eliminating an intermediate heap-only tuple or
271replacing a physical root tuple by a redirect pointer, a decrement in
272the table's number of dead tuples is reported to pgstats, which may
273postpone autovacuuming.  Note that we do not count replacing a root tuple
274by a DEAD item pointer as decrementing n_dead_tuples; we still want
275autovacuum to run to clean up the index entries and DEAD item.
276
277This area probably needs further work ...
278
279
280CREATE INDEX
281------------
282
283CREATE INDEX presents a problem for HOT updates.  While the existing HOT
284chains all have the same index values for existing indexes, the columns
285in the new index might change within a pre-existing HOT chain, creating
286a "broken" chain that can't be indexed properly.
287
288To address this issue, regular (non-concurrent) CREATE INDEX makes the
289new index usable only by new transactions and transactions that don't
290have snapshots older than the CREATE INDEX command.  This prevents
291queries that can see the inconsistent HOT chains from trying to use the
292new index and getting incorrect results.  Queries that can see the index
293can only see the rows that were visible after the index was created,
294hence the HOT chains are consistent for them.
295
296Entries in the new index point to root tuples (tuples with current index
297pointers) so that our index uses the same index pointers as all other
298indexes on the table.  However the row we want to index is actually at
299the *end* of the chain, ie, the most recent live tuple on the HOT chain.
300That is the one we compute the index entry values for, but the TID
301we put into the index is that of the root tuple.  Since queries that
302will be allowed to use the new index cannot see any of the older tuple
303versions in the chain, the fact that they might not match the index entry
304isn't a problem.  (Such queries will check the tuple visibility
305information of the older versions and ignore them, without ever looking at
306their contents, so the content inconsistency is OK.)  Subsequent updates
307to the live tuple will be allowed to extend the HOT chain only if they are
308HOT-safe for all the indexes.
309
310Because we have ShareLock on the table, any DELETE_IN_PROGRESS or
311INSERT_IN_PROGRESS tuples should have come from our own transaction.
312Therefore we can consider them committed since if the CREATE INDEX
313commits, they will be committed, and if it aborts the index is discarded.
314An exception to this is that early lock release is customary for system
315catalog updates, and so we might find such tuples when reindexing a system
316catalog.  In that case we deal with it by waiting for the source
317transaction to commit or roll back.  (We could do that for user tables
318too, but since the case is unexpected we prefer to throw an error.)
319
320Practically, we prevent certain transactions from using the new index by
321setting pg_index.indcheckxmin to TRUE.  Transactions are allowed to use
322such an index only after pg_index.xmin is below their TransactionXmin
323horizon, thereby ensuring that any incompatible rows in HOT chains are
324dead to them. (pg_index.xmin will be the XID of the CREATE INDEX
325transaction.  The reason for using xmin rather than a normal column is
326that the regular vacuum freezing mechanism will take care of converting
327xmin to FrozenTransactionId before it can wrap around.)
328
329This means in particular that the transaction creating the index will be
330unable to use the index if the transaction has old snapshots.  We
331alleviate that problem somewhat by not setting indcheckxmin unless the
332table actually contains HOT chains with RECENTLY_DEAD members.
333
334Another unpleasant consequence is that it is now risky to use SnapshotAny
335in an index scan: if the index was created more recently than the last
336vacuum, it's possible that some of the visited tuples do not match the
337index entry they are linked to.  This does not seem to be a fatal
338objection, since there are few users of SnapshotAny and most use seqscans.
339The only exception at this writing is CLUSTER, which is okay because it
340does not require perfect ordering of the indexscan readout (and especially
341so because CLUSTER tends to write recently-dead tuples out of order anyway).
342
343
344CREATE INDEX CONCURRENTLY
345-------------------------
346
347In the concurrent case we must take a different approach.  We create the
348pg_index entry immediately, before we scan the table.  The pg_index entry
349is marked as "not ready for inserts".  Then we commit and wait for any
350transactions which have the table open to finish.  This ensures that no
351new HOT updates will change the key value for our new index, because all
352transactions will see the existence of the index and will respect its
353constraint on which updates can be HOT.  Other transactions must include
354such an index when determining HOT-safety of updates, even though they
355must ignore it for both insertion and searching purposes.
356
357We must do this to avoid making incorrect index entries.  For example,
358suppose we are building an index on column X and we make an index entry for
359a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
360indexed column, HOT-updates the row to have X=2, and commits.  We now have
361an index entry for X=1 pointing at a HOT chain whose live row has X=2.
362We could make an index entry with X=2 during the validation pass, but
363there is no nice way to get rid of the wrong entry with X=1.  So we must
364have the HOT-safety property enforced before we start to build the new
365index.
366
367After waiting for transactions which had the table open, we build the index
368for all rows that are valid in a fresh snapshot.  Any tuples visible in the
369snapshot will have only valid forward-growing HOT chains.  (They might have
370older HOT updates behind them which are broken, but this is OK for the same
371reason it's OK in a regular index build.)  As above, we point the index
372entry at the root of the HOT-update chain but we use the key value from the
373live tuple.
374
375We mark the index open for inserts (but still not ready for reads) then
376we again wait for transactions which have the table open.  Then we take
377a second reference snapshot and validate the index.  This searches for
378tuples missing from the index, and inserts any missing ones.  Again,
379the index entries have to have TIDs equal to HOT-chain root TIDs, but
380the value to be inserted is the one from the live tuple.
381
382Then we wait until every transaction that could have a snapshot older than
383the second reference snapshot is finished.  This ensures that nobody is
384alive any longer who could need to see any tuples that might be missing
385from the index, as well as ensuring that no one can see any inconsistent
386rows in a broken HOT chain (the first condition is stronger than the
387second).  Finally, we can mark the index valid for searches.
388
389Note that we do not need to set pg_index.indcheckxmin in this code path,
390because we have outwaited any transactions that would need to avoid using
391the index.  (indcheckxmin is only needed because non-concurrent CREATE
392INDEX doesn't want to wait; its stronger lock would create too much risk of
393deadlock if it did.)
394
395
396DROP INDEX CONCURRENTLY
397-----------------------
398
399DROP INDEX CONCURRENTLY is sort of the reverse sequence of CREATE INDEX
400CONCURRENTLY.  We first mark the index as not indisvalid, and then wait for
401any transactions that could be using it in queries to end.  (During this
402time, index updates must still be performed as normal, since such
403transactions might expect freshly inserted tuples to be findable.)
404Then, we clear indisready and indislive, and again wait for transactions
405that could be updating the index to end.  Finally we can drop the index
406normally (though taking only ShareUpdateExclusiveLock on its parent table).
407
408The reason we need the pg_index.indislive flag is that after the second
409wait step begins, we don't want transactions to be touching the index at
410all; otherwise they might suffer errors if the DROP finally commits while
411they are reading catalog entries for the index.  If we had only indisvalid
412and indisready, this state would be indistinguishable from the first stage
413of CREATE INDEX CONCURRENTLY --- but in that state, we *do* want
414transactions to examine the index, since they must consider it in
415HOT-safety checks.
416
417
418Limitations and Restrictions
419----------------------------
420
421It is worth noting that HOT forever forecloses alternative approaches
422to vacuuming, specifically the recompute-the-index-keys approach alluded
423to in Technical Challenges above.  It'll be tough to recompute the index
424keys for a root line pointer you don't have data for anymore ...
425
426
427Glossary
428--------
429
430Broken HOT Chain
431
432	A HOT chain in which the key value for an index has changed.
433
434	This is not allowed to occur normally but if a new index is created
435	it can happen.  In that case various strategies are used to ensure
436	that no transaction for which the older tuples are visible can
437	use the index.
438
439Cold update
440
441	A normal, non-HOT update, in which index entries are made for
442	the new version of the tuple.
443
444Dead line pointer
445
446	A stub line pointer, that does not point to anything, but cannot
447	be removed or reused yet because there are index pointers to it.
448	Semantically same as a dead tuple.  It has state LP_DEAD.
449
450Heap-only tuple
451
452	A heap tuple with no index pointers, which can only be reached
453	from indexes indirectly through its ancestral root tuple.
454	Marked with HEAP_ONLY_TUPLE flag.
455
456HOT-safe
457
458	A proposed tuple update is said to be HOT-safe if it changes
459	none of the tuple's indexed columns.  It will only become an
460	actual HOT update if we can find room on the same page for
461	the new tuple version.
462
463HOT update
464
465	An UPDATE where the new tuple becomes a heap-only tuple, and no
466	new index entries are made.
467
468HOT-updated tuple
469
470	An updated tuple, for which the next tuple in the chain is a
471	heap-only tuple.  Marked with HEAP_HOT_UPDATED flag.
472
473Indexed column
474
475	A column used in an index definition.  The column might not
476	actually be stored in the index --- it could be used in a
477	functional index's expression, or used in a partial index
478	predicate.  HOT treats all these cases alike.
479
480Redirecting line pointer
481
482	A line pointer that points to another line pointer and has no
483	associated tuple.  It has the special lp_flags state LP_REDIRECT,
484	and lp_off is the OffsetNumber of the line pointer it links to.
485	This is used when a root tuple becomes dead but we cannot prune
486	the line pointer because there are non-dead heap-only tuples
487	further down the chain.
488
489Root tuple
490
491	The first tuple in a HOT update chain; the one that indexes point to.
492
493Update chain
494
495	A chain of updated tuples, in which each tuple's ctid points to
496	the next tuple in the chain. A HOT update chain is an update chain
497	(or portion of an update chain) that consists of a root tuple and
498	one or more heap-only tuples.  A complete update chain can contain
499	both HOT and non-HOT (cold) updated tuples.
500

README.tuplock

1Locking tuples
2--------------
3
4Locking tuples is not as easy as locking tables or other database objects.
5The problem is that transactions might want to lock large numbers of tuples at
6any one time, so it's not possible to keep the locks objects in shared memory.
7To work around this limitation, we use a two-level mechanism.  The first level
8is implemented by storing locking information in the tuple header: a tuple is
9marked as locked by setting the current transaction's XID as its XMAX, and
10setting additional infomask bits to distinguish this case from the more normal
11case of having deleted the tuple.  When multiple transactions concurrently
12lock a tuple, a MultiXact is used; see below.  This mechanism can accommodate
13arbitrarily large numbers of tuples being locked simultaneously.
14
15When it is necessary to wait for a tuple-level lock to be released, the basic
16delay is provided by XactLockTableWait or MultiXactIdWait on the contents of
17the tuple's XMAX.  However, that mechanism will release all waiters
18concurrently, so there would be a race condition as to which waiter gets the
19tuple, potentially leading to indefinite starvation of some waiters.  The
20possibility of share-locking makes the problem much worse --- a steady stream
21of share-lockers can easily block an exclusive locker forever.  To provide
22more reliable semantics about who gets a tuple-level lock first, we use the
23standard lock manager, which implements the second level mentioned above.  The
24protocol for waiting for a tuple-level lock is really
25
26     LockTuple()
27     XactLockTableWait()
28     mark tuple as locked by me
29     UnlockTuple()
30
31When there are multiple waiters, arbitration of who is to get the lock next
32is provided by LockTuple().  However, at most one tuple-level lock will
33be held or awaited per backend at any time, so we don't risk overflow
34of the lock table.  Note that incoming share-lockers are required to
35do LockTuple as well, if there is any conflict, to ensure that they don't
36starve out waiting exclusive-lockers.  However, if there is not any active
37conflict for a tuple, we don't incur any extra overhead.
38
39We make an exception to the above rule for those lockers that already hold
40some lock on a tuple and attempt to acquire a stronger one on it.  In that
41case, we skip the LockTuple() call even when there are conflicts, provided
42that the target tuple is being locked, updated or deleted by multiple sessions
43concurrently.  Failing to skip the lock would risk a deadlock, e.g., between a
44session that was first to record its weaker lock in the tuple header and would
45be waiting on the LockTuple() call to upgrade to the stronger lock level, and
46another session that has already done LockTuple() and is waiting for the first
47session transaction to release its tuple header-level lock.
48
49We provide four levels of tuple locking strength: SELECT FOR UPDATE obtains an
50exclusive lock which prevents any kind of modification of the tuple. This is
51the lock level that is implicitly taken by DELETE operations, and also by
52UPDATE operations if they modify any of the tuple's key fields. SELECT FOR NO
53KEY UPDATE likewise obtains an exclusive lock, but only prevents tuple removal
54and modifications which might alter the tuple's key. This is the lock that is
55implicitly taken by UPDATE operations which leave all key fields unchanged.
56SELECT FOR SHARE obtains a shared lock which prevents any kind of tuple
57modification. Finally, SELECT FOR KEY SHARE obtains a shared lock which only
58prevents tuple removal and modifications of key fields. This last mode
59implements a mode just strong enough to implement RI checks, i.e. it ensures
60that tuples do not go away from under a check, without blocking when some
61other transaction that want to update the tuple without changing its key.
62
63The conflict table is:
64
65                  UPDATE       NO KEY UPDATE    SHARE        KEY SHARE
66UPDATE           conflict        conflict      conflict      conflict
67NO KEY UPDATE    conflict        conflict      conflict
68SHARE            conflict        conflict
69KEY SHARE        conflict
70
71When there is a single locker in a tuple, we can just store the locking info
72in the tuple itself.  We do this by storing the locker's Xid in XMAX, and
73setting infomask bits specifying the locking strength.  There is one exception
74here: since infomask space is limited, we do not provide a separate bit
75for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
76that case.  (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
77presumably more commonly used due to being the standards-mandated locking
78mechanism, or heavily used by the RI code, so we want to provide fast paths
79for those.)
80
81MultiXacts
82----------
83
84A tuple header provides very limited space for storing information about tuple
85locking and updates: there is room only for a single Xid and a small number of
86infomask bits.  Whenever we need to store more than one lock, we replace the
87first locker's Xid with a new MultiXactId.  Each MultiXact provides extended
88locking data; it comprises an array of Xids plus some flags bits for each one.
89The flags are currently used to store the locking strength of each member
90transaction.  (The flags also distinguish a pure locker from an updater.)
91
92In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
93locked in shared mode by multiple transactions.  This is no longer the case; a
94MultiXact may contain an update or delete Xid.  (Keep in mind that tuple locks
95in a transaction do not conflict with other tuple locks in the same
96transaction, so it's possible to have otherwise conflicting locks in a
97MultiXact if they belong to the same transaction).
98
99Note that each lock is attributed to the subtransaction that acquires it.
100This means that a subtransaction that aborts is seen as though it releases the
101locks it acquired; concurrent transactions can then proceed without having to
102wait for the main transaction to finish.  It also means that a subtransaction
103can upgrade to a stronger lock level than an earlier transaction had, and if
104the subxact aborts, the earlier, weaker lock is kept.
105
106The possibility of having an update within a MultiXact means that they must
107persist across crashes and restarts: a future reader of the tuple needs to
108figure out whether the update committed or aborted.  So we have a requirement
109that pg_multixact needs to retain pages of its data until we're certain that
110the MultiXacts in them are no longer of interest.
111
112VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
113The lower bound used by vacuum (that is, the value below which all multixacts
114are removed) is stored as pg_class.relminmxid for each table; the minimum of
115all such values is stored in pg_database.datminmxid.  The minimum across
116all databases, in turn, is recorded in checkpoint records, and CHECKPOINT
117removes pg_multixact/ segments older than that value once the checkpoint
118record has been flushed.
119
120Infomask Bits
121-------------
122
123The following infomask bits are applicable:
124
125- HEAP_XMAX_INVALID
126  Any tuple with this bit set does not have a valid value stored in XMAX.
127
128- HEAP_XMAX_IS_MULTI
129  This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a
130  regular TransactionId).
131
132- HEAP_XMAX_LOCK_ONLY
133  This bit is set when the XMAX is a locker only; that is, if it's a
134  multixact, it does not contain an update among its members.  It's set when
135  the XMAX is a plain Xid that locked the tuple, as well.
136
137- HEAP_XMAX_KEYSHR_LOCK
138- HEAP_XMAX_SHR_LOCK
139- HEAP_XMAX_EXCL_LOCK
140  These bits indicate the strength of the lock acquired; they are useful when
141  the XMAX is not a MultiXactId.  If it's a multi, the info is to be found in
142  the member flags.  If HEAP_XMAX_IS_MULTI is not set and HEAP_XMAX_LOCK_ONLY
143  is set, then one of these *must* be set as well.
144
145  Note that HEAP_XMAX_EXCL_LOCK does not distinguish FOR NO KEY UPDATE from
146  FOR UPDATE; this is implemented by the HEAP_KEYS_UPDATED bit.
147
148- HEAP_KEYS_UPDATED
149  This bit lives in t_infomask2.  If set, indicates that the operation(s) done
150  by the XMAX compromise the tuple key, such as a SELECT FOR UPDATE, an UPDATE
151  that modifies the columns of the key, or a DELETE.  It's set regardless of
152  whether the XMAX is a TransactionId or a MultiXactId.
153
154We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
155is set.
156