1Locking tuples
2--------------
3
4Locking tuples is not as easy as locking tables or other database objects.
5The problem is that transactions might want to lock large numbers of tuples at
6any one time, so it's not possible to keep the locks objects in shared memory.
7To work around this limitation, we use a two-level mechanism.  The first level
8is implemented by storing locking information in the tuple header: a tuple is
9marked as locked by setting the current transaction's XID as its XMAX, and
10setting additional infomask bits to distinguish this case from the more normal
11case of having deleted the tuple.  When multiple transactions concurrently
12lock a tuple, a MultiXact is used; see below.  This mechanism can accommodate
13arbitrarily large numbers of tuples being locked simultaneously.
14
15When it is necessary to wait for a tuple-level lock to be released, the basic
16delay is provided by XactLockTableWait or MultiXactIdWait on the contents of
17the tuple's XMAX.  However, that mechanism will release all waiters
18concurrently, so there would be a race condition as to which waiter gets the
19tuple, potentially leading to indefinite starvation of some waiters.  The
20possibility of share-locking makes the problem much worse --- a steady stream
21of share-lockers can easily block an exclusive locker forever.  To provide
22more reliable semantics about who gets a tuple-level lock first, we use the
23standard lock manager, which implements the second level mentioned above.  The
24protocol for waiting for a tuple-level lock is really
25
26     LockTuple()
27     XactLockTableWait()
28     mark tuple as locked by me
29     UnlockTuple()
30
31When there are multiple waiters, arbitration of who is to get the lock next
32is provided by LockTuple().  However, at most one tuple-level lock will
33be held or awaited per backend at any time, so we don't risk overflow
34of the lock table.  Note that incoming share-lockers are required to
35do LockTuple as well, if there is any conflict, to ensure that they don't
36starve out waiting exclusive-lockers.  However, if there is not any active
37conflict for a tuple, we don't incur any extra overhead.
38
39We make an exception to the above rule for those lockers that already hold
40some lock on a tuple and attempt to acquire a stronger one on it.  In that
41case, we skip the LockTuple() call even when there are conflicts, provided
42that the target tuple is being locked, updated or deleted by multiple sessions
43concurrently.  Failing to skip the lock would risk a deadlock, e.g., between a
44session that was first to record its weaker lock in the tuple header and would
45be waiting on the LockTuple() call to upgrade to the stronger lock level, and
46another session that has already done LockTuple() and is waiting for the first
47session transaction to release its tuple header-level lock.
48
49We provide four levels of tuple locking strength: SELECT FOR UPDATE obtains an
50exclusive lock which prevents any kind of modification of the tuple. This is
51the lock level that is implicitly taken by DELETE operations, and also by
52UPDATE operations if they modify any of the tuple's key fields. SELECT FOR NO
53KEY UPDATE likewise obtains an exclusive lock, but only prevents tuple removal
54and modifications which might alter the tuple's key. This is the lock that is
55implicitly taken by UPDATE operations which leave all key fields unchanged.
56SELECT FOR SHARE obtains a shared lock which prevents any kind of tuple
57modification. Finally, SELECT FOR KEY SHARE obtains a shared lock which only
58prevents tuple removal and modifications of key fields. This last mode
59implements a mode just strong enough to implement RI checks, i.e. it ensures
60that tuples do not go away from under a check, without blocking when some
61other transaction that want to update the tuple without changing its key.
62
63The conflict table is:
64
65                  UPDATE       NO KEY UPDATE    SHARE        KEY SHARE
66UPDATE           conflict        conflict      conflict      conflict
67NO KEY UPDATE    conflict        conflict      conflict
68SHARE            conflict        conflict
69KEY SHARE        conflict
70
71When there is a single locker in a tuple, we can just store the locking info
72in the tuple itself.  We do this by storing the locker's Xid in XMAX, and
73setting infomask bits specifying the locking strength.  There is one exception
74here: since infomask space is limited, we do not provide a separate bit
75for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
76that case.  (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
77presumably more commonly used due to being the standards-mandated locking
78mechanism, or heavily used by the RI code, so we want to provide fast paths
79for those.)
80
81MultiXacts
82----------
83
84A tuple header provides very limited space for storing information about tuple
85locking and updates: there is room only for a single Xid and a small number of
86infomask bits.  Whenever we need to store more than one lock, we replace the
87first locker's Xid with a new MultiXactId.  Each MultiXact provides extended
88locking data; it comprises an array of Xids plus some flags bits for each one.
89The flags are currently used to store the locking strength of each member
90transaction.  (The flags also distinguish a pure locker from an updater.)
91
92In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
93locked in shared mode by multiple transactions.  This is no longer the case; a
94MultiXact may contain an update or delete Xid.  (Keep in mind that tuple locks
95in a transaction do not conflict with other tuple locks in the same
96transaction, so it's possible to have otherwise conflicting locks in a
97MultiXact if they belong to the same transaction).
98
99Note that each lock is attributed to the subtransaction that acquires it.
100This means that a subtransaction that aborts is seen as though it releases the
101locks it acquired; concurrent transactions can then proceed without having to
102wait for the main transaction to finish.  It also means that a subtransaction
103can upgrade to a stronger lock level than an earlier transaction had, and if
104the subxact aborts, the earlier, weaker lock is kept.
105
106The possibility of having an update within a MultiXact means that they must
107persist across crashes and restarts: a future reader of the tuple needs to
108figure out whether the update committed or aborted.  So we have a requirement
109that pg_multixact needs to retain pages of its data until we're certain that
110the MultiXacts in them are no longer of interest.
111
112VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
113The lower bound used by vacuum (that is, the value below which all multixacts
114are removed) is stored as pg_class.relminmxid for each table; the minimum of
115all such values is stored in pg_database.datminmxid.  The minimum across
116all databases, in turn, is recorded in checkpoint records, and CHECKPOINT
117removes pg_multixact/ segments older than that value once the checkpoint
118record has been flushed.
119
120Infomask Bits
121-------------
122
123The following infomask bits are applicable:
124
125- HEAP_XMAX_INVALID
126  Any tuple with this bit set does not have a valid value stored in XMAX.
127
128- HEAP_XMAX_IS_MULTI
129  This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a
130  regular TransactionId).
131
132- HEAP_XMAX_LOCK_ONLY
133  This bit is set when the XMAX is a locker only; that is, if it's a
134  multixact, it does not contain an update among its members.  It's set when
135  the XMAX is a plain Xid that locked the tuple, as well.
136
137- HEAP_XMAX_KEYSHR_LOCK
138- HEAP_XMAX_SHR_LOCK
139- HEAP_XMAX_EXCL_LOCK
140  These bits indicate the strength of the lock acquired; they are useful when
141  the XMAX is not a MultiXactId.  If it's a multi, the info is to be found in
142  the member flags.  If HEAP_XMAX_IS_MULTI is not set and HEAP_XMAX_LOCK_ONLY
143  is set, then one of these *must* be set as well.
144
145  Note that HEAP_XMAX_EXCL_LOCK does not distinguish FOR NO KEY UPDATE from
146  FOR UPDATE; this is implemented by the HEAP_KEYS_UPDATED bit.
147
148- HEAP_KEYS_UPDATED
149  This bit lives in t_infomask2.  If set, indicates that the operation(s) done
150  by the XMAX compromise the tuple key, such as a SELECT FOR UPDATE, an UPDATE
151  that modifies the columns of the key, or a DELETE.  It's set regardless of
152  whether the XMAX is a TransactionId or a MultiXactId.
153
154We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
155is set.
156