• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..08-Nov-2021-

.gitignoreH A D08-Nov-202130 32

MakefileH A D08-Nov-20211.1 KiB4322

READMEH A D08-Nov-202140 KiB734625

README-SSIH A D08-Nov-202129 KiB630507

README.barrierH A D08-Nov-20219 KiB198156

deadlock.cH A D08-Nov-202134.7 KiB1,171621

generate-lwlocknames.plH A D08-Nov-20211.6 KiB6844

lmgr.cH A D08-Nov-202126.7 KiB1,113595

lock.cH A D08-Nov-2021138.2 KiB4,6162,564

lwlock.cH A D08-Nov-202151.4 KiB1,891997

lwlocknames.cH A D08-Nov-20211.2 KiB5350

lwlocknames.hH A D08-Nov-20212.5 KiB5045

predicate.cH A D08-Nov-2021160 KiB5,0812,797

proc.cH A D08-Nov-202155.2 KiB1,853875

s_lock.cH A D08-Nov-202110 KiB379199

spin.cH A D08-Nov-20214 KiB17281

README

1src/backend/storage/lmgr/README
2
3Locking Overview
4================
5
6Postgres uses four types of interprocess locks:
7
8* Spinlocks.  These are intended for *very* short-term locks.  If a lock
9is to be held more than a few dozen instructions, or across any sort of
10kernel call (or even a call to a nontrivial subroutine), don't use a
11spinlock. Spinlocks are primarily used as infrastructure for lightweight
12locks. They are implemented using a hardware atomic-test-and-set
13instruction, if available.  Waiting processes busy-loop until they can
14get the lock. There is no provision for deadlock detection, automatic
15release on error, or any other nicety.  There is a timeout if the lock
16cannot be gotten after a minute or so (which is approximately forever in
17comparison to the intended lock hold time, so this is certainly an error
18condition).
19
20* Lightweight locks (LWLocks).  These locks are typically used to
21interlock access to datastructures in shared memory.  LWLocks support
22both exclusive and shared lock modes (for read/write and read-only
23access to a shared object). There is no provision for deadlock
24detection, but the LWLock manager will automatically release held
25LWLocks during elog() recovery, so it is safe to raise an error while
26holding LWLocks.  Obtaining or releasing an LWLock is quite fast (a few
27dozen instructions) when there is no contention for the lock.  When a
28process has to wait for an LWLock, it blocks on a SysV semaphore so as
29to not consume CPU time.  Waiting processes will be granted the lock in
30arrival order.  There is no timeout.
31
32* Regular locks (a/k/a heavyweight locks).  The regular lock manager
33supports a variety of lock modes with table-driven semantics, and it has
34full deadlock detection and automatic release at transaction end.
35Regular locks should be used for all user-driven lock requests.
36
37* SIReadLock predicate locks.  See separate README-SSI file for details.
38
39Acquisition of either a spinlock or a lightweight lock causes query
40cancel and die() interrupts to be held off until all such locks are
41released. No such restriction exists for regular locks, however.  Also
42note that we can accept query cancel and die() interrupts while waiting
43for a regular lock, but we will not accept them while waiting for
44spinlocks or LW locks. It is therefore not a good idea to use LW locks
45when the wait time might exceed a few seconds.
46
47The rest of this README file discusses the regular lock manager in detail.
48
49
50Lock Data Structures
51--------------------
52
53Lock methods describe the overall locking behavior.  Currently there are
54two lock methods: DEFAULT and USER.
55
56Lock modes describe the type of the lock (read/write or shared/exclusive).
57In principle, each lock method can have its own set of lock modes with
58different conflict rules, but currently DEFAULT and USER methods use
59identical lock mode sets. See src/include/storage/lock.h for more details.
60(Lock modes are also called lock types in some places in the code and
61documentation.)
62
63There are two main methods for recording locks in shared memory.  The primary
64mechanism uses two main structures: the per-lockable-object LOCK struct, and
65the per-lock-and-requestor PROCLOCK struct.  A LOCK object exists for each
66lockable object that currently has locks held or requested on it.  A PROCLOCK
67struct exists for each backend that is holding or requesting lock(s) on each
68LOCK object.
69
70There is also a special "fast path" mechanism which backends may use to
71record a limited number of locks with very specific characteristics: they must
72use the DEFAULT lockmethod; they must represent a lock on a database relation
73(not a shared relation), they must be a "weak" lock which is unlikely to
74conflict (AccessShareLock, RowShareLock, or RowExclusiveLock); and the system
75must be able to quickly verify that no conflicting locks could possibly be
76present.  See "Fast Path Locking", below, for more details.
77
78Each backend also maintains an unshared LOCALLOCK structure for each lockable
79object and lock mode that it is currently holding or requesting.  The shared
80lock structures only allow a single lock grant to be made per lockable
81object/lock mode/backend.  Internally to a backend, however, the same lock may
82be requested and perhaps released multiple times in a transaction, and it can
83also be held both transactionally and session-wide.  The internal request
84counts are held in LOCALLOCK so that the shared data structures need not be
85accessed to alter them.
86
87---------------------------------------------------------------------------
88
89The lock manager's LOCK objects contain:
90
91tag -
92    The key fields that are used for hashing locks in the shared memory
93    lock hash table.  The contents of the tag essentially define an
94    individual lockable object.  See include/storage/lock.h for details
95    about the supported types of lockable objects.  This is declared as
96    a separate struct to ensure that we always zero out the correct number
97    of bytes.  It is critical that any alignment-padding bytes the compiler
98    might insert in the struct be zeroed out, else the hash computation
99    will be random.  (Currently, we are careful to define struct LOCKTAG
100    so that there are no padding bytes.)
101
102grantMask -
103    This bitmask indicates what types of locks are currently held on the
104    given lockable object.  It is used (against the lock table's conflict
105    table) to determine if a new lock request will conflict with existing
106    lock types held.  Conflicts are determined by bitwise AND operations
107    between the grantMask and the conflict table entry for the requested
108    lock type.  Bit i of grantMask is 1 if and only if granted[i] > 0.
109
110waitMask -
111    This bitmask shows the types of locks being waited for.  Bit i of waitMask
112    is 1 if and only if requested[i] > granted[i].
113
114procLocks -
115    This is a shared memory queue of all the PROCLOCK structs associated with
116    the lock object.  Note that both granted and waiting PROCLOCKs are in this
117    list (indeed, the same PROCLOCK might have some already-granted locks and
118    be waiting for more!).
119
120waitProcs -
121    This is a shared memory queue of all PGPROC structures corresponding to
122    backends that are waiting (sleeping) until another backend releases this
123    lock.  The process structure holds the information needed to determine
124    if it should be woken up when the lock is released.
125
126nRequested -
127    Keeps a count of how many times this lock has been attempted to be
128    acquired.  The count includes attempts by processes which were put
129    to sleep due to conflicts.  It also counts the same backend twice
130    if, for example, a backend process first acquires a read and then
131    acquires a write.  (But multiple acquisitions of the same lock/lock mode
132    within a backend are not multiply counted here; they are recorded
133    only in the backend's LOCALLOCK structure.)
134
135requested -
136    Keeps a count of how many locks of each type have been attempted.  Only
137    elements 1 through MAX_LOCKMODES-1 are used as they correspond to the lock
138    type defined constants.  Summing the values of requested[] should come out
139    equal to nRequested.
140
141nGranted -
142    Keeps count of how many times this lock has been successfully acquired.
143    This count does not include attempts that are waiting due to conflicts.
144    Otherwise the counting rules are the same as for nRequested.
145
146granted -
147    Keeps count of how many locks of each type are currently held.  Once again
148    only elements 1 through MAX_LOCKMODES-1 are used (0 is not).  Also, like
149    requested[], summing the values of granted[] should total to the value
150    of nGranted.
151
152We should always have 0 <= nGranted <= nRequested, and
1530 <= granted[i] <= requested[i] for each i.  When all the request counts
154go to zero, the LOCK object is no longer needed and can be freed.
155
156---------------------------------------------------------------------------
157
158The lock manager's PROCLOCK objects contain:
159
160tag -
161    The key fields that are used for hashing entries in the shared memory
162    PROCLOCK hash table.  This is declared as a separate struct to ensure that
163    we always zero out the correct number of bytes.  It is critical that any
164    alignment-padding bytes the compiler might insert in the struct be zeroed
165    out, else the hash computation will be random.  (Currently, we are careful
166    to define struct PROCLOCKTAG so that there are no padding bytes.)
167
168    tag.myLock
169        Pointer to the shared LOCK object this PROCLOCK is for.
170
171    tag.myProc
172        Pointer to the PGPROC of backend process that owns this PROCLOCK.
173
174    Note: it's OK to use pointers here because a PROCLOCK never outlives
175    either its lock or its proc.  The tag is therefore unique for as long
176    as it needs to be, even though the same tag values might mean something
177    else at other times.
178
179holdMask -
180    A bitmask for the lock modes successfully acquired by this PROCLOCK.
181    This should be a subset of the LOCK object's grantMask, and also a
182    subset of the PGPROC object's heldLocks mask (if the PGPROC is
183    currently waiting for another lock mode on this lock).
184
185releaseMask -
186    A bitmask for the lock modes due to be released during LockReleaseAll.
187    This must be a subset of the holdMask.  Note that it is modified without
188    taking the partition LWLock, and therefore it is unsafe for any
189    backend except the one owning the PROCLOCK to examine/change it.
190
191lockLink -
192    List link for shared memory queue of all the PROCLOCK objects for the
193    same LOCK.
194
195procLink -
196    List link for shared memory queue of all the PROCLOCK objects for the
197    same backend.
198
199---------------------------------------------------------------------------
200
201
202Lock Manager Internal Locking
203-----------------------------
204
205Before PostgreSQL 8.2, all of the shared-memory data structures used by
206the lock manager were protected by a single LWLock, the LockMgrLock;
207any operation involving these data structures had to exclusively lock
208LockMgrLock.  Not too surprisingly, this became a contention bottleneck.
209To reduce contention, the lock manager's data structures have been split
210into multiple "partitions", each protected by an independent LWLock.
211Most operations only need to lock the single partition they are working in.
212Here are the details:
213
214* Each possible lock is assigned to one partition according to a hash of
215its LOCKTAG value.  The partition's LWLock is considered to protect all the
216LOCK objects of that partition as well as their subsidiary PROCLOCKs.
217
218* The shared-memory hash tables for LOCKs and PROCLOCKs are organized
219so that different partitions use different hash chains, and thus there
220is no conflict in working with objects in different partitions.  This
221is supported directly by dynahash.c's "partitioned table" mechanism
222for the LOCK table: we need only ensure that the partition number is
223taken from the low-order bits of the dynahash hash value for the LOCKTAG.
224To make it work for PROCLOCKs, we have to ensure that a PROCLOCK's hash
225value has the same low-order bits as its associated LOCK.  This requires
226a specialized hash function (see proclock_hash).
227
228* Formerly, each PGPROC had a single list of PROCLOCKs belonging to it.
229This has now been split into per-partition lists, so that access to a
230particular PROCLOCK list can be protected by the associated partition's
231LWLock.  (This rule allows one backend to manipulate another backend's
232PROCLOCK lists, which was not originally necessary but is now required in
233connection with fast-path locking; see below.)
234
235* The other lock-related fields of a PGPROC are only interesting when
236the PGPROC is waiting for a lock, so we consider that they are protected
237by the partition LWLock of the awaited lock.
238
239For normal lock acquisition and release, it is sufficient to lock the
240partition containing the desired lock.  Deadlock checking needs to touch
241multiple partitions in general; for simplicity, we just make it lock all
242the partitions in partition-number order.  (To prevent LWLock deadlock,
243we establish the rule that any backend needing to lock more than one
244partition at once must lock them in partition-number order.)  It's
245possible that deadlock checking could be done without touching every
246partition in typical cases, but since in a properly functioning system
247deadlock checking should not occur often enough to be performance-critical,
248trying to make this work does not seem a productive use of effort.
249
250A backend's internal LOCALLOCK hash table is not partitioned.  We do store
251a copy of the locktag hash code in LOCALLOCK table entries, from which the
252partition number can be computed, but this is a straight speed-for-space
253tradeoff: we could instead recalculate the partition number from the LOCKTAG
254when needed.
255
256
257Fast Path Locking
258-----------------
259
260Fast path locking is a special purpose mechanism designed to reduce the
261overhead of taking and releasing certain types of locks which are taken
262and released very frequently but rarely conflict.  Currently, this includes
263two categories of locks:
264
265(1) Weak relation locks.  SELECT, INSERT, UPDATE, and DELETE must acquire a
266lock on every relation they operate on, as well as various system catalogs
267that can be used internally.  Many DML operations can proceed in parallel
268against the same table at the same time; only DDL operations such as
269CLUSTER, ALTER TABLE, or DROP -- or explicit user action such as LOCK TABLE
270-- will create lock conflicts with the "weak" locks (AccessShareLock,
271RowShareLock, RowExclusiveLock) acquired by DML operations.
272
273(2) VXID locks.  Every transaction takes a lock on its own virtual
274transaction ID.  Currently, the only operations that wait for these locks
275are CREATE INDEX CONCURRENTLY and Hot Standby (in the case of a conflict),
276so most VXID locks are taken and released by the owner without anyone else
277needing to care.
278
279The primary locking mechanism does not cope well with this workload.  Even
280though the lock manager locks are partitioned, the locktag for any given
281relation still falls in one, and only one, partition.  Thus, if many short
282queries are accessing the same relation, the lock manager partition lock for
283that partition becomes a contention bottleneck.  This effect is measurable
284even on 2-core servers, and becomes very pronounced as core count increases.
285
286To alleviate this bottleneck, beginning in PostgreSQL 9.2, each backend is
287permitted to record a limited number of locks on unshared relations in an
288array within its PGPROC structure, rather than using the primary lock table.
289This mechanism can only be used when the locker can verify that no conflicting
290locks exist at the time of taking the lock.
291
292A key point of this algorithm is that it must be possible to verify the
293absence of possibly conflicting locks without fighting over a shared LWLock or
294spinlock.  Otherwise, this effort would simply move the contention bottleneck
295from one place to another.  We accomplish this using an array of 1024 integer
296counters, which are in effect a 1024-way partitioning of the lock space.
297Each counter records the number of "strong" locks (that is, ShareLock,
298ShareRowExclusiveLock, ExclusiveLock, and AccessExclusiveLock) on unshared
299relations that fall into that partition.  When this counter is non-zero, the
300fast path mechanism may not be used to take new relation locks within that
301partition.  A strong locker bumps the counter and then scans each per-backend
302array for matching fast-path locks; any which are found must be transferred to
303the primary lock table before attempting to acquire the lock, to ensure proper
304lock conflict and deadlock detection.
305
306On an SMP system, we must guarantee proper memory synchronization.  Here we
307rely on the fact that LWLock acquisition acts as a memory sequence point: if
308A performs a store, A and B both acquire an LWLock in either order, and B
309then performs a load on the same memory location, it is guaranteed to see
310A's store.  In this case, each backend's fast-path lock queue is protected
311by an LWLock.  A backend wishing to acquire a fast-path lock grabs this
312LWLock before examining FastPathStrongRelationLocks to check for the presence
313of a conflicting strong lock.  And the backend attempting to acquire a strong
314lock, because it must transfer any matching weak locks taken via the fast-path
315mechanism to the shared lock table, will acquire every LWLock protecting a
316backend fast-path queue in turn.  So, if we examine
317FastPathStrongRelationLocks and see a zero, then either the value is truly
318zero, or if it is a stale value, the strong locker has yet to acquire the
319per-backend LWLock we now hold (or, indeed, even the first per-backend LWLock)
320and will notice any weak lock we take when it does.
321
322Fast-path VXID locks do not use the FastPathStrongRelationLocks table.  The
323first lock taken on a VXID is always the ExclusiveLock taken by its owner.
324Any subsequent lockers are share lockers waiting for the VXID to terminate.
325Indeed, the only reason VXID locks use the lock manager at all (rather than
326waiting for the VXID to terminate via some other method) is for deadlock
327detection.  Thus, the initial VXID lock can *always* be taken via the fast
328path without checking for conflicts.  Any subsequent locker must check
329whether the lock has been transferred to the main lock table, and if not,
330do so.  The backend owning the VXID must be careful to clean up any entry
331made in the main lock table at end of transaction.
332
333Deadlock detection does not need to examine the fast-path data structures,
334because any lock that could possibly be involved in a deadlock must have
335been transferred to the main tables beforehand.
336
337
338The Deadlock Detection Algorithm
339--------------------------------
340
341Since we allow user transactions to request locks in any order, deadlock
342is possible.  We use a deadlock detection/breaking algorithm that is
343fairly standard in essence, but there are many special considerations
344needed to deal with Postgres' generalized locking model.
345
346A key design consideration is that we want to make routine operations
347(lock grant and release) run quickly when there is no deadlock, and
348avoid the overhead of deadlock handling as much as possible.  We do this
349using an "optimistic waiting" approach: if a process cannot acquire the
350lock it wants immediately, it goes to sleep without any deadlock check.
351But it also sets a delay timer, with a delay of DeadlockTimeout
352milliseconds (typically set to one second).  If the delay expires before
353the process is granted the lock it wants, it runs the deadlock
354detection/breaking code. Normally this code will determine that there is
355no deadlock condition, and then the process will go back to sleep and
356wait quietly until it is granted the lock.  But if a deadlock condition
357does exist, it will be resolved, usually by aborting the detecting
358process' transaction.  In this way, we avoid deadlock handling overhead
359whenever the wait time for a lock is less than DeadlockTimeout, while
360not imposing an unreasonable delay of detection when there is an error.
361
362Lock acquisition (routines LockAcquire and ProcSleep) follows these rules:
363
3641. A lock request is granted immediately if it does not conflict with
365any existing or waiting lock request, or if the process already holds an
366instance of the same lock type (eg, there's no penalty to acquire a read
367lock twice).  Note that a process never conflicts with itself, eg one
368can obtain read lock when one already holds exclusive lock.
369
3702. Otherwise the process joins the lock's wait queue.  Normally it will
371be added to the end of the queue, but there is an exception: if the
372process already holds locks on this same lockable object that conflict
373with the request of any pending waiter, then the process will be
374inserted in the wait queue just ahead of the first such waiter.  (If we
375did not make this check, the deadlock detection code would adjust the
376queue order to resolve the conflict, but it's relatively cheap to make
377the check in ProcSleep and avoid a deadlock timeout delay in this case.)
378Note special case when inserting before the end of the queue: if the
379process's request does not conflict with any existing lock nor any
380waiting request before its insertion point, then go ahead and grant the
381lock without waiting.
382
383When a lock is released, the lock release routine (ProcLockWakeup) scans
384the lock object's wait queue.  Each waiter is awoken if (a) its request
385does not conflict with already-granted locks, and (b) its request does
386not conflict with the requests of prior un-wakable waiters.  Rule (b)
387ensures that conflicting requests are granted in order of arrival. There
388are cases where a later waiter must be allowed to go in front of
389conflicting earlier waiters to avoid deadlock, but it is not
390ProcLockWakeup's responsibility to recognize these cases; instead, the
391deadlock detection code will re-order the wait queue when necessary.
392
393To perform deadlock checking, we use the standard method of viewing the
394various processes as nodes in a directed graph (the waits-for graph or
395WFG).  There is a graph edge leading from process A to process B if A
396waits for B, ie, A is waiting for some lock and B holds a conflicting
397lock.  There is a deadlock condition if and only if the WFG contains a
398cycle.  We detect cycles by searching outward along waits-for edges to
399see if we return to our starting point.  There are three possible
400outcomes:
401
4021. All outgoing paths terminate at a running process (which has no
403outgoing edge).
404
4052. A deadlock is detected by looping back to the start point.  We
406resolve such a deadlock by canceling the start point's lock request and
407reporting an error in that transaction, which normally leads to
408transaction abort and release of that transaction's held locks.  Note
409that it's sufficient to cancel one request to remove the cycle; we don't
410need to kill all the transactions involved.
411
4123. Some path(s) loop back to a node other than the start point.  This
413indicates a deadlock, but one that does not involve our starting
414process. We ignore this condition on the grounds that resolving such a
415deadlock is the responsibility of the processes involved --- killing our
416start-point process would not resolve the deadlock.  So, cases 1 and 3
417both report "no deadlock".
418
419Postgres' situation is a little more complex than the standard discussion
420of deadlock detection, for two reasons:
421
4221. A process can be waiting for more than one other process, since there
423might be multiple PROCLOCKs of (non-conflicting) lock types that all
424conflict with the waiter's request.  This creates no real difficulty
425however; we simply need to be prepared to trace more than one outgoing
426edge.
427
4282. If a process A is behind a process B in some lock's wait queue, and
429their requested locks conflict, then we must say that A waits for B, since
430ProcLockWakeup will never awaken A before B.  This creates additional
431edges in the WFG.  We call these "soft" edges, as opposed to the "hard"
432edges induced by locks already held.  Note that if B already holds any
433locks conflicting with A's request, then their relationship is a hard edge
434not a soft edge.
435
436A "soft" block, or wait-priority block, has the same potential for
437inducing deadlock as a hard block.  However, we may be able to resolve
438a soft block without aborting the transactions involved: we can instead
439rearrange the order of the wait queue.  This rearrangement reverses the
440direction of the soft edge between two processes with conflicting requests
441whose queue order is reversed.  If we can find a rearrangement that
442eliminates a cycle without creating new ones, then we can avoid an abort.
443Checking for such possible rearrangements is the trickiest part of the
444algorithm.
445
446The workhorse of the deadlock detector is a routine FindLockCycle() which
447is given a starting point process (which must be a waiting process).
448It recursively scans outward across waits-for edges as discussed above.
449If it finds no cycle involving the start point, it returns "false".
450(As discussed above, we can ignore cycles not involving the start point.)
451When such a cycle is found, FindLockCycle() returns "true", and as it
452unwinds it also builds a list of any "soft" edges involved in the cycle.
453If the resulting list is empty then there is a hard deadlock and the
454configuration cannot succeed.  However, if the list is not empty, then
455reversing any one of the listed edges through wait-queue rearrangement
456will eliminate that cycle.  Since such a reversal might create cycles
457elsewhere, we may need to try every possibility.  Therefore, we need to
458be able to invoke FindLockCycle() on hypothetical configurations (wait
459orders) as well as the current real order.
460
461The easiest way to handle this seems to be to have a lookaside table that
462shows the proposed new queue order for each wait queue that we are
463considering rearranging.  This table is checked by FindLockCycle, and it
464believes the proposed queue order rather than the real order for each lock
465that has an entry in the lookaside table.
466
467We build a proposed new queue order by doing a "topological sort" of the
468existing entries.  Each soft edge that we are currently considering
469reversing creates a property of the partial order that the topological sort
470has to enforce.  We must use a sort method that preserves the input
471ordering as much as possible, so as not to gratuitously break arrival
472order for processes not involved in a deadlock.  (This is not true of the
473tsort method shown in Knuth, for example, but it's easily done by a simple
474doubly-nested-loop method that emits the first legal candidate at each
475step.  Fortunately, we don't need a highly efficient sort algorithm, since
476the number of partial order constraints is not likely to be large.)  Note
477that failure of the topological sort tells us we have conflicting ordering
478constraints, and therefore that the last-added soft edge reversal
479conflicts with a prior edge reversal.  We need to detect this case to
480avoid an infinite loop in the case where no possible rearrangement will
481work: otherwise, we might try a reversal, find that it still leads to
482a cycle, then try to un-reverse the reversal while trying to get rid of
483that cycle, etc etc.  Topological sort failure tells us the un-reversal
484is not a legitimate move in this context.
485
486So, the basic step in our rearrangement method is to take a list of
487soft edges in a cycle (as returned by FindLockCycle()) and successively
488try the reversal of each one as a topological-sort constraint added to
489whatever constraints we are already considering.  We recursively search
490through all such sets of constraints to see if any one eliminates all
491the deadlock cycles at once.  Although this might seem impossibly
492inefficient, it shouldn't be a big problem in practice, because there
493will normally be very few, and not very large, deadlock cycles --- if
494any at all.  So the combinatorial inefficiency isn't going to hurt us.
495Besides, it's better to spend some time to guarantee that we've checked
496all possible escape routes than to abort a transaction when we didn't
497really have to.
498
499Each edge reversal constraint can be viewed as requesting that the waiting
500process A be moved to before the blocking process B in the wait queue they
501are both in.  This action will reverse the desired soft edge, as well as
502any other soft edges between A and other processes it is advanced over.
503No other edges will be affected (note this is actually a constraint on our
504topological sort method to not re-order the queue more than necessary.)
505Therefore, we can be sure we have not created any new deadlock cycles if
506neither FindLockCycle(A) nor FindLockCycle(B) discovers any cycle.  Given
507the above-defined behavior of FindLockCycle, each of these searches is
508necessary as well as sufficient, since FindLockCycle starting at the
509original start point will not complain about cycles that include A or B
510but not the original start point.
511
512In short then, a proposed rearrangement of the wait queue(s) is determined
513by one or more broken soft edges A->B, fully specified by the output of
514topological sorts of each wait queue involved, and then tested by invoking
515FindLockCycle() starting at the original start point as well as each of
516the mentioned processes (A's and B's).  If none of the tests detect a
517cycle, then we have a valid configuration and can implement it by
518reordering the wait queues per the sort outputs (and then applying
519ProcLockWakeup on each reordered queue, in case a waiter has become wakable).
520If any test detects a soft cycle, we can try to resolve it by adding each
521soft link in that cycle, in turn, to the proposed rearrangement list.
522This is repeated recursively until we either find a workable rearrangement
523or determine that none exists.  In the latter case, the outer level
524resolves the deadlock by aborting the original start-point transaction.
525
526The particular order in which rearrangements are tried depends on the
527order FindLockCycle() happens to scan in, so if there are multiple
528workable rearrangements of the wait queues, then it is unspecified which
529one will be chosen.  What's more important is that we guarantee to try
530every queue rearrangement that could lead to success.  (For example,
531if we have A before B before C and the needed order constraints are
532C before A and B before C, we would first discover that A before C
533doesn't work and try the rearrangement C before A before B.  This would
534eventually lead to the discovery of the additional constraint B before C.)
535
536Got that?
537
538Miscellaneous Notes
539-------------------
540
5411. It is easily proven that no deadlock will be missed due to our
542asynchronous invocation of deadlock checking.  A deadlock cycle in the WFG
543is formed when the last edge in the cycle is added; therefore the last
544process in the cycle to wait (the one from which that edge is outgoing) is
545certain to detect and resolve the cycle when it later runs CheckDeadLock.
546This holds even if that edge addition created multiple cycles; the process
547may indeed abort without ever noticing those additional cycles, but we
548don't particularly care.  The only other possible creation of deadlocks is
549during deadlock resolution's rearrangement of wait queues, and we already
550saw that that algorithm will prove that it creates no new deadlocks before
551it attempts to actually execute any rearrangement.
552
5532. It is not certain that a deadlock will be resolved by aborting the
554last-to-wait process.  If earlier waiters in the cycle have not yet run
555CheckDeadLock, then the first one to do so will be the victim.
556
5573. No live (wakable) process can be missed by ProcLockWakeup, since it
558examines every member of the wait queue (this was not true in the 7.0
559implementation, BTW).  Therefore, if ProcLockWakeup is always invoked
560after a lock is released or a wait queue is rearranged, there can be no
561failure to wake a wakable process.  One should also note that
562LockErrorCleanup (abort a waiter due to outside factors) must run
563ProcLockWakeup, in case the canceled waiter was soft-blocking other
564waiters.
565
5664. We can minimize excess rearrangement-trial work by being careful to
567scan the wait queue from the front when looking for soft edges.  For
568example, if we have queue order A,B,C and C has deadlock conflicts with
569both A and B, we want to generate the "C before A" constraint first,
570rather than wasting time with "C before B", which won't move C far
571enough up.  So we look for soft edges outgoing from C starting at the
572front of the wait queue.
573
5745. The working data structures needed by the deadlock detection code can
575be limited to numbers of entries computed from MaxBackends.  Therefore,
576we can allocate the worst-case space needed during backend startup. This
577seems a safer approach than trying to allocate workspace on the fly; we
578don't want to risk having the deadlock detector run out of memory, else
579we really have no guarantees at all that deadlock will be detected.
580
5816. We abuse the deadlock detector to implement autovacuum cancellation.
582When we run the detector and we find that there's an autovacuum worker
583involved in the waits-for graph, we store a pointer to its PGPROC, and
584return a special return code (unless a hard deadlock has been detected).
585The caller can then send a cancellation signal.  This implements the
586principle that autovacuum has a low locking priority (eg it must not block
587DDL on the table).
588
589Group Locking
590-------------
591
592As if all of that weren't already complicated enough, PostgreSQL now supports
593parallelism (see src/backend/access/transam/README.parallel), which means that
594we might need to resolve deadlocks that occur between gangs of related
595processes rather than individual processes.  This doesn't change the basic
596deadlock detection algorithm very much, but it makes the bookkeeping more
597complicated.
598
599We choose to regard locks held by processes in the same parallel group as
600non-conflicting.  This means that two processes in a parallel group can hold a
601self-exclusive lock on the same relation at the same time, or one process can
602acquire an AccessShareLock while the other already holds AccessExclusiveLock.
603This might seem dangerous and could be in some cases (more on that below), but
604if we didn't do this then parallel query would be extremely prone to
605self-deadlock.  For example, a parallel query against a relation on which the
606leader already had AccessExclusiveLock would hang, because the workers would
607try to lock the same relation and be blocked by the leader; yet the leader
608can't finish until it receives completion indications from all workers.  An
609undetected deadlock results.  This is far from the only scenario where such a
610problem happens.  The same thing will occur if the leader holds only
611AccessShareLock, the worker seeks AccessShareLock, but between the time the
612leader attempts to acquire the lock and the time the worker attempts to
613acquire it, some other process queues up waiting for an AccessExclusiveLock.
614In this case, too, an indefinite hang results.
615
616It might seem that we could predict which locks the workers will attempt to
617acquire and ensure before going parallel that those locks would be acquired
618successfully.  But this is very difficult to make work in a general way.  For
619example, a parallel worker's portion of the query plan could involve an
620SQL-callable function which generates a query dynamically, and that query
621might happen to hit a table on which the leader happens to hold
622AccessExclusiveLock.  By imposing enough restrictions on what workers can do,
623we could eventually create a situation where their behavior can be adequately
624restricted, but these restrictions would be fairly onerous, and even then, the
625system required to decide whether the workers will succeed at acquiring the
626necessary locks would be complex and possibly buggy.
627
628So, instead, we take the approach of deciding that locks within a lock group
629do not conflict.  This eliminates the possibility of an undetected deadlock,
630but also opens up some problem cases: if the leader and worker try to do some
631operation at the same time which would ordinarily be prevented by the
632heavyweight lock mechanism, undefined behavior might result.  In practice, the
633dangers are modest.  The leader and worker share the same transaction,
634snapshot, and combo CID hash, and neither can perform any DDL or, indeed,
635write any data at all.  Thus, for either to read a table locked exclusively by
636the other is safe enough.  Problems would occur if the leader initiated
637parallelism from a point in the code at which it had some backend-private
638state that made table access from another process unsafe, for example after
639calling SetReindexProcessing and before calling ResetReindexProcessing,
640catastrophe could ensue, because the worker won't have that state.  Similarly,
641problems could occur with certain kinds of non-relation locks, such as
642relation extension locks.  It's no safer for two related processes to extend
643the same relation at the time than for unrelated processes to do the same.
644However, since parallel mode is strictly read-only at present, neither this
645nor most of the similar cases can arise at present.  To allow parallel writes,
646we'll either need to (1) further enhance the deadlock detector to handle those
647types of locks in a different way than other types; or (2) have parallel
648workers use some other mutual exclusion method for such cases; or (3) revise
649those cases so that they no longer use heavyweight locking in the first place
650(which is not a crazy idea, given that such lock acquisitions are not expected
651to deadlock and that heavyweight lock acquisition is fairly slow anyway).
652
653Group locking adds three new members to each PGPROC: lockGroupLeader,
654lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
655processes not involved in parallel query. When a process wants to cooperate
656with parallel workers, it becomes a lock group leader, which means setting
657this field to point to its own PGPROC. When a parallel worker starts up, it
658points this field at the leader. The lockGroupMembers field is only used in
659the leader; it is a list of the member PGPROCs of the lock group (the leader
660and all workers). The lockGroupLink field is the list link for this list.
661
662All three of these fields are considered to be protected by a lock manager
663partition lock.  The partition lock that protects these fields within a given
664lock group is chosen by taking the leader's pgprocno modulo the number of lock
665manager partitions.  This unusual arrangement has a major advantage: the
666deadlock detector can count on the fact that no lockGroupLeader field can
667change while the deadlock detector is running, because it knows that it holds
668all the lock manager locks.  Also, holding this single lock allows safe
669manipulation of the lockGroupMembers list for the lock group.
670
671We need an additional interlock when setting these fields, because a newly
672started parallel worker has to try to join the leader's lock group, but it
673has no guarantee that the group leader is still alive by the time it gets
674started.  We try to ensure that the parallel leader dies after all workers
675in normal cases, but also that the system could survive relatively intact
676if that somehow fails to happen.  This is one of the precautions against
677such a scenario: the leader relays its PGPROC and also its PID to the
678worker, and the worker fails to join the lock group unless the given PGPROC
679still has the same PID and is still a lock group leader.  We assume that
680PIDs are not recycled quickly enough for this interlock to fail.
681
682
683User Locks (Advisory Locks)
684---------------------------
685
686User locks are handled totally on the application side as long term
687cooperative locks which may extend beyond the normal transaction boundaries.
688Their purpose is to indicate to an application that someone is `working'
689on an item.  So it is possible to put an user lock on a tuple's oid,
690retrieve the tuple, work on it for an hour and then update it and remove
691the lock.  While the lock is active other clients can still read and write
692the tuple but they can be aware that it has been locked at the application
693level by someone.
694
695User locks and normal locks are completely orthogonal and they don't
696interfere with each other.
697
698User locks can be acquired either at session level or transaction level.
699A session-level lock request is not automatically released at transaction
700end, but must be explicitly released by the application.  (However, any
701remaining locks are always released at session end.)  Transaction-level
702user lock requests behave the same as normal lock requests, in that they
703are released at transaction end and do not need explicit unlocking.
704
705Locking during Hot Standby
706--------------------------
707
708The Startup process is the only backend that can make changes during
709recovery, all other backends are read only.  As a result the Startup
710process does not acquire locks on relations or objects except when the lock
711level is AccessExclusiveLock.
712
713Regular backends are only allowed to take locks on relations or objects
714at RowExclusiveLock or lower. This ensures that they do not conflict with
715each other or with the Startup process, unless AccessExclusiveLocks are
716requested by the Startup process.
717
718Deadlocks involving AccessExclusiveLocks are not possible, so we need
719not be concerned that a user initiated deadlock can prevent recovery from
720progressing.
721
722AccessExclusiveLocks on the primary or master node generate WAL records
723that are then applied by the Startup process. Locks are released at end
724of transaction just as they are in normal processing. These locks are
725held by the Startup process, acting as a proxy for the backends that
726originally acquired these locks. Again, these locks cannot conflict with
727one another, so the Startup process cannot deadlock itself either.
728
729Although deadlock is not possible, a regular backend's weak lock can
730prevent the Startup process from making progress in applying WAL, which is
731usually not something that should be tolerated for very long.  Mechanisms
732exist to forcibly cancel a regular backend's query if it blocks the
733Startup process for too long.
734

README-SSI

1src/backend/storage/lmgr/README-SSI
2
3Serializable Snapshot Isolation (SSI) and Predicate Locking
4===========================================================
5
6This code is in the lmgr directory because about 90% of it is an
7implementation of predicate locking, which is required for SSI,
8rather than being directly related to SSI itself.  When another use
9for predicate locking justifies the effort to tease these two things
10apart, this README file should probably be split.
11
12
13Credits
14-------
15
16This feature was developed by Kevin Grittner and Dan R. K. Ports,
17with review and suggestions from Joe Conway, Heikki Linnakangas, and
18Jeff Davis.  It is based on work published in these papers:
19
20	Michael J. Cahill, Uwe Röhm, and Alan D. Fekete. 2008.
21	Serializable isolation for snapshot databases.
22	In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD
23	international conference on Management of data,
24	pages 729-738, New York, NY, USA. ACM.
25	http://doi.acm.org/10.1145/1376616.1376690
26
27	Michael James Cahill. 2009.
28	Serializable Isolation for Snapshot Databases.
29	Sydney Digital Theses.
30	University of Sydney, School of Information Technologies.
31	http://hdl.handle.net/2123/5353
32
33
34Overview
35--------
36
37With true serializable transactions, if you can show that your
38transaction will do the right thing if there are no concurrent
39transactions, it will do the right thing in any mix of serializable
40transactions or be rolled back with a serialization failure.  This
41feature has been implemented in PostgreSQL using SSI.
42
43
44Serializable and Snapshot Transaction Isolation Levels
45------------------------------------------------------
46
47Serializable transaction isolation is attractive for shops with
48active development by many programmers against a complex schema
49because it guarantees data integrity with very little staff time --
50if a transaction can be shown to always do the right thing when it is
51run alone (before or after any other transaction), it will always do
52the right thing in any mix of concurrent serializable transactions.
53Where conflicts with other transactions would result in an
54inconsistent state within the database or an inconsistent view of
55the data, a serializable transaction will block or roll back to
56prevent the anomaly. The SQL standard provides a specific SQLSTATE
57for errors generated when a transaction rolls back for this reason,
58so that transactions can be retried automatically.
59
60Before version 9.1, PostgreSQL did not support a full serializable
61isolation level. A request for serializable transaction isolation
62actually provided snapshot isolation. This has well known anomalies
63which can allow data corruption or inconsistent views of the data
64during concurrent transactions; although these anomalies only occur
65when certain patterns of read-write dependencies exist within a set
66of concurrent transactions. Where these patterns exist, the anomalies
67can be prevented by introducing conflicts through explicitly
68programmed locks or otherwise unnecessary writes to the database.
69Snapshot isolation is popular because performance is better than
70serializable isolation and the integrity guarantees which it does
71provide allow anomalies to be avoided or managed with reasonable
72effort in many environments.
73
74
75Serializable Isolation Implementation Strategies
76------------------------------------------------
77
78Techniques for implementing full serializable isolation have been
79published and in use in many database products for decades. The
80primary technique which has been used is Strict Two-Phase Locking
81(S2PL), which operates by blocking writes against data which has been
82read by concurrent transactions and blocking any access (read or
83write) against data which has been written by concurrent
84transactions. A cycle in a graph of blocking indicates a deadlock,
85requiring a rollback. Blocking and deadlocks under S2PL in high
86contention workloads can be debilitating, crippling throughput and
87response time.
88
89A new technique for implementing full serializable isolation in an
90MVCC database appears in the literature beginning in 2008. This
91technique, known as Serializable Snapshot Isolation (SSI) has many of
92the advantages of snapshot isolation. In particular, reads don't
93block anything and writes don't block reads. Essentially, it runs
94snapshot isolation but monitors the read-write conflicts between
95transactions to identify dangerous structures in the transaction
96graph which indicate that a set of concurrent transactions might
97produce an anomaly, and rolls back transactions to ensure that no
98anomalies occur. It will produce some false positives (where a
99transaction is rolled back even though there would not have been an
100anomaly), but will never let an anomaly occur. In the two known
101prototype implementations, performance for many workloads (even with
102the need to restart transactions which are rolled back) is very close
103to snapshot isolation and generally far better than an S2PL
104implementation.
105
106
107Apparent Serial Order of Execution
108----------------------------------
109
110One way to understand when snapshot anomalies can occur, and to
111visualize the difference between the serializable implementations
112described above, is to consider that among transactions executing at
113the serializable transaction isolation level, the results are
114required to be consistent with some serial (one-at-a-time) execution
115of the transactions [1]. How is that order determined in each?
116
117In S2PL, each transaction locks any data it accesses. It holds the
118locks until committing, preventing other transactions from making
119conflicting accesses to the same data in the interim. Some
120transactions may have to be rolled back to prevent deadlock. But
121successful transactions can always be viewed as having occurred
122sequentially, in the order they committed.
123
124With snapshot isolation, reads never block writes, nor vice versa, so
125more concurrency is possible. The order in which transactions appear
126to have executed is determined by something more subtle than in S2PL:
127read/write dependencies. If a transaction reads data, it appears to
128execute after the transaction that wrote the data it is reading.
129Similarly, if it updates data, it appears to execute after the
130transaction that wrote the previous version. These dependencies, which
131we call "wr-dependencies" and "ww-dependencies", are consistent with
132the commit order, because the first transaction must have committed
133before the second starts. However, there can also be dependencies
134between two *concurrent* transactions, i.e. where one was running when
135the other acquired its snapshot.  These "rw-conflicts" occur when one
136transaction attempts to read data which is not visible to it because
137the transaction which wrote it (or will later write it) is
138concurrent. The reading transaction appears to have executed first,
139regardless of the actual sequence of transaction starts or commits,
140because it sees a database state prior to that in which the other
141transaction leaves it.
142
143Anomalies occur when a cycle is created in the graph of dependencies:
144when a dependency or series of dependencies causes transaction A to
145appear to have executed before transaction B, but another series of
146dependencies causes B to appear before A. If that's the case, then
147the results can't be consistent with any serial execution of the
148transactions.
149
150
151SSI Algorithm
152-------------
153
154As of 9.1, serializable transactions in PostgreSQL are implemented using
155Serializable Snapshot Isolation (SSI), based on the work of Cahill
156et al. Fundamentally, this allows snapshot isolation to run as it
157previously did, while monitoring for conditions which could create a
158serialization anomaly.
159
160SSI is based on the observation [2] that each snapshot isolation
161anomaly corresponds to a cycle that contains a "dangerous structure"
162of two adjacent rw-conflict edges:
163
164      Tin ------> Tpivot ------> Tout
165            rw             rw
166
167SSI works by watching for this dangerous structure, and rolling
168back a transaction when needed to prevent any anomaly. This means it
169only needs to track rw-conflicts between concurrent transactions, not
170wr- and ww-dependencies. It also means there is a risk of false
171positives, because not every dangerous structure is embedded in an
172actual cycle.  The number of false positives is low in practice, so
173this represents an acceptable tradeoff for keeping the detection
174overhead low.
175
176The PostgreSQL implementation uses two additional optimizations:
177
178* Tout must commit before any other transaction in the cycle
179  (see proof of Theorem 2.1 of [2]). We only roll back a transaction
180  if Tout commits before Tpivot and Tin.
181
182* if Tin is read-only, there can only be an anomaly if Tout committed
183  before Tin takes its snapshot. This optimization is an original
184  one. Proof:
185
186  - Because there is a cycle, there must be some transaction T0 that
187    precedes Tin in the cycle. (T0 might be the same as Tout.)
188
189  - The edge between T0 and Tin can't be a rw-conflict or ww-dependency,
190    because Tin was read-only, so it must be a wr-dependency.
191    Those can only occur if T0 committed before Tin took its snapshot,
192    else Tin would have ignored T0's output.
193
194  - Because Tout must commit before any other transaction in the
195    cycle, it must commit before T0 commits -- and thus before Tin
196    starts.
197
198
199PostgreSQL Implementation
200-------------------------
201
202    * Since this technique is based on Snapshot Isolation (SI), those
203areas in PostgreSQL which don't use SI can't be brought under SSI.
204This includes system tables, temporary tables, sequences, hint bit
205rewrites, etc.  SSI can not eliminate existing anomalies in these
206areas.
207
208    * Any transaction which is run at a transaction isolation level
209other than SERIALIZABLE will not be affected by SSI.  If you want to
210enforce business rules through SSI, all transactions should be run at
211the SERIALIZABLE transaction isolation level, and that should
212probably be set as the default.
213
214    * If all transactions are run at the SERIALIZABLE transaction
215isolation level, business rules can be enforced in triggers or
216application code without ever having a need to acquire an explicit
217lock or to use SELECT FOR SHARE or SELECT FOR UPDATE.
218
219    * Those who want to continue to use snapshot isolation without
220the additional protections of SSI (and the associated costs of
221enforcing those protections), can use the REPEATABLE READ transaction
222isolation level.  This level retains its legacy behavior, which
223is identical to the old SERIALIZABLE implementation and fully
224consistent with the standard's requirements for the REPEATABLE READ
225transaction isolation level.
226
227    * Performance under this SSI implementation will be significantly
228improved if transactions which don't modify permanent tables are
229declared to be READ ONLY before they begin reading data.
230
231    * Performance under SSI will tend to degrade more rapidly with a
232large number of active database transactions than under less strict
233isolation levels.  Limiting the number of active transactions through
234use of a connection pool or similar techniques may be necessary to
235maintain good performance.
236
237    * Any transaction which must be rolled back to prevent
238serialization anomalies will fail with SQLSTATE 40001, which has a
239standard meaning of "serialization failure".
240
241    * This SSI implementation makes an effort to choose the
242transaction to be canceled such that an immediate retry of the
243transaction will not fail due to conflicts with exactly the same
244transactions.  Pursuant to this goal, no transaction is canceled
245until one of the other transactions in the set of conflicts which
246could generate an anomaly has successfully committed.  This is
247conceptually similar to how write conflicts are handled.  To fully
248implement this guarantee there needs to be a way to roll back the
249active transaction for another process with a serialization failure
250SQLSTATE, even if it is "idle in transaction".
251
252
253Predicate Locking
254-----------------
255
256Both S2PL and SSI require some form of predicate locking to handle
257situations where reads conflict with later inserts or with later
258updates which move data into the selected range.  PostgreSQL didn't
259already have predicate locking, so it needed to be added to support
260full serializable transactions under either strategy. Practical
261implementations of predicate locking generally involve acquiring
262locks against data as it is accessed, using multiple granularities
263(tuple, page, table, etc.) with escalation as needed to keep the lock
264count to a number which can be tracked within RAM structures.  This
265approach was used in PostgreSQL.  Coarse granularities can cause some
266false positive indications of conflict. The number of false positives
267can be influenced by plan choice.
268
269
270Implementation overview
271-----------------------
272
273New RAM structures, inspired by those used to track traditional locks
274in PostgreSQL, but tailored to the needs of SIREAD predicate locking,
275are used.  These refer to physical objects actually accessed in the
276course of executing the query, to model the predicates through
277inference.  Anyone interested in this subject should review the
278Hellerstein, Stonebraker and Hamilton paper [3], along with the
279locking papers referenced from that and the Cahill papers.
280
281Because the SIREAD locks don't block, traditional locking techniques
282have to be modified.  Intent locking (locking higher level objects
283before locking lower level objects) doesn't work with non-blocking
284"locks" (which are, in some respects, more like flags than locks).
285
286A configurable amount of shared memory is reserved at postmaster
287start-up to track predicate locks. This size cannot be changed
288without a restart.
289
290To prevent resource exhaustion, multiple fine-grained locks may
291be promoted to a single coarser-grained lock as needed.
292
293An attempt to acquire an SIREAD lock on a tuple when the same
294transaction already holds an SIREAD lock on the page or the relation
295will be ignored. Likewise, an attempt to lock a page when the
296relation is locked will be ignored, and the acquisition of a coarser
297lock will result in the automatic release of all finer-grained locks
298it covers.
299
300
301Heap locking
302------------
303
304Predicate locks will be acquired for the heap based on the following:
305
306    * For a table scan, the entire relation will be locked.
307
308    * Each tuple read which is visible to the reading transaction
309will be locked, whether or not it meets selection criteria; except
310that there is no need to acquire an SIREAD lock on a tuple when the
311transaction already holds a write lock on any tuple representing the
312row, since a rw-conflict would also create a ww-dependency which
313has more aggressive enforcement and thus will prevent any anomaly.
314
315    * Modifying a heap tuple creates a rw-conflict with any transaction
316that holds a SIREAD lock on that tuple, or on the page or relation
317that contains it.
318
319    * Inserting a new tuple creates a rw-conflict with any transaction
320holding a SIREAD lock on the entire relation. It doesn't conflict with
321page-level locks, because page-level locks are only used to aggregate
322tuple locks. Unlike index page locks, they don't lock "gaps" on the page.
323
324
325Index AM implementations
326------------------------
327
328Since predicate locks only exist to detect writes which conflict with
329earlier reads, and heap tuple locks are acquired to cover all heap
330tuples actually read, including those read through indexes, the index
331tuples which were actually scanned are not of interest in themselves;
332we only care about their "new neighbors" -- later inserts into the
333index which would have been included in the scan had they existed at
334the time.  Conceptually, we want to lock the gaps between and
335surrounding index entries within the scanned range.
336
337Correctness requires that any insert into an index generates a
338rw-conflict with a concurrent serializable transaction if, after that
339insert, re-execution of any index scan of the other transaction would
340access the heap for a row not accessed during the previous execution.
341Note that a non-HOT update which expires an old index entry covered
342by the scan and adds a new entry for the modified row's new tuple
343need not generate a conflict, although an update which "moves" a row
344into the scan must generate a conflict.  While correctness allows
345false positives, they should be minimized for performance reasons.
346
347Several optimizations are possible, though not all are implemented yet:
348
349    * An index scan which is just finding the right position for an
350index insertion or deletion need not acquire a predicate lock.
351
352    * An index scan which is comparing for equality on the entire key
353for a unique index need not acquire a predicate lock as long as a key
354is found corresponding to a visible tuple which has not been modified
355by another transaction -- there are no "between or around" gaps to
356cover.
357
358    * As long as built-in foreign key enforcement continues to use
359its current "special tricks" to deal with MVCC issues, predicate
360locks should not be needed for scans done by enforcement code.
361
362    * If a search determines that no rows can be found regardless of
363index contents because the search conditions are contradictory (e.g.,
364x = 1 AND x = 2), then no predicate lock is needed.
365
366Other index AM implementation considerations:
367
368    * For an index AM that doesn't have support for predicate locking,
369we just acquire a predicate lock on the whole index for any search.
370
371    * B-tree index searches acquire predicate locks only on the
372index *leaf* pages needed to lock the appropriate index range. If,
373however, a search discovers that no root page has yet been created, a
374predicate lock on the index relation is required.
375
376    * GiST searches can determine that there are no matches at any
377level of the index, so there must be a predicate lock at each index
378level during a GiST search. An index insert at the leaf level can
379then be trusted to ripple up to all levels and locations where
380conflicting predicate locks may exist.
381
382    * The effects of page splits, overflows, consolidations, and
383removals must be carefully reviewed to ensure that predicate locks
384aren't "lost" during those operations, or kept with pages which could
385get re-used for different parts of the index.
386
387
388Innovations
389-----------
390
391The PostgreSQL implementation of Serializable Snapshot Isolation
392differs from what is described in the cited papers for several
393reasons:
394
395   1. PostgreSQL didn't have any existing predicate locking. It had
396to be added from scratch.
397
398   2. The existing in-memory lock structures were not suitable for
399tracking SIREAD locks.
400          * In PostgreSQL, tuple level locks are not held in RAM for
401any length of time; lock information is written to the tuples
402involved in the transactions.
403          * In PostgreSQL, existing lock structures have pointers to
404memory which is related to a session. SIREAD locks need to persist
405past the end of the originating transaction and even the session
406which ran it.
407          * PostgreSQL needs to be able to tolerate a large number of
408transactions executing while one long-running transaction stays open
409-- the in-RAM techniques discussed in the papers wouldn't support
410that.
411
412   3. Unlike the database products used for the prototypes described
413in the papers, PostgreSQL didn't already have a true serializable
414isolation level distinct from snapshot isolation.
415
416   4. PostgreSQL supports subtransactions -- an issue not mentioned
417in the papers.
418
419   5. PostgreSQL doesn't assign a transaction number to a database
420transaction until and unless necessary (normally, when the transaction
421attempts to modify data).
422
423   6. PostgreSQL has pluggable data types with user-definable
424operators, as well as pluggable index types, not all of which are
425based around data types which support ordering.
426
427   7. Some possible optimizations became apparent during development
428and testing.
429
430Differences from the implementation described in the papers are
431listed below.
432
433    * New structures needed to be created in shared memory to track
434the proper information for serializable transactions and their SIREAD
435locks.
436
437    * Because PostgreSQL does not have the same concept of an "oldest
438transaction ID" for all serializable transactions as assumed in the
439Cahill thesis, we track the oldest snapshot xmin among serializable
440transactions, and a count of how many active transactions use that
441xmin. When the count hits zero we find the new oldest xmin and run a
442clean-up based on that.
443
444    * Because reads in a subtransaction may cause that subtransaction
445to roll back, thereby affecting what is written by the top level
446transaction, predicate locks must survive a subtransaction rollback.
447As a consequence, all xid usage in SSI, including predicate locking,
448is based on the top level xid.  When looking at an xid that comes
449from a tuple's xmin or xmax, for example, we always call
450SubTransGetTopmostTransaction() before doing much else with it.
451
452    * PostgreSQL does not use "update in place" with a rollback log
453for its MVCC implementation.  Where possible it uses "HOT" updates on
454the same page (if there is room and no indexed value is changed).
455For non-HOT updates the old tuple is expired in place and a new tuple
456is inserted at a new location.  Because of this difference, a tuple
457lock in PostgreSQL doesn't automatically lock any other versions of a
458row.  We don't try to copy or expand a tuple lock to any other
459versions of the row, based on the following proof that any additional
460serialization failures we would get from that would be false
461positives:
462
463          o If transaction T1 reads a row version (thus acquiring a
464predicate lock on it) and a second transaction T2 updates that row
465version (thus creating a rw-conflict graph edge from T1 to T2), must a
466third transaction T3 which re-updates the new version of the row also
467have a rw-conflict in from T1 to prevent anomalies?  In other words,
468does it matter whether we recognize the edge T1 -> T3?
469
470          o If T1 has a conflict in, it certainly doesn't. Adding the
471edge T1 -> T3 would create a dangerous structure, but we already had
472one from the edge T1 -> T2, so we would have aborted something anyway.
473(T2 has already committed, else T3 could not have updated its output;
474but we would have aborted either T1 or T1's predecessor(s).  Hence
475no cycle involving T1 and T3 can survive.)
476
477          o Now let's consider the case where T1 doesn't have a
478rw-conflict in. If that's the case, for this edge T1 -> T3 to make a
479difference, T3 must have a rw-conflict out that induces a cycle in the
480dependency graph, i.e. a conflict out to some transaction preceding T1
481in the graph. (A conflict out to T1 itself would be problematic too,
482but that would mean T1 has a conflict in, the case we already
483eliminated.)
484
485          o So now we're trying to figure out if there can be an
486rw-conflict edge T3 -> T0, where T0 is some transaction that precedes
487T1. For T0 to precede T1, there has to be some edge, or sequence of
488edges, from T0 to T1. At least the last edge has to be a wr-dependency
489or ww-dependency rather than a rw-conflict, because T1 doesn't have a
490rw-conflict in. And that gives us enough information about the order
491of transactions to see that T3 can't have a rw-conflict to T0:
492 - T0 committed before T1 started (the wr/ww-dependency implies this)
493 - T1 started before T2 committed (the T1->T2 rw-conflict implies this)
494 - T2 committed before T3 started (otherwise, T3 would get aborted
495                                   because of an update conflict)
496
497          o That means T0 committed before T3 started, and therefore
498there can't be a rw-conflict from T3 to T0.
499
500          o So in all cases, we don't need the T1 -> T3 edge to
501recognize cycles.  Therefore it's not necessary for T1's SIREAD lock
502on the original tuple version to cover later versions as well.
503
504    * Predicate locking in PostgreSQL starts at the tuple level
505when possible. Multiple fine-grained locks are promoted to a single
506coarser-granularity lock as needed to avoid resource exhaustion.  The
507amount of memory used for these structures is configurable, to balance
508RAM usage against SIREAD lock granularity.
509
510    * Each backend keeps a process-local table of the locks it holds.
511To support granularity promotion decisions with low CPU and locking
512overhead, this table also includes the coarser covering locks and the
513number of finer-granularity locks they cover.
514
515    * Conflicts are identified by looking for predicate locks
516when tuples are written, and by looking at the MVCC information when
517tuples are read. There is no matching between two RAM-based locks.
518
519    * Because write locks are stored in the heap tuples rather than a
520RAM-based lock table, the optimization described in the Cahill thesis
521which eliminates an SIREAD lock where there is a write lock is
522implemented by the following:
523         1. When checking a heap write for conflicts against existing
524predicate locks, a tuple lock on the tuple being written is removed.
525         2. When acquiring a predicate lock on a heap tuple, we
526return quickly without doing anything if it is a tuple written by the
527reading transaction.
528
529    * Rather than using conflictIn and conflictOut pointers which use
530NULL to indicate no conflict and a self-reference to indicate
531multiple conflicts or conflicts with committed transactions, we use a
532list of rw-conflicts. With the more complete information, false
533positives are reduced and we have sufficient data for more aggressive
534clean-up and other optimizations:
535
536          o We can avoid ever rolling back a transaction until and
537unless there is a pivot where a transaction on the conflict *out*
538side of the pivot committed before either of the other transactions.
539
540          o We can avoid ever rolling back a transaction when the
541transaction on the conflict *in* side of the pivot is explicitly or
542implicitly READ ONLY unless the transaction on the conflict *out*
543side of the pivot committed before the READ ONLY transaction acquired
544its snapshot. (An implicit READ ONLY transaction is one which
545committed without writing, even though it was not explicitly declared
546to be READ ONLY.)
547
548          o We can more aggressively clean up conflicts, predicate
549locks, and SSI transaction information.
550
551    * We allow a READ ONLY transaction to "opt out" of SSI if there are
552no READ WRITE transactions which could cause the READ ONLY
553transaction to ever become part of a "dangerous structure" of
554overlapping transaction dependencies.
555
556    * We allow the user to request that a READ ONLY transaction wait
557until the conditions are right for it to start in the "opt out" state
558described above. We add a DEFERRABLE state to transactions, which is
559specified and maintained in a way similar to READ ONLY. It is
560ignored for transactions that are not SERIALIZABLE and READ ONLY.
561
562    * When a transaction must be rolled back, we pick among the
563active transactions such that an immediate retry will not fail again
564on conflicts with the same transactions.
565
566    * We use the PostgreSQL SLRU system to hold summarized
567information about older committed transactions to put an upper bound
568on RAM used. Beyond that limit, information spills to disk.
569Performance can degrade in a pessimal situation, but it should be
570tolerable, and transactions won't need to be canceled or blocked
571from starting.
572
573
574R&D Issues
575----------
576
577This is intended to be the place to record specific issues which need
578more detailed review or analysis.
579
580    * WAL file replay. While serializable implementations using S2PL
581can guarantee that the write-ahead log contains commits in a sequence
582consistent with some serial execution of serializable transactions,
583SSI cannot make that guarantee. While the WAL replay is no less
584consistent than under snapshot isolation, it is possible that under
585PITR recovery or hot standby a database could reach a readable state
586where some transactions appear before other transactions which would
587have had to precede them to maintain serializable consistency. In
588essence, if we do nothing, WAL replay will be at snapshot isolation
589even for serializable transactions. Is this OK? If not, how do we
590address it?
591
592    * External replication. Look at how this impacts external
593replication solutions, like Postgres-R, Slony, pgpool, HS/SR, etc.
594This is related to the "WAL file replay" issue.
595
596    * UNIQUE btree search for equality on all columns. Since a search
597of a UNIQUE index using equality tests on all columns will lock the
598heap tuple if an entry is found, it appears that there is no need to
599get a predicate lock on the index in that case. A predicate lock is
600still needed for such a search if a matching index entry which points
601to a visible tuple is not found.
602
603    * Minimize touching of shared memory. Should lists in shared
604memory push entries which have just been returned to the front of the
605available list, so they will be popped back off soon and some memory
606might never be touched, or should we keep adding returned items to
607the end of the available list?
608
609
610References
611----------
612
613[1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
614Search for serial execution to find the relevant section.
615
616[2] A. Fekete et al. Making Snapshot Isolation Serializable. In ACM
617Transactions on Database Systems 30:2, Jun. 2005.
618http://dx.doi.org/10.1145/1071610.1071615
619
620[3] Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007.
621Architecture of a Database System. Foundations and Trends(R) in
622Databases Vol. 1, No. 2 (2007) 141-259.
623http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
624  Of particular interest:
625    * 6.1 A Note on ACID
626    * 6.2 A Brief Review of Serializability
627    * 6.3 Locking and Latching
628    * 6.3.1 Transaction Isolation Levels
629    * 6.5.3 Next-Key Locking: Physical Surrogates for Logical Properties
630

README.barrier

1Memory Barriers
2===============
3
4Modern CPUs make extensive use of pipe-lining and out-of-order execution,
5meaning that the CPU is often executing more than one instruction at a
6time, and not necessarily in the order that the source code would suggest.
7Furthermore, even before the CPU gets a chance to reorder operations, the
8compiler may (and often does) reorganize the code for greater efficiency,
9particularly at higher optimization levels.  Optimizing compilers and
10out-of-order execution are both critical for good performance, but they
11can lead to surprising results when multiple processes access the same
12memory space.
13
14Example
15=======
16
17Suppose x is a pointer to a structure stored in shared memory, and that the
18entire structure has been initialized to zero bytes.  One backend executes
19the following code fragment:
20
21    x->foo = 1;
22    x->bar = 1;
23
24Meanwhile, at approximately the same time, another backend executes this
25code fragment:
26
27    bar = x->bar;
28    foo = x->foo;
29
30The second backend might end up with foo = 1 and bar = 1 (if it executes
31both statements after the first backend), or with foo = 0 and bar = 0 (if
32it executes both statements before the first backend), or with foo = 1 and
33bar = 0 (if the first backend executes the first statement, the second
34backend executes both statements, and then the first backend executes the
35second statement).
36
37Surprisingly, however, the second backend could also end up with foo = 0
38and bar = 1.  The compiler might swap the order of the two stores performed
39by the first backend, or the two loads performed by the second backend.
40Even if it doesn't, on a machine with weak memory ordering (such as PowerPC
41or Itanium) the CPU might choose to execute either the loads or the stores
42out of order.  This surprising result can lead to bugs.
43
44A common pattern where this actually does result in a bug is when adding items
45onto a queue.  The writer does this:
46
47    q->items[q->num_items] = new_item;
48    ++q->num_items;
49
50The reader does this:
51
52    num_items = q->num_items;
53    for (i = 0; i < num_items; ++i)
54        /* do something with q->items[i] */
55
56This code turns out to be unsafe, because the writer might increment
57q->num_items before it finishes storing the new item into the appropriate slot.
58More subtly, the reader might prefetch the contents of the q->items array
59before reading q->num_items.  Thus, there's still a bug here *even if the
60writer does everything in the order we expect*.  We need the writer to update
61the array before bumping the item counter, and the reader to examine the item
62counter before examining the array.
63
64Note that these types of highly counterintuitive bugs can *only* occur when
65multiple processes are interacting with the same memory segment.  A given
66process always perceives its *own* writes to memory in program order.
67
68Avoiding Memory Ordering Bugs
69=============================
70
71The simplest (and often best) way to avoid memory ordering bugs is to
72protect the data structures involved with an lwlock.  For more details, see
73src/backend/storage/lmgr/README.  For instance, in the above example, the
74writer could acquire an lwlock in exclusive mode before appending to the
75queue, and each reader could acquire the same lock in shared mode before
76reading it.  If the data structure is not heavily trafficked, this solution is
77generally entirely adequate.
78
79However, in some cases, it is desirable to avoid the overhead of acquiring
80and releasing locks.  In this case, memory barriers may be used to ensure
81that the apparent order of execution is as the programmer desires.   In
82PostgreSQL backend code, the pg_memory_barrier() macro may be used to achieve
83this result.  In the example above, we can prevent the reader from seeing a
84garbage value by having the writer do this:
85
86    q->items[q->num_items] = new_item;
87    pg_memory_barrier();
88    ++q->num_items;
89
90And by having the reader do this:
91
92    num_items = q->num_items;
93    pg_memory_barrier();
94    for (i = 0; i < num_items; ++i)
95        /* do something with q->items[i] */
96
97The pg_memory_barrier() macro will (1) prevent the compiler from rearranging
98the code in such a way as to allow the memory accesses to occur out of order
99and (2) generate any code (often, inline assembly) that is needed to prevent
100the CPU from executing the memory accesses out of order.  Specifically, the
101barrier prevents loads and stores written after the barrier from being
102performed before the barrier, and vice-versa.
103
104Although this code will work, it is needlessly inefficient.  On systems with
105strong memory ordering (such as x86), the CPU never reorders loads with other
106loads, nor stores with other stores.  It can, however, allow a load to be
107performed before a subsequent store.  To avoid emitting unnecessary memory
108instructions, we provide two additional primitives: pg_read_barrier(), and
109pg_write_barrier().  When a memory barrier is being used to separate two
110loads, use pg_read_barrier(); when it is separating two stores, use
111pg_write_barrier(); when it is a separating a load and a store (in either
112order), use pg_memory_barrier().  pg_memory_barrier() can always substitute
113for either a read or a write barrier, but is typically more expensive, and
114therefore should be used only when needed.
115
116With these guidelines in mind, the writer can do this:
117
118    q->items[q->num_items] = new_item;
119    pg_write_barrier();
120    ++q->num_items;
121
122And the reader can do this:
123
124    num_items = q->num_items;
125    pg_read_barrier();
126    for (i = 0; i < num_items; ++i)
127        /* do something with q->items[i] */
128
129On machines with strong memory ordering, these weaker barriers will simply
130prevent compiler rearrangement, without emitting any actual machine code.
131On machines with weak memory ordering, they will prevent compiler
132reordering and also emit whatever hardware barrier may be required.  Even
133on machines with weak memory ordering, a read or write barrier may be able
134to use a less expensive instruction than a full barrier.
135
136Weaknesses of Memory Barriers
137=============================
138
139While memory barriers are a powerful tool, and much cheaper than locks, they
140are also much less capable than locks.  Here are some of the problems.
141
1421. Concurrent writers are unsafe.  In the above example of a queue, using
143memory barriers doesn't make it safe for two processes to add items to the
144same queue at the same time.  If more than one process can write to the queue,
145a spinlock or lwlock must be used to synchronize access. The readers can
146perhaps proceed without any lock, but the writers may not.
147
148Even very simple write operations often require additional synchronization.
149For example, it's not safe for multiple writers to simultaneously execute
150this code (supposing x is a pointer into shared memory):
151
152    x->foo++;
153
154Although this may compile down to a single machine-language instruction,
155the CPU will execute that instruction by reading the current value of foo,
156adding one to it, and then storing the result back to the original address.
157If two CPUs try to do this simultaneously, both may do their reads before
158either one does their writes.  Such a case could be made safe by using an
159atomic variable and an atomic add.  See port/atomics.h.
160
1612. Eight-byte loads and stores aren't necessarily atomic.  We assume in
162various places in the source code that an aligned four-byte load or store is
163atomic, and that other processes therefore won't see a half-set value.
164Sadly, the same can't be said for eight-byte value: on some platforms, an
165aligned eight-byte load or store will generate two four-byte operations.  If
166you need an atomic eight-byte read or write, you must either serialize access
167with a lock or use an atomic variable.
168
1693. No ordering guarantees.  While memory barriers ensure that any given
170process performs loads and stores to shared memory in order, they don't
171guarantee synchronization.  In the queue example above, we can use memory
172barriers to be sure that readers won't see garbage, but there's nothing to
173say whether a given reader will run before or after a given writer.  If this
174matters in a given situation, some other mechanism must be used instead of
175or in addition to memory barriers.
176
1774. Barrier proliferation.  Many algorithms that at first seem appealing
178require multiple barriers.  If the number of barriers required is more than
179one or two, you may be better off just using a lock.  Keep in mind that, on
180some platforms, a barrier may be implemented by acquiring and releasing a
181backend-private spinlock.  This may be better than a centralized lock under
182contention, but it may also be slower in the uncontended case.
183
184Further Reading
185===============
186
187Much of the documentation about memory barriers appears to be quite
188Linux-specific.  The following papers may be helpful:
189
190Memory Ordering in Modern Microprocessors, by Paul E. McKenney
191* http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
192
193Memory Barriers: a Hardware View for Software Hackers, by Paul E. McKenney
194* http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
195
196The Linux kernel also has some useful documentation on this topic.  Start
197with Documentation/memory-barriers.txt
198