• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..08-Nov-2021-

MakefileH A D08-Nov-2021841 4126

READMEH A D08-Nov-202146.6 KiB897741

README.parallelH A D08-Nov-202112.3 KiB238195

clog.cH A D08-Nov-202133.2 KiB1,030451

commit_ts.cH A D08-Nov-202129.2 KiB1,031518

generic_xlog.cH A D08-Nov-202115.5 KiB545306

multixact.cH A D08-Nov-2021111 KiB3,4281,666

parallel.cH A D08-Nov-202149.9 KiB1,586869

rmgr.cH A D08-Nov-20211 KiB3927

slru.cH A D08-Nov-202149.1 KiB1,612881

subtrans.cH A D08-Nov-202110.8 KiB375159

timeline.cH A D08-Nov-202115 KiB601365

transam.cH A D08-Nov-202111.4 KiB426169

twophase.cH A D08-Nov-202171.7 KiB2,5181,356

twophase_rmgr.cH A D08-Nov-20211.6 KiB5938

varsup.cH A D08-Nov-202123.4 KiB638286

xact.cH A D08-Nov-2021174.1 KiB6,1563,097

xlog.cH A D08-Nov-2021407.5 KiB13,1186,763

xlogarchive.cH A D08-Nov-202121.6 KiB733361

xlogfuncs.cH A D08-Nov-202121.9 KiB831471

xloginsert.cH A D08-Nov-202133.7 KiB1,230679

xlogreader.cH A D08-Nov-202144.7 KiB1,661998

xlogutils.cH A D08-Nov-202129.9 KiB979482

README

1src/backend/access/transam/README
2
3The Transaction System
4======================
5
6PostgreSQL's transaction system is a three-layer system.  The bottom layer
7implements low-level transactions and subtransactions, on top of which rests
8the mainloop's control code, which in turn implements user-visible
9transactions and savepoints.
10
11The middle layer of code is called by postgres.c before and after the
12processing of each query, or after detecting an error:
13
14		StartTransactionCommand
15		CommitTransactionCommand
16		AbortCurrentTransaction
17
18Meanwhile, the user can alter the system's state by issuing the SQL commands
19BEGIN, COMMIT, ROLLBACK, SAVEPOINT, ROLLBACK TO or RELEASE.  The traffic cop
20redirects these calls to the toplevel routines
21
22		BeginTransactionBlock
23		EndTransactionBlock
24		UserAbortTransactionBlock
25		DefineSavepoint
26		RollbackToSavepoint
27		ReleaseSavepoint
28
29respectively.  Depending on the current state of the system, these functions
30call low level functions to activate the real transaction system:
31
32		StartTransaction
33		CommitTransaction
34		AbortTransaction
35		CleanupTransaction
36		StartSubTransaction
37		CommitSubTransaction
38		AbortSubTransaction
39		CleanupSubTransaction
40
41Additionally, within a transaction, CommandCounterIncrement is called to
42increment the command counter, which allows future commands to "see" the
43effects of previous commands within the same transaction.  Note that this is
44done automatically by CommitTransactionCommand after each query inside a
45transaction block, but some utility functions also do it internally to allow
46some operations (usually in the system catalogs) to be seen by future
47operations in the same utility command.  (For example, in DefineRelation it is
48done after creating the heap so the pg_class row is visible, to be able to
49lock it.)
50
51
52For example, consider the following sequence of user commands:
53
541)		BEGIN
552)		SELECT * FROM foo
563)		INSERT INTO foo VALUES (...)
574)		COMMIT
58
59In the main processing loop, this results in the following function call
60sequence:
61
62     /  StartTransactionCommand;
63    /       StartTransaction;
641) <    ProcessUtility;                 << BEGIN
65    \       BeginTransactionBlock;
66     \  CommitTransactionCommand;
67
68    /   StartTransactionCommand;
692) /    PortalRunSelect;                << SELECT ...
70   \    CommitTransactionCommand;
71    \       CommandCounterIncrement;
72
73    /   StartTransactionCommand;
743) /    ProcessQuery;                   << INSERT ...
75   \    CommitTransactionCommand;
76    \       CommandCounterIncrement;
77
78     /  StartTransactionCommand;
79    /   ProcessUtility;                 << COMMIT
804) <        EndTransactionBlock;
81    \   CommitTransactionCommand;
82     \      CommitTransaction;
83
84The point of this example is to demonstrate the need for
85StartTransactionCommand and CommitTransactionCommand to be state smart -- they
86should call CommandCounterIncrement between the calls to BeginTransactionBlock
87and EndTransactionBlock and outside these calls they need to do normal start,
88commit or abort processing.
89
90Furthermore, suppose the "SELECT * FROM foo" caused an abort condition. In
91this case AbortCurrentTransaction is called, and the transaction is put in
92aborted state.  In this state, any user input is ignored except for
93transaction-termination statements, or ROLLBACK TO <savepoint> commands.
94
95Transaction aborts can occur in two ways:
96
971) system dies from some internal cause  (syntax error, etc)
982) user types ROLLBACK
99
100The reason we have to distinguish them is illustrated by the following two
101situations:
102
103        case 1                                  case 2
104        ------                                  ------
1051) user types BEGIN                     1) user types BEGIN
1062) user does something                  2) user does something
1073) user does not like what              3) system aborts for some reason
108   she sees and types ABORT                (syntax error, etc)
109
110In case 1, we want to abort the transaction and return to the default state.
111In case 2, there may be more commands coming our way which are part of the
112same transaction block; we have to ignore these commands until we see a COMMIT
113or ROLLBACK.
114
115Internal aborts are handled by AbortCurrentTransaction, while user aborts are
116handled by UserAbortTransactionBlock.  Both of them rely on AbortTransaction
117to do all the real work.  The only difference is what state we enter after
118AbortTransaction does its work:
119
120* AbortCurrentTransaction leaves us in TBLOCK_ABORT,
121* UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END
122
123Low-level transaction abort handling is divided in two phases:
124* AbortTransaction executes as soon as we realize the transaction has
125  failed.  It should release all shared resources (locks etc) so that we do
126  not delay other backends unnecessarily.
127* CleanupTransaction executes when we finally see a user COMMIT
128  or ROLLBACK command; it cleans things up and gets us out of the transaction
129  completely.  In particular, we mustn't destroy TopTransactionContext until
130  this point.
131
132Also, note that when a transaction is committed, we don't close it right away.
133Rather it's put in TBLOCK_END state, which means that when
134CommitTransactionCommand is called after the query has finished processing,
135the transaction has to be closed.  The distinction is subtle but important,
136because it means that control will leave the xact.c code with the transaction
137open, and the main loop will be able to keep processing inside the same
138transaction.  So, in a sense, transaction commit is also handled in two
139phases, the first at EndTransactionBlock and the second at
140CommitTransactionCommand (which is where CommitTransaction is actually
141called).
142
143The rest of the code in xact.c are routines to support the creation and
144finishing of transactions and subtransactions.  For example, AtStart_Memory
145takes care of initializing the memory subsystem at main transaction start.
146
147
148Subtransaction Handling
149-----------------------
150
151Subtransactions are implemented using a stack of TransactionState structures,
152each of which has a pointer to its parent transaction's struct.  When a new
153subtransaction is to be opened, PushTransaction is called, which creates a new
154TransactionState, with its parent link pointing to the current transaction.
155StartSubTransaction is in charge of initializing the new TransactionState to
156sane values, and properly initializing other subsystems (AtSubStart routines).
157
158When closing a subtransaction, either CommitSubTransaction has to be called
159(if the subtransaction is committing), or AbortSubTransaction and
160CleanupSubTransaction (if it's aborting).  In either case, PopTransaction is
161called so the system returns to the parent transaction.
162
163One important point regarding subtransaction handling is that several may need
164to be closed in response to a single user command.  That's because savepoints
165have names, and we allow to commit or rollback a savepoint by name, which is
166not necessarily the one that was last opened.  Also a COMMIT or ROLLBACK
167command must be able to close out the entire stack.  We handle this by having
168the utility command subroutine mark all the state stack entries as commit-
169pending or abort-pending, and then when the main loop reaches
170CommitTransactionCommand, the real work is done.  The main point of doing
171things this way is that if we get an error while popping state stack entries,
172the remaining stack entries still show what we need to do to finish up.
173
174In the case of ROLLBACK TO <savepoint>, we abort all the subtransactions up
175through the one identified by the savepoint name, and then re-create that
176subtransaction level with the same name.  So it's a completely new
177subtransaction as far as the internals are concerned.
178
179Other subsystems are allowed to start "internal" subtransactions, which are
180handled by BeginInternalSubTransaction.  This is to allow implementing
181exception handling, e.g. in PL/pgSQL.  ReleaseCurrentSubTransaction and
182RollbackAndReleaseCurrentSubTransaction allows the subsystem to close said
183subtransactions.  The main difference between this and the savepoint/release
184path is that we execute the complete state transition immediately in each
185subroutine, rather than deferring some work until CommitTransactionCommand.
186Another difference is that BeginInternalSubTransaction is allowed when no
187explicit transaction block has been established, while DefineSavepoint is not.
188
189
190Transaction and Subtransaction Numbering
191----------------------------------------
192
193Transactions and subtransactions are assigned permanent XIDs only when/if
194they first do something that requires one --- typically, insert/update/delete
195a tuple, though there are a few other places that need an XID assigned.
196If a subtransaction requires an XID, we always first assign one to its
197parent.  This maintains the invariant that child transactions have XIDs later
198than their parents, which is assumed in a number of places.
199
200The subsidiary actions of obtaining a lock on the XID and entering it into
201pg_subtrans and PG_PROC are done at the time it is assigned.
202
203A transaction that has no XID still needs to be identified for various
204purposes, notably holding locks.  For this purpose we assign a "virtual
205transaction ID" or VXID to each top-level transaction.  VXIDs are formed from
206two fields, the backendID and a backend-local counter; this arrangement allows
207assignment of a new VXID at transaction start without any contention for
208shared memory.  To ensure that a VXID isn't re-used too soon after backend
209exit, we store the last local counter value into shared memory at backend
210exit, and initialize it from the previous value for the same backendID slot
211at backend start.  All these counters go back to zero at shared memory
212re-initialization, but that's OK because VXIDs never appear anywhere on-disk.
213
214Internally, a backend needs a way to identify subtransactions whether or not
215they have XIDs; but this need only lasts as long as the parent top transaction
216endures.  Therefore, we have SubTransactionId, which is somewhat like
217CommandId in that it's generated from a counter that we reset at the start of
218each top transaction.  The top-level transaction itself has SubTransactionId 1,
219and subtransactions have IDs 2 and up.  (Zero is reserved for
220InvalidSubTransactionId.)  Note that subtransactions do not have their
221own VXIDs; they use the parent top transaction's VXID.
222
223
224Interlocking Transaction Begin, Transaction End, and Snapshots
225--------------------------------------------------------------
226
227We try hard to minimize the amount of overhead and lock contention involved
228in the frequent activities of beginning/ending a transaction and taking a
229snapshot.  Unfortunately, we must have some interlocking for this, because
230we must ensure consistency about the commit order of transactions.
231For example, suppose an UPDATE in xact A is blocked by xact B's prior
232update of the same row, and xact B is doing commit while xact C gets a
233snapshot.  Xact A can complete and commit as soon as B releases its locks.
234If xact C's GetSnapshotData sees xact B as still running, then it had
235better see xact A as still running as well, or it will be able to see two
236tuple versions - one deleted by xact B and one inserted by xact A.  Another
237reason why this would be bad is that C would see (in the row inserted by A)
238earlier changes by B, and it would be inconsistent for C not to see any
239of B's changes elsewhere in the database.
240
241Formally, the correctness requirement is "if a snapshot A considers
242transaction X as committed, and any of transaction X's snapshots considered
243transaction Y as committed, then snapshot A must consider transaction Y as
244committed".
245
246What we actually enforce is strict serialization of commits and rollbacks
247with snapshot-taking: we do not allow any transaction to exit the set of
248running transactions while a snapshot is being taken.  (This rule is
249stronger than necessary for consistency, but is relatively simple to
250enforce, and it assists with some other issues as explained below.)  The
251implementation of this is that GetSnapshotData takes the ProcArrayLock in
252shared mode (so that multiple backends can take snapshots in parallel),
253but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
254while clearing the ProcGlobal->xids[] entry at transaction end (either
255commit or abort). (To reduce context switching, when multiple transactions
256commit nearly simultaneously, we have one backend take ProcArrayLock and
257clear the XIDs of multiple processes at once.)
258
259ProcArrayEndTransaction also holds the lock while advancing the shared
260latestCompletedXid variable.  This allows GetSnapshotData to use
261latestCompletedXid + 1 as xmax for its snapshot: there can be no
262transaction >= this xid value that the snapshot needs to consider as
263completed.
264
265In short, then, the rule is that no transaction may exit the set of
266currently-running transactions between the time we fetch latestCompletedXid
267and the time we finish building our snapshot.  However, this restriction
268only applies to transactions that have an XID --- read-only transactions
269can end without acquiring ProcArrayLock, since they don't affect anyone
270else's snapshot nor latestCompletedXid.
271
272Transaction start, per se, doesn't have any interlocking with these
273considerations, since we no longer assign an XID immediately at transaction
274start.  But when we do decide to allocate an XID, GetNewTransactionId must
275store the new XID into the shared ProcArray before releasing XidGenLock.
276This ensures that all top-level XIDs <= latestCompletedXid are either
277present in the ProcArray, or not running anymore.  (This guarantee doesn't
278apply to subtransaction XIDs, because of the possibility that there's not
279room for them in the subxid array; instead we guarantee that they are
280present or the overflow flag is set.)  If a backend released XidGenLock
281before storing its XID into ProcGlobal->xids[], then it would be possible for
282another backend to allocate and commit a later XID, causing latestCompletedXid
283to pass the first backend's XID, before that value became visible in the
284ProcArray.  That would break ComputeXidHorizons, as discussed below.
285
286We allow GetNewTransactionId to store the XID into ProcGlobal->xids[] (or the
287subxid array) without taking ProcArrayLock.  This was once necessary to
288avoid deadlock; while that is no longer the case, it's still beneficial for
289performance.  We are thereby relying on fetch/store of an XID to be atomic,
290else other backends might see a partially-set XID.  This also means that
291readers of the ProcArray xid fields must be careful to fetch a value only
292once, rather than assume they can read it multiple times and get the same
293answer each time.  (Use volatile-qualified pointers when doing this, to
294ensure that the C compiler does exactly what you tell it to.)
295
296Another important activity that uses the shared ProcArray is
297ComputeXidHorizons, which must determine a lower bound for the oldest xmin
298of any active MVCC snapshot, system-wide.  Each individual backend
299advertises the smallest xmin of its own snapshots in MyProc->xmin, or zero
300if it currently has no live snapshots (eg, if it's between transactions or
301hasn't yet set a snapshot for a new transaction).  ComputeXidHorizons takes
302the MIN() of the valid xmin fields.  It does this with only shared lock on
303ProcArrayLock, which means there is a potential race condition against other
304backends doing GetSnapshotData concurrently: we must be certain that a
305concurrent backend that is about to set its xmin does not compute an xmin
306less than what ComputeXidHorizons determines.  We ensure that by including
307all the active XIDs into the MIN() calculation, along with the valid xmins.
308The rule that transactions can't exit without taking exclusive ProcArrayLock
309ensures that concurrent holders of shared ProcArrayLock will compute the
310same minimum of currently-active XIDs: no xact, in particular not the
311oldest, can exit while we hold shared ProcArrayLock.  So
312ComputeXidHorizons's view of the minimum active XID will be the same as that
313of any concurrent GetSnapshotData, and so it can't produce an overestimate.
314If there is no active transaction at all, ComputeXidHorizons uses
315latestCompletedXid + 1, which is a lower bound for the xmin that might
316be computed by concurrent or later GetSnapshotData calls.  (We know that no
317XID less than this could be about to appear in the ProcArray, because of the
318XidGenLock interlock discussed above.)
319
320As GetSnapshotData is performance critical, it does not perform an accurate
321oldest-xmin calculation (it used to, until v14). The contents of a snapshot
322only depend on the xids of other backends, not their xmin. As backend's xmin
323changes much more often than its xid, having GetSnapshotData look at xmins
324can lead to a lot of unnecessary cacheline ping-pong.  Instead
325GetSnapshotData updates approximate thresholds (one that guarantees that all
326deleted rows older than it can be removed, another determining that deleted
327rows newer than it can not be removed). GlobalVisTest* uses those threshold
328to make invisibility decision, falling back to ComputeXidHorizons if
329necessary.
330
331Note that while it is certain that two concurrent executions of
332GetSnapshotData will compute the same xmin for their own snapshots, there is
333no such guarantee for the horizons computed by ComputeXidHorizons.  This is
334because we allow XID-less transactions to clear their MyProc->xmin
335asynchronously (without taking ProcArrayLock), so one execution might see
336what had been the oldest xmin, and another not.  This is OK since the
337thresholds need only be a valid lower bound.  As noted above, we are already
338assuming that fetch/store of the xid fields is atomic, so assuming it for
339xmin as well is no extra risk.
340
341
342pg_xact and pg_subtrans
343-----------------------
344
345pg_xact and pg_subtrans are permanent (on-disk) storage of transaction related
346information.  There is a limited number of pages of each kept in memory, so
347in many cases there is no need to actually read from disk.  However, if
348there's a long running transaction or a backend sitting idle with an open
349transaction, it may be necessary to be able to read and write this information
350from disk.  They also allow information to be permanent across server restarts.
351
352pg_xact records the commit status for each transaction that has been assigned
353an XID.  A transaction can be in progress, committed, aborted, or
354"sub-committed".  This last state means that it's a subtransaction that's no
355longer running, but its parent has not updated its state yet.  It is not
356necessary to update a subtransaction's transaction status to subcommit, so we
357can just defer it until main transaction commit.  The main role of marking
358transactions as sub-committed is to provide an atomic commit protocol when
359transaction status is spread across multiple clog pages. As a result, whenever
360transaction status spreads across multiple pages we must use a two-phase commit
361protocol: the first phase is to mark the subtransactions as sub-committed, then
362we mark the top level transaction and all its subtransactions committed (in
363that order).  Thus, subtransactions that have not aborted appear as in-progress
364even when they have already finished, and the subcommit status appears as a
365very short transitory state during main transaction commit.  Subtransaction
366abort is always marked in clog as soon as it occurs.  When the transaction
367status all fit in a single CLOG page, we atomically mark them all as committed
368without bothering with the intermediate sub-commit state.
369
370Savepoints are implemented using subtransactions.  A subtransaction is a
371transaction inside a transaction; its commit or abort status is not only
372dependent on whether it committed itself, but also whether its parent
373transaction committed.  To implement multiple savepoints in a transaction we
374allow unlimited transaction nesting depth, so any particular subtransaction's
375commit state is dependent on the commit status of each and every ancestor
376transaction.
377
378The "subtransaction parent" (pg_subtrans) mechanism records, for each
379transaction with an XID, the TransactionId of its parent transaction.  This
380information is stored as soon as the subtransaction is assigned an XID.
381Top-level transactions do not have a parent, so they leave their pg_subtrans
382entries set to the default value of zero (InvalidTransactionId).
383
384pg_subtrans is used to check whether the transaction in question is still
385running --- the main Xid of a transaction is recorded in ProcGlobal->xids[],
386with a copy in PGPROC->xid, but since we allow arbitrary nesting of
387subtransactions, we can't fit all Xids in shared memory, so we have to store
388them on disk.  Note, however, that for each transaction we keep a "cache" of
389Xids that are known to be part of the transaction tree, so we can skip looking
390at pg_subtrans unless we know the cache has been overflowed.  See
391storage/ipc/procarray.c for the gory details.
392
393slru.c is the supporting mechanism for both pg_xact and pg_subtrans.  It
394implements the LRU policy for in-memory buffer pages.  The high-level routines
395for pg_xact are implemented in transam.c, while the low-level functions are in
396clog.c.  pg_subtrans is contained completely in subtrans.c.
397
398
399Write-Ahead Log Coding
400----------------------
401
402The WAL subsystem (also called XLOG in the code) exists to guarantee crash
403recovery.  It can also be used to provide point-in-time recovery, as well as
404hot-standby replication via log shipping.  Here are some notes about
405non-obvious aspects of its design.
406
407A basic assumption of a write AHEAD log is that log entries must reach stable
408storage before the data-page changes they describe.  This ensures that
409replaying the log to its end will bring us to a consistent state where there
410are no partially-performed transactions.  To guarantee this, each data page
411(either heap or index) is marked with the LSN (log sequence number --- in
412practice, a WAL file location) of the latest XLOG record affecting the page.
413Before the bufmgr can write out a dirty page, it must ensure that xlog has
414been flushed to disk at least up to the page's LSN.  This low-level
415interaction improves performance by not waiting for XLOG I/O until necessary.
416The LSN check exists only in the shared-buffer manager, not in the local
417buffer manager used for temp tables; hence operations on temp tables must not
418be WAL-logged.
419
420During WAL replay, we can check the LSN of a page to detect whether the change
421recorded by the current log entry is already applied (it has been, if the page
422LSN is >= the log entry's WAL location).
423
424Usually, log entries contain just enough information to redo a single
425incremental update on a page (or small group of pages).  This will work only
426if the filesystem and hardware implement data page writes as atomic actions,
427so that a page is never left in a corrupt partly-written state.  Since that's
428often an untenable assumption in practice, we log additional information to
429allow complete reconstruction of modified pages.  The first WAL record
430affecting a given page after a checkpoint is made to contain a copy of the
431entire page, and we implement replay by restoring that page copy instead of
432redoing the update.  (This is more reliable than the data storage itself would
433be because we can check the validity of the WAL record's CRC.)  We can detect
434the "first change after checkpoint" by noting whether the page's old LSN
435precedes the end of WAL as of the last checkpoint (the RedoRecPtr).
436
437The general schema for executing a WAL-logged action is
438
4391. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
440to be modified.
441
4422. START_CRIT_SECTION()  (Any error during the next three steps must cause a
443PANIC because the shared buffers will contain unlogged changes, which we
444have to ensure don't get to disk.  Obviously, you should check conditions
445such as whether there's enough free space on the page before you start the
446critical section.)
447
4483. Apply the required changes to the shared buffer(s).
449
4504. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
451happen before the WAL record is inserted; see notes in SyncOneBuffer().)
452Note that marking a buffer dirty with MarkBufferDirty() should only
453happen iff you write a WAL record; see Writing Hints below.
454
4555. If the relation requires WAL-logging, build a WAL record using
456XLogBeginInsert and XLogRegister* functions, and insert it.  (See
457"Constructing a WAL record" below).  Then update the page's LSN using the
458returned XLOG location.  For instance,
459
460		XLogBeginInsert();
461		XLogRegisterBuffer(...)
462		XLogRegisterData(...)
463		recptr = XLogInsert(rmgr_id, info);
464
465		PageSetLSN(dp, recptr);
466
4676. END_CRIT_SECTION()
468
4697. Unlock and unpin the buffer(s).
470
471Complex changes (such as a multilevel index insertion) normally need to be
472described by a series of atomic-action WAL records.  The intermediate states
473must be self-consistent, so that if the replay is interrupted between any
474two actions, the system is fully functional.  In btree indexes, for example,
475a page split requires a new page to be allocated, and an insertion of a new
476key in the parent btree level, but for locking reasons this has to be
477reflected by two separate WAL records.  Replaying the first record, to
478allocate the new page and move tuples to it, sets a flag on the page to
479indicate that the key has not been inserted to the parent yet.  Replaying the
480second record clears the flag.  This intermediate state is never seen by
481other backends during normal operation, because the lock on the child page
482is held across the two actions, but will be seen if the operation is
483interrupted before writing the second WAL record.  The search algorithm works
484with the intermediate state as normal, but if an insertion encounters a page
485with the incomplete-split flag set, it will finish the interrupted split by
486inserting the key to the parent, before proceeding.
487
488
489Constructing a WAL record
490-------------------------
491
492A WAL record consists of a header common to all WAL record types,
493record-specific data, and information about the data blocks modified.  Each
494modified data block is identified by an ID number, and can optionally have
495more record-specific data associated with the block.  If XLogInsert decides
496that a full-page image of a block needs to be taken, the data associated
497with that block is not included.
498
499The API for constructing a WAL record consists of five functions:
500XLogBeginInsert, XLogRegisterBuffer, XLogRegisterData, XLogRegisterBufData,
501and XLogInsert.  First, call XLogBeginInsert().  Then register all the buffers
502modified, and data needed to replay the changes, using XLogRegister*
503functions.  Finally, insert the constructed record to the WAL by calling
504XLogInsert().
505
506	XLogBeginInsert();
507
508	/* register buffers modified as part of this WAL-logged action */
509	XLogRegisterBuffer(0, lbuffer, REGBUF_STANDARD);
510	XLogRegisterBuffer(1, rbuffer, REGBUF_STANDARD);
511
512	/* register data that is always included in the WAL record */
513	XLogRegisterData(&xlrec, SizeOfFictionalAction);
514
515	/*
516	 * register data associated with a buffer. This will not be included
517	 * in the record if a full-page image is taken.
518	 */
519	XLogRegisterBufData(0, tuple->data, tuple->len);
520
521	/* more data associated with the buffer */
522	XLogRegisterBufData(0, data2, len2);
523
524	/*
525	 * Ok, all the data and buffers to include in the WAL record have
526	 * been registered. Insert the record.
527	 */
528	recptr = XLogInsert(RM_FOO_ID, XLOG_FOOBAR_DO_STUFF);
529
530Details of the API functions:
531
532void XLogBeginInsert(void)
533
534    Must be called before XLogRegisterBuffer and XLogRegisterData.
535
536void XLogResetInsertion(void)
537
538    Clear any currently registered data and buffers from the WAL record
539    construction workspace.  This is only needed if you have already called
540    XLogBeginInsert(), but decide to not insert the record after all.
541
542void XLogEnsureRecordSpace(int max_block_id, int ndatas)
543
544    Normally, the WAL record construction buffers have the following limits:
545
546    * highest block ID that can be used is 4 (allowing five block references)
547    * Max 20 chunks of registered data
548
549    These default limits are enough for most record types that change some
550    on-disk structures.  For the odd case that requires more data, or needs to
551    modify more buffers, these limits can be raised by calling
552    XLogEnsureRecordSpace().  XLogEnsureRecordSpace() must be called before
553    XLogBeginInsert(), and outside a critical section.
554
555void XLogRegisterBuffer(uint8 block_id, Buffer buf, uint8 flags);
556
557    XLogRegisterBuffer adds information about a data block to the WAL record.
558    block_id is an arbitrary number used to identify this page reference in
559    the redo routine.  The information needed to re-find the page at redo -
560    relfilenode, fork, and block number - are included in the WAL record.
561
562    XLogInsert will automatically include a full copy of the page contents, if
563    this is the first modification of the buffer since the last checkpoint.
564    It is important to register every buffer modified by the action with
565    XLogRegisterBuffer, to avoid torn-page hazards.
566
567    The flags control when and how the buffer contents are included in the
568    WAL record.  Normally, a full-page image is taken only if the page has not
569    been modified since the last checkpoint, and only if full_page_writes=on
570    or an online backup is in progress.  The REGBUF_FORCE_IMAGE flag can be
571    used to force a full-page image to always be included; that is useful
572    e.g. for an operation that rewrites most of the page, so that tracking the
573    details is not worth it.  For the rare case where it is not necessary to
574    protect from torn pages, REGBUF_NO_IMAGE flag can be used to suppress
575    full page image from being taken.  REGBUF_WILL_INIT also suppresses a full
576    page image, but the redo routine must re-generate the page from scratch,
577    without looking at the old page contents.  Re-initializing the page
578    protects from torn page hazards like a full page image does.
579
580    The REGBUF_STANDARD flag can be specified together with the other flags to
581    indicate that the page follows the standard page layout.  It causes the
582    area between pd_lower and pd_upper to be left out from the image, reducing
583    WAL volume.
584
585    If the REGBUF_KEEP_DATA flag is given, any per-buffer data registered with
586    XLogRegisterBufData() is included in the WAL record even if a full-page
587    image is taken.
588
589void XLogRegisterData(char *data, int len);
590
591    XLogRegisterData is used to include arbitrary data in the WAL record.  If
592    XLogRegisterData() is called multiple times, the data are appended, and
593    will be made available to the redo routine as one contiguous chunk.
594
595void XLogRegisterBufData(uint8 block_id, char *data, int len);
596
597    XLogRegisterBufData is used to include data associated with a particular
598    buffer that was registered earlier with XLogRegisterBuffer().  If
599    XLogRegisterBufData() is called multiple times with the same block ID, the
600    data are appended, and will be made available to the redo routine as one
601    contiguous chunk.
602
603    If a full-page image of the buffer is taken at insertion, the data is not
604    included in the WAL record, unless the REGBUF_KEEP_DATA flag is used.
605
606
607Writing a REDO routine
608----------------------
609
610A REDO routine uses the data and page references included in the WAL record
611to reconstruct the new state of the page.  The record decoding functions
612and macros in xlogreader.c/h can be used to extract the data from the record.
613
614When replaying a WAL record that describes changes on multiple pages, you
615must be careful to lock the pages properly to prevent concurrent Hot Standby
616queries from seeing an inconsistent state.  If this requires that two
617or more buffer locks be held concurrently, you must lock the pages in
618appropriate order, and not release the locks until all the changes are done.
619
620Note that we must only use PageSetLSN/PageGetLSN() when we know the action
621is serialised. Only Startup process may modify data blocks during recovery,
622so Startup process may execute PageGetLSN() without fear of serialisation
623problems. All other processes must only call PageSet/GetLSN when holding
624either an exclusive buffer lock or a shared lock plus buffer header lock,
625or be writing the data block directly rather than through shared buffers
626while holding AccessExclusiveLock on the relation.
627
628
629Writing Hints
630-------------
631
632In some cases, we write additional information to data blocks without
633writing a preceding WAL record. This should only happen iff the data can
634be reconstructed later following a crash and the action is simply a way
635of optimising for performance. When a hint is written we use
636MarkBufferDirtyHint() to mark the block dirty.
637
638If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
639inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
640that includes the hint. We do this to avoid a partial page write, when we
641write the dirtied page. WAL is not written during recovery, so we simply skip
642dirtying blocks because of hints when in recovery.
643
644If you do decide to optimise away a WAL record, then any calls to
645MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
646otherwise you will expose the risk of partial page writes.
647
648
649Write-Ahead Logging for Filesystem Actions
650------------------------------------------
651
652The previous section described how to WAL-log actions that only change page
653contents within shared buffers.  For that type of action it is generally
654possible to check all likely error cases (such as insufficient space on the
655page) before beginning to make the actual change.  Therefore we can make
656the change and the creation of the associated WAL log record "atomic" by
657wrapping them into a critical section --- the odds of failure partway
658through are low enough that PANIC is acceptable if it does happen.
659
660Clearly, that approach doesn't work for cases where there's a significant
661probability of failure within the action to be logged, such as creation
662of a new file or database.  We don't want to PANIC, and we especially don't
663want to PANIC after having already written a WAL record that says we did
664the action --- if we did, replay of the record would probably fail again
665and PANIC again, making the failure unrecoverable.  This means that the
666ordinary WAL rule of "write WAL before the changes it describes" doesn't
667work, and we need a different design for such cases.
668
669There are several basic types of filesystem actions that have this
670issue.  Here is how we deal with each:
671
6721. Adding a disk page to an existing table.
673
674This action isn't WAL-logged at all.  We extend a table by writing a page
675of zeroes at its end.  We must actually do this write so that we are sure
676the filesystem has allocated the space.  If the write fails we can just
677error out normally.  Once the space is known allocated, we can initialize
678and fill the page via one or more normal WAL-logged actions.  Because it's
679possible that we crash between extending the file and writing out the WAL
680entries, we have to treat discovery of an all-zeroes page in a table or
681index as being a non-error condition.  In such cases we can just reclaim
682the space for re-use.
683
6842. Creating a new table, which requires a new file in the filesystem.
685
686We try to create the file, and if successful we make a WAL record saying
687we did it.  If not successful, we can just throw an error.  Notice that
688there is a window where we have created the file but not yet written any
689WAL about it to disk.  If we crash during this window, the file remains
690on disk as an "orphan".  It would be possible to clean up such orphans
691by having database restart search for files that don't have any committed
692entry in pg_class, but that currently isn't done because of the possibility
693of deleting data that is useful for forensic analysis of the crash.
694Orphan files are harmless --- at worst they waste a bit of disk space ---
695because we check for on-disk collisions when allocating new relfilenode
696OIDs.  So cleaning up isn't really necessary.
697
6983. Deleting a table, which requires an unlink() that could fail.
699
700Our approach here is to WAL-log the operation first, but to treat failure
701of the actual unlink() call as a warning rather than error condition.
702Again, this can leave an orphan file behind, but that's cheap compared to
703the alternatives.  Since we can't actually do the unlink() until after
704we've committed the DROP TABLE transaction, throwing an error would be out
705of the question anyway.  (It may be worth noting that the WAL entry about
706the file deletion is actually part of the commit record for the dropping
707transaction.)
708
7094. Creating and deleting databases and tablespaces, which requires creating
710and deleting directories and entire directory trees.
711
712These cases are handled similarly to creating individual files, ie, we
713try to do the action first and then write a WAL entry if it succeeded.
714The potential amount of wasted disk space is rather larger, of course.
715In the creation case we try to delete the directory tree again if creation
716fails, so as to reduce the risk of wasted space.  Failure partway through
717a deletion operation results in a corrupt database: the DROP failed, but
718some of the data is gone anyway.  There is little we can do about that,
719though, and in any case it was presumably data the user no longer wants.
720
721In all of these cases, if WAL replay fails to redo the original action
722we must panic and abort recovery.  The DBA will have to manually clean up
723(for instance, free up some disk space or fix directory permissions) and
724then restart recovery.  This is part of the reason for not writing a WAL
725entry until we've successfully done the original action.
726
727
728Skipping WAL for New RelFileNode
729--------------------------------
730
731Under wal_level=minimal, if a change modifies a relfilenode that ROLLBACK
732would unlink, in-tree access methods write no WAL for that change.  Code that
733writes WAL without calling RelationNeedsWAL() must check for this case.  This
734skipping is mandatory.  If a WAL-writing change preceded a WAL-skipping change
735for the same block, REDO could overwrite the WAL-skipping change.  If a
736WAL-writing change followed a WAL-skipping change for the same block, a
737related problem would arise.  When a WAL record contains no full-page image,
738REDO expects the page to match its contents from just before record insertion.
739A WAL-skipping change may not reach disk at all, violating REDO's expectation
740under full_page_writes=off.  For any access method, CommitTransaction() writes
741and fsyncs affected blocks before recording the commit.
742
743Prefer to do the same in future access methods.  However, two other approaches
744can work.  First, an access method can irreversibly transition a given fork
745from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
746smgrimmedsync().  Second, an access method can opt to write WAL
747unconditionally for permanent relations.  Under these approaches, the access
748method callbacks must not call functions that react to RelationNeedsWAL().
749
750This applies only to WAL records whose replay would modify bytes stored in the
751new relfilenode.  It does not apply to other records about the relfilenode,
752such as XLOG_SMGR_CREATE.  Because it operates at the level of individual
753relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
754Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
755ALTER TABLE adds a TOAST relation.  The TOAST relation will skip WAL, while
756the table owning it will not.  ALTER TABLE SET TABLESPACE will cause a table
757to skip WAL, but that won't affect its indexes.
758
759
760Asynchronous Commit
761-------------------
762
763As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e.,
764we don't wait while the WAL record for the commit is fsync'ed.
765We perform an asynchronous commit when synchronous_commit = off.  Instead
766of performing an XLogFlush() up to the LSN of the commit, we merely note
767the LSN in shared memory.  The backend then continues with other work.
768We record the LSN only for an asynchronous commit, not an abort; there's
769never any need to flush an abort record, since the presumption after a
770crash would be that the transaction aborted anyway.
771
772We always force synchronous commit when the transaction is deleting
773relations, to ensure the commit record is down to disk before the relations
774are removed from the filesystem.  Also, certain utility commands that have
775non-roll-backable side effects (such as filesystem changes) force sync
776commit to minimize the window in which the filesystem change has been made
777but the transaction isn't guaranteed committed.
778
779The walwriter regularly wakes up (via wal_writer_delay) or is woken up
780(via its latch, which is set by backends committing asynchronously) and
781performs an XLogBackgroundFlush().  This checks the location of the last
782completely filled WAL page.  If that has moved forwards, then we write all
783the changed buffers up to that point, so that under full load we write
784only whole buffers.  If there has been a break in activity and the current
785WAL page is the same as before, then we find out the LSN of the most
786recent asynchronous commit, and write up to that point, if required (i.e.
787if it's in the current WAL page).  If more than wal_writer_delay has
788passed, or more than wal_writer_flush_after blocks have been written, since
789the last flush, WAL is also flushed up to the current location.  This
790arrangement in itself would guarantee that an async commit record reaches
791disk after at most two times wal_writer_delay after the transaction
792completes. However, we also allow XLogFlush to write/flush full buffers
793"flexibly" (ie, not wrapping around at the end of the circular WAL buffer
794area), so as to minimize the number of writes issued under high load when
795multiple WAL pages are filled per walwriter cycle. This makes the worst-case
796delay three wal_writer_delay cycles.
797
798There are some other subtle points to consider with asynchronous commits.
799First, for each page of CLOG we must remember the LSN of the latest commit
800affecting the page, so that we can enforce the same flush-WAL-before-write
801rule that we do for ordinary relation pages.  Otherwise the record of the
802commit might reach disk before the WAL record does.  Again, abort records
803need not factor into this consideration.
804
805In fact, we store more than one LSN for each clog page.  This relates to
806the way we set transaction status hint bits during visibility tests.
807We must not set a transaction-committed hint bit on a relation page and
808have that record make it to disk prior to the WAL record of the commit.
809Since visibility tests are normally made while holding buffer share locks,
810we do not have the option of changing the page's LSN to guarantee WAL
811synchronization.  Instead, we defer the setting of the hint bit if we have
812not yet flushed WAL as far as the LSN associated with the transaction.
813This requires tracking the LSN of each unflushed async commit.  It is
814convenient to associate this data with clog buffers: because we will flush
815WAL before writing a clog page, we know that we do not need to remember a
816transaction's LSN longer than the clog page holding its commit status
817remains in memory.  However, the naive approach of storing an LSN for each
818clog position is unattractive: the LSNs are 32x bigger than the two-bit
819commit status fields, and so we'd need 256K of additional shared memory for
820each 8K clog buffer page.  We choose instead to store a smaller number of
821LSNs per page, where each LSN is the highest LSN associated with any
822transaction commit in a contiguous range of transaction IDs on that page.
823This saves storage at the price of some possibly-unnecessary delay in
824setting transaction hint bits.
825
826How many transactions should share the same cached LSN (N)?  If the
827system's workload consists only of small async-commit transactions, then
828it's reasonable to have N similar to the number of transactions per
829walwriter cycle, since that is the granularity with which transactions will
830become truly committed (and thus hintable) anyway.  The worst case is where
831a sync-commit xact shares a cached LSN with an async-commit xact that
832commits a bit later; even though we paid to sync the first xact to disk,
833we won't be able to hint its outputs until the second xact is sync'd, up to
834three walwriter cycles later.  This argues for keeping N (the group size)
835as small as possible.  For the moment we are setting the group size to 32,
836which makes the LSN cache space the same size as the actual clog buffer
837space (independently of BLCKSZ).
838
839It is useful that we can run both synchronous and asynchronous commit
840transactions concurrently, but the safety of this is perhaps not
841immediately obvious.  Assume we have two transactions, T1 and T2.  The Log
842Sequence Number (LSN) is the point in the WAL sequence where a transaction
843commit is recorded, so LSN1 and LSN2 are the commit records of those
844transactions.  If T2 can see changes made by T1 then when T2 commits it
845must be true that LSN2 follows LSN1.  Thus when T2 commits it is certain
846that all of the changes made by T1 are also now recorded in the WAL.  This
847is true whether T1 was asynchronous or synchronous.  As a result, it is
848safe for asynchronous commits and synchronous commits to work concurrently
849without endangering data written by synchronous commits.  Sub-transactions
850are not important here since the final write to disk only occurs at the
851commit of the top level transaction.
852
853Changes to data blocks cannot reach disk unless WAL is flushed up to the
854point of the LSN of the data blocks.  Any attempt to write unsafe data to
855disk will trigger a write which ensures the safety of all data written by
856that and prior transactions.  Data blocks and clog pages are both protected
857by LSNs.
858
859Changes to a temp table are not WAL-logged, hence could reach disk in
860advance of T1's commit, but we don't care since temp table contents don't
861survive crashes anyway.
862
863Database writes that skip WAL for new relfilenodes are also safe.  In these
864cases it's entirely possible for the data to reach disk before T1's commit,
865because T1 will fsync it down to disk without any sort of interlock.  However,
866all these paths are designed to write data that no other transaction can see
867until after T1 commits.  The situation is thus not different from ordinary
868WAL-logged updates.
869
870Transaction Emulation during Recovery
871-------------------------------------
872
873During Recovery we replay transaction changes in the order they occurred.
874As part of this replay we emulate some transactional behaviour, so that
875read only backends can take MVCC snapshots. We do this by maintaining a
876list of XIDs belonging to transactions that are being replayed, so that
877each transaction that has recorded WAL records for database writes exist
878in the array until it commits. Further details are given in comments in
879procarray.c.
880
881Many actions write no WAL records at all, for example read only transactions.
882These have no effect on MVCC in recovery and we can pretend they never
883occurred at all. Subtransaction commit does not write a WAL record either
884and has very little effect, since lock waiters need to wait for the
885parent transaction to complete.
886
887Not all transactional behaviour is emulated, for example we do not insert
888a transaction entry into the lock table, nor do we maintain the transaction
889stack in memory. Clog, multixact and commit_ts entries are made normally.
890Subtrans is maintained during recovery but the details of the transaction
891tree are ignored and all subtransactions reference the top-level TransactionId
892directly. Since commit is atomic this provides correct lock wait behaviour
893yet simplifies emulation of subtransactions considerably.
894
895Further details on locking mechanics in recovery are given in comments
896with the Lock rmgr code.
897

README.parallel

1Overview
2========
3
4PostgreSQL provides some simple facilities to make writing parallel algorithms
5easier.  Using a data structure called a ParallelContext, you can arrange to
6launch background worker processes, initialize their state to match that of
7the backend which initiated parallelism, communicate with them via dynamic
8shared memory, and write reasonably complex code that can run either in the
9user backend or in one of the parallel workers without needing to be aware of
10where it's running.
11
12The backend which starts a parallel operation (hereafter, the initiating
13backend) starts by creating a dynamic shared memory segment which will last
14for the lifetime of the parallel operation.  This dynamic shared memory segment
15will contain (1) a shm_mq that can be used to transport errors (and other
16messages reported via elog/ereport) from the worker back to the initiating
17backend; (2) serialized representations of the initiating backend's private
18state, so that the worker can synchronize its state with of the initiating
19backend; and (3) any other data structures which a particular user of the
20ParallelContext data structure may wish to add for its own purposes.  Once
21the initiating backend has initialized the dynamic shared memory segment, it
22asks the postmaster to launch the appropriate number of parallel workers.
23These workers then connect to the dynamic shared memory segment, initiate
24their state, and then invoke the appropriate entrypoint, as further detailed
25below.
26
27Error Reporting
28===============
29
30When started, each parallel worker begins by attaching the dynamic shared
31memory segment and locating the shm_mq to be used for error reporting; it
32redirects all of its protocol messages to this shm_mq.  Prior to this point,
33any failure of the background worker will not be reported to the initiating
34backend; from the point of view of the initiating backend, the worker simply
35failed to start.  The initiating backend must anyway be prepared to cope
36with fewer parallel workers than it originally requested, so catering to
37this case imposes no additional burden.
38
39Whenever a new message (or partial message; very large messages may wrap) is
40sent to the error-reporting queue, PROCSIG_PARALLEL_MESSAGE is sent to the
41initiating backend.  This causes the next CHECK_FOR_INTERRUPTS() in the
42initiating backend to read and rethrow the message.  For the most part, this
43makes error reporting in parallel mode "just work".  Of course, to work
44properly, it is important that the code the initiating backend is executing
45CHECK_FOR_INTERRUPTS() regularly and avoid blocking interrupt processing for
46long periods of time, but those are good things to do anyway.
47
48(A currently-unsolved problem is that some messages may get written to the
49system log twice, once in the backend where the report was originally
50generated, and again when the initiating backend rethrows the message.  If
51we decide to suppress one of these reports, it should probably be second one;
52otherwise, if the worker is for some reason unable to propagate the message
53back to the initiating backend, the message will be lost altogether.)
54
55State Sharing
56=============
57
58It's possible to write C code which works correctly without parallelism, but
59which fails when parallelism is used.  No parallel infrastructure can
60completely eliminate this problem, because any global variable is a risk.
61There's no general mechanism for ensuring that every global variable in the
62worker will have the same value that it does in the initiating backend; even
63if we could ensure that, some function we're calling could update the variable
64after each call, and only the backend where that update is performed will see
65the new value.  Similar problems can arise with any more-complex data
66structure we might choose to use.  For example, a pseudo-random number
67generator should, given a particular seed value, produce the same predictable
68series of values every time.  But it does this by relying on some private
69state which won't automatically be shared between cooperating backends.  A
70parallel-safe PRNG would need to store its state in dynamic shared memory, and
71would require locking.  The parallelism infrastructure has no way of knowing
72whether the user intends to call code that has this sort of problem, and can't
73do anything about it anyway.
74
75Instead, we take a more pragmatic approach. First, we try to make as many of
76the operations that are safe outside of parallel mode work correctly in
77parallel mode as well.  Second, we try to prohibit common unsafe operations
78via suitable error checks.  These checks are intended to catch 100% of
79unsafe things that a user might do from the SQL interface, but code written
80in C can do unsafe things that won't trigger these checks.  The error checks
81are engaged via EnterParallelMode(), which should be called before creating
82a parallel context, and disarmed via ExitParallelMode(), which should be
83called after all parallel contexts have been destroyed.  The most
84significant restriction imposed by parallel mode is that all operations must
85be strictly read-only; we allow no writes to the database and no DDL.  We
86might try to relax these restrictions in the future.
87
88To make as many operations as possible safe in parallel mode, we try to copy
89the most important pieces of state from the initiating backend to each parallel
90worker.  This includes:
91
92  - The set of libraries dynamically loaded by dfmgr.c.
93
94  - The authenticated user ID and current database.  Each parallel worker
95    will connect to the same database as the initiating backend, using the
96    same user ID.
97
98  - The values of all GUCs.  Accordingly, permanent changes to the value of
99    any GUC are forbidden while in parallel mode; but temporary changes,
100    such as entering a function with non-NULL proconfig, are OK.
101
102  - The current subtransaction's XID, the top-level transaction's XID, and
103    the list of XIDs considered current (that is, they are in-progress or
104    subcommitted).  This information is needed to ensure that tuple visibility
105    checks return the same results in the worker as they do in the
106    initiating backend.  See also the section Transaction Integration, below.
107
108  - The combo CID mappings.  This is needed to ensure consistent answers to
109    tuple visibility checks.  The need to synchronize this data structure is
110    a major reason why we can't support writes in parallel mode: such writes
111    might create new combo CIDs, and we have no way to let other workers
112    (or the initiating backend) know about them.
113
114  - The transaction snapshot.
115
116  - The active snapshot, which might be different from the transaction
117    snapshot.
118
119  - The currently active user ID and security context.  Note that this is
120    the fourth user ID we restore: the initial step of binding to the correct
121    database also involves restoring the authenticated user ID.  When GUC
122    values are restored, this incidentally sets SessionUserId and OuterUserId
123    to the correct values.  This final step restores CurrentUserId.
124
125  - State related to pending REINDEX operations, which prevents access to
126    an index that is currently being rebuilt.
127
128  - Active relmapper.c mapping state.  This is needed to allow consistent
129    answers when fetching the current relfilenode for relation oids of
130    mapped relations.
131
132To prevent unprincipled deadlocks when running in parallel mode, this code
133also arranges for the leader and all workers to participate in group
134locking.  See src/backend/storage/lmgr/README for more details.
135
136Transaction Integration
137=======================
138
139Regardless of what the TransactionState stack looks like in the parallel
140leader, each parallel worker ends up with a stack of depth 1.  This stack
141entry is marked with the special transaction block state
142TBLOCK_PARALLEL_INPROGRESS so that it's not confused with an ordinary
143toplevel transaction.  The XID of this TransactionState is set to the XID of
144the innermost currently-active subtransaction in the initiating backend.  The
145initiating backend's toplevel XID, and the XIDs of all current (in-progress
146or subcommitted) XIDs are stored separately from the TransactionState stack,
147but in such a way that GetTopTransactionId(), GetTopTransactionIdIfAny(), and
148TransactionIdIsCurrentTransactionId() return the same values that they would
149in the initiating backend.  We could copy the entire transaction state stack,
150but most of it would be useless: for example, you can't roll back to a
151savepoint from within a parallel worker, and there are no resources to
152associated with the memory contexts or resource owners of intermediate
153subtransactions.
154
155No meaningful change to the transaction state can be made while in parallel
156mode.  No XIDs can be assigned, and no subtransactions can start or end,
157because we have no way of communicating these state changes to cooperating
158backends, or of synchronizing them.  It's clearly unworkable for the initiating
159backend to exit any transaction or subtransaction that was in progress when
160parallelism was started before all parallel workers have exited; and it's even
161more clearly crazy for a parallel worker to try to subcommit or subabort the
162current subtransaction and execute in some other transaction context than was
163present in the initiating backend.  It might be practical to allow internal
164sub-transactions (e.g. to implement a PL/pgSQL EXCEPTION block) to be used in
165parallel mode, provided that they are XID-less, because other backends
166wouldn't really need to know about those transactions or do anything
167differently because of them.  Right now, we don't even allow that.
168
169At the end of a parallel operation, which can happen either because it
170completed successfully or because it was interrupted by an error, parallel
171workers associated with that operation exit.  In the error case, transaction
172abort processing in the parallel leader kills off any remaining workers, and
173the parallel leader then waits for them to die.  In the case of a successful
174parallel operation, the parallel leader does not send any signals, but must
175wait for workers to complete and exit of their own volition.  In either
176case, it is very important that all workers actually exit before the
177parallel leader cleans up the (sub)transaction in which they were created;
178otherwise, chaos can ensue.  For example, if the leader is rolling back the
179transaction that created the relation being scanned by a worker, the
180relation could disappear while the worker is still busy scanning it.  That's
181not safe.
182
183Generally, the cleanup performed by each worker at this point is similar to
184top-level commit or abort.  Each backend has its own resource owners: buffer
185pins, catcache or relcache reference counts, tuple descriptors, and so on
186are managed separately by each backend, and must free them before exiting.
187There are, however, some important differences between parallel worker
188commit or abort and a real top-level transaction commit or abort.  Most
189importantly:
190
191  - No commit or abort record is written; the initiating backend is
192    responsible for this.
193
194  - Cleanup of pg_temp namespaces is not done.  Parallel workers cannot
195    safely access the initiating backend's pg_temp namespace, and should
196    not create one of their own.
197
198Coding Conventions
199===================
200
201Before beginning any parallel operation, call EnterParallelMode(); after all
202parallel operations are completed, call ExitParallelMode().  To actually
203parallelize a particular operation, use a ParallelContext.  The basic coding
204pattern looks like this:
205
206	EnterParallelMode();		/* prohibit unsafe state changes */
207
208	pcxt = CreateParallelContext("library_name", "function_name", nworkers);
209
210	/* Allow space for application-specific data here. */
211	shm_toc_estimate_chunk(&pcxt->estimator, size);
212	shm_toc_estimate_keys(&pcxt->estimator, keys);
213
214	InitializeParallelDSM(pcxt);	/* create DSM and copy state to it */
215
216	/* Store the data for which we reserved space. */
217	space = shm_toc_allocate(pcxt->toc, size);
218	shm_toc_insert(pcxt->toc, key, space);
219
220	LaunchParallelWorkers(pcxt);
221
222	/* do parallel stuff */
223
224	WaitForParallelWorkersToFinish(pcxt);
225
226	/* read any final results from dynamic shared memory */
227
228	DestroyParallelContext(pcxt);
229
230	ExitParallelMode();
231
232If desired, after WaitForParallelWorkersToFinish() has been called, the
233context can be reset so that workers can be launched anew using the same
234parallel context.  To do this, first call ReinitializeParallelDSM() to
235reinitialize state managed by the parallel context machinery itself; then,
236perform any other necessary resetting of state; after that, you can again
237call LaunchParallelWorkers.
238