• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..08-Nov-2021-

MakefileH A D08-Nov-2021790 249

READMEH A D08-Nov-202144.3 KiB858708

README.parallelH A D08-Nov-202112 KiB231190

clog.cH A D08-Nov-202125.5 KiB806339

commit_ts.cH A D08-Nov-202129.8 KiB1,034523

generic_xlog.cH A D08-Nov-202115.5 KiB545306

multixact.cH A D08-Nov-2021111 KiB3,4201,658

parallel.cH A D08-Nov-202140.5 KiB1,326711

recovery.conf.sampleH A D08-Nov-20215.6 KiB159158

rmgr.cH A D08-Nov-20211 KiB3927

slru.cH A D08-Nov-202146.4 KiB1,532852

subtrans.cH A D08-Nov-202111.3 KiB392165

timeline.cH A D08-Nov-202114.9 KiB600363

transam.cH A D08-Nov-202111.4 KiB426169

twophase.cH A D08-Nov-202170.4 KiB2,5041,372

twophase_rmgr.cH A D08-Nov-20211.6 KiB5938

varsup.cH A D08-Nov-202119.1 KiB529240

xact.cH A D08-Nov-2021160.6 KiB5,7572,904

xlog.cH A D08-Nov-2021392.6 KiB12,5806,600

xlogarchive.cH A D08-Nov-202121.8 KiB768394

xlogfuncs.cH A D08-Nov-202118.7 KiB710388

xloginsert.cH A D08-Nov-202131.6 KiB1,147629

xlogreader.cH A D08-Nov-202140.5 KiB1,486907

xlogutils.cH A D08-Nov-202131 KiB1,026508

README

1src/backend/access/transam/README
2
3The Transaction System
4======================
5
6PostgreSQL's transaction system is a three-layer system.  The bottom layer
7implements low-level transactions and subtransactions, on top of which rests
8the mainloop's control code, which in turn implements user-visible
9transactions and savepoints.
10
11The middle layer of code is called by postgres.c before and after the
12processing of each query, or after detecting an error:
13
14		StartTransactionCommand
15		CommitTransactionCommand
16		AbortCurrentTransaction
17
18Meanwhile, the user can alter the system's state by issuing the SQL commands
19BEGIN, COMMIT, ROLLBACK, SAVEPOINT, ROLLBACK TO or RELEASE.  The traffic cop
20redirects these calls to the toplevel routines
21
22		BeginTransactionBlock
23		EndTransactionBlock
24		UserAbortTransactionBlock
25		DefineSavepoint
26		RollbackToSavepoint
27		ReleaseSavepoint
28
29respectively.  Depending on the current state of the system, these functions
30call low level functions to activate the real transaction system:
31
32		StartTransaction
33		CommitTransaction
34		AbortTransaction
35		CleanupTransaction
36		StartSubTransaction
37		CommitSubTransaction
38		AbortSubTransaction
39		CleanupSubTransaction
40
41Additionally, within a transaction, CommandCounterIncrement is called to
42increment the command counter, which allows future commands to "see" the
43effects of previous commands within the same transaction.  Note that this is
44done automatically by CommitTransactionCommand after each query inside a
45transaction block, but some utility functions also do it internally to allow
46some operations (usually in the system catalogs) to be seen by future
47operations in the same utility command.  (For example, in DefineRelation it is
48done after creating the heap so the pg_class row is visible, to be able to
49lock it.)
50
51
52For example, consider the following sequence of user commands:
53
541)		BEGIN
552)		SELECT * FROM foo
563)		INSERT INTO foo VALUES (...)
574)		COMMIT
58
59In the main processing loop, this results in the following function call
60sequence:
61
62     /  StartTransactionCommand;
63    /       StartTransaction;
641) <    ProcessUtility;                 << BEGIN
65    \       BeginTransactionBlock;
66     \  CommitTransactionCommand;
67
68    /   StartTransactionCommand;
692) /    PortalRunSelect;                << SELECT ...
70   \    CommitTransactionCommand;
71    \       CommandCounterIncrement;
72
73    /   StartTransactionCommand;
743) /    ProcessQuery;                   << INSERT ...
75   \    CommitTransactionCommand;
76    \       CommandCounterIncrement;
77
78     /  StartTransactionCommand;
79    /   ProcessUtility;                 << COMMIT
804) <        EndTransactionBlock;
81    \   CommitTransactionCommand;
82     \      CommitTransaction;
83
84The point of this example is to demonstrate the need for
85StartTransactionCommand and CommitTransactionCommand to be state smart -- they
86should call CommandCounterIncrement between the calls to BeginTransactionBlock
87and EndTransactionBlock and outside these calls they need to do normal start,
88commit or abort processing.
89
90Furthermore, suppose the "SELECT * FROM foo" caused an abort condition. In
91this case AbortCurrentTransaction is called, and the transaction is put in
92aborted state.  In this state, any user input is ignored except for
93transaction-termination statements, or ROLLBACK TO <savepoint> commands.
94
95Transaction aborts can occur in two ways:
96
971) system dies from some internal cause  (syntax error, etc)
982) user types ROLLBACK
99
100The reason we have to distinguish them is illustrated by the following two
101situations:
102
103        case 1                                  case 2
104        ------                                  ------
1051) user types BEGIN                     1) user types BEGIN
1062) user does something                  2) user does something
1073) user does not like what              3) system aborts for some reason
108   she sees and types ABORT                (syntax error, etc)
109
110In case 1, we want to abort the transaction and return to the default state.
111In case 2, there may be more commands coming our way which are part of the
112same transaction block; we have to ignore these commands until we see a COMMIT
113or ROLLBACK.
114
115Internal aborts are handled by AbortCurrentTransaction, while user aborts are
116handled by UserAbortTransactionBlock.  Both of them rely on AbortTransaction
117to do all the real work.  The only difference is what state we enter after
118AbortTransaction does its work:
119
120* AbortCurrentTransaction leaves us in TBLOCK_ABORT,
121* UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END
122
123Low-level transaction abort handling is divided in two phases:
124* AbortTransaction executes as soon as we realize the transaction has
125  failed.  It should release all shared resources (locks etc) so that we do
126  not delay other backends unnecessarily.
127* CleanupTransaction executes when we finally see a user COMMIT
128  or ROLLBACK command; it cleans things up and gets us out of the transaction
129  completely.  In particular, we mustn't destroy TopTransactionContext until
130  this point.
131
132Also, note that when a transaction is committed, we don't close it right away.
133Rather it's put in TBLOCK_END state, which means that when
134CommitTransactionCommand is called after the query has finished processing,
135the transaction has to be closed.  The distinction is subtle but important,
136because it means that control will leave the xact.c code with the transaction
137open, and the main loop will be able to keep processing inside the same
138transaction.  So, in a sense, transaction commit is also handled in two
139phases, the first at EndTransactionBlock and the second at
140CommitTransactionCommand (which is where CommitTransaction is actually
141called).
142
143The rest of the code in xact.c are routines to support the creation and
144finishing of transactions and subtransactions.  For example, AtStart_Memory
145takes care of initializing the memory subsystem at main transaction start.
146
147
148Subtransaction Handling
149-----------------------
150
151Subtransactions are implemented using a stack of TransactionState structures,
152each of which has a pointer to its parent transaction's struct.  When a new
153subtransaction is to be opened, PushTransaction is called, which creates a new
154TransactionState, with its parent link pointing to the current transaction.
155StartSubTransaction is in charge of initializing the new TransactionState to
156sane values, and properly initializing other subsystems (AtSubStart routines).
157
158When closing a subtransaction, either CommitSubTransaction has to be called
159(if the subtransaction is committing), or AbortSubTransaction and
160CleanupSubTransaction (if it's aborting).  In either case, PopTransaction is
161called so the system returns to the parent transaction.
162
163One important point regarding subtransaction handling is that several may need
164to be closed in response to a single user command.  That's because savepoints
165have names, and we allow to commit or rollback a savepoint by name, which is
166not necessarily the one that was last opened.  Also a COMMIT or ROLLBACK
167command must be able to close out the entire stack.  We handle this by having
168the utility command subroutine mark all the state stack entries as commit-
169pending or abort-pending, and then when the main loop reaches
170CommitTransactionCommand, the real work is done.  The main point of doing
171things this way is that if we get an error while popping state stack entries,
172the remaining stack entries still show what we need to do to finish up.
173
174In the case of ROLLBACK TO <savepoint>, we abort all the subtransactions up
175through the one identified by the savepoint name, and then re-create that
176subtransaction level with the same name.  So it's a completely new
177subtransaction as far as the internals are concerned.
178
179Other subsystems are allowed to start "internal" subtransactions, which are
180handled by BeginInternalSubtransaction.  This is to allow implementing
181exception handling, e.g. in PL/pgSQL.  ReleaseCurrentSubTransaction and
182RollbackAndReleaseCurrentSubTransaction allows the subsystem to close said
183subtransactions.  The main difference between this and the savepoint/release
184path is that we execute the complete state transition immediately in each
185subroutine, rather than deferring some work until CommitTransactionCommand.
186Another difference is that BeginInternalSubtransaction is allowed when no
187explicit transaction block has been established, while DefineSavepoint is not.
188
189
190Transaction and Subtransaction Numbering
191----------------------------------------
192
193Transactions and subtransactions are assigned permanent XIDs only when/if
194they first do something that requires one --- typically, insert/update/delete
195a tuple, though there are a few other places that need an XID assigned.
196If a subtransaction requires an XID, we always first assign one to its
197parent.  This maintains the invariant that child transactions have XIDs later
198than their parents, which is assumed in a number of places.
199
200The subsidiary actions of obtaining a lock on the XID and entering it into
201pg_subtrans and PG_PROC are done at the time it is assigned.
202
203A transaction that has no XID still needs to be identified for various
204purposes, notably holding locks.  For this purpose we assign a "virtual
205transaction ID" or VXID to each top-level transaction.  VXIDs are formed from
206two fields, the backendID and a backend-local counter; this arrangement allows
207assignment of a new VXID at transaction start without any contention for
208shared memory.  To ensure that a VXID isn't re-used too soon after backend
209exit, we store the last local counter value into shared memory at backend
210exit, and initialize it from the previous value for the same backendID slot
211at backend start.  All these counters go back to zero at shared memory
212re-initialization, but that's OK because VXIDs never appear anywhere on-disk.
213
214Internally, a backend needs a way to identify subtransactions whether or not
215they have XIDs; but this need only lasts as long as the parent top transaction
216endures.  Therefore, we have SubTransactionId, which is somewhat like
217CommandId in that it's generated from a counter that we reset at the start of
218each top transaction.  The top-level transaction itself has SubTransactionId 1,
219and subtransactions have IDs 2 and up.  (Zero is reserved for
220InvalidSubTransactionId.)  Note that subtransactions do not have their
221own VXIDs; they use the parent top transaction's VXID.
222
223
224Interlocking Transaction Begin, Transaction End, and Snapshots
225--------------------------------------------------------------
226
227We try hard to minimize the amount of overhead and lock contention involved
228in the frequent activities of beginning/ending a transaction and taking a
229snapshot.  Unfortunately, we must have some interlocking for this, because
230we must ensure consistency about the commit order of transactions.
231For example, suppose an UPDATE in xact A is blocked by xact B's prior
232update of the same row, and xact B is doing commit while xact C gets a
233snapshot.  Xact A can complete and commit as soon as B releases its locks.
234If xact C's GetSnapshotData sees xact B as still running, then it had
235better see xact A as still running as well, or it will be able to see two
236tuple versions - one deleted by xact B and one inserted by xact A.  Another
237reason why this would be bad is that C would see (in the row inserted by A)
238earlier changes by B, and it would be inconsistent for C not to see any
239of B's changes elsewhere in the database.
240
241Formally, the correctness requirement is "if a snapshot A considers
242transaction X as committed, and any of transaction X's snapshots considered
243transaction Y as committed, then snapshot A must consider transaction Y as
244committed".
245
246What we actually enforce is strict serialization of commits and rollbacks
247with snapshot-taking: we do not allow any transaction to exit the set of
248running transactions while a snapshot is being taken.  (This rule is
249stronger than necessary for consistency, but is relatively simple to
250enforce, and it assists with some other issues as explained below.)  The
251implementation of this is that GetSnapshotData takes the ProcArrayLock in
252shared mode (so that multiple backends can take snapshots in parallel),
253but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
254while clearing MyPgXact->xid at transaction end (either commit or abort).
255(To reduce context switching, when multiple transactions commit nearly
256simultaneously, we have one backend take ProcArrayLock and clear the XIDs
257of multiple processes at once.)
258
259ProcArrayEndTransaction also holds the lock while advancing the shared
260latestCompletedXid variable.  This allows GetSnapshotData to use
261latestCompletedXid + 1 as xmax for its snapshot: there can be no
262transaction >= this xid value that the snapshot needs to consider as
263completed.
264
265In short, then, the rule is that no transaction may exit the set of
266currently-running transactions between the time we fetch latestCompletedXid
267and the time we finish building our snapshot.  However, this restriction
268only applies to transactions that have an XID --- read-only transactions
269can end without acquiring ProcArrayLock, since they don't affect anyone
270else's snapshot nor latestCompletedXid.
271
272Transaction start, per se, doesn't have any interlocking with these
273considerations, since we no longer assign an XID immediately at transaction
274start.  But when we do decide to allocate an XID, GetNewTransactionId must
275store the new XID into the shared ProcArray before releasing XidGenLock.
276This ensures that all top-level XIDs <= latestCompletedXid are either
277present in the ProcArray, or not running anymore.  (This guarantee doesn't
278apply to subtransaction XIDs, because of the possibility that there's not
279room for them in the subxid array; instead we guarantee that they are
280present or the overflow flag is set.)  If a backend released XidGenLock
281before storing its XID into MyPgXact, then it would be possible for another
282backend to allocate and commit a later XID, causing latestCompletedXid to
283pass the first backend's XID, before that value became visible in the
284ProcArray.  That would break GetOldestXmin, as discussed below.
285
286We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the
287subxid array) without taking ProcArrayLock.  This was once necessary to
288avoid deadlock; while that is no longer the case, it's still beneficial for
289performance.  We are thereby relying on fetch/store of an XID to be atomic,
290else other backends might see a partially-set XID.  This also means that
291readers of the ProcArray xid fields must be careful to fetch a value only
292once, rather than assume they can read it multiple times and get the same
293answer each time.  (Use volatile-qualified pointers when doing this, to
294ensure that the C compiler does exactly what you tell it to.)
295
296Another important activity that uses the shared ProcArray is GetOldestXmin,
297which must determine a lower bound for the oldest xmin of any active MVCC
298snapshot, system-wide.  Each individual backend advertises the smallest
299xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no
300live snapshots (eg, if it's between transactions or hasn't yet set a
301snapshot for a new transaction).  GetOldestXmin takes the MIN() of the
302valid xmin fields.  It does this with only shared lock on ProcArrayLock,
303which means there is a potential race condition against other backends
304doing GetSnapshotData concurrently: we must be certain that a concurrent
305backend that is about to set its xmin does not compute an xmin less than
306what GetOldestXmin returns.  We ensure that by including all the active
307XIDs into the MIN() calculation, along with the valid xmins.  The rule that
308transactions can't exit without taking exclusive ProcArrayLock ensures that
309concurrent holders of shared ProcArrayLock will compute the same minimum of
310currently-active XIDs: no xact, in particular not the oldest, can exit
311while we hold shared ProcArrayLock.  So GetOldestXmin's view of the minimum
312active XID will be the same as that of any concurrent GetSnapshotData, and
313so it can't produce an overestimate.  If there is no active transaction at
314all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound
315for the xmin that might be computed by concurrent or later GetSnapshotData
316calls.  (We know that no XID less than this could be about to appear in
317the ProcArray, because of the XidGenLock interlock discussed above.)
318
319GetSnapshotData also performs an oldest-xmin calculation (which had better
320match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
321for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
322too expensive.  Note that while it is certain that two concurrent
323executions of GetSnapshotData will compute the same xmin for their own
324snapshots, as argued above, it is not certain that they will arrive at the
325same estimate of RecentGlobalXmin.  This is because we allow XID-less
326transactions to clear their MyPgXact->xmin asynchronously (without taking
327ProcArrayLock), so one execution might see what had been the oldest xmin,
328and another not.  This is OK since RecentGlobalXmin need only be a valid
329lower bound.  As noted above, we are already assuming that fetch/store
330of the xid fields is atomic, so assuming it for xmin as well is no extra
331risk.
332
333
334pg_xact and pg_subtrans
335-----------------------
336
337pg_xact and pg_subtrans are permanent (on-disk) storage of transaction related
338information.  There is a limited number of pages of each kept in memory, so
339in many cases there is no need to actually read from disk.  However, if
340there's a long running transaction or a backend sitting idle with an open
341transaction, it may be necessary to be able to read and write this information
342from disk.  They also allow information to be permanent across server restarts.
343
344pg_xact records the commit status for each transaction that has been assigned
345an XID.  A transaction can be in progress, committed, aborted, or
346"sub-committed".  This last state means that it's a subtransaction that's no
347longer running, but its parent has not updated its state yet.  It is not
348necessary to update a subtransaction's transaction status to subcommit, so we
349can just defer it until main transaction commit.  The main role of marking
350transactions as sub-committed is to provide an atomic commit protocol when
351transaction status is spread across multiple clog pages. As a result, whenever
352transaction status spreads across multiple pages we must use a two-phase commit
353protocol: the first phase is to mark the subtransactions as sub-committed, then
354we mark the top level transaction and all its subtransactions committed (in
355that order).  Thus, subtransactions that have not aborted appear as in-progress
356even when they have already finished, and the subcommit status appears as a
357very short transitory state during main transaction commit.  Subtransaction
358abort is always marked in clog as soon as it occurs.  When the transaction
359status all fit in a single CLOG page, we atomically mark them all as committed
360without bothering with the intermediate sub-commit state.
361
362Savepoints are implemented using subtransactions.  A subtransaction is a
363transaction inside a transaction; its commit or abort status is not only
364dependent on whether it committed itself, but also whether its parent
365transaction committed.  To implement multiple savepoints in a transaction we
366allow unlimited transaction nesting depth, so any particular subtransaction's
367commit state is dependent on the commit status of each and every ancestor
368transaction.
369
370The "subtransaction parent" (pg_subtrans) mechanism records, for each
371transaction with an XID, the TransactionId of its parent transaction.  This
372information is stored as soon as the subtransaction is assigned an XID.
373Top-level transactions do not have a parent, so they leave their pg_subtrans
374entries set to the default value of zero (InvalidTransactionId).
375
376pg_subtrans is used to check whether the transaction in question is still
377running --- the main Xid of a transaction is recorded in the PGXACT struct,
378but since we allow arbitrary nesting of subtransactions, we can't fit all Xids
379in shared memory, so we have to store them on disk.  Note, however, that for
380each transaction we keep a "cache" of Xids that are known to be part of the
381transaction tree, so we can skip looking at pg_subtrans unless we know the
382cache has been overflowed.  See storage/ipc/procarray.c for the gory details.
383
384slru.c is the supporting mechanism for both pg_xact and pg_subtrans.  It
385implements the LRU policy for in-memory buffer pages.  The high-level routines
386for pg_xact are implemented in transam.c, while the low-level functions are in
387clog.c.  pg_subtrans is contained completely in subtrans.c.
388
389
390Write-Ahead Log Coding
391----------------------
392
393The WAL subsystem (also called XLOG in the code) exists to guarantee crash
394recovery.  It can also be used to provide point-in-time recovery, as well as
395hot-standby replication via log shipping.  Here are some notes about
396non-obvious aspects of its design.
397
398A basic assumption of a write AHEAD log is that log entries must reach stable
399storage before the data-page changes they describe.  This ensures that
400replaying the log to its end will bring us to a consistent state where there
401are no partially-performed transactions.  To guarantee this, each data page
402(either heap or index) is marked with the LSN (log sequence number --- in
403practice, a WAL file location) of the latest XLOG record affecting the page.
404Before the bufmgr can write out a dirty page, it must ensure that xlog has
405been flushed to disk at least up to the page's LSN.  This low-level
406interaction improves performance by not waiting for XLOG I/O until necessary.
407The LSN check exists only in the shared-buffer manager, not in the local
408buffer manager used for temp tables; hence operations on temp tables must not
409be WAL-logged.
410
411During WAL replay, we can check the LSN of a page to detect whether the change
412recorded by the current log entry is already applied (it has been, if the page
413LSN is >= the log entry's WAL location).
414
415Usually, log entries contain just enough information to redo a single
416incremental update on a page (or small group of pages).  This will work only
417if the filesystem and hardware implement data page writes as atomic actions,
418so that a page is never left in a corrupt partly-written state.  Since that's
419often an untenable assumption in practice, we log additional information to
420allow complete reconstruction of modified pages.  The first WAL record
421affecting a given page after a checkpoint is made to contain a copy of the
422entire page, and we implement replay by restoring that page copy instead of
423redoing the update.  (This is more reliable than the data storage itself would
424be because we can check the validity of the WAL record's CRC.)  We can detect
425the "first change after checkpoint" by noting whether the page's old LSN
426precedes the end of WAL as of the last checkpoint (the RedoRecPtr).
427
428The general schema for executing a WAL-logged action is
429
4301. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
431to be modified.
432
4332. START_CRIT_SECTION()  (Any error during the next three steps must cause a
434PANIC because the shared buffers will contain unlogged changes, which we
435have to ensure don't get to disk.  Obviously, you should check conditions
436such as whether there's enough free space on the page before you start the
437critical section.)
438
4393. Apply the required changes to the shared buffer(s).
440
4414. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
442happen before the WAL record is inserted; see notes in SyncOneBuffer().)
443Note that marking a buffer dirty with MarkBufferDirty() should only
444happen iff you write a WAL record; see Writing Hints below.
445
4465. If the relation requires WAL-logging, build a WAL record using
447XLogBeginInsert and XLogRegister* functions, and insert it.  (See
448"Constructing a WAL record" below).  Then update the page's LSN using the
449returned XLOG location.  For instance,
450
451		XLogBeginInsert();
452		XLogRegisterBuffer(...)
453		XLogRegisterData(...)
454		recptr = XLogInsert(rmgr_id, info);
455
456		PageSetLSN(dp, recptr);
457
4586. END_CRIT_SECTION()
459
4607. Unlock and unpin the buffer(s).
461
462Complex changes (such as a multilevel index insertion) normally need to be
463described by a series of atomic-action WAL records.  The intermediate states
464must be self-consistent, so that if the replay is interrupted between any
465two actions, the system is fully functional.  In btree indexes, for example,
466a page split requires a new page to be allocated, and an insertion of a new
467key in the parent btree level, but for locking reasons this has to be
468reflected by two separate WAL records.  Replaying the first record, to
469allocate the new page and move tuples to it, sets a flag on the page to
470indicate that the key has not been inserted to the parent yet.  Replaying the
471second record clears the flag.  This intermediate state is never seen by
472other backends during normal operation, because the lock on the child page
473is held across the two actions, but will be seen if the operation is
474interrupted before writing the second WAL record.  The search algorithm works
475with the intermediate state as normal, but if an insertion encounters a page
476with the incomplete-split flag set, it will finish the interrupted split by
477inserting the key to the parent, before proceeding.
478
479
480Constructing a WAL record
481-------------------------
482
483A WAL record consists of a header common to all WAL record types,
484record-specific data, and information about the data blocks modified.  Each
485modified data block is identified by an ID number, and can optionally have
486more record-specific data associated with the block.  If XLogInsert decides
487that a full-page image of a block needs to be taken, the data associated
488with that block is not included.
489
490The API for constructing a WAL record consists of five functions:
491XLogBeginInsert, XLogRegisterBuffer, XLogRegisterData, XLogRegisterBufData,
492and XLogInsert.  First, call XLogBeginInsert().  Then register all the buffers
493modified, and data needed to replay the changes, using XLogRegister*
494functions.  Finally, insert the constructed record to the WAL by calling
495XLogInsert().
496
497	XLogBeginInsert();
498
499	/* register buffers modified as part of this WAL-logged action */
500	XLogRegisterBuffer(0, lbuffer, REGBUF_STANDARD);
501	XLogRegisterBuffer(1, rbuffer, REGBUF_STANDARD);
502
503	/* register data that is always included in the WAL record */
504	XLogRegisterData(&xlrec, SizeOfFictionalAction);
505
506	/*
507	 * register data associated with a buffer. This will not be included
508	 * in the record if a full-page image is taken.
509	 */
510	XLogRegisterBufData(0, tuple->data, tuple->len);
511
512	/* more data associated with the buffer */
513	XLogRegisterBufData(0, data2, len2);
514
515	/*
516	 * Ok, all the data and buffers to include in the WAL record have
517	 * been registered. Insert the record.
518	 */
519	recptr = XLogInsert(RM_FOO_ID, XLOG_FOOBAR_DO_STUFF);
520
521Details of the API functions:
522
523void XLogBeginInsert(void)
524
525    Must be called before XLogRegisterBuffer and XLogRegisterData.
526
527void XLogResetInsertion(void)
528
529    Clear any currently registered data and buffers from the WAL record
530    construction workspace.  This is only needed if you have already called
531    XLogBeginInsert(), but decide to not insert the record after all.
532
533void XLogEnsureRecordSpace(int max_block_id, int nrdatas)
534
535    Normally, the WAL record construction buffers have the following limits:
536
537    * highest block ID that can be used is 4 (allowing five block references)
538    * Max 20 chunks of registered data
539
540    These default limits are enough for most record types that change some
541    on-disk structures.  For the odd case that requires more data, or needs to
542    modify more buffers, these limits can be raised by calling
543    XLogEnsureRecordSpace().  XLogEnsureRecordSpace() must be called before
544    XLogBeginInsert(), and outside a critical section.
545
546void XLogRegisterBuffer(uint8 block_id, Buffer buf, uint8 flags);
547
548    XLogRegisterBuffer adds information about a data block to the WAL record.
549    block_id is an arbitrary number used to identify this page reference in
550    the redo routine.  The information needed to re-find the page at redo -
551    relfilenode, fork, and block number - are included in the WAL record.
552
553    XLogInsert will automatically include a full copy of the page contents, if
554    this is the first modification of the buffer since the last checkpoint.
555    It is important to register every buffer modified by the action with
556    XLogRegisterBuffer, to avoid torn-page hazards.
557
558    The flags control when and how the buffer contents are included in the
559    WAL record.  Normally, a full-page image is taken only if the page has not
560    been modified since the last checkpoint, and only if full_page_writes=on
561    or an online backup is in progress.  The REGBUF_FORCE_IMAGE flag can be
562    used to force a full-page image to always be included; that is useful
563    e.g. for an operation that rewrites most of the page, so that tracking the
564    details is not worth it.  For the rare case where it is not necessary to
565    protect from torn pages, REGBUF_NO_IMAGE flag can be used to suppress
566    full page image from being taken.  REGBUF_WILL_INIT also suppresses a full
567    page image, but the redo routine must re-generate the page from scratch,
568    without looking at the old page contents.  Re-initializing the page
569    protects from torn page hazards like a full page image does.
570
571    The REGBUF_STANDARD flag can be specified together with the other flags to
572    indicate that the page follows the standard page layout.  It causes the
573    area between pd_lower and pd_upper to be left out from the image, reducing
574    WAL volume.
575
576    If the REGBUF_KEEP_DATA flag is given, any per-buffer data registered with
577    XLogRegisterBufData() is included in the WAL record even if a full-page
578    image is taken.
579
580void XLogRegisterData(char *data, int len);
581
582    XLogRegisterData is used to include arbitrary data in the WAL record.  If
583    XLogRegisterData() is called multiple times, the data are appended, and
584    will be made available to the redo routine as one contiguous chunk.
585
586void XLogRegisterBufData(uint8 block_id, char *data, int len);
587
588    XLogRegisterBufData is used to include data associated with a particular
589    buffer that was registered earlier with XLogRegisterBuffer().  If
590    XLogRegisterBufData() is called multiple times with the same block ID, the
591    data are appended, and will be made available to the redo routine as one
592    contiguous chunk.
593
594    If a full-page image of the buffer is taken at insertion, the data is not
595    included in the WAL record, unless the REGBUF_KEEP_DATA flag is used.
596
597
598Writing a REDO routine
599----------------------
600
601A REDO routine uses the data and page references included in the WAL record
602to reconstruct the new state of the page.  The record decoding functions
603and macros in xlogreader.c/h can be used to extract the data from the record.
604
605When replaying a WAL record that describes changes on multiple pages, you
606must be careful to lock the pages properly to prevent concurrent Hot Standby
607queries from seeing an inconsistent state.  If this requires that two
608or more buffer locks be held concurrently, you must lock the pages in
609appropriate order, and not release the locks until all the changes are done.
610
611Note that we must only use PageSetLSN/PageGetLSN() when we know the action
612is serialised. Only Startup process may modify data blocks during recovery,
613so Startup process may execute PageGetLSN() without fear of serialisation
614problems. All other processes must only call PageSet/GetLSN when holding
615either an exclusive buffer lock or a shared lock plus buffer header lock,
616or be writing the data block directly rather than through shared buffers
617while holding AccessExclusiveLock on the relation.
618
619
620Writing Hints
621-------------
622
623In some cases, we write additional information to data blocks without
624writing a preceding WAL record. This should only happen iff the data can
625be reconstructed later following a crash and the action is simply a way
626of optimising for performance. When a hint is written we use
627MarkBufferDirtyHint() to mark the block dirty.
628
629If the buffer is clean and checksums are in use then
630MarkBufferDirtyHint() inserts an XLOG_FPI record to ensure that we
631take a full page image that includes the hint. We do this to avoid
632a partial page write, when we write the dirtied page. WAL is not
633written during recovery, so we simply skip dirtying blocks because
634of hints when in recovery.
635
636If you do decide to optimise away a WAL record, then any calls to
637MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
638otherwise you will expose the risk of partial page writes.
639
640
641Write-Ahead Logging for Filesystem Actions
642------------------------------------------
643
644The previous section described how to WAL-log actions that only change page
645contents within shared buffers.  For that type of action it is generally
646possible to check all likely error cases (such as insufficient space on the
647page) before beginning to make the actual change.  Therefore we can make
648the change and the creation of the associated WAL log record "atomic" by
649wrapping them into a critical section --- the odds of failure partway
650through are low enough that PANIC is acceptable if it does happen.
651
652Clearly, that approach doesn't work for cases where there's a significant
653probability of failure within the action to be logged, such as creation
654of a new file or database.  We don't want to PANIC, and we especially don't
655want to PANIC after having already written a WAL record that says we did
656the action --- if we did, replay of the record would probably fail again
657and PANIC again, making the failure unrecoverable.  This means that the
658ordinary WAL rule of "write WAL before the changes it describes" doesn't
659work, and we need a different design for such cases.
660
661There are several basic types of filesystem actions that have this
662issue.  Here is how we deal with each:
663
6641. Adding a disk page to an existing table.
665
666This action isn't WAL-logged at all.  We extend a table by writing a page
667of zeroes at its end.  We must actually do this write so that we are sure
668the filesystem has allocated the space.  If the write fails we can just
669error out normally.  Once the space is known allocated, we can initialize
670and fill the page via one or more normal WAL-logged actions.  Because it's
671possible that we crash between extending the file and writing out the WAL
672entries, we have to treat discovery of an all-zeroes page in a table or
673index as being a non-error condition.  In such cases we can just reclaim
674the space for re-use.
675
6762. Creating a new table, which requires a new file in the filesystem.
677
678We try to create the file, and if successful we make a WAL record saying
679we did it.  If not successful, we can just throw an error.  Notice that
680there is a window where we have created the file but not yet written any
681WAL about it to disk.  If we crash during this window, the file remains
682on disk as an "orphan".  It would be possible to clean up such orphans
683by having database restart search for files that don't have any committed
684entry in pg_class, but that currently isn't done because of the possibility
685of deleting data that is useful for forensic analysis of the crash.
686Orphan files are harmless --- at worst they waste a bit of disk space ---
687because we check for on-disk collisions when allocating new relfilenode
688OIDs.  So cleaning up isn't really necessary.
689
6903. Deleting a table, which requires an unlink() that could fail.
691
692Our approach here is to WAL-log the operation first, but to treat failure
693of the actual unlink() call as a warning rather than error condition.
694Again, this can leave an orphan file behind, but that's cheap compared to
695the alternatives.  Since we can't actually do the unlink() until after
696we've committed the DROP TABLE transaction, throwing an error would be out
697of the question anyway.  (It may be worth noting that the WAL entry about
698the file deletion is actually part of the commit record for the dropping
699transaction.)
700
7014. Creating and deleting databases and tablespaces, which requires creating
702and deleting directories and entire directory trees.
703
704These cases are handled similarly to creating individual files, ie, we
705try to do the action first and then write a WAL entry if it succeeded.
706The potential amount of wasted disk space is rather larger, of course.
707In the creation case we try to delete the directory tree again if creation
708fails, so as to reduce the risk of wasted space.  Failure partway through
709a deletion operation results in a corrupt database: the DROP failed, but
710some of the data is gone anyway.  There is little we can do about that,
711though, and in any case it was presumably data the user no longer wants.
712
713In all of these cases, if WAL replay fails to redo the original action
714we must panic and abort recovery.  The DBA will have to manually clean up
715(for instance, free up some disk space or fix directory permissions) and
716then restart recovery.  This is part of the reason for not writing a WAL
717entry until we've successfully done the original action.
718
719
720Asynchronous Commit
721-------------------
722
723As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e.,
724we don't wait while the WAL record for the commit is fsync'ed.
725We perform an asynchronous commit when synchronous_commit = off.  Instead
726of performing an XLogFlush() up to the LSN of the commit, we merely note
727the LSN in shared memory.  The backend then continues with other work.
728We record the LSN only for an asynchronous commit, not an abort; there's
729never any need to flush an abort record, since the presumption after a
730crash would be that the transaction aborted anyway.
731
732We always force synchronous commit when the transaction is deleting
733relations, to ensure the commit record is down to disk before the relations
734are removed from the filesystem.  Also, certain utility commands that have
735non-roll-backable side effects (such as filesystem changes) force sync
736commit to minimize the window in which the filesystem change has been made
737but the transaction isn't guaranteed committed.
738
739The walwriter regularly wakes up (via wal_writer_delay) or is woken up
740(via its latch, which is set by backends committing asynchronously) and
741performs an XLogBackgroundFlush().  This checks the location of the last
742completely filled WAL page.  If that has moved forwards, then we write all
743the changed buffers up to that point, so that under full load we write
744only whole buffers.  If there has been a break in activity and the current
745WAL page is the same as before, then we find out the LSN of the most
746recent asynchronous commit, and write up to that point, if required (i.e.
747if it's in the current WAL page).  If more than wal_writer_delay has
748passed, or more than wal_writer_flush_after blocks have been written, since
749the last flush, WAL is also flushed up to the current location.  This
750arrangement in itself would guarantee that an async commit record reaches
751disk after at most two times wal_writer_delay after the transaction
752completes. However, we also allow XLogFlush to write/flush full buffers
753"flexibly" (ie, not wrapping around at the end of the circular WAL buffer
754area), so as to minimize the number of writes issued under high load when
755multiple WAL pages are filled per walwriter cycle. This makes the worst-case
756delay three wal_writer_delay cycles.
757
758There are some other subtle points to consider with asynchronous commits.
759First, for each page of CLOG we must remember the LSN of the latest commit
760affecting the page, so that we can enforce the same flush-WAL-before-write
761rule that we do for ordinary relation pages.  Otherwise the record of the
762commit might reach disk before the WAL record does.  Again, abort records
763need not factor into this consideration.
764
765In fact, we store more than one LSN for each clog page.  This relates to
766the way we set transaction status hint bits during visibility tests.
767We must not set a transaction-committed hint bit on a relation page and
768have that record make it to disk prior to the WAL record of the commit.
769Since visibility tests are normally made while holding buffer share locks,
770we do not have the option of changing the page's LSN to guarantee WAL
771synchronization.  Instead, we defer the setting of the hint bit if we have
772not yet flushed WAL as far as the LSN associated with the transaction.
773This requires tracking the LSN of each unflushed async commit.  It is
774convenient to associate this data with clog buffers: because we will flush
775WAL before writing a clog page, we know that we do not need to remember a
776transaction's LSN longer than the clog page holding its commit status
777remains in memory.  However, the naive approach of storing an LSN for each
778clog position is unattractive: the LSNs are 32x bigger than the two-bit
779commit status fields, and so we'd need 256K of additional shared memory for
780each 8K clog buffer page.  We choose instead to store a smaller number of
781LSNs per page, where each LSN is the highest LSN associated with any
782transaction commit in a contiguous range of transaction IDs on that page.
783This saves storage at the price of some possibly-unnecessary delay in
784setting transaction hint bits.
785
786How many transactions should share the same cached LSN (N)?  If the
787system's workload consists only of small async-commit transactions, then
788it's reasonable to have N similar to the number of transactions per
789walwriter cycle, since that is the granularity with which transactions will
790become truly committed (and thus hintable) anyway.  The worst case is where
791a sync-commit xact shares a cached LSN with an async-commit xact that
792commits a bit later; even though we paid to sync the first xact to disk,
793we won't be able to hint its outputs until the second xact is sync'd, up to
794three walwriter cycles later.  This argues for keeping N (the group size)
795as small as possible.  For the moment we are setting the group size to 32,
796which makes the LSN cache space the same size as the actual clog buffer
797space (independently of BLCKSZ).
798
799It is useful that we can run both synchronous and asynchronous commit
800transactions concurrently, but the safety of this is perhaps not
801immediately obvious.  Assume we have two transactions, T1 and T2.  The Log
802Sequence Number (LSN) is the point in the WAL sequence where a transaction
803commit is recorded, so LSN1 and LSN2 are the commit records of those
804transactions.  If T2 can see changes made by T1 then when T2 commits it
805must be true that LSN2 follows LSN1.  Thus when T2 commits it is certain
806that all of the changes made by T1 are also now recorded in the WAL.  This
807is true whether T1 was asynchronous or synchronous.  As a result, it is
808safe for asynchronous commits and synchronous commits to work concurrently
809without endangering data written by synchronous commits.  Sub-transactions
810are not important here since the final write to disk only occurs at the
811commit of the top level transaction.
812
813Changes to data blocks cannot reach disk unless WAL is flushed up to the
814point of the LSN of the data blocks.  Any attempt to write unsafe data to
815disk will trigger a write which ensures the safety of all data written by
816that and prior transactions.  Data blocks and clog pages are both protected
817by LSNs.
818
819Changes to a temp table are not WAL-logged, hence could reach disk in
820advance of T1's commit, but we don't care since temp table contents don't
821survive crashes anyway.
822
823Database writes made via any of the paths we have introduced to avoid WAL
824overhead for bulk updates are also safe.  In these cases it's entirely
825possible for the data to reach disk before T1's commit, because T1 will
826fsync it down to disk without any sort of interlock, as soon as it finishes
827the bulk update.  However, all these paths are designed to write data that
828no other transaction can see until after T1 commits.  The situation is thus
829not different from ordinary WAL-logged updates.
830
831Transaction Emulation during Recovery
832-------------------------------------
833
834During Recovery we replay transaction changes in the order they occurred.
835As part of this replay we emulate some transactional behaviour, so that
836read only backends can take MVCC snapshots. We do this by maintaining a
837list of XIDs belonging to transactions that are being replayed, so that
838each transaction that has recorded WAL records for database writes exist
839in the array until it commits. Further details are given in comments in
840procarray.c.
841
842Many actions write no WAL records at all, for example read only transactions.
843These have no effect on MVCC in recovery and we can pretend they never
844occurred at all. Subtransaction commit does not write a WAL record either
845and has very little effect, since lock waiters need to wait for the
846parent transaction to complete.
847
848Not all transactional behaviour is emulated, for example we do not insert
849a transaction entry into the lock table, nor do we maintain the transaction
850stack in memory. Clog, multixact and commit_ts entries are made normally.
851Subtrans is maintained during recovery but the details of the transaction
852tree are ignored and all subtransactions reference the top-level TransactionId
853directly. Since commit is atomic this provides correct lock wait behaviour
854yet simplifies emulation of subtransactions considerably.
855
856Further details on locking mechanics in recovery are given in comments
857with the Lock rmgr code.
858

README.parallel

1Overview
2========
3
4PostgreSQL provides some simple facilities to make writing parallel algorithms
5easier.  Using a data structure called a ParallelContext, you can arrange to
6launch background worker processes, initialize their state to match that of
7the backend which initiated parallelism, communicate with them via dynamic
8shared memory, and write reasonably complex code that can run either in the
9user backend or in one of the parallel workers without needing to be aware of
10where it's running.
11
12The backend which starts a parallel operation (hereafter, the initiating
13backend) starts by creating a dynamic shared memory segment which will last
14for the lifetime of the parallel operation.  This dynamic shared memory segment
15will contain (1) a shm_mq that can be used to transport errors (and other
16messages reported via elog/ereport) from the worker back to the initiating
17backend; (2) serialized representations of the initiating backend's private
18state, so that the worker can synchronize its state with of the initiating
19backend; and (3) any other data structures which a particular user of the
20ParallelContext data structure may wish to add for its own purposes.  Once
21the initiating backend has initialized the dynamic shared memory segment, it
22asks the postmaster to launch the appropriate number of parallel workers.
23These workers then connect to the dynamic shared memory segment, initiate
24their state, and then invoke the appropriate entrypoint, as further detailed
25below.
26
27Error Reporting
28===============
29
30When started, each parallel worker begins by attaching the dynamic shared
31memory segment and locating the shm_mq to be used for error reporting; it
32redirects all of its protocol messages to this shm_mq.  Prior to this point,
33any failure of the background worker will not be reported to the initiating
34backend; from the point of view of the initiating backend, the worker simply
35failed to start.  The initiating backend must anyway be prepared to cope
36with fewer parallel workers than it originally requested, so catering to
37this case imposes no additional burden.
38
39Whenever a new message (or partial message; very large messages may wrap) is
40sent to the error-reporting queue, PROCSIG_PARALLEL_MESSAGE is sent to the
41initiating backend.  This causes the next CHECK_FOR_INTERRUPTS() in the
42initiating backend to read and rethrow the message.  For the most part, this
43makes error reporting in parallel mode "just work".  Of course, to work
44properly, it is important that the code the initiating backend is executing
45CHECK_FOR_INTERRUPTS() regularly and avoid blocking interrupt processing for
46long periods of time, but those are good things to do anyway.
47
48(A currently-unsolved problem is that some messages may get written to the
49system log twice, once in the backend where the report was originally
50generated, and again when the initiating backend rethrows the message.  If
51we decide to suppress one of these reports, it should probably be second one;
52otherwise, if the worker is for some reason unable to propagate the message
53back to the initiating backend, the message will be lost altogether.)
54
55State Sharing
56=============
57
58It's possible to write C code which works correctly without parallelism, but
59which fails when parallelism is used.  No parallel infrastructure can
60completely eliminate this problem, because any global variable is a risk.
61There's no general mechanism for ensuring that every global variable in the
62worker will have the same value that it does in the initiating backend; even
63if we could ensure that, some function we're calling could update the variable
64after each call, and only the backend where that update is performed will see
65the new value.  Similar problems can arise with any more-complex data
66structure we might choose to use.  For example, a pseudo-random number
67generator should, given a particular seed value, produce the same predictable
68series of values every time.  But it does this by relying on some private
69state which won't automatically be shared between cooperating backends.  A
70parallel-safe PRNG would need to store its state in dynamic shared memory, and
71would require locking.  The parallelism infrastructure has no way of knowing
72whether the user intends to call code that has this sort of problem, and can't
73do anything about it anyway.
74
75Instead, we take a more pragmatic approach. First, we try to make as many of
76the operations that are safe outside of parallel mode work correctly in
77parallel mode as well.  Second, we try to prohibit common unsafe operations
78via suitable error checks.  These checks are intended to catch 100% of
79unsafe things that a user might do from the SQL interface, but code written
80in C can do unsafe things that won't trigger these checks.  The error checks
81are engaged via EnterParallelMode(), which should be called before creating
82a parallel context, and disarmed via ExitParallelMode(), which should be
83called after all parallel contexts have been destroyed.  The most
84significant restriction imposed by parallel mode is that all operations must
85be strictly read-only; we allow no writes to the database and no DDL.  We
86might try to relax these restrictions in the future.
87
88To make as many operations as possible safe in parallel mode, we try to copy
89the most important pieces of state from the initiating backend to each parallel
90worker.  This includes:
91
92  - The set of libraries dynamically loaded by dfmgr.c.
93
94  - The authenticated user ID and current database.  Each parallel worker
95    will connect to the same database as the initiating backend, using the
96    same user ID.
97
98  - The values of all GUCs.  Accordingly, permanent changes to the value of
99    any GUC are forbidden while in parallel mode; but temporary changes,
100    such as entering a function with non-NULL proconfig, are OK.
101
102  - The current subtransaction's XID, the top-level transaction's XID, and
103    the list of XIDs considered current (that is, they are in-progress or
104    subcommitted).  This information is needed to ensure that tuple visibility
105    checks return the same results in the worker as they do in the
106    initiating backend.  See also the section Transaction Integration, below.
107
108  - The combo CID mappings.  This is needed to ensure consistent answers to
109    tuple visibility checks.  The need to synchronize this data structure is
110    a major reason why we can't support writes in parallel mode: such writes
111    might create new combo CIDs, and we have no way to let other workers
112    (or the initiating backend) know about them.
113
114  - The transaction snapshot.
115
116  - The active snapshot, which might be different from the transaction
117    snapshot.
118
119  - The currently active user ID and security context.  Note that this is
120    the fourth user ID we restore: the initial step of binding to the correct
121    database also involves restoring the authenticated user ID.  When GUC
122    values are restored, this incidentally sets SessionUserId and OuterUserId
123    to the correct values.  This final step restores CurrentUserId.
124
125To prevent undetected or unprincipled deadlocks when running in parallel mode,
126this could should eventually handle heavyweight locks in some way.  This is
127not implemented yet.
128
129Transaction Integration
130=======================
131
132Regardless of what the TransactionState stack looks like in the parallel
133leader, each parallel worker ends up with a stack of depth 1.  This stack
134entry is marked with the special transaction block state
135TBLOCK_PARALLEL_INPROGRESS so that it's not confused with an ordinary
136toplevel transaction.  The XID of this TransactionState is set to the XID of
137the innermost currently-active subtransaction in the initiating backend.  The
138initiating backend's toplevel XID, and the XIDs of all current (in-progress
139or subcommitted) XIDs are stored separately from the TransactionState stack,
140but in such a way that GetTopTransactionId(), GetTopTransactionIdIfAny(), and
141TransactionIdIsCurrentTransactionId() return the same values that they would
142in the initiating backend.  We could copy the entire transaction state stack,
143but most of it would be useless: for example, you can't roll back to a
144savepoint from within a parallel worker, and there are no resources to
145associated with the memory contexts or resource owners of intermediate
146subtransactions.
147
148No meaningful change to the transaction state can be made while in parallel
149mode.  No XIDs can be assigned, and no subtransactions can start or end,
150because we have no way of communicating these state changes to cooperating
151backends, or of synchronizing them.  It's clearly unworkable for the initiating
152backend to exit any transaction or subtransaction that was in progress when
153parallelism was started before all parallel workers have exited; and it's even
154more clearly crazy for a parallel worker to try to subcommit or subabort the
155current subtransaction and execute in some other transaction context than was
156present in the initiating backend.  It might be practical to allow internal
157sub-transactions (e.g. to implement a PL/pgSQL EXCEPTION block) to be used in
158parallel mode, provided that they are XID-less, because other backends
159wouldn't really need to know about those transactions or do anything
160differently because of them.  Right now, we don't even allow that.
161
162At the end of a parallel operation, which can happen either because it
163completed successfully or because it was interrupted by an error, parallel
164workers associated with that operation exit.  In the error case, transaction
165abort processing in the parallel leader kills of any remaining workers, and
166the parallel leader then waits for them to die.  In the case of a successful
167parallel operation, the parallel leader does not send any signals, but must
168wait for workers to complete and exit of their own volition.  In either
169case, it is very important that all workers actually exit before the
170parallel leader cleans up the (sub)transaction in which they were created;
171otherwise, chaos can ensue.  For example, if the leader is rolling back the
172transaction that created the relation being scanned by a worker, the
173relation could disappear while the worker is still busy scanning it.  That's
174not safe.
175
176Generally, the cleanup performed by each worker at this point is similar to
177top-level commit or abort.  Each backend has its own resource owners: buffer
178pins, catcache or relcache reference counts, tuple descriptors, and so on
179are managed separately by each backend, and must free them before exiting.
180There are, however, some important differences between parallel worker
181commit or abort and a real top-level transaction commit or abort.  Most
182importantly:
183
184  - No commit or abort record is written; the initiating backend is
185    responsible for this.
186
187  - Cleanup of pg_temp namespaces is not done.  Parallel workers cannot
188    safely access the initiating backend's pg_temp namespace, and should
189    not create one of their own.
190
191Coding Conventions
192===================
193
194Before beginning any parallel operation, call EnterParallelMode(); after all
195parallel operations are completed, call ExitParallelMode().  To actually
196parallelize a particular operation, use a ParallelContext.  The basic coding
197pattern looks like this:
198
199	EnterParallelMode();		/* prohibit unsafe state changes */
200
201	pcxt = CreateParallelContext("library_name", "function_name", nworkers);
202
203	/* Allow space for application-specific data here. */
204	shm_toc_estimate_chunk(&pcxt->estimator, size);
205	shm_toc_estimate_keys(&pcxt->estimator, keys);
206
207	InitializeParallelDSM(pcxt);	/* create DSM and copy state to it */
208
209	/* Store the data for which we reserved space. */
210	space = shm_toc_allocate(pcxt->toc, size);
211	shm_toc_insert(pcxt->toc, key, space);
212
213	LaunchParallelWorkers(pcxt);
214
215	/* do parallel stuff */
216
217	WaitForParallelWorkersToFinish(pcxt);
218
219	/* read any final results from dynamic shared memory */
220
221	DestroyParallelContext(pcxt);
222
223	ExitParallelMode();
224
225If desired, after WaitForParallelWorkersToFinish() has been called, the
226context can be reset so that workers can be launched anew using the same
227parallel context.  To do this, first call ReinitializeParallelDSM() to
228reinitialize state managed by the parallel context machinery itself; then,
229perform any other necessary resetting of state; after that, you can again
230call LaunchParallelWorkers.
231