1src/backend/access/transam/README 2 3The Transaction System 4====================== 5 6PostgreSQL's transaction system is a three-layer system. The bottom layer 7implements low-level transactions and subtransactions, on top of which rests 8the mainloop's control code, which in turn implements user-visible 9transactions and savepoints. 10 11The middle layer of code is called by postgres.c before and after the 12processing of each query, or after detecting an error: 13 14 StartTransactionCommand 15 CommitTransactionCommand 16 AbortCurrentTransaction 17 18Meanwhile, the user can alter the system's state by issuing the SQL commands 19BEGIN, COMMIT, ROLLBACK, SAVEPOINT, ROLLBACK TO or RELEASE. The traffic cop 20redirects these calls to the toplevel routines 21 22 BeginTransactionBlock 23 EndTransactionBlock 24 UserAbortTransactionBlock 25 DefineSavepoint 26 RollbackToSavepoint 27 ReleaseSavepoint 28 29respectively. Depending on the current state of the system, these functions 30call low level functions to activate the real transaction system: 31 32 StartTransaction 33 CommitTransaction 34 AbortTransaction 35 CleanupTransaction 36 StartSubTransaction 37 CommitSubTransaction 38 AbortSubTransaction 39 CleanupSubTransaction 40 41Additionally, within a transaction, CommandCounterIncrement is called to 42increment the command counter, which allows future commands to "see" the 43effects of previous commands within the same transaction. Note that this is 44done automatically by CommitTransactionCommand after each query inside a 45transaction block, but some utility functions also do it internally to allow 46some operations (usually in the system catalogs) to be seen by future 47operations in the same utility command. (For example, in DefineRelation it is 48done after creating the heap so the pg_class row is visible, to be able to 49lock it.) 50 51 52For example, consider the following sequence of user commands: 53 541) BEGIN 552) SELECT * FROM foo 563) INSERT INTO foo VALUES (...) 574) COMMIT 58 59In the main processing loop, this results in the following function call 60sequence: 61 62 / StartTransactionCommand; 63 / StartTransaction; 641) < ProcessUtility; << BEGIN 65 \ BeginTransactionBlock; 66 \ CommitTransactionCommand; 67 68 / StartTransactionCommand; 692) / PortalRunSelect; << SELECT ... 70 \ CommitTransactionCommand; 71 \ CommandCounterIncrement; 72 73 / StartTransactionCommand; 743) / ProcessQuery; << INSERT ... 75 \ CommitTransactionCommand; 76 \ CommandCounterIncrement; 77 78 / StartTransactionCommand; 79 / ProcessUtility; << COMMIT 804) < EndTransactionBlock; 81 \ CommitTransactionCommand; 82 \ CommitTransaction; 83 84The point of this example is to demonstrate the need for 85StartTransactionCommand and CommitTransactionCommand to be state smart -- they 86should call CommandCounterIncrement between the calls to BeginTransactionBlock 87and EndTransactionBlock and outside these calls they need to do normal start, 88commit or abort processing. 89 90Furthermore, suppose the "SELECT * FROM foo" caused an abort condition. In 91this case AbortCurrentTransaction is called, and the transaction is put in 92aborted state. In this state, any user input is ignored except for 93transaction-termination statements, or ROLLBACK TO <savepoint> commands. 94 95Transaction aborts can occur in two ways: 96 971) system dies from some internal cause (syntax error, etc) 982) user types ROLLBACK 99 100The reason we have to distinguish them is illustrated by the following two 101situations: 102 103 case 1 case 2 104 ------ ------ 1051) user types BEGIN 1) user types BEGIN 1062) user does something 2) user does something 1073) user does not like what 3) system aborts for some reason 108 she sees and types ABORT (syntax error, etc) 109 110In case 1, we want to abort the transaction and return to the default state. 111In case 2, there may be more commands coming our way which are part of the 112same transaction block; we have to ignore these commands until we see a COMMIT 113or ROLLBACK. 114 115Internal aborts are handled by AbortCurrentTransaction, while user aborts are 116handled by UserAbortTransactionBlock. Both of them rely on AbortTransaction 117to do all the real work. The only difference is what state we enter after 118AbortTransaction does its work: 119 120* AbortCurrentTransaction leaves us in TBLOCK_ABORT, 121* UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END 122 123Low-level transaction abort handling is divided in two phases: 124* AbortTransaction executes as soon as we realize the transaction has 125 failed. It should release all shared resources (locks etc) so that we do 126 not delay other backends unnecessarily. 127* CleanupTransaction executes when we finally see a user COMMIT 128 or ROLLBACK command; it cleans things up and gets us out of the transaction 129 completely. In particular, we mustn't destroy TopTransactionContext until 130 this point. 131 132Also, note that when a transaction is committed, we don't close it right away. 133Rather it's put in TBLOCK_END state, which means that when 134CommitTransactionCommand is called after the query has finished processing, 135the transaction has to be closed. The distinction is subtle but important, 136because it means that control will leave the xact.c code with the transaction 137open, and the main loop will be able to keep processing inside the same 138transaction. So, in a sense, transaction commit is also handled in two 139phases, the first at EndTransactionBlock and the second at 140CommitTransactionCommand (which is where CommitTransaction is actually 141called). 142 143The rest of the code in xact.c are routines to support the creation and 144finishing of transactions and subtransactions. For example, AtStart_Memory 145takes care of initializing the memory subsystem at main transaction start. 146 147 148Subtransaction Handling 149----------------------- 150 151Subtransactions are implemented using a stack of TransactionState structures, 152each of which has a pointer to its parent transaction's struct. When a new 153subtransaction is to be opened, PushTransaction is called, which creates a new 154TransactionState, with its parent link pointing to the current transaction. 155StartSubTransaction is in charge of initializing the new TransactionState to 156sane values, and properly initializing other subsystems (AtSubStart routines). 157 158When closing a subtransaction, either CommitSubTransaction has to be called 159(if the subtransaction is committing), or AbortSubTransaction and 160CleanupSubTransaction (if it's aborting). In either case, PopTransaction is 161called so the system returns to the parent transaction. 162 163One important point regarding subtransaction handling is that several may need 164to be closed in response to a single user command. That's because savepoints 165have names, and we allow to commit or rollback a savepoint by name, which is 166not necessarily the one that was last opened. Also a COMMIT or ROLLBACK 167command must be able to close out the entire stack. We handle this by having 168the utility command subroutine mark all the state stack entries as commit- 169pending or abort-pending, and then when the main loop reaches 170CommitTransactionCommand, the real work is done. The main point of doing 171things this way is that if we get an error while popping state stack entries, 172the remaining stack entries still show what we need to do to finish up. 173 174In the case of ROLLBACK TO <savepoint>, we abort all the subtransactions up 175through the one identified by the savepoint name, and then re-create that 176subtransaction level with the same name. So it's a completely new 177subtransaction as far as the internals are concerned. 178 179Other subsystems are allowed to start "internal" subtransactions, which are 180handled by BeginInternalSubtransaction. This is to allow implementing 181exception handling, e.g. in PL/pgSQL. ReleaseCurrentSubTransaction and 182RollbackAndReleaseCurrentSubTransaction allows the subsystem to close said 183subtransactions. The main difference between this and the savepoint/release 184path is that we execute the complete state transition immediately in each 185subroutine, rather than deferring some work until CommitTransactionCommand. 186Another difference is that BeginInternalSubtransaction is allowed when no 187explicit transaction block has been established, while DefineSavepoint is not. 188 189 190Transaction and Subtransaction Numbering 191---------------------------------------- 192 193Transactions and subtransactions are assigned permanent XIDs only when/if 194they first do something that requires one --- typically, insert/update/delete 195a tuple, though there are a few other places that need an XID assigned. 196If a subtransaction requires an XID, we always first assign one to its 197parent. This maintains the invariant that child transactions have XIDs later 198than their parents, which is assumed in a number of places. 199 200The subsidiary actions of obtaining a lock on the XID and entering it into 201pg_subtrans and PG_PROC are done at the time it is assigned. 202 203A transaction that has no XID still needs to be identified for various 204purposes, notably holding locks. For this purpose we assign a "virtual 205transaction ID" or VXID to each top-level transaction. VXIDs are formed from 206two fields, the backendID and a backend-local counter; this arrangement allows 207assignment of a new VXID at transaction start without any contention for 208shared memory. To ensure that a VXID isn't re-used too soon after backend 209exit, we store the last local counter value into shared memory at backend 210exit, and initialize it from the previous value for the same backendID slot 211at backend start. All these counters go back to zero at shared memory 212re-initialization, but that's OK because VXIDs never appear anywhere on-disk. 213 214Internally, a backend needs a way to identify subtransactions whether or not 215they have XIDs; but this need only lasts as long as the parent top transaction 216endures. Therefore, we have SubTransactionId, which is somewhat like 217CommandId in that it's generated from a counter that we reset at the start of 218each top transaction. The top-level transaction itself has SubTransactionId 1, 219and subtransactions have IDs 2 and up. (Zero is reserved for 220InvalidSubTransactionId.) Note that subtransactions do not have their 221own VXIDs; they use the parent top transaction's VXID. 222 223 224Interlocking Transaction Begin, Transaction End, and Snapshots 225-------------------------------------------------------------- 226 227We try hard to minimize the amount of overhead and lock contention involved 228in the frequent activities of beginning/ending a transaction and taking a 229snapshot. Unfortunately, we must have some interlocking for this, because 230we must ensure consistency about the commit order of transactions. 231For example, suppose an UPDATE in xact A is blocked by xact B's prior 232update of the same row, and xact B is doing commit while xact C gets a 233snapshot. Xact A can complete and commit as soon as B releases its locks. 234If xact C's GetSnapshotData sees xact B as still running, then it had 235better see xact A as still running as well, or it will be able to see two 236tuple versions - one deleted by xact B and one inserted by xact A. Another 237reason why this would be bad is that C would see (in the row inserted by A) 238earlier changes by B, and it would be inconsistent for C not to see any 239of B's changes elsewhere in the database. 240 241Formally, the correctness requirement is "if a snapshot A considers 242transaction X as committed, and any of transaction X's snapshots considered 243transaction Y as committed, then snapshot A must consider transaction Y as 244committed". 245 246What we actually enforce is strict serialization of commits and rollbacks 247with snapshot-taking: we do not allow any transaction to exit the set of 248running transactions while a snapshot is being taken. (This rule is 249stronger than necessary for consistency, but is relatively simple to 250enforce, and it assists with some other issues as explained below.) The 251implementation of this is that GetSnapshotData takes the ProcArrayLock in 252shared mode (so that multiple backends can take snapshots in parallel), 253but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode 254while clearing MyPgXact->xid at transaction end (either commit or abort). 255(To reduce context switching, when multiple transactions commit nearly 256simultaneously, we have one backend take ProcArrayLock and clear the XIDs 257of multiple processes at once.) 258 259ProcArrayEndTransaction also holds the lock while advancing the shared 260latestCompletedXid variable. This allows GetSnapshotData to use 261latestCompletedXid + 1 as xmax for its snapshot: there can be no 262transaction >= this xid value that the snapshot needs to consider as 263completed. 264 265In short, then, the rule is that no transaction may exit the set of 266currently-running transactions between the time we fetch latestCompletedXid 267and the time we finish building our snapshot. However, this restriction 268only applies to transactions that have an XID --- read-only transactions 269can end without acquiring ProcArrayLock, since they don't affect anyone 270else's snapshot nor latestCompletedXid. 271 272Transaction start, per se, doesn't have any interlocking with these 273considerations, since we no longer assign an XID immediately at transaction 274start. But when we do decide to allocate an XID, GetNewTransactionId must 275store the new XID into the shared ProcArray before releasing XidGenLock. 276This ensures that all top-level XIDs <= latestCompletedXid are either 277present in the ProcArray, or not running anymore. (This guarantee doesn't 278apply to subtransaction XIDs, because of the possibility that there's not 279room for them in the subxid array; instead we guarantee that they are 280present or the overflow flag is set.) If a backend released XidGenLock 281before storing its XID into MyPgXact, then it would be possible for another 282backend to allocate and commit a later XID, causing latestCompletedXid to 283pass the first backend's XID, before that value became visible in the 284ProcArray. That would break GetOldestXmin, as discussed below. 285 286We allow GetNewTransactionId to store the XID into MyPgXact->xid (or the 287subxid array) without taking ProcArrayLock. This was once necessary to 288avoid deadlock; while that is no longer the case, it's still beneficial for 289performance. We are thereby relying on fetch/store of an XID to be atomic, 290else other backends might see a partially-set XID. This also means that 291readers of the ProcArray xid fields must be careful to fetch a value only 292once, rather than assume they can read it multiple times and get the same 293answer each time. (Use volatile-qualified pointers when doing this, to 294ensure that the C compiler does exactly what you tell it to.) 295 296Another important activity that uses the shared ProcArray is GetOldestXmin, 297which must determine a lower bound for the oldest xmin of any active MVCC 298snapshot, system-wide. Each individual backend advertises the smallest 299xmin of its own snapshots in MyPgXact->xmin, or zero if it currently has no 300live snapshots (eg, if it's between transactions or hasn't yet set a 301snapshot for a new transaction). GetOldestXmin takes the MIN() of the 302valid xmin fields. It does this with only shared lock on ProcArrayLock, 303which means there is a potential race condition against other backends 304doing GetSnapshotData concurrently: we must be certain that a concurrent 305backend that is about to set its xmin does not compute an xmin less than 306what GetOldestXmin returns. We ensure that by including all the active 307XIDs into the MIN() calculation, along with the valid xmins. The rule that 308transactions can't exit without taking exclusive ProcArrayLock ensures that 309concurrent holders of shared ProcArrayLock will compute the same minimum of 310currently-active XIDs: no xact, in particular not the oldest, can exit 311while we hold shared ProcArrayLock. So GetOldestXmin's view of the minimum 312active XID will be the same as that of any concurrent GetSnapshotData, and 313so it can't produce an overestimate. If there is no active transaction at 314all, GetOldestXmin returns latestCompletedXid + 1, which is a lower bound 315for the xmin that might be computed by concurrent or later GetSnapshotData 316calls. (We know that no XID less than this could be about to appear in 317the ProcArray, because of the XidGenLock interlock discussed above.) 318 319GetSnapshotData also performs an oldest-xmin calculation (which had better 320match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used 321for some tuple age cutoff checks where a fresh call of GetOldestXmin seems 322too expensive. Note that while it is certain that two concurrent 323executions of GetSnapshotData will compute the same xmin for their own 324snapshots, as argued above, it is not certain that they will arrive at the 325same estimate of RecentGlobalXmin. This is because we allow XID-less 326transactions to clear their MyPgXact->xmin asynchronously (without taking 327ProcArrayLock), so one execution might see what had been the oldest xmin, 328and another not. This is OK since RecentGlobalXmin need only be a valid 329lower bound. As noted above, we are already assuming that fetch/store 330of the xid fields is atomic, so assuming it for xmin as well is no extra 331risk. 332 333 334pg_xact and pg_subtrans 335----------------------- 336 337pg_xact and pg_subtrans are permanent (on-disk) storage of transaction related 338information. There is a limited number of pages of each kept in memory, so 339in many cases there is no need to actually read from disk. However, if 340there's a long running transaction or a backend sitting idle with an open 341transaction, it may be necessary to be able to read and write this information 342from disk. They also allow information to be permanent across server restarts. 343 344pg_xact records the commit status for each transaction that has been assigned 345an XID. A transaction can be in progress, committed, aborted, or 346"sub-committed". This last state means that it's a subtransaction that's no 347longer running, but its parent has not updated its state yet. It is not 348necessary to update a subtransaction's transaction status to subcommit, so we 349can just defer it until main transaction commit. The main role of marking 350transactions as sub-committed is to provide an atomic commit protocol when 351transaction status is spread across multiple clog pages. As a result, whenever 352transaction status spreads across multiple pages we must use a two-phase commit 353protocol: the first phase is to mark the subtransactions as sub-committed, then 354we mark the top level transaction and all its subtransactions committed (in 355that order). Thus, subtransactions that have not aborted appear as in-progress 356even when they have already finished, and the subcommit status appears as a 357very short transitory state during main transaction commit. Subtransaction 358abort is always marked in clog as soon as it occurs. When the transaction 359status all fit in a single CLOG page, we atomically mark them all as committed 360without bothering with the intermediate sub-commit state. 361 362Savepoints are implemented using subtransactions. A subtransaction is a 363transaction inside a transaction; its commit or abort status is not only 364dependent on whether it committed itself, but also whether its parent 365transaction committed. To implement multiple savepoints in a transaction we 366allow unlimited transaction nesting depth, so any particular subtransaction's 367commit state is dependent on the commit status of each and every ancestor 368transaction. 369 370The "subtransaction parent" (pg_subtrans) mechanism records, for each 371transaction with an XID, the TransactionId of its parent transaction. This 372information is stored as soon as the subtransaction is assigned an XID. 373Top-level transactions do not have a parent, so they leave their pg_subtrans 374entries set to the default value of zero (InvalidTransactionId). 375 376pg_subtrans is used to check whether the transaction in question is still 377running --- the main Xid of a transaction is recorded in the PGXACT struct, 378but since we allow arbitrary nesting of subtransactions, we can't fit all Xids 379in shared memory, so we have to store them on disk. Note, however, that for 380each transaction we keep a "cache" of Xids that are known to be part of the 381transaction tree, so we can skip looking at pg_subtrans unless we know the 382cache has been overflowed. See storage/ipc/procarray.c for the gory details. 383 384slru.c is the supporting mechanism for both pg_xact and pg_subtrans. It 385implements the LRU policy for in-memory buffer pages. The high-level routines 386for pg_xact are implemented in transam.c, while the low-level functions are in 387clog.c. pg_subtrans is contained completely in subtrans.c. 388 389 390Write-Ahead Log Coding 391---------------------- 392 393The WAL subsystem (also called XLOG in the code) exists to guarantee crash 394recovery. It can also be used to provide point-in-time recovery, as well as 395hot-standby replication via log shipping. Here are some notes about 396non-obvious aspects of its design. 397 398A basic assumption of a write AHEAD log is that log entries must reach stable 399storage before the data-page changes they describe. This ensures that 400replaying the log to its end will bring us to a consistent state where there 401are no partially-performed transactions. To guarantee this, each data page 402(either heap or index) is marked with the LSN (log sequence number --- in 403practice, a WAL file location) of the latest XLOG record affecting the page. 404Before the bufmgr can write out a dirty page, it must ensure that xlog has 405been flushed to disk at least up to the page's LSN. This low-level 406interaction improves performance by not waiting for XLOG I/O until necessary. 407The LSN check exists only in the shared-buffer manager, not in the local 408buffer manager used for temp tables; hence operations on temp tables must not 409be WAL-logged. 410 411During WAL replay, we can check the LSN of a page to detect whether the change 412recorded by the current log entry is already applied (it has been, if the page 413LSN is >= the log entry's WAL location). 414 415Usually, log entries contain just enough information to redo a single 416incremental update on a page (or small group of pages). This will work only 417if the filesystem and hardware implement data page writes as atomic actions, 418so that a page is never left in a corrupt partly-written state. Since that's 419often an untenable assumption in practice, we log additional information to 420allow complete reconstruction of modified pages. The first WAL record 421affecting a given page after a checkpoint is made to contain a copy of the 422entire page, and we implement replay by restoring that page copy instead of 423redoing the update. (This is more reliable than the data storage itself would 424be because we can check the validity of the WAL record's CRC.) We can detect 425the "first change after checkpoint" by noting whether the page's old LSN 426precedes the end of WAL as of the last checkpoint (the RedoRecPtr). 427 428The general schema for executing a WAL-logged action is 429 4301. Pin and exclusive-lock the shared buffer(s) containing the data page(s) 431to be modified. 432 4332. START_CRIT_SECTION() (Any error during the next three steps must cause a 434PANIC because the shared buffers will contain unlogged changes, which we 435have to ensure don't get to disk. Obviously, you should check conditions 436such as whether there's enough free space on the page before you start the 437critical section.) 438 4393. Apply the required changes to the shared buffer(s). 440 4414. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must 442happen before the WAL record is inserted; see notes in SyncOneBuffer().) 443Note that marking a buffer dirty with MarkBufferDirty() should only 444happen iff you write a WAL record; see Writing Hints below. 445 4465. If the relation requires WAL-logging, build a WAL record using 447XLogBeginInsert and XLogRegister* functions, and insert it. (See 448"Constructing a WAL record" below). Then update the page's LSN using the 449returned XLOG location. For instance, 450 451 XLogBeginInsert(); 452 XLogRegisterBuffer(...) 453 XLogRegisterData(...) 454 recptr = XLogInsert(rmgr_id, info); 455 456 PageSetLSN(dp, recptr); 457 4586. END_CRIT_SECTION() 459 4607. Unlock and unpin the buffer(s). 461 462Complex changes (such as a multilevel index insertion) normally need to be 463described by a series of atomic-action WAL records. The intermediate states 464must be self-consistent, so that if the replay is interrupted between any 465two actions, the system is fully functional. In btree indexes, for example, 466a page split requires a new page to be allocated, and an insertion of a new 467key in the parent btree level, but for locking reasons this has to be 468reflected by two separate WAL records. Replaying the first record, to 469allocate the new page and move tuples to it, sets a flag on the page to 470indicate that the key has not been inserted to the parent yet. Replaying the 471second record clears the flag. This intermediate state is never seen by 472other backends during normal operation, because the lock on the child page 473is held across the two actions, but will be seen if the operation is 474interrupted before writing the second WAL record. The search algorithm works 475with the intermediate state as normal, but if an insertion encounters a page 476with the incomplete-split flag set, it will finish the interrupted split by 477inserting the key to the parent, before proceeding. 478 479 480Constructing a WAL record 481------------------------- 482 483A WAL record consists of a header common to all WAL record types, 484record-specific data, and information about the data blocks modified. Each 485modified data block is identified by an ID number, and can optionally have 486more record-specific data associated with the block. If XLogInsert decides 487that a full-page image of a block needs to be taken, the data associated 488with that block is not included. 489 490The API for constructing a WAL record consists of five functions: 491XLogBeginInsert, XLogRegisterBuffer, XLogRegisterData, XLogRegisterBufData, 492and XLogInsert. First, call XLogBeginInsert(). Then register all the buffers 493modified, and data needed to replay the changes, using XLogRegister* 494functions. Finally, insert the constructed record to the WAL by calling 495XLogInsert(). 496 497 XLogBeginInsert(); 498 499 /* register buffers modified as part of this WAL-logged action */ 500 XLogRegisterBuffer(0, lbuffer, REGBUF_STANDARD); 501 XLogRegisterBuffer(1, rbuffer, REGBUF_STANDARD); 502 503 /* register data that is always included in the WAL record */ 504 XLogRegisterData(&xlrec, SizeOfFictionalAction); 505 506 /* 507 * register data associated with a buffer. This will not be included 508 * in the record if a full-page image is taken. 509 */ 510 XLogRegisterBufData(0, tuple->data, tuple->len); 511 512 /* more data associated with the buffer */ 513 XLogRegisterBufData(0, data2, len2); 514 515 /* 516 * Ok, all the data and buffers to include in the WAL record have 517 * been registered. Insert the record. 518 */ 519 recptr = XLogInsert(RM_FOO_ID, XLOG_FOOBAR_DO_STUFF); 520 521Details of the API functions: 522 523void XLogBeginInsert(void) 524 525 Must be called before XLogRegisterBuffer and XLogRegisterData. 526 527void XLogResetInsertion(void) 528 529 Clear any currently registered data and buffers from the WAL record 530 construction workspace. This is only needed if you have already called 531 XLogBeginInsert(), but decide to not insert the record after all. 532 533void XLogEnsureRecordSpace(int max_block_id, int nrdatas) 534 535 Normally, the WAL record construction buffers have the following limits: 536 537 * highest block ID that can be used is 4 (allowing five block references) 538 * Max 20 chunks of registered data 539 540 These default limits are enough for most record types that change some 541 on-disk structures. For the odd case that requires more data, or needs to 542 modify more buffers, these limits can be raised by calling 543 XLogEnsureRecordSpace(). XLogEnsureRecordSpace() must be called before 544 XLogBeginInsert(), and outside a critical section. 545 546void XLogRegisterBuffer(uint8 block_id, Buffer buf, uint8 flags); 547 548 XLogRegisterBuffer adds information about a data block to the WAL record. 549 block_id is an arbitrary number used to identify this page reference in 550 the redo routine. The information needed to re-find the page at redo - 551 relfilenode, fork, and block number - are included in the WAL record. 552 553 XLogInsert will automatically include a full copy of the page contents, if 554 this is the first modification of the buffer since the last checkpoint. 555 It is important to register every buffer modified by the action with 556 XLogRegisterBuffer, to avoid torn-page hazards. 557 558 The flags control when and how the buffer contents are included in the 559 WAL record. Normally, a full-page image is taken only if the page has not 560 been modified since the last checkpoint, and only if full_page_writes=on 561 or an online backup is in progress. The REGBUF_FORCE_IMAGE flag can be 562 used to force a full-page image to always be included; that is useful 563 e.g. for an operation that rewrites most of the page, so that tracking the 564 details is not worth it. For the rare case where it is not necessary to 565 protect from torn pages, REGBUF_NO_IMAGE flag can be used to suppress 566 full page image from being taken. REGBUF_WILL_INIT also suppresses a full 567 page image, but the redo routine must re-generate the page from scratch, 568 without looking at the old page contents. Re-initializing the page 569 protects from torn page hazards like a full page image does. 570 571 The REGBUF_STANDARD flag can be specified together with the other flags to 572 indicate that the page follows the standard page layout. It causes the 573 area between pd_lower and pd_upper to be left out from the image, reducing 574 WAL volume. 575 576 If the REGBUF_KEEP_DATA flag is given, any per-buffer data registered with 577 XLogRegisterBufData() is included in the WAL record even if a full-page 578 image is taken. 579 580void XLogRegisterData(char *data, int len); 581 582 XLogRegisterData is used to include arbitrary data in the WAL record. If 583 XLogRegisterData() is called multiple times, the data are appended, and 584 will be made available to the redo routine as one contiguous chunk. 585 586void XLogRegisterBufData(uint8 block_id, char *data, int len); 587 588 XLogRegisterBufData is used to include data associated with a particular 589 buffer that was registered earlier with XLogRegisterBuffer(). If 590 XLogRegisterBufData() is called multiple times with the same block ID, the 591 data are appended, and will be made available to the redo routine as one 592 contiguous chunk. 593 594 If a full-page image of the buffer is taken at insertion, the data is not 595 included in the WAL record, unless the REGBUF_KEEP_DATA flag is used. 596 597 598Writing a REDO routine 599---------------------- 600 601A REDO routine uses the data and page references included in the WAL record 602to reconstruct the new state of the page. The record decoding functions 603and macros in xlogreader.c/h can be used to extract the data from the record. 604 605When replaying a WAL record that describes changes on multiple pages, you 606must be careful to lock the pages properly to prevent concurrent Hot Standby 607queries from seeing an inconsistent state. If this requires that two 608or more buffer locks be held concurrently, you must lock the pages in 609appropriate order, and not release the locks until all the changes are done. 610 611Note that we must only use PageSetLSN/PageGetLSN() when we know the action 612is serialised. Only Startup process may modify data blocks during recovery, 613so Startup process may execute PageGetLSN() without fear of serialisation 614problems. All other processes must only call PageSet/GetLSN when holding 615either an exclusive buffer lock or a shared lock plus buffer header lock, 616or be writing the data block directly rather than through shared buffers 617while holding AccessExclusiveLock on the relation. 618 619 620Writing Hints 621------------- 622 623In some cases, we write additional information to data blocks without 624writing a preceding WAL record. This should only happen iff the data can 625be reconstructed later following a crash and the action is simply a way 626of optimising for performance. When a hint is written we use 627MarkBufferDirtyHint() to mark the block dirty. 628 629If the buffer is clean and checksums are in use then 630MarkBufferDirtyHint() inserts an XLOG_FPI record to ensure that we 631take a full page image that includes the hint. We do this to avoid 632a partial page write, when we write the dirtied page. WAL is not 633written during recovery, so we simply skip dirtying blocks because 634of hints when in recovery. 635 636If you do decide to optimise away a WAL record, then any calls to 637MarkBufferDirty() must be replaced by MarkBufferDirtyHint(), 638otherwise you will expose the risk of partial page writes. 639 640 641Write-Ahead Logging for Filesystem Actions 642------------------------------------------ 643 644The previous section described how to WAL-log actions that only change page 645contents within shared buffers. For that type of action it is generally 646possible to check all likely error cases (such as insufficient space on the 647page) before beginning to make the actual change. Therefore we can make 648the change and the creation of the associated WAL log record "atomic" by 649wrapping them into a critical section --- the odds of failure partway 650through are low enough that PANIC is acceptable if it does happen. 651 652Clearly, that approach doesn't work for cases where there's a significant 653probability of failure within the action to be logged, such as creation 654of a new file or database. We don't want to PANIC, and we especially don't 655want to PANIC after having already written a WAL record that says we did 656the action --- if we did, replay of the record would probably fail again 657and PANIC again, making the failure unrecoverable. This means that the 658ordinary WAL rule of "write WAL before the changes it describes" doesn't 659work, and we need a different design for such cases. 660 661There are several basic types of filesystem actions that have this 662issue. Here is how we deal with each: 663 6641. Adding a disk page to an existing table. 665 666This action isn't WAL-logged at all. We extend a table by writing a page 667of zeroes at its end. We must actually do this write so that we are sure 668the filesystem has allocated the space. If the write fails we can just 669error out normally. Once the space is known allocated, we can initialize 670and fill the page via one or more normal WAL-logged actions. Because it's 671possible that we crash between extending the file and writing out the WAL 672entries, we have to treat discovery of an all-zeroes page in a table or 673index as being a non-error condition. In such cases we can just reclaim 674the space for re-use. 675 6762. Creating a new table, which requires a new file in the filesystem. 677 678We try to create the file, and if successful we make a WAL record saying 679we did it. If not successful, we can just throw an error. Notice that 680there is a window where we have created the file but not yet written any 681WAL about it to disk. If we crash during this window, the file remains 682on disk as an "orphan". It would be possible to clean up such orphans 683by having database restart search for files that don't have any committed 684entry in pg_class, but that currently isn't done because of the possibility 685of deleting data that is useful for forensic analysis of the crash. 686Orphan files are harmless --- at worst they waste a bit of disk space --- 687because we check for on-disk collisions when allocating new relfilenode 688OIDs. So cleaning up isn't really necessary. 689 6903. Deleting a table, which requires an unlink() that could fail. 691 692Our approach here is to WAL-log the operation first, but to treat failure 693of the actual unlink() call as a warning rather than error condition. 694Again, this can leave an orphan file behind, but that's cheap compared to 695the alternatives. Since we can't actually do the unlink() until after 696we've committed the DROP TABLE transaction, throwing an error would be out 697of the question anyway. (It may be worth noting that the WAL entry about 698the file deletion is actually part of the commit record for the dropping 699transaction.) 700 7014. Creating and deleting databases and tablespaces, which requires creating 702and deleting directories and entire directory trees. 703 704These cases are handled similarly to creating individual files, ie, we 705try to do the action first and then write a WAL entry if it succeeded. 706The potential amount of wasted disk space is rather larger, of course. 707In the creation case we try to delete the directory tree again if creation 708fails, so as to reduce the risk of wasted space. Failure partway through 709a deletion operation results in a corrupt database: the DROP failed, but 710some of the data is gone anyway. There is little we can do about that, 711though, and in any case it was presumably data the user no longer wants. 712 713In all of these cases, if WAL replay fails to redo the original action 714we must panic and abort recovery. The DBA will have to manually clean up 715(for instance, free up some disk space or fix directory permissions) and 716then restart recovery. This is part of the reason for not writing a WAL 717entry until we've successfully done the original action. 718 719 720Asynchronous Commit 721------------------- 722 723As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e., 724we don't wait while the WAL record for the commit is fsync'ed. 725We perform an asynchronous commit when synchronous_commit = off. Instead 726of performing an XLogFlush() up to the LSN of the commit, we merely note 727the LSN in shared memory. The backend then continues with other work. 728We record the LSN only for an asynchronous commit, not an abort; there's 729never any need to flush an abort record, since the presumption after a 730crash would be that the transaction aborted anyway. 731 732We always force synchronous commit when the transaction is deleting 733relations, to ensure the commit record is down to disk before the relations 734are removed from the filesystem. Also, certain utility commands that have 735non-roll-backable side effects (such as filesystem changes) force sync 736commit to minimize the window in which the filesystem change has been made 737but the transaction isn't guaranteed committed. 738 739The walwriter regularly wakes up (via wal_writer_delay) or is woken up 740(via its latch, which is set by backends committing asynchronously) and 741performs an XLogBackgroundFlush(). This checks the location of the last 742completely filled WAL page. If that has moved forwards, then we write all 743the changed buffers up to that point, so that under full load we write 744only whole buffers. If there has been a break in activity and the current 745WAL page is the same as before, then we find out the LSN of the most 746recent asynchronous commit, and write up to that point, if required (i.e. 747if it's in the current WAL page). If more than wal_writer_delay has 748passed, or more than wal_writer_flush_after blocks have been written, since 749the last flush, WAL is also flushed up to the current location. This 750arrangement in itself would guarantee that an async commit record reaches 751disk after at most two times wal_writer_delay after the transaction 752completes. However, we also allow XLogFlush to write/flush full buffers 753"flexibly" (ie, not wrapping around at the end of the circular WAL buffer 754area), so as to minimize the number of writes issued under high load when 755multiple WAL pages are filled per walwriter cycle. This makes the worst-case 756delay three wal_writer_delay cycles. 757 758There are some other subtle points to consider with asynchronous commits. 759First, for each page of CLOG we must remember the LSN of the latest commit 760affecting the page, so that we can enforce the same flush-WAL-before-write 761rule that we do for ordinary relation pages. Otherwise the record of the 762commit might reach disk before the WAL record does. Again, abort records 763need not factor into this consideration. 764 765In fact, we store more than one LSN for each clog page. This relates to 766the way we set transaction status hint bits during visibility tests. 767We must not set a transaction-committed hint bit on a relation page and 768have that record make it to disk prior to the WAL record of the commit. 769Since visibility tests are normally made while holding buffer share locks, 770we do not have the option of changing the page's LSN to guarantee WAL 771synchronization. Instead, we defer the setting of the hint bit if we have 772not yet flushed WAL as far as the LSN associated with the transaction. 773This requires tracking the LSN of each unflushed async commit. It is 774convenient to associate this data with clog buffers: because we will flush 775WAL before writing a clog page, we know that we do not need to remember a 776transaction's LSN longer than the clog page holding its commit status 777remains in memory. However, the naive approach of storing an LSN for each 778clog position is unattractive: the LSNs are 32x bigger than the two-bit 779commit status fields, and so we'd need 256K of additional shared memory for 780each 8K clog buffer page. We choose instead to store a smaller number of 781LSNs per page, where each LSN is the highest LSN associated with any 782transaction commit in a contiguous range of transaction IDs on that page. 783This saves storage at the price of some possibly-unnecessary delay in 784setting transaction hint bits. 785 786How many transactions should share the same cached LSN (N)? If the 787system's workload consists only of small async-commit transactions, then 788it's reasonable to have N similar to the number of transactions per 789walwriter cycle, since that is the granularity with which transactions will 790become truly committed (and thus hintable) anyway. The worst case is where 791a sync-commit xact shares a cached LSN with an async-commit xact that 792commits a bit later; even though we paid to sync the first xact to disk, 793we won't be able to hint its outputs until the second xact is sync'd, up to 794three walwriter cycles later. This argues for keeping N (the group size) 795as small as possible. For the moment we are setting the group size to 32, 796which makes the LSN cache space the same size as the actual clog buffer 797space (independently of BLCKSZ). 798 799It is useful that we can run both synchronous and asynchronous commit 800transactions concurrently, but the safety of this is perhaps not 801immediately obvious. Assume we have two transactions, T1 and T2. The Log 802Sequence Number (LSN) is the point in the WAL sequence where a transaction 803commit is recorded, so LSN1 and LSN2 are the commit records of those 804transactions. If T2 can see changes made by T1 then when T2 commits it 805must be true that LSN2 follows LSN1. Thus when T2 commits it is certain 806that all of the changes made by T1 are also now recorded in the WAL. This 807is true whether T1 was asynchronous or synchronous. As a result, it is 808safe for asynchronous commits and synchronous commits to work concurrently 809without endangering data written by synchronous commits. Sub-transactions 810are not important here since the final write to disk only occurs at the 811commit of the top level transaction. 812 813Changes to data blocks cannot reach disk unless WAL is flushed up to the 814point of the LSN of the data blocks. Any attempt to write unsafe data to 815disk will trigger a write which ensures the safety of all data written by 816that and prior transactions. Data blocks and clog pages are both protected 817by LSNs. 818 819Changes to a temp table are not WAL-logged, hence could reach disk in 820advance of T1's commit, but we don't care since temp table contents don't 821survive crashes anyway. 822 823Database writes made via any of the paths we have introduced to avoid WAL 824overhead for bulk updates are also safe. In these cases it's entirely 825possible for the data to reach disk before T1's commit, because T1 will 826fsync it down to disk without any sort of interlock, as soon as it finishes 827the bulk update. However, all these paths are designed to write data that 828no other transaction can see until after T1 commits. The situation is thus 829not different from ordinary WAL-logged updates. 830 831Transaction Emulation during Recovery 832------------------------------------- 833 834During Recovery we replay transaction changes in the order they occurred. 835As part of this replay we emulate some transactional behaviour, so that 836read only backends can take MVCC snapshots. We do this by maintaining a 837list of XIDs belonging to transactions that are being replayed, so that 838each transaction that has recorded WAL records for database writes exist 839in the array until it commits. Further details are given in comments in 840procarray.c. 841 842Many actions write no WAL records at all, for example read only transactions. 843These have no effect on MVCC in recovery and we can pretend they never 844occurred at all. Subtransaction commit does not write a WAL record either 845and has very little effect, since lock waiters need to wait for the 846parent transaction to complete. 847 848Not all transactional behaviour is emulated, for example we do not insert 849a transaction entry into the lock table, nor do we maintain the transaction 850stack in memory. Clog, multixact and commit_ts entries are made normally. 851Subtrans is maintained during recovery but the details of the transaction 852tree are ignored and all subtransactions reference the top-level TransactionId 853directly. Since commit is atomic this provides correct lock wait behaviour 854yet simplifies emulation of subtransactions considerably. 855 856Further details on locking mechanics in recovery are given in comments 857with the Lock rmgr code. 858