1Overview 2======== 3 4PostgreSQL provides some simple facilities to make writing parallel algorithms 5easier. Using a data structure called a ParallelContext, you can arrange to 6launch background worker processes, initialize their state to match that of 7the backend which initiated parallelism, communicate with them via dynamic 8shared memory, and write reasonably complex code that can run either in the 9user backend or in one of the parallel workers without needing to be aware of 10where it's running. 11 12The backend which starts a parallel operation (hereafter, the initiating 13backend) starts by creating a dynamic shared memory segment which will last 14for the lifetime of the parallel operation. This dynamic shared memory segment 15will contain (1) a shm_mq that can be used to transport errors (and other 16messages reported via elog/ereport) from the worker back to the initiating 17backend; (2) serialized representations of the initiating backend's private 18state, so that the worker can synchronize its state with of the initiating 19backend; and (3) any other data structures which a particular user of the 20ParallelContext data structure may wish to add for its own purposes. Once 21the initiating backend has initialized the dynamic shared memory segment, it 22asks the postmaster to launch the appropriate number of parallel workers. 23These workers then connect to the dynamic shared memory segment, initiate 24their state, and then invoke the appropriate entrypoint, as further detailed 25below. 26 27Error Reporting 28=============== 29 30When started, each parallel worker begins by attaching the dynamic shared 31memory segment and locating the shm_mq to be used for error reporting; it 32redirects all of its protocol messages to this shm_mq. Prior to this point, 33any failure of the background worker will not be reported to the initiating 34backend; from the point of view of the initiating backend, the worker simply 35failed to start. The initiating backend must anyway be prepared to cope 36with fewer parallel workers than it originally requested, so catering to 37this case imposes no additional burden. 38 39Whenever a new message (or partial message; very large messages may wrap) is 40sent to the error-reporting queue, PROCSIG_PARALLEL_MESSAGE is sent to the 41initiating backend. This causes the next CHECK_FOR_INTERRUPTS() in the 42initiating backend to read and rethrow the message. For the most part, this 43makes error reporting in parallel mode "just work". Of course, to work 44properly, it is important that the code the initiating backend is executing 45CHECK_FOR_INTERRUPTS() regularly and avoid blocking interrupt processing for 46long periods of time, but those are good things to do anyway. 47 48(A currently-unsolved problem is that some messages may get written to the 49system log twice, once in the backend where the report was originally 50generated, and again when the initiating backend rethrows the message. If 51we decide to suppress one of these reports, it should probably be second one; 52otherwise, if the worker is for some reason unable to propagate the message 53back to the initiating backend, the message will be lost altogether.) 54 55State Sharing 56============= 57 58It's possible to write C code which works correctly without parallelism, but 59which fails when parallelism is used. No parallel infrastructure can 60completely eliminate this problem, because any global variable is a risk. 61There's no general mechanism for ensuring that every global variable in the 62worker will have the same value that it does in the initiating backend; even 63if we could ensure that, some function we're calling could update the variable 64after each call, and only the backend where that update is performed will see 65the new value. Similar problems can arise with any more-complex data 66structure we might choose to use. For example, a pseudo-random number 67generator should, given a particular seed value, produce the same predictable 68series of values every time. But it does this by relying on some private 69state which won't automatically be shared between cooperating backends. A 70parallel-safe PRNG would need to store its state in dynamic shared memory, and 71would require locking. The parallelism infrastructure has no way of knowing 72whether the user intends to call code that has this sort of problem, and can't 73do anything about it anyway. 74 75Instead, we take a more pragmatic approach. First, we try to make as many of 76the operations that are safe outside of parallel mode work correctly in 77parallel mode as well. Second, we try to prohibit common unsafe operations 78via suitable error checks. These checks are intended to catch 100% of 79unsafe things that a user might do from the SQL interface, but code written 80in C can do unsafe things that won't trigger these checks. The error checks 81are engaged via EnterParallelMode(), which should be called before creating 82a parallel context, and disarmed via ExitParallelMode(), which should be 83called after all parallel contexts have been destroyed. The most 84significant restriction imposed by parallel mode is that all operations must 85be strictly read-only; we allow no writes to the database and no DDL. We 86might try to relax these restrictions in the future. 87 88To make as many operations as possible safe in parallel mode, we try to copy 89the most important pieces of state from the initiating backend to each parallel 90worker. This includes: 91 92 - The set of libraries dynamically loaded by dfmgr.c. 93 94 - The authenticated user ID and current database. Each parallel worker 95 will connect to the same database as the initiating backend, using the 96 same user ID. 97 98 - The values of all GUCs. Accordingly, permanent changes to the value of 99 any GUC are forbidden while in parallel mode; but temporary changes, 100 such as entering a function with non-NULL proconfig, are OK. 101 102 - The current subtransaction's XID, the top-level transaction's XID, and 103 the list of XIDs considered current (that is, they are in-progress or 104 subcommitted). This information is needed to ensure that tuple visibility 105 checks return the same results in the worker as they do in the 106 initiating backend. See also the section Transaction Integration, below. 107 108 - The combo CID mappings. This is needed to ensure consistent answers to 109 tuple visibility checks. The need to synchronize this data structure is 110 a major reason why we can't support writes in parallel mode: such writes 111 might create new combo CIDs, and we have no way to let other workers 112 (or the initiating backend) know about them. 113 114 - The transaction snapshot. 115 116 - The active snapshot, which might be different from the transaction 117 snapshot. 118 119 - The currently active user ID and security context. Note that this is 120 the fourth user ID we restore: the initial step of binding to the correct 121 database also involves restoring the authenticated user ID. When GUC 122 values are restored, this incidentally sets SessionUserId and OuterUserId 123 to the correct values. This final step restores CurrentUserId. 124 125 - State related to pending REINDEX operations, which prevents access to 126 an index that is currently being rebuilt. 127 128To prevent unprincipled deadlocks when running in parallel mode, this code 129also arranges for the leader and all workers to participate in group 130locking. See src/backend/storage/lmgr/README for more details. 131 132Transaction Integration 133======================= 134 135Regardless of what the TransactionState stack looks like in the parallel 136leader, each parallel worker ends up with a stack of depth 1. This stack 137entry is marked with the special transaction block state 138TBLOCK_PARALLEL_INPROGRESS so that it's not confused with an ordinary 139toplevel transaction. The XID of this TransactionState is set to the XID of 140the innermost currently-active subtransaction in the initiating backend. The 141initiating backend's toplevel XID, and the XIDs of all current (in-progress 142or subcommitted) XIDs are stored separately from the TransactionState stack, 143but in such a way that GetTopTransactionId(), GetTopTransactionIdIfAny(), and 144TransactionIdIsCurrentTransactionId() return the same values that they would 145in the initiating backend. We could copy the entire transaction state stack, 146but most of it would be useless: for example, you can't roll back to a 147savepoint from within a parallel worker, and there are no resources to 148associated with the memory contexts or resource owners of intermediate 149subtransactions. 150 151No meaningful change to the transaction state can be made while in parallel 152mode. No XIDs can be assigned, and no subtransactions can start or end, 153because we have no way of communicating these state changes to cooperating 154backends, or of synchronizing them. It's clearly unworkable for the initiating 155backend to exit any transaction or subtransaction that was in progress when 156parallelism was started before all parallel workers have exited; and it's even 157more clearly crazy for a parallel worker to try to subcommit or subabort the 158current subtransaction and execute in some other transaction context than was 159present in the initiating backend. It might be practical to allow internal 160sub-transactions (e.g. to implement a PL/pgSQL EXCEPTION block) to be used in 161parallel mode, provided that they are XID-less, because other backends 162wouldn't really need to know about those transactions or do anything 163differently because of them. Right now, we don't even allow that. 164 165At the end of a parallel operation, which can happen either because it 166completed successfully or because it was interrupted by an error, parallel 167workers associated with that operation exit. In the error case, transaction 168abort processing in the parallel leader kills of any remaining workers, and 169the parallel leader then waits for them to die. In the case of a successful 170parallel operation, the parallel leader does not send any signals, but must 171wait for workers to complete and exit of their own volition. In either 172case, it is very important that all workers actually exit before the 173parallel leader cleans up the (sub)transaction in which they were created; 174otherwise, chaos can ensue. For example, if the leader is rolling back the 175transaction that created the relation being scanned by a worker, the 176relation could disappear while the worker is still busy scanning it. That's 177not safe. 178 179Generally, the cleanup performed by each worker at this point is similar to 180top-level commit or abort. Each backend has its own resource owners: buffer 181pins, catcache or relcache reference counts, tuple descriptors, and so on 182are managed separately by each backend, and must free them before exiting. 183There are, however, some important differences between parallel worker 184commit or abort and a real top-level transaction commit or abort. Most 185importantly: 186 187 - No commit or abort record is written; the initiating backend is 188 responsible for this. 189 190 - Cleanup of pg_temp namespaces is not done. Parallel workers cannot 191 safely access the initiating backend's pg_temp namespace, and should 192 not create one of their own. 193 194Coding Conventions 195=================== 196 197Before beginning any parallel operation, call EnterParallelMode(); after all 198parallel operations are completed, call ExitParallelMode(). To actually 199parallelize a particular operation, use a ParallelContext. The basic coding 200pattern looks like this: 201 202 EnterParallelMode(); /* prohibit unsafe state changes */ 203 204 pcxt = CreateParallelContext("library_name", "function_name", nworkers); 205 206 /* Allow space for application-specific data here. */ 207 shm_toc_estimate_chunk(&pcxt->estimator, size); 208 shm_toc_estimate_keys(&pcxt->estimator, keys); 209 210 InitializeParallelDSM(pcxt); /* create DSM and copy state to it */ 211 212 /* Store the data for which we reserved space. */ 213 space = shm_toc_allocate(pcxt->toc, size); 214 shm_toc_insert(pcxt->toc, key, space); 215 216 LaunchParallelWorkers(pcxt); 217 218 /* do parallel stuff */ 219 220 WaitForParallelWorkersToFinish(pcxt); 221 222 /* read any final results from dynamic shared memory */ 223 224 DestroyParallelContext(pcxt); 225 226 ExitParallelMode(); 227 228If desired, after WaitForParallelWorkersToFinish() has been called, the 229context can be reset so that workers can be launched anew using the same 230parallel context. To do this, first call ReinitializeParallelDSM() to 231reinitialize state managed by the parallel context machinery itself; then, 232perform any other necessary resetting of state; after that, you can again 233call LaunchParallelWorkers. 234