1src/backend/executor/README 2 3The Postgres Executor 4===================== 5 6The executor processes a tree of "plan nodes". The plan tree is essentially 7a demand-pull pipeline of tuple processing operations. Each node, when 8called, will produce the next tuple in its output sequence, or NULL if no 9more tuples are available. If the node is not a primitive relation-scanning 10node, it will have child node(s) that it calls in turn to obtain input 11tuples. 12 13Refinements on this basic model include: 14 15* Choice of scan direction (forwards or backwards). Caution: this is not 16currently well-supported. It works for primitive scan nodes, but not very 17well for joins, aggregates, etc. 18 19* Rescan command to reset a node and make it generate its output sequence 20over again. 21 22* Parameters that can alter a node's results. After adjusting a parameter, 23the rescan command must be applied to that node and all nodes above it. 24There is a moderately intelligent scheme to avoid rescanning nodes 25unnecessarily (for example, Sort does not rescan its input if no parameters 26of the input have changed, since it can just reread its stored sorted data). 27 28For a SELECT, it is only necessary to deliver the top-level result tuples 29to the client. For INSERT/UPDATE/DELETE, the actual table modification 30operations happen in a top-level ModifyTable plan node. If the query 31includes a RETURNING clause, the ModifyTable node delivers the computed 32RETURNING rows as output, otherwise it returns nothing. Handling INSERT 33is pretty straightforward: the tuples returned from the plan tree below 34ModifyTable are inserted into the correct result relation. For UPDATE, 35the plan tree returns the computed tuples to be updated, plus a "junk" 36(hidden) CTID column identifying which table row is to be replaced by each 37one. For DELETE, the plan tree need only deliver a CTID column, and the 38ModifyTable node visits each of those rows and marks the row deleted. 39 40XXX a great deal more documentation needs to be written here... 41 42 43Plan Trees and State Trees 44-------------------------- 45 46The plan tree delivered by the planner contains a tree of Plan nodes (struct 47types derived from struct Plan). During executor startup we build a parallel 48tree of identical structure containing executor state nodes --- generally, 49every plan node type has a corresponding executor state node type. Each node 50in the state tree has a pointer to its corresponding node in the plan tree, 51plus executor state data as needed to implement that node type. This 52arrangement allows the plan tree to be completely read-only so far as the 53executor is concerned: all data that is modified during execution is in the 54state tree. Read-only plan trees make life much simpler for plan caching and 55reuse. 56 57A corresponding executor state node may not be created during executor startup 58if the executor determines that an entire subplan is not required due to 59execution time partition pruning determining that no matching records will be 60found there. This currently only occurs for Append and MergeAppend nodes. In 61this case the non-required subplans are ignored and the executor state's 62subnode array will become out of sequence to the plan's subplan list. 63 64Each Plan node may have expression trees associated with it, to represent 65its target list, qualification conditions, etc. These trees are also 66read-only to the executor, but the executor state for expression evaluation 67does not mirror the Plan expression's tree shape, as explained below. 68Rather, there's just one ExprState node per expression tree, although this 69may have sub-nodes for some complex expression node types. 70 71Altogether there are four classes of nodes used in these trees: Plan nodes, 72their corresponding PlanState nodes, Expr nodes, and ExprState nodes. 73(Actually, there are also List nodes, which are used as "glue" in all 74three tree-based representations.) 75 76 77Expression Trees and ExprState nodes 78------------------------------------ 79 80Expression trees, in contrast to Plan trees, are not mirrored into a 81corresponding tree of state nodes. Instead each separately executable 82expression tree (e.g. a Plan's qual or targetlist) is represented by one 83ExprState node. The ExprState node contains the information needed to 84evaluate the expression in a compact, linear form. That compact form is 85stored as a flat array in ExprState->steps[] (an array of ExprEvalStep, 86not ExprEvalStep *). 87 88The reasons for choosing such a representation include: 89- commonly the amount of work needed to evaluate one Expr-type node is 90 small enough that the overhead of having to perform a tree-walk 91 during evaluation is significant. 92- the flat representation can be evaluated non-recursively within a single 93 function, reducing stack depth and function call overhead. 94- such a representation is usable both for fast interpreted execution, 95 and for compiling into native code. 96 97The Plan-tree representation of an expression is compiled into an 98ExprState node by ExecInitExpr(). As much complexity as possible should 99be handled by ExecInitExpr() (and helpers), instead of execution time 100where both interpreted and compiled versions would need to deal with the 101complexity. Besides duplicating effort between execution approaches, 102runtime initialization checks also have a small but noticeable cost every 103time the expression is evaluated. Therefore, we allow ExecInitExpr() to 104precompute information that we do not expect to vary across execution of a 105single query, for example the set of CHECK constraint expressions to be 106applied to a domain type. This could not be done at plan time without 107greatly increasing the number of events that require plan invalidation. 108(Previously, some information of this kind was rechecked on each 109expression evaluation, but that seems like unnecessary overhead.) 110 111 112Expression Initialization 113------------------------- 114 115During ExecInitExpr() and similar routines, Expr trees are converted 116into the flat representation. Each Expr node might be represented by 117zero, one, or more ExprEvalSteps. 118 119Each ExprEvalStep's work is determined by its opcode (of enum ExprEvalOp) 120and it stores the result of its work into the Datum variable and boolean 121null flag variable pointed to by ExprEvalStep->resvalue/resnull. 122Complex expressions are performed by chaining together several steps. 123For example, "a + b" (one OpExpr, with two Var expressions) would be 124represented as two steps to fetch the Var values, and one step for the 125evaluation of the function underlying the + operator. The steps for the 126Vars would have their resvalue/resnull pointing directly to the appropriate 127arg[] and argnull[] array elements in the FunctionCallInfoData struct that 128is used by the function evaluation step, thus avoiding extra work to copy 129the result values around. 130 131The last entry in a completed ExprState->steps array is always an 132EEOP_DONE step; this removes the need to test for end-of-array while 133iterating. Also, if the expression contains any variable references (to 134user columns of the ExprContext's INNER, OUTER, or SCAN tuples), the steps 135array begins with EEOP_*_FETCHSOME steps that ensure that the relevant 136tuples have been deconstructed to make the required columns directly 137available (cf. slot_getsomeattrs()). This allows individual Var-fetching 138steps to be little more than an array lookup. 139 140Most of ExecInitExpr()'s work is done by the recursive function 141ExecInitExprRec() and its subroutines. ExecInitExprRec() maps one Expr 142node into the steps required for execution, recursing as needed for 143sub-expressions. 144 145Each ExecInitExprRec() call has to specify where that subexpression's 146results are to be stored (via the resv/resnull parameters). This allows 147the above scenario of evaluating a (sub-)expression directly into 148fcinfo->arg/argnull, but also requires some care: target Datum/isnull 149variables may not be shared with another ExecInitExprRec() unless the 150results are only needed by steps executing before further usages of those 151target Datum/isnull variables. Due to the non-recursiveness of the 152ExprEvalStep representation that's usually easy to guarantee. 153 154ExecInitExprRec() pushes new operations into the ExprState->steps array 155using ExprEvalPushStep(). To keep the steps as a consecutively laid out 156array, ExprEvalPushStep() has to repalloc the entire array when there's 157not enough space. Because of that it is *not* allowed to point directly 158into any of the steps during expression initialization. Therefore, the 159resv/resnull for a subexpression usually point to some storage that is 160palloc'd separately from the steps array. For instance, the 161FunctionCallInfoData for a function call step is separately allocated 162rather than being part of the ExprEvalStep array. The overall result 163of a complete expression is typically returned into the resvalue/resnull 164fields of the ExprState node itself. 165 166Some steps, e.g. boolean expressions, allow skipping evaluation of 167certain subexpressions. In the flat representation this amounts to 168jumping to some later step rather than just continuing consecutively 169with the next step. The target for such a jump is represented by 170the integer index in the ExprState->steps array of the step to execute 171next. (Compare the EEO_NEXT and EEO_JUMP macros in execExprInterp.c.) 172 173Typically, ExecInitExprRec() has to push a jumping step into the steps 174array, then recursively generate steps for the subexpression that might 175get skipped over, then go back and fix up the jump target index using 176the now-known length of the subexpression's steps. This is handled by 177adjust_jumps lists in execExpr.c. 178 179The last step in constructing an ExprState is to apply ExecReadyExpr(), 180which readies it for execution using whichever execution method has been 181selected. 182 183 184Expression Evaluation 185--------------------- 186 187To allow for different methods of expression evaluation, and for 188better branch/jump target prediction, expressions are evaluated by 189calling ExprState->evalfunc (via ExprEvalExpr() and friends). 190 191ExprReadyExpr() can choose the method of interpretation by setting 192evalfunc to an appropriate function. The default execution function, 193ExecInterpExpr, is implemented in execExprInterp.c; see its header 194comment for details. Special-case evalfuncs are used for certain 195especially-simple expressions. 196 197Note that a lot of the more complex expression evaluation steps, which are 198less performance-critical than the simpler ones, are implemented as 199separate functions outside the fast-path of expression execution, allowing 200their implementation to be shared between interpreted and compiled 201expression evaluation. This means that these helper functions are not 202allowed to perform expression step dispatch themselves, as the method of 203dispatch will vary based on the caller. The helpers therefore cannot call 204for the execution of subexpressions; all subexpression results they need 205must be computed by earlier steps. And dispatch to the following 206expression step must be performed after returning from the helper. 207 208 209Targetlist Evaluation 210--------------------- 211 212ExecBuildProjectionInfo builds an ExprState that has the effect of 213evaluating a targetlist into ExprState->resultslot. A generic targetlist 214expression is executed by evaluating it as discussed above (storing the 215result into the ExprState's resvalue/resnull fields) and then using an 216EEOP_ASSIGN_TMP step to move the result into the appropriate tts_values[] 217and tts_isnull[] array elements of the result slot. There are special 218fast-path step types (EEOP_ASSIGN_*_VAR) to handle targetlist entries that 219are simple Vars using only one step instead of two. 220 221 222Memory Management 223----------------- 224 225A "per query" memory context is created during CreateExecutorState(); 226all storage allocated during an executor invocation is allocated in that 227context or a child context. This allows easy reclamation of storage 228during executor shutdown --- rather than messing with retail pfree's and 229probable storage leaks, we just destroy the memory context. 230 231In particular, the plan state trees and expression state trees described 232in the previous section are allocated in the per-query memory context. 233 234To avoid intra-query memory leaks, most processing while a query runs 235is done in "per tuple" memory contexts, which are so-called because they 236are typically reset to empty once per tuple. Per-tuple contexts are usually 237associated with ExprContexts, and commonly each PlanState node has its own 238ExprContext to evaluate its qual and targetlist expressions in. 239 240 241Query Processing Control Flow 242----------------------------- 243 244This is a sketch of control flow for full query processing: 245 246 CreateQueryDesc 247 248 ExecutorStart 249 CreateExecutorState 250 creates per-query context 251 switch to per-query context to run ExecInitNode 252 AfterTriggerBeginQuery 253 ExecInitNode --- recursively scans plan tree 254 ExecInitNode 255 recurse into subsidiary nodes 256 CreateExprContext 257 creates per-tuple context 258 ExecInitExpr 259 260 ExecutorRun 261 ExecProcNode --- recursively called in per-query context 262 ExecEvalExpr --- called in per-tuple context 263 ResetExprContext --- to free memory 264 265 ExecutorFinish 266 ExecPostprocessPlan --- run any unfinished ModifyTable nodes 267 AfterTriggerEndQuery 268 269 ExecutorEnd 270 ExecEndNode --- recursively releases resources 271 FreeExecutorState 272 frees per-query context and child contexts 273 274 FreeQueryDesc 275 276Per above comments, it's not really critical for ExecEndNode to free any 277memory; it'll all go away in FreeExecutorState anyway. However, we do need to 278be careful to close relations, drop buffer pins, etc, so we do need to scan 279the plan state tree to find these sorts of resources. 280 281 282The executor can also be used to evaluate simple expressions without any Plan 283tree ("simple" meaning "no aggregates and no sub-selects", though such might 284be hidden inside function calls). This case has a flow of control like 285 286 CreateExecutorState 287 creates per-query context 288 289 CreateExprContext -- or use GetPerTupleExprContext(estate) 290 creates per-tuple context 291 292 ExecPrepareExpr 293 temporarily switch to per-query context 294 run the expression through expression_planner 295 ExecInitExpr 296 297 Repeatedly do: 298 ExecEvalExprSwitchContext 299 ExecEvalExpr --- called in per-tuple context 300 ResetExprContext --- to free memory 301 302 FreeExecutorState 303 frees per-query context, as well as ExprContext 304 (a separate FreeExprContext call is not necessary) 305 306 307EvalPlanQual (READ COMMITTED Update Checking) 308--------------------------------------------- 309 310For simple SELECTs, the executor need only pay attention to tuples that are 311valid according to the snapshot seen by the current transaction (ie, they 312were inserted by a previously committed transaction, and not deleted by any 313previously committed transaction). However, for UPDATE and DELETE it is not 314cool to modify or delete a tuple that's been modified by an open or 315concurrently-committed transaction. If we are running in SERIALIZABLE 316isolation level then we just raise an error when this condition is seen to 317occur. In READ COMMITTED isolation level, we must work a lot harder. 318 319The basic idea in READ COMMITTED mode is to take the modified tuple 320committed by the concurrent transaction (after waiting for it to commit, 321if need be) and re-evaluate the query qualifications to see if it would 322still meet the quals. If so, we regenerate the updated tuple (if we are 323doing an UPDATE) from the modified tuple, and finally update/delete the 324modified tuple. SELECT FOR UPDATE/SHARE behaves similarly, except that its 325action is just to lock the modified tuple and return results based on that 326version of the tuple. 327 328To implement this checking, we actually re-run the query from scratch for 329each modified tuple (or set of tuples, for SELECT FOR UPDATE), with the 330relation scan nodes tweaked to return only the current tuples --- either 331the original ones, or the updated (and now locked) versions of the modified 332tuple(s). If this query returns a tuple, then the modified tuple(s) pass 333the quals (and the query output is the suitably modified update tuple, if 334we're doing UPDATE). If no tuple is returned, then the modified tuple(s) 335fail the quals, so we ignore the current result tuple and continue the 336original query. 337 338In UPDATE/DELETE, only the target relation needs to be handled this way. 339In SELECT FOR UPDATE, there may be multiple relations flagged FOR UPDATE, 340so we obtain lock on the current tuple version in each such relation before 341executing the recheck. 342 343It is also possible that there are relations in the query that are not 344to be locked (they are neither the UPDATE/DELETE target nor specified to 345be locked in SELECT FOR UPDATE/SHARE). When re-running the test query 346we want to use the same rows from these relations that were joined to 347the locked rows. For ordinary relations this can be implemented relatively 348cheaply by including the row TID in the join outputs and re-fetching that 349TID. (The re-fetch is expensive, but we're trying to optimize the normal 350case where no re-test is needed.) We have also to consider non-table 351relations, such as a ValuesScan or FunctionScan. For these, since there 352is no equivalent of TID, the only practical solution seems to be to include 353the entire row value in the join output row. 354 355We disallow set-returning functions in the targetlist of SELECT FOR UPDATE, 356so as to ensure that at most one tuple can be returned for any particular 357set of scan tuples. Otherwise we'd get duplicates due to the original 358query returning the same set of scan tuples multiple times. Likewise, 359SRFs are disallowed in an UPDATE's targetlist. There, they would have the 360effect of the same row being updated multiple times, which is not very 361useful --- and updates after the first would have no effect anyway. 362