1src/backend/executor/README 2 3The Postgres Executor 4===================== 5 6The executor processes a tree of "plan nodes". The plan tree is essentially 7a demand-pull pipeline of tuple processing operations. Each node, when 8called, will produce the next tuple in its output sequence, or NULL if no 9more tuples are available. If the node is not a primitive relation-scanning 10node, it will have child node(s) that it calls in turn to obtain input 11tuples. 12 13Refinements on this basic model include: 14 15* Choice of scan direction (forwards or backwards). Caution: this is not 16currently well-supported. It works for primitive scan nodes, but not very 17well for joins, aggregates, etc. 18 19* Rescan command to reset a node and make it generate its output sequence 20over again. 21 22* Parameters that can alter a node's results. After adjusting a parameter, 23the rescan command must be applied to that node and all nodes above it. 24There is a moderately intelligent scheme to avoid rescanning nodes 25unnecessarily (for example, Sort does not rescan its input if no parameters 26of the input have changed, since it can just reread its stored sorted data). 27 28For a SELECT, it is only necessary to deliver the top-level result tuples 29to the client. For INSERT/UPDATE/DELETE, the actual table modification 30operations happen in a top-level ModifyTable plan node. If the query 31includes a RETURNING clause, the ModifyTable node delivers the computed 32RETURNING rows as output, otherwise it returns nothing. Handling INSERT 33is pretty straightforward: the tuples returned from the plan tree below 34ModifyTable are inserted into the correct result relation. For UPDATE, 35the plan tree returns the computed tuples to be updated, plus a "junk" 36(hidden) CTID column identifying which table row is to be replaced by each 37one. For DELETE, the plan tree need only deliver a CTID column, and the 38ModifyTable node visits each of those rows and marks the row deleted. 39 40XXX a great deal more documentation needs to be written here... 41 42 43Plan Trees and State Trees 44-------------------------- 45 46The plan tree delivered by the planner contains a tree of Plan nodes (struct 47types derived from struct Plan). During executor startup we build a parallel 48tree of identical structure containing executor state nodes --- every plan 49node type has a corresponding executor state node type. Each node in the 50state tree has a pointer to its corresponding node in the plan tree, plus 51executor state data as needed to implement that node type. This arrangement 52allows the plan tree to be completely read-only so far as the executor is 53concerned: all data that is modified during execution is in the state tree. 54Read-only plan trees make life much simpler for plan caching and reuse. 55 56Each Plan node may have expression trees associated with it, to represent 57its target list, qualification conditions, etc. These trees are also 58read-only to the executor, but the executor state for expression evaluation 59does not mirror the Plan expression's tree shape, as explained below. 60Rather, there's just one ExprState node per expression tree, although this 61may have sub-nodes for some complex expression node types. 62 63Altogether there are four classes of nodes used in these trees: Plan nodes, 64their corresponding PlanState nodes, Expr nodes, and ExprState nodes. 65(Actually, there are also List nodes, which are used as "glue" in all 66three tree-based representations.) 67 68 69Expression Trees and ExprState nodes 70------------------------------------ 71 72Expression trees, in contrast to Plan trees, are not mirrored into a 73corresponding tree of state nodes. Instead each separately executable 74expression tree (e.g. a Plan's qual or targetlist) is represented by one 75ExprState node. The ExprState node contains the information needed to 76evaluate the expression in a compact, linear form. That compact form is 77stored as a flat array in ExprState->steps[] (an array of ExprEvalStep, 78not ExprEvalStep *). 79 80The reasons for choosing such a representation include: 81- commonly the amount of work needed to evaluate one Expr-type node is 82 small enough that the overhead of having to perform a tree-walk 83 during evaluation is significant. 84- the flat representation can be evaluated non-recursively within a single 85 function, reducing stack depth and function call overhead. 86- such a representation is usable both for fast interpreted execution, 87 and for compiling into native code. 88 89The Plan-tree representation of an expression is compiled into an 90ExprState node by ExecInitExpr(). As much complexity as possible should 91be handled by ExecInitExpr() (and helpers), instead of execution time 92where both interpreted and compiled versions would need to deal with the 93complexity. Besides duplicating effort between execution approaches, 94runtime initialization checks also have a small but noticeable cost every 95time the expression is evaluated. Therefore, we allow ExecInitExpr() to 96precompute information that we do not expect to vary across execution of a 97single query, for example the set of CHECK constraint expressions to be 98applied to a domain type. This could not be done at plan time without 99greatly increasing the number of events that require plan invalidation. 100(Previously, some information of this kind was rechecked on each 101expression evaluation, but that seems like unnecessary overhead.) 102 103 104Expression Initialization 105------------------------- 106 107During ExecInitExpr() and similar routines, Expr trees are converted 108into the flat representation. Each Expr node might be represented by 109zero, one, or more ExprEvalSteps. 110 111Each ExprEvalStep's work is determined by its opcode (of enum ExprEvalOp) 112and it stores the result of its work into the Datum variable and boolean 113null flag variable pointed to by ExprEvalStep->resvalue/resnull. 114Complex expressions are performed by chaining together several steps. 115For example, "a + b" (one OpExpr, with two Var expressions) would be 116represented as two steps to fetch the Var values, and one step for the 117evaluation of the function underlying the + operator. The steps for the 118Vars would have their resvalue/resnull pointing directly to the appropriate 119arg[] and argnull[] array elements in the FunctionCallInfoData struct that 120is used by the function evaluation step, thus avoiding extra work to copy 121the result values around. 122 123The last entry in a completed ExprState->steps array is always an 124EEOP_DONE step; this removes the need to test for end-of-array while 125iterating. Also, if the expression contains any variable references (to 126user columns of the ExprContext's INNER, OUTER, or SCAN tuples), the steps 127array begins with EEOP_*_FETCHSOME steps that ensure that the relevant 128tuples have been deconstructed to make the required columns directly 129available (cf. slot_getsomeattrs()). This allows individual Var-fetching 130steps to be little more than an array lookup. 131 132Most of ExecInitExpr()'s work is done by the recursive function 133ExecInitExprRec() and its subroutines. ExecInitExprRec() maps one Expr 134node into the steps required for execution, recursing as needed for 135sub-expressions. 136 137Each ExecInitExprRec() call has to specify where that subexpression's 138results are to be stored (via the resv/resnull parameters). This allows 139the above scenario of evaluating a (sub-)expression directly into 140fcinfo->arg/argnull, but also requires some care: target Datum/isnull 141variables may not be shared with another ExecInitExprRec() unless the 142results are only needed by steps executing before further usages of those 143target Datum/isnull variables. Due to the non-recursiveness of the 144ExprEvalStep representation that's usually easy to guarantee. 145 146ExecInitExprRec() pushes new operations into the ExprState->steps array 147using ExprEvalPushStep(). To keep the steps as a consecutively laid out 148array, ExprEvalPushStep() has to repalloc the entire array when there's 149not enough space. Because of that it is *not* allowed to point directly 150into any of the steps during expression initialization. Therefore, the 151resv/resnull for a subexpression usually point to some storage that is 152palloc'd separately from the steps array. For instance, the 153FunctionCallInfoData for a function call step is separately allocated 154rather than being part of the ExprEvalStep array. The overall result 155of a complete expression is typically returned into the resvalue/resnull 156fields of the ExprState node itself. 157 158Some steps, e.g. boolean expressions, allow skipping evaluation of 159certain subexpressions. In the flat representation this amounts to 160jumping to some later step rather than just continuing consecutively 161with the next step. The target for such a jump is represented by 162the integer index in the ExprState->steps array of the step to execute 163next. (Compare the EEO_NEXT and EEO_JUMP macros in execExprInterp.c.) 164 165Typically, ExecInitExprRec() has to push a jumping step into the steps 166array, then recursively generate steps for the subexpression that might 167get skipped over, then go back and fix up the jump target index using 168the now-known length of the subexpression's steps. This is handled by 169adjust_jumps lists in execExpr.c. 170 171The last step in constructing an ExprState is to apply ExecReadyExpr(), 172which readies it for execution using whichever execution method has been 173selected. 174 175 176Expression Evaluation 177--------------------- 178 179To allow for different methods of expression evaluation, and for 180better branch/jump target prediction, expressions are evaluated by 181calling ExprState->evalfunc (via ExprEvalExpr() and friends). 182 183ExprReadyExpr() can choose the method of interpretation by setting 184evalfunc to an appropriate function. The default execution function, 185ExecInterpExpr, is implemented in execExprInterp.c; see its header 186comment for details. Special-case evalfuncs are used for certain 187especially-simple expressions. 188 189Note that a lot of the more complex expression evaluation steps, which are 190less performance-critical than the simpler ones, are implemented as 191separate functions outside the fast-path of expression execution, allowing 192their implementation to be shared between interpreted and compiled 193expression evaluation. This means that these helper functions are not 194allowed to perform expression step dispatch themselves, as the method of 195dispatch will vary based on the caller. The helpers therefore cannot call 196for the execution of subexpressions; all subexpression results they need 197must be computed by earlier steps. And dispatch to the following 198expression step must be performed after returning from the helper. 199 200 201Targetlist Evaluation 202--------------------- 203 204ExecBuildProjectionInfo builds an ExprState that has the effect of 205evaluating a targetlist into ExprState->resultslot. A generic targetlist 206expression is executed by evaluating it as discussed above (storing the 207result into the ExprState's resvalue/resnull fields) and then using an 208EEOP_ASSIGN_TMP step to move the result into the appropriate tts_values[] 209and tts_isnull[] array elements of the result slot. There are special 210fast-path step types (EEOP_ASSIGN_*_VAR) to handle targetlist entries that 211are simple Vars using only one step instead of two. 212 213 214Memory Management 215----------------- 216 217A "per query" memory context is created during CreateExecutorState(); 218all storage allocated during an executor invocation is allocated in that 219context or a child context. This allows easy reclamation of storage 220during executor shutdown --- rather than messing with retail pfree's and 221probable storage leaks, we just destroy the memory context. 222 223In particular, the plan state trees and expression state trees described 224in the previous section are allocated in the per-query memory context. 225 226To avoid intra-query memory leaks, most processing while a query runs 227is done in "per tuple" memory contexts, which are so-called because they 228are typically reset to empty once per tuple. Per-tuple contexts are usually 229associated with ExprContexts, and commonly each PlanState node has its own 230ExprContext to evaluate its qual and targetlist expressions in. 231 232 233Query Processing Control Flow 234----------------------------- 235 236This is a sketch of control flow for full query processing: 237 238 CreateQueryDesc 239 240 ExecutorStart 241 CreateExecutorState 242 creates per-query context 243 switch to per-query context to run ExecInitNode 244 AfterTriggerBeginQuery 245 ExecInitNode --- recursively scans plan tree 246 CreateExprContext 247 creates per-tuple context 248 ExecInitExpr 249 250 ExecutorRun 251 ExecProcNode --- recursively called in per-query context 252 ExecEvalExpr --- called in per-tuple context 253 ResetExprContext --- to free memory 254 255 ExecutorFinish 256 ExecPostprocessPlan --- run any unfinished ModifyTable nodes 257 AfterTriggerEndQuery 258 259 ExecutorEnd 260 ExecEndNode --- recursively releases resources 261 FreeExecutorState 262 frees per-query context and child contexts 263 264 FreeQueryDesc 265 266Per above comments, it's not really critical for ExecEndNode to free any 267memory; it'll all go away in FreeExecutorState anyway. However, we do need to 268be careful to close relations, drop buffer pins, etc, so we do need to scan 269the plan state tree to find these sorts of resources. 270 271 272The executor can also be used to evaluate simple expressions without any Plan 273tree ("simple" meaning "no aggregates and no sub-selects", though such might 274be hidden inside function calls). This case has a flow of control like 275 276 CreateExecutorState 277 creates per-query context 278 279 CreateExprContext -- or use GetPerTupleExprContext(estate) 280 creates per-tuple context 281 282 ExecPrepareExpr 283 temporarily switch to per-query context 284 run the expression through expression_planner 285 ExecInitExpr 286 287 Repeatedly do: 288 ExecEvalExprSwitchContext 289 ExecEvalExpr --- called in per-tuple context 290 ResetExprContext --- to free memory 291 292 FreeExecutorState 293 frees per-query context, as well as ExprContext 294 (a separate FreeExprContext call is not necessary) 295 296 297EvalPlanQual (READ COMMITTED Update Checking) 298--------------------------------------------- 299 300For simple SELECTs, the executor need only pay attention to tuples that are 301valid according to the snapshot seen by the current transaction (ie, they 302were inserted by a previously committed transaction, and not deleted by any 303previously committed transaction). However, for UPDATE and DELETE it is not 304cool to modify or delete a tuple that's been modified by an open or 305concurrently-committed transaction. If we are running in SERIALIZABLE 306isolation level then we just raise an error when this condition is seen to 307occur. In READ COMMITTED isolation level, we must work a lot harder. 308 309The basic idea in READ COMMITTED mode is to take the modified tuple 310committed by the concurrent transaction (after waiting for it to commit, 311if need be) and re-evaluate the query qualifications to see if it would 312still meet the quals. If so, we regenerate the updated tuple (if we are 313doing an UPDATE) from the modified tuple, and finally update/delete the 314modified tuple. SELECT FOR UPDATE/SHARE behaves similarly, except that its 315action is just to lock the modified tuple and return results based on that 316version of the tuple. 317 318To implement this checking, we actually re-run the query from scratch for 319each modified tuple (or set of tuples, for SELECT FOR UPDATE), with the 320relation scan nodes tweaked to return only the current tuples --- either 321the original ones, or the updated (and now locked) versions of the modified 322tuple(s). If this query returns a tuple, then the modified tuple(s) pass 323the quals (and the query output is the suitably modified update tuple, if 324we're doing UPDATE). If no tuple is returned, then the modified tuple(s) 325fail the quals, so we ignore the current result tuple and continue the 326original query. 327 328In UPDATE/DELETE, only the target relation needs to be handled this way. 329In SELECT FOR UPDATE, there may be multiple relations flagged FOR UPDATE, 330so we obtain lock on the current tuple version in each such relation before 331executing the recheck. 332 333It is also possible that there are relations in the query that are not 334to be locked (they are neither the UPDATE/DELETE target nor specified to 335be locked in SELECT FOR UPDATE/SHARE). When re-running the test query 336we want to use the same rows from these relations that were joined to 337the locked rows. For ordinary relations this can be implemented relatively 338cheaply by including the row TID in the join outputs and re-fetching that 339TID. (The re-fetch is expensive, but we're trying to optimize the normal 340case where no re-test is needed.) We have also to consider non-table 341relations, such as a ValuesScan or FunctionScan. For these, since there 342is no equivalent of TID, the only practical solution seems to be to include 343the entire row value in the join output row. 344 345We disallow set-returning functions in the targetlist of SELECT FOR UPDATE, 346so as to ensure that at most one tuple can be returned for any particular 347set of scan tuples. Otherwise we'd get duplicates due to the original 348query returning the same set of scan tuples multiple times. Likewise, 349SRFs are disallowed in an UPDATE's targetlist. There, they would have the 350effect of the same row being updated multiple times, which is not very 351useful --- and updates after the first would have no effect anyway. 352