1src/backend/executor/README
2
3The Postgres Executor
4=====================
5
6The executor processes a tree of "plan nodes". The plan tree is essentially
7a demand-pull pipeline of tuple processing operations. Each node, when
8called, will produce the next tuple in its output sequence, or NULL if no
9more tuples are available. If the node is not a primitive relation-scanning
10node, it will have child node(s) that it calls in turn to obtain input
11tuples.
12
13Refinements on this basic model include:
14
15* Choice of scan direction (forwards or backwards). Caution: this is not
16currently well-supported. It works for primitive scan nodes, but not very
17well for joins, aggregates, etc.
18
19* Rescan command to reset a node and make it generate its output sequence
20over again.
21
22* Parameters that can alter a node's results. After adjusting a parameter,
23the rescan command must be applied to that node and all nodes above it.
24There is a moderately intelligent scheme to avoid rescanning nodes
25unnecessarily (for example, Sort does not rescan its input if no parameters
26of the input have changed, since it can just reread its stored sorted data).
27
28For a SELECT, it is only necessary to deliver the top-level result tuples
29to the client. For INSERT/UPDATE/DELETE, the actual table modification
30operations happen in a top-level ModifyTable plan node. If the query
31includes a RETURNING clause, the ModifyTable node delivers the computed
32RETURNING rows as output, otherwise it returns nothing. Handling INSERT
33is pretty straightforward: the tuples returned from the plan tree below
34ModifyTable are inserted into the correct result relation. For UPDATE,
35the plan tree returns the computed tuples to be updated, plus a "junk"
36(hidden) CTID column identifying which table row is to be replaced by each
37one. For DELETE, the plan tree need only deliver a CTID column, and the
38ModifyTable node visits each of those rows and marks the row deleted.
39
40XXX a great deal more documentation needs to be written here...
41
42
43Plan Trees and State Trees
44--------------------------
45
46The plan tree delivered by the planner contains a tree of Plan nodes (struct
47types derived from struct Plan). During executor startup we build a parallel
48tree of identical structure containing executor state nodes --- generally,
49every plan node type has a corresponding executor state node type. Each node
50in the state tree has a pointer to its corresponding node in the plan tree,
51plus executor state data as needed to implement that node type. This
52arrangement allows the plan tree to be completely read-only so far as the
53executor is concerned: all data that is modified during execution is in the
54state tree. Read-only plan trees make life much simpler for plan caching and
55reuse.
56
57A corresponding executor state node may not be created during executor startup
58if the executor determines that an entire subplan is not required due to
59execution time partition pruning determining that no matching records will be
60found there. This currently only occurs for Append and MergeAppend nodes. In
61this case the non-required subplans are ignored and the executor state's
62subnode array will become out of sequence to the plan's subplan list.
63
64Each Plan node may have expression trees associated with it, to represent
65its target list, qualification conditions, etc. These trees are also
66read-only to the executor, but the executor state for expression evaluation
67does not mirror the Plan expression's tree shape, as explained below.
68Rather, there's just one ExprState node per expression tree, although this
69may have sub-nodes for some complex expression node types.
70
71Altogether there are four classes of nodes used in these trees: Plan nodes,
72their corresponding PlanState nodes, Expr nodes, and ExprState nodes.
73(Actually, there are also List nodes, which are used as "glue" in all
74three tree-based representations.)
75
76
77Expression Trees and ExprState nodes
78------------------------------------
79
80Expression trees, in contrast to Plan trees, are not mirrored into a
81corresponding tree of state nodes. Instead each separately executable
82expression tree (e.g. a Plan's qual or targetlist) is represented by one
83ExprState node. The ExprState node contains the information needed to
84evaluate the expression in a compact, linear form. That compact form is
85stored as a flat array in ExprState->steps[] (an array of ExprEvalStep,
86not ExprEvalStep *).
87
88The reasons for choosing such a representation include:
89- commonly the amount of work needed to evaluate one Expr-type node is
90 small enough that the overhead of having to perform a tree-walk
91 during evaluation is significant.
92- the flat representation can be evaluated non-recursively within a single
93 function, reducing stack depth and function call overhead.
94- such a representation is usable both for fast interpreted execution,
95 and for compiling into native code.
96
97The Plan-tree representation of an expression is compiled into an
98ExprState node by ExecInitExpr(). As much complexity as possible should
99be handled by ExecInitExpr() (and helpers), instead of execution time
100where both interpreted and compiled versions would need to deal with the
101complexity. Besides duplicating effort between execution approaches,
102runtime initialization checks also have a small but noticeable cost every
103time the expression is evaluated. Therefore, we allow ExecInitExpr() to
104precompute information that we do not expect to vary across execution of a
105single query, for example the set of CHECK constraint expressions to be
106applied to a domain type. This could not be done at plan time without
107greatly increasing the number of events that require plan invalidation.
108(Previously, some information of this kind was rechecked on each
109expression evaluation, but that seems like unnecessary overhead.)
110
111
112Expression Initialization
113-------------------------
114
115During ExecInitExpr() and similar routines, Expr trees are converted
116into the flat representation. Each Expr node might be represented by
117zero, one, or more ExprEvalSteps.
118
119Each ExprEvalStep's work is determined by its opcode (of enum ExprEvalOp)
120and it stores the result of its work into the Datum variable and boolean
121null flag variable pointed to by ExprEvalStep->resvalue/resnull.
122Complex expressions are performed by chaining together several steps.
123For example, "a + b" (one OpExpr, with two Var expressions) would be
124represented as two steps to fetch the Var values, and one step for the
125evaluation of the function underlying the + operator. The steps for the
126Vars would have their resvalue/resnull pointing directly to the appropriate
127args[].value .isnull elements in the FunctionCallInfoBaseData struct that
128is used by the function evaluation step, thus avoiding extra work to copy
129the result values around.
130
131The last entry in a completed ExprState->steps array is always an
132EEOP_DONE step; this removes the need to test for end-of-array while
133iterating. Also, if the expression contains any variable references (to
134user columns of the ExprContext's INNER, OUTER, or SCAN tuples), the steps
135array begins with EEOP_*_FETCHSOME steps that ensure that the relevant
136tuples have been deconstructed to make the required columns directly
137available (cf. slot_getsomeattrs()). This allows individual Var-fetching
138steps to be little more than an array lookup.
139
140Most of ExecInitExpr()'s work is done by the recursive function
141ExecInitExprRec() and its subroutines. ExecInitExprRec() maps one Expr
142node into the steps required for execution, recursing as needed for
143sub-expressions.
144
145Each ExecInitExprRec() call has to specify where that subexpression's
146results are to be stored (via the resv/resnull parameters). This allows
147the above scenario of evaluating a (sub-)expression directly into
148fcinfo->args[].value/isnull, but also requires some care: target Datum/isnull
149variables may not be shared with another ExecInitExprRec() unless the
150results are only needed by steps executing before further usages of those
151target Datum/isnull variables. Due to the non-recursiveness of the
152ExprEvalStep representation that's usually easy to guarantee.
153
154ExecInitExprRec() pushes new operations into the ExprState->steps array
155using ExprEvalPushStep(). To keep the steps as a consecutively laid out
156array, ExprEvalPushStep() has to repalloc the entire array when there's
157not enough space. Because of that it is *not* allowed to point directly
158into any of the steps during expression initialization. Therefore, the
159resv/resnull for a subexpression usually point to some storage that is
160palloc'd separately from the steps array. For instance, the
161FunctionCallInfoBaseData for a function call step is separately allocated
162rather than being part of the ExprEvalStep array. The overall result
163of a complete expression is typically returned into the resvalue/resnull
164fields of the ExprState node itself.
165
166Some steps, e.g. boolean expressions, allow skipping evaluation of
167certain subexpressions. In the flat representation this amounts to
168jumping to some later step rather than just continuing consecutively
169with the next step. The target for such a jump is represented by
170the integer index in the ExprState->steps array of the step to execute
171next. (Compare the EEO_NEXT and EEO_JUMP macros in execExprInterp.c.)
172
173Typically, ExecInitExprRec() has to push a jumping step into the steps
174array, then recursively generate steps for the subexpression that might
175get skipped over, then go back and fix up the jump target index using
176the now-known length of the subexpression's steps. This is handled by
177adjust_jumps lists in execExpr.c.
178
179The last step in constructing an ExprState is to apply ExecReadyExpr(),
180which readies it for execution using whichever execution method has been
181selected.
182
183
184Expression Evaluation
185---------------------
186
187To allow for different methods of expression evaluation, and for
188better branch/jump target prediction, expressions are evaluated by
189calling ExprState->evalfunc (via ExprEvalExpr() and friends).
190
191ExprReadyExpr() can choose the method of interpretation by setting
192evalfunc to an appropriate function. The default execution function,
193ExecInterpExpr, is implemented in execExprInterp.c; see its header
194comment for details. Special-case evalfuncs are used for certain
195especially-simple expressions.
196
197Note that a lot of the more complex expression evaluation steps, which are
198less performance-critical than the simpler ones, are implemented as
199separate functions outside the fast-path of expression execution, allowing
200their implementation to be shared between interpreted and compiled
201expression evaluation. This means that these helper functions are not
202allowed to perform expression step dispatch themselves, as the method of
203dispatch will vary based on the caller. The helpers therefore cannot call
204for the execution of subexpressions; all subexpression results they need
205must be computed by earlier steps. And dispatch to the following
206expression step must be performed after returning from the helper.
207
208
209Targetlist Evaluation
210---------------------
211
212ExecBuildProjectionInfo builds an ExprState that has the effect of
213evaluating a targetlist into ExprState->resultslot. A generic targetlist
214expression is executed by evaluating it as discussed above (storing the
215result into the ExprState's resvalue/resnull fields) and then using an
216EEOP_ASSIGN_TMP step to move the result into the appropriate tts_values[]
217and tts_isnull[] array elements of the result slot. There are special
218fast-path step types (EEOP_ASSIGN_*_VAR) to handle targetlist entries that
219are simple Vars using only one step instead of two.
220
221
222Memory Management
223-----------------
224
225A "per query" memory context is created during CreateExecutorState();
226all storage allocated during an executor invocation is allocated in that
227context or a child context. This allows easy reclamation of storage
228during executor shutdown --- rather than messing with retail pfree's and
229probable storage leaks, we just destroy the memory context.
230
231In particular, the plan state trees and expression state trees described
232in the previous section are allocated in the per-query memory context.
233
234To avoid intra-query memory leaks, most processing while a query runs
235is done in "per tuple" memory contexts, which are so-called because they
236are typically reset to empty once per tuple. Per-tuple contexts are usually
237associated with ExprContexts, and commonly each PlanState node has its own
238ExprContext to evaluate its qual and targetlist expressions in.
239
240
241Query Processing Control Flow
242-----------------------------
243
244This is a sketch of control flow for full query processing:
245
246 CreateQueryDesc
247
248 ExecutorStart
249 CreateExecutorState
250 creates per-query context
251 switch to per-query context to run ExecInitNode
252 AfterTriggerBeginQuery
253 ExecInitNode --- recursively scans plan tree
254 ExecInitNode
255 recurse into subsidiary nodes
256 CreateExprContext
257 creates per-tuple context
258 ExecInitExpr
259
260 ExecutorRun
261 ExecProcNode --- recursively called in per-query context
262 ExecEvalExpr --- called in per-tuple context
263 ResetExprContext --- to free memory
264
265 ExecutorFinish
266 ExecPostprocessPlan --- run any unfinished ModifyTable nodes
267 AfterTriggerEndQuery
268
269 ExecutorEnd
270 ExecEndNode --- recursively releases resources
271 FreeExecutorState
272 frees per-query context and child contexts
273
274 FreeQueryDesc
275
276Per above comments, it's not really critical for ExecEndNode to free any
277memory; it'll all go away in FreeExecutorState anyway. However, we do need to
278be careful to close relations, drop buffer pins, etc, so we do need to scan
279the plan state tree to find these sorts of resources.
280
281
282The executor can also be used to evaluate simple expressions without any Plan
283tree ("simple" meaning "no aggregates and no sub-selects", though such might
284be hidden inside function calls). This case has a flow of control like
285
286 CreateExecutorState
287 creates per-query context
288
289 CreateExprContext -- or use GetPerTupleExprContext(estate)
290 creates per-tuple context
291
292 ExecPrepareExpr
293 temporarily switch to per-query context
294 run the expression through expression_planner
295 ExecInitExpr
296
297 Repeatedly do:
298 ExecEvalExprSwitchContext
299 ExecEvalExpr --- called in per-tuple context
300 ResetExprContext --- to free memory
301
302 FreeExecutorState
303 frees per-query context, as well as ExprContext
304 (a separate FreeExprContext call is not necessary)
305
306
307EvalPlanQual (READ COMMITTED Update Checking)
308---------------------------------------------
309
310For simple SELECTs, the executor need only pay attention to tuples that are
311valid according to the snapshot seen by the current transaction (ie, they
312were inserted by a previously committed transaction, and not deleted by any
313previously committed transaction). However, for UPDATE and DELETE it is not
314cool to modify or delete a tuple that's been modified by an open or
315concurrently-committed transaction. If we are running in SERIALIZABLE
316isolation level then we just raise an error when this condition is seen to
317occur. In READ COMMITTED isolation level, we must work a lot harder.
318
319The basic idea in READ COMMITTED mode is to take the modified tuple
320committed by the concurrent transaction (after waiting for it to commit,
321if need be) and re-evaluate the query qualifications to see if it would
322still meet the quals. If so, we regenerate the updated tuple (if we are
323doing an UPDATE) from the modified tuple, and finally update/delete the
324modified tuple. SELECT FOR UPDATE/SHARE behaves similarly, except that its
325action is just to lock the modified tuple and return results based on that
326version of the tuple.
327
328To implement this checking, we actually re-run the query from scratch for
329each modified tuple (or set of tuples, for SELECT FOR UPDATE), with the
330relation scan nodes tweaked to return only the current tuples --- either
331the original ones, or the updated (and now locked) versions of the modified
332tuple(s). If this query returns a tuple, then the modified tuple(s) pass
333the quals (and the query output is the suitably modified update tuple, if
334we're doing UPDATE). If no tuple is returned, then the modified tuple(s)
335fail the quals, so we ignore the current result tuple and continue the
336original query.
337
338In UPDATE/DELETE, only the target relation needs to be handled this way.
339In SELECT FOR UPDATE, there may be multiple relations flagged FOR UPDATE,
340so we obtain lock on the current tuple version in each such relation before
341executing the recheck.
342
343It is also possible that there are relations in the query that are not
344to be locked (they are neither the UPDATE/DELETE target nor specified to
345be locked in SELECT FOR UPDATE/SHARE). When re-running the test query
346we want to use the same rows from these relations that were joined to
347the locked rows. For ordinary relations this can be implemented relatively
348cheaply by including the row TID in the join outputs and re-fetching that
349TID. (The re-fetch is expensive, but we're trying to optimize the normal
350case where no re-test is needed.) We have also to consider non-table
351relations, such as a ValuesScan or FunctionScan. For these, since there
352is no equivalent of TID, the only practical solution seems to be to include
353the entire row value in the join output row.
354
355We disallow set-returning functions in the targetlist of SELECT FOR UPDATE,
356so as to ensure that at most one tuple can be returned for any particular
357set of scan tuples. Otherwise we'd get duplicates due to the original
358query returning the same set of scan tuples multiple times. Likewise,
359SRFs are disallowed in an UPDATE's targetlist. There, they would have the
360effect of the same row being updated multiple times, which is not very
361useful --- and updates after the first would have no effect anyway.
362