• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

MakefileH A D08-Nov-20211.4 KiB3522

READMEH A D08-Nov-202116.8 KiB352284

execAmi.cH A D08-Nov-202114.7 KiB601390

execCurrent.cH A D08-Nov-202112.8 KiB423224

execExpr.cH A D08-Nov-202179.9 KiB2,7301,635

execExprInterp.cH A D08-Nov-202195.1 KiB3,5952,261

execGrouping.cH A D08-Nov-202116.2 KiB561289

execIndexing.cH A D08-Nov-202128.6 KiB902446

execJunk.cH A D08-Nov-20218.1 KiB320149

execMain.cH A D08-Nov-2021105.9 KiB3,5641,942

execParallel.cH A D08-Nov-202131.1 KiB1,018595

execProcnode.cH A D08-Nov-202119.7 KiB773447

execReplication.cH A D08-Nov-202115.3 KiB591358

execSRF.cH A D08-Nov-202126.3 KiB931561

execScan.cH A D08-Nov-202110.2 KiB372167

execTuples.cH A D08-Nov-202136.9 KiB1,315548

execUtils.cH A D08-Nov-202127.5 KiB1,028463

functions.cH A D08-Nov-202156.2 KiB1,9421,135

instrument.cH A D08-Nov-20216.4 KiB229160

nodeAgg.cH A D08-Nov-2021133.5 KiB4,2852,247

nodeAppend.cH A D08-Nov-20217.6 KiB304106

nodeBitmapAnd.cH A D08-Nov-20215.5 KiB22496

nodeBitmapHeapscan.cH A D08-Nov-202127.9 KiB1,039573

nodeBitmapIndexscan.cH A D08-Nov-20219.6 KiB337148

nodeBitmapOr.cH A D08-Nov-20215.9 KiB242111

nodeCtescan.cH A D08-Nov-20219.9 KiB355144

nodeCustom.cH A D08-Nov-20216.1 KiB235152

nodeForeignscan.cH A D08-Nov-202111 KiB391184

nodeFunctionscan.cH A D08-Nov-202116.5 KiB620314

nodeGather.cH A D08-Nov-202113.5 KiB461225

nodeGatherMerge.cH A D08-Nov-202121.6 KiB766403

nodeGroup.cH A D08-Nov-20216.1 KiB260105

nodeHash.cH A D08-Nov-202149.7 KiB1,729893

nodeHashjoin.cH A D08-Nov-202127.2 KiB965490

nodeIndexonlyscan.cH A D08-Nov-202123.4 KiB744332

nodeIndexscan.cH A D08-Nov-202152 KiB1,7851,027

nodeLimit.cH A D08-Nov-202111.6 KiB461229

nodeLockRows.cH A D08-Nov-202112.5 KiB468246

nodeMaterial.cH A D08-Nov-20219.8 KiB373148

nodeMergeAppend.cH A D08-Nov-20218.5 KiB326150

nodeMergejoin.cH A D08-Nov-202148.5 KiB1,680778

nodeModifyTable.cH A D08-Nov-202175.1 KiB2,4331,256

nodeNamedtuplestorescan.cH A D08-Nov-20215.5 KiB20578

nodeNestloop.cH A D08-Nov-202110.7 KiB416179

nodeProjectSet.cH A D08-Nov-20218.5 KiB335145

nodeRecursiveunion.cH A D08-Nov-20219 KiB333160

nodeResult.cH A D08-Nov-20216.8 KiB278101

nodeSamplescan.cH A D08-Nov-202115 KiB596324

nodeSeqscan.cH A D08-Nov-20219.5 KiB358141

nodeSetOp.cH A D08-Nov-202117.1 KiB651347

nodeSort.cH A D08-Nov-20218.3 KiB337140

nodeSubplan.cH A D08-Nov-202138.2 KiB1,282720

nodeSubqueryscan.cH A D08-Nov-20215.4 KiB20764

nodeTableFuncscan.cH A D08-Nov-202113.9 KiB527279

nodeTidscan.cH A D08-Nov-202114.6 KiB578327

nodeUnique.cH A D08-Nov-20215.3 KiB20270

nodeValuesscan.cH A D08-Nov-20219.3 KiB348143

nodeWindowAgg.cH A D08-Nov-202184.9 KiB2,8131,680

nodeWorktablescan.cH A D08-Nov-20216.1 KiB21777

spi.cH A D08-Nov-202169.9 KiB2,8181,765

tqueue.cH A D08-Nov-202137.9 KiB1,284772

tstoreReceiver.cH A D08-Nov-20215.4 KiB222132

README

1src/backend/executor/README
2
3The Postgres Executor
4=====================
5
6The executor processes a tree of "plan nodes".  The plan tree is essentially
7a demand-pull pipeline of tuple processing operations.  Each node, when
8called, will produce the next tuple in its output sequence, or NULL if no
9more tuples are available.  If the node is not a primitive relation-scanning
10node, it will have child node(s) that it calls in turn to obtain input
11tuples.
12
13Refinements on this basic model include:
14
15* Choice of scan direction (forwards or backwards).  Caution: this is not
16currently well-supported.  It works for primitive scan nodes, but not very
17well for joins, aggregates, etc.
18
19* Rescan command to reset a node and make it generate its output sequence
20over again.
21
22* Parameters that can alter a node's results.  After adjusting a parameter,
23the rescan command must be applied to that node and all nodes above it.
24There is a moderately intelligent scheme to avoid rescanning nodes
25unnecessarily (for example, Sort does not rescan its input if no parameters
26of the input have changed, since it can just reread its stored sorted data).
27
28For a SELECT, it is only necessary to deliver the top-level result tuples
29to the client.  For INSERT/UPDATE/DELETE, the actual table modification
30operations happen in a top-level ModifyTable plan node.  If the query
31includes a RETURNING clause, the ModifyTable node delivers the computed
32RETURNING rows as output, otherwise it returns nothing.  Handling INSERT
33is pretty straightforward: the tuples returned from the plan tree below
34ModifyTable are inserted into the correct result relation.  For UPDATE,
35the plan tree returns the computed tuples to be updated, plus a "junk"
36(hidden) CTID column identifying which table row is to be replaced by each
37one.  For DELETE, the plan tree need only deliver a CTID column, and the
38ModifyTable node visits each of those rows and marks the row deleted.
39
40XXX a great deal more documentation needs to be written here...
41
42
43Plan Trees and State Trees
44--------------------------
45
46The plan tree delivered by the planner contains a tree of Plan nodes (struct
47types derived from struct Plan).  During executor startup we build a parallel
48tree of identical structure containing executor state nodes --- every plan
49node type has a corresponding executor state node type.  Each node in the
50state tree has a pointer to its corresponding node in the plan tree, plus
51executor state data as needed to implement that node type.  This arrangement
52allows the plan tree to be completely read-only so far as the executor is
53concerned: all data that is modified during execution is in the state tree.
54Read-only plan trees make life much simpler for plan caching and reuse.
55
56Each Plan node may have expression trees associated with it, to represent
57its target list, qualification conditions, etc.  These trees are also
58read-only to the executor, but the executor state for expression evaluation
59does not mirror the Plan expression's tree shape, as explained below.
60Rather, there's just one ExprState node per expression tree, although this
61may have sub-nodes for some complex expression node types.
62
63Altogether there are four classes of nodes used in these trees: Plan nodes,
64their corresponding PlanState nodes, Expr nodes, and ExprState nodes.
65(Actually, there are also List nodes, which are used as "glue" in all
66three tree-based representations.)
67
68
69Expression Trees and ExprState nodes
70------------------------------------
71
72Expression trees, in contrast to Plan trees, are not mirrored into a
73corresponding tree of state nodes.  Instead each separately executable
74expression tree (e.g. a Plan's qual or targetlist) is represented by one
75ExprState node.  The ExprState node contains the information needed to
76evaluate the expression in a compact, linear form.  That compact form is
77stored as a flat array in ExprState->steps[] (an array of ExprEvalStep,
78not ExprEvalStep *).
79
80The reasons for choosing such a representation include:
81- commonly the amount of work needed to evaluate one Expr-type node is
82  small enough that the overhead of having to perform a tree-walk
83  during evaluation is significant.
84- the flat representation can be evaluated non-recursively within a single
85  function, reducing stack depth and function call overhead.
86- such a representation is usable both for fast interpreted execution,
87  and for compiling into native code.
88
89The Plan-tree representation of an expression is compiled into an
90ExprState node by ExecInitExpr().  As much complexity as possible should
91be handled by ExecInitExpr() (and helpers), instead of execution time
92where both interpreted and compiled versions would need to deal with the
93complexity.  Besides duplicating effort between execution approaches,
94runtime initialization checks also have a small but noticeable cost every
95time the expression is evaluated.  Therefore, we allow ExecInitExpr() to
96precompute information that we do not expect to vary across execution of a
97single query, for example the set of CHECK constraint expressions to be
98applied to a domain type.  This could not be done at plan time without
99greatly increasing the number of events that require plan invalidation.
100(Previously, some information of this kind was rechecked on each
101expression evaluation, but that seems like unnecessary overhead.)
102
103
104Expression Initialization
105-------------------------
106
107During ExecInitExpr() and similar routines, Expr trees are converted
108into the flat representation.  Each Expr node might be represented by
109zero, one, or more ExprEvalSteps.
110
111Each ExprEvalStep's work is determined by its opcode (of enum ExprEvalOp)
112and it stores the result of its work into the Datum variable and boolean
113null flag variable pointed to by ExprEvalStep->resvalue/resnull.
114Complex expressions are performed by chaining together several steps.
115For example, "a + b" (one OpExpr, with two Var expressions) would be
116represented as two steps to fetch the Var values, and one step for the
117evaluation of the function underlying the + operator.  The steps for the
118Vars would have their resvalue/resnull pointing directly to the appropriate
119arg[] and argnull[] array elements in the FunctionCallInfoData struct that
120is used by the function evaluation step, thus avoiding extra work to copy
121the result values around.
122
123The last entry in a completed ExprState->steps array is always an
124EEOP_DONE step; this removes the need to test for end-of-array while
125iterating.  Also, if the expression contains any variable references (to
126user columns of the ExprContext's INNER, OUTER, or SCAN tuples), the steps
127array begins with EEOP_*_FETCHSOME steps that ensure that the relevant
128tuples have been deconstructed to make the required columns directly
129available (cf. slot_getsomeattrs()).  This allows individual Var-fetching
130steps to be little more than an array lookup.
131
132Most of ExecInitExpr()'s work is done by the recursive function
133ExecInitExprRec() and its subroutines.  ExecInitExprRec() maps one Expr
134node into the steps required for execution, recursing as needed for
135sub-expressions.
136
137Each ExecInitExprRec() call has to specify where that subexpression's
138results are to be stored (via the resv/resnull parameters).  This allows
139the above scenario of evaluating a (sub-)expression directly into
140fcinfo->arg/argnull, but also requires some care: target Datum/isnull
141variables may not be shared with another ExecInitExprRec() unless the
142results are only needed by steps executing before further usages of those
143target Datum/isnull variables.  Due to the non-recursiveness of the
144ExprEvalStep representation that's usually easy to guarantee.
145
146ExecInitExprRec() pushes new operations into the ExprState->steps array
147using ExprEvalPushStep().  To keep the steps as a consecutively laid out
148array, ExprEvalPushStep() has to repalloc the entire array when there's
149not enough space.  Because of that it is *not* allowed to point directly
150into any of the steps during expression initialization.  Therefore, the
151resv/resnull for a subexpression usually point to some storage that is
152palloc'd separately from the steps array.  For instance, the
153FunctionCallInfoData for a function call step is separately allocated
154rather than being part of the ExprEvalStep array.  The overall result
155of a complete expression is typically returned into the resvalue/resnull
156fields of the ExprState node itself.
157
158Some steps, e.g. boolean expressions, allow skipping evaluation of
159certain subexpressions.  In the flat representation this amounts to
160jumping to some later step rather than just continuing consecutively
161with the next step.  The target for such a jump is represented by
162the integer index in the ExprState->steps array of the step to execute
163next.  (Compare the EEO_NEXT and EEO_JUMP macros in execExprInterp.c.)
164
165Typically, ExecInitExprRec() has to push a jumping step into the steps
166array, then recursively generate steps for the subexpression that might
167get skipped over, then go back and fix up the jump target index using
168the now-known length of the subexpression's steps.  This is handled by
169adjust_jumps lists in execExpr.c.
170
171The last step in constructing an ExprState is to apply ExecReadyExpr(),
172which readies it for execution using whichever execution method has been
173selected.
174
175
176Expression Evaluation
177---------------------
178
179To allow for different methods of expression evaluation, and for
180better branch/jump target prediction, expressions are evaluated by
181calling ExprState->evalfunc (via ExprEvalExpr() and friends).
182
183ExprReadyExpr() can choose the method of interpretation by setting
184evalfunc to an appropriate function.  The default execution function,
185ExecInterpExpr, is implemented in execExprInterp.c; see its header
186comment for details.  Special-case evalfuncs are used for certain
187especially-simple expressions.
188
189Note that a lot of the more complex expression evaluation steps, which are
190less performance-critical than the simpler ones, are implemented as
191separate functions outside the fast-path of expression execution, allowing
192their implementation to be shared between interpreted and compiled
193expression evaluation.  This means that these helper functions are not
194allowed to perform expression step dispatch themselves, as the method of
195dispatch will vary based on the caller.  The helpers therefore cannot call
196for the execution of subexpressions; all subexpression results they need
197must be computed by earlier steps.  And dispatch to the following
198expression step must be performed after returning from the helper.
199
200
201Targetlist Evaluation
202---------------------
203
204ExecBuildProjectionInfo builds an ExprState that has the effect of
205evaluating a targetlist into ExprState->resultslot.  A generic targetlist
206expression is executed by evaluating it as discussed above (storing the
207result into the ExprState's resvalue/resnull fields) and then using an
208EEOP_ASSIGN_TMP step to move the result into the appropriate tts_values[]
209and tts_isnull[] array elements of the result slot.  There are special
210fast-path step types (EEOP_ASSIGN_*_VAR) to handle targetlist entries that
211are simple Vars using only one step instead of two.
212
213
214Memory Management
215-----------------
216
217A "per query" memory context is created during CreateExecutorState();
218all storage allocated during an executor invocation is allocated in that
219context or a child context.  This allows easy reclamation of storage
220during executor shutdown --- rather than messing with retail pfree's and
221probable storage leaks, we just destroy the memory context.
222
223In particular, the plan state trees and expression state trees described
224in the previous section are allocated in the per-query memory context.
225
226To avoid intra-query memory leaks, most processing while a query runs
227is done in "per tuple" memory contexts, which are so-called because they
228are typically reset to empty once per tuple.  Per-tuple contexts are usually
229associated with ExprContexts, and commonly each PlanState node has its own
230ExprContext to evaluate its qual and targetlist expressions in.
231
232
233Query Processing Control Flow
234-----------------------------
235
236This is a sketch of control flow for full query processing:
237
238	CreateQueryDesc
239
240	ExecutorStart
241		CreateExecutorState
242			creates per-query context
243		switch to per-query context to run ExecInitNode
244		AfterTriggerBeginQuery
245		ExecInitNode --- recursively scans plan tree
246			CreateExprContext
247				creates per-tuple context
248			ExecInitExpr
249
250	ExecutorRun
251		ExecProcNode --- recursively called in per-query context
252			ExecEvalExpr --- called in per-tuple context
253			ResetExprContext --- to free memory
254
255	ExecutorFinish
256		ExecPostprocessPlan --- run any unfinished ModifyTable nodes
257		AfterTriggerEndQuery
258
259	ExecutorEnd
260		ExecEndNode --- recursively releases resources
261		FreeExecutorState
262			frees per-query context and child contexts
263
264	FreeQueryDesc
265
266Per above comments, it's not really critical for ExecEndNode to free any
267memory; it'll all go away in FreeExecutorState anyway.  However, we do need to
268be careful to close relations, drop buffer pins, etc, so we do need to scan
269the plan state tree to find these sorts of resources.
270
271
272The executor can also be used to evaluate simple expressions without any Plan
273tree ("simple" meaning "no aggregates and no sub-selects", though such might
274be hidden inside function calls).  This case has a flow of control like
275
276	CreateExecutorState
277		creates per-query context
278
279	CreateExprContext	-- or use GetPerTupleExprContext(estate)
280		creates per-tuple context
281
282	ExecPrepareExpr
283		temporarily switch to per-query context
284		run the expression through expression_planner
285		ExecInitExpr
286
287	Repeatedly do:
288		ExecEvalExprSwitchContext
289			ExecEvalExpr --- called in per-tuple context
290		ResetExprContext --- to free memory
291
292	FreeExecutorState
293		frees per-query context, as well as ExprContext
294		(a separate FreeExprContext call is not necessary)
295
296
297EvalPlanQual (READ COMMITTED Update Checking)
298---------------------------------------------
299
300For simple SELECTs, the executor need only pay attention to tuples that are
301valid according to the snapshot seen by the current transaction (ie, they
302were inserted by a previously committed transaction, and not deleted by any
303previously committed transaction).  However, for UPDATE and DELETE it is not
304cool to modify or delete a tuple that's been modified by an open or
305concurrently-committed transaction.  If we are running in SERIALIZABLE
306isolation level then we just raise an error when this condition is seen to
307occur.  In READ COMMITTED isolation level, we must work a lot harder.
308
309The basic idea in READ COMMITTED mode is to take the modified tuple
310committed by the concurrent transaction (after waiting for it to commit,
311if need be) and re-evaluate the query qualifications to see if it would
312still meet the quals.  If so, we regenerate the updated tuple (if we are
313doing an UPDATE) from the modified tuple, and finally update/delete the
314modified tuple.  SELECT FOR UPDATE/SHARE behaves similarly, except that its
315action is just to lock the modified tuple and return results based on that
316version of the tuple.
317
318To implement this checking, we actually re-run the query from scratch for
319each modified tuple (or set of tuples, for SELECT FOR UPDATE), with the
320relation scan nodes tweaked to return only the current tuples --- either
321the original ones, or the updated (and now locked) versions of the modified
322tuple(s).  If this query returns a tuple, then the modified tuple(s) pass
323the quals (and the query output is the suitably modified update tuple, if
324we're doing UPDATE).  If no tuple is returned, then the modified tuple(s)
325fail the quals, so we ignore the current result tuple and continue the
326original query.
327
328In UPDATE/DELETE, only the target relation needs to be handled this way.
329In SELECT FOR UPDATE, there may be multiple relations flagged FOR UPDATE,
330so we obtain lock on the current tuple version in each such relation before
331executing the recheck.
332
333It is also possible that there are relations in the query that are not
334to be locked (they are neither the UPDATE/DELETE target nor specified to
335be locked in SELECT FOR UPDATE/SHARE).  When re-running the test query
336we want to use the same rows from these relations that were joined to
337the locked rows.  For ordinary relations this can be implemented relatively
338cheaply by including the row TID in the join outputs and re-fetching that
339TID.  (The re-fetch is expensive, but we're trying to optimize the normal
340case where no re-test is needed.)  We have also to consider non-table
341relations, such as a ValuesScan or FunctionScan.  For these, since there
342is no equivalent of TID, the only practical solution seems to be to include
343the entire row value in the join output row.
344
345We disallow set-returning functions in the targetlist of SELECT FOR UPDATE,
346so as to ensure that at most one tuple can be returned for any particular
347set of scan tuples.  Otherwise we'd get duplicates due to the original
348query returning the same set of scan tuples multiple times.  Likewise,
349SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
350effect of the same row being updated multiple times, which is not very
351useful --- and updates after the first would have no effect anyway.
352