• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

MakefileH A D08-Nov-20211.5 KiB8370

READMEH A D08-Nov-202119.8 KiB406329

execAmi.cH A D08-Nov-202116.5 KiB663423

execAsync.cH A D08-Nov-20214 KiB15588

execCurrent.cH A D08-Nov-202112.9 KiB427228

execExpr.cH A D08-Nov-2021116.5 KiB3,9662,433

execExprInterp.cH A D08-Nov-2021118.2 KiB4,3762,772

execGrouping.cH A D08-Nov-202117 KiB561298

execIndexing.cH A D08-Nov-202133.1 KiB1,045514

execJunk.cH A D08-Nov-20217.7 KiB305143

execMain.cH A D08-Nov-202184.8 KiB2,8871,548

execParallel.cH A D08-Nov-202146.3 KiB1,499939

execPartition.cH A D08-Nov-202165.7 KiB2,1081,099

execProcnode.cH A D08-Nov-202126.2 KiB982550

execReplication.cH A D08-Nov-202116.7 KiB630401

execSRF.cH A D08-Nov-202128.1 KiB981583

execScan.cH A D08-Nov-20219.1 KiB343141

execTuples.cH A D08-Nov-202162.3 KiB2,3511,318

execUtils.cH A D08-Nov-202136.7 KiB1,352680

functions.cH A D08-Nov-202160.7 KiB2,1041,214

instrument.cH A D08-Nov-20218 KiB280198

nodeAgg.cH A D08-Nov-2021148.2 KiB4,8302,663

nodeAppend.cH A D08-Nov-202132.8 KiB1,187595

nodeBitmapAnd.cH A D08-Nov-20215.5 KiB22496

nodeBitmapHeapscan.cH A D08-Nov-202126.3 KiB955531

nodeBitmapIndexscan.cH A D08-Nov-20219.4 KiB331148

nodeBitmapOr.cH A D08-Nov-20215.9 KiB242111

nodeCtescan.cH A D08-Nov-20219.8 KiB352145

nodeCustom.cH A D08-Nov-20216 KiB229151

nodeForeignscan.cH A D08-Nov-202114.1 KiB486232

nodeFunctionscan.cH A D08-Nov-202116.5 KiB621316

nodeGather.cH A D08-Nov-202113.9 KiB478230

nodeGatherMerge.cH A D08-Nov-202122.2 KiB790408

nodeGroup.cH A D08-Nov-20216.1 KiB256105

nodeHash.cH A D08-Nov-2021103.4 KiB3,4351,907

nodeHashjoin.cH A D08-Nov-202146.8 KiB1,552782

nodeIncrementalSort.cH A D08-Nov-202142 KiB1,258536

nodeIndexonlyscan.cH A D08-Nov-202123.4 KiB738334

nodeIndexscan.cH A D08-Nov-202150.9 KiB1,7441,015

nodeLimit.cH A D08-Nov-202113.8 KiB559294

nodeLockRows.cH A D08-Nov-202110.5 KiB404212

nodeMaterial.cH A D08-Nov-20219.7 KiB369147

nodeMemoize.cH A D08-Nov-202133 KiB1,124535

nodeMergeAppend.cH A D08-Nov-202110.6 KiB390194

nodeMergejoin.cH A D08-Nov-202148.5 KiB1,679778

nodeModifyTable.cH A D08-Nov-2021100.7 KiB3,2421,684

nodeNamedtuplestorescan.cH A D08-Nov-20215.5 KiB20279

nodeNestloop.cH A D08-Nov-202110.7 KiB412179

nodeProjectSet.cH A D08-Nov-20219.3 KiB352149

nodeRecursiveunion.cH A D08-Nov-20219 KiB332163

nodeResult.cH A D08-Nov-20216.7 KiB273100

nodeSamplescan.cH A D08-Nov-20219.4 KiB379193

nodeSeqscan.cH A D08-Nov-20218.3 KiB315129

nodeSetOp.cH A D08-Nov-202117.4 KiB652355

nodeSort.cH A D08-Nov-202111.1 KiB431193

nodeSubplan.cH A D08-Nov-202139.3 KiB1,314751

nodeSubqueryscan.cH A D08-Nov-20215.8 KiB21470

nodeTableFuncscan.cH A D08-Nov-202113.9 KiB524281

nodeTidrangescan.cH A D08-Nov-202110.5 KiB414214

nodeTidscan.cH A D08-Nov-202114.1 KiB559320

nodeUnique.cH A D08-Nov-20215.1 KiB19369

nodeValuesscan.cH A D08-Nov-202110 KiB362148

nodeWindowAgg.cH A D08-Nov-2021108.4 KiB3,4642,119

nodeWorktablescan.cH A D08-Nov-20216.3 KiB22480

spi.cH A D08-Nov-202181.5 KiB3,2812,041

tqueue.cH A D08-Nov-20215.1 KiB21199

tstoreReceiver.cH A D08-Nov-20217.6 KiB284172

README

1src/backend/executor/README
2
3The Postgres Executor
4=====================
5
6The executor processes a tree of "plan nodes".  The plan tree is essentially
7a demand-pull pipeline of tuple processing operations.  Each node, when
8called, will produce the next tuple in its output sequence, or NULL if no
9more tuples are available.  If the node is not a primitive relation-scanning
10node, it will have child node(s) that it calls in turn to obtain input
11tuples.
12
13Refinements on this basic model include:
14
15* Choice of scan direction (forwards or backwards).  Caution: this is not
16currently well-supported.  It works for primitive scan nodes, but not very
17well for joins, aggregates, etc.
18
19* Rescan command to reset a node and make it generate its output sequence
20over again.
21
22* Parameters that can alter a node's results.  After adjusting a parameter,
23the rescan command must be applied to that node and all nodes above it.
24There is a moderately intelligent scheme to avoid rescanning nodes
25unnecessarily (for example, Sort does not rescan its input if no parameters
26of the input have changed, since it can just reread its stored sorted data).
27
28For a SELECT, it is only necessary to deliver the top-level result tuples
29to the client.  For INSERT/UPDATE/DELETE, the actual table modification
30operations happen in a top-level ModifyTable plan node.  If the query
31includes a RETURNING clause, the ModifyTable node delivers the computed
32RETURNING rows as output, otherwise it returns nothing.  Handling INSERT
33is pretty straightforward: the tuples returned from the plan tree below
34ModifyTable are inserted into the correct result relation.  For UPDATE,
35the plan tree returns the new values of the updated columns, plus "junk"
36(hidden) column(s) identifying which table row is to be updated.  The
37ModifyTable node must fetch that row to extract values for the unchanged
38columns, combine the values into a new row, and apply the update.  (For a
39heap table, the row-identity junk column is a CTID, but other things may
40be used for other table types.)  For DELETE, the plan tree need only deliver
41junk row-identity column(s), and the ModifyTable node visits each of those
42rows and marks the row deleted.
43
44XXX a great deal more documentation needs to be written here...
45
46
47Plan Trees and State Trees
48--------------------------
49
50The plan tree delivered by the planner contains a tree of Plan nodes (struct
51types derived from struct Plan).  During executor startup we build a parallel
52tree of identical structure containing executor state nodes --- generally,
53every plan node type has a corresponding executor state node type.  Each node
54in the state tree has a pointer to its corresponding node in the plan tree,
55plus executor state data as needed to implement that node type.  This
56arrangement allows the plan tree to be completely read-only so far as the
57executor is concerned: all data that is modified during execution is in the
58state tree.  Read-only plan trees make life much simpler for plan caching and
59reuse.
60
61A corresponding executor state node may not be created during executor startup
62if the executor determines that an entire subplan is not required due to
63execution time partition pruning determining that no matching records will be
64found there.  This currently only occurs for Append and MergeAppend nodes.  In
65this case the non-required subplans are ignored and the executor state's
66subnode array will become out of sequence to the plan's subplan list.
67
68Each Plan node may have expression trees associated with it, to represent
69its target list, qualification conditions, etc.  These trees are also
70read-only to the executor, but the executor state for expression evaluation
71does not mirror the Plan expression's tree shape, as explained below.
72Rather, there's just one ExprState node per expression tree, although this
73may have sub-nodes for some complex expression node types.
74
75Altogether there are four classes of nodes used in these trees: Plan nodes,
76their corresponding PlanState nodes, Expr nodes, and ExprState nodes.
77(Actually, there are also List nodes, which are used as "glue" in all
78three tree-based representations.)
79
80
81Expression Trees and ExprState nodes
82------------------------------------
83
84Expression trees, in contrast to Plan trees, are not mirrored into a
85corresponding tree of state nodes.  Instead each separately executable
86expression tree (e.g. a Plan's qual or targetlist) is represented by one
87ExprState node.  The ExprState node contains the information needed to
88evaluate the expression in a compact, linear form.  That compact form is
89stored as a flat array in ExprState->steps[] (an array of ExprEvalStep,
90not ExprEvalStep *).
91
92The reasons for choosing such a representation include:
93- commonly the amount of work needed to evaluate one Expr-type node is
94  small enough that the overhead of having to perform a tree-walk
95  during evaluation is significant.
96- the flat representation can be evaluated non-recursively within a single
97  function, reducing stack depth and function call overhead.
98- such a representation is usable both for fast interpreted execution,
99  and for compiling into native code.
100
101The Plan-tree representation of an expression is compiled into an
102ExprState node by ExecInitExpr().  As much complexity as possible should
103be handled by ExecInitExpr() (and helpers), instead of execution time
104where both interpreted and compiled versions would need to deal with the
105complexity.  Besides duplicating effort between execution approaches,
106runtime initialization checks also have a small but noticeable cost every
107time the expression is evaluated.  Therefore, we allow ExecInitExpr() to
108precompute information that we do not expect to vary across execution of a
109single query, for example the set of CHECK constraint expressions to be
110applied to a domain type.  This could not be done at plan time without
111greatly increasing the number of events that require plan invalidation.
112(Previously, some information of this kind was rechecked on each
113expression evaluation, but that seems like unnecessary overhead.)
114
115
116Expression Initialization
117-------------------------
118
119During ExecInitExpr() and similar routines, Expr trees are converted
120into the flat representation.  Each Expr node might be represented by
121zero, one, or more ExprEvalSteps.
122
123Each ExprEvalStep's work is determined by its opcode (of enum ExprEvalOp)
124and it stores the result of its work into the Datum variable and boolean
125null flag variable pointed to by ExprEvalStep->resvalue/resnull.
126Complex expressions are performed by chaining together several steps.
127For example, "a + b" (one OpExpr, with two Var expressions) would be
128represented as two steps to fetch the Var values, and one step for the
129evaluation of the function underlying the + operator.  The steps for the
130Vars would have their resvalue/resnull pointing directly to the appropriate
131args[].value .isnull elements in the FunctionCallInfoBaseData struct that
132is used by the function evaluation step, thus avoiding extra work to copy
133the result values around.
134
135The last entry in a completed ExprState->steps array is always an
136EEOP_DONE step; this removes the need to test for end-of-array while
137iterating.  Also, if the expression contains any variable references (to
138user columns of the ExprContext's INNER, OUTER, or SCAN tuples), the steps
139array begins with EEOP_*_FETCHSOME steps that ensure that the relevant
140tuples have been deconstructed to make the required columns directly
141available (cf. slot_getsomeattrs()).  This allows individual Var-fetching
142steps to be little more than an array lookup.
143
144Most of ExecInitExpr()'s work is done by the recursive function
145ExecInitExprRec() and its subroutines.  ExecInitExprRec() maps one Expr
146node into the steps required for execution, recursing as needed for
147sub-expressions.
148
149Each ExecInitExprRec() call has to specify where that subexpression's
150results are to be stored (via the resv/resnull parameters).  This allows
151the above scenario of evaluating a (sub-)expression directly into
152fcinfo->args[].value/isnull, but also requires some care: target Datum/isnull
153variables may not be shared with another ExecInitExprRec() unless the
154results are only needed by steps executing before further usages of those
155target Datum/isnull variables.  Due to the non-recursiveness of the
156ExprEvalStep representation that's usually easy to guarantee.
157
158ExecInitExprRec() pushes new operations into the ExprState->steps array
159using ExprEvalPushStep().  To keep the steps as a consecutively laid out
160array, ExprEvalPushStep() has to repalloc the entire array when there's
161not enough space.  Because of that it is *not* allowed to point directly
162into any of the steps during expression initialization.  Therefore, the
163resv/resnull for a subexpression usually point to some storage that is
164palloc'd separately from the steps array.  For instance, the
165FunctionCallInfoBaseData for a function call step is separately allocated
166rather than being part of the ExprEvalStep array.  The overall result
167of a complete expression is typically returned into the resvalue/resnull
168fields of the ExprState node itself.
169
170Some steps, e.g. boolean expressions, allow skipping evaluation of
171certain subexpressions.  In the flat representation this amounts to
172jumping to some later step rather than just continuing consecutively
173with the next step.  The target for such a jump is represented by
174the integer index in the ExprState->steps array of the step to execute
175next.  (Compare the EEO_NEXT and EEO_JUMP macros in execExprInterp.c.)
176
177Typically, ExecInitExprRec() has to push a jumping step into the steps
178array, then recursively generate steps for the subexpression that might
179get skipped over, then go back and fix up the jump target index using
180the now-known length of the subexpression's steps.  This is handled by
181adjust_jumps lists in execExpr.c.
182
183The last step in constructing an ExprState is to apply ExecReadyExpr(),
184which readies it for execution using whichever execution method has been
185selected.
186
187
188Expression Evaluation
189---------------------
190
191To allow for different methods of expression evaluation, and for
192better branch/jump target prediction, expressions are evaluated by
193calling ExprState->evalfunc (via ExecEvalExpr() and friends).
194
195ExecReadyExpr() can choose the method of interpretation by setting
196evalfunc to an appropriate function.  The default execution function,
197ExecInterpExpr, is implemented in execExprInterp.c; see its header
198comment for details.  Special-case evalfuncs are used for certain
199especially-simple expressions.
200
201Note that a lot of the more complex expression evaluation steps, which are
202less performance-critical than the simpler ones, are implemented as
203separate functions outside the fast-path of expression execution, allowing
204their implementation to be shared between interpreted and compiled
205expression evaluation.  This means that these helper functions are not
206allowed to perform expression step dispatch themselves, as the method of
207dispatch will vary based on the caller.  The helpers therefore cannot call
208for the execution of subexpressions; all subexpression results they need
209must be computed by earlier steps.  And dispatch to the following
210expression step must be performed after returning from the helper.
211
212
213Targetlist Evaluation
214---------------------
215
216ExecBuildProjectionInfo builds an ExprState that has the effect of
217evaluating a targetlist into ExprState->resultslot.  A generic targetlist
218expression is executed by evaluating it as discussed above (storing the
219result into the ExprState's resvalue/resnull fields) and then using an
220EEOP_ASSIGN_TMP step to move the result into the appropriate tts_values[]
221and tts_isnull[] array elements of the result slot.  There are special
222fast-path step types (EEOP_ASSIGN_*_VAR) to handle targetlist entries that
223are simple Vars using only one step instead of two.
224
225
226Memory Management
227-----------------
228
229A "per query" memory context is created during CreateExecutorState();
230all storage allocated during an executor invocation is allocated in that
231context or a child context.  This allows easy reclamation of storage
232during executor shutdown --- rather than messing with retail pfree's and
233probable storage leaks, we just destroy the memory context.
234
235In particular, the plan state trees and expression state trees described
236in the previous section are allocated in the per-query memory context.
237
238To avoid intra-query memory leaks, most processing while a query runs
239is done in "per tuple" memory contexts, which are so-called because they
240are typically reset to empty once per tuple.  Per-tuple contexts are usually
241associated with ExprContexts, and commonly each PlanState node has its own
242ExprContext to evaluate its qual and targetlist expressions in.
243
244
245Query Processing Control Flow
246-----------------------------
247
248This is a sketch of control flow for full query processing:
249
250	CreateQueryDesc
251
252	ExecutorStart
253		CreateExecutorState
254			creates per-query context
255		switch to per-query context to run ExecInitNode
256		AfterTriggerBeginQuery
257		ExecInitNode --- recursively scans plan tree
258			ExecInitNode
259				recurse into subsidiary nodes
260			CreateExprContext
261				creates per-tuple context
262			ExecInitExpr
263
264	ExecutorRun
265		ExecProcNode --- recursively called in per-query context
266			ExecEvalExpr --- called in per-tuple context
267			ResetExprContext --- to free memory
268
269	ExecutorFinish
270		ExecPostprocessPlan --- run any unfinished ModifyTable nodes
271		AfterTriggerEndQuery
272
273	ExecutorEnd
274		ExecEndNode --- recursively releases resources
275		FreeExecutorState
276			frees per-query context and child contexts
277
278	FreeQueryDesc
279
280Per above comments, it's not really critical for ExecEndNode to free any
281memory; it'll all go away in FreeExecutorState anyway.  However, we do need to
282be careful to close relations, drop buffer pins, etc, so we do need to scan
283the plan state tree to find these sorts of resources.
284
285
286The executor can also be used to evaluate simple expressions without any Plan
287tree ("simple" meaning "no aggregates and no sub-selects", though such might
288be hidden inside function calls).  This case has a flow of control like
289
290	CreateExecutorState
291		creates per-query context
292
293	CreateExprContext	-- or use GetPerTupleExprContext(estate)
294		creates per-tuple context
295
296	ExecPrepareExpr
297		temporarily switch to per-query context
298		run the expression through expression_planner
299		ExecInitExpr
300
301	Repeatedly do:
302		ExecEvalExprSwitchContext
303			ExecEvalExpr --- called in per-tuple context
304		ResetExprContext --- to free memory
305
306	FreeExecutorState
307		frees per-query context, as well as ExprContext
308		(a separate FreeExprContext call is not necessary)
309
310
311EvalPlanQual (READ COMMITTED Update Checking)
312---------------------------------------------
313
314For simple SELECTs, the executor need only pay attention to tuples that are
315valid according to the snapshot seen by the current transaction (ie, they
316were inserted by a previously committed transaction, and not deleted by any
317previously committed transaction).  However, for UPDATE and DELETE it is not
318cool to modify or delete a tuple that's been modified by an open or
319concurrently-committed transaction.  If we are running in SERIALIZABLE
320isolation level then we just raise an error when this condition is seen to
321occur.  In READ COMMITTED isolation level, we must work a lot harder.
322
323The basic idea in READ COMMITTED mode is to take the modified tuple
324committed by the concurrent transaction (after waiting for it to commit,
325if need be) and re-evaluate the query qualifications to see if it would
326still meet the quals.  If so, we regenerate the updated tuple (if we are
327doing an UPDATE) from the modified tuple, and finally update/delete the
328modified tuple.  SELECT FOR UPDATE/SHARE behaves similarly, except that its
329action is just to lock the modified tuple and return results based on that
330version of the tuple.
331
332To implement this checking, we actually re-run the query from scratch for
333each modified tuple (or set of tuples, for SELECT FOR UPDATE), with the
334relation scan nodes tweaked to return only the current tuples --- either
335the original ones, or the updated (and now locked) versions of the modified
336tuple(s).  If this query returns a tuple, then the modified tuple(s) pass
337the quals (and the query output is the suitably modified update tuple, if
338we're doing UPDATE).  If no tuple is returned, then the modified tuple(s)
339fail the quals, so we ignore the current result tuple and continue the
340original query.
341
342In UPDATE/DELETE, only the target relation needs to be handled this way.
343In SELECT FOR UPDATE, there may be multiple relations flagged FOR UPDATE,
344so we obtain lock on the current tuple version in each such relation before
345executing the recheck.
346
347It is also possible that there are relations in the query that are not
348to be locked (they are neither the UPDATE/DELETE target nor specified to
349be locked in SELECT FOR UPDATE/SHARE).  When re-running the test query
350we want to use the same rows from these relations that were joined to
351the locked rows.  For ordinary relations this can be implemented relatively
352cheaply by including the row TID in the join outputs and re-fetching that
353TID.  (The re-fetch is expensive, but we're trying to optimize the normal
354case where no re-test is needed.)  We have also to consider non-table
355relations, such as a ValuesScan or FunctionScan.  For these, since there
356is no equivalent of TID, the only practical solution seems to be to include
357the entire row value in the join output row.
358
359We disallow set-returning functions in the targetlist of SELECT FOR UPDATE,
360so as to ensure that at most one tuple can be returned for any particular
361set of scan tuples.  Otherwise we'd get duplicates due to the original
362query returning the same set of scan tuples multiple times.  Likewise,
363SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
364effect of the same row being updated multiple times, which is not very
365useful --- and updates after the first would have no effect anyway.
366
367
368Asynchronous Execution
369----------------------
370
371In cases where a node is waiting on an event external to the database system,
372such as a ForeignScan awaiting network I/O, it's desirable for the node to
373indicate that it cannot return any tuple immediately but may be able to do so
374at a later time.  A process which discovers this type of situation can always
375handle it simply by blocking, but this may waste time that could be spent
376executing some other part of the plan tree where progress could be made
377immediately.  This is particularly likely to occur when the plan tree contains
378an Append node.  Asynchronous execution runs multiple parts of an Append node
379concurrently rather than serially to improve performance.
380
381For asynchronous execution, an Append node must first request a tuple from an
382async-capable child node using ExecAsyncRequest.  Next, it must execute the
383asynchronous event loop using ExecAppendAsyncEventWait.  Eventually, when a
384child node to which an asynchronous request has been made produces a tuple,
385the Append node will receive it from the event loop via ExecAsyncResponse.  In
386the current implementation of asynchronous execution, the only node type that
387requests tuples from an async-capable child node is an Append, while the only
388node type that might be async-capable is a ForeignScan.
389
390Typically, the ExecAsyncResponse callback is the only one required for nodes
391that wish to request tuples asynchronously.  On the other hand, async-capable
392nodes generally need to implement three methods:
393
3941. When an asynchronous request is made, the node's ExecAsyncRequest callback
395   will be invoked; it should use ExecAsyncRequestPending to indicate that the
396   request is pending for a callback described below.  Alternatively, it can
397   instead use ExecAsyncRequestDone if a result is available immediately.
398
3992. When the event loop wishes to wait or poll for file descriptor events, the
400   node's ExecAsyncConfigureWait callback will be invoked to configure the
401   file descriptor event for which the node wishes to wait.
402
4033. When the file descriptor becomes ready, the node's ExecAsyncNotify callback
404   will be invoked; like #1, it should use ExecAsyncRequestPending for another
405   callback or ExecAsyncRequestDone to return a result immediately.
406