• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

geqo/H08-Nov-2021-2,6351,225

path/H08-Nov-2021-25,00312,864

plan/H08-Nov-2021-25,35414,267

prep/H08-Nov-2021-6,8683,735

util/H08-Nov-2021-22,37812,372

MakefileH A D08-Nov-2021249 145

READMEH A D08-Nov-202161.3 KiB1,160999

README

1src/backend/optimizer/README
2
3Optimizer
4=========
5
6These directories take the Query structure returned by the parser, and
7generate a plan used by the executor.  The /plan directory generates the
8actual output plan, the /path code generates all possible ways to join the
9tables, and /prep handles various preprocessing steps for special cases.
10/util is utility stuff.  /geqo is the separate "genetic optimization" planner
11--- it does a semi-random search through the join tree space, rather than
12exhaustively considering all possible join trees.  (But each join considered
13by /geqo is given to /path to create paths for, so we consider all possible
14implementation paths for each specific join pair even in GEQO mode.)
15
16
17Paths and Join Pairs
18--------------------
19
20During the planning/optimizing process, we build "Path" trees representing
21the different ways of doing a query.  We select the cheapest Path that
22generates the desired relation and turn it into a Plan to pass to the
23executor.  (There is pretty nearly a one-to-one correspondence between the
24Path and Plan trees, but Path nodes omit info that won't be needed during
25planning, and include info needed for planning that won't be needed by the
26executor.)
27
28The optimizer builds a RelOptInfo structure for each base relation used in
29the query.  Base rels are either primitive tables, or subquery subselects
30that are planned via a separate recursive invocation of the planner.  A
31RelOptInfo is also built for each join relation that is considered during
32planning.  A join rel is simply a combination of base rels.  There is only
33one join RelOptInfo for any given set of baserels --- for example, the join
34{A B C} is represented by the same RelOptInfo no matter whether we build it
35by joining A and B first and then adding C, or joining B and C first and
36then adding A, etc.  These different means of building the joinrel are
37represented as Paths.  For each RelOptInfo we build a list of Paths that
38represent plausible ways to implement the scan or join of that relation.
39Once we've considered all the plausible Paths for a rel, we select the one
40that is cheapest according to the planner's cost estimates.  The final plan
41is derived from the cheapest Path for the RelOptInfo that includes all the
42base rels of the query.
43
44Possible Paths for a primitive table relation include plain old sequential
45scan, plus index scans for any indexes that exist on the table, plus bitmap
46index scans using one or more indexes.  Specialized RTE types, such as
47function RTEs, may have only one possible Path.
48
49Joins always occur using two RelOptInfos.  One is outer, the other inner.
50Outers drive lookups of values in the inner.  In a nested loop, lookups of
51values in the inner occur by scanning the inner path once per outer tuple
52to find each matching inner row.  In a mergejoin, inner and outer rows are
53ordered, and are accessed in order, so only one scan is required to perform
54the entire join: both inner and outer paths are scanned in-sync.  (There's
55not a lot of difference between inner and outer in a mergejoin...)  In a
56hashjoin, the inner is scanned first and all its rows are entered in a
57hashtable, then the outer is scanned and for each row we lookup the join
58key in the hashtable.
59
60A Path for a join relation is actually a tree structure, with the topmost
61Path node representing the last-applied join method.  It has left and right
62subpaths that represent the scan or join methods used for the two input
63relations.
64
65
66Join Tree Construction
67----------------------
68
69The optimizer generates optimal query plans by doing a more-or-less
70exhaustive search through the ways of executing the query.  The best Path
71tree is found by a recursive process:
72
731) Take each base relation in the query, and make a RelOptInfo structure
74for it.  Find each potentially useful way of accessing the relation,
75including sequential and index scans, and make Paths representing those
76ways.  All the Paths made for a given relation are placed in its
77RelOptInfo.pathlist.  (Actually, we discard Paths that are obviously
78inferior alternatives before they ever get into the pathlist --- what
79ends up in the pathlist is the cheapest way of generating each potentially
80useful sort ordering and parameterization of the relation.)  Also create a
81RelOptInfo.joininfo list including all the join clauses that involve this
82relation.  For example, the WHERE clause "tab1.col1 = tab2.col1" generates
83entries in both tab1 and tab2's joininfo lists.
84
85If we have only a single base relation in the query, we are done.
86Otherwise we have to figure out how to join the base relations into a
87single join relation.
88
892) Normally, any explicit JOIN clauses are "flattened" so that we just
90have a list of relations to join.  However, FULL OUTER JOIN clauses are
91never flattened, and other kinds of JOIN might not be either, if the
92flattening process is stopped by join_collapse_limit or from_collapse_limit
93restrictions.  Therefore, we end up with a planning problem that contains
94lists of relations to be joined in any order, where any individual item
95might be a sub-list that has to be joined together before we can consider
96joining it to its siblings.  We process these sub-problems recursively,
97bottom up.  Note that the join list structure constrains the possible join
98orders, but it doesn't constrain the join implementation method at each
99join (nestloop, merge, hash), nor does it say which rel is considered outer
100or inner at each join.  We consider all these possibilities in building
101Paths. We generate a Path for each feasible join method, and select the
102cheapest Path.
103
104For each planning problem, therefore, we will have a list of relations
105that are either base rels or joinrels constructed per sub-join-lists.
106We can join these rels together in any order the planner sees fit.
107The standard (non-GEQO) planner does this as follows:
108
109Consider joining each RelOptInfo to each other RelOptInfo for which there
110is a usable joinclause, and generate a Path for each possible join method
111for each such pair.  (If we have a RelOptInfo with no join clauses, we have
112no choice but to generate a clauseless Cartesian-product join; so we
113consider joining that rel to each other available rel.  But in the presence
114of join clauses we will only consider joins that use available join
115clauses.  Note that join-order restrictions induced by outer joins and
116IN/EXISTS clauses are also checked, to ensure that we find a workable join
117order in cases where those restrictions force a clauseless join to be done.)
118
119If we only had two relations in the list, we are done: we just pick
120the cheapest path for the join RelOptInfo.  If we had more than two, we now
121need to consider ways of joining join RelOptInfos to each other to make
122join RelOptInfos that represent more than two list items.
123
124The join tree is constructed using a "dynamic programming" algorithm:
125in the first pass (already described) we consider ways to create join rels
126representing exactly two list items.  The second pass considers ways
127to make join rels that represent exactly three list items; the next pass,
128four items, etc.  The last pass considers how to make the final join
129relation that includes all list items --- obviously there can be only one
130join rel at this top level, whereas there can be more than one join rel
131at lower levels.  At each level we use joins that follow available join
132clauses, if possible, just as described for the first level.
133
134For example:
135
136    SELECT  *
137    FROM    tab1, tab2, tab3, tab4
138    WHERE   tab1.col = tab2.col AND
139        tab2.col = tab3.col AND
140        tab3.col = tab4.col
141
142    Tables 1, 2, 3, and 4 are joined as:
143    {1 2},{2 3},{3 4}
144    {1 2 3},{2 3 4}
145    {1 2 3 4}
146    (other possibilities will be excluded for lack of join clauses)
147
148    SELECT  *
149    FROM    tab1, tab2, tab3, tab4
150    WHERE   tab1.col = tab2.col AND
151        tab1.col = tab3.col AND
152        tab1.col = tab4.col
153
154    Tables 1, 2, 3, and 4 are joined as:
155    {1 2},{1 3},{1 4}
156    {1 2 3},{1 3 4},{1 2 4}
157    {1 2 3 4}
158
159We consider left-handed plans (the outer rel of an upper join is a joinrel,
160but the inner is always a single list item); right-handed plans (outer rel
161is always a single item); and bushy plans (both inner and outer can be
162joins themselves).  For example, when building {1 2 3 4} we consider
163joining {1 2 3} to {4} (left-handed), {4} to {1 2 3} (right-handed), and
164{1 2} to {3 4} (bushy), among other choices.  Although the jointree
165scanning code produces these potential join combinations one at a time,
166all the ways to produce the same set of joined base rels will share the
167same RelOptInfo, so the paths produced from different join combinations
168that produce equivalent joinrels will compete in add_path().
169
170The dynamic-programming approach has an important property that's not
171immediately obvious: we will finish constructing all paths for a given
172relation before we construct any paths for relations containing that rel.
173This means that we can reliably identify the "cheapest path" for each rel
174before higher-level relations need to know that.  Also, we can safely
175discard a path when we find that another path for the same rel is better,
176without worrying that maybe there is already a reference to that path in
177some higher-level join path.  Without this, memory management for paths
178would be much more complicated.
179
180Once we have built the final join rel, we use either the cheapest path
181for it or the cheapest path with the desired ordering (if that's cheaper
182than applying a sort to the cheapest other path).
183
184If the query contains one-sided outer joins (LEFT or RIGHT joins), or
185IN or EXISTS WHERE clauses that were converted to semijoins or antijoins,
186then some of the possible join orders may be illegal.  These are excluded
187by having join_is_legal consult a side list of such "special" joins to see
188whether a proposed join is illegal.  (The same consultation allows it to
189see which join style should be applied for a valid join, ie, JOIN_INNER,
190JOIN_LEFT, etc.)
191
192
193Valid OUTER JOIN Optimizations
194------------------------------
195
196The planner's treatment of outer join reordering is based on the following
197identities:
198
1991.	(A leftjoin B on (Pab)) innerjoin C on (Pac)
200	= (A innerjoin C on (Pac)) leftjoin B on (Pab)
201
202where Pac is a predicate referencing A and C, etc (in this case, clearly
203Pac cannot reference B, or the transformation is nonsensical).
204
2052.	(A leftjoin B on (Pab)) leftjoin C on (Pac)
206	= (A leftjoin C on (Pac)) leftjoin B on (Pab)
207
2083.	(A leftjoin B on (Pab)) leftjoin C on (Pbc)
209	= A leftjoin (B leftjoin C on (Pbc)) on (Pab)
210
211Identity 3 only holds if predicate Pbc must fail for all-null B rows
212(that is, Pbc is strict for at least one column of B).  If Pbc is not
213strict, the first form might produce some rows with nonnull C columns
214where the second form would make those entries null.
215
216RIGHT JOIN is equivalent to LEFT JOIN after switching the two input
217tables, so the same identities work for right joins.
218
219An example of a case that does *not* work is moving an innerjoin into or
220out of the nullable side of an outer join:
221
222	A leftjoin (B join C on (Pbc)) on (Pab)
223	!= (A leftjoin B on (Pab)) join C on (Pbc)
224
225SEMI joins work a little bit differently.  A semijoin can be reassociated
226into or out of the lefthand side of another semijoin, left join, or
227antijoin, but not into or out of the righthand side.  Likewise, an inner
228join, left join, or antijoin can be reassociated into or out of the
229lefthand side of a semijoin, but not into or out of the righthand side.
230
231ANTI joins work approximately like LEFT joins, except that identity 3
232fails if the join to C is an antijoin (even if Pbc is strict, and in
233both the cases where the other join is a leftjoin and where it is an
234antijoin).  So we can't reorder antijoins into or out of the RHS of a
235leftjoin or antijoin, even if the relevant clause is strict.
236
237The current code does not attempt to re-order FULL JOINs at all.
238FULL JOIN ordering is enforced by not collapsing FULL JOIN nodes when
239translating the jointree to "joinlist" representation.  Other types of
240JOIN nodes are normally collapsed so that they participate fully in the
241join order search.  To avoid generating illegal join orders, the planner
242creates a SpecialJoinInfo node for each non-inner join, and join_is_legal
243checks this list to decide if a proposed join is legal.
244
245What we store in SpecialJoinInfo nodes are the minimum sets of Relids
246required on each side of the join to form the outer join.  Note that
247these are minimums; there's no explicit maximum, since joining other
248rels to the OJ's syntactic rels may be legal.  Per identities 1 and 2,
249non-FULL joins can be freely associated into the lefthand side of an
250OJ, but in some cases they can't be associated into the righthand side.
251So the restriction enforced by join_is_legal is that a proposed join
252can't join a rel within or partly within an RHS boundary to one outside
253the boundary, unless the proposed join is a LEFT join that can associate
254into the SpecialJoinInfo's RHS using identity 3.
255
256The use of minimum Relid sets has some pitfalls; consider a query like
257	A leftjoin (B leftjoin (C innerjoin D) on (Pbcd)) on Pa
258where Pa doesn't mention B/C/D at all.  In this case a naive computation
259would give the upper leftjoin's min LHS as {A} and min RHS as {C,D} (since
260we know that the innerjoin can't associate out of the leftjoin's RHS, and
261enforce that by including its relids in the leftjoin's min RHS).  And the
262lower leftjoin has min LHS of {B} and min RHS of {C,D}.  Given such
263information, join_is_legal would think it's okay to associate the upper
264join into the lower join's RHS, transforming the query to
265	B leftjoin (A leftjoin (C innerjoin D) on Pa) on (Pbcd)
266which yields totally wrong answers.  We prevent that by forcing the min RHS
267for the upper join to include B.  This is perhaps overly restrictive, but
268such cases don't arise often so it's not clear that it's worth developing a
269more complicated system.
270
271
272Pulling Up Subqueries
273---------------------
274
275As we described above, a subquery appearing in the range table is planned
276independently and treated as a "black box" during planning of the outer
277query.  This is necessary when the subquery uses features such as
278aggregates, GROUP, or DISTINCT.  But if the subquery is just a simple
279scan or join, treating the subquery as a black box may produce a poor plan
280compared to considering it as part of the entire plan search space.
281Therefore, at the start of the planning process the planner looks for
282simple subqueries and pulls them up into the main query's jointree.
283
284Pulling up a subquery may result in FROM-list joins appearing below the top
285of the join tree.  Each FROM-list is planned using the dynamic-programming
286search method described above.
287
288If pulling up a subquery produces a FROM-list as a direct child of another
289FROM-list, then we can merge the two FROM-lists together.  Once that's
290done, the subquery is an absolutely integral part of the outer query and
291will not constrain the join tree search space at all.  However, that could
292result in unpleasant growth of planning time, since the dynamic-programming
293search has runtime exponential in the number of FROM-items considered.
294Therefore, we don't merge FROM-lists if the result would have too many
295FROM-items in one list.
296
297
298Optimizer Functions
299-------------------
300
301The primary entry point is planner().
302
303planner()
304set up for recursive handling of subqueries
305-subquery_planner()
306 pull up sublinks and subqueries from rangetable, if possible
307 canonicalize qual
308     Attempt to simplify WHERE clause to the most useful form; this includes
309     flattening nested AND/ORs and detecting clauses that are duplicated in
310     different branches of an OR.
311 simplify constant expressions
312 process sublinks
313 convert Vars of outer query levels into Params
314--grouping_planner()
315  preprocess target list for non-SELECT queries
316  handle UNION/INTERSECT/EXCEPT, GROUP BY, HAVING, aggregates,
317	ORDER BY, DISTINCT, LIMIT
318---query_planner()
319   make list of base relations used in query
320   split up the qual into restrictions (a=1) and joins (b=c)
321   find qual clauses that enable merge and hash joins
322----make_one_rel()
323     set_base_rel_pathlists()
324      find seqscan and all index paths for each base relation
325      find selectivity of columns used in joins
326     make_rel_from_joinlist()
327      hand off join subproblems to a plugin, GEQO, or standard_join_search()
328------standard_join_search()
329      call join_search_one_level() for each level of join tree needed
330      join_search_one_level():
331        For each joinrel of the prior level, do make_rels_by_clause_joins()
332        if it has join clauses, or make_rels_by_clauseless_joins() if not.
333        Also generate "bushy plan" joins between joinrels of lower levels.
334      Back at standard_join_search(), generate gather paths if needed for
335      each newly constructed joinrel, then apply set_cheapest() to extract
336      the cheapest path for it.
337      Loop back if this wasn't the top join level.
338  Back at grouping_planner:
339  do grouping (GROUP BY) and aggregation
340  do window functions
341  make unique (DISTINCT)
342  do sorting (ORDER BY)
343  do limit (LIMIT/OFFSET)
344Back at planner():
345convert finished Path tree into a Plan tree
346do final cleanup after planning
347
348
349Optimizer Data Structures
350-------------------------
351
352PlannerGlobal   - global information for a single planner invocation
353
354PlannerInfo     - information for planning a particular Query (we make
355                  a separate PlannerInfo node for each sub-Query)
356
357RelOptInfo      - a relation or joined relations
358
359 RestrictInfo   - WHERE clauses, like "x = 3" or "y = z"
360                  (note the same structure is used for restriction and
361                   join clauses)
362
363 Path           - every way to generate a RelOptInfo(sequential,index,joins)
364  A plain Path node can represent several simple plans, per its pathtype:
365    T_SeqScan   - sequential scan
366    T_SampleScan - tablesample scan
367    T_FunctionScan - function-in-FROM scan
368    T_TableFuncScan - table function scan
369    T_ValuesScan - VALUES scan
370    T_CteScan   - CTE (WITH) scan
371    T_NamedTuplestoreScan - ENR scan
372    T_WorkTableScan - scan worktable of a recursive CTE
373    T_Result    - childless Result plan node (used for FROM-less SELECT)
374  IndexPath     - index scan
375  BitmapHeapPath - top of a bitmapped index scan
376  TidPath       - scan by CTID
377  TidRangePath  - scan a contiguous range of CTIDs
378  SubqueryScanPath - scan a subquery-in-FROM
379  ForeignPath   - scan a foreign table, foreign join or foreign upper-relation
380  CustomPath    - for custom scan providers
381  AppendPath    - append multiple subpaths together
382  MergeAppendPath - merge multiple subpaths, preserving their common sort order
383  GroupResultPath - childless Result plan node (used for degenerate grouping)
384  MaterialPath  - a Material plan node
385  MemoizePath   - a Memoize plan node for caching tuples from sub-paths
386  UniquePath    - remove duplicate rows (either by hashing or sorting)
387  GatherPath    - collect the results of parallel workers
388  GatherMergePath - collect parallel results, preserving their common sort order
389  ProjectionPath - a Result plan node with child (used for projection)
390  ProjectSetPath - a ProjectSet plan node applied to some sub-path
391  SortPath      - a Sort plan node applied to some sub-path
392  IncrementalSortPath - an IncrementalSort plan node applied to some sub-path
393  GroupPath     - a Group plan node applied to some sub-path
394  UpperUniquePath - a Unique plan node applied to some sub-path
395  AggPath       - an Agg plan node applied to some sub-path
396  GroupingSetsPath - an Agg plan node used to implement GROUPING SETS
397  MinMaxAggPath - a Result plan node with subplans performing MIN/MAX
398  WindowAggPath - a WindowAgg plan node applied to some sub-path
399  SetOpPath     - a SetOp plan node applied to some sub-path
400  RecursiveUnionPath - a RecursiveUnion plan node applied to two sub-paths
401  LockRowsPath  - a LockRows plan node applied to some sub-path
402  ModifyTablePath - a ModifyTable plan node applied to some sub-path(s)
403  LimitPath     - a Limit plan node applied to some sub-path
404  NestPath      - nested-loop joins
405  MergePath     - merge joins
406  HashPath      - hash joins
407
408 EquivalenceClass - a data structure representing a set of values known equal
409
410 PathKey        - a data structure representing the sort ordering of a path
411
412The optimizer spends a good deal of its time worrying about the ordering
413of the tuples returned by a path.  The reason this is useful is that by
414knowing the sort ordering of a path, we may be able to use that path as
415the left or right input of a mergejoin and avoid an explicit sort step.
416Nestloops and hash joins don't really care what the order of their inputs
417is, but mergejoin needs suitably ordered inputs.  Therefore, all paths
418generated during the optimization process are marked with their sort order
419(to the extent that it is known) for possible use by a higher-level merge.
420
421It is also possible to avoid an explicit sort step to implement a user's
422ORDER BY clause if the final path has the right ordering already, so the
423sort ordering is of interest even at the top level.  grouping_planner() will
424look for the cheapest path with a sort order matching the desired order,
425then compare its cost to the cost of using the cheapest-overall path and
426doing an explicit sort on that.
427
428When we are generating paths for a particular RelOptInfo, we discard a path
429if it is more expensive than another known path that has the same or better
430sort order.  We will never discard a path that is the only known way to
431achieve a given sort order (without an explicit sort, that is).  In this
432way, the next level up will have the maximum freedom to build mergejoins
433without sorting, since it can pick from any of the paths retained for its
434inputs.
435
436
437EquivalenceClasses
438------------------
439
440During the deconstruct_jointree() scan of the query's qual clauses, we look
441for mergejoinable equality clauses A = B whose applicability is not delayed
442by an outer join; these are called "equivalence clauses".  When we find
443one, we create an EquivalenceClass containing the expressions A and B to
444record this knowledge.  If we later find another equivalence clause B = C,
445we add C to the existing EquivalenceClass for {A B}; this may require
446merging two existing EquivalenceClasses.  At the end of the scan, we have
447sets of values that are known all transitively equal to each other.  We can
448therefore use a comparison of any pair of the values as a restriction or
449join clause (when these values are available at the scan or join, of
450course); furthermore, we need test only one such comparison, not all of
451them.  Therefore, equivalence clauses are removed from the standard qual
452distribution process.  Instead, when preparing a restriction or join clause
453list, we examine each EquivalenceClass to see if it can contribute a
454clause, and if so we select an appropriate pair of values to compare.  For
455example, if we are trying to join A's relation to C's, we can generate the
456clause A = C, even though this appeared nowhere explicitly in the original
457query.  This may allow us to explore join paths that otherwise would have
458been rejected as requiring Cartesian-product joins.
459
460Sometimes an EquivalenceClass may contain a pseudo-constant expression
461(i.e., one not containing Vars or Aggs of the current query level, nor
462volatile functions).  In this case we do not follow the policy of
463dynamically generating join clauses: instead, we dynamically generate
464restriction clauses "var = const" wherever one of the variable members of
465the class can first be computed.  For example, if we have A = B and B = 42,
466we effectively generate the restriction clauses A = 42 and B = 42, and then
467we need not bother with explicitly testing the join clause A = B when the
468relations are joined.  In effect, all the class members can be tested at
469relation-scan level and there's never a need for join tests.
470
471The precise technical interpretation of an EquivalenceClass is that it
472asserts that at any plan node where more than one of its member values
473can be computed, output rows in which the values are not all equal may
474be discarded without affecting the query result.  (We require all levels
475of the plan to enforce EquivalenceClasses, hence a join need not recheck
476equality of values that were computable by one of its children.)  For an
477ordinary EquivalenceClass that is "valid everywhere", we can further infer
478that the values are all non-null, because all mergejoinable operators are
479strict.  However, we also allow equivalence clauses that appear below the
480nullable side of an outer join to form EquivalenceClasses; for these
481classes, the interpretation is that either all the values are equal, or
482all (except pseudo-constants) have gone to null.  (This requires a
483limitation that non-constant members be strict, else they might not go
484to null when the other members do.)  Consider for example
485
486	SELECT *
487	  FROM a LEFT JOIN
488	       (SELECT * FROM b JOIN c ON b.y = c.z WHERE b.y = 10) ss
489	       ON a.x = ss.y
490	  WHERE a.x = 42;
491
492We can form the below-outer-join EquivalenceClass {b.y c.z 10} and thereby
493apply c.z = 10 while scanning c.  (The reason we disallow outerjoin-delayed
494clauses from forming EquivalenceClasses is exactly that we want to be able
495to push any derived clauses as far down as possible.)  But once above the
496outer join it's no longer necessarily the case that b.y = 10, and thus we
497cannot use such EquivalenceClasses to conclude that sorting is unnecessary
498(see discussion of PathKeys below).
499
500In this example, notice also that a.x = ss.y (really a.x = b.y) is not an
501equivalence clause because its applicability to b is delayed by the outer
502join; thus we do not try to insert b.y into the equivalence class {a.x 42}.
503But since we see that a.x has been equated to 42 above the outer join, we
504are able to form a below-outer-join class {b.y 42}; this restriction can be
505added because no b/c row not having b.y = 42 can contribute to the result
506of the outer join, and so we need not compute such rows.  Now this class
507will get merged with {b.y c.z 10}, leading to the contradiction 10 = 42,
508which lets the planner deduce that the b/c join need not be computed at all
509because none of its rows can contribute to the outer join.  (This gets
510implemented as a gating Result filter, since more usually the potential
511contradiction involves Param values rather than just Consts, and thus has
512to be checked at runtime.)
513
514To aid in determining the sort ordering(s) that can work with a mergejoin,
515we mark each mergejoinable clause with the EquivalenceClasses of its left
516and right inputs.  For an equivalence clause, these are of course the same
517EquivalenceClass.  For a non-equivalence mergejoinable clause (such as an
518outer-join qualification), we generate two separate EquivalenceClasses for
519the left and right inputs.  This may result in creating single-item
520equivalence "classes", though of course these are still subject to merging
521if other equivalence clauses are later found to bear on the same
522expressions.
523
524Another way that we may form a single-item EquivalenceClass is in creation
525of a PathKey to represent a desired sort order (see below).  This is a bit
526different from the above cases because such an EquivalenceClass might
527contain an aggregate function or volatile expression.  (A clause containing
528a volatile function will never be considered mergejoinable, even if its top
529operator is mergejoinable, so there is no way for a volatile expression to
530get into EquivalenceClasses otherwise.  Aggregates are disallowed in WHERE
531altogether, so will never be found in a mergejoinable clause.)  This is just
532a convenience to maintain a uniform PathKey representation: such an
533EquivalenceClass will never be merged with any other.  Note in particular
534that a single-item EquivalenceClass {a.x} is *not* meant to imply an
535assertion that a.x = a.x; the practical effect of this is that a.x could
536be NULL.
537
538An EquivalenceClass also contains a list of btree opfamily OIDs, which
539determines what the equalities it represents actually "mean".  All the
540equivalence clauses that contribute to an EquivalenceClass must have
541equality operators that belong to the same set of opfamilies.  (Note: most
542of the time, a particular equality operator belongs to only one family, but
543it's possible that it belongs to more than one.  We keep track of all the
544families to ensure that we can make use of an index belonging to any one of
545the families for mergejoin purposes.)
546
547An EquivalenceClass can contain "em_is_child" members, which are copies
548of members that contain appendrel parent relation Vars, transposed to
549contain the equivalent child-relation variables or expressions.  These
550members are *not* full-fledged members of the EquivalenceClass and do not
551affect the class's overall properties at all.  They are kept only to
552simplify matching of child-relation expressions to EquivalenceClasses.
553Most operations on EquivalenceClasses should ignore child members.
554
555
556PathKeys
557--------
558
559The PathKeys data structure represents what is known about the sort order
560of the tuples generated by a particular Path.  A path's pathkeys field is a
561list of PathKey nodes, where the n'th item represents the n'th sort key of
562the result.  Each PathKey contains these fields:
563
564	* a reference to an EquivalenceClass
565	* a btree opfamily OID (must match one of those in the EC)
566	* a sort direction (ascending or descending)
567	* a nulls-first-or-last flag
568
569The EquivalenceClass represents the value being sorted on.  Since the
570various members of an EquivalenceClass are known equal according to the
571opfamily, we can consider a path sorted by any one of them to be sorted by
572any other too; this is what justifies referencing the whole
573EquivalenceClass rather than just one member of it.
574
575In single/base relation RelOptInfo's, the Paths represent various ways
576of scanning the relation and the resulting ordering of the tuples.
577Sequential scan Paths have NIL pathkeys, indicating no known ordering.
578Index scans have Path.pathkeys that represent the chosen index's ordering,
579if any.  A single-key index would create a single-PathKey list, while a
580multi-column index generates a list with one element per key index column.
581Non-key columns specified in the INCLUDE clause of covering indexes don't
582have corresponding PathKeys in the list, because the have no influence on
583index ordering.  (Actually, since an index can be scanned either forward or
584backward, there are two possible sort orders and two possible PathKey lists
585it can generate.)
586
587Note that a bitmap scan has NIL pathkeys since we can say nothing about
588the overall order of its result.  Also, an indexscan on an unordered type
589of index generates NIL pathkeys.  However, we can always create a pathkey
590by doing an explicit sort.  The pathkeys for a Sort plan's output just
591represent the sort key fields and the ordering operators used.
592
593Things get more interesting when we consider joins.  Suppose we do a
594mergejoin between A and B using the mergeclause A.X = B.Y.  The output
595of the mergejoin is sorted by X --- but it is also sorted by Y.  Again,
596this can be represented by a PathKey referencing an EquivalenceClass
597containing both X and Y.
598
599With a little further thought, it becomes apparent that nestloop joins
600can also produce sorted output.  For example, if we do a nestloop join
601between outer relation A and inner relation B, then any pathkeys relevant
602to A are still valid for the join result: we have not altered the order of
603the tuples from A.  Even more interesting, if there was an equivalence clause
604A.X=B.Y, and A.X was a pathkey for the outer relation A, then we can assert
605that B.Y is a pathkey for the join result; X was ordered before and still
606is, and the joined values of Y are equal to the joined values of X, so Y
607must now be ordered too.  This is true even though we used neither an
608explicit sort nor a mergejoin on Y.  (Note: hash joins cannot be counted
609on to preserve the order of their outer relation, because the executor
610might decide to "batch" the join, so we always set pathkeys to NIL for
611a hashjoin path.)  Exception: a RIGHT or FULL join doesn't preserve the
612ordering of its outer relation, because it might insert nulls at random
613points in the ordering.
614
615In general, we can justify using EquivalenceClasses as the basis for
616pathkeys because, whenever we scan a relation containing multiple
617EquivalenceClass members or join two relations each containing
618EquivalenceClass members, we apply restriction or join clauses derived from
619the EquivalenceClass.  This guarantees that any two values listed in the
620EquivalenceClass are in fact equal in all tuples emitted by the scan or
621join, and therefore that if the tuples are sorted by one of the values,
622they can be considered sorted by any other as well.  It does not matter
623whether the test clause is used as a mergeclause, or merely enforced
624after-the-fact as a qpqual filter.
625
626Note that there is no particular difficulty in labeling a path's sort
627order with a PathKey referencing an EquivalenceClass that contains
628variables not yet joined into the path's output.  We can simply ignore
629such entries as not being relevant (yet).  This makes it possible to
630use the same EquivalenceClasses throughout the join planning process.
631In fact, by being careful not to generate multiple identical PathKey
632objects, we can reduce comparison of EquivalenceClasses and PathKeys
633to simple pointer comparison, which is a huge savings because add_path
634has to make a large number of PathKey comparisons in deciding whether
635competing Paths are equivalently sorted.
636
637Pathkeys are also useful to represent an ordering that we wish to achieve,
638since they are easily compared to the pathkeys of a potential candidate
639path.  So, SortGroupClause lists are turned into pathkeys lists for use
640inside the optimizer.
641
642An additional refinement we can make is to insist that canonical pathkey
643lists (sort orderings) do not mention the same EquivalenceClass more than
644once.  For example, in all these cases the second sort column is redundant,
645because it cannot distinguish values that are the same according to the
646first sort column:
647	SELECT ... ORDER BY x, x
648	SELECT ... ORDER BY x, x DESC
649	SELECT ... WHERE x = y ORDER BY x, y
650Although a user probably wouldn't write "ORDER BY x,x" directly, such
651redundancies are more probable once equivalence classes have been
652considered.  Also, the system may generate redundant pathkey lists when
653computing the sort ordering needed for a mergejoin.  By eliminating the
654redundancy, we save time and improve planning, since the planner will more
655easily recognize equivalent orderings as being equivalent.
656
657Another interesting property is that if the underlying EquivalenceClass
658contains a constant and is not below an outer join, then the pathkey is
659completely redundant and need not be sorted by at all!  Every row must
660contain the same constant value, so there's no need to sort.  (If the EC is
661below an outer join, we still have to sort, since some of the rows might
662have gone to null and others not.  In this case we must be careful to pick
663a non-const member to sort by.  The assumption that all the non-const
664members go to null at the same plan level is critical here, else they might
665not produce the same sort order.)  This might seem pointless because users
666are unlikely to write "... WHERE x = 42 ORDER BY x", but it allows us to
667recognize when particular index columns are irrelevant to the sort order:
668if we have "... WHERE x = 42 ORDER BY y", scanning an index on (x,y)
669produces correctly ordered data without a sort step.  We used to have very
670ugly ad-hoc code to recognize that in limited contexts, but discarding
671constant ECs from pathkeys makes it happen cleanly and automatically.
672
673You might object that a below-outer-join EquivalenceClass doesn't always
674represent the same values at every level of the join tree, and so using
675it to uniquely identify a sort order is dubious.  This is true, but we
676can avoid dealing with the fact explicitly because we always consider that
677an outer join destroys any ordering of its nullable inputs.  Thus, even
678if a path was sorted by {a.x} below an outer join, we'll re-sort if that
679sort ordering was important; and so using the same PathKey for both sort
680orderings doesn't create any real problem.
681
682
683Order of processing for EquivalenceClasses and PathKeys
684-------------------------------------------------------
685
686As alluded to above, there is a specific sequence of phases in the
687processing of EquivalenceClasses and PathKeys during planning.  During the
688initial scanning of the query's quals (deconstruct_jointree followed by
689reconsider_outer_join_clauses), we construct EquivalenceClasses based on
690mergejoinable clauses found in the quals.  At the end of this process,
691we know all we can know about equivalence of different variables, so
692subsequently there will be no further merging of EquivalenceClasses.
693At that point it is possible to consider the EquivalenceClasses as
694"canonical" and build canonical PathKeys that reference them.  At this
695time we construct PathKeys for the query's ORDER BY and related clauses.
696(Any ordering expressions that do not appear elsewhere will result in
697the creation of new EquivalenceClasses, but this cannot result in merging
698existing classes, so canonical-ness is not lost.)
699
700Because all the EquivalenceClasses are known before we begin path
701generation, we can use them as a guide to which indexes are of interest:
702if an index's column is not mentioned in any EquivalenceClass then that
703index's sort order cannot possibly be helpful for the query.  This allows
704short-circuiting of much of the processing of create_index_paths() for
705irrelevant indexes.
706
707There are some cases where planner.c constructs additional
708EquivalenceClasses and PathKeys after query_planner has completed.
709In these cases, the extra ECs/PKs are needed to represent sort orders
710that were not considered during query_planner.  Such situations should be
711minimized since it is impossible for query_planner to return a plan
712producing such a sort order, meaning an explicit sort will always be needed.
713Currently this happens only for queries involving multiple window functions
714with different orderings, for which extra sorts are needed anyway.
715
716
717Parameterized Paths
718-------------------
719
720The naive way to join two relations using a clause like WHERE A.X = B.Y
721is to generate a nestloop plan like this:
722
723	NestLoop
724		Filter: A.X = B.Y
725		-> Seq Scan on A
726		-> Seq Scan on B
727
728We can make this better by using a merge or hash join, but it still
729requires scanning all of both input relations.  If A is very small and B is
730very large, but there is an index on B.Y, it can be enormously better to do
731something like this:
732
733	NestLoop
734		-> Seq Scan on A
735		-> Index Scan using B_Y_IDX on B
736			Index Condition: B.Y = A.X
737
738Here, we are expecting that for each row scanned from A, the nestloop
739plan node will pass down the current value of A.X into the scan of B.
740That allows the indexscan to treat A.X as a constant for any one
741invocation, and thereby use it as an index key.  This is the only plan type
742that can avoid fetching all of B, and for small numbers of rows coming from
743A, that will dominate every other consideration.  (As A gets larger, this
744gets less attractive, and eventually a merge or hash join will win instead.
745So we have to cost out all the alternatives to decide what to do.)
746
747It can be useful for the parameter value to be passed down through
748intermediate layers of joins, for example:
749
750	NestLoop
751		-> Seq Scan on A
752		Hash Join
753			Join Condition: B.Y = C.W
754			-> Seq Scan on B
755			-> Index Scan using C_Z_IDX on C
756				Index Condition: C.Z = A.X
757
758If all joins are plain inner joins then this is usually unnecessary,
759because it's possible to reorder the joins so that a parameter is used
760immediately below the nestloop node that provides it.  But in the
761presence of outer joins, such join reordering may not be possible.
762
763Also, the bottom-level scan might require parameters from more than one
764other relation.  In principle we could join the other relations first
765so that all the parameters are supplied from a single nestloop level.
766But if those other relations have no join clause in common (which is
767common in star-schema queries for instance), the planner won't consider
768joining them directly to each other.  In such a case we need to be able
769to create a plan like
770
771    NestLoop
772        -> Seq Scan on SmallTable1 A
773        NestLoop
774            -> Seq Scan on SmallTable2 B
775            -> Index Scan using XYIndex on LargeTable C
776                 Index Condition: C.X = A.AID and C.Y = B.BID
777
778so we should be willing to pass down A.AID through a join even though
779there is no join order constraint forcing the plan to look like this.
780
781Before version 9.2, Postgres used ad-hoc methods for planning and
782executing nestloop queries of this kind, and those methods could not
783handle passing parameters down through multiple join levels.
784
785To plan such queries, we now use a notion of a "parameterized path",
786which is a path that makes use of a join clause to a relation that's not
787scanned by the path.  In the example two above, we would construct a
788path representing the possibility of doing this:
789
790	-> Index Scan using C_Z_IDX on C
791		Index Condition: C.Z = A.X
792
793This path will be marked as being parameterized by relation A.  (Note that
794this is only one of the possible access paths for C; we'd still have a
795plain unparameterized seqscan, and perhaps other possibilities.)  The
796parameterization marker does not prevent joining the path to B, so one of
797the paths generated for the joinrel {B C} will represent
798
799	Hash Join
800		Join Condition: B.Y = C.W
801		-> Seq Scan on B
802		-> Index Scan using C_Z_IDX on C
803			Index Condition: C.Z = A.X
804
805This path is still marked as being parameterized by A.  When we attempt to
806join {B C} to A to form the complete join tree, such a path can only be
807used as the inner side of a nestloop join: it will be ignored for other
808possible join types.  So we will form a join path representing the query
809plan shown above, and it will compete in the usual way with paths built
810from non-parameterized scans.
811
812While all ordinary paths for a particular relation generate the same set
813of rows (since they must all apply the same set of restriction clauses),
814parameterized paths typically generate fewer rows than less-parameterized
815paths, since they have additional clauses to work with.  This means we
816must consider the number of rows generated as an additional figure of
817merit.  A path that costs more than another, but generates fewer rows,
818must be kept since the smaller number of rows might save work at some
819intermediate join level.  (It would not save anything if joined
820immediately to the source of the parameters.)
821
822To keep cost estimation rules relatively simple, we make an implementation
823restriction that all paths for a given relation of the same parameterization
824(i.e., the same set of outer relations supplying parameters) must have the
825same rowcount estimate.  This is justified by insisting that each such path
826apply *all* join clauses that are available with the named outer relations.
827Different paths might, for instance, choose different join clauses to use
828as index clauses; but they must then apply any other join clauses available
829from the same outer relations as filter conditions, so that the set of rows
830returned is held constant.  This restriction doesn't degrade the quality of
831the finished plan: it amounts to saying that we should always push down
832movable join clauses to the lowest possible evaluation level, which is a
833good thing anyway.  The restriction is useful in particular to support
834pre-filtering of join paths in add_path_precheck.  Without this rule we
835could never reject a parameterized path in advance of computing its rowcount
836estimate, which would greatly reduce the value of the pre-filter mechanism.
837
838To limit planning time, we have to avoid generating an unreasonably large
839number of parameterized paths.  We do this by only generating parameterized
840relation scan paths for index scans, and then only for indexes for which
841suitable join clauses are available.  There are also heuristics in join
842planning that try to limit the number of parameterized paths considered.
843
844In particular, there's been a deliberate policy decision to favor hash
845joins over merge joins for parameterized join steps (those occurring below
846a nestloop that provides parameters to the lower join's inputs).  While we
847do not ignore merge joins entirely, joinpath.c does not fully explore the
848space of potential merge joins with parameterized inputs.  Also, add_path
849treats parameterized paths as having no pathkeys, so that they compete
850only on cost and rowcount; they don't get preference for producing a
851special sort order.  This creates additional bias against merge joins,
852since we might discard a path that could have been useful for performing
853a merge without an explicit sort step.  Since a parameterized path must
854ultimately be used on the inside of a nestloop, where its sort order is
855uninteresting, these choices do not affect any requirement for the final
856output order of a query --- they only make it harder to use a merge join
857at a lower level.  The savings in planning work justifies that.
858
859Similarly, parameterized paths do not normally get preference in add_path
860for having cheap startup cost; that's seldom of much value when on the
861inside of a nestloop, so it seems not worth keeping extra paths solely for
862that.  An exception occurs for parameterized paths for the RHS relation of
863a SEMI or ANTI join: in those cases, we can stop the inner scan after the
864first match, so it's primarily startup not total cost that we care about.
865
866
867LATERAL subqueries
868------------------
869
870As of 9.3 we support SQL-standard LATERAL references from subqueries in
871FROM (and also functions in FROM).  The planner implements these by
872generating parameterized paths for any RTE that contains lateral
873references.  In such cases, *all* paths for that relation will be
874parameterized by at least the set of relations used in its lateral
875references.  (And in turn, join relations including such a subquery might
876not have any unparameterized paths.)  All the other comments made above for
877parameterized paths still apply, though; in particular, each such path is
878still expected to enforce any join clauses that can be pushed down to it,
879so that all paths of the same parameterization have the same rowcount.
880
881We also allow LATERAL subqueries to be flattened (pulled up into the parent
882query) by the optimizer, but only when this does not introduce lateral
883references into JOIN/ON quals that would refer to relations outside the
884lowest outer join at/above that qual.  The semantics of such a qual would
885be unclear.  Note that even with this restriction, pullup of a LATERAL
886subquery can result in creating PlaceHolderVars that contain lateral
887references to relations outside their syntactic scope.  We still evaluate
888such PHVs at their syntactic location or lower, but the presence of such a
889PHV in the quals or targetlist of a plan node requires that node to appear
890on the inside of a nestloop join relative to the rel(s) supplying the
891lateral reference.  (Perhaps now that that stuff works, we could relax the
892pullup restriction?)
893
894
895Security-level constraints on qual clauses
896------------------------------------------
897
898To support row-level security and security-barrier views efficiently,
899we mark qual clauses (RestrictInfo nodes) with a "security_level" field.
900The basic concept is that a qual with a lower security_level must be
901evaluated before one with a higher security_level.  This ensures that
902"leaky" quals that might expose sensitive data are not evaluated until
903after the security barrier quals that are supposed to filter out
904security-sensitive rows.  However, many qual conditions are "leakproof",
905that is we trust the functions they use to not expose data.  To avoid
906unnecessarily inefficient plans, a leakproof qual is not delayed by
907security-level considerations, even if it has a higher syntactic
908security_level than another qual.
909
910In a query that contains no use of RLS or security-barrier views, all
911quals will have security_level zero, so that none of these restrictions
912kick in; we don't even need to check leakproofness of qual conditions.
913
914If there are security-barrier quals, they get security_level zero (and
915possibly higher, if there are multiple layers of barriers).  Regular quals
916coming from the query text get a security_level one more than the highest
917level used for barrier quals.
918
919When new qual clauses are generated by EquivalenceClass processing,
920they must be assigned a security_level.  This is trickier than it seems.
921One's first instinct is that it would be safe to use the largest level
922found among the source quals for the EquivalenceClass, but that isn't
923safe at all, because it allows unwanted delays of security-barrier quals.
924Consider a barrier qual "t.x = t.y" plus a query qual "t.x = constant",
925and suppose there is another query qual "leaky_function(t.z)" that
926we mustn't evaluate before the barrier qual has been checked.
927We will have an EC {t.x, t.y, constant} which will lead us to replace
928the EC quals with "t.x = constant AND t.y = constant".  (We do not want
929to give up that behavior, either, since the latter condition could allow
930use of an index on t.y, which we would never discover from the original
931quals.)  If these generated quals are assigned the same security_level as
932the query quals, then it's possible for the leaky_function qual to be
933evaluated first, allowing leaky_function to see data from rows that
934possibly don't pass the barrier condition.
935
936Instead, our handling of security levels with ECs works like this:
937* Quals are not accepted as source clauses for ECs in the first place
938unless they are leakproof or have security_level zero.
939* EC-derived quals are assigned the minimum (not maximum) security_level
940found among the EC's source clauses.
941* If the maximum security_level found among the EC's source clauses is
942above zero, then the equality operators selected for derived quals must
943be leakproof.  When no such operator can be found, the EC is treated as
944"broken" and we fall back to emitting its source clauses without any
945additional derived quals.
946
947These rules together ensure that an untrusted qual clause (one with
948security_level above zero) cannot cause an EC to generate a leaky derived
949clause.  This makes it safe to use the minimum not maximum security_level
950for derived clauses.  The rules could result in poor plans due to not
951being able to generate derived clauses at all, but the risk of that is
952small in practice because most btree equality operators are leakproof.
953Also, by making exceptions for level-zero quals, we ensure that there is
954no plan degradation when no barrier quals are present.
955
956Once we have security levels assigned to all clauses, enforcement
957of barrier-qual ordering restrictions boils down to two rules:
958
959* Table scan plan nodes must not select quals for early execution
960(for example, use them as index qualifiers in an indexscan) unless
961they are leakproof or have security_level no higher than any other
962qual that is due to be executed at the same plan node.  (Use the
963utility function restriction_is_securely_promotable() to check
964whether it's okay to select a qual for early execution.)
965
966* Normal execution of a list of quals must execute them in an order
967that satisfies the same security rule, ie higher security_levels must
968be evaluated later unless leakproof.  (This is handled in a single place
969by order_qual_clauses() in createplan.c.)
970
971order_qual_clauses() uses a heuristic to decide exactly what to do with
972leakproof clauses.  Normally it sorts clauses by security_level then cost,
973being careful that the sort is stable so that we don't reorder clauses
974without a clear reason.  But this could result in a very expensive qual
975being done before a cheaper one that is of higher security_level.
976If the cheaper qual is leaky we have no choice, but if it is leakproof
977we could put it first.  We choose to sort leakproof quals as if they
978have security_level zero, but only when their cost is less than 10X
979cpu_operator_cost; that restriction alleviates the opposite problem of
980doing expensive quals first just because they're leakproof.
981
982Additional rules will be needed to support safe handling of join quals
983when there is a mix of security levels among join quals; for example, it
984will be necessary to prevent leaky higher-security-level quals from being
985evaluated at a lower join level than other quals of lower security level.
986Currently there is no need to consider that since security-prioritized
987quals can only be single-table restriction quals coming from RLS policies
988or security-barrier views, and security-barrier view subqueries are never
989flattened into the parent query.  Hence enforcement of security-prioritized
990quals only happens at the table scan level.  With extra rules for safe
991handling of security levels among join quals, it should be possible to let
992security-barrier views be flattened into the parent query, allowing more
993flexibility of planning while still preserving required ordering of qual
994evaluation.  But that will come later.
995
996
997Post scan/join planning
998-----------------------
999
1000So far we have discussed only scan/join planning, that is, implementation
1001of the FROM and WHERE clauses of a SQL query.  But the planner must also
1002determine how to deal with GROUP BY, aggregation, and other higher-level
1003features of queries; and in many cases there are multiple ways to do these
1004steps and thus opportunities for optimization choices.  These steps, like
1005scan/join planning, are handled by constructing Paths representing the
1006different ways to do a step, then choosing the cheapest Path.
1007
1008Since all Paths require a RelOptInfo as "parent", we create RelOptInfos
1009representing the outputs of these upper-level processing steps.  These
1010RelOptInfos are mostly dummy, but their pathlist lists hold all the Paths
1011considered useful for each step.  Currently, we may create these types of
1012additional RelOptInfos during upper-level planning:
1013
1014UPPERREL_SETOP		result of UNION/INTERSECT/EXCEPT, if any
1015UPPERREL_PARTIAL_GROUP_AGG	result of partial grouping/aggregation, if any
1016UPPERREL_GROUP_AGG	result of grouping/aggregation, if any
1017UPPERREL_WINDOW		result of window functions, if any
1018UPPERREL_DISTINCT	result of "SELECT DISTINCT", if any
1019UPPERREL_ORDERED	result of ORDER BY, if any
1020UPPERREL_FINAL		result of any remaining top-level actions
1021
1022UPPERREL_FINAL is used to represent any final processing steps, currently
1023LockRows (SELECT FOR UPDATE), LIMIT/OFFSET, and ModifyTable.  There is no
1024flexibility about the order in which these steps are done, and thus no need
1025to subdivide this stage more finely.
1026
1027These "upper relations" are identified by the UPPERREL enum values shown
1028above, plus a relids set, which allows there to be more than one upperrel
1029of the same kind.  We use NULL for the relids if there's no need for more
1030than one upperrel of the same kind.  Currently, in fact, the relids set
1031is vestigial because it's always NULL, but that's expected to change in
1032the future.  For example, in planning set operations, we might need the
1033relids to denote which subset of the leaf SELECTs has been combined in a
1034particular group of Paths that are competing with each other.
1035
1036The result of subquery_planner() is always returned as a set of Paths
1037stored in the UPPERREL_FINAL rel with NULL relids.  The other types of
1038upperrels are created only if needed for the particular query.
1039
1040
1041Parallel Query and Partial Paths
1042--------------------------------
1043
1044Parallel query involves dividing up the work that needs to be performed
1045either by an entire query or some portion of the query in such a way that
1046some of that work can be done by one or more worker processes, which are
1047called parallel workers.  Parallel workers are a subtype of dynamic
1048background workers; see src/backend/access/transam/README.parallel for a
1049fuller description.  The academic literature on parallel query suggests
1050that parallel execution strategies can be divided into essentially two
1051categories: pipelined parallelism, where the execution of the query is
1052divided into multiple stages and each stage is handled by a separate
1053process; and partitioning parallelism, where the data is split between
1054multiple processes and each process handles a subset of it.  The
1055literature, however, suggests that gains from pipeline parallelism are
1056often very limited due to the difficulty of avoiding pipeline stalls.
1057Consequently, we do not currently attempt to generate query plans that
1058use this technique.
1059
1060Instead, we focus on partitioning parallelism, which does not require
1061that the underlying table be partitioned.  It only requires that (1)
1062there is some method of dividing the data from at least one of the base
1063tables involved in the relation across multiple processes, (2) allowing
1064each process to handle its own portion of the data, and then (3)
1065collecting the results.  Requirements (2) and (3) are satisfied by the
1066executor node Gather (or GatherMerge), which launches any number of worker
1067processes and executes its single child plan in all of them, and perhaps
1068in the leader also, if the children aren't generating enough data to keep
1069the leader busy.  Requirement (1) is handled by the table scan node: when
1070invoked with parallel_aware = true, this node will, in effect, partition
1071the table on a block by block basis, returning a subset of the tuples from
1072the relation in each worker where that scan node is executed.
1073
1074Just as we do for non-parallel access methods, we build Paths to
1075represent access strategies that can be used in a parallel plan.  These
1076are, in essence, the same strategies that are available in the
1077non-parallel plan, but there is an important difference: a path that
1078will run beneath a Gather node returns only a subset of the query
1079results in each worker, not all of them.  To form a path that can
1080actually be executed, the (rather large) cost of the Gather node must be
1081accounted for.  For this reason among others, paths intended to run
1082beneath a Gather node - which we call "partial" paths since they return
1083only a subset of the results in each worker - must be kept separate from
1084ordinary paths (see RelOptInfo's partial_pathlist and the function
1085add_partial_path).
1086
1087One of the keys to making parallel query effective is to run as much of
1088the query in parallel as possible.  Therefore, we expect it to generally
1089be desirable to postpone the Gather stage until as near to the top of the
1090plan as possible.  Expanding the range of cases in which more work can be
1091pushed below the Gather (and costing them accurately) is likely to keep us
1092busy for a long time to come.
1093
1094Partitionwise joins
1095-------------------
1096
1097A join between two similarly partitioned tables can be broken down into joins
1098between their matching partitions if there exists an equi-join condition
1099between the partition keys of the joining tables. The equi-join between
1100partition keys implies that all join partners for a given row in one
1101partitioned table must be in the corresponding partition of the other
1102partitioned table. Because of this the join between partitioned tables to be
1103broken into joins between the matching partitions. The resultant join is
1104partitioned in the same way as the joining relations, thus allowing an N-way
1105join between similarly partitioned tables having equi-join condition between
1106their partition keys to be broken down into N-way joins between their matching
1107partitions. This technique of breaking down a join between partitioned tables
1108into joins between their partitions is called partitionwise join. We will use
1109term "partitioned relation" for either a partitioned table or a join between
1110compatibly partitioned tables.
1111
1112Even if the joining relations don't have exactly the same partition bounds,
1113partitionwise join can still be applied by using an advanced
1114partition-matching algorithm.  For both the joining relations, the algorithm
1115checks whether every partition of one joining relation only matches one
1116partition of the other joining relation at most.  In such a case the join
1117between the joining relations can be broken down into joins between the
1118matching partitions.  The join relation can then be considered partitioned.
1119The algorithm produces the pairs of the matching partitions, plus the
1120partition bounds for the join relation, to allow partitionwise join for
1121computing the join.  The algorithm is implemented in partition_bounds_merge().
1122For an N-way join relation considered partitioned this way, not every pair of
1123joining relations can use partitionwise join.  For example:
1124
1125	(A leftjoin B on (Pab)) innerjoin C on (Pac)
1126
1127where A, B, and C are partitioned tables, and A has an extra partition
1128compared to B and C.  When considering partitionwise join for the join {A B},
1129the extra partition of A doesn't have a matching partition on the nullable
1130side, which is the case that the current implementation of partitionwise join
1131can't handle.  So {A B} is not considered partitioned, and the pair of {A B}
1132and C considered for the 3-way join can't use partitionwise join.  On the
1133other hand, the pair of {A C} and B can use partitionwise join because {A C}
1134is considered partitioned by eliminating the extra partition (see identity 1
1135on outer join reordering).  Whether an N-way join can use partitionwise join
1136is determined based on the first pair of joining relations that are both
1137partitioned and can use partitionwise join.
1138
1139The partitioning properties of a partitioned relation are stored in its
1140RelOptInfo.  The information about data types of partition keys are stored in
1141PartitionSchemeData structure. The planner maintains a list of canonical
1142partition schemes (distinct PartitionSchemeData objects) so that RelOptInfo of
1143any two partitioned relations with same partitioning scheme point to the same
1144PartitionSchemeData object.  This reduces memory consumed by
1145PartitionSchemeData objects and makes it easy to compare the partition schemes
1146of joining relations.
1147
1148Partitionwise aggregates/grouping
1149---------------------------------
1150
1151If the GROUP BY clause contains all of the partition keys, all the rows
1152that belong to a given group must come from a single partition; therefore,
1153aggregation can be done completely separately for each partition. Otherwise,
1154partial aggregates can be computed for each partition, and then finalized
1155after appending the results from the individual partitions.  This technique of
1156breaking down aggregation or grouping over a partitioned relation into
1157aggregation or grouping over its partitions is called partitionwise
1158aggregation.  Especially when the partition keys match the GROUP BY clause,
1159this can be significantly faster than the regular method.
1160