1<!-- doc/src/sgml/perform.sgml -->
2
3 <chapter id="performance-tips">
4  <title>Performance Tips</title>
5
6  <indexterm zone="performance-tips">
7   <primary>performance</primary>
8  </indexterm>
9
10  <para>
11   Query performance can be affected by many things. Some of these can
12   be controlled by the user, while others are fundamental to the underlying
13   design of the system.  This chapter provides some hints about understanding
14   and tuning <productname>PostgreSQL</productname> performance.
15  </para>
16
17 <sect1 id="using-explain">
18  <title>Using <command>EXPLAIN</command></title>
19
20   <indexterm zone="using-explain">
21    <primary>EXPLAIN</primary>
22   </indexterm>
23
24   <indexterm zone="using-explain">
25    <primary>query plan</primary>
26   </indexterm>
27
28   <para>
29    <productname>PostgreSQL</productname> devises a <firstterm>query
30    plan</firstterm> for each query it receives.  Choosing the right
31    plan to match the query structure and the properties of the data
32    is absolutely critical for good performance, so the system includes
33    a complex <firstterm>planner</firstterm> that tries to choose good plans.
34    You can use the <xref linkend="sql-explain"/> command
35    to see what query plan the planner creates for any query.
36    Plan-reading is an art that requires some experience to master,
37    but this section attempts to cover the basics.
38   </para>
39
40   <para>
41    Examples in this section are drawn from the regression test database
42    after doing a <command>VACUUM ANALYZE</command>, using 9.3 development sources.
43    You should be able to get similar results if you try the examples
44    yourself, but your estimated costs and row counts might vary slightly
45    because <command>ANALYZE</command>'s statistics are random samples rather
46    than exact, and because costs are inherently somewhat platform-dependent.
47   </para>
48
49   <para>
50    The examples use <command>EXPLAIN</command>'s default <quote>text</quote> output
51    format, which is compact and convenient for humans to read.
52    If you want to feed <command>EXPLAIN</command>'s output to a program for further
53    analysis, you should use one of its machine-readable output formats
54    (XML, JSON, or YAML) instead.
55   </para>
56
57  <sect2 id="using-explain-basics">
58   <title><command>EXPLAIN</command> Basics</title>
59
60   <para>
61    The structure of a query plan is a tree of <firstterm>plan nodes</firstterm>.
62    Nodes at the bottom level of the tree are scan nodes: they return raw rows
63    from a table.  There are different types of scan nodes for different
64    table access methods: sequential scans, index scans, and bitmap index
65    scans.  There are also non-table row sources, such as <literal>VALUES</literal>
66    clauses and set-returning functions in <literal>FROM</literal>, which have their
67    own scan node types.
68    If the query requires joining, aggregation, sorting, or other
69    operations on the raw rows, then there will be additional nodes
70    above the scan nodes to perform these operations.  Again,
71    there is usually more than one possible way to do these operations,
72    so different node types can appear here too.  The output
73    of <command>EXPLAIN</command> has one line for each node in the plan
74    tree, showing the basic node type plus the cost estimates that the planner
75    made for the execution of that plan node.  Additional lines might appear,
76    indented from the node's summary line,
77    to show additional properties of the node.
78    The very first line (the summary line for the topmost
79    node) has the estimated total execution cost for the plan; it is this
80    number that the planner seeks to minimize.
81   </para>
82
83   <para>
84    Here is a trivial example, just to show what the output looks like:
85
86<screen>
87EXPLAIN SELECT * FROM tenk1;
88
89                         QUERY PLAN
90-------------------------------------------------------------
91 Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)
92</screen>
93   </para>
94
95   <para>
96    Since this query has no <literal>WHERE</literal> clause, it must scan all the
97    rows of the table, so the planner has chosen to use a simple sequential
98    scan plan.  The numbers that are quoted in parentheses are (left
99    to right):
100
101    <itemizedlist>
102     <listitem>
103      <para>
104       Estimated start-up cost.  This is the time expended before the output
105       phase can begin, e.g., time to do the sorting in a sort node.
106      </para>
107     </listitem>
108
109     <listitem>
110      <para>
111       Estimated total cost.  This is stated on the assumption that the plan
112       node is run to completion, i.e., all available rows are retrieved.
113       In practice a node's parent node might stop short of reading all
114       available rows (see the <literal>LIMIT</literal> example below).
115      </para>
116     </listitem>
117
118     <listitem>
119      <para>
120       Estimated number of rows output by this plan node.  Again, the node
121       is assumed to be run to completion.
122      </para>
123     </listitem>
124
125     <listitem>
126      <para>
127       Estimated average width of rows output by this plan node (in bytes).
128      </para>
129     </listitem>
130    </itemizedlist>
131   </para>
132
133   <para>
134    The costs are measured in arbitrary units determined by the planner's
135    cost parameters (see <xref linkend="runtime-config-query-constants"/>).
136    Traditional practice is to measure the costs in units of disk page
137    fetches; that is, <xref linkend="guc-seq-page-cost"/> is conventionally
138    set to <literal>1.0</literal> and the other cost parameters are set relative
139    to that.  The examples in this section are run with the default cost
140    parameters.
141   </para>
142
143   <para>
144    It's important to understand that the cost of an upper-level node includes
145    the cost of all its child nodes.  It's also important to realize that
146    the cost only reflects things that the planner cares about.
147    In particular, the cost does not consider the time spent transmitting
148    result rows to the client, which could be an important
149    factor in the real elapsed time; but the planner ignores it because
150    it cannot change it by altering the plan.  (Every correct plan will
151    output the same row set, we trust.)
152   </para>
153
154   <para>
155    The <literal>rows</literal> value is a little tricky because it is
156    not the number of rows processed or scanned by the
157    plan node, but rather the number emitted by the node.  This is often
158    less than the number scanned, as a result of filtering by any
159    <literal>WHERE</literal>-clause conditions that are being applied at the node.
160    Ideally the top-level rows estimate will approximate the number of rows
161    actually returned, updated, or deleted by the query.
162   </para>
163
164   <para>
165    Returning to our example:
166
167<screen>
168EXPLAIN SELECT * FROM tenk1;
169
170                         QUERY PLAN
171-------------------------------------------------------------
172 Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)
173</screen>
174   </para>
175
176   <para>
177    These numbers are derived very straightforwardly.  If you do:
178
179<programlisting>
180SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
181</programlisting>
182
183    you will find that <classname>tenk1</classname> has 358 disk
184    pages and 10000 rows.  The estimated cost is computed as (disk pages read *
185    <xref linkend="guc-seq-page-cost"/>) + (rows scanned *
186    <xref linkend="guc-cpu-tuple-cost"/>).  By default,
187    <varname>seq_page_cost</varname> is 1.0 and <varname>cpu_tuple_cost</varname> is 0.01,
188    so the estimated cost is (358 * 1.0) + (10000 * 0.01) = 458.
189   </para>
190
191   <para>
192    Now let's modify the query to add a <literal>WHERE</literal> condition:
193
194<screen>
195EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 7000;
196
197                         QUERY PLAN
198------------------------------------------------------------
199 Seq Scan on tenk1  (cost=0.00..483.00 rows=7001 width=244)
200   Filter: (unique1 &lt; 7000)
201</screen>
202
203    Notice that the <command>EXPLAIN</command> output shows the <literal>WHERE</literal>
204    clause being applied as a <quote>filter</quote> condition attached to the Seq
205    Scan plan node.  This means that
206    the plan node checks the condition for each row it scans, and outputs
207    only the ones that pass the condition.
208    The estimate of output rows has been reduced because of the
209    <literal>WHERE</literal> clause.
210    However, the scan will still have to visit all 10000 rows, so the cost
211    hasn't decreased; in fact it has gone up a bit (by 10000 * <xref
212    linkend="guc-cpu-operator-cost"/>, to be exact) to reflect the extra CPU
213    time spent checking the <literal>WHERE</literal> condition.
214   </para>
215
216   <para>
217    The actual number of rows this query would select is 7000, but the <literal>rows</literal>
218    estimate is only approximate.  If you try to duplicate this experiment,
219    you will probably get a slightly different estimate; moreover, it can
220    change after each <command>ANALYZE</command> command, because the
221    statistics produced by <command>ANALYZE</command> are taken from a
222    randomized sample of the table.
223   </para>
224
225   <para>
226    Now, let's make the condition more restrictive:
227
228<screen>
229EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 100;
230
231                                  QUERY PLAN
232------------------------------------------------------------------------------
233 Bitmap Heap Scan on tenk1  (cost=5.07..229.20 rows=101 width=244)
234   Recheck Cond: (unique1 &lt; 100)
235   -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=101 width=0)
236         Index Cond: (unique1 &lt; 100)
237</screen>
238
239    Here the planner has decided to use a two-step plan: the child plan
240    node visits an index to find the locations of rows matching the index
241    condition, and then the upper plan node actually fetches those rows
242    from the table itself.  Fetching rows separately is much more
243    expensive than reading them sequentially, but because not all the pages
244    of the table have to be visited, this is still cheaper than a sequential
245    scan.  (The reason for using two plan levels is that the upper plan
246    node sorts the row locations identified by the index into physical order
247    before reading them, to minimize the cost of separate fetches.
248    The <quote>bitmap</quote> mentioned in the node names is the mechanism that
249    does the sorting.)
250   </para>
251
252   <para>
253    Now let's add another condition to the <literal>WHERE</literal> clause:
254
255<screen>
256EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND stringu1 = 'xxx';
257
258                                  QUERY PLAN
259------------------------------------------------------------------------------
260 Bitmap Heap Scan on tenk1  (cost=5.04..229.43 rows=1 width=244)
261   Recheck Cond: (unique1 &lt; 100)
262   Filter: (stringu1 = 'xxx'::name)
263   -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=101 width=0)
264         Index Cond: (unique1 &lt; 100)
265</screen>
266
267    The added condition <literal>stringu1 = 'xxx'</literal> reduces the
268    output row count estimate, but not the cost because we still have to visit
269    the same set of rows.  Notice that the <literal>stringu1</literal> clause
270    cannot be applied as an index condition, since this index is only on
271    the <literal>unique1</literal> column.  Instead it is applied as a filter on
272    the rows retrieved by the index.  Thus the cost has actually gone up
273    slightly to reflect this extra checking.
274   </para>
275
276   <para>
277    In some cases the planner will prefer a <quote>simple</quote> index scan plan:
278
279<screen>
280EXPLAIN SELECT * FROM tenk1 WHERE unique1 = 42;
281
282                                 QUERY PLAN
283-----------------------------------------------------------------------------
284 Index Scan using tenk1_unique1 on tenk1  (cost=0.29..8.30 rows=1 width=244)
285   Index Cond: (unique1 = 42)
286</screen>
287
288    In this type of plan the table rows are fetched in index order, which
289    makes them even more expensive to read, but there are so few that the
290    extra cost of sorting the row locations is not worth it.  You'll most
291    often see this plan type for queries that fetch just a single row.  It's
292    also often used for queries that have an <literal>ORDER BY</literal> condition
293    that matches the index order, because then no extra sorting step is needed
294    to satisfy the <literal>ORDER BY</literal>.
295   </para>
296
297   <para>
298    If there are separate indexes on several of the columns referenced
299    in <literal>WHERE</literal>, the planner might choose to use an AND or OR
300    combination of the indexes:
301
302<screen>
303EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000;
304
305                                     QUERY PLAN
306-------------------------------------------------------------------------------------
307 Bitmap Heap Scan on tenk1  (cost=25.08..60.21 rows=10 width=244)
308   Recheck Cond: ((unique1 &lt; 100) AND (unique2 &gt; 9000))
309   -&gt;  BitmapAnd  (cost=25.08..25.08 rows=10 width=0)
310         -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=101 width=0)
311               Index Cond: (unique1 &lt; 100)
312         -&gt;  Bitmap Index Scan on tenk1_unique2  (cost=0.00..19.78 rows=999 width=0)
313               Index Cond: (unique2 &gt; 9000)
314</screen>
315
316    But this requires visiting both indexes, so it's not necessarily a win
317    compared to using just one index and treating the other condition as
318    a filter.  If you vary the ranges involved you'll see the plan change
319    accordingly.
320   </para>
321
322   <para>
323    Here is an example showing the effects of <literal>LIMIT</literal>:
324
325<screen>
326EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000 LIMIT 2;
327
328                                     QUERY PLAN
329-------------------------------------------------------------------------------------
330 Limit  (cost=0.29..14.48 rows=2 width=244)
331   -&gt;  Index Scan using tenk1_unique2 on tenk1  (cost=0.29..71.27 rows=10 width=244)
332         Index Cond: (unique2 &gt; 9000)
333         Filter: (unique1 &lt; 100)
334</screen>
335   </para>
336
337   <para>
338    This is the same query as above, but we added a <literal>LIMIT</literal> so that
339    not all the rows need be retrieved, and the planner changed its mind about
340    what to do.  Notice that the total cost and row count of the Index Scan
341    node are shown as if it were run to completion.  However, the Limit node
342    is expected to stop after retrieving only a fifth of those rows, so its
343    total cost is only a fifth as much, and that's the actual estimated cost
344    of the query.  This plan is preferred over adding a Limit node to the
345    previous plan because the Limit could not avoid paying the startup cost
346    of the bitmap scan, so the total cost would be something over 25 units
347    with that approach.
348   </para>
349
350   <para>
351    Let's try joining two tables, using the columns we have been discussing:
352
353<screen>
354EXPLAIN SELECT *
355FROM tenk1 t1, tenk2 t2
356WHERE t1.unique1 &lt; 10 AND t1.unique2 = t2.unique2;
357
358                                      QUERY PLAN
359--------------------------------------------------------------------------------------
360 Nested Loop  (cost=4.65..118.62 rows=10 width=488)
361   -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=4.36..39.47 rows=10 width=244)
362         Recheck Cond: (unique1 &lt; 10)
363         -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.36 rows=10 width=0)
364               Index Cond: (unique1 &lt; 10)
365   -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.29..7.91 rows=1 width=244)
366         Index Cond: (unique2 = t1.unique2)
367</screen>
368   </para>
369
370   <para>
371    In this plan, we have a nested-loop join node with two table scans as
372    inputs, or children.  The indentation of the node summary lines reflects
373    the plan tree structure.  The join's first, or <quote>outer</quote>, child
374    is a bitmap scan similar to those we saw before.  Its cost and row count
375    are the same as we'd get from <literal>SELECT ... WHERE unique1 &lt; 10</literal>
376    because we are
377    applying the <literal>WHERE</literal> clause <literal>unique1 &lt; 10</literal>
378    at that node.
379    The <literal>t1.unique2 = t2.unique2</literal> clause is not relevant yet,
380    so it doesn't affect the row count of the outer scan.  The nested-loop
381    join node will run its second,
382    or <quote>inner</quote> child once for each row obtained from the outer child.
383    Column values from the current outer row can be plugged into the inner
384    scan; here, the <literal>t1.unique2</literal> value from the outer row is available,
385    so we get a plan and costs similar to what we saw above for a simple
386    <literal>SELECT ... WHERE t2.unique2 = <replaceable>constant</replaceable></literal> case.
387    (The estimated cost is actually a bit lower than what was seen above,
388    as a result of caching that's expected to occur during the repeated
389    index scans on <literal>t2</literal>.)  The
390    costs of the loop node are then set on the basis of the cost of the outer
391    scan, plus one repetition of the inner scan for each outer row (10 * 7.91,
392    here), plus a little CPU time for join processing.
393   </para>
394
395   <para>
396    In this example the join's output row count is the same as the product
397    of the two scans' row counts, but that's not true in all cases because
398    there can be additional <literal>WHERE</literal> clauses that mention both tables
399    and so can only be applied at the join point, not to either input scan.
400    Here's an example:
401
402<screen>
403EXPLAIN SELECT *
404FROM tenk1 t1, tenk2 t2
405WHERE t1.unique1 &lt; 10 AND t2.unique2 &lt; 10 AND t1.hundred &lt; t2.hundred;
406
407                                         QUERY PLAN
408---------------------------------------------------------------------------------------------
409 Nested Loop  (cost=4.65..49.46 rows=33 width=488)
410   Join Filter: (t1.hundred &lt; t2.hundred)
411   -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=4.36..39.47 rows=10 width=244)
412         Recheck Cond: (unique1 &lt; 10)
413         -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.36 rows=10 width=0)
414               Index Cond: (unique1 &lt; 10)
415   -&gt;  Materialize  (cost=0.29..8.51 rows=10 width=244)
416         -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.29..8.46 rows=10 width=244)
417               Index Cond: (unique2 &lt; 10)
418</screen>
419
420    The condition <literal>t1.hundred &lt; t2.hundred</literal> can't be
421    tested in the <literal>tenk2_unique2</literal> index, so it's applied at the
422    join node.  This reduces the estimated output row count of the join node,
423    but does not change either input scan.
424   </para>
425
426   <para>
427    Notice that here the planner has chosen to <quote>materialize</quote> the inner
428    relation of the join, by putting a Materialize plan node atop it.  This
429    means that the <literal>t2</literal> index scan will be done just once, even
430    though the nested-loop join node needs to read that data ten times, once
431    for each row from the outer relation.  The Materialize node saves the data
432    in memory as it's read, and then returns the data from memory on each
433    subsequent pass.
434   </para>
435
436   <para>
437    When dealing with outer joins, you might see join plan nodes with both
438    <quote>Join Filter</quote> and plain <quote>Filter</quote> conditions attached.
439    Join Filter conditions come from the outer join's <literal>ON</literal> clause,
440    so a row that fails the Join Filter condition could still get emitted as
441    a null-extended row.  But a plain Filter condition is applied after the
442    outer-join rules and so acts to remove rows unconditionally.  In an inner
443    join there is no semantic difference between these types of filters.
444   </para>
445
446   <para>
447    If we change the query's selectivity a bit, we might get a very different
448    join plan:
449
450<screen>
451EXPLAIN SELECT *
452FROM tenk1 t1, tenk2 t2
453WHERE t1.unique1 &lt; 100 AND t1.unique2 = t2.unique2;
454
455                                        QUERY PLAN
456------------------------------------------------------------------------------------------
457 Hash Join  (cost=230.47..713.98 rows=101 width=488)
458   Hash Cond: (t2.unique2 = t1.unique2)
459   -&gt;  Seq Scan on tenk2 t2  (cost=0.00..445.00 rows=10000 width=244)
460   -&gt;  Hash  (cost=229.20..229.20 rows=101 width=244)
461         -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=5.07..229.20 rows=101 width=244)
462               Recheck Cond: (unique1 &lt; 100)
463               -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=101 width=0)
464                     Index Cond: (unique1 &lt; 100)
465</screen>
466   </para>
467
468   <para>
469    Here, the planner has chosen to use a hash join, in which rows of one
470    table are entered into an in-memory hash table, after which the other
471    table is scanned and the hash table is probed for matches to each row.
472    Again note how the indentation reflects the plan structure: the bitmap
473    scan on <literal>tenk1</literal> is the input to the Hash node, which constructs
474    the hash table.  That's then returned to the Hash Join node, which reads
475    rows from its outer child plan and searches the hash table for each one.
476   </para>
477
478   <para>
479    Another possible type of join is a merge join, illustrated here:
480
481<screen>
482EXPLAIN SELECT *
483FROM tenk1 t1, onek t2
484WHERE t1.unique1 &lt; 100 AND t1.unique2 = t2.unique2;
485
486                                        QUERY PLAN
487------------------------------------------------------------------------------------------
488 Merge Join  (cost=198.11..268.19 rows=10 width=488)
489   Merge Cond: (t1.unique2 = t2.unique2)
490   -&gt;  Index Scan using tenk1_unique2 on tenk1 t1  (cost=0.29..656.28 rows=101 width=244)
491         Filter: (unique1 &lt; 100)
492   -&gt;  Sort  (cost=197.83..200.33 rows=1000 width=244)
493         Sort Key: t2.unique2
494         -&gt;  Seq Scan on onek t2  (cost=0.00..148.00 rows=1000 width=244)
495</screen>
496   </para>
497
498   <para>
499    Merge join requires its input data to be sorted on the join keys.  In this
500    plan the <literal>tenk1</literal> data is sorted by using an index scan to visit
501    the rows in the correct order, but a sequential scan and sort is preferred
502    for <literal>onek</literal>, because there are many more rows to be visited in
503    that table.
504    (Sequential-scan-and-sort frequently beats an index scan for sorting many rows,
505    because of the nonsequential disk access required by the index scan.)
506   </para>
507
508   <para>
509    One way to look at variant plans is to force the planner to disregard
510    whatever strategy it thought was the cheapest, using the enable/disable
511    flags described in <xref linkend="runtime-config-query-enable"/>.
512    (This is a crude tool, but useful.  See
513    also <xref linkend="explicit-joins"/>.)
514    For example, if we're unconvinced that sequential-scan-and-sort is the best way to
515    deal with table <literal>onek</literal> in the previous example, we could try
516
517<screen>
518SET enable_sort = off;
519
520EXPLAIN SELECT *
521FROM tenk1 t1, onek t2
522WHERE t1.unique1 &lt; 100 AND t1.unique2 = t2.unique2;
523
524                                        QUERY PLAN
525------------------------------------------------------------------------------------------
526 Merge Join  (cost=0.56..292.65 rows=10 width=488)
527   Merge Cond: (t1.unique2 = t2.unique2)
528   -&gt;  Index Scan using tenk1_unique2 on tenk1 t1  (cost=0.29..656.28 rows=101 width=244)
529         Filter: (unique1 &lt; 100)
530   -&gt;  Index Scan using onek_unique2 on onek t2  (cost=0.28..224.79 rows=1000 width=244)
531</screen>
532
533    which shows that the planner thinks that sorting <literal>onek</literal> by
534    index-scanning is about 12% more expensive than sequential-scan-and-sort.
535    Of course, the next question is whether it's right about that.
536    We can investigate that using <command>EXPLAIN ANALYZE</command>, as discussed
537    below.
538   </para>
539
540  </sect2>
541
542  <sect2 id="using-explain-analyze">
543   <title><command>EXPLAIN ANALYZE</command></title>
544
545   <para>
546    It is possible to check the accuracy of the planner's estimates
547    by using <command>EXPLAIN</command>'s <literal>ANALYZE</literal> option.  With this
548    option, <command>EXPLAIN</command> actually executes the query, and then displays
549    the true row counts and true run time accumulated within each plan node,
550    along with the same estimates that a plain <command>EXPLAIN</command>
551    shows.  For example, we might get a result like this:
552
553<screen>
554EXPLAIN ANALYZE SELECT *
555FROM tenk1 t1, tenk2 t2
556WHERE t1.unique1 &lt; 10 AND t1.unique2 = t2.unique2;
557
558                                                           QUERY PLAN
559---------------------------------------------------------------------------------------------------------------------------------
560 Nested Loop  (cost=4.65..118.62 rows=10 width=488) (actual time=0.128..0.377 rows=10 loops=1)
561   -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=4.36..39.47 rows=10 width=244) (actual time=0.057..0.121 rows=10 loops=1)
562         Recheck Cond: (unique1 &lt; 10)
563         -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.36 rows=10 width=0) (actual time=0.024..0.024 rows=10 loops=1)
564               Index Cond: (unique1 &lt; 10)
565   -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.29..7.91 rows=1 width=244) (actual time=0.021..0.022 rows=1 loops=10)
566         Index Cond: (unique2 = t1.unique2)
567 Planning time: 0.181 ms
568 Execution time: 0.501 ms
569</screen>
570
571    Note that the <quote>actual time</quote> values are in milliseconds of
572    real time, whereas the <literal>cost</literal> estimates are expressed in
573    arbitrary units; so they are unlikely to match up.
574    The thing that's usually most important to look for is whether the
575    estimated row counts are reasonably close to reality.  In this example
576    the estimates were all dead-on, but that's quite unusual in practice.
577   </para>
578
579   <para>
580    In some query plans, it is possible for a subplan node to be executed more
581    than once.  For example, the inner index scan will be executed once per
582    outer row in the above nested-loop plan.  In such cases, the
583    <literal>loops</literal> value reports the
584    total number of executions of the node, and the actual time and rows
585    values shown are averages per-execution.  This is done to make the numbers
586    comparable with the way that the cost estimates are shown.  Multiply by
587    the <literal>loops</literal> value to get the total time actually spent in
588    the node.  In the above example, we spent a total of 0.220 milliseconds
589    executing the index scans on <literal>tenk2</literal>.
590   </para>
591
592   <para>
593    In some cases <command>EXPLAIN ANALYZE</command> shows additional execution
594    statistics beyond the plan node execution times and row counts.
595    For example, Sort and Hash nodes provide extra information:
596
597<screen>
598EXPLAIN ANALYZE SELECT *
599FROM tenk1 t1, tenk2 t2
600WHERE t1.unique1 &lt; 100 AND t1.unique2 = t2.unique2 ORDER BY t1.fivethous;
601
602                                                                 QUERY PLAN
603--------------------------------------------------------------------------------------------------------------------------------------------
604 Sort  (cost=717.34..717.59 rows=101 width=488) (actual time=7.761..7.774 rows=100 loops=1)
605   Sort Key: t1.fivethous
606   Sort Method: quicksort  Memory: 77kB
607   -&gt;  Hash Join  (cost=230.47..713.98 rows=101 width=488) (actual time=0.711..7.427 rows=100 loops=1)
608         Hash Cond: (t2.unique2 = t1.unique2)
609         -&gt;  Seq Scan on tenk2 t2  (cost=0.00..445.00 rows=10000 width=244) (actual time=0.007..2.583 rows=10000 loops=1)
610         -&gt;  Hash  (cost=229.20..229.20 rows=101 width=244) (actual time=0.659..0.659 rows=100 loops=1)
611               Buckets: 1024  Batches: 1  Memory Usage: 28kB
612               -&gt;  Bitmap Heap Scan on tenk1 t1  (cost=5.07..229.20 rows=101 width=244) (actual time=0.080..0.526 rows=100 loops=1)
613                     Recheck Cond: (unique1 &lt; 100)
614                     -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=101 width=0) (actual time=0.049..0.049 rows=100 loops=1)
615                           Index Cond: (unique1 &lt; 100)
616 Planning time: 0.194 ms
617 Execution time: 8.008 ms
618</screen>
619
620    The Sort node shows the sort method used (in particular, whether the sort
621    was in-memory or on-disk) and the amount of memory or disk space needed.
622    The Hash node shows the number of hash buckets and batches as well as the
623    peak amount of memory used for the hash table.  (If the number of batches
624    exceeds one, there will also be disk space usage involved, but that is not
625    shown.)
626   </para>
627
628   <para>
629    Another type of extra information is the number of rows removed by a
630    filter condition:
631
632<screen>
633EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE ten &lt; 7;
634
635                                               QUERY PLAN
636---------------------------------------------------------------------------------------------------------
637 Seq Scan on tenk1  (cost=0.00..483.00 rows=7000 width=244) (actual time=0.016..5.107 rows=7000 loops=1)
638   Filter: (ten &lt; 7)
639   Rows Removed by Filter: 3000
640 Planning time: 0.083 ms
641 Execution time: 5.905 ms
642</screen>
643
644    These counts can be particularly valuable for filter conditions applied at
645    join nodes.  The <quote>Rows Removed</quote> line only appears when at least
646    one scanned row, or potential join pair in the case of a join node,
647    is rejected by the filter condition.
648   </para>
649
650   <para>
651    A case similar to filter conditions occurs with <quote>lossy</quote>
652    index scans.  For example, consider this search for polygons containing a
653    specific point:
654
655<screen>
656EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @&gt; polygon '(0.5,2.0)';
657
658                                              QUERY PLAN
659------------------------------------------------------------------------------------------------------
660 Seq Scan on polygon_tbl  (cost=0.00..1.05 rows=1 width=32) (actual time=0.044..0.044 rows=0 loops=1)
661   Filter: (f1 @&gt; '((0.5,2))'::polygon)
662   Rows Removed by Filter: 4
663 Planning time: 0.040 ms
664 Execution time: 0.083 ms
665</screen>
666
667    The planner thinks (quite correctly) that this sample table is too small
668    to bother with an index scan, so we have a plain sequential scan in which
669    all the rows got rejected by the filter condition.  But if we force an
670    index scan to be used, we see:
671
672<screen>
673SET enable_seqscan TO off;
674
675EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @&gt; polygon '(0.5,2.0)';
676
677                                                        QUERY PLAN
678--------------------------------------------------------------------------------------------------------------------------
679 Index Scan using gpolygonind on polygon_tbl  (cost=0.13..8.15 rows=1 width=32) (actual time=0.062..0.062 rows=0 loops=1)
680   Index Cond: (f1 @&gt; '((0.5,2))'::polygon)
681   Rows Removed by Index Recheck: 1
682 Planning time: 0.034 ms
683 Execution time: 0.144 ms
684</screen>
685
686    Here we can see that the index returned one candidate row, which was
687    then rejected by a recheck of the index condition.  This happens because a
688    GiST index is <quote>lossy</quote> for polygon containment tests: it actually
689    returns the rows with polygons that overlap the target, and then we have
690    to do the exact containment test on those rows.
691   </para>
692
693   <para>
694    <command>EXPLAIN</command> has a <literal>BUFFERS</literal> option that can be used with
695    <literal>ANALYZE</literal> to get even more run time statistics:
696
697<screen>
698EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000;
699
700                                                           QUERY PLAN
701---------------------------------------------------------------------------------------------------------------------------------
702 Bitmap Heap Scan on tenk1  (cost=25.08..60.21 rows=10 width=244) (actual time=0.323..0.342 rows=10 loops=1)
703   Recheck Cond: ((unique1 &lt; 100) AND (unique2 &gt; 9000))
704   Buffers: shared hit=15
705   -&gt;  BitmapAnd  (cost=25.08..25.08 rows=10 width=0) (actual time=0.309..0.309 rows=0 loops=1)
706         Buffers: shared hit=7
707         -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=101 width=0) (actual time=0.043..0.043 rows=100 loops=1)
708               Index Cond: (unique1 &lt; 100)
709               Buffers: shared hit=2
710         -&gt;  Bitmap Index Scan on tenk1_unique2  (cost=0.00..19.78 rows=999 width=0) (actual time=0.227..0.227 rows=999 loops=1)
711               Index Cond: (unique2 &gt; 9000)
712               Buffers: shared hit=5
713 Planning time: 0.088 ms
714 Execution time: 0.423 ms
715</screen>
716
717    The numbers provided by <literal>BUFFERS</literal> help to identify which parts
718    of the query are the most I/O-intensive.
719   </para>
720
721   <para>
722    Keep in mind that because <command>EXPLAIN ANALYZE</command> actually
723    runs the query, any side-effects will happen as usual, even though
724    whatever results the query might output are discarded in favor of
725    printing the <command>EXPLAIN</command> data.  If you want to analyze a
726    data-modifying query without changing your tables, you can
727    roll the command back afterwards, for example:
728
729<screen>
730BEGIN;
731
732EXPLAIN ANALYZE UPDATE tenk1 SET hundred = hundred + 1 WHERE unique1 &lt; 100;
733
734                                                           QUERY PLAN
735--------------------------------------------------------------------------------------------------------------------------------
736 Update on tenk1  (cost=5.07..229.46 rows=101 width=250) (actual time=14.628..14.628 rows=0 loops=1)
737   -&gt;  Bitmap Heap Scan on tenk1  (cost=5.07..229.46 rows=101 width=250) (actual time=0.101..0.439 rows=100 loops=1)
738         Recheck Cond: (unique1 &lt; 100)
739         -&gt;  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=101 width=0) (actual time=0.043..0.043 rows=100 loops=1)
740               Index Cond: (unique1 &lt; 100)
741 Planning time: 0.079 ms
742 Execution time: 14.727 ms
743
744ROLLBACK;
745</screen>
746   </para>
747
748   <para>
749    As seen in this example, when the query is an <command>INSERT</command>,
750    <command>UPDATE</command>, or <command>DELETE</command> command, the actual work of
751    applying the table changes is done by a top-level Insert, Update,
752    or Delete plan node.  The plan nodes underneath this node perform
753    the work of locating the old rows and/or computing the new data.
754    So above, we see the same sort of bitmap table scan we've seen already,
755    and its output is fed to an Update node that stores the updated rows.
756    It's worth noting that although the data-modifying node can take a
757    considerable amount of run time (here, it's consuming the lion's share
758    of the time), the planner does not currently add anything to the cost
759    estimates to account for that work.  That's because the work to be done is
760    the same for every correct query plan, so it doesn't affect planning
761    decisions.
762   </para>
763
764   <para>
765    When an <command>UPDATE</command> or <command>DELETE</command> command affects an
766    inheritance hierarchy, the output might look like this:
767
768<screen>
769EXPLAIN UPDATE parent SET f2 = f2 + 1 WHERE f1 = 101;
770                                    QUERY PLAN
771-----------------------------------------------------------------------------------
772 Update on parent  (cost=0.00..24.53 rows=4 width=14)
773   Update on parent
774   Update on child1
775   Update on child2
776   Update on child3
777   -&gt;  Seq Scan on parent  (cost=0.00..0.00 rows=1 width=14)
778         Filter: (f1 = 101)
779   -&gt;  Index Scan using child1_f1_key on child1  (cost=0.15..8.17 rows=1 width=14)
780         Index Cond: (f1 = 101)
781   -&gt;  Index Scan using child2_f1_key on child2  (cost=0.15..8.17 rows=1 width=14)
782         Index Cond: (f1 = 101)
783   -&gt;  Index Scan using child3_f1_key on child3  (cost=0.15..8.17 rows=1 width=14)
784         Index Cond: (f1 = 101)
785</screen>
786
787    In this example the Update node needs to consider three child tables as
788    well as the originally-mentioned parent table.  So there are four input
789    scanning subplans, one per table.  For clarity, the Update node is
790    annotated to show the specific target tables that will be updated, in the
791    same order as the corresponding subplans.  (These annotations are new as
792    of <productname>PostgreSQL</productname> 9.5; in prior versions the reader had to
793    intuit the target tables by inspecting the subplans.)
794   </para>
795
796   <para>
797    The <literal>Planning time</literal> shown by <command>EXPLAIN
798    ANALYZE</command> is the time it took to generate the query plan from the
799    parsed query and optimize it. It does not include parsing or rewriting.
800   </para>
801
802   <para>
803    The <literal>Execution time</literal> shown by <command>EXPLAIN
804    ANALYZE</command> includes executor start-up and shut-down time, as well
805    as the time to run any triggers that are fired, but it does not include
806    parsing, rewriting, or planning time.
807    Time spent executing <literal>BEFORE</literal> triggers, if any, is included in
808    the time for the related Insert, Update, or Delete node; but time
809    spent executing <literal>AFTER</literal> triggers is not counted there because
810    <literal>AFTER</literal> triggers are fired after completion of the whole plan.
811    The total time spent in each trigger
812    (either <literal>BEFORE</literal> or <literal>AFTER</literal>) is also shown separately.
813    Note that deferred constraint triggers will not be executed
814    until end of transaction and are thus not considered at all by
815    <command>EXPLAIN ANALYZE</command>.
816   </para>
817
818  </sect2>
819
820  <sect2 id="using-explain-caveats">
821   <title>Caveats</title>
822
823   <para>
824    There are two significant ways in which run times measured by
825    <command>EXPLAIN ANALYZE</command> can deviate from normal execution of
826    the same query.  First, since no output rows are delivered to the client,
827    network transmission costs and I/O conversion costs are not included.
828    Second, the measurement overhead added by <command>EXPLAIN
829    ANALYZE</command> can be significant, especially on machines with slow
830    <function>gettimeofday()</function> operating-system calls. You can use the
831    <xref linkend="pgtesttiming"/> tool to measure the overhead of timing
832    on your system.
833   </para>
834
835   <para>
836    <command>EXPLAIN</command> results should not be extrapolated to situations
837    much different from the one you are actually testing; for example,
838    results on a toy-sized table cannot be assumed to apply to large tables.
839    The planner's cost estimates are not linear and so it might choose
840    a different plan for a larger or smaller table.  An extreme example
841    is that on a table that only occupies one disk page, you'll nearly
842    always get a sequential scan plan whether indexes are available or not.
843    The planner realizes that it's going to take one disk page read to
844    process the table in any case, so there's no value in expending additional
845    page reads to look at an index.  (We saw this happening in the
846    <literal>polygon_tbl</literal> example above.)
847   </para>
848
849   <para>
850    There are cases in which the actual and estimated values won't match up
851    well, but nothing is really wrong.  One such case occurs when
852    plan node execution is stopped short by a <literal>LIMIT</literal> or similar
853    effect.  For example, in the <literal>LIMIT</literal> query we used before,
854
855<screen>
856EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000 LIMIT 2;
857
858                                                          QUERY PLAN
859-------------------------------------------------------------------------------------------------------------------------------
860 Limit  (cost=0.29..14.71 rows=2 width=244) (actual time=0.177..0.249 rows=2 loops=1)
861   -&gt;  Index Scan using tenk1_unique2 on tenk1  (cost=0.29..72.42 rows=10 width=244) (actual time=0.174..0.244 rows=2 loops=1)
862         Index Cond: (unique2 &gt; 9000)
863         Filter: (unique1 &lt; 100)
864         Rows Removed by Filter: 287
865 Planning time: 0.096 ms
866 Execution time: 0.336 ms
867</screen>
868
869    the estimated cost and row count for the Index Scan node are shown as
870    though it were run to completion.  But in reality the Limit node stopped
871    requesting rows after it got two, so the actual row count is only 2 and
872    the run time is less than the cost estimate would suggest.  This is not
873    an estimation error, only a discrepancy in the way the estimates and true
874    values are displayed.
875   </para>
876
877   <para>
878    Merge joins also have measurement artifacts that can confuse the unwary.
879    A merge join will stop reading one input if it's exhausted the other input
880    and the next key value in the one input is greater than the last key value
881    of the other input; in such a case there can be no more matches and so no
882    need to scan the rest of the first input.  This results in not reading all
883    of one child, with results like those mentioned for <literal>LIMIT</literal>.
884    Also, if the outer (first) child contains rows with duplicate key values,
885    the inner (second) child is backed up and rescanned for the portion of its
886    rows matching that key value.  <command>EXPLAIN ANALYZE</command> counts these
887    repeated emissions of the same inner rows as if they were real additional
888    rows.  When there are many outer duplicates, the reported actual row count
889    for the inner child plan node can be significantly larger than the number
890    of rows that are actually in the inner relation.
891   </para>
892
893   <para>
894    BitmapAnd and BitmapOr nodes always report their actual row counts as zero,
895    due to implementation limitations.
896   </para>
897
898   <para>
899    Normally, <command>EXPLAIN</command> will display every plan node
900    created by the planner.  However, there are cases where the executor
901    can determine that certain nodes need not be executed because they
902    cannot produce any rows, based on parameter values that were not
903    available at planning time.  (Currently this can only happen for child
904    nodes of an Append or MergeAppend node that is scanning a partitioned
905    table.)  When this happens, those plan nodes are omitted from
906    the <command>EXPLAIN</command> output and a <literal>Subplans
907    Removed: <replaceable>N</replaceable></literal> annotation appears
908    instead.
909   </para>
910  </sect2>
911
912 </sect1>
913
914 <sect1 id="planner-stats">
915  <title>Statistics Used by the Planner</title>
916
917  <indexterm zone="planner-stats">
918   <primary>statistics</primary>
919   <secondary>of the planner</secondary>
920  </indexterm>
921
922  <sect2>
923   <title>Single-Column Statistics</title>
924  <para>
925   As we saw in the previous section, the query planner needs to estimate
926   the number of rows retrieved by a query in order to make good choices
927   of query plans.  This section provides a quick look at the statistics
928   that the system uses for these estimates.
929  </para>
930
931  <para>
932   One component of the statistics is the total number of entries in
933   each table and index, as well as the number of disk blocks occupied
934   by each table and index.  This information is kept in the table
935   <link linkend="catalog-pg-class"><structname>pg_class</structname></link>,
936   in the columns <structfield>reltuples</structfield> and
937   <structfield>relpages</structfield>.  We can look at it with
938   queries similar to this one:
939
940<screen>
941SELECT relname, relkind, reltuples, relpages
942FROM pg_class
943WHERE relname LIKE 'tenk1%';
944
945       relname        | relkind | reltuples | relpages
946----------------------+---------+-----------+----------
947 tenk1                | r       |     10000 |      358
948 tenk1_hundred        | i       |     10000 |       30
949 tenk1_thous_tenthous | i       |     10000 |       30
950 tenk1_unique1        | i       |     10000 |       30
951 tenk1_unique2        | i       |     10000 |       30
952(5 rows)
953</screen>
954
955   Here we can see that <structname>tenk1</structname> contains 10000
956   rows, as do its indexes, but the indexes are (unsurprisingly) much
957   smaller than the table.
958  </para>
959
960  <para>
961   For efficiency reasons, <structfield>reltuples</structfield>
962   and <structfield>relpages</structfield> are not updated on-the-fly,
963   and so they usually contain somewhat out-of-date values.
964   They are updated by <command>VACUUM</command>, <command>ANALYZE</command>, and a
965   few DDL commands such as <command>CREATE INDEX</command>.  A <command>VACUUM</command>
966   or <command>ANALYZE</command> operation that does not scan the entire table
967   (which is commonly the case) will incrementally update the
968   <structfield>reltuples</structfield> count on the basis of the part
969   of the table it did scan, resulting in an approximate value.
970   In any case, the planner
971   will scale the values it finds in <structname>pg_class</structname>
972   to match the current physical table size, thus obtaining a closer
973   approximation.
974  </para>
975
976  <indexterm>
977   <primary>pg_statistic</primary>
978  </indexterm>
979
980  <para>
981   Most queries retrieve only a fraction of the rows in a table, due
982   to <literal>WHERE</literal> clauses that restrict the rows to be
983   examined.  The planner thus needs to make an estimate of the
984   <firstterm>selectivity</firstterm> of <literal>WHERE</literal> clauses, that is,
985   the fraction of rows that match each condition in the
986   <literal>WHERE</literal> clause.  The information used for this task is
987   stored in the
988   <link linkend="catalog-pg-statistic"><structname>pg_statistic</structname></link>
989   system catalog.  Entries in <structname>pg_statistic</structname>
990   are updated by the <command>ANALYZE</command> and <command>VACUUM
991   ANALYZE</command> commands, and are always approximate even when freshly
992   updated.
993  </para>
994
995  <indexterm>
996   <primary>pg_stats</primary>
997  </indexterm>
998
999  <para>
1000   Rather than look at <structname>pg_statistic</structname> directly,
1001   it's better to look at its view
1002   <link linkend="view-pg-stats"><structname>pg_stats</structname></link>
1003   when examining the statistics manually.  <structname>pg_stats</structname>
1004   is designed to be more easily readable.  Furthermore,
1005   <structname>pg_stats</structname> is readable by all, whereas
1006   <structname>pg_statistic</structname> is only readable by a superuser.
1007   (This prevents unprivileged users from learning something about
1008   the contents of other people's tables from the statistics.  The
1009   <structname>pg_stats</structname> view is restricted to show only
1010   rows about tables that the current user can read.)
1011   For example, we might do:
1012
1013<screen>
1014SELECT attname, inherited, n_distinct,
1015       array_to_string(most_common_vals, E'\n') as most_common_vals
1016FROM pg_stats
1017WHERE tablename = 'road';
1018
1019 attname | inherited | n_distinct |          most_common_vals
1020---------+-----------+------------+------------------------------------
1021 name    | f         |  -0.363388 | I- 580                        Ramp+
1022         |           |            | I- 880                        Ramp+
1023         |           |            | Sp Railroad                       +
1024         |           |            | I- 580                            +
1025         |           |            | I- 680                        Ramp
1026 name    | t         |  -0.284859 | I- 880                        Ramp+
1027         |           |            | I- 580                        Ramp+
1028         |           |            | I- 680                        Ramp+
1029         |           |            | I- 580                            +
1030         |           |            | State Hwy 13                  Ramp
1031(2 rows)
1032</screen>
1033
1034   Note that two rows are displayed for the same column, one corresponding
1035   to the complete inheritance hierarchy starting at the
1036   <literal>road</literal> table (<literal>inherited</literal>=<literal>t</literal>),
1037   and another one including only the <literal>road</literal> table itself
1038   (<literal>inherited</literal>=<literal>f</literal>).
1039  </para>
1040
1041  <para>
1042   The amount of information stored in <structname>pg_statistic</structname>
1043   by <command>ANALYZE</command>, in particular the maximum number of entries in the
1044   <structfield>most_common_vals</structfield> and <structfield>histogram_bounds</structfield>
1045   arrays for each column, can be set on a
1046   column-by-column basis using the <command>ALTER TABLE SET STATISTICS</command>
1047   command, or globally by setting the
1048   <xref linkend="guc-default-statistics-target"/> configuration variable.
1049   The default limit is presently 100 entries.  Raising the limit
1050   might allow more accurate planner estimates to be made, particularly for
1051   columns with irregular data distributions, at the price of consuming
1052   more space in <structname>pg_statistic</structname> and slightly more
1053   time to compute the estimates.  Conversely, a lower limit might be
1054   sufficient for columns with simple data distributions.
1055  </para>
1056
1057  <para>
1058   Further details about the planner's use of statistics can be found in
1059   <xref linkend="planner-stats-details"/>.
1060  </para>
1061  </sect2>
1062
1063  <sect2 id="planner-stats-extended">
1064   <title>Extended Statistics</title>
1065
1066   <indexterm zone="planner-stats-extended">
1067    <primary>statistics</primary>
1068    <secondary>of the planner</secondary>
1069   </indexterm>
1070
1071   <indexterm>
1072    <primary>correlation</primary>
1073    <secondary>in the query planner</secondary>
1074   </indexterm>
1075
1076   <indexterm>
1077    <primary>pg_statistic_ext</primary>
1078   </indexterm>
1079
1080   <indexterm>
1081    <primary>pg_statistic_ext_data</primary>
1082   </indexterm>
1083
1084   <para>
1085    It is common to see slow queries running bad execution plans because
1086    multiple columns used in the query clauses are correlated.
1087    The planner normally assumes that multiple conditions
1088    are independent of each other,
1089    an assumption that does not hold when column values are correlated.
1090    Regular statistics, because of their per-individual-column nature,
1091    cannot capture any knowledge about cross-column correlation.
1092    However, <productname>PostgreSQL</productname> has the ability to compute
1093    <firstterm>multivariate statistics</firstterm>, which can capture
1094    such information.
1095   </para>
1096
1097   <para>
1098    Because the number of possible column combinations is very large,
1099    it's impractical to compute multivariate statistics automatically.
1100    Instead, <firstterm>extended statistics objects</firstterm>, more often
1101    called just <firstterm>statistics objects</firstterm>, can be created to instruct
1102    the server to obtain statistics across interesting sets of columns.
1103   </para>
1104
1105   <para>
1106    Statistics objects are created using the
1107    <xref linkend="sql-createstatistics"/> command.
1108    Creation of such an object merely creates a catalog entry expressing
1109    interest in the statistics.  Actual data collection is performed
1110    by <command>ANALYZE</command> (either a manual command, or background
1111    auto-analyze).  The collected values can be examined in the
1112    <link linkend="catalog-pg-statistic-ext-data"><structname>pg_statistic_ext_data</structname></link>
1113    catalog.
1114   </para>
1115
1116   <para>
1117    <command>ANALYZE</command> computes extended statistics based on the same
1118    sample of table rows that it takes for computing regular single-column
1119    statistics.  Since the sample size is increased by increasing the
1120    statistics target for the table or any of its columns (as described in
1121    the previous section), a larger statistics target will normally result in
1122    more accurate extended statistics, as well as more time spent calculating
1123    them.
1124   </para>
1125
1126   <para>
1127    The following subsections describe the kinds of extended statistics
1128    that are currently supported.
1129   </para>
1130
1131   <sect3>
1132    <title>Functional Dependencies</title>
1133
1134    <para>
1135     The simplest kind of extended statistics tracks <firstterm>functional
1136     dependencies</firstterm>, a concept used in definitions of database normal forms.
1137     We say that column <structfield>b</structfield> is functionally dependent on
1138     column <structfield>a</structfield> if knowledge of the value of
1139     <structfield>a</structfield> is sufficient to determine the value
1140     of <structfield>b</structfield>, that is there are no two rows having the same value
1141     of <structfield>a</structfield> but different values of <structfield>b</structfield>.
1142     In a fully normalized database, functional dependencies should exist
1143     only on primary keys and superkeys. However, in practice many data sets
1144     are not fully normalized for various reasons; intentional
1145     denormalization for performance reasons is a common example.
1146     Even in a fully normalized database, there may be partial correlation
1147     between some columns, which can be expressed as partial functional
1148     dependency.
1149    </para>
1150
1151    <para>
1152     The existence of functional dependencies directly affects the accuracy
1153     of estimates in certain queries.  If a query contains conditions on
1154     both the independent and the dependent column(s), the
1155     conditions on the dependent columns do not further reduce the result
1156     size; but without knowledge of the functional dependency, the query
1157     planner will assume that the conditions are independent, resulting
1158     in underestimating the result size.
1159    </para>
1160
1161    <para>
1162     To inform the planner about functional dependencies, <command>ANALYZE</command>
1163     can collect measurements of cross-column dependency. Assessing the
1164     degree of dependency between all sets of columns would be prohibitively
1165     expensive, so data collection is limited to those groups of columns
1166     appearing together in a statistics object defined with
1167     the <literal>dependencies</literal> option.  It is advisable to create
1168     <literal>dependencies</literal> statistics only for column groups that are
1169     strongly correlated, to avoid unnecessary overhead in both
1170     <command>ANALYZE</command> and later query planning.
1171    </para>
1172
1173    <para>
1174     Here is an example of collecting functional-dependency statistics:
1175<programlisting>
1176CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;
1177
1178ANALYZE zipcodes;
1179
1180SELECT stxname, stxkeys, stxddependencies
1181  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
1182  WHERE stxname = 'stts';
1183 stxname | stxkeys |             stxddependencies
1184---------+---------+------------------------------------------
1185 stts    | 1 5     | {"1 => 5": 1.000000, "5 => 1": 0.423130}
1186(1 row)
1187</programlisting>
1188     Here it can be seen that column 1 (zip code) fully determines column
1189     5 (city) so the coefficient is 1.0, while city only determines zip code
1190     about 42% of the time, meaning that there are many cities (58%) that are
1191     represented by more than a single ZIP code.
1192    </para>
1193
1194    <para>
1195     When computing the selectivity for a query involving functionally
1196     dependent columns, the planner adjusts the per-condition selectivity
1197     estimates using the dependency coefficients so as not to produce
1198     an underestimate.
1199    </para>
1200
1201    <sect4>
1202     <title>Limitations of Functional Dependencies</title>
1203
1204     <para>
1205      Functional dependencies are currently only applied when considering
1206      simple equality conditions that compare columns to constant values.
1207      They are not used to improve estimates for equality conditions
1208      comparing two columns or comparing a column to an expression, nor for
1209      range clauses, <literal>LIKE</literal> or any other type of condition.
1210     </para>
1211
1212     <para>
1213      When estimating with functional dependencies, the planner assumes that
1214      conditions on the involved columns are compatible and hence redundant.
1215      If they are incompatible, the correct estimate would be zero rows, but
1216      that possibility is not considered.  For example, given a query like
1217<programlisting>
1218SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
1219</programlisting>
1220      the planner will disregard the <structfield>city</structfield> clause as not
1221      changing the selectivity, which is correct.  However, it will make
1222      the same assumption about
1223<programlisting>
1224SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
1225</programlisting>
1226      even though there will really be zero rows satisfying this query.
1227      Functional dependency statistics do not provide enough information
1228      to conclude that, however.
1229     </para>
1230
1231     <para>
1232      In many practical situations, this assumption is usually satisfied;
1233      for example, there might be a GUI in the application that only allows
1234      selecting compatible city and ZIP code values to use in a query.
1235      But if that's not the case, functional dependencies may not be a viable
1236      option.
1237     </para>
1238    </sect4>
1239   </sect3>
1240
1241   <sect3>
1242    <title>Multivariate N-Distinct Counts</title>
1243
1244    <para>
1245     Single-column statistics store the number of distinct values in each
1246     column.  Estimates of the number of distinct values when combining more
1247     than one column (for example, for <literal>GROUP BY a, b</literal>) are
1248     frequently wrong when the planner only has single-column statistical
1249     data, causing it to select bad plans.
1250    </para>
1251
1252    <para>
1253     To improve such estimates, <command>ANALYZE</command> can collect n-distinct
1254     statistics for groups of columns.  As before, it's impractical to do
1255     this for every possible column grouping, so data is collected only for
1256     those groups of columns appearing together in a statistics object
1257     defined with the <literal>ndistinct</literal> option.  Data will be collected
1258     for each possible combination of two or more columns from the set of
1259     listed columns.
1260    </para>
1261
1262    <para>
1263     Continuing the previous example, the n-distinct counts in a
1264     table of ZIP codes might look like the following:
1265<programlisting>
1266CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;
1267
1268ANALYZE zipcodes;
1269
1270SELECT stxkeys AS k, stxdndistinct AS nd
1271  FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
1272  WHERE stxname = 'stts2';
1273-[ RECORD 1 ]--------------------------------------------------------
1274k  | 1 2 5
1275nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
1276(1 row)
1277</programlisting>
1278     This indicates that there are three combinations of columns that
1279     have 33178 distinct values: ZIP code and state; ZIP code and city;
1280     and ZIP code, city and state (the fact that they are all equal is
1281     expected given that ZIP code alone is unique in this table).  On the
1282     other hand, the combination of city and state has only 27435 distinct
1283     values.
1284    </para>
1285
1286    <para>
1287     It's advisable to create <literal>ndistinct</literal> statistics objects only
1288     on combinations of columns that are actually used for grouping, and
1289     for which misestimation of the number of groups is resulting in bad
1290     plans.  Otherwise, the <command>ANALYZE</command> cycles are just wasted.
1291    </para>
1292   </sect3>
1293
1294   <sect3>
1295    <title>Multivariate MCV Lists</title>
1296
1297    <para>
1298     Another type of statistics stored for each column are most-common value
1299     lists.  This allows very accurate estimates for individual columns, but
1300     may result in significant misestimates for queries with conditions on
1301     multiple columns.
1302    </para>
1303
1304    <para>
1305     To improve such estimates, <command>ANALYZE</command> can collect MCV
1306     lists on combinations of columns.  Similarly to functional dependencies
1307     and n-distinct coefficients, it's impractical to do this for every
1308     possible column grouping.  Even more so in this case, as the MCV list
1309     (unlike functional dependencies and n-distinct coefficients) does store
1310     the common column values.  So data is collected only for those groups
1311     of columns appearing together in a statistics object defined with the
1312     <literal>mcv</literal> option.
1313    </para>
1314
1315    <para>
1316     Continuing the previous example, the MCV list for a table of ZIP codes
1317     might look like the following (unlike for simpler types of statistics,
1318     a function is required for inspection of MCV contents):
1319
1320<programlisting>
1321CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;
1322
1323ANALYZE zipcodes;
1324
1325SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
1326                pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';
1327
1328 index |         values         | nulls | frequency | base_frequency
1329-------+------------------------+-------+-----------+----------------
1330     0 | {Washington, DC}       | {f,f} |  0.003467 |        2.7e-05
1331     1 | {Apo, AE}              | {f,f} |  0.003067 |        1.9e-05
1332     2 | {Houston, TX}          | {f,f} |  0.002167 |       0.000133
1333     3 | {El Paso, TX}          | {f,f} |     0.002 |       0.000113
1334     4 | {New York, NY}         | {f,f} |  0.001967 |       0.000114
1335     5 | {Atlanta, GA}          | {f,f} |  0.001633 |        3.3e-05
1336     6 | {Sacramento, CA}       | {f,f} |  0.001433 |        7.8e-05
1337     7 | {Miami, FL}            | {f,f} |    0.0014 |          6e-05
1338     8 | {Dallas, TX}           | {f,f} |  0.001367 |        8.8e-05
1339     9 | {Chicago, IL}          | {f,f} |  0.001333 |        5.1e-05
1340   ...
1341(99 rows)
1342</programlisting>
1343     This indicates that the most common combination of city and state is
1344     Washington in DC, with actual frequency (in the sample) about 0.35%.
1345     The base frequency of the combination (as computed from the simple
1346     per-column frequencies) is only 0.0027%, resulting in two orders of
1347     magnitude under-estimates.
1348    </para>
1349
1350    <para>
1351     It's advisable to create <acronym>MCV</acronym> statistics objects only
1352     on combinations of columns that are actually used in conditions together,
1353     and for which misestimation of the number of groups is resulting in bad
1354     plans.  Otherwise, the <command>ANALYZE</command> and planning cycles
1355     are just wasted.
1356    </para>
1357   </sect3>
1358
1359  </sect2>
1360 </sect1>
1361
1362 <sect1 id="explicit-joins">
1363  <title>Controlling the Planner with Explicit <literal>JOIN</literal> Clauses</title>
1364
1365  <indexterm zone="explicit-joins">
1366   <primary>join</primary>
1367   <secondary>controlling the order</secondary>
1368  </indexterm>
1369
1370  <para>
1371   It is possible
1372   to control the query planner to some extent by using the explicit <literal>JOIN</literal>
1373   syntax.  To see why this matters, we first need some background.
1374  </para>
1375
1376  <para>
1377   In a simple join query, such as:
1378<programlisting>
1379SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;
1380</programlisting>
1381   the planner is free to join the given tables in any order.  For
1382   example, it could generate a query plan that joins A to B, using
1383   the <literal>WHERE</literal> condition <literal>a.id = b.id</literal>, and then
1384   joins C to this joined table, using the other <literal>WHERE</literal>
1385   condition.  Or it could join B to C and then join A to that result.
1386   Or it could join A to C and then join them with B &mdash; but that
1387   would be inefficient, since the full Cartesian product of A and C
1388   would have to be formed, there being no applicable condition in the
1389   <literal>WHERE</literal> clause to allow optimization of the join.  (All
1390   joins in the <productname>PostgreSQL</productname> executor happen
1391   between two input tables, so it's necessary to build up the result
1392   in one or another of these fashions.)  The important point is that
1393   these different join possibilities give semantically equivalent
1394   results but might have hugely different execution costs.  Therefore,
1395   the planner will explore all of them to try to find the most
1396   efficient query plan.
1397  </para>
1398
1399  <para>
1400   When a query only involves two or three tables, there aren't many join
1401   orders to worry about.  But the number of possible join orders grows
1402   exponentially as the number of tables expands.  Beyond ten or so input
1403   tables it's no longer practical to do an exhaustive search of all the
1404   possibilities, and even for six or seven tables planning might take an
1405   annoyingly long time.  When there are too many input tables, the
1406   <productname>PostgreSQL</productname> planner will switch from exhaustive
1407   search to a <firstterm>genetic</firstterm> probabilistic search
1408   through a limited number of possibilities.  (The switch-over threshold is
1409   set by the <xref linkend="guc-geqo-threshold"/> run-time
1410   parameter.)
1411   The genetic search takes less time, but it won't
1412   necessarily find the best possible plan.
1413  </para>
1414
1415  <para>
1416   When the query involves outer joins, the planner has less freedom
1417   than it does for plain (inner) joins. For example, consider:
1418<programlisting>
1419SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
1420</programlisting>
1421   Although this query's restrictions are superficially similar to the
1422   previous example, the semantics are different because a row must be
1423   emitted for each row of A that has no matching row in the join of B and C.
1424   Therefore the planner has no choice of join order here: it must join
1425   B to C and then join A to that result.  Accordingly, this query takes
1426   less time to plan than the previous query.  In other cases, the planner
1427   might be able to determine that more than one join order is safe.
1428   For example, given:
1429<programlisting>
1430SELECT * FROM a LEFT JOIN b ON (a.bid = b.id) LEFT JOIN c ON (a.cid = c.id);
1431</programlisting>
1432   it is valid to join A to either B or C first.  Currently, only
1433   <literal>FULL JOIN</literal> completely constrains the join order.  Most
1434   practical cases involving <literal>LEFT JOIN</literal> or <literal>RIGHT JOIN</literal>
1435   can be rearranged to some extent.
1436  </para>
1437
1438  <para>
1439   Explicit inner join syntax (<literal>INNER JOIN</literal>, <literal>CROSS
1440   JOIN</literal>, or unadorned <literal>JOIN</literal>) is semantically the same as
1441   listing the input relations in <literal>FROM</literal>, so it does not
1442   constrain the join order.
1443  </para>
1444
1445  <para>
1446   Even though most kinds of <literal>JOIN</literal> don't completely constrain
1447   the join order, it is possible to instruct the
1448   <productname>PostgreSQL</productname> query planner to treat all
1449   <literal>JOIN</literal> clauses as constraining the join order anyway.
1450   For example, these three queries are logically equivalent:
1451<programlisting>
1452SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;
1453SELECT * FROM a CROSS JOIN b CROSS JOIN c WHERE a.id = b.id AND b.ref = c.id;
1454SELECT * FROM a JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);
1455</programlisting>
1456   But if we tell the planner to honor the <literal>JOIN</literal> order,
1457   the second and third take less time to plan than the first.  This effect
1458   is not worth worrying about for only three tables, but it can be a
1459   lifesaver with many tables.
1460  </para>
1461
1462  <para>
1463   To force the planner to follow the join order laid out by explicit
1464   <literal>JOIN</literal>s,
1465   set the <xref linkend="guc-join-collapse-limit"/> run-time parameter to 1.
1466   (Other possible values are discussed below.)
1467  </para>
1468
1469  <para>
1470   You do not need to constrain the join order completely in order to
1471   cut search time, because it's OK to use <literal>JOIN</literal> operators
1472   within items of a plain <literal>FROM</literal> list.  For example, consider:
1473<programlisting>
1474SELECT * FROM a CROSS JOIN b, c, d, e WHERE ...;
1475</programlisting>
1476   With <varname>join_collapse_limit</varname> = 1, this
1477   forces the planner to join A to B before joining them to other tables,
1478   but doesn't constrain its choices otherwise.  In this example, the
1479   number of possible join orders is reduced by a factor of 5.
1480  </para>
1481
1482  <para>
1483   Constraining the planner's search in this way is a useful technique
1484   both for reducing planning time and for directing the planner to a
1485   good query plan.  If the planner chooses a bad join order by default,
1486   you can force it to choose a better order via <literal>JOIN</literal> syntax
1487   &mdash; assuming that you know of a better order, that is.  Experimentation
1488   is recommended.
1489  </para>
1490
1491  <para>
1492   A closely related issue that affects planning time is collapsing of
1493   subqueries into their parent query.  For example, consider:
1494<programlisting>
1495SELECT *
1496FROM x, y,
1497    (SELECT * FROM a, b, c WHERE something) AS ss
1498WHERE somethingelse;
1499</programlisting>
1500   This situation might arise from use of a view that contains a join;
1501   the view's <literal>SELECT</literal> rule will be inserted in place of the view
1502   reference, yielding a query much like the above.  Normally, the planner
1503   will try to collapse the subquery into the parent, yielding:
1504<programlisting>
1505SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
1506</programlisting>
1507   This usually results in a better plan than planning the subquery
1508   separately.  (For example, the outer <literal>WHERE</literal> conditions might be such that
1509   joining X to A first eliminates many rows of A, thus avoiding the need to
1510   form the full logical output of the subquery.)  But at the same time,
1511   we have increased the planning time; here, we have a five-way join
1512   problem replacing two separate three-way join problems.  Because of the
1513   exponential growth of the number of possibilities, this makes a big
1514   difference.  The planner tries to avoid getting stuck in huge join search
1515   problems by not collapsing a subquery if more than <varname>from_collapse_limit</varname>
1516   <literal>FROM</literal> items would result in the parent
1517   query.  You can trade off planning time against quality of plan by
1518   adjusting this run-time parameter up or down.
1519  </para>
1520
1521  <para>
1522   <xref linkend="guc-from-collapse-limit"/> and <xref
1523   linkend="guc-join-collapse-limit"/>
1524   are similarly named because they do almost the same thing: one controls
1525   when the planner will <quote>flatten out</quote> subqueries, and the
1526   other controls when it will flatten out explicit joins.  Typically
1527   you would either set <varname>join_collapse_limit</varname> equal to
1528   <varname>from_collapse_limit</varname> (so that explicit joins and subqueries
1529   act similarly) or set <varname>join_collapse_limit</varname> to 1 (if you want
1530   to control join order with explicit joins).  But you might set them
1531   differently if you are trying to fine-tune the trade-off between planning
1532   time and run time.
1533  </para>
1534 </sect1>
1535
1536 <sect1 id="populate">
1537  <title>Populating a Database</title>
1538
1539  <para>
1540   One might need to insert a large amount of data when first populating
1541   a database. This section contains some suggestions on how to make
1542   this process as efficient as possible.
1543  </para>
1544
1545  <sect2 id="disable-autocommit">
1546   <title>Disable Autocommit</title>
1547
1548   <indexterm>
1549    <primary>autocommit</primary>
1550    <secondary>bulk-loading data</secondary>
1551   </indexterm>
1552
1553   <para>
1554    When using multiple <command>INSERT</command>s, turn off autocommit and just do
1555    one commit at the end.  (In plain
1556    SQL, this means issuing <command>BEGIN</command> at the start and
1557    <command>COMMIT</command> at the end.  Some client libraries might
1558    do this behind your back, in which case you need to make sure the
1559    library does it when you want it done.)  If you allow each
1560    insertion to be committed separately,
1561    <productname>PostgreSQL</productname> is doing a lot of work for
1562    each row that is added.  An additional benefit of doing all
1563    insertions in one transaction is that if the insertion of one row
1564    were to fail then the insertion of all rows inserted up to that
1565    point would be rolled back, so you won't be stuck with partially
1566    loaded data.
1567   </para>
1568  </sect2>
1569
1570  <sect2 id="populate-copy-from">
1571   <title>Use <command>COPY</command></title>
1572
1573   <para>
1574    Use <xref linkend="sql-copy"/> to load
1575    all the rows in one command, instead of using a series of
1576    <command>INSERT</command> commands.  The <command>COPY</command>
1577    command is optimized for loading large numbers of rows; it is less
1578    flexible than <command>INSERT</command>, but incurs significantly
1579    less overhead for large data loads. Since <command>COPY</command>
1580    is a single command, there is no need to disable autocommit if you
1581    use this method to populate a table.
1582   </para>
1583
1584   <para>
1585    If you cannot use <command>COPY</command>, it might help to use <xref
1586    linkend="sql-prepare"/> to create a
1587    prepared <command>INSERT</command> statement, and then use
1588    <command>EXECUTE</command> as many times as required.  This avoids
1589    some of the overhead of repeatedly parsing and planning
1590    <command>INSERT</command>. Different interfaces provide this facility
1591    in different ways; look for <quote>prepared statements</quote> in the interface
1592    documentation.
1593   </para>
1594
1595   <para>
1596    Note that loading a large number of rows using
1597    <command>COPY</command> is almost always faster than using
1598    <command>INSERT</command>, even if <command>PREPARE</command> is used and
1599    multiple insertions are batched into a single transaction.
1600   </para>
1601
1602   <para>
1603    <command>COPY</command> is fastest when used within the same
1604    transaction as an earlier <command>CREATE TABLE</command> or
1605    <command>TRUNCATE</command> command. In such cases no WAL
1606    needs to be written, because in case of an error, the files
1607    containing the newly loaded data will be removed anyway.
1608    However, this consideration only applies when
1609    <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
1610    non-partitioned tables as all commands must write WAL otherwise.
1611   </para>
1612
1613  </sect2>
1614
1615  <sect2 id="populate-rm-indexes">
1616   <title>Remove Indexes</title>
1617
1618   <para>
1619    If you are loading a freshly created table, the fastest method is to
1620    create the table, bulk load the table's data using
1621    <command>COPY</command>, then create any indexes needed for the
1622    table.  Creating an index on pre-existing data is quicker than
1623    updating it incrementally as each row is loaded.
1624   </para>
1625
1626   <para>
1627    If you are adding large amounts of data to an existing table,
1628    it might be a win to drop the indexes,
1629    load the table, and then recreate the indexes.  Of course, the
1630    database performance for other users might suffer
1631    during the time the indexes are missing.  One should also think
1632    twice before dropping a unique index, since the error checking
1633    afforded by the unique constraint will be lost while the index is
1634    missing.
1635   </para>
1636  </sect2>
1637
1638  <sect2 id="populate-rm-fkeys">
1639   <title>Remove Foreign Key Constraints</title>
1640
1641   <para>
1642    Just as with indexes, a foreign key constraint can be checked
1643    <quote>in bulk</quote> more efficiently than row-by-row.  So it might be
1644    useful to drop foreign key constraints, load data, and re-create
1645    the constraints.  Again, there is a trade-off between data load
1646    speed and loss of error checking while the constraint is missing.
1647   </para>
1648
1649   <para>
1650    What's more, when you load data into a table with existing foreign key
1651    constraints, each new row requires an entry in the server's list of
1652    pending trigger events (since it is the firing of a trigger that checks
1653    the row's foreign key constraint).  Loading many millions of rows can
1654    cause the trigger event queue to overflow available memory, leading to
1655    intolerable swapping or even outright failure of the command.  Therefore
1656    it may be <emphasis>necessary</emphasis>, not just desirable, to drop and re-apply
1657    foreign keys when loading large amounts of data.  If temporarily removing
1658    the constraint isn't acceptable, the only other recourse may be to split
1659    up the load operation into smaller transactions.
1660   </para>
1661  </sect2>
1662
1663  <sect2 id="populate-work-mem">
1664   <title>Increase <varname>maintenance_work_mem</varname></title>
1665
1666   <para>
1667    Temporarily increasing the <xref linkend="guc-maintenance-work-mem"/>
1668    configuration variable when loading large amounts of data can
1669    lead to improved performance.  This will help to speed up <command>CREATE
1670    INDEX</command> commands and <command>ALTER TABLE ADD FOREIGN KEY</command> commands.
1671    It won't do much for <command>COPY</command> itself, so this advice is
1672    only useful when you are using one or both of the above techniques.
1673   </para>
1674  </sect2>
1675
1676  <sect2 id="populate-max-wal-size">
1677   <title>Increase <varname>max_wal_size</varname></title>
1678
1679   <para>
1680    Temporarily increasing the <xref linkend="guc-max-wal-size"/>
1681    configuration variable can also
1682    make large data loads faster.  This is because loading a large
1683    amount of data into <productname>PostgreSQL</productname> will
1684    cause checkpoints to occur more often than the normal checkpoint
1685    frequency (specified by the <varname>checkpoint_timeout</varname>
1686    configuration variable). Whenever a checkpoint occurs, all dirty
1687    pages must be flushed to disk. By increasing
1688    <varname>max_wal_size</varname> temporarily during bulk
1689    data loads, the number of checkpoints that are required can be
1690    reduced.
1691   </para>
1692  </sect2>
1693
1694  <sect2 id="populate-pitr">
1695   <title>Disable WAL Archival and Streaming Replication</title>
1696
1697   <para>
1698    When loading large amounts of data into an installation that uses
1699    WAL archiving or streaming replication, it might be faster to take a
1700    new base backup after the load has completed than to process a large
1701    amount of incremental WAL data.  To prevent incremental WAL logging
1702    while loading, disable archiving and streaming replication, by setting
1703    <xref linkend="guc-wal-level"/> to <literal>minimal</literal>,
1704    <xref linkend="guc-archive-mode"/> to <literal>off</literal>, and
1705    <xref linkend="guc-max-wal-senders"/> to zero.
1706    But note that changing these settings requires a server restart.
1707   </para>
1708
1709   <para>
1710    Aside from avoiding the time for the archiver or WAL sender to
1711    process the WAL data,
1712    doing this will actually make certain commands faster, because they
1713    are designed not to write WAL at all if <varname>wal_level</varname>
1714    is <literal>minimal</literal>.  (They can guarantee crash safety more cheaply
1715    by doing an <function>fsync</function> at the end than by writing WAL.)
1716    This applies to the following commands:
1717    <itemizedlist>
1718     <listitem>
1719      <para>
1720       <command>CREATE TABLE AS SELECT</command>
1721      </para>
1722     </listitem>
1723     <listitem>
1724      <para>
1725       <command>CREATE INDEX</command> (and variants such as
1726       <command>ALTER TABLE ADD PRIMARY KEY</command>)
1727      </para>
1728     </listitem>
1729     <listitem>
1730      <para>
1731       <command>ALTER TABLE SET TABLESPACE</command>
1732      </para>
1733     </listitem>
1734     <listitem>
1735      <para>
1736       <command>CLUSTER</command>
1737      </para>
1738     </listitem>
1739     <listitem>
1740      <para>
1741       <command>COPY FROM</command>, when the target table has been
1742       created or truncated earlier in the same transaction
1743      </para>
1744     </listitem>
1745    </itemizedlist>
1746   </para>
1747  </sect2>
1748
1749  <sect2 id="populate-analyze">
1750   <title>Run <command>ANALYZE</command> Afterwards</title>
1751
1752   <para>
1753    Whenever you have significantly altered the distribution of data
1754    within a table, running <xref linkend="sql-analyze"/> is strongly recommended. This
1755    includes bulk loading large amounts of data into the table.  Running
1756    <command>ANALYZE</command> (or <command>VACUUM ANALYZE</command>)
1757    ensures that the planner has up-to-date statistics about the
1758    table.  With no statistics or obsolete statistics, the planner might
1759    make poor decisions during query planning, leading to poor
1760    performance on any tables with inaccurate or nonexistent
1761    statistics.  Note that if the autovacuum daemon is enabled, it might
1762    run <command>ANALYZE</command> automatically; see
1763    <xref linkend="vacuum-for-statistics"/>
1764    and <xref linkend="autovacuum"/> for more information.
1765   </para>
1766  </sect2>
1767
1768  <sect2 id="populate-pg-dump">
1769   <title>Some Notes about <application>pg_dump</application></title>
1770
1771   <para>
1772    Dump scripts generated by <application>pg_dump</application> automatically apply
1773    several, but not all, of the above guidelines.  To reload a
1774    <application>pg_dump</application> dump as quickly as possible, you need to
1775    do a few extra things manually.  (Note that these points apply while
1776    <emphasis>restoring</emphasis> a dump, not while <emphasis>creating</emphasis> it.
1777    The same points apply whether loading a text dump with
1778    <application>psql</application> or using <application>pg_restore</application> to load
1779    from a <application>pg_dump</application> archive file.)
1780   </para>
1781
1782   <para>
1783    By default, <application>pg_dump</application> uses <command>COPY</command>, and when
1784    it is generating a complete schema-and-data dump, it is careful to
1785    load data before creating indexes and foreign keys.  So in this case
1786    several guidelines are handled automatically.  What is left
1787    for you to do is to:
1788    <itemizedlist>
1789     <listitem>
1790      <para>
1791       Set appropriate (i.e., larger than normal) values for
1792       <varname>maintenance_work_mem</varname> and
1793       <varname>max_wal_size</varname>.
1794      </para>
1795     </listitem>
1796     <listitem>
1797      <para>
1798       If using WAL archiving or streaming replication, consider disabling
1799       them during the restore. To do that, set <varname>archive_mode</varname>
1800       to <literal>off</literal>,
1801       <varname>wal_level</varname> to <literal>minimal</literal>, and
1802       <varname>max_wal_senders</varname> to zero before loading the dump.
1803       Afterwards, set them back to the right values and take a fresh
1804       base backup.
1805      </para>
1806     </listitem>
1807     <listitem>
1808      <para>
1809       Experiment with the parallel dump and restore modes of both
1810       <application>pg_dump</application> and <application>pg_restore</application> and find the
1811       optimal number of concurrent jobs to use. Dumping and restoring in
1812       parallel by means of the <option>-j</option> option should give you a
1813       significantly higher performance over the serial mode.
1814      </para>
1815     </listitem>
1816     <listitem>
1817      <para>
1818       Consider whether the whole dump should be restored as a single
1819       transaction.  To do that, pass the <option>-1</option> or
1820       <option>--single-transaction</option> command-line option to
1821       <application>psql</application> or <application>pg_restore</application>. When using this
1822       mode, even the smallest of errors will rollback the entire restore,
1823       possibly discarding many hours of processing.  Depending on how
1824       interrelated the data is, that might seem preferable to manual cleanup,
1825       or not.  <command>COPY</command> commands will run fastest if you use a single
1826       transaction and have WAL archiving turned off.
1827      </para>
1828     </listitem>
1829     <listitem>
1830      <para>
1831       If multiple CPUs are available in the database server, consider using
1832       <application>pg_restore</application>'s <option>--jobs</option> option.  This
1833       allows concurrent data loading and index creation.
1834      </para>
1835     </listitem>
1836     <listitem>
1837      <para>
1838       Run <command>ANALYZE</command> afterwards.
1839      </para>
1840     </listitem>
1841    </itemizedlist>
1842   </para>
1843
1844   <para>
1845    A data-only dump will still use <command>COPY</command>, but it does not
1846    drop or recreate indexes, and it does not normally touch foreign
1847    keys.
1848
1849     <footnote>
1850      <para>
1851       You can get the effect of disabling foreign keys by using
1852       the <option>--disable-triggers</option> option &mdash; but realize that
1853       that eliminates, rather than just postpones, foreign key
1854       validation, and so it is possible to insert bad data if you use it.
1855      </para>
1856     </footnote>
1857
1858    So when loading a data-only dump, it is up to you to drop and recreate
1859    indexes and foreign keys if you wish to use those techniques.
1860    It's still useful to increase <varname>max_wal_size</varname>
1861    while loading the data, but don't bother increasing
1862    <varname>maintenance_work_mem</varname>; rather, you'd do that while
1863    manually recreating indexes and foreign keys afterwards.
1864    And don't forget to <command>ANALYZE</command> when you're done; see
1865    <xref linkend="vacuum-for-statistics"/>
1866    and <xref linkend="autovacuum"/> for more information.
1867   </para>
1868  </sect2>
1869  </sect1>
1870
1871  <sect1 id="non-durability">
1872   <title>Non-Durable Settings</title>
1873
1874   <indexterm zone="non-durability">
1875    <primary>non-durable</primary>
1876   </indexterm>
1877
1878   <para>
1879    Durability is a database feature that guarantees the recording of
1880    committed transactions even if the server crashes or loses
1881    power.  However, durability adds significant database overhead,
1882    so if your site does not require such a guarantee,
1883    <productname>PostgreSQL</productname> can be configured to run
1884    much faster.  The following are configuration changes you can make
1885    to improve performance in such cases.  Except as noted below, durability
1886    is still guaranteed in case of a crash of the database software;
1887    only abrupt operating system stoppage creates a risk of data loss
1888    or corruption when these settings are used.
1889
1890    <itemizedlist>
1891     <listitem>
1892      <para>
1893       Place the database cluster's data directory in a memory-backed
1894       file system (i.e., <acronym>RAM</acronym> disk).  This eliminates all
1895       database disk I/O, but limits data storage to the amount of
1896       available memory (and perhaps swap).
1897      </para>
1898     </listitem>
1899
1900     <listitem>
1901      <para>
1902       Turn off <xref linkend="guc-fsync"/>;  there is no need to flush
1903       data to disk.
1904      </para>
1905     </listitem>
1906
1907     <listitem>
1908      <para>
1909       Turn off <xref linkend="guc-synchronous-commit"/>;  there might be no
1910       need to force <acronym>WAL</acronym> writes to disk on every
1911       commit.  This setting does risk transaction loss (though not data
1912       corruption) in case of a crash of the <emphasis>database</emphasis>.
1913      </para>
1914     </listitem>
1915
1916     <listitem>
1917      <para>
1918       Turn off <xref linkend="guc-full-page-writes"/>;  there is no need
1919       to guard against partial page writes.
1920      </para>
1921     </listitem>
1922
1923     <listitem>
1924      <para>
1925       Increase <xref linkend="guc-max-wal-size"/> and <xref
1926       linkend="guc-checkpoint-timeout"/>; this reduces the frequency
1927       of checkpoints, but increases the storage requirements of
1928       <filename>/pg_wal</filename>.
1929      </para>
1930     </listitem>
1931
1932     <listitem>
1933      <para>
1934       Create <link linkend="sql-createtable-unlogged">unlogged
1935       tables</link> to avoid <acronym>WAL</acronym> writes, though it
1936       makes the tables non-crash-safe.
1937      </para>
1938     </listitem>
1939
1940    </itemizedlist>
1941   </para>
1942  </sect1>
1943
1944 </chapter>
1945