1<!-- doc/src/sgml/indexam.sgml -->
2
3<chapter id="indexam">
4 <title>Index Access Method Interface Definition</title>
5
6 <indexterm>
7  <primary>Index Access Method</primary>
8 </indexterm>
9 <indexterm>
10  <primary>indexam</primary>
11  <secondary>Index Access Method</secondary>
12 </indexterm>
13
14  <para>
15   This chapter defines the interface between the core
16   <productname>PostgreSQL</productname> system and <firstterm>index access
17   methods</firstterm>, which manage individual index types.  The core system
18   knows nothing about indexes beyond what is specified here, so it is
19   possible to develop entirely new index types by writing add-on code.
20  </para>
21
22  <para>
23   All indexes in <productname>PostgreSQL</productname> are what are known
24   technically as <firstterm>secondary indexes</firstterm>; that is, the index is
25   physically separate from the table file that it describes.  Each index
26   is stored as its own physical <firstterm>relation</firstterm> and so is described
27   by an entry in the <structname>pg_class</structname> catalog.  The contents of an
28   index are entirely under the control of its index access method.  In
29   practice, all index access methods divide indexes into standard-size
30   pages so that they can use the regular storage manager and buffer manager
31   to access the index contents.  (All the existing index access methods
32   furthermore use the standard page layout described in <xref
33   linkend="storage-page-layout"/>, and most use the same format for index
34   tuple headers; but these decisions are not forced on an access method.)
35  </para>
36
37  <para>
38   An index is effectively a mapping from some data key values to
39   <firstterm>tuple identifiers</firstterm>, or <acronym>TIDs</acronym>, of row versions
40   (tuples) in the index's parent table.  A TID consists of a
41   block number and an item number within that block (see <xref
42   linkend="storage-page-layout"/>).  This is sufficient
43   information to fetch a particular row version from the table.
44   Indexes are not directly aware that under MVCC, there might be multiple
45   extant versions of the same logical row; to an index, each tuple is
46   an independent object that needs its own index entry.  Thus, an
47   update of a row always creates all-new index entries for the row, even if
48   the key values did not change.  (HOT tuples are an exception to this
49   statement; but indexes do not deal with those, either.)  Index entries for
50   dead tuples are reclaimed (by vacuuming) when the dead tuples themselves
51   are reclaimed.
52  </para>
53
54 <sect1 id="index-api">
55  <title>Basic API Structure for Indexes</title>
56
57  <para>
58   Each index access method is described by a row in the
59   <link linkend="catalog-pg-am"><structname>pg_am</structname></link>
60   system catalog.  The <structname>pg_am</structname> entry
61   specifies a name and a <firstterm>handler function</firstterm> for the index
62   access method.  These entries can be created and deleted using the
63   <xref linkend="sql-create-access-method"/> and
64   <xref linkend="sql-drop-access-method"/> SQL commands.
65  </para>
66
67  <para>
68   An index access method handler function must be declared to accept a
69   single argument of type <type>internal</type> and to return the
70   pseudo-type <type>index_am_handler</type>.  The argument is a dummy value that
71   simply serves to prevent handler functions from being called directly from
72   SQL commands.  The result of the function must be a palloc'd struct of
73   type <structname>IndexAmRoutine</structname>, which contains everything
74   that the core code needs to know to make use of the index access method.
75   The <structname>IndexAmRoutine</structname> struct, also called the access
76   method's <firstterm>API struct</firstterm>, includes fields specifying assorted
77   fixed properties of the access method, such as whether it can support
78   multicolumn indexes.  More importantly, it contains pointers to support
79   functions for the access method, which do all of the real work to access
80   indexes.  These support functions are plain C functions and are not
81   visible or callable at the SQL level.  The support functions are described
82   in <xref linkend="index-functions"/>.
83  </para>
84
85  <para>
86   The structure <structname>IndexAmRoutine</structname> is defined thus:
87<programlisting>
88typedef struct IndexAmRoutine
89{
90    NodeTag     type;
91
92    /*
93     * Total number of strategies (operators) by which we can traverse/search
94     * this AM.  Zero if AM does not have a fixed set of strategy assignments.
95     */
96    uint16      amstrategies;
97    /* total number of support functions that this AM uses */
98    uint16      amsupport;
99    /* does AM support ORDER BY indexed column's value? */
100    bool        amcanorder;
101    /* does AM support ORDER BY result of an operator on indexed column? */
102    bool        amcanorderbyop;
103    /* does AM support backward scanning? */
104    bool        amcanbackward;
105    /* does AM support UNIQUE indexes? */
106    bool        amcanunique;
107    /* does AM support multi-column indexes? */
108    bool        amcanmulticol;
109    /* does AM require scans to have a constraint on the first index column? */
110    bool        amoptionalkey;
111    /* does AM handle ScalarArrayOpExpr quals? */
112    bool        amsearcharray;
113    /* does AM handle IS NULL/IS NOT NULL quals? */
114    bool        amsearchnulls;
115    /* can index storage data type differ from column data type? */
116    bool        amstorage;
117    /* can an index of this type be clustered on? */
118    bool        amclusterable;
119    /* does AM handle predicate locks? */
120    bool        ampredlocks;
121    /* does AM support parallel scan? */
122    bool        amcanparallel;
123    /* does AM support columns included with clause INCLUDE? */
124    bool        amcaninclude;
125    /* type of data stored in index, or InvalidOid if variable */
126    Oid         amkeytype;
127
128    /* interface functions */
129    ambuild_function ambuild;
130    ambuildempty_function ambuildempty;
131    aminsert_function aminsert;
132    ambulkdelete_function ambulkdelete;
133    amvacuumcleanup_function amvacuumcleanup;
134    amcanreturn_function amcanreturn;   /* can be NULL */
135    amcostestimate_function amcostestimate;
136    amoptions_function amoptions;
137    amproperty_function amproperty;     /* can be NULL */
138    ambuildphasename_function ambuildphasename;   /* can be NULL */
139    amvalidate_function amvalidate;
140    ambeginscan_function ambeginscan;
141    amrescan_function amrescan;
142    amgettuple_function amgettuple;     /* can be NULL */
143    amgetbitmap_function amgetbitmap;   /* can be NULL */
144    amendscan_function amendscan;
145    ammarkpos_function ammarkpos;       /* can be NULL */
146    amrestrpos_function amrestrpos;     /* can be NULL */
147
148    /* interface functions to support parallel index scans */
149    amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
150    aminitparallelscan_function aminitparallelscan;    /* can be NULL */
151    amparallelrescan_function amparallelrescan;    /* can be NULL */
152} IndexAmRoutine;
153</programlisting>
154  </para>
155
156  <para>
157   To be useful, an index access method must also have one or more
158   <firstterm>operator families</firstterm> and
159   <firstterm>operator classes</firstterm> defined in
160   <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
161   <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
162   <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
163   <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
164   These entries allow the planner
165   to determine what kinds of query qualifications can be used with
166   indexes of this access method.  Operator families and classes are described
167   in <xref linkend="xindex"/>, which is prerequisite material for reading
168   this chapter.
169  </para>
170
171  <para>
172   An individual index is defined by a
173   <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
174   entry that describes it as a physical relation, plus a
175   <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
176   entry that shows the logical content of the index &mdash; that is, the set
177   of index columns it has and the semantics of those columns, as captured by
178   the associated operator classes.  The index columns (key values) can be
179   either simple columns of the underlying table or expressions over the table
180   rows.  The index access method normally has no interest in where the index
181   key values come from (it is always handed precomputed key values) but it
182   will be very interested in the operator class information in
183   <structname>pg_index</structname>.  Both of these catalog entries can be
184   accessed as part of the <structname>Relation</structname> data structure that is
185   passed to all operations on the index.
186  </para>
187
188  <para>
189   Some of the flag fields of <structname>IndexAmRoutine</structname> have nonobvious
190   implications.  The requirements of <structfield>amcanunique</structfield>
191   are discussed in <xref linkend="index-unique-checks"/>.
192   The <structfield>amcanmulticol</structfield> flag asserts that the
193   access method supports multi-key-column indexes, while
194   <structfield>amoptionalkey</structfield> asserts that it allows scans
195   where no indexable restriction clause is given for the first index column.
196   When <structfield>amcanmulticol</structfield> is false,
197   <structfield>amoptionalkey</structfield> essentially says whether the
198   access method supports full-index scans without any restriction clause.
199   Access methods that support multiple index columns <emphasis>must</emphasis>
200   support scans that omit restrictions on any or all of the columns after
201   the first; however they are permitted to require some restriction to
202   appear for the first index column, and this is signaled by setting
203   <structfield>amoptionalkey</structfield> false.
204   One reason that an index AM might set
205   <structfield>amoptionalkey</structfield> false is if it doesn't index
206   null values.  Since most indexable operators are
207   strict and hence cannot return true for null inputs,
208   it is at first sight attractive to not store index entries for null values:
209   they could never be returned by an index scan anyway.  However, this
210   argument fails when an index scan has no restriction clause for a given
211   index column.  In practice this means that
212   indexes that have <structfield>amoptionalkey</structfield> true must
213   index nulls, since the planner might decide to use such an index
214   with no scan keys at all.  A related restriction is that an index
215   access method that supports multiple index columns <emphasis>must</emphasis>
216   support indexing null values in columns after the first, because the planner
217   will assume the index can be used for queries that do not restrict
218   these columns.  For example, consider an index on (a,b) and a query with
219   <literal>WHERE a = 4</literal>.  The system will assume the index can be
220   used to scan for rows with <literal>a = 4</literal>, which is wrong if the
221   index omits rows where <literal>b</literal> is null.
222   It is, however, OK to omit rows where the first indexed column is null.
223   An index access method that does index nulls may also set
224   <structfield>amsearchnulls</structfield>, indicating that it supports
225   <literal>IS NULL</literal> and <literal>IS NOT NULL</literal> clauses as search
226   conditions.
227  </para>
228
229  <para>
230   The <structfield>amcaninclude</structfield> flag indicates whether the
231   access method supports <quote>included</quote> columns, that is it can
232   store (without processing) additional columns beyond the key column(s).
233   The requirements of the preceding paragraph apply only to the key
234   columns.  In particular, the combination
235   of <structfield>amcanmulticol</structfield>=<literal>false</literal>
236   and <structfield>amcaninclude</structfield>=<literal>true</literal> is
237   sensible: it means that there can only be one key column, but there can
238   also be included column(s).  Also, included columns must be allowed to be
239   null, independently of <structfield>amoptionalkey</structfield>.
240  </para>
241
242 </sect1>
243
244 <sect1 id="index-functions">
245  <title>Index Access Method Functions</title>
246
247  <para>
248   The index construction and maintenance functions that an index access
249   method must provide in <structname>IndexAmRoutine</structname> are:
250  </para>
251
252  <para>
253<programlisting>
254IndexBuildResult *
255ambuild (Relation heapRelation,
256         Relation indexRelation,
257         IndexInfo *indexInfo);
258</programlisting>
259   Build a new index.  The index relation has been physically created,
260   but is empty.  It must be filled in with whatever fixed data the
261   access method requires, plus entries for all tuples already existing
262   in the table.  Ordinarily the <function>ambuild</function> function will call
263   <function>table_index_build_scan()</function> to scan the table for existing tuples
264   and compute the keys that need to be inserted into the index.
265   The function must return a palloc'd struct containing statistics about
266   the new index.
267  </para>
268
269  <para>
270<programlisting>
271void
272ambuildempty (Relation indexRelation);
273</programlisting>
274   Build an empty index, and write it to the initialization fork (<symbol>INIT_FORKNUM</symbol>)
275   of the given relation.  This method is called only for unlogged indexes; the
276   empty index written to the initialization fork will be copied over the main
277   relation fork on each server restart.
278  </para>
279
280  <para>
281<programlisting>
282bool
283aminsert (Relation indexRelation,
284          Datum *values,
285          bool *isnull,
286          ItemPointer heap_tid,
287          Relation heapRelation,
288          IndexUniqueCheck checkUnique,
289          IndexInfo *indexInfo);
290</programlisting>
291   Insert a new tuple into an existing index.  The <literal>values</literal> and
292   <literal>isnull</literal> arrays give the key values to be indexed, and
293   <literal>heap_tid</literal> is the TID to be indexed.
294   If the access method supports unique indexes (its
295   <structfield>amcanunique</structfield> flag is true) then
296   <literal>checkUnique</literal> indicates the type of uniqueness check to
297   perform.  This varies depending on whether the unique constraint is
298   deferrable; see <xref linkend="index-unique-checks"/> for details.
299   Normally the access method only needs the <literal>heapRelation</literal>
300   parameter when performing uniqueness checking (since then it will have to
301   look into the heap to verify tuple liveness).
302  </para>
303
304  <para>
305   The function's Boolean result value is significant only when
306   <literal>checkUnique</literal> is <literal>UNIQUE_CHECK_PARTIAL</literal>.
307   In this case a true result means the new entry is known unique, whereas
308   false means it might be non-unique (and a deferred uniqueness check must
309   be scheduled).  For other cases a constant false result is recommended.
310  </para>
311
312  <para>
313   Some indexes might not index all tuples.  If the tuple is not to be
314   indexed, <function>aminsert</function> should just return without doing anything.
315  </para>
316
317  <para>
318   If the index AM wishes to cache data across successive index insertions
319   within a SQL statement, it can allocate space
320   in <literal>indexInfo-&gt;ii_Context</literal> and store a pointer to the
321   data in <literal>indexInfo-&gt;ii_AmCache</literal> (which will be NULL
322   initially).
323  </para>
324
325  <para>
326<programlisting>
327IndexBulkDeleteResult *
328ambulkdelete (IndexVacuumInfo *info,
329              IndexBulkDeleteResult *stats,
330              IndexBulkDeleteCallback callback,
331              void *callback_state);
332</programlisting>
333   Delete tuple(s) from the index.  This is a <quote>bulk delete</quote> operation
334   that is intended to be implemented by scanning the whole index and checking
335   each entry to see if it should be deleted.
336   The passed-in <literal>callback</literal> function must be called, in the style
337   <literal>callback(<replaceable>TID</replaceable>, callback_state) returns bool</literal>,
338   to determine whether any particular index entry, as identified by its
339   referenced TID, is to be deleted.  Must return either NULL or a palloc'd
340   struct containing statistics about the effects of the deletion operation.
341   It is OK to return NULL if no information needs to be passed on to
342   <function>amvacuumcleanup</function>.
343  </para>
344
345  <para>
346   Because of limited <varname>maintenance_work_mem</varname>,
347   <function>ambulkdelete</function> might need to be called more than once when many
348   tuples are to be deleted.  The <literal>stats</literal> argument is the result
349   of the previous call for this index (it is NULL for the first call within a
350   <command>VACUUM</command> operation).  This allows the AM to accumulate statistics
351   across the whole operation.  Typically, <function>ambulkdelete</function> will
352   modify and return the same struct if the passed <literal>stats</literal> is not
353   null.
354  </para>
355
356  <para>
357<programlisting>
358IndexBulkDeleteResult *
359amvacuumcleanup (IndexVacuumInfo *info,
360                 IndexBulkDeleteResult *stats);
361</programlisting>
362   Clean up after a <command>VACUUM</command> operation (zero or more
363   <function>ambulkdelete</function> calls).  This does not have to do anything
364   beyond returning index statistics, but it might perform bulk cleanup
365   such as reclaiming empty index pages.  <literal>stats</literal> is whatever the
366   last <function>ambulkdelete</function> call returned, or NULL if
367   <function>ambulkdelete</function> was not called because no tuples needed to be
368   deleted.  If the result is not NULL it must be a palloc'd struct.
369   The statistics it contains will be used to update <structname>pg_class</structname>,
370   and will be reported by <command>VACUUM</command> if <literal>VERBOSE</literal> is given.
371   It is OK to return NULL if the index was not changed at all during the
372   <command>VACUUM</command> operation, but otherwise correct stats should
373   be returned.
374  </para>
375
376  <para>
377   As of <productname>PostgreSQL</productname> 8.4,
378   <function>amvacuumcleanup</function> will also be called at completion of an
379   <command>ANALYZE</command> operation.  In this case <literal>stats</literal> is always
380   NULL and any return value will be ignored.  This case can be distinguished
381   by checking <literal>info-&gt;analyze_only</literal>.  It is recommended
382   that the access method do nothing except post-insert cleanup in such a
383   call, and that only in an autovacuum worker process.
384  </para>
385
386  <para>
387<programlisting>
388bool
389amcanreturn (Relation indexRelation, int attno);
390</programlisting>
391   Check whether the index can support <link
392   linkend="indexes-index-only-scans"><firstterm>index-only scans</firstterm></link> on
393   the given column, by returning the column's original indexed value.
394   The attribute number is 1-based, i.e., the first column's attno is 1.
395   Returns true if supported, else false.
396   This function should always return true for included columns
397   (if those are supported), since there's little point in an included
398   column that can't be retrieved.
399   If the access method does not support index-only scans at all,
400   the <structfield>amcanreturn</structfield> field in its <structname>IndexAmRoutine</structname>
401   struct can be set to NULL.
402  </para>
403
404  <para>
405<programlisting>
406void
407amcostestimate (PlannerInfo *root,
408                IndexPath *path,
409                double loop_count,
410                Cost *indexStartupCost,
411                Cost *indexTotalCost,
412                Selectivity *indexSelectivity,
413                double *indexCorrelation,
414                double *indexPages);
415</programlisting>
416   Estimate the costs of an index scan.  This function is described fully
417   in <xref linkend="index-cost-estimation"/>, below.
418  </para>
419
420  <para>
421<programlisting>
422bytea *
423amoptions (ArrayType *reloptions,
424           bool validate);
425</programlisting>
426   Parse and validate the reloptions array for an index.  This is called only
427   when a non-null reloptions array exists for the index.
428   <parameter>reloptions</parameter> is a <type>text</type> array containing entries of the
429   form <replaceable>name</replaceable><literal>=</literal><replaceable>value</replaceable>.
430   The function should construct a <type>bytea</type> value, which will be copied
431   into the <structfield>rd_options</structfield> field of the index's relcache entry.
432   The data contents of the <type>bytea</type> value are open for the access
433   method to define; most of the standard access methods use struct
434   <structname>StdRdOptions</structname>.
435   When <parameter>validate</parameter> is true, the function should report a suitable
436   error message if any of the options are unrecognized or have invalid
437   values; when <parameter>validate</parameter> is false, invalid entries should be
438   silently ignored.  (<parameter>validate</parameter> is false when loading options
439   already stored in <structname>pg_catalog</structname>; an invalid entry could only
440   be found if the access method has changed its rules for options, and in
441   that case ignoring obsolete entries is appropriate.)
442   It is OK to return NULL if default behavior is wanted.
443  </para>
444
445  <para>
446<programlisting>
447bool
448amproperty (Oid index_oid, int attno,
449            IndexAMProperty prop, const char *propname,
450            bool *res, bool *isnull);
451</programlisting>
452   The <function>amproperty</function> method allows index access methods to override
453   the default behavior of <function>pg_index_column_has_property</function>
454   and related functions.
455   If the access method does not have any special behavior for index property
456   inquiries, the <structfield>amproperty</structfield> field in
457   its <structname>IndexAmRoutine</structname> struct can be set to NULL.
458   Otherwise, the <function>amproperty</function> method will be called with
459   <parameter>index_oid</parameter> and <parameter>attno</parameter> both zero for
460   <function>pg_indexam_has_property</function> calls,
461   or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> zero for
462   <function>pg_index_has_property</function> calls,
463   or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> greater than
464   zero for <function>pg_index_column_has_property</function> calls.
465   <parameter>prop</parameter> is an enum value identifying the property being tested,
466   while <parameter>propname</parameter> is the original property name string.
467   If the core code does not recognize the property name
468   then <parameter>prop</parameter> is <literal>AMPROP_UNKNOWN</literal>.
469   Access methods can define custom property names by
470   checking <parameter>propname</parameter> for a match (use <function>pg_strcasecmp</function>
471   to match, for consistency with the core code); for names known to the core
472   code, it's better to inspect <parameter>prop</parameter>.
473   If the <structfield>amproperty</structfield> method returns <literal>true</literal> then
474   it has determined the property test result: it must set <literal>*res</literal>
475   to the boolean value to return, or set <literal>*isnull</literal>
476   to <literal>true</literal> to return a NULL.  (Both of the referenced variables
477   are initialized to <literal>false</literal> before the call.)
478   If the <structfield>amproperty</structfield> method returns <literal>false</literal> then
479   the core code will proceed with its normal logic for determining the
480   property test result.
481  </para>
482
483  <para>
484   Access methods that support ordering operators should
485   implement <literal>AMPROP_DISTANCE_ORDERABLE</literal> property testing, as the
486   core code does not know how to do that and will return NULL.  It may
487   also be advantageous to implement <literal>AMPROP_RETURNABLE</literal> testing,
488   if that can be done more cheaply than by opening the index and calling
489   <function>amcanreturn</function>, which is the core code's default behavior.
490   The default behavior should be satisfactory for all other standard
491   properties.
492  </para>
493
494  <para>
495<programlisting>
496char *
497ambuildphasename (int64 phasenum);
498</programlisting>
499   Return the textual name of the given build phase number.
500   The phase numbers are those reported during an index build via the
501   <function>pgstat_progress_update_param</function> interface.
502   The phase names are then exposed in the
503   <structname>pg_stat_progress_create_index</structname> view.
504  </para>
505
506  <para>
507<programlisting>
508bool
509amvalidate (Oid opclassoid);
510</programlisting>
511   Validate the catalog entries for the specified operator class, so far as
512   the access method can reasonably do that.  For example, this might include
513   testing that all required support functions are provided.
514   The <function>amvalidate</function> function must return false if the opclass is
515   invalid.  Problems should be reported with <function>ereport</function> messages.
516  </para>
517
518
519  <para>
520   The purpose of an index, of course, is to support scans for tuples matching
521   an indexable <literal>WHERE</literal> condition, often called a
522   <firstterm>qualifier</firstterm> or <firstterm>scan key</firstterm>.  The semantics of
523   index scanning are described more fully in <xref linkend="index-scanning"/>,
524   below.  An index access method can support <quote>plain</quote> index scans,
525   <quote>bitmap</quote> index scans, or both.  The scan-related functions that an
526   index access method must or may provide are:
527  </para>
528
529  <para>
530<programlisting>
531IndexScanDesc
532ambeginscan (Relation indexRelation,
533             int nkeys,
534             int norderbys);
535</programlisting>
536   Prepare for an index scan.  The <literal>nkeys</literal> and <literal>norderbys</literal>
537   parameters indicate the number of quals and ordering operators that will be
538   used in the scan; these may be useful for space allocation purposes.
539   Note that the actual values of the scan keys aren't provided yet.
540   The result must be a palloc'd struct.
541   For implementation reasons the index access method
542   <emphasis>must</emphasis> create this struct by calling
543   <function>RelationGetIndexScan()</function>.  In most cases
544   <function>ambeginscan</function> does little beyond making that call and perhaps
545   acquiring locks;
546   the interesting parts of index-scan startup are in <function>amrescan</function>.
547  </para>
548
549  <para>
550<programlisting>
551void
552amrescan (IndexScanDesc scan,
553          ScanKey keys,
554          int nkeys,
555          ScanKey orderbys,
556          int norderbys);
557</programlisting>
558   Start or restart an index scan, possibly with new scan keys.  (To restart
559   using previously-passed keys, NULL is passed for <literal>keys</literal> and/or
560   <literal>orderbys</literal>.)  Note that it is not allowed for
561   the number of keys or order-by operators to be larger than
562   what was passed to <function>ambeginscan</function>.  In practice the restart
563   feature is used when a new outer tuple is selected by a nested-loop join
564   and so a new key comparison value is needed, but the scan key structure
565   remains the same.
566  </para>
567
568  <para>
569<programlisting>
570bool
571amgettuple (IndexScanDesc scan,
572            ScanDirection direction);
573</programlisting>
574   Fetch the next tuple in the given scan, moving in the given
575   direction (forward or backward in the index).  Returns true if a tuple was
576   obtained, false if no matching tuples remain.  In the true case the tuple
577   TID is stored into the <literal>scan</literal> structure.  Note that
578   <quote>success</quote> means only that the index contains an entry that matches
579   the scan keys, not that the tuple necessarily still exists in the heap or
580   will pass the caller's snapshot test.  On success, <function>amgettuple</function>
581   must also set <literal>scan-&gt;xs_recheck</literal> to true or false.
582   False means it is certain that the index entry matches the scan keys.
583   True means this is not certain, and the conditions represented by the
584   scan keys must be rechecked against the heap tuple after fetching it.
585   This provision supports <quote>lossy</quote> index operators.
586   Note that rechecking will extend only to the scan conditions; a partial
587   index predicate (if any) is never rechecked by <function>amgettuple</function>
588   callers.
589  </para>
590
591  <para>
592   If the index supports <link linkend="indexes-index-only-scans">index-only
593   scans</link> (i.e., <function>amcanreturn</function> returns true for any
594   of its columns),
595   then on success the AM must also check <literal>scan-&gt;xs_want_itup</literal>,
596   and if that is true it must return the originally indexed data for the
597   index entry.  Columns for which <function>amcanreturn</function> returns
598   false can be returned as nulls.
599   The data can be returned in the form of an
600   <structname>IndexTuple</structname> pointer stored at <literal>scan-&gt;xs_itup</literal>,
601   with tuple descriptor <literal>scan-&gt;xs_itupdesc</literal>; or in the form of
602   a <structname>HeapTuple</structname> pointer stored at <literal>scan-&gt;xs_hitup</literal>,
603   with tuple descriptor <literal>scan-&gt;xs_hitupdesc</literal>.  (The latter
604   format should be used when reconstructing data that might possibly not fit
605   into an <structname>IndexTuple</structname>.)  In either case,
606   management of the data referenced by the pointer is the access method's
607   responsibility.  The data must remain good at least until the next
608   <function>amgettuple</function>, <function>amrescan</function>, or <function>amendscan</function>
609   call for the scan.
610  </para>
611
612  <para>
613   The <function>amgettuple</function> function need only be provided if the access
614   method supports <quote>plain</quote> index scans.  If it doesn't, the
615   <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
616   struct must be set to NULL.
617  </para>
618
619  <para>
620<programlisting>
621int64
622amgetbitmap (IndexScanDesc scan,
623             TIDBitmap *tbm);
624</programlisting>
625   Fetch all tuples in the given scan and add them to the caller-supplied
626   <type>TIDBitmap</type> (that is, OR the set of tuple IDs into whatever set is already
627   in the bitmap).  The number of tuples fetched is returned (this might be
628   just an approximate count, for instance some AMs do not detect duplicates).
629   While inserting tuple IDs into the bitmap, <function>amgetbitmap</function> can
630   indicate that rechecking of the scan conditions is required for specific
631   tuple IDs.  This is analogous to the <literal>xs_recheck</literal> output parameter
632   of <function>amgettuple</function>.  Note: in the current implementation, support
633   for this feature is conflated with support for lossy storage of the bitmap
634   itself, and therefore callers recheck both the scan conditions and the
635   partial index predicate (if any) for recheckable tuples.  That might not
636   always be true, however.
637   <function>amgetbitmap</function> and
638   <function>amgettuple</function> cannot be used in the same index scan; there
639   are other restrictions too when using <function>amgetbitmap</function>, as explained
640   in <xref linkend="index-scanning"/>.
641  </para>
642
643  <para>
644   The <function>amgetbitmap</function> function need only be provided if the access
645   method supports <quote>bitmap</quote> index scans.  If it doesn't, the
646   <structfield>amgetbitmap</structfield> field in its <structname>IndexAmRoutine</structname>
647   struct must be set to NULL.
648  </para>
649
650  <para>
651<programlisting>
652void
653amendscan (IndexScanDesc scan);
654</programlisting>
655   End a scan and release resources.  The <literal>scan</literal> struct itself
656   should not be freed, but any locks or pins taken internally by the
657   access method must be released, as well as any other memory allocated
658   by <function>ambeginscan</function> and other scan-related functions.
659  </para>
660
661  <para>
662<programlisting>
663void
664ammarkpos (IndexScanDesc scan);
665</programlisting>
666   Mark current scan position.  The access method need only support one
667   remembered scan position per scan.
668  </para>
669
670  <para>
671   The <function>ammarkpos</function> function need only be provided if the access
672   method supports ordered scans.  If it doesn't,
673   the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
674   struct may be set to NULL.
675  </para>
676
677  <para>
678<programlisting>
679void
680amrestrpos (IndexScanDesc scan);
681</programlisting>
682   Restore the scan to the most recently marked position.
683  </para>
684
685  <para>
686   The <function>amrestrpos</function> function need only be provided if the access
687   method supports ordered scans.  If it doesn't,
688   the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
689   struct may be set to NULL.
690  </para>
691
692  <para>
693   In addition to supporting ordinary index scans, some types of index
694   may wish to support <firstterm>parallel index scans</firstterm>, which allow
695   multiple backends to cooperate in performing an index scan.  The
696   index access method should arrange things so that each cooperating
697   process returns a subset of the tuples that would be performed by
698   an ordinary, non-parallel index scan, but in such a way that the
699   union of those subsets is equal to the set of tuples that would be
700   returned by an ordinary, non-parallel index scan.  Furthermore, while
701   there need not be any global ordering of tuples returned by a parallel
702   scan, the ordering of that subset of tuples returned within each
703   cooperating backend must match the requested ordering.  The following
704   functions may be implemented to support parallel index scans:
705  </para>
706
707  <para>
708<programlisting>
709Size
710amestimateparallelscan (void);
711</programlisting>
712   Estimate and return the number of bytes of dynamic shared memory which
713   the access method will be needed to perform a parallel scan.  (This number
714   is in addition to, not in lieu of, the amount of space needed for
715   AM-independent data in <structname>ParallelIndexScanDescData</structname>.)
716  </para>
717
718  <para>
719   It is not necessary to implement this function for access methods which
720   do not support parallel scans or for which the number of additional bytes
721   of storage required is zero.
722  </para>
723
724  <para>
725<programlisting>
726void
727aminitparallelscan (void *target);
728</programlisting>
729   This function will be called to initialize dynamic shared memory at the
730   beginning of a parallel scan.  <parameter>target</parameter> will point to at least
731   the number of bytes previously returned by
732   <function>amestimateparallelscan</function>, and this function may use that
733   amount of space to store whatever data it wishes.
734  </para>
735
736  <para>
737   It is not necessary to implement this function for access methods which
738   do not support parallel scans or in cases where the shared memory space
739   required needs no initialization.
740  </para>
741
742  <para>
743<programlisting>
744void
745amparallelrescan (IndexScanDesc scan);
746</programlisting>
747   This function, if implemented, will be called when a parallel index scan
748   must be restarted.  It should reset any shared state set up by
749   <function>aminitparallelscan</function> such that the scan will be restarted from
750   the beginning.
751  </para>
752
753 </sect1>
754
755 <sect1 id="index-scanning">
756  <title>Index Scanning</title>
757
758  <para>
759   In an index scan, the index access method is responsible for regurgitating
760   the TIDs of all the tuples it has been told about that match the
761   <firstterm>scan keys</firstterm>.  The access method is <emphasis>not</emphasis> involved in
762   actually fetching those tuples from the index's parent table, nor in
763   determining whether they pass the scan's visibility test or other
764   conditions.
765  </para>
766
767  <para>
768   A scan key is the internal representation of a <literal>WHERE</literal> clause of
769   the form <replaceable>index_key</replaceable> <replaceable>operator</replaceable>
770   <replaceable>constant</replaceable>, where the index key is one of the columns of the
771   index and the operator is one of the members of the operator family
772   associated with that index column.  An index scan has zero or more scan
773   keys, which are implicitly ANDed &mdash; the returned tuples are expected
774   to satisfy all the indicated conditions.
775  </para>
776
777  <para>
778   The access method can report that the index is <firstterm>lossy</firstterm>, or
779   requires rechecks, for a particular query.  This implies that the index
780   scan will return all the entries that pass the scan key, plus possibly
781   additional entries that do not.  The core system's index-scan machinery
782   will then apply the index conditions again to the heap tuple to verify
783   whether or not it really should be selected.  If the recheck option is not
784   specified, the index scan must return exactly the set of matching entries.
785  </para>
786
787  <para>
788   Note that it is entirely up to the access method to ensure that it
789   correctly finds all and only the entries passing all the given scan keys.
790   Also, the core system will simply hand off all the <literal>WHERE</literal>
791   clauses that match the index keys and operator families, without any
792   semantic analysis to determine whether they are redundant or
793   contradictory.  As an example, given
794   <literal>WHERE x &gt; 4 AND x &gt; 14</literal> where <literal>x</literal> is a b-tree
795   indexed column, it is left to the b-tree <function>amrescan</function> function
796   to realize that the first scan key is redundant and can be discarded.
797   The extent of preprocessing needed during <function>amrescan</function> will
798   depend on the extent to which the index access method needs to reduce
799   the scan keys to a <quote>normalized</quote> form.
800  </para>
801
802  <para>
803   Some access methods return index entries in a well-defined order, others
804   do not.  There are actually two different ways that an access method can
805   support sorted output:
806
807    <itemizedlist>
808     <listitem>
809      <para>
810       Access methods that always return entries in the natural ordering
811       of their data (such as btree) should set
812       <structfield>amcanorder</structfield> to true.
813       Currently, such access methods must use btree-compatible strategy
814       numbers for their equality and ordering operators.
815      </para>
816     </listitem>
817     <listitem>
818      <para>
819       Access methods that support ordering operators should set
820       <structfield>amcanorderbyop</structfield> to true.
821       This indicates that the index is capable of returning entries in
822       an order satisfying <literal>ORDER BY</literal> <replaceable>index_key</replaceable>
823       <replaceable>operator</replaceable> <replaceable>constant</replaceable>.  Scan modifiers
824       of that form can be passed to <function>amrescan</function> as described
825       previously.
826      </para>
827     </listitem>
828    </itemizedlist>
829  </para>
830
831  <para>
832   The <function>amgettuple</function> function has a <literal>direction</literal> argument,
833   which can be either <literal>ForwardScanDirection</literal> (the normal case)
834   or  <literal>BackwardScanDirection</literal>.  If the first call after
835   <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
836   set of matching index entries is to be scanned back-to-front rather than in
837   the normal front-to-back direction, so <function>amgettuple</function> must return
838   the last matching tuple in the index, rather than the first one as it
839   normally would.  (This will only occur for access
840   methods that set <structfield>amcanorder</structfield> to true.)  After the
841   first call, <function>amgettuple</function> must be prepared to advance the scan in
842   either direction from the most recently returned entry.  (But if
843   <structfield>amcanbackward</structfield> is false, all subsequent
844   calls will have the same direction as the first one.)
845  </para>
846
847  <para>
848   Access methods that support ordered scans must support <quote>marking</quote> a
849   position in a scan and later returning to the marked position.  The same
850   position might be restored multiple times.  However, only one position need
851   be remembered per scan; a new <function>ammarkpos</function> call overrides the
852   previously marked position.  An access method that does not support ordered
853   scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
854   functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
855   instead.
856  </para>
857
858  <para>
859   Both the scan position and the mark position (if any) must be maintained
860   consistently in the face of concurrent insertions or deletions in the
861   index.  It is OK if a freshly-inserted entry is not returned by a scan that
862   would have found the entry if it had existed when the scan started, or for
863   the scan to return such an entry upon rescanning or backing
864   up even though it had not been returned the first time through.  Similarly,
865   a concurrent delete might or might not be reflected in the results of a scan.
866   What is important is that insertions or deletions not cause the scan to
867   miss or multiply return entries that were not themselves being inserted or
868   deleted.
869  </para>
870
871  <para>
872   If the index stores the original indexed data values (and not some lossy
873   representation of them), it is useful to
874   support <link linkend="indexes-index-only-scans">index-only scans</link>, in
875   which the index returns the actual data not just the TID of the heap tuple.
876   This will only avoid I/O if the visibility map shows that the TID is on an
877   all-visible page; else the heap tuple must be visited anyway to check
878   MVCC visibility.  But that is no concern of the access method's.
879  </para>
880
881  <para>
882   Instead of using <function>amgettuple</function>, an index scan can be done with
883   <function>amgetbitmap</function> to fetch all tuples in one call.  This can be
884   noticeably more efficient than <function>amgettuple</function> because it allows
885   avoiding lock/unlock cycles within the access method.  In principle
886   <function>amgetbitmap</function> should have the same effects as repeated
887   <function>amgettuple</function> calls, but we impose several restrictions to
888   simplify matters.  First of all, <function>amgetbitmap</function> returns all
889   tuples at once and marking or restoring scan positions isn't
890   supported. Secondly, the tuples are returned in a bitmap which doesn't
891   have any specific ordering, which is why <function>amgetbitmap</function> doesn't
892   take a <literal>direction</literal> argument.  (Ordering operators will never be
893   supplied for such a scan, either.)
894   Also, there is no provision for index-only scans with
895   <function>amgetbitmap</function>, since there is no way to return the contents of
896   index tuples.
897   Finally, <function>amgetbitmap</function>
898   does not guarantee any locking of the returned tuples, with implications
899   spelled out in <xref linkend="index-locking"/>.
900  </para>
901
902  <para>
903   Note that it is permitted for an access method to implement only
904   <function>amgetbitmap</function> and not <function>amgettuple</function>, or vice versa,
905   if its internal implementation is unsuited to one API or the other.
906  </para>
907
908 </sect1>
909
910 <sect1 id="index-locking">
911  <title>Index Locking Considerations</title>
912
913  <para>
914   Index access methods must handle concurrent updates
915   of the index by multiple processes.
916   The core <productname>PostgreSQL</productname> system obtains
917   <literal>AccessShareLock</literal> on the index during an index scan, and
918   <literal>RowExclusiveLock</literal> when updating the index (including plain
919   <command>VACUUM</command>).  Since these lock types do not conflict, the access
920   method is responsible for handling any fine-grained locking it might need.
921   An <literal>ACCESS EXCLUSIVE</literal> lock on the index as a whole will be
922   taken only during index creation, destruction, or <command>REINDEX</command>
923   (<literal>SHARE UPDATE EXCLUSIVE</literal> is taken instead with
924   <literal>CONCURRENTLY</literal>).
925  </para>
926
927  <para>
928   Building an index type that supports concurrent updates usually requires
929   extensive and subtle analysis of the required behavior.  For the b-tree
930   and hash index types, you can read about the design decisions involved in
931   <filename>src/backend/access/nbtree/README</filename> and
932   <filename>src/backend/access/hash/README</filename>.
933  </para>
934
935  <para>
936   Aside from the index's own internal consistency requirements, concurrent
937   updates create issues about consistency between the parent table (the
938   <firstterm>heap</firstterm>) and the index.  Because
939   <productname>PostgreSQL</productname> separates accesses
940   and updates of the heap from those of the index, there are windows in
941   which the index might be inconsistent with the heap.  We handle this problem
942   with the following rules:
943
944    <itemizedlist>
945     <listitem>
946      <para>
947       A new heap entry is made before making its index entries.  (Therefore
948       a concurrent index scan is likely to fail to see the heap entry.
949       This is okay because the index reader would be uninterested in an
950       uncommitted row anyway.  But see <xref linkend="index-unique-checks"/>.)
951      </para>
952     </listitem>
953     <listitem>
954      <para>
955       When a heap entry is to be deleted (by <command>VACUUM</command>), all its
956       index entries must be removed first.
957      </para>
958     </listitem>
959     <listitem>
960      <para>
961       An index scan must maintain a pin
962       on the index page holding the item last returned by
963       <function>amgettuple</function>, and <function>ambulkdelete</function> cannot delete
964       entries from pages that are pinned by other backends.  The need
965       for this rule is explained below.
966      </para>
967     </listitem>
968    </itemizedlist>
969
970   Without the third rule, it is possible for an index reader to
971   see an index entry just before it is removed by <command>VACUUM</command>, and
972   then to arrive at the corresponding heap entry after that was removed by
973   <command>VACUUM</command>.
974   This creates no serious problems if that item
975   number is still unused when the reader reaches it, since an empty
976   item slot will be ignored by <function>heap_fetch()</function>.  But what if a
977   third backend has already re-used the item slot for something else?
978   When using an MVCC-compliant snapshot, there is no problem because
979   the new occupant of the slot is certain to be too new to pass the
980   snapshot test.  However, with a non-MVCC-compliant snapshot (such as
981   <literal>SnapshotAny</literal>), it would be possible to accept and return
982   a row that does not in fact match the scan keys.  We could defend
983   against this scenario by requiring the scan keys to be rechecked
984   against the heap row in all cases, but that is too expensive.  Instead,
985   we use a pin on an index page as a proxy to indicate that the reader
986   might still be <quote>in flight</quote> from the index entry to the matching
987   heap entry.  Making <function>ambulkdelete</function> block on such a pin ensures
988   that <command>VACUUM</command> cannot delete the heap entry before the reader
989   is done with it.  This solution costs little in run time, and adds blocking
990   overhead only in the rare cases where there actually is a conflict.
991  </para>
992
993  <para>
994   This solution requires that index scans be <quote>synchronous</quote>: we have
995   to fetch each heap tuple immediately after scanning the corresponding index
996   entry.  This is expensive for a number of reasons.  An
997   <quote>asynchronous</quote> scan in which we collect many TIDs from the index,
998   and only visit the heap tuples sometime later, requires much less index
999   locking overhead and can allow a more efficient heap access pattern.
1000   Per the above analysis, we must use the synchronous approach for
1001   non-MVCC-compliant snapshots, but an asynchronous scan is workable
1002   for a query using an MVCC snapshot.
1003  </para>
1004
1005  <para>
1006   In an <function>amgetbitmap</function> index scan, the access method does not
1007   keep an index pin on any of the returned tuples.  Therefore
1008   it is only safe to use such scans with MVCC-compliant snapshots.
1009  </para>
1010
1011  <para>
1012   When the <structfield>ampredlocks</structfield> flag is not set, any scan using that
1013   index access method within a serializable transaction will acquire a
1014   nonblocking predicate lock on the full index.  This will generate a
1015   read-write conflict with the insert of any tuple into that index by a
1016   concurrent serializable transaction.  If certain patterns of read-write
1017   conflicts are detected among a set of concurrent serializable
1018   transactions, one of those transactions may be canceled to protect data
1019   integrity.  When the flag is set, it indicates that the index access
1020   method implements finer-grained predicate locking, which will tend to
1021   reduce the frequency of such transaction cancellations.
1022  </para>
1023
1024 </sect1>
1025
1026 <sect1 id="index-unique-checks">
1027  <title>Index Uniqueness Checks</title>
1028
1029  <para>
1030   <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
1031   using <firstterm>unique indexes</firstterm>, which are indexes that disallow
1032   multiple entries with identical keys.  An access method that supports this
1033   feature sets <structfield>amcanunique</structfield> true.
1034   (At present, only b-tree supports it.)  Columns listed in the
1035   <literal>INCLUDE</literal> clause are not considered when enforcing
1036   uniqueness.
1037  </para>
1038
1039  <para>
1040   Because of MVCC, it is always necessary to allow duplicate entries to
1041   exist physically in an index: the entries might refer to successive
1042   versions of a single logical row.  The behavior we actually want to
1043   enforce is that no MVCC snapshot could include two rows with equal
1044   index keys.  This breaks down into the following cases that must be
1045   checked when inserting a new row into a unique index:
1046
1047    <itemizedlist>
1048     <listitem>
1049      <para>
1050       If a conflicting valid row has been deleted by the current transaction,
1051       it's okay.  (In particular, since an UPDATE always deletes the old row
1052       version before inserting the new version, this will allow an UPDATE on
1053       a row without changing the key.)
1054      </para>
1055     </listitem>
1056     <listitem>
1057      <para>
1058       If a conflicting row has been inserted by an as-yet-uncommitted
1059       transaction, the would-be inserter must wait to see if that transaction
1060       commits.  If it rolls back then there is no conflict.  If it commits
1061       without deleting the conflicting row again, there is a uniqueness
1062       violation.  (In practice we just wait for the other transaction to
1063       end and then redo the visibility check in toto.)
1064      </para>
1065     </listitem>
1066     <listitem>
1067      <para>
1068       Similarly, if a conflicting valid row has been deleted by an
1069       as-yet-uncommitted transaction, the would-be inserter must wait
1070       for that transaction to commit or abort, and then repeat the test.
1071      </para>
1072     </listitem>
1073    </itemizedlist>
1074  </para>
1075
1076  <para>
1077   Furthermore, immediately before reporting a uniqueness violation
1078   according to the above rules, the access method must recheck the
1079   liveness of the row being inserted.  If it is committed dead then
1080   no violation should be reported.  (This case cannot occur during the
1081   ordinary scenario of inserting a row that's just been created by
1082   the current transaction.  It can happen during
1083   <command>CREATE UNIQUE INDEX CONCURRENTLY</command>, however.)
1084  </para>
1085
1086  <para>
1087   We require the index access method to apply these tests itself, which
1088   means that it must reach into the heap to check the commit status of
1089   any row that is shown to have a duplicate key according to the index
1090   contents.  This is without a doubt ugly and non-modular, but it saves
1091   redundant work: if we did a separate probe then the index lookup for
1092   a conflicting row would be essentially repeated while finding the place to
1093   insert the new row's index entry.  What's more, there is no obvious way
1094   to avoid race conditions unless the conflict check is an integral part
1095   of insertion of the new index entry.
1096  </para>
1097
1098  <para>
1099   If the unique constraint is deferrable, there is additional complexity:
1100   we need to be able to insert an index entry for a new row, but defer any
1101   uniqueness-violation error until end of statement or even later.  To
1102   avoid unnecessary repeat searches of the index, the index access method
1103   should do a preliminary uniqueness check during the initial insertion.
1104   If this shows that there is definitely no conflicting live tuple, we
1105   are done.  Otherwise, we schedule a recheck to occur when it is time to
1106   enforce the constraint.  If, at the time of the recheck, both the inserted
1107   tuple and some other tuple with the same key are live, then the error
1108   must be reported.  (Note that for this purpose, <quote>live</quote> actually
1109   means <quote>any tuple in the index entry's HOT chain is live</quote>.)
1110   To implement this, the <function>aminsert</function> function is passed a
1111   <literal>checkUnique</literal> parameter having one of the following values:
1112
1113    <itemizedlist>
1114     <listitem>
1115      <para>
1116       <literal>UNIQUE_CHECK_NO</literal> indicates that no uniqueness checking
1117       should be done (this is not a unique index).
1118      </para>
1119     </listitem>
1120     <listitem>
1121      <para>
1122       <literal>UNIQUE_CHECK_YES</literal> indicates that this is a non-deferrable
1123       unique index, and the uniqueness check must be done immediately, as
1124       described above.
1125      </para>
1126     </listitem>
1127     <listitem>
1128      <para>
1129       <literal>UNIQUE_CHECK_PARTIAL</literal> indicates that the unique
1130       constraint is deferrable. <productname>PostgreSQL</productname>
1131       will use this mode to insert each row's index entry.  The access
1132       method must allow duplicate entries into the index, and report any
1133       potential duplicates by returning false from <function>aminsert</function>.
1134       For each row for which false is returned, a deferred recheck will
1135       be scheduled.
1136      </para>
1137
1138      <para>
1139       The access method must identify any rows which might violate the
1140       unique constraint, but it is not an error for it to report false
1141       positives. This allows the check to be done without waiting for other
1142       transactions to finish; conflicts reported here are not treated as
1143       errors and will be rechecked later, by which time they may no longer
1144       be conflicts.
1145      </para>
1146     </listitem>
1147     <listitem>
1148      <para>
1149       <literal>UNIQUE_CHECK_EXISTING</literal> indicates that this is a deferred
1150       recheck of a row that was reported as a potential uniqueness violation.
1151       Although this is implemented by calling <function>aminsert</function>, the
1152       access method must <emphasis>not</emphasis> insert a new index entry in this
1153       case.  The index entry is already present.  Rather, the access method
1154       must check to see if there is another live index entry.  If so, and
1155       if the target row is also still live, report error.
1156      </para>
1157
1158      <para>
1159       It is recommended that in a <literal>UNIQUE_CHECK_EXISTING</literal> call,
1160       the access method further verify that the target row actually does
1161       have an existing entry in the index, and report error if not.  This
1162       is a good idea because the index tuple values passed to
1163       <function>aminsert</function> will have been recomputed.  If the index
1164       definition involves functions that are not really immutable, we
1165       might be checking the wrong area of the index.  Checking that the
1166       target row is found in the recheck verifies that we are scanning
1167       for the same tuple values as were used in the original insertion.
1168      </para>
1169     </listitem>
1170    </itemizedlist>
1171  </para>
1172
1173 </sect1>
1174
1175 <sect1 id="index-cost-estimation">
1176  <title>Index Cost Estimation Functions</title>
1177
1178  <para>
1179   The <function>amcostestimate</function> function is given information describing
1180   a possible index scan, including lists of WHERE and ORDER BY clauses that
1181   have been determined to be usable with the index.  It must return estimates
1182   of the cost of accessing the index and the selectivity of the WHERE
1183   clauses (that is, the fraction of parent-table rows that will be
1184   retrieved during the index scan).  For simple cases, nearly all the
1185   work of the cost estimator can be done by calling standard routines
1186   in the optimizer; the point of having an <function>amcostestimate</function> function is
1187   to allow index access methods to provide index-type-specific knowledge,
1188   in case it is possible to improve on the standard estimates.
1189  </para>
1190
1191  <para>
1192   Each <function>amcostestimate</function> function must have the signature:
1193
1194<programlisting>
1195void
1196amcostestimate (PlannerInfo *root,
1197                IndexPath *path,
1198                double loop_count,
1199                Cost *indexStartupCost,
1200                Cost *indexTotalCost,
1201                Selectivity *indexSelectivity,
1202                double *indexCorrelation,
1203                double *indexPages);
1204</programlisting>
1205
1206   The first three parameters are inputs:
1207
1208   <variablelist>
1209    <varlistentry>
1210     <term><parameter>root</parameter></term>
1211     <listitem>
1212      <para>
1213       The planner's information about the query being processed.
1214      </para>
1215     </listitem>
1216    </varlistentry>
1217
1218    <varlistentry>
1219     <term><parameter>path</parameter></term>
1220     <listitem>
1221      <para>
1222       The index access path being considered.  All fields except cost and
1223       selectivity values are valid.
1224      </para>
1225     </listitem>
1226    </varlistentry>
1227
1228    <varlistentry>
1229     <term><parameter>loop_count</parameter></term>
1230     <listitem>
1231      <para>
1232       The number of repetitions of the index scan that should be factored
1233       into the cost estimates.  This will typically be greater than one when
1234       considering a parameterized scan for use in the inside of a nestloop
1235       join.  Note that the cost estimates should still be for just one scan;
1236       a larger <parameter>loop_count</parameter> means that it may be appropriate
1237       to allow for some caching effects across multiple scans.
1238      </para>
1239     </listitem>
1240    </varlistentry>
1241   </variablelist>
1242  </para>
1243
1244  <para>
1245   The last five parameters are pass-by-reference outputs:
1246
1247   <variablelist>
1248    <varlistentry>
1249     <term><parameter>*indexStartupCost</parameter></term>
1250     <listitem>
1251      <para>
1252       Set to cost of index start-up processing
1253      </para>
1254     </listitem>
1255    </varlistentry>
1256
1257    <varlistentry>
1258     <term><parameter>*indexTotalCost</parameter></term>
1259     <listitem>
1260      <para>
1261       Set to total cost of index processing
1262      </para>
1263     </listitem>
1264    </varlistentry>
1265
1266    <varlistentry>
1267     <term><parameter>*indexSelectivity</parameter></term>
1268     <listitem>
1269      <para>
1270       Set to index selectivity
1271      </para>
1272     </listitem>
1273    </varlistentry>
1274
1275    <varlistentry>
1276     <term><parameter>*indexCorrelation</parameter></term>
1277     <listitem>
1278      <para>
1279       Set to correlation coefficient between index scan order and
1280       underlying table's order
1281      </para>
1282     </listitem>
1283    </varlistentry>
1284
1285    <varlistentry>
1286     <term><parameter>*indexPages</parameter></term>
1287     <listitem>
1288      <para>
1289       Set to number of index leaf pages
1290      </para>
1291     </listitem>
1292    </varlistentry>
1293   </variablelist>
1294  </para>
1295
1296  <para>
1297   Note that cost estimate functions must be written in C, not in SQL or
1298   any available procedural language, because they must access internal
1299   data structures of the planner/optimizer.
1300  </para>
1301
1302  <para>
1303   The index access costs should be computed using the parameters used by
1304   <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
1305   disk block fetch has cost <varname>seq_page_cost</varname>, a nonsequential fetch
1306   has cost <varname>random_page_cost</varname>, and the cost of processing one index
1307   row should usually be taken as <varname>cpu_index_tuple_cost</varname>.  In
1308   addition, an appropriate multiple of <varname>cpu_operator_cost</varname> should
1309   be charged for any comparison operators invoked during index processing
1310   (especially evaluation of the indexquals themselves).
1311  </para>
1312
1313  <para>
1314   The access costs should include all disk and CPU costs associated with
1315   scanning the index itself, but <emphasis>not</emphasis> the costs of retrieving or
1316   processing the parent-table rows that are identified by the index.
1317  </para>
1318
1319  <para>
1320   The <quote>start-up cost</quote> is the part of the total scan cost that
1321   must be expended before we can begin to fetch the first row.  For most
1322   indexes this can be taken as zero, but an index type with a high start-up
1323   cost might want to set it nonzero.
1324  </para>
1325
1326  <para>
1327   The <parameter>indexSelectivity</parameter> should be set to the estimated fraction of the parent
1328   table rows that will be retrieved during the index scan.  In the case
1329   of a lossy query, this will typically be higher than the fraction of
1330   rows that actually pass the given qual conditions.
1331  </para>
1332
1333  <para>
1334   The <parameter>indexCorrelation</parameter> should be set to the correlation (ranging between
1335   -1.0 and 1.0) between the index order and the table order.  This is used
1336   to adjust the estimate for the cost of fetching rows from the parent
1337   table.
1338  </para>
1339
1340  <para>
1341   The <parameter>indexPages</parameter> should be set to the number of leaf pages.
1342   This is used to estimate the number of workers for parallel index scan.
1343  </para>
1344
1345  <para>
1346   When <parameter>loop_count</parameter> is greater than one, the returned numbers
1347   should be averages expected for any one scan of the index.
1348  </para>
1349
1350  <procedure>
1351   <title>Cost Estimation</title>
1352   <para>
1353    A typical cost estimator will proceed as follows:
1354   </para>
1355
1356   <step>
1357    <para>
1358     Estimate and return the fraction of parent-table rows that will be visited
1359     based on the given qual conditions.  In the absence of any index-type-specific
1360     knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
1361
1362<programlisting>
1363*indexSelectivity = clauselist_selectivity(root, path-&gt;indexquals,
1364                                           path-&gt;indexinfo-&gt;rel-&gt;relid,
1365                                           JOIN_INNER, NULL);
1366</programlisting>
1367    </para>
1368   </step>
1369
1370   <step>
1371    <para>
1372     Estimate the number of index rows that will be visited during the
1373     scan.  For many index types this is the same as <parameter>indexSelectivity</parameter> times
1374     the number of rows in the index, but it might be more.  (Note that the
1375     index's size in pages and rows is available from the
1376     <literal>path-&gt;indexinfo</literal> struct.)
1377    </para>
1378   </step>
1379
1380   <step>
1381    <para>
1382     Estimate the number of index pages that will be retrieved during the scan.
1383     This might be just <parameter>indexSelectivity</parameter> times the index's size in pages.
1384    </para>
1385   </step>
1386
1387   <step>
1388    <para>
1389     Compute the index access cost.  A generic estimator might do this:
1390
1391<programlisting>
1392/*
1393 * Our generic assumption is that the index pages will be read
1394 * sequentially, so they cost seq_page_cost each, not random_page_cost.
1395 * Also, we charge for evaluation of the indexquals at each index row.
1396 * All the costs are assumed to be paid incrementally during the scan.
1397 */
1398cost_qual_eval(&amp;index_qual_cost, path-&gt;indexquals, root);
1399*indexStartupCost = index_qual_cost.startup;
1400*indexTotalCost = seq_page_cost * numIndexPages +
1401    (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
1402</programlisting>
1403
1404     However, the above does not account for amortization of index reads
1405     across repeated index scans.
1406    </para>
1407   </step>
1408
1409   <step>
1410    <para>
1411     Estimate the index correlation.  For a simple ordered index on a single
1412     field, this can be retrieved from pg_statistic.  If the correlation
1413     is not known, the conservative estimate is zero (no correlation).
1414    </para>
1415   </step>
1416  </procedure>
1417
1418  <para>
1419   Examples of cost estimator functions can be found in
1420   <filename>src/backend/utils/adt/selfuncs.c</filename>.
1421  </para>
1422 </sect1>
1423</chapter>
1424