1<!-- doc/src/sgml/indexam.sgml --> 2 3<chapter id="indexam"> 4 <title>Index Access Method Interface Definition</title> 5 6 <indexterm> 7 <primary>Index Access Method</primary> 8 </indexterm> 9 <indexterm> 10 <primary>indexam</primary> 11 <secondary>Index Access Method</secondary> 12 </indexterm> 13 14 <para> 15 This chapter defines the interface between the core 16 <productname>PostgreSQL</productname> system and <firstterm>index access 17 methods</firstterm>, which manage individual index types. The core system 18 knows nothing about indexes beyond what is specified here, so it is 19 possible to develop entirely new index types by writing add-on code. 20 </para> 21 22 <para> 23 All indexes in <productname>PostgreSQL</productname> are what are known 24 technically as <firstterm>secondary indexes</firstterm>; that is, the index is 25 physically separate from the table file that it describes. Each index 26 is stored as its own physical <firstterm>relation</firstterm> and so is described 27 by an entry in the <structname>pg_class</structname> catalog. The contents of an 28 index are entirely under the control of its index access method. In 29 practice, all index access methods divide indexes into standard-size 30 pages so that they can use the regular storage manager and buffer manager 31 to access the index contents. (All the existing index access methods 32 furthermore use the standard page layout described in <xref 33 linkend="storage-page-layout"/>, and most use the same format for index 34 tuple headers; but these decisions are not forced on an access method.) 35 </para> 36 37 <para> 38 An index is effectively a mapping from some data key values to 39 <firstterm>tuple identifiers</firstterm>, or <acronym>TIDs</acronym>, of row versions 40 (tuples) in the index's parent table. A TID consists of a 41 block number and an item number within that block (see <xref 42 linkend="storage-page-layout"/>). This is sufficient 43 information to fetch a particular row version from the table. 44 Indexes are not directly aware that under MVCC, there might be multiple 45 extant versions of the same logical row; to an index, each tuple is 46 an independent object that needs its own index entry. Thus, an 47 update of a row always creates all-new index entries for the row, even if 48 the key values did not change. (HOT tuples are an exception to this 49 statement; but indexes do not deal with those, either.) Index entries for 50 dead tuples are reclaimed (by vacuuming) when the dead tuples themselves 51 are reclaimed. 52 </para> 53 54 <sect1 id="index-api"> 55 <title>Basic API Structure for Indexes</title> 56 57 <para> 58 Each index access method is described by a row in the 59 <link linkend="catalog-pg-am"><structname>pg_am</structname></link> 60 system catalog. The <structname>pg_am</structname> entry 61 specifies a name and a <firstterm>handler function</firstterm> for the index 62 access method. These entries can be created and deleted using the 63 <xref linkend="sql-create-access-method"/> and 64 <xref linkend="sql-drop-access-method"/> SQL commands. 65 </para> 66 67 <para> 68 An index access method handler function must be declared to accept a 69 single argument of type <type>internal</type> and to return the 70 pseudo-type <type>index_am_handler</type>. The argument is a dummy value that 71 simply serves to prevent handler functions from being called directly from 72 SQL commands. The result of the function must be a palloc'd struct of 73 type <structname>IndexAmRoutine</structname>, which contains everything 74 that the core code needs to know to make use of the index access method. 75 The <structname>IndexAmRoutine</structname> struct, also called the access 76 method's <firstterm>API struct</firstterm>, includes fields specifying assorted 77 fixed properties of the access method, such as whether it can support 78 multicolumn indexes. More importantly, it contains pointers to support 79 functions for the access method, which do all of the real work to access 80 indexes. These support functions are plain C functions and are not 81 visible or callable at the SQL level. The support functions are described 82 in <xref linkend="index-functions"/>. 83 </para> 84 85 <para> 86 The structure <structname>IndexAmRoutine</structname> is defined thus: 87<programlisting> 88typedef struct IndexAmRoutine 89{ 90 NodeTag type; 91 92 /* 93 * Total number of strategies (operators) by which we can traverse/search 94 * this AM. Zero if AM does not have a fixed set of strategy assignments. 95 */ 96 uint16 amstrategies; 97 /* total number of support functions that this AM uses */ 98 uint16 amsupport; 99 /* does AM support ORDER BY indexed column's value? */ 100 bool amcanorder; 101 /* does AM support ORDER BY result of an operator on indexed column? */ 102 bool amcanorderbyop; 103 /* does AM support backward scanning? */ 104 bool amcanbackward; 105 /* does AM support UNIQUE indexes? */ 106 bool amcanunique; 107 /* does AM support multi-column indexes? */ 108 bool amcanmulticol; 109 /* does AM require scans to have a constraint on the first index column? */ 110 bool amoptionalkey; 111 /* does AM handle ScalarArrayOpExpr quals? */ 112 bool amsearcharray; 113 /* does AM handle IS NULL/IS NOT NULL quals? */ 114 bool amsearchnulls; 115 /* can index storage data type differ from column data type? */ 116 bool amstorage; 117 /* can an index of this type be clustered on? */ 118 bool amclusterable; 119 /* does AM handle predicate locks? */ 120 bool ampredlocks; 121 /* does AM support parallel scan? */ 122 bool amcanparallel; 123 /* does AM support columns included with clause INCLUDE? */ 124 bool amcaninclude; 125 /* type of data stored in index, or InvalidOid if variable */ 126 Oid amkeytype; 127 128 /* interface functions */ 129 ambuild_function ambuild; 130 ambuildempty_function ambuildempty; 131 aminsert_function aminsert; 132 ambulkdelete_function ambulkdelete; 133 amvacuumcleanup_function amvacuumcleanup; 134 amcanreturn_function amcanreturn; /* can be NULL */ 135 amcostestimate_function amcostestimate; 136 amoptions_function amoptions; 137 amproperty_function amproperty; /* can be NULL */ 138 ambuildphasename_function ambuildphasename; /* can be NULL */ 139 amvalidate_function amvalidate; 140 ambeginscan_function ambeginscan; 141 amrescan_function amrescan; 142 amgettuple_function amgettuple; /* can be NULL */ 143 amgetbitmap_function amgetbitmap; /* can be NULL */ 144 amendscan_function amendscan; 145 ammarkpos_function ammarkpos; /* can be NULL */ 146 amrestrpos_function amrestrpos; /* can be NULL */ 147 148 /* interface functions to support parallel index scans */ 149 amestimateparallelscan_function amestimateparallelscan; /* can be NULL */ 150 aminitparallelscan_function aminitparallelscan; /* can be NULL */ 151 amparallelrescan_function amparallelrescan; /* can be NULL */ 152} IndexAmRoutine; 153</programlisting> 154 </para> 155 156 <para> 157 To be useful, an index access method must also have one or more 158 <firstterm>operator families</firstterm> and 159 <firstterm>operator classes</firstterm> defined in 160 <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>, 161 <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>, 162 <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and 163 <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>. 164 These entries allow the planner 165 to determine what kinds of query qualifications can be used with 166 indexes of this access method. Operator families and classes are described 167 in <xref linkend="xindex"/>, which is prerequisite material for reading 168 this chapter. 169 </para> 170 171 <para> 172 An individual index is defined by a 173 <link linkend="catalog-pg-class"><structname>pg_class</structname></link> 174 entry that describes it as a physical relation, plus a 175 <link linkend="catalog-pg-index"><structname>pg_index</structname></link> 176 entry that shows the logical content of the index — that is, the set 177 of index columns it has and the semantics of those columns, as captured by 178 the associated operator classes. The index columns (key values) can be 179 either simple columns of the underlying table or expressions over the table 180 rows. The index access method normally has no interest in where the index 181 key values come from (it is always handed precomputed key values) but it 182 will be very interested in the operator class information in 183 <structname>pg_index</structname>. Both of these catalog entries can be 184 accessed as part of the <structname>Relation</structname> data structure that is 185 passed to all operations on the index. 186 </para> 187 188 <para> 189 Some of the flag fields of <structname>IndexAmRoutine</structname> have nonobvious 190 implications. The requirements of <structfield>amcanunique</structfield> 191 are discussed in <xref linkend="index-unique-checks"/>. 192 The <structfield>amcanmulticol</structfield> flag asserts that the 193 access method supports multi-key-column indexes, while 194 <structfield>amoptionalkey</structfield> asserts that it allows scans 195 where no indexable restriction clause is given for the first index column. 196 When <structfield>amcanmulticol</structfield> is false, 197 <structfield>amoptionalkey</structfield> essentially says whether the 198 access method supports full-index scans without any restriction clause. 199 Access methods that support multiple index columns <emphasis>must</emphasis> 200 support scans that omit restrictions on any or all of the columns after 201 the first; however they are permitted to require some restriction to 202 appear for the first index column, and this is signaled by setting 203 <structfield>amoptionalkey</structfield> false. 204 One reason that an index AM might set 205 <structfield>amoptionalkey</structfield> false is if it doesn't index 206 null values. Since most indexable operators are 207 strict and hence cannot return true for null inputs, 208 it is at first sight attractive to not store index entries for null values: 209 they could never be returned by an index scan anyway. However, this 210 argument fails when an index scan has no restriction clause for a given 211 index column. In practice this means that 212 indexes that have <structfield>amoptionalkey</structfield> true must 213 index nulls, since the planner might decide to use such an index 214 with no scan keys at all. A related restriction is that an index 215 access method that supports multiple index columns <emphasis>must</emphasis> 216 support indexing null values in columns after the first, because the planner 217 will assume the index can be used for queries that do not restrict 218 these columns. For example, consider an index on (a,b) and a query with 219 <literal>WHERE a = 4</literal>. The system will assume the index can be 220 used to scan for rows with <literal>a = 4</literal>, which is wrong if the 221 index omits rows where <literal>b</literal> is null. 222 It is, however, OK to omit rows where the first indexed column is null. 223 An index access method that does index nulls may also set 224 <structfield>amsearchnulls</structfield>, indicating that it supports 225 <literal>IS NULL</literal> and <literal>IS NOT NULL</literal> clauses as search 226 conditions. 227 </para> 228 229 <para> 230 The <structfield>amcaninclude</structfield> flag indicates whether the 231 access method supports <quote>included</quote> columns, that is it can 232 store (without processing) additional columns beyond the key column(s). 233 The requirements of the preceding paragraph apply only to the key 234 columns. In particular, the combination 235 of <structfield>amcanmulticol</structfield>=<literal>false</literal> 236 and <structfield>amcaninclude</structfield>=<literal>true</literal> is 237 sensible: it means that there can only be one key column, but there can 238 also be included column(s). Also, included columns must be allowed to be 239 null, independently of <structfield>amoptionalkey</structfield>. 240 </para> 241 242 </sect1> 243 244 <sect1 id="index-functions"> 245 <title>Index Access Method Functions</title> 246 247 <para> 248 The index construction and maintenance functions that an index access 249 method must provide in <structname>IndexAmRoutine</structname> are: 250 </para> 251 252 <para> 253<programlisting> 254IndexBuildResult * 255ambuild (Relation heapRelation, 256 Relation indexRelation, 257 IndexInfo *indexInfo); 258</programlisting> 259 Build a new index. The index relation has been physically created, 260 but is empty. It must be filled in with whatever fixed data the 261 access method requires, plus entries for all tuples already existing 262 in the table. Ordinarily the <function>ambuild</function> function will call 263 <function>table_index_build_scan()</function> to scan the table for existing tuples 264 and compute the keys that need to be inserted into the index. 265 The function must return a palloc'd struct containing statistics about 266 the new index. 267 </para> 268 269 <para> 270<programlisting> 271void 272ambuildempty (Relation indexRelation); 273</programlisting> 274 Build an empty index, and write it to the initialization fork (<symbol>INIT_FORKNUM</symbol>) 275 of the given relation. This method is called only for unlogged indexes; the 276 empty index written to the initialization fork will be copied over the main 277 relation fork on each server restart. 278 </para> 279 280 <para> 281<programlisting> 282bool 283aminsert (Relation indexRelation, 284 Datum *values, 285 bool *isnull, 286 ItemPointer heap_tid, 287 Relation heapRelation, 288 IndexUniqueCheck checkUnique, 289 IndexInfo *indexInfo); 290</programlisting> 291 Insert a new tuple into an existing index. The <literal>values</literal> and 292 <literal>isnull</literal> arrays give the key values to be indexed, and 293 <literal>heap_tid</literal> is the TID to be indexed. 294 If the access method supports unique indexes (its 295 <structfield>amcanunique</structfield> flag is true) then 296 <literal>checkUnique</literal> indicates the type of uniqueness check to 297 perform. This varies depending on whether the unique constraint is 298 deferrable; see <xref linkend="index-unique-checks"/> for details. 299 Normally the access method only needs the <literal>heapRelation</literal> 300 parameter when performing uniqueness checking (since then it will have to 301 look into the heap to verify tuple liveness). 302 </para> 303 304 <para> 305 The function's Boolean result value is significant only when 306 <literal>checkUnique</literal> is <literal>UNIQUE_CHECK_PARTIAL</literal>. 307 In this case a true result means the new entry is known unique, whereas 308 false means it might be non-unique (and a deferred uniqueness check must 309 be scheduled). For other cases a constant false result is recommended. 310 </para> 311 312 <para> 313 Some indexes might not index all tuples. If the tuple is not to be 314 indexed, <function>aminsert</function> should just return without doing anything. 315 </para> 316 317 <para> 318 If the index AM wishes to cache data across successive index insertions 319 within a SQL statement, it can allocate space 320 in <literal>indexInfo->ii_Context</literal> and store a pointer to the 321 data in <literal>indexInfo->ii_AmCache</literal> (which will be NULL 322 initially). 323 </para> 324 325 <para> 326<programlisting> 327IndexBulkDeleteResult * 328ambulkdelete (IndexVacuumInfo *info, 329 IndexBulkDeleteResult *stats, 330 IndexBulkDeleteCallback callback, 331 void *callback_state); 332</programlisting> 333 Delete tuple(s) from the index. This is a <quote>bulk delete</quote> operation 334 that is intended to be implemented by scanning the whole index and checking 335 each entry to see if it should be deleted. 336 The passed-in <literal>callback</literal> function must be called, in the style 337 <literal>callback(<replaceable>TID</replaceable>, callback_state) returns bool</literal>, 338 to determine whether any particular index entry, as identified by its 339 referenced TID, is to be deleted. Must return either NULL or a palloc'd 340 struct containing statistics about the effects of the deletion operation. 341 It is OK to return NULL if no information needs to be passed on to 342 <function>amvacuumcleanup</function>. 343 </para> 344 345 <para> 346 Because of limited <varname>maintenance_work_mem</varname>, 347 <function>ambulkdelete</function> might need to be called more than once when many 348 tuples are to be deleted. The <literal>stats</literal> argument is the result 349 of the previous call for this index (it is NULL for the first call within a 350 <command>VACUUM</command> operation). This allows the AM to accumulate statistics 351 across the whole operation. Typically, <function>ambulkdelete</function> will 352 modify and return the same struct if the passed <literal>stats</literal> is not 353 null. 354 </para> 355 356 <para> 357<programlisting> 358IndexBulkDeleteResult * 359amvacuumcleanup (IndexVacuumInfo *info, 360 IndexBulkDeleteResult *stats); 361</programlisting> 362 Clean up after a <command>VACUUM</command> operation (zero or more 363 <function>ambulkdelete</function> calls). This does not have to do anything 364 beyond returning index statistics, but it might perform bulk cleanup 365 such as reclaiming empty index pages. <literal>stats</literal> is whatever the 366 last <function>ambulkdelete</function> call returned, or NULL if 367 <function>ambulkdelete</function> was not called because no tuples needed to be 368 deleted. If the result is not NULL it must be a palloc'd struct. 369 The statistics it contains will be used to update <structname>pg_class</structname>, 370 and will be reported by <command>VACUUM</command> if <literal>VERBOSE</literal> is given. 371 It is OK to return NULL if the index was not changed at all during the 372 <command>VACUUM</command> operation, but otherwise correct stats should 373 be returned. 374 </para> 375 376 <para> 377 As of <productname>PostgreSQL</productname> 8.4, 378 <function>amvacuumcleanup</function> will also be called at completion of an 379 <command>ANALYZE</command> operation. In this case <literal>stats</literal> is always 380 NULL and any return value will be ignored. This case can be distinguished 381 by checking <literal>info->analyze_only</literal>. It is recommended 382 that the access method do nothing except post-insert cleanup in such a 383 call, and that only in an autovacuum worker process. 384 </para> 385 386 <para> 387<programlisting> 388bool 389amcanreturn (Relation indexRelation, int attno); 390</programlisting> 391 Check whether the index can support <link 392 linkend="indexes-index-only-scans"><firstterm>index-only scans</firstterm></link> on 393 the given column, by returning the column's original indexed value. 394 The attribute number is 1-based, i.e., the first column's attno is 1. 395 Returns true if supported, else false. 396 This function should always return true for included columns 397 (if those are supported), since there's little point in an included 398 column that can't be retrieved. 399 If the access method does not support index-only scans at all, 400 the <structfield>amcanreturn</structfield> field in its <structname>IndexAmRoutine</structname> 401 struct can be set to NULL. 402 </para> 403 404 <para> 405<programlisting> 406void 407amcostestimate (PlannerInfo *root, 408 IndexPath *path, 409 double loop_count, 410 Cost *indexStartupCost, 411 Cost *indexTotalCost, 412 Selectivity *indexSelectivity, 413 double *indexCorrelation, 414 double *indexPages); 415</programlisting> 416 Estimate the costs of an index scan. This function is described fully 417 in <xref linkend="index-cost-estimation"/>, below. 418 </para> 419 420 <para> 421<programlisting> 422bytea * 423amoptions (ArrayType *reloptions, 424 bool validate); 425</programlisting> 426 Parse and validate the reloptions array for an index. This is called only 427 when a non-null reloptions array exists for the index. 428 <parameter>reloptions</parameter> is a <type>text</type> array containing entries of the 429 form <replaceable>name</replaceable><literal>=</literal><replaceable>value</replaceable>. 430 The function should construct a <type>bytea</type> value, which will be copied 431 into the <structfield>rd_options</structfield> field of the index's relcache entry. 432 The data contents of the <type>bytea</type> value are open for the access 433 method to define; most of the standard access methods use struct 434 <structname>StdRdOptions</structname>. 435 When <parameter>validate</parameter> is true, the function should report a suitable 436 error message if any of the options are unrecognized or have invalid 437 values; when <parameter>validate</parameter> is false, invalid entries should be 438 silently ignored. (<parameter>validate</parameter> is false when loading options 439 already stored in <structname>pg_catalog</structname>; an invalid entry could only 440 be found if the access method has changed its rules for options, and in 441 that case ignoring obsolete entries is appropriate.) 442 It is OK to return NULL if default behavior is wanted. 443 </para> 444 445 <para> 446<programlisting> 447bool 448amproperty (Oid index_oid, int attno, 449 IndexAMProperty prop, const char *propname, 450 bool *res, bool *isnull); 451</programlisting> 452 The <function>amproperty</function> method allows index access methods to override 453 the default behavior of <function>pg_index_column_has_property</function> 454 and related functions. 455 If the access method does not have any special behavior for index property 456 inquiries, the <structfield>amproperty</structfield> field in 457 its <structname>IndexAmRoutine</structname> struct can be set to NULL. 458 Otherwise, the <function>amproperty</function> method will be called with 459 <parameter>index_oid</parameter> and <parameter>attno</parameter> both zero for 460 <function>pg_indexam_has_property</function> calls, 461 or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> zero for 462 <function>pg_index_has_property</function> calls, 463 or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> greater than 464 zero for <function>pg_index_column_has_property</function> calls. 465 <parameter>prop</parameter> is an enum value identifying the property being tested, 466 while <parameter>propname</parameter> is the original property name string. 467 If the core code does not recognize the property name 468 then <parameter>prop</parameter> is <literal>AMPROP_UNKNOWN</literal>. 469 Access methods can define custom property names by 470 checking <parameter>propname</parameter> for a match (use <function>pg_strcasecmp</function> 471 to match, for consistency with the core code); for names known to the core 472 code, it's better to inspect <parameter>prop</parameter>. 473 If the <structfield>amproperty</structfield> method returns <literal>true</literal> then 474 it has determined the property test result: it must set <literal>*res</literal> 475 to the boolean value to return, or set <literal>*isnull</literal> 476 to <literal>true</literal> to return a NULL. (Both of the referenced variables 477 are initialized to <literal>false</literal> before the call.) 478 If the <structfield>amproperty</structfield> method returns <literal>false</literal> then 479 the core code will proceed with its normal logic for determining the 480 property test result. 481 </para> 482 483 <para> 484 Access methods that support ordering operators should 485 implement <literal>AMPROP_DISTANCE_ORDERABLE</literal> property testing, as the 486 core code does not know how to do that and will return NULL. It may 487 also be advantageous to implement <literal>AMPROP_RETURNABLE</literal> testing, 488 if that can be done more cheaply than by opening the index and calling 489 <function>amcanreturn</function>, which is the core code's default behavior. 490 The default behavior should be satisfactory for all other standard 491 properties. 492 </para> 493 494 <para> 495<programlisting> 496char * 497ambuildphasename (int64 phasenum); 498</programlisting> 499 Return the textual name of the given build phase number. 500 The phase numbers are those reported during an index build via the 501 <function>pgstat_progress_update_param</function> interface. 502 The phase names are then exposed in the 503 <structname>pg_stat_progress_create_index</structname> view. 504 </para> 505 506 <para> 507<programlisting> 508bool 509amvalidate (Oid opclassoid); 510</programlisting> 511 Validate the catalog entries for the specified operator class, so far as 512 the access method can reasonably do that. For example, this might include 513 testing that all required support functions are provided. 514 The <function>amvalidate</function> function must return false if the opclass is 515 invalid. Problems should be reported with <function>ereport</function> messages. 516 </para> 517 518 519 <para> 520 The purpose of an index, of course, is to support scans for tuples matching 521 an indexable <literal>WHERE</literal> condition, often called a 522 <firstterm>qualifier</firstterm> or <firstterm>scan key</firstterm>. The semantics of 523 index scanning are described more fully in <xref linkend="index-scanning"/>, 524 below. An index access method can support <quote>plain</quote> index scans, 525 <quote>bitmap</quote> index scans, or both. The scan-related functions that an 526 index access method must or may provide are: 527 </para> 528 529 <para> 530<programlisting> 531IndexScanDesc 532ambeginscan (Relation indexRelation, 533 int nkeys, 534 int norderbys); 535</programlisting> 536 Prepare for an index scan. The <literal>nkeys</literal> and <literal>norderbys</literal> 537 parameters indicate the number of quals and ordering operators that will be 538 used in the scan; these may be useful for space allocation purposes. 539 Note that the actual values of the scan keys aren't provided yet. 540 The result must be a palloc'd struct. 541 For implementation reasons the index access method 542 <emphasis>must</emphasis> create this struct by calling 543 <function>RelationGetIndexScan()</function>. In most cases 544 <function>ambeginscan</function> does little beyond making that call and perhaps 545 acquiring locks; 546 the interesting parts of index-scan startup are in <function>amrescan</function>. 547 </para> 548 549 <para> 550<programlisting> 551void 552amrescan (IndexScanDesc scan, 553 ScanKey keys, 554 int nkeys, 555 ScanKey orderbys, 556 int norderbys); 557</programlisting> 558 Start or restart an index scan, possibly with new scan keys. (To restart 559 using previously-passed keys, NULL is passed for <literal>keys</literal> and/or 560 <literal>orderbys</literal>.) Note that it is not allowed for 561 the number of keys or order-by operators to be larger than 562 what was passed to <function>ambeginscan</function>. In practice the restart 563 feature is used when a new outer tuple is selected by a nested-loop join 564 and so a new key comparison value is needed, but the scan key structure 565 remains the same. 566 </para> 567 568 <para> 569<programlisting> 570bool 571amgettuple (IndexScanDesc scan, 572 ScanDirection direction); 573</programlisting> 574 Fetch the next tuple in the given scan, moving in the given 575 direction (forward or backward in the index). Returns true if a tuple was 576 obtained, false if no matching tuples remain. In the true case the tuple 577 TID is stored into the <literal>scan</literal> structure. Note that 578 <quote>success</quote> means only that the index contains an entry that matches 579 the scan keys, not that the tuple necessarily still exists in the heap or 580 will pass the caller's snapshot test. On success, <function>amgettuple</function> 581 must also set <literal>scan->xs_recheck</literal> to true or false. 582 False means it is certain that the index entry matches the scan keys. 583 True means this is not certain, and the conditions represented by the 584 scan keys must be rechecked against the heap tuple after fetching it. 585 This provision supports <quote>lossy</quote> index operators. 586 Note that rechecking will extend only to the scan conditions; a partial 587 index predicate (if any) is never rechecked by <function>amgettuple</function> 588 callers. 589 </para> 590 591 <para> 592 If the index supports <link linkend="indexes-index-only-scans">index-only 593 scans</link> (i.e., <function>amcanreturn</function> returns true for any 594 of its columns), 595 then on success the AM must also check <literal>scan->xs_want_itup</literal>, 596 and if that is true it must return the originally indexed data for the 597 index entry. Columns for which <function>amcanreturn</function> returns 598 false can be returned as nulls. 599 The data can be returned in the form of an 600 <structname>IndexTuple</structname> pointer stored at <literal>scan->xs_itup</literal>, 601 with tuple descriptor <literal>scan->xs_itupdesc</literal>; or in the form of 602 a <structname>HeapTuple</structname> pointer stored at <literal>scan->xs_hitup</literal>, 603 with tuple descriptor <literal>scan->xs_hitupdesc</literal>. (The latter 604 format should be used when reconstructing data that might possibly not fit 605 into an <structname>IndexTuple</structname>.) In either case, 606 management of the data referenced by the pointer is the access method's 607 responsibility. The data must remain good at least until the next 608 <function>amgettuple</function>, <function>amrescan</function>, or <function>amendscan</function> 609 call for the scan. 610 </para> 611 612 <para> 613 The <function>amgettuple</function> function need only be provided if the access 614 method supports <quote>plain</quote> index scans. If it doesn't, the 615 <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname> 616 struct must be set to NULL. 617 </para> 618 619 <para> 620<programlisting> 621int64 622amgetbitmap (IndexScanDesc scan, 623 TIDBitmap *tbm); 624</programlisting> 625 Fetch all tuples in the given scan and add them to the caller-supplied 626 <type>TIDBitmap</type> (that is, OR the set of tuple IDs into whatever set is already 627 in the bitmap). The number of tuples fetched is returned (this might be 628 just an approximate count, for instance some AMs do not detect duplicates). 629 While inserting tuple IDs into the bitmap, <function>amgetbitmap</function> can 630 indicate that rechecking of the scan conditions is required for specific 631 tuple IDs. This is analogous to the <literal>xs_recheck</literal> output parameter 632 of <function>amgettuple</function>. Note: in the current implementation, support 633 for this feature is conflated with support for lossy storage of the bitmap 634 itself, and therefore callers recheck both the scan conditions and the 635 partial index predicate (if any) for recheckable tuples. That might not 636 always be true, however. 637 <function>amgetbitmap</function> and 638 <function>amgettuple</function> cannot be used in the same index scan; there 639 are other restrictions too when using <function>amgetbitmap</function>, as explained 640 in <xref linkend="index-scanning"/>. 641 </para> 642 643 <para> 644 The <function>amgetbitmap</function> function need only be provided if the access 645 method supports <quote>bitmap</quote> index scans. If it doesn't, the 646 <structfield>amgetbitmap</structfield> field in its <structname>IndexAmRoutine</structname> 647 struct must be set to NULL. 648 </para> 649 650 <para> 651<programlisting> 652void 653amendscan (IndexScanDesc scan); 654</programlisting> 655 End a scan and release resources. The <literal>scan</literal> struct itself 656 should not be freed, but any locks or pins taken internally by the 657 access method must be released, as well as any other memory allocated 658 by <function>ambeginscan</function> and other scan-related functions. 659 </para> 660 661 <para> 662<programlisting> 663void 664ammarkpos (IndexScanDesc scan); 665</programlisting> 666 Mark current scan position. The access method need only support one 667 remembered scan position per scan. 668 </para> 669 670 <para> 671 The <function>ammarkpos</function> function need only be provided if the access 672 method supports ordered scans. If it doesn't, 673 the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname> 674 struct may be set to NULL. 675 </para> 676 677 <para> 678<programlisting> 679void 680amrestrpos (IndexScanDesc scan); 681</programlisting> 682 Restore the scan to the most recently marked position. 683 </para> 684 685 <para> 686 The <function>amrestrpos</function> function need only be provided if the access 687 method supports ordered scans. If it doesn't, 688 the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname> 689 struct may be set to NULL. 690 </para> 691 692 <para> 693 In addition to supporting ordinary index scans, some types of index 694 may wish to support <firstterm>parallel index scans</firstterm>, which allow 695 multiple backends to cooperate in performing an index scan. The 696 index access method should arrange things so that each cooperating 697 process returns a subset of the tuples that would be performed by 698 an ordinary, non-parallel index scan, but in such a way that the 699 union of those subsets is equal to the set of tuples that would be 700 returned by an ordinary, non-parallel index scan. Furthermore, while 701 there need not be any global ordering of tuples returned by a parallel 702 scan, the ordering of that subset of tuples returned within each 703 cooperating backend must match the requested ordering. The following 704 functions may be implemented to support parallel index scans: 705 </para> 706 707 <para> 708<programlisting> 709Size 710amestimateparallelscan (void); 711</programlisting> 712 Estimate and return the number of bytes of dynamic shared memory which 713 the access method will be needed to perform a parallel scan. (This number 714 is in addition to, not in lieu of, the amount of space needed for 715 AM-independent data in <structname>ParallelIndexScanDescData</structname>.) 716 </para> 717 718 <para> 719 It is not necessary to implement this function for access methods which 720 do not support parallel scans or for which the number of additional bytes 721 of storage required is zero. 722 </para> 723 724 <para> 725<programlisting> 726void 727aminitparallelscan (void *target); 728</programlisting> 729 This function will be called to initialize dynamic shared memory at the 730 beginning of a parallel scan. <parameter>target</parameter> will point to at least 731 the number of bytes previously returned by 732 <function>amestimateparallelscan</function>, and this function may use that 733 amount of space to store whatever data it wishes. 734 </para> 735 736 <para> 737 It is not necessary to implement this function for access methods which 738 do not support parallel scans or in cases where the shared memory space 739 required needs no initialization. 740 </para> 741 742 <para> 743<programlisting> 744void 745amparallelrescan (IndexScanDesc scan); 746</programlisting> 747 This function, if implemented, will be called when a parallel index scan 748 must be restarted. It should reset any shared state set up by 749 <function>aminitparallelscan</function> such that the scan will be restarted from 750 the beginning. 751 </para> 752 753 </sect1> 754 755 <sect1 id="index-scanning"> 756 <title>Index Scanning</title> 757 758 <para> 759 In an index scan, the index access method is responsible for regurgitating 760 the TIDs of all the tuples it has been told about that match the 761 <firstterm>scan keys</firstterm>. The access method is <emphasis>not</emphasis> involved in 762 actually fetching those tuples from the index's parent table, nor in 763 determining whether they pass the scan's visibility test or other 764 conditions. 765 </para> 766 767 <para> 768 A scan key is the internal representation of a <literal>WHERE</literal> clause of 769 the form <replaceable>index_key</replaceable> <replaceable>operator</replaceable> 770 <replaceable>constant</replaceable>, where the index key is one of the columns of the 771 index and the operator is one of the members of the operator family 772 associated with that index column. An index scan has zero or more scan 773 keys, which are implicitly ANDed — the returned tuples are expected 774 to satisfy all the indicated conditions. 775 </para> 776 777 <para> 778 The access method can report that the index is <firstterm>lossy</firstterm>, or 779 requires rechecks, for a particular query. This implies that the index 780 scan will return all the entries that pass the scan key, plus possibly 781 additional entries that do not. The core system's index-scan machinery 782 will then apply the index conditions again to the heap tuple to verify 783 whether or not it really should be selected. If the recheck option is not 784 specified, the index scan must return exactly the set of matching entries. 785 </para> 786 787 <para> 788 Note that it is entirely up to the access method to ensure that it 789 correctly finds all and only the entries passing all the given scan keys. 790 Also, the core system will simply hand off all the <literal>WHERE</literal> 791 clauses that match the index keys and operator families, without any 792 semantic analysis to determine whether they are redundant or 793 contradictory. As an example, given 794 <literal>WHERE x > 4 AND x > 14</literal> where <literal>x</literal> is a b-tree 795 indexed column, it is left to the b-tree <function>amrescan</function> function 796 to realize that the first scan key is redundant and can be discarded. 797 The extent of preprocessing needed during <function>amrescan</function> will 798 depend on the extent to which the index access method needs to reduce 799 the scan keys to a <quote>normalized</quote> form. 800 </para> 801 802 <para> 803 Some access methods return index entries in a well-defined order, others 804 do not. There are actually two different ways that an access method can 805 support sorted output: 806 807 <itemizedlist> 808 <listitem> 809 <para> 810 Access methods that always return entries in the natural ordering 811 of their data (such as btree) should set 812 <structfield>amcanorder</structfield> to true. 813 Currently, such access methods must use btree-compatible strategy 814 numbers for their equality and ordering operators. 815 </para> 816 </listitem> 817 <listitem> 818 <para> 819 Access methods that support ordering operators should set 820 <structfield>amcanorderbyop</structfield> to true. 821 This indicates that the index is capable of returning entries in 822 an order satisfying <literal>ORDER BY</literal> <replaceable>index_key</replaceable> 823 <replaceable>operator</replaceable> <replaceable>constant</replaceable>. Scan modifiers 824 of that form can be passed to <function>amrescan</function> as described 825 previously. 826 </para> 827 </listitem> 828 </itemizedlist> 829 </para> 830 831 <para> 832 The <function>amgettuple</function> function has a <literal>direction</literal> argument, 833 which can be either <literal>ForwardScanDirection</literal> (the normal case) 834 or <literal>BackwardScanDirection</literal>. If the first call after 835 <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the 836 set of matching index entries is to be scanned back-to-front rather than in 837 the normal front-to-back direction, so <function>amgettuple</function> must return 838 the last matching tuple in the index, rather than the first one as it 839 normally would. (This will only occur for access 840 methods that set <structfield>amcanorder</structfield> to true.) After the 841 first call, <function>amgettuple</function> must be prepared to advance the scan in 842 either direction from the most recently returned entry. (But if 843 <structfield>amcanbackward</structfield> is false, all subsequent 844 calls will have the same direction as the first one.) 845 </para> 846 847 <para> 848 Access methods that support ordered scans must support <quote>marking</quote> a 849 position in a scan and later returning to the marked position. The same 850 position might be restored multiple times. However, only one position need 851 be remembered per scan; a new <function>ammarkpos</function> call overrides the 852 previously marked position. An access method that does not support ordered 853 scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function> 854 functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL 855 instead. 856 </para> 857 858 <para> 859 Both the scan position and the mark position (if any) must be maintained 860 consistently in the face of concurrent insertions or deletions in the 861 index. It is OK if a freshly-inserted entry is not returned by a scan that 862 would have found the entry if it had existed when the scan started, or for 863 the scan to return such an entry upon rescanning or backing 864 up even though it had not been returned the first time through. Similarly, 865 a concurrent delete might or might not be reflected in the results of a scan. 866 What is important is that insertions or deletions not cause the scan to 867 miss or multiply return entries that were not themselves being inserted or 868 deleted. 869 </para> 870 871 <para> 872 If the index stores the original indexed data values (and not some lossy 873 representation of them), it is useful to 874 support <link linkend="indexes-index-only-scans">index-only scans</link>, in 875 which the index returns the actual data not just the TID of the heap tuple. 876 This will only avoid I/O if the visibility map shows that the TID is on an 877 all-visible page; else the heap tuple must be visited anyway to check 878 MVCC visibility. But that is no concern of the access method's. 879 </para> 880 881 <para> 882 Instead of using <function>amgettuple</function>, an index scan can be done with 883 <function>amgetbitmap</function> to fetch all tuples in one call. This can be 884 noticeably more efficient than <function>amgettuple</function> because it allows 885 avoiding lock/unlock cycles within the access method. In principle 886 <function>amgetbitmap</function> should have the same effects as repeated 887 <function>amgettuple</function> calls, but we impose several restrictions to 888 simplify matters. First of all, <function>amgetbitmap</function> returns all 889 tuples at once and marking or restoring scan positions isn't 890 supported. Secondly, the tuples are returned in a bitmap which doesn't 891 have any specific ordering, which is why <function>amgetbitmap</function> doesn't 892 take a <literal>direction</literal> argument. (Ordering operators will never be 893 supplied for such a scan, either.) 894 Also, there is no provision for index-only scans with 895 <function>amgetbitmap</function>, since there is no way to return the contents of 896 index tuples. 897 Finally, <function>amgetbitmap</function> 898 does not guarantee any locking of the returned tuples, with implications 899 spelled out in <xref linkend="index-locking"/>. 900 </para> 901 902 <para> 903 Note that it is permitted for an access method to implement only 904 <function>amgetbitmap</function> and not <function>amgettuple</function>, or vice versa, 905 if its internal implementation is unsuited to one API or the other. 906 </para> 907 908 </sect1> 909 910 <sect1 id="index-locking"> 911 <title>Index Locking Considerations</title> 912 913 <para> 914 Index access methods must handle concurrent updates 915 of the index by multiple processes. 916 The core <productname>PostgreSQL</productname> system obtains 917 <literal>AccessShareLock</literal> on the index during an index scan, and 918 <literal>RowExclusiveLock</literal> when updating the index (including plain 919 <command>VACUUM</command>). Since these lock types do not conflict, the access 920 method is responsible for handling any fine-grained locking it might need. 921 An <literal>ACCESS EXCLUSIVE</literal> lock on the index as a whole will be 922 taken only during index creation, destruction, or <command>REINDEX</command> 923 (<literal>SHARE UPDATE EXCLUSIVE</literal> is taken instead with 924 <literal>CONCURRENTLY</literal>). 925 </para> 926 927 <para> 928 Building an index type that supports concurrent updates usually requires 929 extensive and subtle analysis of the required behavior. For the b-tree 930 and hash index types, you can read about the design decisions involved in 931 <filename>src/backend/access/nbtree/README</filename> and 932 <filename>src/backend/access/hash/README</filename>. 933 </para> 934 935 <para> 936 Aside from the index's own internal consistency requirements, concurrent 937 updates create issues about consistency between the parent table (the 938 <firstterm>heap</firstterm>) and the index. Because 939 <productname>PostgreSQL</productname> separates accesses 940 and updates of the heap from those of the index, there are windows in 941 which the index might be inconsistent with the heap. We handle this problem 942 with the following rules: 943 944 <itemizedlist> 945 <listitem> 946 <para> 947 A new heap entry is made before making its index entries. (Therefore 948 a concurrent index scan is likely to fail to see the heap entry. 949 This is okay because the index reader would be uninterested in an 950 uncommitted row anyway. But see <xref linkend="index-unique-checks"/>.) 951 </para> 952 </listitem> 953 <listitem> 954 <para> 955 When a heap entry is to be deleted (by <command>VACUUM</command>), all its 956 index entries must be removed first. 957 </para> 958 </listitem> 959 <listitem> 960 <para> 961 An index scan must maintain a pin 962 on the index page holding the item last returned by 963 <function>amgettuple</function>, and <function>ambulkdelete</function> cannot delete 964 entries from pages that are pinned by other backends. The need 965 for this rule is explained below. 966 </para> 967 </listitem> 968 </itemizedlist> 969 970 Without the third rule, it is possible for an index reader to 971 see an index entry just before it is removed by <command>VACUUM</command>, and 972 then to arrive at the corresponding heap entry after that was removed by 973 <command>VACUUM</command>. 974 This creates no serious problems if that item 975 number is still unused when the reader reaches it, since an empty 976 item slot will be ignored by <function>heap_fetch()</function>. But what if a 977 third backend has already re-used the item slot for something else? 978 When using an MVCC-compliant snapshot, there is no problem because 979 the new occupant of the slot is certain to be too new to pass the 980 snapshot test. However, with a non-MVCC-compliant snapshot (such as 981 <literal>SnapshotAny</literal>), it would be possible to accept and return 982 a row that does not in fact match the scan keys. We could defend 983 against this scenario by requiring the scan keys to be rechecked 984 against the heap row in all cases, but that is too expensive. Instead, 985 we use a pin on an index page as a proxy to indicate that the reader 986 might still be <quote>in flight</quote> from the index entry to the matching 987 heap entry. Making <function>ambulkdelete</function> block on such a pin ensures 988 that <command>VACUUM</command> cannot delete the heap entry before the reader 989 is done with it. This solution costs little in run time, and adds blocking 990 overhead only in the rare cases where there actually is a conflict. 991 </para> 992 993 <para> 994 This solution requires that index scans be <quote>synchronous</quote>: we have 995 to fetch each heap tuple immediately after scanning the corresponding index 996 entry. This is expensive for a number of reasons. An 997 <quote>asynchronous</quote> scan in which we collect many TIDs from the index, 998 and only visit the heap tuples sometime later, requires much less index 999 locking overhead and can allow a more efficient heap access pattern. 1000 Per the above analysis, we must use the synchronous approach for 1001 non-MVCC-compliant snapshots, but an asynchronous scan is workable 1002 for a query using an MVCC snapshot. 1003 </para> 1004 1005 <para> 1006 In an <function>amgetbitmap</function> index scan, the access method does not 1007 keep an index pin on any of the returned tuples. Therefore 1008 it is only safe to use such scans with MVCC-compliant snapshots. 1009 </para> 1010 1011 <para> 1012 When the <structfield>ampredlocks</structfield> flag is not set, any scan using that 1013 index access method within a serializable transaction will acquire a 1014 nonblocking predicate lock on the full index. This will generate a 1015 read-write conflict with the insert of any tuple into that index by a 1016 concurrent serializable transaction. If certain patterns of read-write 1017 conflicts are detected among a set of concurrent serializable 1018 transactions, one of those transactions may be canceled to protect data 1019 integrity. When the flag is set, it indicates that the index access 1020 method implements finer-grained predicate locking, which will tend to 1021 reduce the frequency of such transaction cancellations. 1022 </para> 1023 1024 </sect1> 1025 1026 <sect1 id="index-unique-checks"> 1027 <title>Index Uniqueness Checks</title> 1028 1029 <para> 1030 <productname>PostgreSQL</productname> enforces SQL uniqueness constraints 1031 using <firstterm>unique indexes</firstterm>, which are indexes that disallow 1032 multiple entries with identical keys. An access method that supports this 1033 feature sets <structfield>amcanunique</structfield> true. 1034 (At present, only b-tree supports it.) Columns listed in the 1035 <literal>INCLUDE</literal> clause are not considered when enforcing 1036 uniqueness. 1037 </para> 1038 1039 <para> 1040 Because of MVCC, it is always necessary to allow duplicate entries to 1041 exist physically in an index: the entries might refer to successive 1042 versions of a single logical row. The behavior we actually want to 1043 enforce is that no MVCC snapshot could include two rows with equal 1044 index keys. This breaks down into the following cases that must be 1045 checked when inserting a new row into a unique index: 1046 1047 <itemizedlist> 1048 <listitem> 1049 <para> 1050 If a conflicting valid row has been deleted by the current transaction, 1051 it's okay. (In particular, since an UPDATE always deletes the old row 1052 version before inserting the new version, this will allow an UPDATE on 1053 a row without changing the key.) 1054 </para> 1055 </listitem> 1056 <listitem> 1057 <para> 1058 If a conflicting row has been inserted by an as-yet-uncommitted 1059 transaction, the would-be inserter must wait to see if that transaction 1060 commits. If it rolls back then there is no conflict. If it commits 1061 without deleting the conflicting row again, there is a uniqueness 1062 violation. (In practice we just wait for the other transaction to 1063 end and then redo the visibility check in toto.) 1064 </para> 1065 </listitem> 1066 <listitem> 1067 <para> 1068 Similarly, if a conflicting valid row has been deleted by an 1069 as-yet-uncommitted transaction, the would-be inserter must wait 1070 for that transaction to commit or abort, and then repeat the test. 1071 </para> 1072 </listitem> 1073 </itemizedlist> 1074 </para> 1075 1076 <para> 1077 Furthermore, immediately before reporting a uniqueness violation 1078 according to the above rules, the access method must recheck the 1079 liveness of the row being inserted. If it is committed dead then 1080 no violation should be reported. (This case cannot occur during the 1081 ordinary scenario of inserting a row that's just been created by 1082 the current transaction. It can happen during 1083 <command>CREATE UNIQUE INDEX CONCURRENTLY</command>, however.) 1084 </para> 1085 1086 <para> 1087 We require the index access method to apply these tests itself, which 1088 means that it must reach into the heap to check the commit status of 1089 any row that is shown to have a duplicate key according to the index 1090 contents. This is without a doubt ugly and non-modular, but it saves 1091 redundant work: if we did a separate probe then the index lookup for 1092 a conflicting row would be essentially repeated while finding the place to 1093 insert the new row's index entry. What's more, there is no obvious way 1094 to avoid race conditions unless the conflict check is an integral part 1095 of insertion of the new index entry. 1096 </para> 1097 1098 <para> 1099 If the unique constraint is deferrable, there is additional complexity: 1100 we need to be able to insert an index entry for a new row, but defer any 1101 uniqueness-violation error until end of statement or even later. To 1102 avoid unnecessary repeat searches of the index, the index access method 1103 should do a preliminary uniqueness check during the initial insertion. 1104 If this shows that there is definitely no conflicting live tuple, we 1105 are done. Otherwise, we schedule a recheck to occur when it is time to 1106 enforce the constraint. If, at the time of the recheck, both the inserted 1107 tuple and some other tuple with the same key are live, then the error 1108 must be reported. (Note that for this purpose, <quote>live</quote> actually 1109 means <quote>any tuple in the index entry's HOT chain is live</quote>.) 1110 To implement this, the <function>aminsert</function> function is passed a 1111 <literal>checkUnique</literal> parameter having one of the following values: 1112 1113 <itemizedlist> 1114 <listitem> 1115 <para> 1116 <literal>UNIQUE_CHECK_NO</literal> indicates that no uniqueness checking 1117 should be done (this is not a unique index). 1118 </para> 1119 </listitem> 1120 <listitem> 1121 <para> 1122 <literal>UNIQUE_CHECK_YES</literal> indicates that this is a non-deferrable 1123 unique index, and the uniqueness check must be done immediately, as 1124 described above. 1125 </para> 1126 </listitem> 1127 <listitem> 1128 <para> 1129 <literal>UNIQUE_CHECK_PARTIAL</literal> indicates that the unique 1130 constraint is deferrable. <productname>PostgreSQL</productname> 1131 will use this mode to insert each row's index entry. The access 1132 method must allow duplicate entries into the index, and report any 1133 potential duplicates by returning false from <function>aminsert</function>. 1134 For each row for which false is returned, a deferred recheck will 1135 be scheduled. 1136 </para> 1137 1138 <para> 1139 The access method must identify any rows which might violate the 1140 unique constraint, but it is not an error for it to report false 1141 positives. This allows the check to be done without waiting for other 1142 transactions to finish; conflicts reported here are not treated as 1143 errors and will be rechecked later, by which time they may no longer 1144 be conflicts. 1145 </para> 1146 </listitem> 1147 <listitem> 1148 <para> 1149 <literal>UNIQUE_CHECK_EXISTING</literal> indicates that this is a deferred 1150 recheck of a row that was reported as a potential uniqueness violation. 1151 Although this is implemented by calling <function>aminsert</function>, the 1152 access method must <emphasis>not</emphasis> insert a new index entry in this 1153 case. The index entry is already present. Rather, the access method 1154 must check to see if there is another live index entry. If so, and 1155 if the target row is also still live, report error. 1156 </para> 1157 1158 <para> 1159 It is recommended that in a <literal>UNIQUE_CHECK_EXISTING</literal> call, 1160 the access method further verify that the target row actually does 1161 have an existing entry in the index, and report error if not. This 1162 is a good idea because the index tuple values passed to 1163 <function>aminsert</function> will have been recomputed. If the index 1164 definition involves functions that are not really immutable, we 1165 might be checking the wrong area of the index. Checking that the 1166 target row is found in the recheck verifies that we are scanning 1167 for the same tuple values as were used in the original insertion. 1168 </para> 1169 </listitem> 1170 </itemizedlist> 1171 </para> 1172 1173 </sect1> 1174 1175 <sect1 id="index-cost-estimation"> 1176 <title>Index Cost Estimation Functions</title> 1177 1178 <para> 1179 The <function>amcostestimate</function> function is given information describing 1180 a possible index scan, including lists of WHERE and ORDER BY clauses that 1181 have been determined to be usable with the index. It must return estimates 1182 of the cost of accessing the index and the selectivity of the WHERE 1183 clauses (that is, the fraction of parent-table rows that will be 1184 retrieved during the index scan). For simple cases, nearly all the 1185 work of the cost estimator can be done by calling standard routines 1186 in the optimizer; the point of having an <function>amcostestimate</function> function is 1187 to allow index access methods to provide index-type-specific knowledge, 1188 in case it is possible to improve on the standard estimates. 1189 </para> 1190 1191 <para> 1192 Each <function>amcostestimate</function> function must have the signature: 1193 1194<programlisting> 1195void 1196amcostestimate (PlannerInfo *root, 1197 IndexPath *path, 1198 double loop_count, 1199 Cost *indexStartupCost, 1200 Cost *indexTotalCost, 1201 Selectivity *indexSelectivity, 1202 double *indexCorrelation, 1203 double *indexPages); 1204</programlisting> 1205 1206 The first three parameters are inputs: 1207 1208 <variablelist> 1209 <varlistentry> 1210 <term><parameter>root</parameter></term> 1211 <listitem> 1212 <para> 1213 The planner's information about the query being processed. 1214 </para> 1215 </listitem> 1216 </varlistentry> 1217 1218 <varlistentry> 1219 <term><parameter>path</parameter></term> 1220 <listitem> 1221 <para> 1222 The index access path being considered. All fields except cost and 1223 selectivity values are valid. 1224 </para> 1225 </listitem> 1226 </varlistentry> 1227 1228 <varlistentry> 1229 <term><parameter>loop_count</parameter></term> 1230 <listitem> 1231 <para> 1232 The number of repetitions of the index scan that should be factored 1233 into the cost estimates. This will typically be greater than one when 1234 considering a parameterized scan for use in the inside of a nestloop 1235 join. Note that the cost estimates should still be for just one scan; 1236 a larger <parameter>loop_count</parameter> means that it may be appropriate 1237 to allow for some caching effects across multiple scans. 1238 </para> 1239 </listitem> 1240 </varlistentry> 1241 </variablelist> 1242 </para> 1243 1244 <para> 1245 The last five parameters are pass-by-reference outputs: 1246 1247 <variablelist> 1248 <varlistentry> 1249 <term><parameter>*indexStartupCost</parameter></term> 1250 <listitem> 1251 <para> 1252 Set to cost of index start-up processing 1253 </para> 1254 </listitem> 1255 </varlistentry> 1256 1257 <varlistentry> 1258 <term><parameter>*indexTotalCost</parameter></term> 1259 <listitem> 1260 <para> 1261 Set to total cost of index processing 1262 </para> 1263 </listitem> 1264 </varlistentry> 1265 1266 <varlistentry> 1267 <term><parameter>*indexSelectivity</parameter></term> 1268 <listitem> 1269 <para> 1270 Set to index selectivity 1271 </para> 1272 </listitem> 1273 </varlistentry> 1274 1275 <varlistentry> 1276 <term><parameter>*indexCorrelation</parameter></term> 1277 <listitem> 1278 <para> 1279 Set to correlation coefficient between index scan order and 1280 underlying table's order 1281 </para> 1282 </listitem> 1283 </varlistentry> 1284 1285 <varlistentry> 1286 <term><parameter>*indexPages</parameter></term> 1287 <listitem> 1288 <para> 1289 Set to number of index leaf pages 1290 </para> 1291 </listitem> 1292 </varlistentry> 1293 </variablelist> 1294 </para> 1295 1296 <para> 1297 Note that cost estimate functions must be written in C, not in SQL or 1298 any available procedural language, because they must access internal 1299 data structures of the planner/optimizer. 1300 </para> 1301 1302 <para> 1303 The index access costs should be computed using the parameters used by 1304 <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential 1305 disk block fetch has cost <varname>seq_page_cost</varname>, a nonsequential fetch 1306 has cost <varname>random_page_cost</varname>, and the cost of processing one index 1307 row should usually be taken as <varname>cpu_index_tuple_cost</varname>. In 1308 addition, an appropriate multiple of <varname>cpu_operator_cost</varname> should 1309 be charged for any comparison operators invoked during index processing 1310 (especially evaluation of the indexquals themselves). 1311 </para> 1312 1313 <para> 1314 The access costs should include all disk and CPU costs associated with 1315 scanning the index itself, but <emphasis>not</emphasis> the costs of retrieving or 1316 processing the parent-table rows that are identified by the index. 1317 </para> 1318 1319 <para> 1320 The <quote>start-up cost</quote> is the part of the total scan cost that 1321 must be expended before we can begin to fetch the first row. For most 1322 indexes this can be taken as zero, but an index type with a high start-up 1323 cost might want to set it nonzero. 1324 </para> 1325 1326 <para> 1327 The <parameter>indexSelectivity</parameter> should be set to the estimated fraction of the parent 1328 table rows that will be retrieved during the index scan. In the case 1329 of a lossy query, this will typically be higher than the fraction of 1330 rows that actually pass the given qual conditions. 1331 </para> 1332 1333 <para> 1334 The <parameter>indexCorrelation</parameter> should be set to the correlation (ranging between 1335 -1.0 and 1.0) between the index order and the table order. This is used 1336 to adjust the estimate for the cost of fetching rows from the parent 1337 table. 1338 </para> 1339 1340 <para> 1341 The <parameter>indexPages</parameter> should be set to the number of leaf pages. 1342 This is used to estimate the number of workers for parallel index scan. 1343 </para> 1344 1345 <para> 1346 When <parameter>loop_count</parameter> is greater than one, the returned numbers 1347 should be averages expected for any one scan of the index. 1348 </para> 1349 1350 <procedure> 1351 <title>Cost Estimation</title> 1352 <para> 1353 A typical cost estimator will proceed as follows: 1354 </para> 1355 1356 <step> 1357 <para> 1358 Estimate and return the fraction of parent-table rows that will be visited 1359 based on the given qual conditions. In the absence of any index-type-specific 1360 knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>: 1361 1362<programlisting> 1363*indexSelectivity = clauselist_selectivity(root, path->indexquals, 1364 path->indexinfo->rel->relid, 1365 JOIN_INNER, NULL); 1366</programlisting> 1367 </para> 1368 </step> 1369 1370 <step> 1371 <para> 1372 Estimate the number of index rows that will be visited during the 1373 scan. For many index types this is the same as <parameter>indexSelectivity</parameter> times 1374 the number of rows in the index, but it might be more. (Note that the 1375 index's size in pages and rows is available from the 1376 <literal>path->indexinfo</literal> struct.) 1377 </para> 1378 </step> 1379 1380 <step> 1381 <para> 1382 Estimate the number of index pages that will be retrieved during the scan. 1383 This might be just <parameter>indexSelectivity</parameter> times the index's size in pages. 1384 </para> 1385 </step> 1386 1387 <step> 1388 <para> 1389 Compute the index access cost. A generic estimator might do this: 1390 1391<programlisting> 1392/* 1393 * Our generic assumption is that the index pages will be read 1394 * sequentially, so they cost seq_page_cost each, not random_page_cost. 1395 * Also, we charge for evaluation of the indexquals at each index row. 1396 * All the costs are assumed to be paid incrementally during the scan. 1397 */ 1398cost_qual_eval(&index_qual_cost, path->indexquals, root); 1399*indexStartupCost = index_qual_cost.startup; 1400*indexTotalCost = seq_page_cost * numIndexPages + 1401 (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples; 1402</programlisting> 1403 1404 However, the above does not account for amortization of index reads 1405 across repeated index scans. 1406 </para> 1407 </step> 1408 1409 <step> 1410 <para> 1411 Estimate the index correlation. For a simple ordered index on a single 1412 field, this can be retrieved from pg_statistic. If the correlation 1413 is not known, the conservative estimate is zero (no correlation). 1414 </para> 1415 </step> 1416 </procedure> 1417 1418 <para> 1419 Examples of cost estimator functions can be found in 1420 <filename>src/backend/utils/adt/selfuncs.c</filename>. 1421 </para> 1422 </sect1> 1423</chapter> 1424