1<!-- doc/src/sgml/gin.sgml --> 2 3<chapter id="gin"> 4<title>GIN Indexes</title> 5 6 <indexterm> 7 <primary>index</primary> 8 <secondary>GIN</secondary> 9 </indexterm> 10 11<sect1 id="gin-intro"> 12 <title>Introduction</title> 13 14 <para> 15 <acronym>GIN</acronym> stands for Generalized Inverted Index. 16 <acronym>GIN</acronym> is designed for handling cases where the items 17 to be indexed are composite values, and the queries to be handled by 18 the index need to search for element values that appear within 19 the composite items. For example, the items could be documents, 20 and the queries could be searches for documents containing specific words. 21 </para> 22 23 <para> 24 We use the word <firstterm>item</firstterm> to refer to a composite value that 25 is to be indexed, and the word <firstterm>key</firstterm> to refer to an element 26 value. <acronym>GIN</acronym> always stores and searches for keys, 27 not item values per se. 28 </para> 29 30 <para> 31 A <acronym>GIN</acronym> index stores a set of (key, posting list) pairs, 32 where a <firstterm>posting list</firstterm> is a set of row IDs in which the key 33 occurs. The same row ID can appear in multiple posting lists, since 34 an item can contain more than one key. Each key value is stored only 35 once, so a <acronym>GIN</acronym> index is very compact for cases 36 where the same key appears many times. 37 </para> 38 39 <para> 40 <acronym>GIN</acronym> is generalized in the sense that the 41 <acronym>GIN</acronym> access method code does not need to know the 42 specific operations that it accelerates. 43 Instead, it uses custom strategies defined for particular data types. 44 The strategy defines how keys are extracted from indexed items and 45 query conditions, and how to determine whether a row that contains 46 some of the key values in a query actually satisfies the query. 47 </para> 48 49 <para> 50 One advantage of <acronym>GIN</acronym> is that it allows the development 51 of custom data types with the appropriate access methods, by 52 an expert in the domain of the data type, rather than a database expert. 53 This is much the same advantage as using <acronym>GiST</acronym>. 54 </para> 55 56 <para> 57 The <acronym>GIN</acronym> 58 implementation in <productname>PostgreSQL</productname> is primarily 59 maintained by Teodor Sigaev and Oleg Bartunov. There is more 60 information about <acronym>GIN</acronym> on their 61 <ulink url="http://www.sai.msu.su/~megera/wiki/Gin">website</ulink>. 62 </para> 63</sect1> 64 65<sect1 id="gin-builtin-opclasses"> 66 <title>Built-in Operator Classes</title> 67 68 <para> 69 The core <productname>PostgreSQL</productname> distribution 70 includes the <acronym>GIN</acronym> operator classes shown in 71 <xref linkend="gin-builtin-opclasses-table"/>. 72 (Some of the optional modules described in <xref linkend="contrib"/> 73 provide additional <acronym>GIN</acronym> operator classes.) 74 </para> 75 76 <table id="gin-builtin-opclasses-table"> 77 <title>Built-in <acronym>GIN</acronym> Operator Classes</title> 78 <tgroup cols="3"> 79 <thead> 80 <row> 81 <entry>Name</entry> 82 <entry>Indexed Data Type</entry> 83 <entry>Indexable Operators</entry> 84 </row> 85 </thead> 86 <tbody> 87 <row> 88 <entry><literal>array_ops</literal></entry> 89 <entry><type>anyarray</type></entry> 90 <entry> 91 <literal>&&</literal> 92 <literal><@</literal> 93 <literal>=</literal> 94 <literal>@></literal> 95 </entry> 96 </row> 97 <row> 98 <entry><literal>jsonb_ops</literal></entry> 99 <entry><type>jsonb</type></entry> 100 <entry> 101 <literal>?</literal> 102 <literal>?&</literal> 103 <literal>?|</literal> 104 <literal>@></literal> 105 <literal>@?</literal> 106 <literal>@@</literal> 107 </entry> 108 </row> 109 <row> 110 <entry><literal>jsonb_path_ops</literal></entry> 111 <entry><type>jsonb</type></entry> 112 <entry> 113 <literal>@></literal> 114 <literal>@?</literal> 115 <literal>@@</literal> 116 </entry> 117 </row> 118 <row> 119 <entry><literal>tsvector_ops</literal></entry> 120 <entry><type>tsvector</type></entry> 121 <entry> 122 <literal>@@</literal> 123 <literal>@@@</literal> 124 </entry> 125 </row> 126 </tbody> 127 </tgroup> 128 </table> 129 130 <para> 131 Of the two operator classes for type <type>jsonb</type>, <literal>jsonb_ops</literal> 132 is the default. <literal>jsonb_path_ops</literal> supports fewer operators but 133 offers better performance for those operators. 134 See <xref linkend="json-indexing"/> for details. 135 </para> 136 137</sect1> 138 139<sect1 id="gin-extensibility"> 140 <title>Extensibility</title> 141 142 <para> 143 The <acronym>GIN</acronym> interface has a high level of abstraction, 144 requiring the access method implementer only to implement the semantics of 145 the data type being accessed. The <acronym>GIN</acronym> layer itself 146 takes care of concurrency, logging and searching the tree structure. 147 </para> 148 149 <para> 150 All it takes to get a <acronym>GIN</acronym> access method working is to 151 implement a few user-defined methods, which define the behavior of 152 keys in the tree and the relationships between keys, indexed items, 153 and indexable queries. In short, <acronym>GIN</acronym> combines 154 extensibility with generality, code reuse, and a clean interface. 155 </para> 156 157 <para> 158 There are two methods that an operator class for 159 <acronym>GIN</acronym> must provide: 160 161 <variablelist> 162 <varlistentry> 163 <term><function>Datum *extractValue(Datum itemValue, int32 *nkeys, 164 bool **nullFlags)</function></term> 165 <listitem> 166 <para> 167 Returns a palloc'd array of keys given an item to be indexed. The 168 number of returned keys must be stored into <literal>*nkeys</literal>. 169 If any of the keys can be null, also palloc an array of 170 <literal>*nkeys</literal> <type>bool</type> fields, store its address at 171 <literal>*nullFlags</literal>, and set these null flags as needed. 172 <literal>*nullFlags</literal> can be left <symbol>NULL</symbol> (its initial value) 173 if all keys are non-null. 174 The return value can be <symbol>NULL</symbol> if the item contains no keys. 175 </para> 176 </listitem> 177 </varlistentry> 178 179 <varlistentry> 180 <term><function>Datum *extractQuery(Datum query, int32 *nkeys, 181 StrategyNumber n, bool **pmatch, Pointer **extra_data, 182 bool **nullFlags, int32 *searchMode)</function></term> 183 <listitem> 184 <para> 185 Returns a palloc'd array of keys given a value to be queried; that is, 186 <literal>query</literal> is the value on the right-hand side of an 187 indexable operator whose left-hand side is the indexed column. 188 <literal>n</literal> is the strategy number of the operator within the 189 operator class (see <xref linkend="xindex-strategies"/>). 190 Often, <function>extractQuery</function> will need 191 to consult <literal>n</literal> to determine the data type of 192 <literal>query</literal> and the method it should use to extract key values. 193 The number of returned keys must be stored into <literal>*nkeys</literal>. 194 If any of the keys can be null, also palloc an array of 195 <literal>*nkeys</literal> <type>bool</type> fields, store its address at 196 <literal>*nullFlags</literal>, and set these null flags as needed. 197 <literal>*nullFlags</literal> can be left <symbol>NULL</symbol> (its initial value) 198 if all keys are non-null. 199 The return value can be <symbol>NULL</symbol> if the <literal>query</literal> contains no keys. 200 </para> 201 202 <para> 203 <literal>searchMode</literal> is an output argument that allows 204 <function>extractQuery</function> to specify details about how the search 205 will be done. 206 If <literal>*searchMode</literal> is set to 207 <literal>GIN_SEARCH_MODE_DEFAULT</literal> (which is the value it is 208 initialized to before call), only items that match at least one of 209 the returned keys are considered candidate matches. 210 If <literal>*searchMode</literal> is set to 211 <literal>GIN_SEARCH_MODE_INCLUDE_EMPTY</literal>, then in addition to items 212 containing at least one matching key, items that contain no keys at 213 all are considered candidate matches. (This mode is useful for 214 implementing is-subset-of operators, for example.) 215 If <literal>*searchMode</literal> is set to <literal>GIN_SEARCH_MODE_ALL</literal>, 216 then all non-null items in the index are considered candidate 217 matches, whether they match any of the returned keys or not. (This 218 mode is much slower than the other two choices, since it requires 219 scanning essentially the entire index, but it may be necessary to 220 implement corner cases correctly. An operator that needs this mode 221 in most cases is probably not a good candidate for a GIN operator 222 class.) 223 The symbols to use for setting this mode are defined in 224 <filename>access/gin.h</filename>. 225 </para> 226 227 <para> 228 <literal>pmatch</literal> is an output argument for use when partial match 229 is supported. To use it, <function>extractQuery</function> must allocate 230 an array of <literal>*nkeys</literal> <type>bool</type>s and store its address at 231 <literal>*pmatch</literal>. Each element of the array should be set to true 232 if the corresponding key requires partial match, false if not. 233 If <literal>*pmatch</literal> is set to <symbol>NULL</symbol> then GIN assumes partial match 234 is not required. The variable is initialized to <symbol>NULL</symbol> before call, 235 so this argument can simply be ignored by operator classes that do 236 not support partial match. 237 </para> 238 239 <para> 240 <literal>extra_data</literal> is an output argument that allows 241 <function>extractQuery</function> to pass additional data to the 242 <function>consistent</function> and <function>comparePartial</function> methods. 243 To use it, <function>extractQuery</function> must allocate 244 an array of <literal>*nkeys</literal> pointers and store its address at 245 <literal>*extra_data</literal>, then store whatever it wants to into the 246 individual pointers. The variable is initialized to <symbol>NULL</symbol> before 247 call, so this argument can simply be ignored by operator classes that 248 do not require extra data. If <literal>*extra_data</literal> is set, the 249 whole array is passed to the <function>consistent</function> method, and 250 the appropriate element to the <function>comparePartial</function> method. 251 </para> 252 253 </listitem> 254 </varlistentry> 255 </variablelist> 256 257 An operator class must also provide a function to check if an indexed item 258 matches the query. It comes in two flavors, a Boolean <function>consistent</function> 259 function, and a ternary <function>triConsistent</function> function. 260 <function>triConsistent</function> covers the functionality of both, so providing 261 <function>triConsistent</function> alone is sufficient. However, if the Boolean 262 variant is significantly cheaper to calculate, it can be advantageous to 263 provide both. If only the Boolean variant is provided, some optimizations 264 that depend on refuting index items before fetching all the keys are 265 disabled. 266 267 <variablelist> 268 <varlistentry> 269 <term><function>bool consistent(bool check[], StrategyNumber n, Datum query, 270 int32 nkeys, Pointer extra_data[], bool *recheck, 271 Datum queryKeys[], bool nullFlags[])</function></term> 272 <listitem> 273 <para> 274 Returns true if an indexed item satisfies the query operator with 275 strategy number <literal>n</literal> (or might satisfy it, if the recheck 276 indication is returned). This function does not have direct access 277 to the indexed item's value, since <acronym>GIN</acronym> does not 278 store items explicitly. Rather, what is available is knowledge 279 about which key values extracted from the query appear in a given 280 indexed item. The <literal>check</literal> array has length 281 <literal>nkeys</literal>, which is the same as the number of keys previously 282 returned by <function>extractQuery</function> for this <literal>query</literal> datum. 283 Each element of the 284 <literal>check</literal> array is true if the indexed item contains the 285 corresponding query key, i.e., if (check[i] == true) the i-th key of the 286 <function>extractQuery</function> result array is present in the indexed item. 287 The original <literal>query</literal> datum is 288 passed in case the <function>consistent</function> method needs to consult it, 289 and so are the <literal>queryKeys[]</literal> and <literal>nullFlags[]</literal> 290 arrays previously returned by <function>extractQuery</function>. 291 <literal>extra_data</literal> is the extra-data array returned by 292 <function>extractQuery</function>, or <symbol>NULL</symbol> if none. 293 </para> 294 295 <para> 296 When <function>extractQuery</function> returns a null key in 297 <literal>queryKeys[]</literal>, the corresponding <literal>check[]</literal> element 298 is true if the indexed item contains a null key; that is, the 299 semantics of <literal>check[]</literal> are like <literal>IS NOT DISTINCT 300 FROM</literal>. The <function>consistent</function> function can examine the 301 corresponding <literal>nullFlags[]</literal> element if it needs to tell 302 the difference between a regular value match and a null match. 303 </para> 304 305 <para> 306 On success, <literal>*recheck</literal> should be set to true if the heap 307 tuple needs to be rechecked against the query operator, or false if 308 the index test is exact. That is, a false return value guarantees 309 that the heap tuple does not match the query; a true return value with 310 <literal>*recheck</literal> set to false guarantees that the heap tuple does 311 match the query; and a true return value with 312 <literal>*recheck</literal> set to true means that the heap tuple might match 313 the query, so it needs to be fetched and rechecked by evaluating the 314 query operator directly against the originally indexed item. 315 </para> 316 </listitem> 317 </varlistentry> 318 319 <varlistentry> 320 <term><function>GinTernaryValue triConsistent(GinTernaryValue check[], StrategyNumber n, Datum query, 321 int32 nkeys, Pointer extra_data[], 322 Datum queryKeys[], bool nullFlags[])</function></term> 323 <listitem> 324 <para> 325 <function>triConsistent</function> is similar to <function>consistent</function>, 326 but instead of Booleans in the <literal>check</literal> vector, there are 327 three possible values for each 328 key: <literal>GIN_TRUE</literal>, <literal>GIN_FALSE</literal> and 329 <literal>GIN_MAYBE</literal>. <literal>GIN_FALSE</literal> and <literal>GIN_TRUE</literal> 330 have the same meaning as regular Boolean values, while 331 <literal>GIN_MAYBE</literal> means that the presence of that key is not known. 332 When <literal>GIN_MAYBE</literal> values are present, the function should only 333 return <literal>GIN_TRUE</literal> if the item certainly matches whether or 334 not the index item contains the corresponding query keys. Likewise, the 335 function must return <literal>GIN_FALSE</literal> only if the item certainly 336 does not match, whether or not it contains the <literal>GIN_MAYBE</literal> 337 keys. If the result depends on the <literal>GIN_MAYBE</literal> entries, i.e., 338 the match cannot be confirmed or refuted based on the known query keys, 339 the function must return <literal>GIN_MAYBE</literal>. 340 </para> 341 <para> 342 When there are no <literal>GIN_MAYBE</literal> values in the <literal>check</literal> 343 vector, a <literal>GIN_MAYBE</literal> return value is the equivalent of 344 setting the <literal>recheck</literal> flag in the 345 Boolean <function>consistent</function> function. 346 </para> 347 </listitem> 348 </varlistentry> 349 </variablelist> 350 </para> 351 352 <para> 353 In addition, GIN must have a way to sort the key values stored in the index. 354 The operator class can define the sort ordering by specifying a comparison 355 method: 356 357 <variablelist> 358 <varlistentry> 359 <term><function>int compare(Datum a, Datum b)</function></term> 360 <listitem> 361 <para> 362 Compares two keys (not indexed items!) and returns an integer less than 363 zero, zero, or greater than zero, indicating whether the first key is 364 less than, equal to, or greater than the second. Null keys are never 365 passed to this function. 366 </para> 367 </listitem> 368 </varlistentry> 369 </variablelist> 370 371 Alternatively, if the operator class does not provide a <function>compare</function> 372 method, GIN will look up the default btree operator class for the index 373 key data type, and use its comparison function. It is recommended to 374 specify the comparison function in a GIN operator class that is meant for 375 just one data type, as looking up the btree operator class costs a few 376 cycles. However, polymorphic GIN operator classes (such 377 as <literal>array_ops</literal>) typically cannot specify a single comparison 378 function. 379 </para> 380 381 <para> 382 An operator class for <acronym>GIN</acronym> can optionally supply the 383 following methods: 384 385 <variablelist> 386 <varlistentry> 387 <term><function>int comparePartial(Datum partial_key, Datum key, StrategyNumber n, 388 Pointer extra_data)</function></term> 389 <listitem> 390 <para> 391 Compare a partial-match query key to an index key. Returns an integer 392 whose sign indicates the result: less than zero means the index key 393 does not match the query, but the index scan should continue; zero 394 means that the index key does match the query; greater than zero 395 indicates that the index scan should stop because no more matches 396 are possible. The strategy number <literal>n</literal> of the operator 397 that generated the partial match query is provided, in case its 398 semantics are needed to determine when to end the scan. Also, 399 <literal>extra_data</literal> is the corresponding element of the extra-data 400 array made by <function>extractQuery</function>, or <symbol>NULL</symbol> if none. 401 Null keys are never passed to this function. 402 </para> 403 </listitem> 404 </varlistentry> 405 <varlistentry> 406 <term><function>void options(local_relopts *relopts)</function></term> 407 <listitem> 408 <para> 409 Defines a set of user-visible parameters that control operator class 410 behavior. 411 </para> 412 413 <para> 414 The <function>options</function> function is passed a pointer to a 415 <replaceable>local_relopts</replaceable> struct, which needs to be 416 filled with a set of operator class specific options. The options 417 can be accessed from other support functions using the 418 <literal>PG_HAS_OPCLASS_OPTIONS()</literal> and 419 <literal>PG_GET_OPCLASS_OPTIONS()</literal> macros. 420 </para> 421 422 <para> 423 Since both key extraction of indexed values and representation of the 424 key in <acronym>GIN</acronym> are flexible, they may depend on 425 user-specified parameters. 426 </para> 427 </listitem> 428 </varlistentry> 429 </variablelist> 430 </para> 431 432 <para> 433 To support <quote>partial match</quote> queries, an operator class must 434 provide the <function>comparePartial</function> method, and its 435 <function>extractQuery</function> method must set the <literal>pmatch</literal> 436 parameter when a partial-match query is encountered. See 437 <xref linkend="gin-partial-match"/> for details. 438 </para> 439 440 <para> 441 The actual data types of the various <literal>Datum</literal> values mentioned 442 above vary depending on the operator class. The item values passed to 443 <function>extractValue</function> are always of the operator class's input type, and 444 all key values must be of the class's <literal>STORAGE</literal> type. The type of 445 the <literal>query</literal> argument passed to <function>extractQuery</function>, 446 <function>consistent</function> and <function>triConsistent</function> is whatever is the 447 right-hand input type of the class member operator identified by the 448 strategy number. This need not be the same as the indexed type, so long as 449 key values of the correct type can be extracted from it. However, it is 450 recommended that the SQL declarations of these three support functions use 451 the opclass's indexed data type for the <literal>query</literal> argument, even 452 though the actual type might be something else depending on the operator. 453 </para> 454 455</sect1> 456 457<sect1 id="gin-implementation"> 458 <title>Implementation</title> 459 460 <para> 461 Internally, a <acronym>GIN</acronym> index contains a B-tree index 462 constructed over keys, where each key is an element of one or more indexed 463 items (a member of an array, for example) and where each tuple in a leaf 464 page contains either a pointer to a B-tree of heap pointers (a 465 <quote>posting tree</quote>), or a simple list of heap pointers (a <quote>posting 466 list</quote>) when the list is small enough to fit into a single index tuple along 467 with the key value. <xref linkend="gin-internals-figure"/> illustrates 468 these components of a GIN index. 469 </para> 470 471 <para> 472 As of <productname>PostgreSQL</productname> 9.1, null key values can be 473 included in the index. Also, placeholder nulls are included in the index 474 for indexed items that are null or contain no keys according to 475 <function>extractValue</function>. This allows searches that should find empty 476 items to do so. 477 </para> 478 479 <para> 480 Multicolumn <acronym>GIN</acronym> indexes are implemented by building 481 a single B-tree over composite values (column number, key value). The 482 key values for different columns can be of different types. 483 </para> 484 485 <figure id="gin-internals-figure"> 486 <title>GIN Internals</title> 487 <mediaobject> 488 <imageobject> 489 <imagedata fileref="images/gin.svg" format="SVG" width="100%"/> 490 </imageobject> 491 </mediaobject> 492 </figure> 493 494 <sect2 id="gin-fast-update"> 495 <title>GIN Fast Update Technique</title> 496 497 <para> 498 Updating a <acronym>GIN</acronym> index tends to be slow because of the 499 intrinsic nature of inverted indexes: inserting or updating one heap row 500 can cause many inserts into the index (one for each key extracted 501 from the indexed item). As of <productname>PostgreSQL</productname> 8.4, 502 <acronym>GIN</acronym> is capable of postponing much of this work by inserting 503 new tuples into a temporary, unsorted list of pending entries. 504 When the table is vacuumed or autoanalyzed, or when 505 <function>gin_clean_pending_list</function> function is called, or if the 506 pending list becomes larger than 507 <xref linkend="guc-gin-pending-list-limit"/>, the entries are moved to the 508 main <acronym>GIN</acronym> data structure using the same bulk insert 509 techniques used during initial index creation. This greatly improves 510 <acronym>GIN</acronym> index update speed, even counting the additional 511 vacuum overhead. Moreover the overhead work can be done by a background 512 process instead of in foreground query processing. 513 </para> 514 515 <para> 516 The main disadvantage of this approach is that searches must scan the list 517 of pending entries in addition to searching the regular index, and so 518 a large list of pending entries will slow searches significantly. 519 Another disadvantage is that, while most updates are fast, an update 520 that causes the pending list to become <quote>too large</quote> will incur an 521 immediate cleanup cycle and thus be much slower than other updates. 522 Proper use of autovacuum can minimize both of these problems. 523 </para> 524 525 <para> 526 If consistent response time is more important than update speed, 527 use of pending entries can be disabled by turning off the 528 <literal>fastupdate</literal> storage parameter for a 529 <acronym>GIN</acronym> index. See <xref linkend="sql-createindex"/> 530 for details. 531 </para> 532 </sect2> 533 534 <sect2 id="gin-partial-match"> 535 <title>Partial Match Algorithm</title> 536 537 <para> 538 GIN can support <quote>partial match</quote> queries, in which the query 539 does not determine an exact match for one or more keys, but the possible 540 matches fall within a reasonably narrow range of key values (within the 541 key sorting order determined by the <function>compare</function> support method). 542 The <function>extractQuery</function> method, instead of returning a key value 543 to be matched exactly, returns a key value that is the lower bound of 544 the range to be searched, and sets the <literal>pmatch</literal> flag true. 545 The key range is then scanned using the <function>comparePartial</function> 546 method. <function>comparePartial</function> must return zero for a matching 547 index key, less than zero for a non-match that is still within the range 548 to be searched, or greater than zero if the index key is past the range 549 that could match. 550 </para> 551 </sect2> 552 553</sect1> 554 555<sect1 id="gin-tips"> 556<title>GIN Tips and Tricks</title> 557 558 <variablelist> 559 <varlistentry> 560 <term>Create vs. insert</term> 561 <listitem> 562 <para> 563 Insertion into a <acronym>GIN</acronym> index can be slow 564 due to the likelihood of many keys being inserted for each item. 565 So, for bulk insertions into a table it is advisable to drop the GIN 566 index and recreate it after finishing bulk insertion. 567 </para> 568 569 <para> 570 As of <productname>PostgreSQL</productname> 8.4, this advice is less 571 necessary since delayed indexing is used (see <xref 572 linkend="gin-fast-update"/> for details). But for very large updates 573 it may still be best to drop and recreate the index. 574 </para> 575 </listitem> 576 </varlistentry> 577 578 <varlistentry> 579 <term><xref linkend="guc-maintenance-work-mem"/></term> 580 <listitem> 581 <para> 582 Build time for a <acronym>GIN</acronym> index is very sensitive to 583 the <varname>maintenance_work_mem</varname> setting; it doesn't pay to 584 skimp on work memory during index creation. 585 </para> 586 </listitem> 587 </varlistentry> 588 589 <varlistentry> 590 <term><xref linkend="guc-gin-pending-list-limit"/></term> 591 <listitem> 592 <para> 593 During a series of insertions into an existing <acronym>GIN</acronym> 594 index that has <literal>fastupdate</literal> enabled, the system will clean up 595 the pending-entry list whenever the list grows larger than 596 <varname>gin_pending_list_limit</varname>. To avoid fluctuations in observed 597 response time, it's desirable to have pending-list cleanup occur in the 598 background (i.e., via autovacuum). Foreground cleanup operations 599 can be avoided by increasing <varname>gin_pending_list_limit</varname> 600 or making autovacuum more aggressive. 601 However, enlarging the threshold of the cleanup operation means that 602 if a foreground cleanup does occur, it will take even longer. 603 </para> 604 <para> 605 <varname>gin_pending_list_limit</varname> can be overridden for individual 606 GIN indexes by changing storage parameters, which allows each 607 GIN index to have its own cleanup threshold. 608 For example, it's possible to increase the threshold only for the GIN 609 index which can be updated heavily, and decrease it otherwise. 610 </para> 611 </listitem> 612 </varlistentry> 613 614 <varlistentry> 615 <term><xref linkend="guc-gin-fuzzy-search-limit"/></term> 616 <listitem> 617 <para> 618 The primary goal of developing <acronym>GIN</acronym> indexes was 619 to create support for highly scalable full-text search in 620 <productname>PostgreSQL</productname>, and there are often situations when 621 a full-text search returns a very large set of results. Moreover, this 622 often happens when the query contains very frequent words, so that the 623 large result set is not even useful. Since reading many 624 tuples from the disk and sorting them could take a lot of time, this is 625 unacceptable for production. (Note that the index search itself is very 626 fast.) 627 </para> 628 <para> 629 To facilitate controlled execution of such queries, 630 <acronym>GIN</acronym> has a configurable soft upper limit on the 631 number of rows returned: the 632 <varname>gin_fuzzy_search_limit</varname> configuration parameter. 633 It is set to 0 (meaning no limit) by default. 634 If a non-zero limit is set, then the returned set is a subset of 635 the whole result set, chosen at random. 636 </para> 637 <para> 638 <quote>Soft</quote> means that the actual number of returned results 639 could differ somewhat from the specified limit, depending on the query 640 and the quality of the system's random number generator. 641 </para> 642 <para> 643 From experience, values in the thousands (e.g., 5000 — 20000) 644 work well. 645 </para> 646 </listitem> 647 </varlistentry> 648 </variablelist> 649 650</sect1> 651 652<sect1 id="gin-limit"> 653 <title>Limitations</title> 654 655 <para> 656 <acronym>GIN</acronym> assumes that indexable operators are strict. This 657 means that <function>extractValue</function> will not be called at all on a null 658 item value (instead, a placeholder index entry is created automatically), 659 and <function>extractQuery</function> will not be called on a null query 660 value either (instead, the query is presumed to be unsatisfiable). Note 661 however that null key values contained within a non-null composite item 662 or query value are supported. 663 </para> 664</sect1> 665 666<sect1 id="gin-examples"> 667 <title>Examples</title> 668 669 <para> 670 The core <productname>PostgreSQL</productname> distribution 671 includes the <acronym>GIN</acronym> operator classes previously shown in 672 <xref linkend="gin-builtin-opclasses-table"/>. 673 The following <filename>contrib</filename> modules also contain 674 <acronym>GIN</acronym> operator classes: 675 676 <variablelist> 677 <varlistentry> 678 <term><filename>btree_gin</filename></term> 679 <listitem> 680 <para>B-tree equivalent functionality for several data types</para> 681 </listitem> 682 </varlistentry> 683 684 <varlistentry> 685 <term><filename>hstore</filename></term> 686 <listitem> 687 <para>Module for storing (key, value) pairs</para> 688 </listitem> 689 </varlistentry> 690 691 <varlistentry> 692 <term><filename>intarray</filename></term> 693 <listitem> 694 <para>Enhanced support for <type>int[]</type></para> 695 </listitem> 696 </varlistentry> 697 698 <varlistentry> 699 <term><filename>pg_trgm</filename></term> 700 <listitem> 701 <para>Text similarity using trigram matching</para> 702 </listitem> 703 </varlistentry> 704 </variablelist> 705 </para> 706</sect1> 707 708</chapter> 709