1<!-- doc/src/sgml/gin.sgml -->
2
3<chapter id="gin">
4<title>GIN Indexes</title>
5
6   <indexterm>
7    <primary>index</primary>
8    <secondary>GIN</secondary>
9   </indexterm>
10
11<sect1 id="gin-intro">
12 <title>Introduction</title>
13
14 <para>
15  <acronym>GIN</acronym> stands for Generalized Inverted Index.
16  <acronym>GIN</acronym> is designed for handling cases where the items
17  to be indexed are composite values, and the queries to be handled by
18  the index need to search for element values that appear within
19  the composite items.  For example, the items could be documents,
20  and the queries could be searches for documents containing specific words.
21 </para>
22
23 <para>
24  We use the word <firstterm>item</firstterm> to refer to a composite value that
25  is to be indexed, and the word <firstterm>key</firstterm> to refer to an element
26  value.  <acronym>GIN</acronym> always stores and searches for keys,
27  not item values per se.
28 </para>
29
30 <para>
31  A <acronym>GIN</acronym> index stores a set of (key, posting list) pairs,
32  where a <firstterm>posting list</firstterm> is a set of row IDs in which the key
33  occurs.  The same row ID can appear in multiple posting lists, since
34  an item can contain more than one key.  Each key value is stored only
35  once, so a <acronym>GIN</acronym> index is very compact for cases
36  where the same key appears many times.
37 </para>
38
39 <para>
40  <acronym>GIN</acronym> is generalized in the sense that the
41  <acronym>GIN</acronym> access method code does not need to know the
42  specific operations that it accelerates.
43  Instead, it uses custom strategies defined for particular data types.
44  The strategy defines how keys are extracted from indexed items and
45  query conditions, and how to determine whether a row that contains
46  some of the key values in a query actually satisfies the query.
47 </para>
48
49 <para>
50  One advantage of <acronym>GIN</acronym> is that it allows the development
51  of custom data types with the appropriate access methods, by
52  an expert in the domain of the data type, rather than a database expert.
53  This is much the same advantage as using <acronym>GiST</acronym>.
54 </para>
55
56 <para>
57  The <acronym>GIN</acronym>
58  implementation in <productname>PostgreSQL</productname> is primarily
59  maintained by Teodor Sigaev and Oleg Bartunov. There is more
60  information about <acronym>GIN</acronym> on their
61  <ulink url="http://www.sai.msu.su/~megera/wiki/Gin">website</ulink>.
62 </para>
63</sect1>
64
65<sect1 id="gin-builtin-opclasses">
66 <title>Built-in Operator Classes</title>
67
68 <para>
69  The core <productname>PostgreSQL</productname> distribution
70  includes the <acronym>GIN</acronym> operator classes shown in
71  <xref linkend="gin-builtin-opclasses-table"/>.
72  (Some of the optional modules described in <xref linkend="contrib"/>
73  provide additional <acronym>GIN</acronym> operator classes.)
74 </para>
75
76  <table id="gin-builtin-opclasses-table">
77   <title>Built-in <acronym>GIN</acronym> Operator Classes</title>
78   <tgroup cols="3">
79    <thead>
80     <row>
81      <entry>Name</entry>
82      <entry>Indexed Data Type</entry>
83      <entry>Indexable Operators</entry>
84     </row>
85    </thead>
86    <tbody>
87     <row>
88      <entry><literal>array_ops</literal></entry>
89      <entry><type>anyarray</type></entry>
90      <entry>
91       <literal>&amp;&amp;</literal>
92       <literal>&lt;@</literal>
93       <literal>=</literal>
94       <literal>@&gt;</literal>
95      </entry>
96     </row>
97     <row>
98      <entry><literal>jsonb_ops</literal></entry>
99      <entry><type>jsonb</type></entry>
100      <entry>
101       <literal>?</literal>
102       <literal>?&amp;</literal>
103       <literal>?|</literal>
104       <literal>@&gt;</literal>
105       <literal>@?</literal>
106       <literal>@@</literal>
107      </entry>
108     </row>
109     <row>
110      <entry><literal>jsonb_path_ops</literal></entry>
111      <entry><type>jsonb</type></entry>
112      <entry>
113       <literal>@&gt;</literal>
114       <literal>@?</literal>
115       <literal>@@</literal>
116      </entry>
117     </row>
118     <row>
119      <entry><literal>tsvector_ops</literal></entry>
120      <entry><type>tsvector</type></entry>
121      <entry>
122       <literal>@@</literal>
123       <literal>@@@</literal>
124      </entry>
125     </row>
126    </tbody>
127   </tgroup>
128  </table>
129
130 <para>
131  Of the two operator classes for type <type>jsonb</type>, <literal>jsonb_ops</literal>
132  is the default.  <literal>jsonb_path_ops</literal> supports fewer operators but
133  offers better performance for those operators.
134  See <xref linkend="json-indexing"/> for details.
135 </para>
136
137</sect1>
138
139<sect1 id="gin-extensibility">
140 <title>Extensibility</title>
141
142 <para>
143   The <acronym>GIN</acronym> interface has a high level of abstraction,
144   requiring the access method implementer only to implement the semantics of
145   the data type being accessed.  The <acronym>GIN</acronym> layer itself
146   takes care of concurrency, logging and searching the tree structure.
147 </para>
148
149 <para>
150   All it takes to get a <acronym>GIN</acronym> access method working is to
151   implement a few user-defined methods, which define the behavior of
152   keys in the tree and the relationships between keys, indexed items,
153   and indexable queries. In short, <acronym>GIN</acronym> combines
154   extensibility with generality, code reuse, and a clean interface.
155 </para>
156
157 <para>
158   There are two methods that an operator class for
159   <acronym>GIN</acronym> must provide:
160
161  <variablelist>
162    <varlistentry>
163     <term><function>Datum *extractValue(Datum itemValue, int32 *nkeys,
164        bool **nullFlags)</function></term>
165     <listitem>
166      <para>
167       Returns a palloc'd array of keys given an item to be indexed.  The
168       number of returned keys must be stored into <literal>*nkeys</literal>.
169       If any of the keys can be null, also palloc an array of
170       <literal>*nkeys</literal> <type>bool</type> fields, store its address at
171       <literal>*nullFlags</literal>, and set these null flags as needed.
172       <literal>*nullFlags</literal> can be left <symbol>NULL</symbol> (its initial value)
173       if all keys are non-null.
174       The return value can be <symbol>NULL</symbol> if the item contains no keys.
175      </para>
176     </listitem>
177    </varlistentry>
178
179    <varlistentry>
180     <term><function>Datum *extractQuery(Datum query, int32 *nkeys,
181        StrategyNumber n, bool **pmatch, Pointer **extra_data,
182        bool **nullFlags, int32 *searchMode)</function></term>
183     <listitem>
184      <para>
185       Returns a palloc'd array of keys given a value to be queried; that is,
186       <literal>query</literal> is the value on the right-hand side of an
187       indexable operator whose left-hand side is the indexed column.
188       <literal>n</literal> is the strategy number of the operator within the
189       operator class (see <xref linkend="xindex-strategies"/>).
190       Often, <function>extractQuery</function> will need
191       to consult <literal>n</literal> to determine the data type of
192       <literal>query</literal> and the method it should use to extract key values.
193       The number of returned keys must be stored into <literal>*nkeys</literal>.
194       If any of the keys can be null, also palloc an array of
195       <literal>*nkeys</literal> <type>bool</type> fields, store its address at
196       <literal>*nullFlags</literal>, and set these null flags as needed.
197       <literal>*nullFlags</literal> can be left <symbol>NULL</symbol> (its initial value)
198       if all keys are non-null.
199       The return value can be <symbol>NULL</symbol> if the <literal>query</literal> contains no keys.
200      </para>
201
202      <para>
203       <literal>searchMode</literal> is an output argument that allows
204       <function>extractQuery</function> to specify details about how the search
205       will be done.
206       If <literal>*searchMode</literal> is set to
207       <literal>GIN_SEARCH_MODE_DEFAULT</literal> (which is the value it is
208       initialized to before call), only items that match at least one of
209       the returned keys are considered candidate matches.
210       If <literal>*searchMode</literal> is set to
211       <literal>GIN_SEARCH_MODE_INCLUDE_EMPTY</literal>, then in addition to items
212       containing at least one matching key, items that contain no keys at
213       all are considered candidate matches.  (This mode is useful for
214       implementing is-subset-of operators, for example.)
215       If <literal>*searchMode</literal> is set to <literal>GIN_SEARCH_MODE_ALL</literal>,
216       then all non-null items in the index are considered candidate
217       matches, whether they match any of the returned keys or not.  (This
218       mode is much slower than the other two choices, since it requires
219       scanning essentially the entire index, but it may be necessary to
220       implement corner cases correctly.  An operator that needs this mode
221       in most cases is probably not a good candidate for a GIN operator
222       class.)
223       The symbols to use for setting this mode are defined in
224       <filename>access/gin.h</filename>.
225      </para>
226
227      <para>
228       <literal>pmatch</literal> is an output argument for use when partial match
229       is supported.  To use it, <function>extractQuery</function> must allocate
230       an array of <literal>*nkeys</literal> <type>bool</type>s and store its address at
231       <literal>*pmatch</literal>.  Each element of the array should be set to true
232       if the corresponding key requires partial match, false if not.
233       If <literal>*pmatch</literal> is set to <symbol>NULL</symbol> then GIN assumes partial match
234       is not required.  The variable is initialized to <symbol>NULL</symbol> before call,
235       so this argument can simply be ignored by operator classes that do
236       not support partial match.
237      </para>
238
239      <para>
240       <literal>extra_data</literal> is an output argument that allows
241       <function>extractQuery</function> to pass additional data to the
242       <function>consistent</function> and <function>comparePartial</function> methods.
243       To use it, <function>extractQuery</function> must allocate
244       an array of <literal>*nkeys</literal> pointers and store its address at
245       <literal>*extra_data</literal>, then store whatever it wants to into the
246       individual pointers.  The variable is initialized to <symbol>NULL</symbol> before
247       call, so this argument can simply be ignored by operator classes that
248       do not require extra data.  If <literal>*extra_data</literal> is set, the
249       whole array is passed to the <function>consistent</function> method, and
250       the appropriate element to the <function>comparePartial</function> method.
251      </para>
252
253     </listitem>
254    </varlistentry>
255  </variablelist>
256
257  An operator class must also provide a function to check if an indexed item
258  matches the query. It comes in two flavors, a Boolean <function>consistent</function>
259  function, and a ternary <function>triConsistent</function> function.
260  <function>triConsistent</function> covers the functionality of both, so providing
261  <function>triConsistent</function> alone is sufficient. However, if the Boolean
262  variant is significantly cheaper to calculate, it can be advantageous to
263  provide both.  If only the Boolean variant is provided, some optimizations
264  that depend on refuting index items before fetching all the keys are
265  disabled.
266
267  <variablelist>
268    <varlistentry>
269     <term><function>bool consistent(bool check[], StrategyNumber n, Datum query,
270        int32 nkeys, Pointer extra_data[], bool *recheck,
271        Datum queryKeys[], bool nullFlags[])</function></term>
272     <listitem>
273      <para>
274       Returns true if an indexed item satisfies the query operator with
275       strategy number <literal>n</literal> (or might satisfy it, if the recheck
276       indication is returned).  This function does not have direct access
277       to the indexed item's value, since <acronym>GIN</acronym> does not
278       store items explicitly.  Rather, what is available is knowledge
279       about which key values extracted from the query appear in a given
280       indexed item.  The <literal>check</literal> array has length
281       <literal>nkeys</literal>, which is the same as the number of keys previously
282       returned by <function>extractQuery</function> for this <literal>query</literal> datum.
283       Each element of the
284       <literal>check</literal> array is true if the indexed item contains the
285       corresponding query key, i.e., if (check[i] == true) the i-th key of the
286       <function>extractQuery</function> result array is present in the indexed item.
287       The original <literal>query</literal> datum is
288       passed in case the <function>consistent</function> method needs to consult it,
289       and so are the <literal>queryKeys[]</literal> and <literal>nullFlags[]</literal>
290       arrays previously returned by <function>extractQuery</function>.
291       <literal>extra_data</literal> is the extra-data array returned by
292       <function>extractQuery</function>, or <symbol>NULL</symbol> if none.
293      </para>
294
295      <para>
296       When <function>extractQuery</function> returns a null key in
297       <literal>queryKeys[]</literal>, the corresponding <literal>check[]</literal> element
298       is true if the indexed item contains a null key; that is, the
299       semantics of <literal>check[]</literal> are like <literal>IS NOT DISTINCT
300       FROM</literal>.  The <function>consistent</function> function can examine the
301       corresponding <literal>nullFlags[]</literal> element if it needs to tell
302       the difference between a regular value match and a null match.
303      </para>
304
305      <para>
306       On success, <literal>*recheck</literal> should be set to true if the heap
307       tuple needs to be rechecked against the query operator, or false if
308       the index test is exact.  That is, a false return value guarantees
309       that the heap tuple does not match the query; a true return value with
310       <literal>*recheck</literal> set to false guarantees that the heap tuple does
311       match the query; and a true return value with
312       <literal>*recheck</literal> set to true means that the heap tuple might match
313       the query, so it needs to be fetched and rechecked by evaluating the
314       query operator directly against the originally indexed item.
315      </para>
316     </listitem>
317    </varlistentry>
318
319    <varlistentry>
320     <term><function>GinTernaryValue triConsistent(GinTernaryValue check[], StrategyNumber n, Datum query,
321        int32 nkeys, Pointer extra_data[],
322        Datum queryKeys[], bool nullFlags[])</function></term>
323     <listitem>
324      <para>
325       <function>triConsistent</function> is similar to <function>consistent</function>,
326       but instead of Booleans in the <literal>check</literal> vector, there are
327       three possible values for each
328       key: <literal>GIN_TRUE</literal>, <literal>GIN_FALSE</literal> and
329       <literal>GIN_MAYBE</literal>. <literal>GIN_FALSE</literal> and <literal>GIN_TRUE</literal>
330       have the same meaning as regular Boolean values, while
331       <literal>GIN_MAYBE</literal> means that the presence of that key is not known.
332       When <literal>GIN_MAYBE</literal> values are present, the function should only
333       return <literal>GIN_TRUE</literal> if the item certainly matches whether or
334       not the index item contains the corresponding query keys. Likewise, the
335       function must return <literal>GIN_FALSE</literal> only if the item certainly
336       does not match, whether or not it contains the <literal>GIN_MAYBE</literal>
337       keys. If the result depends on the <literal>GIN_MAYBE</literal> entries, i.e.,
338       the match cannot be confirmed or refuted based on the known query keys,
339       the function must return <literal>GIN_MAYBE</literal>.
340      </para>
341      <para>
342       When there are no <literal>GIN_MAYBE</literal> values in the <literal>check</literal>
343       vector, a <literal>GIN_MAYBE</literal> return value is the equivalent of
344       setting the <literal>recheck</literal> flag in the
345       Boolean <function>consistent</function> function.
346      </para>
347     </listitem>
348    </varlistentry>
349  </variablelist>
350 </para>
351
352 <para>
353  In addition, GIN must have a way to sort the key values stored in the index.
354  The operator class can define the sort ordering by specifying a comparison
355  method:
356
357  <variablelist>
358    <varlistentry>
359     <term><function>int compare(Datum a, Datum b)</function></term>
360     <listitem>
361      <para>
362       Compares two keys (not indexed items!) and returns an integer less than
363       zero, zero, or greater than zero, indicating whether the first key is
364       less than, equal to, or greater than the second.  Null keys are never
365       passed to this function.
366      </para>
367     </listitem>
368    </varlistentry>
369  </variablelist>
370
371  Alternatively, if the operator class does not provide a <function>compare</function>
372  method, GIN will look up the default btree operator class for the index
373  key data type, and use its comparison function.  It is recommended to
374  specify the comparison function in a GIN operator class that is meant for
375  just one data type, as looking up the btree operator class costs a few
376  cycles.  However, polymorphic GIN operator classes (such
377  as <literal>array_ops</literal>) typically cannot specify a single comparison
378  function.
379 </para>
380
381 <para>
382  An operator class for <acronym>GIN</acronym> can optionally supply the
383  following methods:
384
385  <variablelist>
386    <varlistentry>
387     <term><function>int comparePartial(Datum partial_key, Datum key, StrategyNumber n,
388                              Pointer extra_data)</function></term>
389     <listitem>
390      <para>
391       Compare a partial-match query key to an index key.  Returns an integer
392       whose sign indicates the result: less than zero means the index key
393       does not match the query, but the index scan should continue; zero
394       means that the index key does match the query; greater than zero
395       indicates that the index scan should stop because no more matches
396       are possible.  The strategy number <literal>n</literal> of the operator
397       that generated the partial match query is provided, in case its
398       semantics are needed to determine when to end the scan.  Also,
399       <literal>extra_data</literal> is the corresponding element of the extra-data
400       array made by <function>extractQuery</function>, or <symbol>NULL</symbol> if none.
401       Null keys are never passed to this function.
402      </para>
403     </listitem>
404    </varlistentry>
405    <varlistentry>
406     <term><function>void options(local_relopts *relopts)</function></term>
407     <listitem>
408      <para>
409       Defines a set of user-visible parameters that control operator class
410       behavior.
411      </para>
412
413      <para>
414       The <function>options</function> function is passed a pointer to a
415       <replaceable>local_relopts</replaceable> struct, which needs to be
416       filled with a set of operator class specific options.  The options
417       can be accessed from other support functions using the
418       <literal>PG_HAS_OPCLASS_OPTIONS()</literal> and
419       <literal>PG_GET_OPCLASS_OPTIONS()</literal> macros.
420      </para>
421
422      <para>
423       Since both key extraction of indexed values and representation of the
424       key in <acronym>GIN</acronym> are flexible, they may depend on
425       user-specified parameters.
426      </para>
427     </listitem>
428    </varlistentry>
429  </variablelist>
430 </para>
431
432 <para>
433  To support <quote>partial match</quote> queries, an operator class must
434  provide the <function>comparePartial</function> method, and its
435  <function>extractQuery</function> method must set the <literal>pmatch</literal>
436  parameter when a partial-match query is encountered.  See
437  <xref linkend="gin-partial-match"/> for details.
438 </para>
439
440 <para>
441  The actual data types of the various <literal>Datum</literal> values mentioned
442  above vary depending on the operator class.  The item values passed to
443  <function>extractValue</function> are always of the operator class's input type, and
444  all key values must be of the class's <literal>STORAGE</literal> type.  The type of
445  the <literal>query</literal> argument passed to <function>extractQuery</function>,
446  <function>consistent</function> and <function>triConsistent</function> is whatever is the
447  right-hand input type of the class member operator identified by the
448  strategy number.  This need not be the same as the indexed type, so long as
449  key values of the correct type can be extracted from it.  However, it is
450  recommended that the SQL declarations of these three support functions use
451  the opclass's indexed data type for the <literal>query</literal> argument, even
452  though the actual type might be something else depending on the operator.
453 </para>
454
455</sect1>
456
457<sect1 id="gin-implementation">
458 <title>Implementation</title>
459
460 <para>
461  Internally, a <acronym>GIN</acronym> index contains a B-tree index
462  constructed over keys, where each key is an element of one or more indexed
463  items (a member of an array, for example) and where each tuple in a leaf
464  page contains either a pointer to a B-tree of heap pointers (a
465  <quote>posting tree</quote>), or a simple list of heap pointers (a <quote>posting
466  list</quote>) when the list is small enough to fit into a single index tuple along
467  with the key value.  <xref linkend="gin-internals-figure"/> illustrates
468  these components of a GIN index.
469 </para>
470
471 <para>
472  As of <productname>PostgreSQL</productname> 9.1, null key values can be
473  included in the index.  Also, placeholder nulls are included in the index
474  for indexed items that are null or contain no keys according to
475  <function>extractValue</function>.  This allows searches that should find empty
476  items to do so.
477 </para>
478
479 <para>
480  Multicolumn <acronym>GIN</acronym> indexes are implemented by building
481  a single B-tree over composite values (column number, key value).  The
482  key values for different columns can be of different types.
483 </para>
484
485 <figure id="gin-internals-figure">
486  <title>GIN Internals</title>
487  <mediaobject>
488   <imageobject>
489    <imagedata fileref="images/gin.svg" format="SVG" width="100%"/>
490   </imageobject>
491  </mediaobject>
492 </figure>
493
494 <sect2 id="gin-fast-update">
495  <title>GIN Fast Update Technique</title>
496
497  <para>
498   Updating a <acronym>GIN</acronym> index tends to be slow because of the
499   intrinsic nature of inverted indexes: inserting or updating one heap row
500   can cause many inserts into the index (one for each key extracted
501   from the indexed item). As of <productname>PostgreSQL</productname> 8.4,
502   <acronym>GIN</acronym> is capable of postponing much of this work by inserting
503   new tuples into a temporary, unsorted list of pending entries.
504   When the table is vacuumed or autoanalyzed, or when
505   <function>gin_clean_pending_list</function> function is called, or if the
506   pending list becomes larger than
507   <xref linkend="guc-gin-pending-list-limit"/>, the entries are moved to the
508   main <acronym>GIN</acronym> data structure using the same bulk insert
509   techniques used during initial index creation.  This greatly improves
510   <acronym>GIN</acronym> index update speed, even counting the additional
511   vacuum overhead.  Moreover the overhead work can be done by a background
512   process instead of in foreground query processing.
513  </para>
514
515  <para>
516   The main disadvantage of this approach is that searches must scan the list
517   of pending entries in addition to searching the regular index, and so
518   a large list of pending entries will slow searches significantly.
519   Another disadvantage is that, while most updates are fast, an update
520   that causes the pending list to become <quote>too large</quote> will incur an
521   immediate cleanup cycle and thus be much slower than other updates.
522   Proper use of autovacuum can minimize both of these problems.
523  </para>
524
525  <para>
526   If consistent response time is more important than update speed,
527   use of pending entries can be disabled by turning off the
528   <literal>fastupdate</literal> storage parameter for a
529   <acronym>GIN</acronym> index.  See <xref linkend="sql-createindex"/>
530   for details.
531  </para>
532 </sect2>
533
534 <sect2 id="gin-partial-match">
535  <title>Partial Match Algorithm</title>
536
537  <para>
538   GIN can support <quote>partial match</quote> queries, in which the query
539   does not determine an exact match for one or more keys, but the possible
540   matches fall within a reasonably narrow range of key values (within the
541   key sorting order determined by the <function>compare</function> support method).
542   The <function>extractQuery</function> method, instead of returning a key value
543   to be matched exactly, returns a key value that is the lower bound of
544   the range to be searched, and sets the <literal>pmatch</literal> flag true.
545   The key range is then scanned using the <function>comparePartial</function>
546   method.  <function>comparePartial</function> must return zero for a matching
547   index key, less than zero for a non-match that is still within the range
548   to be searched, or greater than zero if the index key is past the range
549   that could match.
550  </para>
551 </sect2>
552
553</sect1>
554
555<sect1 id="gin-tips">
556<title>GIN Tips and Tricks</title>
557
558 <variablelist>
559  <varlistentry>
560   <term>Create vs. insert</term>
561   <listitem>
562    <para>
563     Insertion into a <acronym>GIN</acronym> index can be slow
564     due to the likelihood of many keys being inserted for each item.
565     So, for bulk insertions into a table it is advisable to drop the GIN
566     index and recreate it after finishing bulk insertion.
567    </para>
568
569    <para>
570     As of <productname>PostgreSQL</productname> 8.4, this advice is less
571     necessary since delayed indexing is used (see <xref
572     linkend="gin-fast-update"/> for details).  But for very large updates
573     it may still be best to drop and recreate the index.
574    </para>
575   </listitem>
576  </varlistentry>
577
578  <varlistentry>
579   <term><xref linkend="guc-maintenance-work-mem"/></term>
580   <listitem>
581    <para>
582     Build time for a <acronym>GIN</acronym> index is very sensitive to
583     the <varname>maintenance_work_mem</varname> setting; it doesn't pay to
584     skimp on work memory during index creation.
585    </para>
586   </listitem>
587  </varlistentry>
588
589  <varlistentry>
590   <term><xref linkend="guc-gin-pending-list-limit"/></term>
591   <listitem>
592    <para>
593     During a series of insertions into an existing <acronym>GIN</acronym>
594     index that has <literal>fastupdate</literal> enabled, the system will clean up
595     the pending-entry list whenever the list grows larger than
596     <varname>gin_pending_list_limit</varname>. To avoid fluctuations in observed
597     response time, it's desirable to have pending-list cleanup occur in the
598     background (i.e., via autovacuum).  Foreground cleanup operations
599     can be avoided by increasing <varname>gin_pending_list_limit</varname>
600     or making autovacuum more aggressive.
601     However, enlarging the threshold of the cleanup operation means that
602     if a foreground cleanup does occur, it will take even longer.
603    </para>
604    <para>
605     <varname>gin_pending_list_limit</varname> can be overridden for individual
606     GIN indexes by changing storage parameters, which allows each
607     GIN index to have its own cleanup threshold.
608     For example, it's possible to increase the threshold only for the GIN
609     index which can be updated heavily, and decrease it otherwise.
610    </para>
611   </listitem>
612  </varlistentry>
613
614  <varlistentry>
615   <term><xref linkend="guc-gin-fuzzy-search-limit"/></term>
616   <listitem>
617    <para>
618     The primary goal of developing <acronym>GIN</acronym> indexes was
619     to create support for highly scalable full-text search in
620     <productname>PostgreSQL</productname>, and there are often situations when
621     a full-text search returns a very large set of results.  Moreover, this
622     often happens when the query contains very frequent words, so that the
623     large result set is not even useful.  Since reading many
624     tuples from the disk and sorting them could take a lot of time, this is
625     unacceptable for production.  (Note that the index search itself is very
626     fast.)
627    </para>
628    <para>
629     To facilitate controlled execution of such queries,
630     <acronym>GIN</acronym> has a configurable soft upper limit on the
631     number of rows returned: the
632     <varname>gin_fuzzy_search_limit</varname> configuration parameter.
633     It is set to 0 (meaning no limit) by default.
634     If a non-zero limit is set, then the returned set is a subset of
635     the whole result set, chosen at random.
636    </para>
637    <para>
638     <quote>Soft</quote> means that the actual number of returned results
639     could differ somewhat from the specified limit, depending on the query
640     and the quality of the system's random number generator.
641    </para>
642    <para>
643     From experience, values in the thousands (e.g., 5000 &mdash; 20000)
644     work well.
645    </para>
646   </listitem>
647  </varlistentry>
648 </variablelist>
649
650</sect1>
651
652<sect1 id="gin-limit">
653 <title>Limitations</title>
654
655 <para>
656  <acronym>GIN</acronym> assumes that indexable operators are strict.  This
657  means that <function>extractValue</function> will not be called at all on a null
658  item value (instead, a placeholder index entry is created automatically),
659  and <function>extractQuery</function> will not be called on a null query
660  value either (instead, the query is presumed to be unsatisfiable).  Note
661  however that null key values contained within a non-null composite item
662  or query value are supported.
663 </para>
664</sect1>
665
666<sect1 id="gin-examples">
667 <title>Examples</title>
668
669 <para>
670  The core <productname>PostgreSQL</productname> distribution
671  includes the <acronym>GIN</acronym> operator classes previously shown in
672  <xref linkend="gin-builtin-opclasses-table"/>.
673  The following <filename>contrib</filename> modules also contain
674  <acronym>GIN</acronym> operator classes:
675
676 <variablelist>
677  <varlistentry>
678   <term><filename>btree_gin</filename></term>
679   <listitem>
680    <para>B-tree equivalent functionality for several data types</para>
681   </listitem>
682  </varlistentry>
683
684  <varlistentry>
685   <term><filename>hstore</filename></term>
686   <listitem>
687    <para>Module for storing (key, value) pairs</para>
688   </listitem>
689  </varlistentry>
690
691  <varlistentry>
692   <term><filename>intarray</filename></term>
693   <listitem>
694    <para>Enhanced support for <type>int[]</type></para>
695   </listitem>
696  </varlistentry>
697
698  <varlistentry>
699   <term><filename>pg_trgm</filename></term>
700   <listitem>
701    <para>Text similarity using trigram matching</para>
702   </listitem>
703  </varlistentry>
704 </variablelist>
705 </para>
706</sect1>
707
708</chapter>
709