1<!-- doc/src/sgml/pgtrgm.sgml -->
2
3<sect1 id="pgtrgm" xreflabel="pg_trgm">
4 <title>pg_trgm</title>
5
6 <indexterm zone="pgtrgm">
7  <primary>pg_trgm</primary>
8 </indexterm>
9
10 <para>
11  The <filename>pg_trgm</filename> module provides functions and operators
12  for determining the similarity of
13  alphanumeric text based on trigram matching, as
14  well as index operator classes that support fast searching for similar
15  strings.
16 </para>
17
18 <para>
19  This module is considered <quote>trusted</quote>, that is, it can be
20  installed by non-superusers who have <literal>CREATE</literal> privilege
21  on the current database.
22 </para>
23
24 <sect2>
25  <title>Trigram (or Trigraph) Concepts</title>
26
27  <para>
28   A trigram is a group of three consecutive characters taken
29   from a string.  We can measure the similarity of two strings by
30   counting the number of trigrams they share.  This simple idea
31   turns out to be very effective for measuring the similarity of
32   words in many natural languages.
33  </para>
34
35  <note>
36   <para>
37    <filename>pg_trgm</filename> ignores non-word characters
38    (non-alphanumerics) when extracting trigrams from a string.
39    Each word is considered to have two spaces
40    prefixed and one space suffixed when determining the set
41    of trigrams contained in the string.
42    For example, the set of trigrams in the string
43    <quote><literal>cat</literal></quote> is
44    <quote><literal>  c</literal></quote>,
45    <quote><literal> ca</literal></quote>,
46    <quote><literal>cat</literal></quote>, and
47    <quote><literal>at </literal></quote>.
48    The set of trigrams in the string
49    <quote><literal>foo|bar</literal></quote> is
50    <quote><literal>  f</literal></quote>,
51    <quote><literal> fo</literal></quote>,
52    <quote><literal>foo</literal></quote>,
53    <quote><literal>oo </literal></quote>,
54    <quote><literal>  b</literal></quote>,
55    <quote><literal> ba</literal></quote>,
56    <quote><literal>bar</literal></quote>, and
57    <quote><literal>ar </literal></quote>.
58   </para>
59  </note>
60 </sect2>
61
62 <sect2>
63  <title>Functions and Operators</title>
64
65  <para>
66   The functions provided by the <filename>pg_trgm</filename> module
67   are shown in <xref linkend="pgtrgm-func-table"/>, the operators
68   in <xref linkend="pgtrgm-op-table"/>.
69  </para>
70
71  <table id="pgtrgm-func-table">
72   <title><filename>pg_trgm</filename> Functions</title>
73    <tgroup cols="1">
74     <thead>
75      <row>
76       <entry role="func_table_entry"><para role="func_signature">
77        Function
78       </para>
79       <para>
80        Description
81       </para></entry>
82      </row>
83     </thead>
84
85     <tbody>
86      <row>
87       <entry role="func_table_entry"><para role="func_signature">
88        <indexterm><primary>similarity</primary></indexterm>
89        <function>similarity</function> ( <type>text</type>, <type>text</type> )
90        <returnvalue>real</returnvalue>
91       </para>
92       <para>
93        Returns a number that indicates how similar the two arguments are.
94        The range of the result is zero (indicating that the two strings are
95        completely dissimilar) to one (indicating that the two strings are
96        identical).
97       </para></entry>
98      </row>
99
100      <row>
101       <entry role="func_table_entry"><para role="func_signature">
102        <indexterm><primary>show_trgm</primary></indexterm>
103        <function>show_trgm</function> ( <type>text</type> )
104        <returnvalue>text[]</returnvalue>
105       </para>
106       <para>
107        Returns an array of all the trigrams in the given string.
108        (In practice this is seldom useful except for debugging.)
109       </para></entry>
110      </row>
111
112      <row>
113       <entry role="func_table_entry"><para role="func_signature">
114        <indexterm><primary>word_similarity</primary></indexterm>
115        <function>word_similarity</function> ( <type>text</type>, <type>text</type> )
116        <returnvalue>real</returnvalue>
117       </para>
118       <para>
119        Returns a number that indicates the greatest similarity between
120        the set of trigrams in the first string and any continuous extent
121        of an ordered set of trigrams in the second string.  For details, see
122        the explanation below.
123       </para></entry>
124      </row>
125
126      <row>
127       <entry role="func_table_entry"><para role="func_signature">
128        <indexterm><primary>strict_word_similarity</primary></indexterm>
129        <function>strict_word_similarity</function> ( <type>text</type>, <type>text</type> )
130        <returnvalue>real</returnvalue>
131       </para>
132       <para>
133        Same as <function>word_similarity</function>, but forces
134        extent boundaries to match word boundaries.  Since we don't have
135        cross-word trigrams, this function actually returns greatest similarity
136        between first string and any continuous extent of words of the second
137        string.
138       </para></entry>
139      </row>
140
141      <row>
142       <entry role="func_table_entry"><para role="func_signature">
143        <indexterm><primary>show_limit</primary></indexterm>
144        <function>show_limit</function> ()
145        <returnvalue>real</returnvalue>
146       </para>
147       <para>
148        Returns the current similarity threshold used by the <literal>%</literal>
149        operator.  This sets the minimum similarity between
150        two words for them to be considered similar enough to
151        be misspellings of each other, for example.
152        (<emphasis>Deprecated</emphasis>; instead use <command>SHOW</command>
153        <varname>pg_trgm.similarity_threshold</varname>.)
154       </para></entry>
155      </row>
156
157      <row>
158       <entry role="func_table_entry"><para role="func_signature">
159        <indexterm><primary>set_limit</primary></indexterm>
160        <function>set_limit</function> ( <type>real</type> )
161        <returnvalue>real</returnvalue>
162       </para>
163       <para>
164        Sets the current similarity threshold that is used by the <literal>%</literal>
165        operator.  The threshold must be between 0 and 1 (default is 0.3).
166        Returns the same value passed in.
167        (<emphasis>Deprecated</emphasis>; instead use <command>SET</command>
168        <varname>pg_trgm.similarity_threshold</varname>.)
169       </para></entry>
170      </row>
171     </tbody>
172    </tgroup>
173  </table>
174
175  <para>
176   Consider the following example:
177
178<programlisting>
179# SELECT word_similarity('word', 'two words');
180 word_similarity
181-----------------
182             0.8
183(1 row)
184</programlisting>
185
186   In the first string, the set of trigrams is
187   <literal>{"  w"," wo","wor","ord","rd "}</literal>.
188   In the second string, the ordered set of trigrams is
189   <literal>{"  t"," tw","two","wo ","  w"," wo","wor","ord","rds","ds "}</literal>.
190   The most similar extent of an ordered set of trigrams in the second string
191   is <literal>{"  w"," wo","wor","ord"}</literal>, and the similarity is
192   <literal>0.8</literal>.
193  </para>
194
195  <para>
196   This function returns a value that can be approximately understood as the
197   greatest similarity between the first string and any substring of the second
198   string.  However, this function does not add padding to the boundaries of
199   the extent.  Thus, the number of additional characters present in the
200   second string is not considered, except for the mismatched word boundaries.
201  </para>
202
203  <para>
204   At the same time, <function>strict_word_similarity</function>
205   selects an extent of words in the second string.  In the example above,
206   <function>strict_word_similarity</function> would select the
207   extent of a single word <literal>'words'</literal>, whose set of trigrams is
208   <literal>{"  w"," wo","wor","ord","rds","ds "}</literal>.
209
210<programlisting>
211# SELECT strict_word_similarity('word', 'two words'), similarity('word', 'words');
212 strict_word_similarity | similarity
213------------------------+------------
214               0.571429 |   0.571429
215(1 row)
216</programlisting>
217  </para>
218
219  <para>
220   Thus, the <function>strict_word_similarity</function> function
221   is useful for finding the similarity to whole words, while
222   <function>word_similarity</function> is more suitable for
223   finding the similarity for parts of words.
224  </para>
225
226  <table id="pgtrgm-op-table">
227   <title><filename>pg_trgm</filename> Operators</title>
228    <tgroup cols="1">
229     <thead>
230      <row>
231       <entry role="func_table_entry"><para role="func_signature">
232        Operator
233       </para>
234       <para>
235        Description
236       </para></entry>
237      </row>
238     </thead>
239
240     <tbody>
241      <row>
242       <entry role="func_table_entry"><para role="func_signature">
243        <type>text</type> <literal>%</literal> <type>text</type>
244        <returnvalue>boolean</returnvalue>
245       </para>
246       <para>
247        Returns <literal>true</literal> if its arguments have a similarity
248        that is greater than the current similarity threshold set by
249        <varname>pg_trgm.similarity_threshold</varname>.
250       </para></entry>
251      </row>
252
253      <row>
254       <entry role="func_table_entry"><para role="func_signature">
255        <type>text</type> <literal>&lt;%</literal> <type>text</type>
256        <returnvalue>boolean</returnvalue>
257       </para>
258       <para>
259        Returns <literal>true</literal> if the similarity between the trigram
260        set in the first argument and a continuous extent of an ordered trigram
261        set in the second argument is greater than the current word similarity
262        threshold set by <varname>pg_trgm.word_similarity_threshold</varname>
263        parameter.
264       </para></entry>
265      </row>
266
267      <row>
268       <entry role="func_table_entry"><para role="func_signature">
269        <type>text</type> <literal>%&gt;</literal> <type>text</type>
270        <returnvalue>boolean</returnvalue>
271       </para>
272       <para>
273        Commutator of the <literal>&lt;%</literal> operator.
274       </para></entry>
275      </row>
276
277      <row>
278       <entry role="func_table_entry"><para role="func_signature">
279        <type>text</type> <literal>&lt;&lt;%</literal> <type>text</type>
280        <returnvalue>boolean</returnvalue>
281       </para>
282       <para>
283        Returns <literal>true</literal> if its second argument has a continuous
284        extent of an ordered trigram set that matches word boundaries,
285        and its similarity to the trigram set of the first argument is greater
286        than the current strict word similarity threshold set by the
287        <varname>pg_trgm.strict_word_similarity_threshold</varname> parameter.
288       </para></entry>
289      </row>
290
291      <row>
292       <entry role="func_table_entry"><para role="func_signature">
293        <type>text</type> <literal>%&gt;&gt;</literal> <type>text</type>
294        <returnvalue>boolean</returnvalue>
295       </para>
296       <para>
297        Commutator of the <literal>&lt;&lt;%</literal> operator.
298       </para></entry>
299      </row>
300
301      <row>
302       <entry role="func_table_entry"><para role="func_signature">
303        <type>text</type> <literal>&lt;-&gt;</literal> <type>text</type>
304        <returnvalue>real</returnvalue>
305       </para>
306       <para>
307        Returns the <quote>distance</quote> between the arguments, that is
308        one minus the <function>similarity()</function> value.
309       </para></entry>
310      </row>
311
312      <row>
313       <entry role="func_table_entry"><para role="func_signature">
314        <type>text</type> <literal>&lt;&lt;-&gt;</literal> <type>text</type>
315        <returnvalue>real</returnvalue>
316       </para>
317       <para>
318        Returns the <quote>distance</quote> between the arguments, that is
319        one minus the <function>word_similarity()</function> value.
320       </para></entry>
321      </row>
322
323      <row>
324       <entry role="func_table_entry"><para role="func_signature">
325        <type>text</type> <literal>&lt;-&gt;&gt;</literal> <type>text</type>
326        <returnvalue>real</returnvalue>
327       </para>
328       <para>
329        Commutator of the <literal>&lt;&lt;-&gt;</literal> operator.
330       </para></entry>
331      </row>
332
333      <row>
334       <entry role="func_table_entry"><para role="func_signature">
335        <type>text</type> <literal>&lt;&lt;&lt;-&gt;</literal> <type>text</type>
336        <returnvalue>real</returnvalue>
337       </para>
338       <para>
339        Returns the <quote>distance</quote> between the arguments, that is
340        one minus the <function>strict_word_similarity()</function> value.
341       </para></entry>
342      </row>
343
344      <row>
345       <entry role="func_table_entry"><para role="func_signature">
346        <type>text</type> <literal>&lt;-&gt;&gt;&gt;</literal> <type>text</type>
347        <returnvalue>real</returnvalue>
348       </para>
349       <para>
350        Commutator of the <literal>&lt;&lt;&lt;-&gt;</literal> operator.
351       </para></entry>
352      </row>
353     </tbody>
354    </tgroup>
355  </table>
356 </sect2>
357
358 <sect2>
359  <title>GUC Parameters</title>
360
361  <variablelist>
362   <varlistentry id="guc-pgtrgm-similarity-threshold" xreflabel="pg_trgm.similarity_threshold">
363    <term>
364     <varname>pg_trgm.similarity_threshold</varname> (<type>real</type>)
365     <indexterm>
366      <primary><varname>pg_trgm.similarity_threshold</varname> configuration parameter</primary>
367     </indexterm>
368    </term>
369    <listitem>
370     <para>
371      Sets the current similarity threshold that is used by the <literal>%</literal>
372      operator.  The threshold must be between 0 and 1 (default is 0.3).
373     </para>
374    </listitem>
375   </varlistentry>
376   <varlistentry id="guc-pgtrgm-word-similarity-threshold" xreflabel="pg_trgm.word_similarity_threshold">
377    <term>
378     <varname>pg_trgm.word_similarity_threshold</varname> (<type>real</type>)
379     <indexterm>
380      <primary>
381       <varname>pg_trgm.word_similarity_threshold</varname> configuration parameter
382      </primary>
383     </indexterm>
384    </term>
385    <listitem>
386     <para>
387      Sets the current word similarity threshold that is used by the
388      <literal>&lt;%</literal> and <literal>%&gt;</literal> operators.  The threshold
389      must be between 0 and 1 (default is 0.6).
390     </para>
391    </listitem>
392   </varlistentry>
393   <varlistentry id="guc-pgtrgm-strict-word-similarity-threshold" xreflabel="pg_trgm.strict_word_similarity_threshold">
394    <term>
395     <varname>pg_trgm.strict_word_similarity_threshold</varname> (<type>real</type>)
396     <indexterm>
397      <primary>
398       <varname>pg_trgm.strict_word_similarity_threshold</varname> configuration parameter
399      </primary>
400     </indexterm>
401    </term>
402    <listitem>
403     <para>
404      Sets the current strict word similarity threshold that is used by the
405      <literal>&lt;&lt;%</literal> and <literal>%&gt;&gt;</literal> operators.  The threshold
406      must be between 0 and 1 (default is 0.5).
407     </para>
408    </listitem>
409   </varlistentry>
410  </variablelist>
411 </sect2>
412
413 <sect2>
414  <title>Index Support</title>
415
416  <para>
417   The <filename>pg_trgm</filename> module provides GiST and GIN index
418   operator classes that allow you to create an index over a text column for
419   the purpose of very fast similarity searches.  These index types support
420   the above-described similarity operators, and additionally support
421   trigram-based index searches for <literal>LIKE</literal>, <literal>ILIKE</literal>,
422   <literal>~</literal> and <literal>~*</literal> queries.  (These indexes do not
423   support equality nor simple comparison operators, so you may need a
424   regular B-tree index too.)
425  </para>
426
427  <para>
428   Example:
429
430<programlisting>
431CREATE TABLE test_trgm (t text);
432CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);
433</programlisting>
434or
435<programlisting>
436CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);
437</programlisting>
438  </para>
439
440  <para>
441   <literal>gist_trgm_ops</literal> GiST opclass approximates a set of
442   trigrams as a bitmap signature.  Its optional integer parameter
443   <literal>siglen</literal> determines the
444   signature length in bytes.  The default length is 12 bytes.
445   Valid values of signature length are between 1 and 2024 bytes.  Longer
446   signatures lead to a more precise search (scanning a smaller fraction of the index and
447   fewer heap pages), at the cost of a larger index.
448  </para>
449
450  <para>
451   Example of creating such an index with a signature length of 32 bytes:
452  </para>
453<programlisting>
454CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops(siglen=32));
455</programlisting>
456
457  <para>
458   At this point, you will have an index on the <structfield>t</structfield> column that
459   you can use for similarity searching.  A typical query is
460<programlisting>
461SELECT t, similarity(t, '<replaceable>word</replaceable>') AS sml
462  FROM test_trgm
463  WHERE t % '<replaceable>word</replaceable>'
464  ORDER BY sml DESC, t;
465</programlisting>
466   This will return all values in the text column that are sufficiently
467   similar to <replaceable>word</replaceable>, sorted from best match to worst.  The
468   index will be used to make this a fast operation even over very large data
469   sets.
470  </para>
471
472  <para>
473   A variant of the above query is
474<programlisting>
475SELECT t, t &lt;-&gt; '<replaceable>word</replaceable>' AS dist
476  FROM test_trgm
477  ORDER BY dist LIMIT 10;
478</programlisting>
479   This can be implemented quite efficiently by GiST indexes, but not
480   by GIN indexes.  It will usually beat the first formulation when only
481   a small number of the closest matches is wanted.
482  </para>
483
484  <para>
485   Also you can use an index on the <structfield>t</structfield> column for word
486   similarity or strict word similarity.  Typical queries are:
487<programlisting>
488SELECT t, word_similarity('<replaceable>word</replaceable>', t) AS sml
489  FROM test_trgm
490  WHERE '<replaceable>word</replaceable>' &lt;% t
491  ORDER BY sml DESC, t;
492</programlisting>
493   and
494<programlisting>
495SELECT t, strict_word_similarity('<replaceable>word</replaceable>', t) AS sml
496  FROM test_trgm
497  WHERE '<replaceable>word</replaceable>' &lt;&lt;% t
498  ORDER BY sml DESC, t;
499</programlisting>
500   This will return all values in the text column for which there is a
501   continuous extent in the corresponding ordered trigram set that is
502   sufficiently similar to the trigram set of <replaceable>word</replaceable>,
503   sorted from best match to worst.  The index will be used to make this
504   a fast operation even over very large data sets.
505  </para>
506
507  <para>
508   Possible variants of the above queries are:
509<programlisting>
510SELECT t, '<replaceable>word</replaceable>' &lt;&lt;-&gt; t AS dist
511  FROM test_trgm
512  ORDER BY dist LIMIT 10;
513</programlisting>
514   and
515<programlisting>
516SELECT t, '<replaceable>word</replaceable>' &lt;&lt;&lt;-&gt; t AS dist
517  FROM test_trgm
518  ORDER BY dist LIMIT 10;
519</programlisting>
520   This can be implemented quite efficiently by GiST indexes, but not
521   by GIN indexes.
522  </para>
523
524
525  <para>
526   Beginning in <productname>PostgreSQL</productname> 9.1, these index types also support
527   index searches for <literal>LIKE</literal> and <literal>ILIKE</literal>, for example
528<programlisting>
529SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';
530</programlisting>
531   The index search works by extracting trigrams from the search string
532   and then looking these up in the index.  The more trigrams in the search
533   string, the more effective the index search is.  Unlike B-tree based
534   searches, the search string need not be left-anchored.
535  </para>
536
537  <para>
538   Beginning in <productname>PostgreSQL</productname> 9.3, these index types also support
539   index searches for regular-expression matches
540   (<literal>~</literal> and <literal>~*</literal> operators), for example
541<programlisting>
542SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';
543</programlisting>
544   The index search works by extracting trigrams from the regular expression
545   and then looking these up in the index.  The more trigrams that can be
546   extracted from the regular expression, the more effective the index search
547   is.  Unlike B-tree based searches, the search string need not be
548   left-anchored.
549  </para>
550
551  <para>
552   For both <literal>LIKE</literal> and regular-expression searches, keep in mind
553   that a pattern with no extractable trigrams will degenerate to a full-index
554   scan.
555  </para>
556
557  <para>
558   The choice between GiST and GIN indexing depends on the relative
559   performance characteristics of GiST and GIN, which are discussed elsewhere.
560  </para>
561 </sect2>
562
563 <sect2>
564  <title>Text Search Integration</title>
565
566  <para>
567   Trigram matching is a very useful tool when used in conjunction
568   with a full text index.  In particular it can help to recognize
569   misspelled input words that will not be matched directly by the
570   full text search mechanism.
571  </para>
572
573  <para>
574   The first step is to generate an auxiliary table containing all
575   the unique words in the documents:
576
577<programlisting>
578CREATE TABLE words AS SELECT word FROM
579        ts_stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
580</programlisting>
581
582   where <structname>documents</structname> is a table that has a text field
583   <structfield>bodytext</structfield> that we wish to search.  The reason for using
584   the <literal>simple</literal> configuration with the <function>to_tsvector</function>
585   function, instead of using a language-specific configuration,
586   is that we want a list of the original (unstemmed) words.
587  </para>
588
589  <para>
590   Next, create a trigram index on the word column:
591
592<programlisting>
593CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops);
594</programlisting>
595
596   Now, a <command>SELECT</command> query similar to the previous example can
597   be used to suggest spellings for misspelled words in user search terms.
598   A useful extra test is to require that the selected words are also of
599   similar length to the misspelled word.
600  </para>
601
602  <note>
603   <para>
604    Since the <structname>words</structname> table has been generated as a separate,
605    static table, it will need to be periodically regenerated so that
606    it remains reasonably up-to-date with the document collection.
607    Keeping it exactly current is usually unnecessary.
608   </para>
609  </note>
610 </sect2>
611
612 <sect2>
613  <title>References</title>
614
615  <para>
616   GiST Development Site
617   <ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink>
618  </para>
619  <para>
620   Tsearch2 Development Site
621   <ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>
622  </para>
623 </sect2>
624
625 <sect2>
626  <title>Authors</title>
627
628  <para>
629   Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia
630  </para>
631  <para>
632   Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia
633  </para>
634  <para>
635   Alexander Korotkov <email>a.korotkov@postgrespro.ru</email>, Moscow, Postgres Professional, Russia
636  </para>
637  <para>
638   Documentation: Christopher Kings-Lynne
639  </para>
640  <para>
641   This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
642  </para>
643 </sect2>
644
645</sect1>
646