1<!-- doc/src/sgml/pgtrgm.sgml --> 2 3<sect1 id="pgtrgm" xreflabel="pg_trgm"> 4 <title>pg_trgm</title> 5 6 <indexterm zone="pgtrgm"> 7 <primary>pg_trgm</primary> 8 </indexterm> 9 10 <para> 11 The <filename>pg_trgm</filename> module provides functions and operators 12 for determining the similarity of 13 alphanumeric text based on trigram matching, as 14 well as index operator classes that support fast searching for similar 15 strings. 16 </para> 17 18 <para> 19 This module is considered <quote>trusted</quote>, that is, it can be 20 installed by non-superusers who have <literal>CREATE</literal> privilege 21 on the current database. 22 </para> 23 24 <sect2> 25 <title>Trigram (or Trigraph) Concepts</title> 26 27 <para> 28 A trigram is a group of three consecutive characters taken 29 from a string. We can measure the similarity of two strings by 30 counting the number of trigrams they share. This simple idea 31 turns out to be very effective for measuring the similarity of 32 words in many natural languages. 33 </para> 34 35 <note> 36 <para> 37 <filename>pg_trgm</filename> ignores non-word characters 38 (non-alphanumerics) when extracting trigrams from a string. 39 Each word is considered to have two spaces 40 prefixed and one space suffixed when determining the set 41 of trigrams contained in the string. 42 For example, the set of trigrams in the string 43 <quote><literal>cat</literal></quote> is 44 <quote><literal> c</literal></quote>, 45 <quote><literal> ca</literal></quote>, 46 <quote><literal>cat</literal></quote>, and 47 <quote><literal>at </literal></quote>. 48 The set of trigrams in the string 49 <quote><literal>foo|bar</literal></quote> is 50 <quote><literal> f</literal></quote>, 51 <quote><literal> fo</literal></quote>, 52 <quote><literal>foo</literal></quote>, 53 <quote><literal>oo </literal></quote>, 54 <quote><literal> b</literal></quote>, 55 <quote><literal> ba</literal></quote>, 56 <quote><literal>bar</literal></quote>, and 57 <quote><literal>ar </literal></quote>. 58 </para> 59 </note> 60 </sect2> 61 62 <sect2> 63 <title>Functions and Operators</title> 64 65 <para> 66 The functions provided by the <filename>pg_trgm</filename> module 67 are shown in <xref linkend="pgtrgm-func-table"/>, the operators 68 in <xref linkend="pgtrgm-op-table"/>. 69 </para> 70 71 <table id="pgtrgm-func-table"> 72 <title><filename>pg_trgm</filename> Functions</title> 73 <tgroup cols="1"> 74 <thead> 75 <row> 76 <entry role="func_table_entry"><para role="func_signature"> 77 Function 78 </para> 79 <para> 80 Description 81 </para></entry> 82 </row> 83 </thead> 84 85 <tbody> 86 <row> 87 <entry role="func_table_entry"><para role="func_signature"> 88 <indexterm><primary>similarity</primary></indexterm> 89 <function>similarity</function> ( <type>text</type>, <type>text</type> ) 90 <returnvalue>real</returnvalue> 91 </para> 92 <para> 93 Returns a number that indicates how similar the two arguments are. 94 The range of the result is zero (indicating that the two strings are 95 completely dissimilar) to one (indicating that the two strings are 96 identical). 97 </para></entry> 98 </row> 99 100 <row> 101 <entry role="func_table_entry"><para role="func_signature"> 102 <indexterm><primary>show_trgm</primary></indexterm> 103 <function>show_trgm</function> ( <type>text</type> ) 104 <returnvalue>text[]</returnvalue> 105 </para> 106 <para> 107 Returns an array of all the trigrams in the given string. 108 (In practice this is seldom useful except for debugging.) 109 </para></entry> 110 </row> 111 112 <row> 113 <entry role="func_table_entry"><para role="func_signature"> 114 <indexterm><primary>word_similarity</primary></indexterm> 115 <function>word_similarity</function> ( <type>text</type>, <type>text</type> ) 116 <returnvalue>real</returnvalue> 117 </para> 118 <para> 119 Returns a number that indicates the greatest similarity between 120 the set of trigrams in the first string and any continuous extent 121 of an ordered set of trigrams in the second string. For details, see 122 the explanation below. 123 </para></entry> 124 </row> 125 126 <row> 127 <entry role="func_table_entry"><para role="func_signature"> 128 <indexterm><primary>strict_word_similarity</primary></indexterm> 129 <function>strict_word_similarity</function> ( <type>text</type>, <type>text</type> ) 130 <returnvalue>real</returnvalue> 131 </para> 132 <para> 133 Same as <function>word_similarity</function>, but forces 134 extent boundaries to match word boundaries. Since we don't have 135 cross-word trigrams, this function actually returns greatest similarity 136 between first string and any continuous extent of words of the second 137 string. 138 </para></entry> 139 </row> 140 141 <row> 142 <entry role="func_table_entry"><para role="func_signature"> 143 <indexterm><primary>show_limit</primary></indexterm> 144 <function>show_limit</function> () 145 <returnvalue>real</returnvalue> 146 </para> 147 <para> 148 Returns the current similarity threshold used by the <literal>%</literal> 149 operator. This sets the minimum similarity between 150 two words for them to be considered similar enough to 151 be misspellings of each other, for example. 152 (<emphasis>Deprecated</emphasis>; instead use <command>SHOW</command> 153 <varname>pg_trgm.similarity_threshold</varname>.) 154 </para></entry> 155 </row> 156 157 <row> 158 <entry role="func_table_entry"><para role="func_signature"> 159 <indexterm><primary>set_limit</primary></indexterm> 160 <function>set_limit</function> ( <type>real</type> ) 161 <returnvalue>real</returnvalue> 162 </para> 163 <para> 164 Sets the current similarity threshold that is used by the <literal>%</literal> 165 operator. The threshold must be between 0 and 1 (default is 0.3). 166 Returns the same value passed in. 167 (<emphasis>Deprecated</emphasis>; instead use <command>SET</command> 168 <varname>pg_trgm.similarity_threshold</varname>.) 169 </para></entry> 170 </row> 171 </tbody> 172 </tgroup> 173 </table> 174 175 <para> 176 Consider the following example: 177 178<programlisting> 179# SELECT word_similarity('word', 'two words'); 180 word_similarity 181----------------- 182 0.8 183(1 row) 184</programlisting> 185 186 In the first string, the set of trigrams is 187 <literal>{" w"," wo","wor","ord","rd "}</literal>. 188 In the second string, the ordered set of trigrams is 189 <literal>{" t"," tw","two","wo "," w"," wo","wor","ord","rds","ds "}</literal>. 190 The most similar extent of an ordered set of trigrams in the second string 191 is <literal>{" w"," wo","wor","ord"}</literal>, and the similarity is 192 <literal>0.8</literal>. 193 </para> 194 195 <para> 196 This function returns a value that can be approximately understood as the 197 greatest similarity between the first string and any substring of the second 198 string. However, this function does not add padding to the boundaries of 199 the extent. Thus, the number of additional characters present in the 200 second string is not considered, except for the mismatched word boundaries. 201 </para> 202 203 <para> 204 At the same time, <function>strict_word_similarity</function> 205 selects an extent of words in the second string. In the example above, 206 <function>strict_word_similarity</function> would select the 207 extent of a single word <literal>'words'</literal>, whose set of trigrams is 208 <literal>{" w"," wo","wor","ord","rds","ds "}</literal>. 209 210<programlisting> 211# SELECT strict_word_similarity('word', 'two words'), similarity('word', 'words'); 212 strict_word_similarity | similarity 213------------------------+------------ 214 0.571429 | 0.571429 215(1 row) 216</programlisting> 217 </para> 218 219 <para> 220 Thus, the <function>strict_word_similarity</function> function 221 is useful for finding the similarity to whole words, while 222 <function>word_similarity</function> is more suitable for 223 finding the similarity for parts of words. 224 </para> 225 226 <table id="pgtrgm-op-table"> 227 <title><filename>pg_trgm</filename> Operators</title> 228 <tgroup cols="1"> 229 <thead> 230 <row> 231 <entry role="func_table_entry"><para role="func_signature"> 232 Operator 233 </para> 234 <para> 235 Description 236 </para></entry> 237 </row> 238 </thead> 239 240 <tbody> 241 <row> 242 <entry role="func_table_entry"><para role="func_signature"> 243 <type>text</type> <literal>%</literal> <type>text</type> 244 <returnvalue>boolean</returnvalue> 245 </para> 246 <para> 247 Returns <literal>true</literal> if its arguments have a similarity 248 that is greater than the current similarity threshold set by 249 <varname>pg_trgm.similarity_threshold</varname>. 250 </para></entry> 251 </row> 252 253 <row> 254 <entry role="func_table_entry"><para role="func_signature"> 255 <type>text</type> <literal><%</literal> <type>text</type> 256 <returnvalue>boolean</returnvalue> 257 </para> 258 <para> 259 Returns <literal>true</literal> if the similarity between the trigram 260 set in the first argument and a continuous extent of an ordered trigram 261 set in the second argument is greater than the current word similarity 262 threshold set by <varname>pg_trgm.word_similarity_threshold</varname> 263 parameter. 264 </para></entry> 265 </row> 266 267 <row> 268 <entry role="func_table_entry"><para role="func_signature"> 269 <type>text</type> <literal>%></literal> <type>text</type> 270 <returnvalue>boolean</returnvalue> 271 </para> 272 <para> 273 Commutator of the <literal><%</literal> operator. 274 </para></entry> 275 </row> 276 277 <row> 278 <entry role="func_table_entry"><para role="func_signature"> 279 <type>text</type> <literal><<%</literal> <type>text</type> 280 <returnvalue>boolean</returnvalue> 281 </para> 282 <para> 283 Returns <literal>true</literal> if its second argument has a continuous 284 extent of an ordered trigram set that matches word boundaries, 285 and its similarity to the trigram set of the first argument is greater 286 than the current strict word similarity threshold set by the 287 <varname>pg_trgm.strict_word_similarity_threshold</varname> parameter. 288 </para></entry> 289 </row> 290 291 <row> 292 <entry role="func_table_entry"><para role="func_signature"> 293 <type>text</type> <literal>%>></literal> <type>text</type> 294 <returnvalue>boolean</returnvalue> 295 </para> 296 <para> 297 Commutator of the <literal><<%</literal> operator. 298 </para></entry> 299 </row> 300 301 <row> 302 <entry role="func_table_entry"><para role="func_signature"> 303 <type>text</type> <literal><-></literal> <type>text</type> 304 <returnvalue>real</returnvalue> 305 </para> 306 <para> 307 Returns the <quote>distance</quote> between the arguments, that is 308 one minus the <function>similarity()</function> value. 309 </para></entry> 310 </row> 311 312 <row> 313 <entry role="func_table_entry"><para role="func_signature"> 314 <type>text</type> <literal><<-></literal> <type>text</type> 315 <returnvalue>real</returnvalue> 316 </para> 317 <para> 318 Returns the <quote>distance</quote> between the arguments, that is 319 one minus the <function>word_similarity()</function> value. 320 </para></entry> 321 </row> 322 323 <row> 324 <entry role="func_table_entry"><para role="func_signature"> 325 <type>text</type> <literal><->></literal> <type>text</type> 326 <returnvalue>real</returnvalue> 327 </para> 328 <para> 329 Commutator of the <literal><<-></literal> operator. 330 </para></entry> 331 </row> 332 333 <row> 334 <entry role="func_table_entry"><para role="func_signature"> 335 <type>text</type> <literal><<<-></literal> <type>text</type> 336 <returnvalue>real</returnvalue> 337 </para> 338 <para> 339 Returns the <quote>distance</quote> between the arguments, that is 340 one minus the <function>strict_word_similarity()</function> value. 341 </para></entry> 342 </row> 343 344 <row> 345 <entry role="func_table_entry"><para role="func_signature"> 346 <type>text</type> <literal><->>></literal> <type>text</type> 347 <returnvalue>real</returnvalue> 348 </para> 349 <para> 350 Commutator of the <literal><<<-></literal> operator. 351 </para></entry> 352 </row> 353 </tbody> 354 </tgroup> 355 </table> 356 </sect2> 357 358 <sect2> 359 <title>GUC Parameters</title> 360 361 <variablelist> 362 <varlistentry id="guc-pgtrgm-similarity-threshold" xreflabel="pg_trgm.similarity_threshold"> 363 <term> 364 <varname>pg_trgm.similarity_threshold</varname> (<type>real</type>) 365 <indexterm> 366 <primary><varname>pg_trgm.similarity_threshold</varname> configuration parameter</primary> 367 </indexterm> 368 </term> 369 <listitem> 370 <para> 371 Sets the current similarity threshold that is used by the <literal>%</literal> 372 operator. The threshold must be between 0 and 1 (default is 0.3). 373 </para> 374 </listitem> 375 </varlistentry> 376 <varlistentry id="guc-pgtrgm-word-similarity-threshold" xreflabel="pg_trgm.word_similarity_threshold"> 377 <term> 378 <varname>pg_trgm.word_similarity_threshold</varname> (<type>real</type>) 379 <indexterm> 380 <primary> 381 <varname>pg_trgm.word_similarity_threshold</varname> configuration parameter 382 </primary> 383 </indexterm> 384 </term> 385 <listitem> 386 <para> 387 Sets the current word similarity threshold that is used by the 388 <literal><%</literal> and <literal>%></literal> operators. The threshold 389 must be between 0 and 1 (default is 0.6). 390 </para> 391 </listitem> 392 </varlistentry> 393 <varlistentry id="guc-pgtrgm-strict-word-similarity-threshold" xreflabel="pg_trgm.strict_word_similarity_threshold"> 394 <term> 395 <varname>pg_trgm.strict_word_similarity_threshold</varname> (<type>real</type>) 396 <indexterm> 397 <primary> 398 <varname>pg_trgm.strict_word_similarity_threshold</varname> configuration parameter 399 </primary> 400 </indexterm> 401 </term> 402 <listitem> 403 <para> 404 Sets the current strict word similarity threshold that is used by the 405 <literal><<%</literal> and <literal>%>></literal> operators. The threshold 406 must be between 0 and 1 (default is 0.5). 407 </para> 408 </listitem> 409 </varlistentry> 410 </variablelist> 411 </sect2> 412 413 <sect2> 414 <title>Index Support</title> 415 416 <para> 417 The <filename>pg_trgm</filename> module provides GiST and GIN index 418 operator classes that allow you to create an index over a text column for 419 the purpose of very fast similarity searches. These index types support 420 the above-described similarity operators, and additionally support 421 trigram-based index searches for <literal>LIKE</literal>, <literal>ILIKE</literal>, 422 <literal>~</literal> and <literal>~*</literal> queries. (These indexes do not 423 support equality nor simple comparison operators, so you may need a 424 regular B-tree index too.) 425 </para> 426 427 <para> 428 Example: 429 430<programlisting> 431CREATE TABLE test_trgm (t text); 432CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops); 433</programlisting> 434or 435<programlisting> 436CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops); 437</programlisting> 438 </para> 439 440 <para> 441 <literal>gist_trgm_ops</literal> GiST opclass approximates a set of 442 trigrams as a bitmap signature. Its optional integer parameter 443 <literal>siglen</literal> determines the 444 signature length in bytes. The default length is 12 bytes. 445 Valid values of signature length are between 1 and 2024 bytes. Longer 446 signatures lead to a more precise search (scanning a smaller fraction of the index and 447 fewer heap pages), at the cost of a larger index. 448 </para> 449 450 <para> 451 Example of creating such an index with a signature length of 32 bytes: 452 </para> 453<programlisting> 454CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops(siglen=32)); 455</programlisting> 456 457 <para> 458 At this point, you will have an index on the <structfield>t</structfield> column that 459 you can use for similarity searching. A typical query is 460<programlisting> 461SELECT t, similarity(t, '<replaceable>word</replaceable>') AS sml 462 FROM test_trgm 463 WHERE t % '<replaceable>word</replaceable>' 464 ORDER BY sml DESC, t; 465</programlisting> 466 This will return all values in the text column that are sufficiently 467 similar to <replaceable>word</replaceable>, sorted from best match to worst. The 468 index will be used to make this a fast operation even over very large data 469 sets. 470 </para> 471 472 <para> 473 A variant of the above query is 474<programlisting> 475SELECT t, t <-> '<replaceable>word</replaceable>' AS dist 476 FROM test_trgm 477 ORDER BY dist LIMIT 10; 478</programlisting> 479 This can be implemented quite efficiently by GiST indexes, but not 480 by GIN indexes. It will usually beat the first formulation when only 481 a small number of the closest matches is wanted. 482 </para> 483 484 <para> 485 Also you can use an index on the <structfield>t</structfield> column for word 486 similarity or strict word similarity. Typical queries are: 487<programlisting> 488SELECT t, word_similarity('<replaceable>word</replaceable>', t) AS sml 489 FROM test_trgm 490 WHERE '<replaceable>word</replaceable>' <% t 491 ORDER BY sml DESC, t; 492</programlisting> 493 and 494<programlisting> 495SELECT t, strict_word_similarity('<replaceable>word</replaceable>', t) AS sml 496 FROM test_trgm 497 WHERE '<replaceable>word</replaceable>' <<% t 498 ORDER BY sml DESC, t; 499</programlisting> 500 This will return all values in the text column for which there is a 501 continuous extent in the corresponding ordered trigram set that is 502 sufficiently similar to the trigram set of <replaceable>word</replaceable>, 503 sorted from best match to worst. The index will be used to make this 504 a fast operation even over very large data sets. 505 </para> 506 507 <para> 508 Possible variants of the above queries are: 509<programlisting> 510SELECT t, '<replaceable>word</replaceable>' <<-> t AS dist 511 FROM test_trgm 512 ORDER BY dist LIMIT 10; 513</programlisting> 514 and 515<programlisting> 516SELECT t, '<replaceable>word</replaceable>' <<<-> t AS dist 517 FROM test_trgm 518 ORDER BY dist LIMIT 10; 519</programlisting> 520 This can be implemented quite efficiently by GiST indexes, but not 521 by GIN indexes. 522 </para> 523 524 525 <para> 526 Beginning in <productname>PostgreSQL</productname> 9.1, these index types also support 527 index searches for <literal>LIKE</literal> and <literal>ILIKE</literal>, for example 528<programlisting> 529SELECT * FROM test_trgm WHERE t LIKE '%foo%bar'; 530</programlisting> 531 The index search works by extracting trigrams from the search string 532 and then looking these up in the index. The more trigrams in the search 533 string, the more effective the index search is. Unlike B-tree based 534 searches, the search string need not be left-anchored. 535 </para> 536 537 <para> 538 Beginning in <productname>PostgreSQL</productname> 9.3, these index types also support 539 index searches for regular-expression matches 540 (<literal>~</literal> and <literal>~*</literal> operators), for example 541<programlisting> 542SELECT * FROM test_trgm WHERE t ~ '(foo|bar)'; 543</programlisting> 544 The index search works by extracting trigrams from the regular expression 545 and then looking these up in the index. The more trigrams that can be 546 extracted from the regular expression, the more effective the index search 547 is. Unlike B-tree based searches, the search string need not be 548 left-anchored. 549 </para> 550 551 <para> 552 For both <literal>LIKE</literal> and regular-expression searches, keep in mind 553 that a pattern with no extractable trigrams will degenerate to a full-index 554 scan. 555 </para> 556 557 <para> 558 The choice between GiST and GIN indexing depends on the relative 559 performance characteristics of GiST and GIN, which are discussed elsewhere. 560 </para> 561 </sect2> 562 563 <sect2> 564 <title>Text Search Integration</title> 565 566 <para> 567 Trigram matching is a very useful tool when used in conjunction 568 with a full text index. In particular it can help to recognize 569 misspelled input words that will not be matched directly by the 570 full text search mechanism. 571 </para> 572 573 <para> 574 The first step is to generate an auxiliary table containing all 575 the unique words in the documents: 576 577<programlisting> 578CREATE TABLE words AS SELECT word FROM 579 ts_stat('SELECT to_tsvector(''simple'', bodytext) FROM documents'); 580</programlisting> 581 582 where <structname>documents</structname> is a table that has a text field 583 <structfield>bodytext</structfield> that we wish to search. The reason for using 584 the <literal>simple</literal> configuration with the <function>to_tsvector</function> 585 function, instead of using a language-specific configuration, 586 is that we want a list of the original (unstemmed) words. 587 </para> 588 589 <para> 590 Next, create a trigram index on the word column: 591 592<programlisting> 593CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops); 594</programlisting> 595 596 Now, a <command>SELECT</command> query similar to the previous example can 597 be used to suggest spellings for misspelled words in user search terms. 598 A useful extra test is to require that the selected words are also of 599 similar length to the misspelled word. 600 </para> 601 602 <note> 603 <para> 604 Since the <structname>words</structname> table has been generated as a separate, 605 static table, it will need to be periodically regenerated so that 606 it remains reasonably up-to-date with the document collection. 607 Keeping it exactly current is usually unnecessary. 608 </para> 609 </note> 610 </sect2> 611 612 <sect2> 613 <title>References</title> 614 615 <para> 616 GiST Development Site 617 <ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink> 618 </para> 619 <para> 620 Tsearch2 Development Site 621 <ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink> 622 </para> 623 </sect2> 624 625 <sect2> 626 <title>Authors</title> 627 628 <para> 629 Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia 630 </para> 631 <para> 632 Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia 633 </para> 634 <para> 635 Alexander Korotkov <email>a.korotkov@postgrespro.ru</email>, Moscow, Postgres Professional, Russia 636 </para> 637 <para> 638 Documentation: Christopher Kings-Lynne 639 </para> 640 <para> 641 This module is sponsored by Delta-Soft Ltd., Moscow, Russia. 642 </para> 643 </sect2> 644 645</sect1> 646