1<!-- doc/src/sgml/textsearch.sgml --> 2 3<chapter id="textsearch"> 4 <title>Full Text Search</title> 5 6 <indexterm zone="textsearch"> 7 <primary>full text search</primary> 8 </indexterm> 9 10 <indexterm zone="textsearch"> 11 <primary>text search</primary> 12 </indexterm> 13 14 <sect1 id="textsearch-intro"> 15 <title>Introduction</title> 16 17 <para> 18 Full Text Searching (or just <firstterm>text search</firstterm>) provides 19 the capability to identify natural-language <firstterm>documents</firstterm> that 20 satisfy a <firstterm>query</firstterm>, and optionally to sort them by 21 relevance to the query. The most common type of search 22 is to find all documents containing given <firstterm>query terms</firstterm> 23 and return them in order of their <firstterm>similarity</firstterm> to the 24 query. Notions of <varname>query</varname> and 25 <varname>similarity</varname> are very flexible and depend on the specific 26 application. The simplest search considers <varname>query</varname> as a 27 set of words and <varname>similarity</varname> as the frequency of query 28 words in the document. 29 </para> 30 31 <para> 32 Textual search operators have existed in databases for years. 33 <productname>PostgreSQL</productname> has 34 <literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and 35 <literal>ILIKE</literal> operators for textual data types, but they lack 36 many essential properties required by modern information systems: 37 </para> 38 39 <itemizedlist spacing="compact" mark="bullet"> 40 <listitem> 41 <para> 42 There is no linguistic support, even for English. Regular expressions 43 are not sufficient because they cannot easily handle derived words, e.g., 44 <literal>satisfies</literal> and <literal>satisfy</literal>. You might 45 miss documents that contain <literal>satisfies</literal>, although you 46 probably would like to find them when searching for 47 <literal>satisfy</literal>. It is possible to use <literal>OR</literal> 48 to search for multiple derived forms, but this is tedious and error-prone 49 (some words can have several thousand derivatives). 50 </para> 51 </listitem> 52 53 <listitem> 54 <para> 55 They provide no ordering (ranking) of search results, which makes them 56 ineffective when thousands of matching documents are found. 57 </para> 58 </listitem> 59 60 <listitem> 61 <para> 62 They tend to be slow because there is no index support, so they must 63 process all documents for every search. 64 </para> 65 </listitem> 66 </itemizedlist> 67 68 <para> 69 Full text indexing allows documents to be <emphasis>preprocessed</emphasis> 70 and an index saved for later rapid searching. Preprocessing includes: 71 </para> 72 73 <itemizedlist mark="none"> 74 <listitem> 75 <para> 76 <emphasis>Parsing documents into <firstterm>tokens</firstterm></emphasis>. It is 77 useful to identify various classes of tokens, e.g., numbers, words, 78 complex words, email addresses, so that they can be processed 79 differently. In principle token classes depend on the specific 80 application, but for most purposes it is adequate to use a predefined 81 set of classes. 82 <productname>PostgreSQL</productname> uses a <firstterm>parser</firstterm> to 83 perform this step. A standard parser is provided, and custom parsers 84 can be created for specific needs. 85 </para> 86 </listitem> 87 88 <listitem> 89 <para> 90 <emphasis>Converting tokens into <firstterm>lexemes</firstterm></emphasis>. 91 A lexeme is a string, just like a token, but it has been 92 <firstterm>normalized</firstterm> so that different forms of the same word 93 are made alike. For example, normalization almost always includes 94 folding upper-case letters to lower-case, and often involves removal 95 of suffixes (such as <literal>s</literal> or <literal>es</literal> in English). 96 This allows searches to find variant forms of the 97 same word, without tediously entering all the possible variants. 98 Also, this step typically eliminates <firstterm>stop words</firstterm>, which 99 are words that are so common that they are useless for searching. 100 (In short, then, tokens are raw fragments of the document text, while 101 lexemes are words that are believed useful for indexing and searching.) 102 <productname>PostgreSQL</productname> uses <firstterm>dictionaries</firstterm> to 103 perform this step. Various standard dictionaries are provided, and 104 custom ones can be created for specific needs. 105 </para> 106 </listitem> 107 108 <listitem> 109 <para> 110 <emphasis>Storing preprocessed documents optimized for 111 searching</emphasis>. For example, each document can be represented 112 as a sorted array of normalized lexemes. Along with the lexemes it is 113 often desirable to store positional information to use for 114 <firstterm>proximity ranking</firstterm>, so that a document that 115 contains a more <quote>dense</quote> region of query words is 116 assigned a higher rank than one with scattered query words. 117 </para> 118 </listitem> 119 </itemizedlist> 120 121 <para> 122 Dictionaries allow fine-grained control over how tokens are normalized. 123 With appropriate dictionaries, you can: 124 </para> 125 126 <itemizedlist spacing="compact" mark="bullet"> 127 <listitem> 128 <para> 129 Define stop words that should not be indexed. 130 </para> 131 </listitem> 132 133 <listitem> 134 <para> 135 Map synonyms to a single word using <application>Ispell</application>. 136 </para> 137 </listitem> 138 139 <listitem> 140 <para> 141 Map phrases to a single word using a thesaurus. 142 </para> 143 </listitem> 144 145 <listitem> 146 <para> 147 Map different variations of a word to a canonical form using 148 an <application>Ispell</application> dictionary. 149 </para> 150 </listitem> 151 152 <listitem> 153 <para> 154 Map different variations of a word to a canonical form using 155 <application>Snowball</application> stemmer rules. 156 </para> 157 </listitem> 158 </itemizedlist> 159 160 <para> 161 A data type <type>tsvector</type> is provided for storing preprocessed 162 documents, along with a type <type>tsquery</type> for representing processed 163 queries (<xref linkend="datatype-textsearch"/>). There are many 164 functions and operators available for these data types 165 (<xref linkend="functions-textsearch"/>), the most important of which is 166 the match operator <literal>@@</literal>, which we introduce in 167 <xref linkend="textsearch-matching"/>. Full text searches can be accelerated 168 using indexes (<xref linkend="textsearch-indexes"/>). 169 </para> 170 171 172 <sect2 id="textsearch-document"> 173 <title>What Is a Document?</title> 174 175 <indexterm zone="textsearch-document"> 176 <primary>document</primary> 177 <secondary>text search</secondary> 178 </indexterm> 179 180 <para> 181 A <firstterm>document</firstterm> is the unit of searching in a full text search 182 system; for example, a magazine article or email message. The text search 183 engine must be able to parse documents and store associations of lexemes 184 (key words) with their parent document. Later, these associations are 185 used to search for documents that contain query words. 186 </para> 187 188 <para> 189 For searches within <productname>PostgreSQL</productname>, 190 a document is normally a textual field within a row of a database table, 191 or possibly a combination (concatenation) of such fields, perhaps stored 192 in several tables or obtained dynamically. In other words, a document can 193 be constructed from different parts for indexing and it might not be 194 stored anywhere as a whole. For example: 195 196<programlisting> 197SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document 198FROM messages 199WHERE mid = 12; 200 201SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document 202FROM messages m, docs d 203WHERE m.mid = d.did AND m.mid = 12; 204</programlisting> 205 </para> 206 207 <note> 208 <para> 209 Actually, in these example queries, <function>coalesce</function> 210 should be used to prevent a single <literal>NULL</literal> attribute from 211 causing a <literal>NULL</literal> result for the whole document. 212 </para> 213 </note> 214 215 <para> 216 Another possibility is to store the documents as simple text files in the 217 file system. In this case, the database can be used to store the full text 218 index and to execute searches, and some unique identifier can be used to 219 retrieve the document from the file system. However, retrieving files 220 from outside the database requires superuser permissions or special 221 function support, so this is usually less convenient than keeping all 222 the data inside <productname>PostgreSQL</productname>. Also, keeping 223 everything inside the database allows easy access 224 to document metadata to assist in indexing and display. 225 </para> 226 227 <para> 228 For text search purposes, each document must be reduced to the 229 preprocessed <type>tsvector</type> format. Searching and ranking 230 are performed entirely on the <type>tsvector</type> representation 231 of a document — the original text need only be retrieved 232 when the document has been selected for display to a user. 233 We therefore often speak of the <type>tsvector</type> as being the 234 document, but of course it is only a compact representation of 235 the full document. 236 </para> 237 </sect2> 238 239 <sect2 id="textsearch-matching"> 240 <title>Basic Text Matching</title> 241 242 <para> 243 Full text searching in <productname>PostgreSQL</productname> is based on 244 the match operator <literal>@@</literal>, which returns 245 <literal>true</literal> if a <type>tsvector</type> 246 (document) matches a <type>tsquery</type> (query). 247 It doesn't matter which data type is written first: 248 249<programlisting> 250SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery; 251 ?column? 252---------- 253 t 254 255SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; 256 ?column? 257---------- 258 f 259</programlisting> 260 </para> 261 262 <para> 263 As the above example suggests, a <type>tsquery</type> is not just raw 264 text, any more than a <type>tsvector</type> is. A <type>tsquery</type> 265 contains search terms, which must be already-normalized lexemes, and 266 may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. 267 (For syntax details see <xref linkend="datatype-tsquery"/>.) There are 268 functions <function>to_tsquery</function>, <function>plainto_tsquery</function>, 269 and <function>phraseto_tsquery</function> 270 that are helpful in converting user-written text into a proper 271 <type>tsquery</type>, primarily by normalizing words appearing in 272 the text. Similarly, <function>to_tsvector</function> is used to parse and 273 normalize a document string. So in practice a text search match would 274 look more like this: 275 276<programlisting> 277SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat'); 278 ?column? 279---------- 280 t 281</programlisting> 282 283 Observe that this match would not succeed if written as 284 285<programlisting> 286SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat'); 287 ?column? 288---------- 289 f 290</programlisting> 291 292 since here no normalization of the word <literal>rats</literal> will occur. 293 The elements of a <type>tsvector</type> are lexemes, which are assumed 294 already normalized, so <literal>rats</literal> does not match <literal>rat</literal>. 295 </para> 296 297 <para> 298 The <literal>@@</literal> operator also 299 supports <type>text</type> input, allowing explicit conversion of a text 300 string to <type>tsvector</type> or <type>tsquery</type> to be skipped 301 in simple cases. The variants available are: 302 303<programlisting> 304tsvector @@ tsquery 305tsquery @@ tsvector 306text @@ tsquery 307text @@ text 308</programlisting> 309 </para> 310 311 <para> 312 The first two of these we saw already. 313 The form <type>text</type> <literal>@@</literal> <type>tsquery</type> 314 is equivalent to <literal>to_tsvector(x) @@ y</literal>. 315 The form <type>text</type> <literal>@@</literal> <type>text</type> 316 is equivalent to <literal>to_tsvector(x) @@ plainto_tsquery(y)</literal>. 317 </para> 318 319 <para> 320 Within a <type>tsquery</type>, the <literal>&</literal> (AND) operator 321 specifies that both its arguments must appear in the document to have a 322 match. Similarly, the <literal>|</literal> (OR) operator specifies that 323 at least one of its arguments must appear, while the <literal>!</literal> (NOT) 324 operator specifies that its argument must <emphasis>not</emphasis> appear in 325 order to have a match. 326 For example, the query <literal>fat & ! rat</literal> matches documents that 327 contain <literal>fat</literal> but not <literal>rat</literal>. 328 </para> 329 330 <para> 331 Searching for phrases is possible with the help of 332 the <literal><-></literal> (FOLLOWED BY) <type>tsquery</type> operator, which 333 matches only if its arguments have matches that are adjacent and in the 334 given order. For example: 335 336<programlisting> 337SELECT to_tsvector('fatal error') @@ to_tsquery('fatal <-> error'); 338 ?column? 339---------- 340 t 341 342SELECT to_tsvector('error is not fatal') @@ to_tsquery('fatal <-> error'); 343 ?column? 344---------- 345 f 346</programlisting> 347 348 There is a more general version of the FOLLOWED BY operator having the 349 form <literal><<replaceable>N</replaceable>></literal>, 350 where <replaceable>N</replaceable> is an integer standing for the difference between 351 the positions of the matching lexemes. <literal><1></literal> is 352 the same as <literal><-></literal>, while <literal><2></literal> 353 allows exactly one other lexeme to appear between the matches, and so 354 on. The <literal>phraseto_tsquery</literal> function makes use of this 355 operator to construct a <literal>tsquery</literal> that can match a multi-word 356 phrase when some of the words are stop words. For example: 357 358<programlisting> 359SELECT phraseto_tsquery('cats ate rats'); 360 phraseto_tsquery 361------------------------------- 362 'cat' <-> 'ate' <-> 'rat' 363 364SELECT phraseto_tsquery('the cats ate the rats'); 365 phraseto_tsquery 366------------------------------- 367 'cat' <-> 'ate' <2> 'rat' 368</programlisting> 369 </para> 370 371 <para> 372 A special case that's sometimes useful is that <literal><0></literal> 373 can be used to require that two patterns match the same word. 374 </para> 375 376 <para> 377 Parentheses can be used to control nesting of the <type>tsquery</type> 378 operators. Without parentheses, <literal>|</literal> binds least tightly, 379 then <literal>&</literal>, then <literal><-></literal>, 380 and <literal>!</literal> most tightly. 381 </para> 382 383 <para> 384 It's worth noticing that the AND/OR/NOT operators mean something subtly 385 different when they are within the arguments of a FOLLOWED BY operator 386 than when they are not, because within FOLLOWED BY the exact position of 387 the match is significant. For example, normally <literal>!x</literal> matches 388 only documents that do not contain <literal>x</literal> anywhere. 389 But <literal>!x <-> y</literal> matches <literal>y</literal> if it is not 390 immediately after an <literal>x</literal>; an occurrence of <literal>x</literal> 391 elsewhere in the document does not prevent a match. Another example is 392 that <literal>x & y</literal> normally only requires that <literal>x</literal> 393 and <literal>y</literal> both appear somewhere in the document, but 394 <literal>(x & y) <-> z</literal> requires <literal>x</literal> 395 and <literal>y</literal> to match at the same place, immediately before 396 a <literal>z</literal>. Thus this query behaves differently from 397 <literal>x <-> z & y <-> z</literal>, which will match a 398 document containing two separate sequences <literal>x z</literal> and 399 <literal>y z</literal>. (This specific query is useless as written, 400 since <literal>x</literal> and <literal>y</literal> could not match at the same place; 401 but with more complex situations such as prefix-match patterns, a query 402 of this form could be useful.) 403 </para> 404 </sect2> 405 406 <sect2 id="textsearch-intro-configurations"> 407 <title>Configurations</title> 408 409 <para> 410 The above are all simple text search examples. As mentioned before, full 411 text search functionality includes the ability to do many more things: 412 skip indexing certain words (stop words), process synonyms, and use 413 sophisticated parsing, e.g., parse based on more than just white space. 414 This functionality is controlled by <firstterm>text search 415 configurations</firstterm>. <productname>PostgreSQL</productname> comes with predefined 416 configurations for many languages, and you can easily create your own 417 configurations. (<application>psql</application>'s <command>\dF</command> command 418 shows all available configurations.) 419 </para> 420 421 <para> 422 During installation an appropriate configuration is selected and 423 <xref linkend="guc-default-text-search-config"/> is set accordingly 424 in <filename>postgresql.conf</filename>. If you are using the same text search 425 configuration for the entire cluster you can use the value in 426 <filename>postgresql.conf</filename>. To use different configurations 427 throughout the cluster but the same configuration within any one database, 428 use <command>ALTER DATABASE ... SET</command>. Otherwise, you can set 429 <varname>default_text_search_config</varname> in each session. 430 </para> 431 432 <para> 433 Each text search function that depends on a configuration has an optional 434 <type>regconfig</type> argument, so that the configuration to use can be 435 specified explicitly. <varname>default_text_search_config</varname> 436 is used only when this argument is omitted. 437 </para> 438 439 <para> 440 To make it easier to build custom text search configurations, a 441 configuration is built up from simpler database objects. 442 <productname>PostgreSQL</productname>'s text search facility provides 443 four types of configuration-related database objects: 444 </para> 445 446 <itemizedlist spacing="compact" mark="bullet"> 447 <listitem> 448 <para> 449 <firstterm>Text search parsers</firstterm> break documents into tokens 450 and classify each token (for example, as words or numbers). 451 </para> 452 </listitem> 453 454 <listitem> 455 <para> 456 <firstterm>Text search dictionaries</firstterm> convert tokens to normalized 457 form and reject stop words. 458 </para> 459 </listitem> 460 461 <listitem> 462 <para> 463 <firstterm>Text search templates</firstterm> provide the functions underlying 464 dictionaries. (A dictionary simply specifies a template and a set 465 of parameters for the template.) 466 </para> 467 </listitem> 468 469 <listitem> 470 <para> 471 <firstterm>Text search configurations</firstterm> select a parser and a set 472 of dictionaries to use to normalize the tokens produced by the parser. 473 </para> 474 </listitem> 475 </itemizedlist> 476 477 <para> 478 Text search parsers and templates are built from low-level C functions; 479 therefore it requires C programming ability to develop new ones, and 480 superuser privileges to install one into a database. (There are examples 481 of add-on parsers and templates in the <filename>contrib/</filename> area of the 482 <productname>PostgreSQL</productname> distribution.) Since dictionaries and 483 configurations just parameterize and connect together some underlying 484 parsers and templates, no special privilege is needed to create a new 485 dictionary or configuration. Examples of creating custom dictionaries and 486 configurations appear later in this chapter. 487 </para> 488 489 </sect2> 490 491 </sect1> 492 493 <sect1 id="textsearch-tables"> 494 <title>Tables and Indexes</title> 495 496 <para> 497 The examples in the previous section illustrated full text matching using 498 simple constant strings. This section shows how to search table data, 499 optionally using indexes. 500 </para> 501 502 <sect2 id="textsearch-tables-search"> 503 <title>Searching a Table</title> 504 505 <para> 506 It is possible to do a full text search without an index. A simple query 507 to print the <structname>title</structname> of each row that contains the word 508 <literal>friend</literal> in its <structfield>body</structfield> field is: 509 510<programlisting> 511SELECT title 512FROM pgweb 513WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend'); 514</programlisting> 515 516 This will also find related words such as <literal>friends</literal> 517 and <literal>friendly</literal>, since all these are reduced to the same 518 normalized lexeme. 519 </para> 520 521 <para> 522 The query above specifies that the <literal>english</literal> configuration 523 is to be used to parse and normalize the strings. Alternatively we 524 could omit the configuration parameters: 525 526<programlisting> 527SELECT title 528FROM pgweb 529WHERE to_tsvector(body) @@ to_tsquery('friend'); 530</programlisting> 531 532 This query will use the configuration set by <xref 533 linkend="guc-default-text-search-config"/>. 534 </para> 535 536 <para> 537 A more complex example is to 538 select the ten most recent documents that contain <literal>create</literal> and 539 <literal>table</literal> in the <structname>title</structname> or <structname>body</structname>: 540 541<programlisting> 542SELECT title 543FROM pgweb 544WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') 545ORDER BY last_mod_date DESC 546LIMIT 10; 547</programlisting> 548 549 For clarity we omitted the <function>coalesce</function> function calls 550 which would be needed to find rows that contain <literal>NULL</literal> 551 in one of the two fields. 552 </para> 553 554 <para> 555 Although these queries will work without an index, most applications 556 will find this approach too slow, except perhaps for occasional ad-hoc 557 searches. Practical use of text searching usually requires creating 558 an index. 559 </para> 560 561 </sect2> 562 563 <sect2 id="textsearch-tables-index"> 564 <title>Creating Indexes</title> 565 566 <para> 567 We can create a <acronym>GIN</acronym> index (<xref 568 linkend="textsearch-indexes"/>) to speed up text searches: 569 570<programlisting> 571CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', body)); 572</programlisting> 573 574 Notice that the 2-argument version of <function>to_tsvector</function> is 575 used. Only text search functions that specify a configuration name can 576 be used in expression indexes (<xref linkend="indexes-expressional"/>). 577 This is because the index contents must be unaffected by <xref 578 linkend="guc-default-text-search-config"/>. If they were affected, the 579 index contents might be inconsistent because different entries could 580 contain <type>tsvector</type>s that were created with different text search 581 configurations, and there would be no way to guess which was which. It 582 would be impossible to dump and restore such an index correctly. 583 </para> 584 585 <para> 586 Because the two-argument version of <function>to_tsvector</function> was 587 used in the index above, only a query reference that uses the 2-argument 588 version of <function>to_tsvector</function> with the same configuration 589 name will use that index. That is, <literal>WHERE 590 to_tsvector('english', body) @@ 'a & b'</literal> can use the index, 591 but <literal>WHERE to_tsvector(body) @@ 'a & b'</literal> cannot. 592 This ensures that an index will be used only with the same configuration 593 used to create the index entries. 594 </para> 595 596 <para> 597 It is possible to set up more complex expression indexes wherein the 598 configuration name is specified by another column, e.g.: 599 600<programlisting> 601CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector(config_name, body)); 602</programlisting> 603 604 where <literal>config_name</literal> is a column in the <literal>pgweb</literal> 605 table. This allows mixed configurations in the same index while 606 recording which configuration was used for each index entry. This 607 would be useful, for example, if the document collection contained 608 documents in different languages. Again, 609 queries that are meant to use the index must be phrased to match, e.g., 610 <literal>WHERE to_tsvector(config_name, body) @@ 'a & b'</literal>. 611 </para> 612 613 <para> 614 Indexes can even concatenate columns: 615 616<programlisting> 617CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', title || ' ' || body)); 618</programlisting> 619 </para> 620 621 <para> 622 Another approach is to create a separate <type>tsvector</type> column 623 to hold the output of <function>to_tsvector</function>. To keep this 624 column automatically up to date with its source data, use a stored 625 generated column. This example is a 626 concatenation of <literal>title</literal> and <literal>body</literal>, 627 using <function>coalesce</function> to ensure that one field will still be 628 indexed when the other is <literal>NULL</literal>: 629 630<programlisting> 631ALTER TABLE pgweb 632 ADD COLUMN textsearchable_index_col tsvector 633 GENERATED ALWAYS AS (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED; 634</programlisting> 635 636 Then we create a <acronym>GIN</acronym> index to speed up the search: 637 638<programlisting> 639CREATE INDEX textsearch_idx ON pgweb USING GIN (textsearchable_index_col); 640</programlisting> 641 642 Now we are ready to perform a fast full text search: 643 644<programlisting> 645SELECT title 646FROM pgweb 647WHERE textsearchable_index_col @@ to_tsquery('create & table') 648ORDER BY last_mod_date DESC 649LIMIT 10; 650</programlisting> 651 </para> 652 653 <para> 654 One advantage of the separate-column approach over an expression index 655 is that it is not necessary to explicitly specify the text search 656 configuration in queries in order to make use of the index. As shown 657 in the example above, the query can depend on 658 <varname>default_text_search_config</varname>. Another advantage is that 659 searches will be faster, since it will not be necessary to redo the 660 <function>to_tsvector</function> calls to verify index matches. (This is more 661 important when using a GiST index than a GIN index; see <xref 662 linkend="textsearch-indexes"/>.) The expression-index approach is 663 simpler to set up, however, and it requires less disk space since the 664 <type>tsvector</type> representation is not stored explicitly. 665 </para> 666 667 </sect2> 668 669 </sect1> 670 671 <sect1 id="textsearch-controls"> 672 <title>Controlling Text Search</title> 673 674 <para> 675 To implement full text searching there must be a function to create a 676 <type>tsvector</type> from a document and a <type>tsquery</type> from a 677 user query. Also, we need to return results in a useful order, so we need 678 a function that compares documents with respect to their relevance to 679 the query. It's also important to be able to display the results nicely. 680 <productname>PostgreSQL</productname> provides support for all of these 681 functions. 682 </para> 683 684 <sect2 id="textsearch-parsing-documents"> 685 <title>Parsing Documents</title> 686 687 <para> 688 <productname>PostgreSQL</productname> provides the 689 function <function>to_tsvector</function> for converting a document to 690 the <type>tsvector</type> data type. 691 </para> 692 693 <indexterm> 694 <primary>to_tsvector</primary> 695 </indexterm> 696 697<synopsis> 698to_tsvector(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>) returns <type>tsvector</type> 699</synopsis> 700 701 <para> 702 <function>to_tsvector</function> parses a textual document into tokens, 703 reduces the tokens to lexemes, and returns a <type>tsvector</type> which 704 lists the lexemes together with their positions in the document. 705 The document is processed according to the specified or default 706 text search configuration. 707 Here is a simple example: 708 709<screen> 710SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats'); 711 to_tsvector 712----------------------------------------------------- 713 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4 714</screen> 715 </para> 716 717 <para> 718 In the example above we see that the resulting <type>tsvector</type> does not 719 contain the words <literal>a</literal>, <literal>on</literal>, or 720 <literal>it</literal>, the word <literal>rats</literal> became 721 <literal>rat</literal>, and the punctuation sign <literal>-</literal> was 722 ignored. 723 </para> 724 725 <para> 726 The <function>to_tsvector</function> function internally calls a parser 727 which breaks the document text into tokens and assigns a type to 728 each token. For each token, a list of 729 dictionaries (<xref linkend="textsearch-dictionaries"/>) is consulted, 730 where the list can vary depending on the token type. The first dictionary 731 that <firstterm>recognizes</firstterm> the token emits one or more normalized 732 <firstterm>lexemes</firstterm> to represent the token. For example, 733 <literal>rats</literal> became <literal>rat</literal> because one of the 734 dictionaries recognized that the word <literal>rats</literal> is a plural 735 form of <literal>rat</literal>. Some words are recognized as 736 <firstterm>stop words</firstterm> (<xref linkend="textsearch-stopwords"/>), which 737 causes them to be ignored since they occur too frequently to be useful in 738 searching. In our example these are 739 <literal>a</literal>, <literal>on</literal>, and <literal>it</literal>. 740 If no dictionary in the list recognizes the token then it is also ignored. 741 In this example that happened to the punctuation sign <literal>-</literal> 742 because there are in fact no dictionaries assigned for its token type 743 (<literal>Space symbols</literal>), meaning space tokens will never be 744 indexed. The choices of parser, dictionaries and which types of tokens to 745 index are determined by the selected text search configuration (<xref 746 linkend="textsearch-configuration"/>). It is possible to have 747 many different configurations in the same database, and predefined 748 configurations are available for various languages. In our example 749 we used the default configuration <literal>english</literal> for the 750 English language. 751 </para> 752 753 <para> 754 The function <function>setweight</function> can be used to label the 755 entries of a <type>tsvector</type> with a given <firstterm>weight</firstterm>, 756 where a weight is one of the letters <literal>A</literal>, <literal>B</literal>, 757 <literal>C</literal>, or <literal>D</literal>. 758 This is typically used to mark entries coming from 759 different parts of a document, such as title versus body. Later, this 760 information can be used for ranking of search results. 761 </para> 762 763 <para> 764 Because <function>to_tsvector</function>(<literal>NULL</literal>) will 765 return <literal>NULL</literal>, it is recommended to use 766 <function>coalesce</function> whenever a field might be null. 767 Here is the recommended method for creating 768 a <type>tsvector</type> from a structured document: 769 770<programlisting> 771UPDATE tt SET ti = 772 setweight(to_tsvector(coalesce(title,'')), 'A') || 773 setweight(to_tsvector(coalesce(keyword,'')), 'B') || 774 setweight(to_tsvector(coalesce(abstract,'')), 'C') || 775 setweight(to_tsvector(coalesce(body,'')), 'D'); 776</programlisting> 777 778 Here we have used <function>setweight</function> to label the source 779 of each lexeme in the finished <type>tsvector</type>, and then merged 780 the labeled <type>tsvector</type> values using the <type>tsvector</type> 781 concatenation operator <literal>||</literal>. (<xref 782 linkend="textsearch-manipulate-tsvector"/> gives details about these 783 operations.) 784 </para> 785 786 </sect2> 787 788 <sect2 id="textsearch-parsing-queries"> 789 <title>Parsing Queries</title> 790 791 <para> 792 <productname>PostgreSQL</productname> provides the 793 functions <function>to_tsquery</function>, 794 <function>plainto_tsquery</function>, 795 <function>phraseto_tsquery</function> and 796 <function>websearch_to_tsquery</function> 797 for converting a query to the <type>tsquery</type> data type. 798 <function>to_tsquery</function> offers access to more features 799 than either <function>plainto_tsquery</function> or 800 <function>phraseto_tsquery</function>, but it is less forgiving about its 801 input. <function>websearch_to_tsquery</function> is a simplified version 802 of <function>to_tsquery</function> with an alternative syntax, similar 803 to the one used by web search engines. 804 </para> 805 806 <indexterm> 807 <primary>to_tsquery</primary> 808 </indexterm> 809 810<synopsis> 811to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type> 812</synopsis> 813 814 <para> 815 <function>to_tsquery</function> creates a <type>tsquery</type> value from 816 <replaceable>querytext</replaceable>, which must consist of single tokens 817 separated by the <type>tsquery</type> operators <literal>&</literal> (AND), 818 <literal>|</literal> (OR), <literal>!</literal> (NOT), and 819 <literal><-></literal> (FOLLOWED BY), possibly grouped 820 using parentheses. In other words, the input to 821 <function>to_tsquery</function> must already follow the general rules for 822 <type>tsquery</type> input, as described in <xref 823 linkend="datatype-tsquery"/>. The difference is that while basic 824 <type>tsquery</type> input takes the tokens at face value, 825 <function>to_tsquery</function> normalizes each token into a lexeme using 826 the specified or default configuration, and discards any tokens that are 827 stop words according to the configuration. For example: 828 829<screen> 830SELECT to_tsquery('english', 'The & Fat & Rats'); 831 to_tsquery 832--------------- 833 'fat' & 'rat' 834</screen> 835 836 As in basic <type>tsquery</type> input, weight(s) can be attached to each 837 lexeme to restrict it to match only <type>tsvector</type> lexemes of those 838 weight(s). For example: 839 840<screen> 841SELECT to_tsquery('english', 'Fat | Rats:AB'); 842 to_tsquery 843------------------ 844 'fat' | 'rat':AB 845</screen> 846 847 Also, <literal>*</literal> can be attached to a lexeme to specify prefix matching: 848 849<screen> 850SELECT to_tsquery('supern:*A & star:A*B'); 851 to_tsquery 852-------------------------- 853 'supern':*A & 'star':*AB 854</screen> 855 856 Such a lexeme will match any word in a <type>tsvector</type> that begins 857 with the given string. 858 </para> 859 860 <para> 861 <function>to_tsquery</function> can also accept single-quoted 862 phrases. This is primarily useful when the configuration includes a 863 thesaurus dictionary that may trigger on such phrases. 864 In the example below, a thesaurus contains the rule <literal>supernovae 865 stars : sn</literal>: 866 867<screen> 868SELECT to_tsquery('''supernovae stars'' & !crab'); 869 to_tsquery 870--------------- 871 'sn' & !'crab' 872</screen> 873 874 Without quotes, <function>to_tsquery</function> will generate a syntax 875 error for tokens that are not separated by an AND, OR, or FOLLOWED BY 876 operator. 877 </para> 878 879 <indexterm> 880 <primary>plainto_tsquery</primary> 881 </indexterm> 882 883<synopsis> 884plainto_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type> 885</synopsis> 886 887 <para> 888 <function>plainto_tsquery</function> transforms the unformatted text 889 <replaceable>querytext</replaceable> to a <type>tsquery</type> value. 890 The text is parsed and normalized much as for <function>to_tsvector</function>, 891 then the <literal>&</literal> (AND) <type>tsquery</type> operator is 892 inserted between surviving words. 893 </para> 894 895 <para> 896 Example: 897 898<screen> 899SELECT plainto_tsquery('english', 'The Fat Rats'); 900 plainto_tsquery 901----------------- 902 'fat' & 'rat' 903</screen> 904 905 Note that <function>plainto_tsquery</function> will not 906 recognize <type>tsquery</type> operators, weight labels, 907 or prefix-match labels in its input: 908 909<screen> 910SELECT plainto_tsquery('english', 'The Fat & Rats:C'); 911 plainto_tsquery 912--------------------- 913 'fat' & 'rat' & 'c' 914</screen> 915 916 Here, all the input punctuation was discarded as being space symbols. 917 </para> 918 919 <indexterm> 920 <primary>phraseto_tsquery</primary> 921 </indexterm> 922 923<synopsis> 924phraseto_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type> 925</synopsis> 926 927 <para> 928 <function>phraseto_tsquery</function> behaves much like 929 <function>plainto_tsquery</function>, except that it inserts 930 the <literal><-></literal> (FOLLOWED BY) operator between 931 surviving words instead of the <literal>&</literal> (AND) operator. 932 Also, stop words are not simply discarded, but are accounted for by 933 inserting <literal><<replaceable>N</replaceable>></literal> operators rather 934 than <literal><-></literal> operators. This function is useful 935 when searching for exact lexeme sequences, since the FOLLOWED BY 936 operators check lexeme order not just the presence of all the lexemes. 937 </para> 938 939 <para> 940 Example: 941 942<screen> 943SELECT phraseto_tsquery('english', 'The Fat Rats'); 944 phraseto_tsquery 945------------------ 946 'fat' <-> 'rat' 947</screen> 948 949 Like <function>plainto_tsquery</function>, the 950 <function>phraseto_tsquery</function> function will not 951 recognize <type>tsquery</type> operators, weight labels, 952 or prefix-match labels in its input: 953 954<screen> 955SELECT phraseto_tsquery('english', 'The Fat & Rats:C'); 956 phraseto_tsquery 957----------------------------- 958 'fat' <-> 'rat' <-> 'c' 959</screen> 960 </para> 961 962<synopsis> 963websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type> 964</synopsis> 965 966 <para> 967 <function>websearch_to_tsquery</function> creates a <type>tsquery</type> 968 value from <replaceable>querytext</replaceable> using an alternative 969 syntax in which simple unformatted text is a valid query. 970 Unlike <function>plainto_tsquery</function> 971 and <function>phraseto_tsquery</function>, it also recognizes certain 972 operators. Moreover, this function should never raise syntax errors, 973 which makes it possible to use raw user-supplied input for search. 974 The following syntax is supported: 975 <itemizedlist spacing="compact" mark="bullet"> 976 <listitem> 977 <para> 978 <literal>unquoted text</literal>: text not inside quote marks will be 979 converted to terms separated by <literal>&</literal> operators, as 980 if processed by 981 <function>plainto_tsquery</function>. 982 </para> 983 </listitem> 984 <listitem> 985 <para> 986 <literal>"quoted text"</literal>: text inside quote marks will be 987 converted to terms separated by <literal><-></literal> 988 operators, as if processed by <function>phraseto_tsquery</function>. 989 </para> 990 </listitem> 991 <listitem> 992 <para> 993 <literal>OR</literal>: logical or will be converted to 994 the <literal>|</literal> operator. 995 </para> 996 </listitem> 997 <listitem> 998 <para> 999 <literal>-</literal>: the logical not operator, converted to the 1000 the <literal>!</literal> operator. 1001 </para> 1002 </listitem> 1003 </itemizedlist> 1004 </para> 1005 <para> 1006 Examples: 1007<screen> 1008SELECT websearch_to_tsquery('english', 'The fat rats'); 1009 websearch_to_tsquery 1010---------------------- 1011 'fat' & 'rat' 1012(1 row) 1013 1014SELECT websearch_to_tsquery('english', '"supernovae stars" -crab'); 1015 websearch_to_tsquery 1016---------------------------------- 1017 'supernova' <-> 'star' & !'crab' 1018(1 row) 1019 1020SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"'); 1021 websearch_to_tsquery 1022----------------------------------- 1023 'sad' <-> 'cat' | 'fat' <-> 'rat' 1024(1 row) 1025 1026SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"'); 1027 websearch_to_tsquery 1028--------------------------------------- 1029 'signal' & !( 'segment' <-> 'fault' ) 1030(1 row) 1031 1032SELECT websearch_to_tsquery('english', '""" )( dummy \\ query <->'); 1033 websearch_to_tsquery 1034---------------------- 1035 'dummi' & 'queri' 1036(1 row) 1037</screen> 1038 </para> 1039 </sect2> 1040 1041 <sect2 id="textsearch-ranking"> 1042 <title>Ranking Search Results</title> 1043 1044 <para> 1045 Ranking attempts to measure how relevant documents are to a particular 1046 query, so that when there are many matches the most relevant ones can be 1047 shown first. <productname>PostgreSQL</productname> provides two 1048 predefined ranking functions, which take into account lexical, proximity, 1049 and structural information; that is, they consider how often the query 1050 terms appear in the document, how close together the terms are in the 1051 document, and how important is the part of the document where they occur. 1052 However, the concept of relevancy is vague and very application-specific. 1053 Different applications might require additional information for ranking, 1054 e.g., document modification time. The built-in ranking functions are only 1055 examples. You can write your own ranking functions and/or combine their 1056 results with additional factors to fit your specific needs. 1057 </para> 1058 1059 <para> 1060 The two ranking functions currently available are: 1061 1062 <variablelist> 1063 1064 <varlistentry> 1065 1066 <term> 1067 <indexterm> 1068 <primary>ts_rank</primary> 1069 </indexterm> 1070 1071 <literal>ts_rank(<optional> <replaceable class="parameter">weights</replaceable> <type>float4[]</type>, </optional> <replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">normalization</replaceable> <type>integer</type> </optional>) returns <type>float4</type></literal> 1072 </term> 1073 1074 <listitem> 1075 <para> 1076 Ranks vectors based on the frequency of their matching lexemes. 1077 </para> 1078 </listitem> 1079 </varlistentry> 1080 1081 <varlistentry> 1082 1083 <term> 1084 <indexterm> 1085 <primary>ts_rank_cd</primary> 1086 </indexterm> 1087 1088 <literal>ts_rank_cd(<optional> <replaceable class="parameter">weights</replaceable> <type>float4[]</type>, </optional> <replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">normalization</replaceable> <type>integer</type> </optional>) returns <type>float4</type></literal> 1089 </term> 1090 1091 <listitem> 1092 <para> 1093 This function computes the <firstterm>cover density</firstterm> 1094 ranking for the given document vector and query, as described in 1095 Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three 1096 Term Queries" in the journal "Information Processing and Management", 1097 1999. Cover density is similar to <function>ts_rank</function> ranking 1098 except that the proximity of matching lexemes to each other is 1099 taken into consideration. 1100 </para> 1101 1102 <para> 1103 This function requires lexeme positional information to perform 1104 its calculation. Therefore, it ignores any <quote>stripped</quote> 1105 lexemes in the <type>tsvector</type>. If there are no unstripped 1106 lexemes in the input, the result will be zero. (See <xref 1107 linkend="textsearch-manipulate-tsvector"/> for more information 1108 about the <function>strip</function> function and positional information 1109 in <type>tsvector</type>s.) 1110 </para> 1111 </listitem> 1112 </varlistentry> 1113 1114 </variablelist> 1115 1116 </para> 1117 1118 <para> 1119 For both these functions, 1120 the optional <replaceable class="parameter">weights</replaceable> 1121 argument offers the ability to weigh word instances more or less 1122 heavily depending on how they are labeled. The weight arrays specify 1123 how heavily to weigh each category of word, in the order: 1124 1125<synopsis> 1126{D-weight, C-weight, B-weight, A-weight} 1127</synopsis> 1128 1129 If no <replaceable class="parameter">weights</replaceable> are provided, 1130 then these defaults are used: 1131 1132<programlisting> 1133{0.1, 0.2, 0.4, 1.0} 1134</programlisting> 1135 1136 Typically weights are used to mark words from special areas of the 1137 document, like the title or an initial abstract, so they can be 1138 treated with more or less importance than words in the document body. 1139 </para> 1140 1141 <para> 1142 Since a longer document has a greater chance of containing a query term 1143 it is reasonable to take into account document size, e.g., a hundred-word 1144 document with five instances of a search word is probably more relevant 1145 than a thousand-word document with five instances. Both ranking functions 1146 take an integer <replaceable>normalization</replaceable> option that 1147 specifies whether and how a document's length should impact its rank. 1148 The integer option controls several behaviors, so it is a bit mask: 1149 you can specify one or more behaviors using 1150 <literal>|</literal> (for example, <literal>2|4</literal>). 1151 1152 <itemizedlist spacing="compact" mark="bullet"> 1153 <listitem> 1154 <para> 1155 0 (the default) ignores the document length 1156 </para> 1157 </listitem> 1158 <listitem> 1159 <para> 1160 1 divides the rank by 1 + the logarithm of the document length 1161 </para> 1162 </listitem> 1163 <listitem> 1164 <para> 1165 2 divides the rank by the document length 1166 </para> 1167 </listitem> 1168 <listitem> 1169 <para> 1170 4 divides the rank by the mean harmonic distance between extents 1171 (this is implemented only by <function>ts_rank_cd</function>) 1172 </para> 1173 </listitem> 1174 <listitem> 1175 <para> 1176 8 divides the rank by the number of unique words in document 1177 </para> 1178 </listitem> 1179 <listitem> 1180 <para> 1181 16 divides the rank by 1 + the logarithm of the number 1182 of unique words in document 1183 </para> 1184 </listitem> 1185 <listitem> 1186 <para> 1187 32 divides the rank by itself + 1 1188 </para> 1189 </listitem> 1190 </itemizedlist> 1191 1192 If more than one flag bit is specified, the transformations are 1193 applied in the order listed. 1194 </para> 1195 1196 <para> 1197 It is important to note that the ranking functions do not use any global 1198 information, so it is impossible to produce a fair normalization to 1% or 1199 100% as sometimes desired. Normalization option 32 1200 (<literal>rank/(rank+1)</literal>) can be applied to scale all ranks 1201 into the range zero to one, but of course this is just a cosmetic change; 1202 it will not affect the ordering of the search results. 1203 </para> 1204 1205 <para> 1206 Here is an example that selects only the ten highest-ranked matches: 1207 1208<screen> 1209SELECT title, ts_rank_cd(textsearch, query) AS rank 1210FROM apod, to_tsquery('neutrino|(dark & matter)') query 1211WHERE query @@ textsearch 1212ORDER BY rank DESC 1213LIMIT 10; 1214 title | rank 1215-----------------------------------------------+---------- 1216 Neutrinos in the Sun | 3.1 1217 The Sudbury Neutrino Detector | 2.4 1218 A MACHO View of Galactic Dark Matter | 2.01317 1219 Hot Gas and Dark Matter | 1.91171 1220 The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953 1221 Rafting for Solar Neutrinos | 1.9 1222 NGC 4650A: Strange Galaxy and Dark Matter | 1.85774 1223 Hot Gas and Dark Matter | 1.6123 1224 Ice Fishing for Cosmic Neutrinos | 1.6 1225 Weak Lensing Distorts the Universe | 0.818218 1226</screen> 1227 1228 This is the same example using normalized ranking: 1229 1230<screen> 1231SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank 1232FROM apod, to_tsquery('neutrino|(dark & matter)') query 1233WHERE query @@ textsearch 1234ORDER BY rank DESC 1235LIMIT 10; 1236 title | rank 1237-----------------------------------------------+------------------- 1238 Neutrinos in the Sun | 0.756097569485493 1239 The Sudbury Neutrino Detector | 0.705882361190954 1240 A MACHO View of Galactic Dark Matter | 0.668123210574724 1241 Hot Gas and Dark Matter | 0.65655958650282 1242 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973 1243 Rafting for Solar Neutrinos | 0.655172410958162 1244 NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637 1245 Hot Gas and Dark Matter | 0.617195790024749 1246 Ice Fishing for Cosmic Neutrinos | 0.615384618911517 1247 Weak Lensing Distorts the Universe | 0.450010798361481 1248</screen> 1249 </para> 1250 1251 <para> 1252 Ranking can be expensive since it requires consulting the 1253 <type>tsvector</type> of each matching document, which can be I/O bound and 1254 therefore slow. Unfortunately, it is almost impossible to avoid since 1255 practical queries often result in large numbers of matches. 1256 </para> 1257 1258 </sect2> 1259 1260 <sect2 id="textsearch-headline"> 1261 <title>Highlighting Results</title> 1262 1263 <para> 1264 To present search results it is ideal to show a part of each document and 1265 how it is related to the query. Usually, search engines show fragments of 1266 the document with marked search terms. <productname>PostgreSQL</productname> 1267 provides a function <function>ts_headline</function> that 1268 implements this functionality. 1269 </para> 1270 1271 <indexterm> 1272 <primary>ts_headline</primary> 1273 </indexterm> 1274 1275<synopsis> 1276ts_headline(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">options</replaceable> <type>text</type> </optional>) returns <type>text</type> 1277</synopsis> 1278 1279 <para> 1280 <function>ts_headline</function> accepts a document along 1281 with a query, and returns an excerpt from 1282 the document in which terms from the query are highlighted. The 1283 configuration to be used to parse the document can be specified by 1284 <replaceable>config</replaceable>; if <replaceable>config</replaceable> 1285 is omitted, the 1286 <varname>default_text_search_config</varname> configuration is used. 1287 </para> 1288 1289 <para> 1290 If an <replaceable>options</replaceable> string is specified it must 1291 consist of a comma-separated list of one or more 1292 <replaceable>option</replaceable><literal>=</literal><replaceable>value</replaceable> pairs. 1293 The available options are: 1294 1295 <itemizedlist spacing="compact" mark="bullet"> 1296 <listitem> 1297 <para> 1298 <literal>MaxWords</literal>, <literal>MinWords</literal> (integers): 1299 these numbers determine the longest and shortest headlines to output. 1300 The default values are 35 and 15. 1301 </para> 1302 </listitem> 1303 <listitem> 1304 <para> 1305 <literal>ShortWord</literal> (integer): words of this length or less 1306 will be dropped at the start and end of a headline, unless they are 1307 query terms. The default value of three eliminates common English 1308 articles. 1309 </para> 1310 </listitem> 1311 <listitem> 1312 <para> 1313 <literal>HighlightAll</literal> (boolean): if 1314 <literal>true</literal> the whole document will be used as the 1315 headline, ignoring the preceding three parameters. The default 1316 is <literal>false</literal>. 1317 </para> 1318 </listitem> 1319 <listitem> 1320 <para> 1321 <literal>MaxFragments</literal> (integer): maximum number of text 1322 fragments to display. The default value of zero selects a 1323 non-fragment-based headline generation method. A value greater 1324 than zero selects fragment-based headline generation (see below). 1325 </para> 1326 </listitem> 1327 <listitem> 1328 <para> 1329 <literal>StartSel</literal>, <literal>StopSel</literal> (strings): 1330 the strings with which to delimit query words appearing in the 1331 document, to distinguish them from other excerpted words. The 1332 default values are <quote><literal><b></literal></quote> and 1333 <quote><literal></b></literal></quote>, which can be suitable 1334 for HTML output. 1335 </para> 1336 </listitem> 1337 <listitem> 1338 <para> 1339 <literal>FragmentDelimiter</literal> (string): When more than one 1340 fragment is displayed, the fragments will be separated by this string. 1341 The default is <quote><literal> ... </literal></quote>. 1342 </para> 1343 </listitem> 1344 </itemizedlist> 1345 1346 These option names are recognized case-insensitively. 1347 You must double-quote string values if they contain spaces or commas. 1348 </para> 1349 1350 <para> 1351 In non-fragment-based headline 1352 generation, <function>ts_headline</function> locates matches for the 1353 given <replaceable class="parameter">query</replaceable> and chooses a 1354 single one to display, preferring matches that have more query words 1355 within the allowed headline length. 1356 In fragment-based headline generation, <function>ts_headline</function> 1357 locates the query matches and splits each match 1358 into <quote>fragments</quote> of no more than <literal>MaxWords</literal> 1359 words each, preferring fragments with more query words, and when 1360 possible <quote>stretching</quote> fragments to include surrounding 1361 words. The fragment-based mode is thus more useful when the query 1362 matches span large sections of the document, or when it's desirable to 1363 display multiple matches. 1364 In either mode, if no query matches can be identified, then a single 1365 fragment of the first <literal>MinWords</literal> words in the document 1366 will be displayed. 1367 </para> 1368 1369 <para> 1370 For example: 1371 1372<screen> 1373SELECT ts_headline('english', 1374 'The most common type of search 1375is to find all documents containing given query terms 1376and return them in order of their similarity to the 1377query.', 1378 to_tsquery('english', 'query & similarity')); 1379 ts_headline 1380------------------------------------------------------------ 1381 containing given <b>query</b> terms + 1382 and return them in order of their <b>similarity</b> to the+ 1383 <b>query</b>. 1384 1385SELECT ts_headline('english', 1386 'Search terms may occur 1387many times in a document, 1388requiring ranking of the search matches to decide which 1389occurrences to display in the result.', 1390 to_tsquery('english', 'search & term'), 1391 'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=<<, StopSel=>>'); 1392 ts_headline 1393------------------------------------------------------------ 1394 <<Search>> <<terms>> may occur + 1395 many times ... ranking of the <<search>> matches to decide 1396</screen> 1397 </para> 1398 1399 <para> 1400 <function>ts_headline</function> uses the original document, not a 1401 <type>tsvector</type> summary, so it can be slow and should be used with 1402 care. 1403 </para> 1404 1405 </sect2> 1406 1407 </sect1> 1408 1409 <sect1 id="textsearch-features"> 1410 <title>Additional Features</title> 1411 1412 <para> 1413 This section describes additional functions and operators that are 1414 useful in connection with text search. 1415 </para> 1416 1417 <sect2 id="textsearch-manipulate-tsvector"> 1418 <title>Manipulating Documents</title> 1419 1420 <para> 1421 <xref linkend="textsearch-parsing-documents"/> showed how raw textual 1422 documents can be converted into <type>tsvector</type> values. 1423 <productname>PostgreSQL</productname> also provides functions and 1424 operators that can be used to manipulate documents that are already 1425 in <type>tsvector</type> form. 1426 </para> 1427 1428 <variablelist> 1429 1430 <varlistentry> 1431 1432 <term> 1433 <indexterm> 1434 <primary>tsvector concatenation</primary> 1435 </indexterm> 1436 1437 <literal><type>tsvector</type> || <type>tsvector</type></literal> 1438 </term> 1439 1440 <listitem> 1441 <para> 1442 The <type>tsvector</type> concatenation operator 1443 returns a vector which combines the lexemes and positional information 1444 of the two vectors given as arguments. Positions and weight labels 1445 are retained during the concatenation. 1446 Positions appearing in the right-hand vector are offset by the largest 1447 position mentioned in the left-hand vector, so that the result is 1448 nearly equivalent to the result of performing <function>to_tsvector</function> 1449 on the concatenation of the two original document strings. (The 1450 equivalence is not exact, because any stop-words removed from the 1451 end of the left-hand argument will not affect the result, whereas 1452 they would have affected the positions of the lexemes in the 1453 right-hand argument if textual concatenation were used.) 1454 </para> 1455 1456 <para> 1457 One advantage of using concatenation in the vector form, rather than 1458 concatenating text before applying <function>to_tsvector</function>, is that 1459 you can use different configurations to parse different sections 1460 of the document. Also, because the <function>setweight</function> function 1461 marks all lexemes of the given vector the same way, it is necessary 1462 to parse the text and do <function>setweight</function> before concatenating 1463 if you want to label different parts of the document with different 1464 weights. 1465 </para> 1466 </listitem> 1467 </varlistentry> 1468 1469 <varlistentry> 1470 1471 <term> 1472 <indexterm> 1473 <primary>setweight</primary> 1474 </indexterm> 1475 1476 <literal>setweight(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">weight</replaceable> <type>"char"</type>) returns <type>tsvector</type></literal> 1477 </term> 1478 1479 <listitem> 1480 <para> 1481 <function>setweight</function> returns a copy of the input vector in which every 1482 position has been labeled with the given <replaceable>weight</replaceable>, either 1483 <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or 1484 <literal>D</literal>. (<literal>D</literal> is the default for new 1485 vectors and as such is not displayed on output.) These labels are 1486 retained when vectors are concatenated, allowing words from different 1487 parts of a document to be weighted differently by ranking functions. 1488 </para> 1489 1490 <para> 1491 Note that weight labels apply to <emphasis>positions</emphasis>, not 1492 <emphasis>lexemes</emphasis>. If the input vector has been stripped of 1493 positions then <function>setweight</function> does nothing. 1494 </para> 1495 </listitem> 1496 </varlistentry> 1497 1498 <varlistentry> 1499 <term> 1500 <indexterm> 1501 <primary>length(tsvector)</primary> 1502 </indexterm> 1503 1504 <literal>length(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>) returns <type>integer</type></literal> 1505 </term> 1506 1507 <listitem> 1508 <para> 1509 Returns the number of lexemes stored in the vector. 1510 </para> 1511 </listitem> 1512 </varlistentry> 1513 1514 <varlistentry> 1515 1516 <term> 1517 <indexterm> 1518 <primary>strip</primary> 1519 </indexterm> 1520 1521 <literal>strip(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>) returns <type>tsvector</type></literal> 1522 </term> 1523 1524 <listitem> 1525 <para> 1526 Returns a vector that lists the same lexemes as the given vector, but 1527 lacks any position or weight information. The result is usually much 1528 smaller than an unstripped vector, but it is also less useful. 1529 Relevance ranking does not work as well on stripped vectors as 1530 unstripped ones. Also, 1531 the <literal><-></literal> (FOLLOWED BY) <type>tsquery</type> operator 1532 will never match stripped input, since it cannot determine the 1533 distance between lexeme occurrences. 1534 </para> 1535 </listitem> 1536 1537 </varlistentry> 1538 1539 </variablelist> 1540 1541 <para> 1542 A full list of <type>tsvector</type>-related functions is available 1543 in <xref linkend="textsearch-functions-table"/>. 1544 </para> 1545 1546 </sect2> 1547 1548 <sect2 id="textsearch-manipulate-tsquery"> 1549 <title>Manipulating Queries</title> 1550 1551 <para> 1552 <xref linkend="textsearch-parsing-queries"/> showed how raw textual 1553 queries can be converted into <type>tsquery</type> values. 1554 <productname>PostgreSQL</productname> also provides functions and 1555 operators that can be used to manipulate queries that are already 1556 in <type>tsquery</type> form. 1557 </para> 1558 1559 <variablelist> 1560 1561 <varlistentry> 1562 1563 <term> 1564 <literal><type>tsquery</type> && <type>tsquery</type></literal> 1565 </term> 1566 1567 <listitem> 1568 <para> 1569 Returns the AND-combination of the two given queries. 1570 </para> 1571 </listitem> 1572 1573 </varlistentry> 1574 1575 <varlistentry> 1576 1577 <term> 1578 <literal><type>tsquery</type> || <type>tsquery</type></literal> 1579 </term> 1580 1581 <listitem> 1582 <para> 1583 Returns the OR-combination of the two given queries. 1584 </para> 1585 </listitem> 1586 1587 </varlistentry> 1588 1589 <varlistentry> 1590 1591 <term> 1592 <literal>!! <type>tsquery</type></literal> 1593 </term> 1594 1595 <listitem> 1596 <para> 1597 Returns the negation (NOT) of the given query. 1598 </para> 1599 </listitem> 1600 1601 </varlistentry> 1602 1603 <varlistentry> 1604 1605 <term> 1606 <literal><type>tsquery</type> <-> <type>tsquery</type></literal> 1607 </term> 1608 1609 <listitem> 1610 <para> 1611 Returns a query that searches for a match to the first given query 1612 immediately followed by a match to the second given query, using 1613 the <literal><-></literal> (FOLLOWED BY) 1614 <type>tsquery</type> operator. For example: 1615 1616<screen> 1617SELECT to_tsquery('fat') <-> to_tsquery('cat | rat'); 1618 ?column? 1619---------------------------- 1620 'fat' <-> ( 'cat' | 'rat' ) 1621</screen> 1622 </para> 1623 </listitem> 1624 1625 </varlistentry> 1626 1627 <varlistentry> 1628 1629 <term> 1630 <indexterm> 1631 <primary>tsquery_phrase</primary> 1632 </indexterm> 1633 1634 <literal>tsquery_phrase(<replaceable class="parameter">query1</replaceable> <type>tsquery</type>, <replaceable class="parameter">query2</replaceable> <type>tsquery</type> [, <replaceable class="parameter">distance</replaceable> <type>integer</type> ]) returns <type>tsquery</type></literal> 1635 </term> 1636 1637 <listitem> 1638 <para> 1639 Returns a query that searches for a match to the first given query 1640 followed by a match to the second given query at a distance of exactly 1641 <replaceable>distance</replaceable> lexemes, using 1642 the <literal><<replaceable>N</replaceable>></literal> 1643 <type>tsquery</type> operator. For example: 1644 1645<screen> 1646SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10); 1647 tsquery_phrase 1648------------------ 1649 'fat' <10> 'cat' 1650</screen> 1651 </para> 1652 </listitem> 1653 1654 </varlistentry> 1655 1656 <varlistentry> 1657 1658 <term> 1659 <indexterm> 1660 <primary>numnode</primary> 1661 </indexterm> 1662 1663 <literal>numnode(<replaceable class="parameter">query</replaceable> <type>tsquery</type>) returns <type>integer</type></literal> 1664 </term> 1665 1666 <listitem> 1667 <para> 1668 Returns the number of nodes (lexemes plus operators) in a 1669 <type>tsquery</type>. This function is useful 1670 to determine if the <replaceable>query</replaceable> is meaningful 1671 (returns > 0), or contains only stop words (returns 0). 1672 Examples: 1673 1674<screen> 1675SELECT numnode(plainto_tsquery('the any')); 1676NOTICE: query contains only stopword(s) or doesn't contain lexeme(s), ignored 1677 numnode 1678--------- 1679 0 1680 1681SELECT numnode('foo & bar'::tsquery); 1682 numnode 1683--------- 1684 3 1685</screen> 1686 </para> 1687 </listitem> 1688 </varlistentry> 1689 1690 <varlistentry> 1691 1692 <term> 1693 <indexterm> 1694 <primary>querytree</primary> 1695 </indexterm> 1696 1697 <literal>querytree(<replaceable class="parameter">query</replaceable> <type>tsquery</type>) returns <type>text</type></literal> 1698 </term> 1699 1700 <listitem> 1701 <para> 1702 Returns the portion of a <type>tsquery</type> that can be used for 1703 searching an index. This function is useful for detecting 1704 unindexable queries, for example those containing only stop words 1705 or only negated terms. For example: 1706 1707<screen> 1708SELECT querytree(to_tsquery('!defined')); 1709 querytree 1710----------- 1711 1712</screen> 1713 </para> 1714 </listitem> 1715 </varlistentry> 1716 1717 </variablelist> 1718 1719 <sect3 id="textsearch-query-rewriting"> 1720 <title>Query Rewriting</title> 1721 1722 <indexterm zone="textsearch-query-rewriting"> 1723 <primary>ts_rewrite</primary> 1724 </indexterm> 1725 1726 <para> 1727 The <function>ts_rewrite</function> family of functions search a 1728 given <type>tsquery</type> for occurrences of a target 1729 subquery, and replace each occurrence with a 1730 substitute subquery. In essence this operation is a 1731 <type>tsquery</type>-specific version of substring replacement. 1732 A target and substitute combination can be 1733 thought of as a <firstterm>query rewrite rule</firstterm>. A collection 1734 of such rewrite rules can be a powerful search aid. 1735 For example, you can expand the search using synonyms 1736 (e.g., <literal>new york</literal>, <literal>big apple</literal>, <literal>nyc</literal>, 1737 <literal>gotham</literal>) or narrow the search to direct the user to some hot 1738 topic. There is some overlap in functionality between this feature 1739 and thesaurus dictionaries (<xref linkend="textsearch-thesaurus"/>). 1740 However, you can modify a set of rewrite rules on-the-fly without 1741 reindexing, whereas updating a thesaurus requires reindexing to be 1742 effective. 1743 </para> 1744 1745 <variablelist> 1746 1747 <varlistentry> 1748 1749 <term> 1750 <literal>ts_rewrite (<replaceable class="parameter">query</replaceable> <type>tsquery</type>, <replaceable class="parameter">target</replaceable> <type>tsquery</type>, <replaceable class="parameter">substitute</replaceable> <type>tsquery</type>) returns <type>tsquery</type></literal> 1751 </term> 1752 1753 <listitem> 1754 <para> 1755 This form of <function>ts_rewrite</function> simply applies a single 1756 rewrite rule: <replaceable class="parameter">target</replaceable> 1757 is replaced by <replaceable class="parameter">substitute</replaceable> 1758 wherever it appears in <replaceable 1759 class="parameter">query</replaceable>. For example: 1760 1761<screen> 1762SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery); 1763 ts_rewrite 1764------------ 1765 'b' & 'c' 1766</screen> 1767 </para> 1768 </listitem> 1769 </varlistentry> 1770 1771 <varlistentry> 1772 1773 <term> 1774 <literal>ts_rewrite (<replaceable class="parameter">query</replaceable> <type>tsquery</type>, <replaceable class="parameter">select</replaceable> <type>text</type>) returns <type>tsquery</type></literal> 1775 </term> 1776 1777 <listitem> 1778 <para> 1779 This form of <function>ts_rewrite</function> accepts a starting 1780 <replaceable>query</replaceable> and a SQL <replaceable>select</replaceable> command, which 1781 is given as a text string. The <replaceable>select</replaceable> must yield two 1782 columns of <type>tsquery</type> type. For each row of the 1783 <replaceable>select</replaceable> result, occurrences of the first column value 1784 (the target) are replaced by the second column value (the substitute) 1785 within the current <replaceable>query</replaceable> value. For example: 1786 1787<screen> 1788CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery); 1789INSERT INTO aliases VALUES('a', 'c'); 1790 1791SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases'); 1792 ts_rewrite 1793------------ 1794 'b' & 'c' 1795</screen> 1796 </para> 1797 1798 <para> 1799 Note that when multiple rewrite rules are applied in this way, 1800 the order of application can be important; so in practice you will 1801 want the source query to <literal>ORDER BY</literal> some ordering key. 1802 </para> 1803 </listitem> 1804 </varlistentry> 1805 1806 </variablelist> 1807 1808 <para> 1809 Let's consider a real-life astronomical example. We'll expand query 1810 <literal>supernovae</literal> using table-driven rewriting rules: 1811 1812<screen> 1813CREATE TABLE aliases (t tsquery primary key, s tsquery); 1814INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn')); 1815 1816SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases'); 1817 ts_rewrite 1818--------------------------------- 1819 'crab' & ( 'supernova' | 'sn' ) 1820</screen> 1821 1822 We can change the rewriting rules just by updating the table: 1823 1824<screen> 1825UPDATE aliases 1826SET s = to_tsquery('supernovae|sn & !nebulae') 1827WHERE t = to_tsquery('supernovae'); 1828 1829SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases'); 1830 ts_rewrite 1831--------------------------------------------- 1832 'crab' & ( 'supernova' | 'sn' & !'nebula' ) 1833</screen> 1834 </para> 1835 1836 <para> 1837 Rewriting can be slow when there are many rewriting rules, since it 1838 checks every rule for a possible match. To filter out obvious non-candidate 1839 rules we can use the containment operators for the <type>tsquery</type> 1840 type. In the example below, we select only those rules which might match 1841 the original query: 1842 1843<screen> 1844SELECT ts_rewrite('a & b'::tsquery, 1845 'SELECT t,s FROM aliases WHERE ''a & b''::tsquery @> t'); 1846 ts_rewrite 1847------------ 1848 'b' & 'c' 1849</screen> 1850 </para> 1851 1852 </sect3> 1853 1854 </sect2> 1855 1856 <sect2 id="textsearch-update-triggers"> 1857 <title>Triggers for Automatic Updates</title> 1858 1859 <indexterm> 1860 <primary>trigger</primary> 1861 <secondary>for updating a derived tsvector column</secondary> 1862 </indexterm> 1863 1864 <note> 1865 <para> 1866 The method described in this section has been obsoleted by the use of 1867 stored generated columns, as described in <xref 1868 linkend="textsearch-tables-index"/>. 1869 </para> 1870 </note> 1871 1872 <para> 1873 When using a separate column to store the <type>tsvector</type> representation 1874 of your documents, it is necessary to create a trigger to update the 1875 <type>tsvector</type> column when the document content columns change. 1876 Two built-in trigger functions are available for this, or you can write 1877 your own. 1878 </para> 1879 1880<synopsis> 1881tsvector_update_trigger(<replaceable class="parameter">tsvector_column_name</replaceable>, <replaceable class="parameter">config_name</replaceable>, <replaceable class="parameter">text_column_name</replaceable> <optional>, ... </optional>) 1882tsvector_update_trigger_column(<replaceable class="parameter">tsvector_column_name</replaceable>, <replaceable class="parameter">config_column_name</replaceable>, <replaceable class="parameter">text_column_name</replaceable> <optional>, ... </optional>) 1883</synopsis> 1884 1885 <para> 1886 These trigger functions automatically compute a <type>tsvector</type> 1887 column from one or more textual columns, under the control of 1888 parameters specified in the <command>CREATE TRIGGER</command> command. 1889 An example of their use is: 1890 1891<screen> 1892CREATE TABLE messages ( 1893 title text, 1894 body text, 1895 tsv tsvector 1896); 1897 1898CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE 1899ON messages FOR EACH ROW EXECUTE FUNCTION 1900tsvector_update_trigger(tsv, 'pg_catalog.english', title, body); 1901 1902INSERT INTO messages VALUES('title here', 'the body text is here'); 1903 1904SELECT * FROM messages; 1905 title | body | tsv 1906------------+-----------------------+---------------------------- 1907 title here | the body text is here | 'bodi':4 'text':5 'titl':1 1908 1909SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title & body'); 1910 title | body 1911------------+----------------------- 1912 title here | the body text is here 1913</screen> 1914 1915 Having created this trigger, any change in <structfield>title</structfield> or 1916 <structfield>body</structfield> will automatically be reflected into 1917 <structfield>tsv</structfield>, without the application having to worry about it. 1918 </para> 1919 1920 <para> 1921 The first trigger argument must be the name of the <type>tsvector</type> 1922 column to be updated. The second argument specifies the text search 1923 configuration to be used to perform the conversion. For 1924 <function>tsvector_update_trigger</function>, the configuration name is simply 1925 given as the second trigger argument. It must be schema-qualified as 1926 shown above, so that the trigger behavior will not change with changes 1927 in <varname>search_path</varname>. For 1928 <function>tsvector_update_trigger_column</function>, the second trigger argument 1929 is the name of another table column, which must be of type 1930 <type>regconfig</type>. This allows a per-row selection of configuration 1931 to be made. The remaining argument(s) are the names of textual columns 1932 (of type <type>text</type>, <type>varchar</type>, or <type>char</type>). These 1933 will be included in the document in the order given. NULL values will 1934 be skipped (but the other columns will still be indexed). 1935 </para> 1936 1937 <para> 1938 A limitation of these built-in triggers is that they treat all the 1939 input columns alike. To process columns differently — for 1940 example, to weight title differently from body — it is necessary 1941 to write a custom trigger. Here is an example using 1942 <application>PL/pgSQL</application> as the trigger language: 1943 1944<programlisting> 1945CREATE FUNCTION messages_trigger() RETURNS trigger AS $$ 1946begin 1947 new.tsv := 1948 setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') || 1949 setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D'); 1950 return new; 1951end 1952$$ LANGUAGE plpgsql; 1953 1954CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE 1955 ON messages FOR EACH ROW EXECUTE FUNCTION messages_trigger(); 1956</programlisting> 1957 </para> 1958 1959 <para> 1960 Keep in mind that it is important to specify the configuration name 1961 explicitly when creating <type>tsvector</type> values inside triggers, 1962 so that the column's contents will not be affected by changes to 1963 <varname>default_text_search_config</varname>. Failure to do this is likely to 1964 lead to problems such as search results changing after a dump and reload. 1965 </para> 1966 1967 </sect2> 1968 1969 <sect2 id="textsearch-statistics"> 1970 <title>Gathering Document Statistics</title> 1971 1972 <indexterm> 1973 <primary>ts_stat</primary> 1974 </indexterm> 1975 1976 <para> 1977 The function <function>ts_stat</function> is useful for checking your 1978 configuration and for finding stop-word candidates. 1979 </para> 1980 1981<synopsis> 1982ts_stat(<replaceable class="parameter">sqlquery</replaceable> <type>text</type>, <optional> <replaceable class="parameter">weights</replaceable> <type>text</type>, </optional> 1983 OUT <replaceable class="parameter">word</replaceable> <type>text</type>, OUT <replaceable class="parameter">ndoc</replaceable> <type>integer</type>, 1984 OUT <replaceable class="parameter">nentry</replaceable> <type>integer</type>) returns <type>setof record</type> 1985</synopsis> 1986 1987 <para> 1988 <replaceable>sqlquery</replaceable> is a text value containing an SQL 1989 query which must return a single <type>tsvector</type> column. 1990 <function>ts_stat</function> executes the query and returns statistics about 1991 each distinct lexeme (word) contained in the <type>tsvector</type> 1992 data. The columns returned are 1993 1994 <itemizedlist spacing="compact" mark="bullet"> 1995 <listitem> 1996 <para> 1997 <replaceable>word</replaceable> <type>text</type> — the value of a lexeme 1998 </para> 1999 </listitem> 2000 <listitem> 2001 <para> 2002 <replaceable>ndoc</replaceable> <type>integer</type> — number of documents 2003 (<type>tsvector</type>s) the word occurred in 2004 </para> 2005 </listitem> 2006 <listitem> 2007 <para> 2008 <replaceable>nentry</replaceable> <type>integer</type> — total number of 2009 occurrences of the word 2010 </para> 2011 </listitem> 2012 </itemizedlist> 2013 2014 If <replaceable>weights</replaceable> is supplied, only occurrences 2015 having one of those weights are counted. 2016 </para> 2017 2018 <para> 2019 For example, to find the ten most frequent words in a document collection: 2020 2021<programlisting> 2022SELECT * FROM ts_stat('SELECT vector FROM apod') 2023ORDER BY nentry DESC, ndoc DESC, word 2024LIMIT 10; 2025</programlisting> 2026 2027 The same, but counting only word occurrences with weight <literal>A</literal> 2028 or <literal>B</literal>: 2029 2030<programlisting> 2031SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab') 2032ORDER BY nentry DESC, ndoc DESC, word 2033LIMIT 10; 2034</programlisting> 2035 </para> 2036 2037 </sect2> 2038 2039 </sect1> 2040 2041 <sect1 id="textsearch-parsers"> 2042 <title>Parsers</title> 2043 2044 <para> 2045 Text search parsers are responsible for splitting raw document text 2046 into <firstterm>tokens</firstterm> and identifying each token's type, where 2047 the set of possible types is defined by the parser itself. 2048 Note that a parser does not modify the text at all — it simply 2049 identifies plausible word boundaries. Because of this limited scope, 2050 there is less need for application-specific custom parsers than there is 2051 for custom dictionaries. At present <productname>PostgreSQL</productname> 2052 provides just one built-in parser, which has been found to be useful for a 2053 wide range of applications. 2054 </para> 2055 2056 <para> 2057 The built-in parser is named <literal>pg_catalog.default</literal>. 2058 It recognizes 23 token types, shown in <xref linkend="textsearch-default-parser"/>. 2059 </para> 2060 2061 <table id="textsearch-default-parser"> 2062 <title>Default Parser's Token Types</title> 2063 <tgroup cols="3"> 2064 <thead> 2065 <row> 2066 <entry>Alias</entry> 2067 <entry>Description</entry> 2068 <entry>Example</entry> 2069 </row> 2070 </thead> 2071 <tbody> 2072 <row> 2073 <entry><literal>asciiword</literal></entry> 2074 <entry>Word, all ASCII letters</entry> 2075 <entry><literal>elephant</literal></entry> 2076 </row> 2077 <row> 2078 <entry><literal>word</literal></entry> 2079 <entry>Word, all letters</entry> 2080 <entry><literal>mañana</literal></entry> 2081 </row> 2082 <row> 2083 <entry><literal>numword</literal></entry> 2084 <entry>Word, letters and digits</entry> 2085 <entry><literal>beta1</literal></entry> 2086 </row> 2087 <row> 2088 <entry><literal>asciihword</literal></entry> 2089 <entry>Hyphenated word, all ASCII</entry> 2090 <entry><literal>up-to-date</literal></entry> 2091 </row> 2092 <row> 2093 <entry><literal>hword</literal></entry> 2094 <entry>Hyphenated word, all letters</entry> 2095 <entry><literal>lógico-matemática</literal></entry> 2096 </row> 2097 <row> 2098 <entry><literal>numhword</literal></entry> 2099 <entry>Hyphenated word, letters and digits</entry> 2100 <entry><literal>postgresql-beta1</literal></entry> 2101 </row> 2102 <row> 2103 <entry><literal>hword_asciipart</literal></entry> 2104 <entry>Hyphenated word part, all ASCII</entry> 2105 <entry><literal>postgresql</literal> in the context <literal>postgresql-beta1</literal></entry> 2106 </row> 2107 <row> 2108 <entry><literal>hword_part</literal></entry> 2109 <entry>Hyphenated word part, all letters</entry> 2110 <entry><literal>lógico</literal> or <literal>matemática</literal> 2111 in the context <literal>lógico-matemática</literal></entry> 2112 </row> 2113 <row> 2114 <entry><literal>hword_numpart</literal></entry> 2115 <entry>Hyphenated word part, letters and digits</entry> 2116 <entry><literal>beta1</literal> in the context 2117 <literal>postgresql-beta1</literal></entry> 2118 </row> 2119 <row> 2120 <entry><literal>email</literal></entry> 2121 <entry>Email address</entry> 2122 <entry><literal>foo@example.com</literal></entry> 2123 </row> 2124 <row> 2125 <entry><literal>protocol</literal></entry> 2126 <entry>Protocol head</entry> 2127 <entry><literal>http://</literal></entry> 2128 </row> 2129 <row> 2130 <entry><literal>url</literal></entry> 2131 <entry>URL</entry> 2132 <entry><literal>example.com/stuff/index.html</literal></entry> 2133 </row> 2134 <row> 2135 <entry><literal>host</literal></entry> 2136 <entry>Host</entry> 2137 <entry><literal>example.com</literal></entry> 2138 </row> 2139 <row> 2140 <entry><literal>url_path</literal></entry> 2141 <entry>URL path</entry> 2142 <entry><literal>/stuff/index.html</literal>, in the context of a URL</entry> 2143 </row> 2144 <row> 2145 <entry><literal>file</literal></entry> 2146 <entry>File or path name</entry> 2147 <entry><literal>/usr/local/foo.txt</literal>, if not within a URL</entry> 2148 </row> 2149 <row> 2150 <entry><literal>sfloat</literal></entry> 2151 <entry>Scientific notation</entry> 2152 <entry><literal>-1.234e56</literal></entry> 2153 </row> 2154 <row> 2155 <entry><literal>float</literal></entry> 2156 <entry>Decimal notation</entry> 2157 <entry><literal>-1.234</literal></entry> 2158 </row> 2159 <row> 2160 <entry><literal>int</literal></entry> 2161 <entry>Signed integer</entry> 2162 <entry><literal>-1234</literal></entry> 2163 </row> 2164 <row> 2165 <entry><literal>uint</literal></entry> 2166 <entry>Unsigned integer</entry> 2167 <entry><literal>1234</literal></entry> 2168 </row> 2169 <row> 2170 <entry><literal>version</literal></entry> 2171 <entry>Version number</entry> 2172 <entry><literal>8.3.0</literal></entry> 2173 </row> 2174 <row> 2175 <entry><literal>tag</literal></entry> 2176 <entry>XML tag</entry> 2177 <entry><literal><a href="dictionaries.html"></literal></entry> 2178 </row> 2179 <row> 2180 <entry><literal>entity</literal></entry> 2181 <entry>XML entity</entry> 2182 <entry><literal>&amp;</literal></entry> 2183 </row> 2184 <row> 2185 <entry><literal>blank</literal></entry> 2186 <entry>Space symbols</entry> 2187 <entry>(any whitespace or punctuation not otherwise recognized)</entry> 2188 </row> 2189 </tbody> 2190 </tgroup> 2191 </table> 2192 2193 <note> 2194 <para> 2195 The parser's notion of a <quote>letter</quote> is determined by the database's 2196 locale setting, specifically <varname>lc_ctype</varname>. Words containing 2197 only the basic ASCII letters are reported as a separate token type, 2198 since it is sometimes useful to distinguish them. In most European 2199 languages, token types <literal>word</literal> and <literal>asciiword</literal> 2200 should be treated alike. 2201 </para> 2202 2203 <para> 2204 <literal>email</literal> does not support all valid email characters as 2205 defined by RFC 5322. Specifically, the only non-alphanumeric 2206 characters supported for email user names are period, dash, and 2207 underscore. 2208 </para> 2209 </note> 2210 2211 <para> 2212 It is possible for the parser to produce overlapping tokens from the same 2213 piece of text. As an example, a hyphenated word will be reported both 2214 as the entire word and as each component: 2215 2216<screen> 2217SELECT alias, description, token FROM ts_debug('foo-bar-beta1'); 2218 alias | description | token 2219-----------------+------------------------------------------+--------------- 2220 numhword | Hyphenated word, letters and digits | foo-bar-beta1 2221 hword_asciipart | Hyphenated word part, all ASCII | foo 2222 blank | Space symbols | - 2223 hword_asciipart | Hyphenated word part, all ASCII | bar 2224 blank | Space symbols | - 2225 hword_numpart | Hyphenated word part, letters and digits | beta1 2226</screen> 2227 2228 This behavior is desirable since it allows searches to work for both 2229 the whole compound word and for components. Here is another 2230 instructive example: 2231 2232<screen> 2233SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html'); 2234 alias | description | token 2235----------+---------------+------------------------------ 2236 protocol | Protocol head | http:// 2237 url | URL | example.com/stuff/index.html 2238 host | Host | example.com 2239 url_path | URL path | /stuff/index.html 2240</screen> 2241 </para> 2242 2243 </sect1> 2244 2245 <sect1 id="textsearch-dictionaries"> 2246 <title>Dictionaries</title> 2247 2248 <para> 2249 Dictionaries are used to eliminate words that should not be considered in a 2250 search (<firstterm>stop words</firstterm>), and to <firstterm>normalize</firstterm> words so 2251 that different derived forms of the same word will match. A successfully 2252 normalized word is called a <firstterm>lexeme</firstterm>. Aside from 2253 improving search quality, normalization and removal of stop words reduce the 2254 size of the <type>tsvector</type> representation of a document, thereby 2255 improving performance. Normalization does not always have linguistic meaning 2256 and usually depends on application semantics. 2257 </para> 2258 2259 <para> 2260 Some examples of normalization: 2261 2262 <itemizedlist spacing="compact" mark="bullet"> 2263 2264 <listitem> 2265 <para> 2266 Linguistic - Ispell dictionaries try to reduce input words to a 2267 normalized form; stemmer dictionaries remove word endings 2268 </para> 2269 </listitem> 2270 <listitem> 2271 <para> 2272 <acronym>URL</acronym> locations can be canonicalized to make 2273 equivalent URLs match: 2274 2275 <itemizedlist spacing="compact" mark="bullet"> 2276 <listitem> 2277 <para> 2278 http://www.pgsql.ru/db/mw/index.html 2279 </para> 2280 </listitem> 2281 <listitem> 2282 <para> 2283 http://www.pgsql.ru/db/mw/ 2284 </para> 2285 </listitem> 2286 <listitem> 2287 <para> 2288 http://www.pgsql.ru/db/../db/mw/index.html 2289 </para> 2290 </listitem> 2291 </itemizedlist> 2292 </para> 2293 </listitem> 2294 <listitem> 2295 <para> 2296 Color names can be replaced by their hexadecimal values, e.g., 2297 <literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal> 2298 </para> 2299 </listitem> 2300 <listitem> 2301 <para> 2302 If indexing numbers, we can 2303 remove some fractional digits to reduce the range of possible 2304 numbers, so for example <emphasis>3.14</emphasis>159265359, 2305 <emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same 2306 after normalization if only two digits are kept after the decimal point. 2307 </para> 2308 </listitem> 2309 </itemizedlist> 2310 2311 </para> 2312 2313 <para> 2314 A dictionary is a program that accepts a token as 2315 input and returns: 2316 <itemizedlist spacing="compact" mark="bullet"> 2317 <listitem> 2318 <para> 2319 an array of lexemes if the input token is known to the dictionary 2320 (notice that one token can produce more than one lexeme) 2321 </para> 2322 </listitem> 2323 <listitem> 2324 <para> 2325 a single lexeme with the <literal>TSL_FILTER</literal> flag set, to replace 2326 the original token with a new token to be passed to subsequent 2327 dictionaries (a dictionary that does this is called a 2328 <firstterm>filtering dictionary</firstterm>) 2329 </para> 2330 </listitem> 2331 <listitem> 2332 <para> 2333 an empty array if the dictionary knows the token, but it is a stop word 2334 </para> 2335 </listitem> 2336 <listitem> 2337 <para> 2338 <literal>NULL</literal> if the dictionary does not recognize the input token 2339 </para> 2340 </listitem> 2341 </itemizedlist> 2342 </para> 2343 2344 <para> 2345 <productname>PostgreSQL</productname> provides predefined dictionaries for 2346 many languages. There are also several predefined templates that can be 2347 used to create new dictionaries with custom parameters. Each predefined 2348 dictionary template is described below. If no existing 2349 template is suitable, it is possible to create new ones; see the 2350 <filename>contrib/</filename> area of the <productname>PostgreSQL</productname> distribution 2351 for examples. 2352 </para> 2353 2354 <para> 2355 A text search configuration binds a parser together with a set of 2356 dictionaries to process the parser's output tokens. For each token 2357 type that the parser can return, a separate list of dictionaries is 2358 specified by the configuration. When a token of that type is found 2359 by the parser, each dictionary in the list is consulted in turn, 2360 until some dictionary recognizes it as a known word. If it is identified 2361 as a stop word, or if no dictionary recognizes the token, it will be 2362 discarded and not indexed or searched for. 2363 Normally, the first dictionary that returns a non-<literal>NULL</literal> 2364 output determines the result, and any remaining dictionaries are not 2365 consulted; but a filtering dictionary can replace the given word 2366 with a modified word, which is then passed to subsequent dictionaries. 2367 </para> 2368 2369 <para> 2370 The general rule for configuring a list of dictionaries 2371 is to place first the most narrow, most specific dictionary, then the more 2372 general dictionaries, finishing with a very general dictionary, like 2373 a <application>Snowball</application> stemmer or <literal>simple</literal>, which 2374 recognizes everything. For example, for an astronomy-specific search 2375 (<literal>astro_en</literal> configuration) one could bind token type 2376 <type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical 2377 terms, a general English dictionary and a <application>Snowball</application> English 2378 stemmer: 2379 2380<programlisting> 2381ALTER TEXT SEARCH CONFIGURATION astro_en 2382 ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem; 2383</programlisting> 2384 </para> 2385 2386 <para> 2387 A filtering dictionary can be placed anywhere in the list, except at the 2388 end where it'd be useless. Filtering dictionaries are useful to partially 2389 normalize words to simplify the task of later dictionaries. For example, 2390 a filtering dictionary could be used to remove accents from accented 2391 letters, as is done by the <xref linkend="unaccent"/> module. 2392 </para> 2393 2394 <sect2 id="textsearch-stopwords"> 2395 <title>Stop Words</title> 2396 2397 <para> 2398 Stop words are words that are very common, appear in almost every 2399 document, and have no discrimination value. Therefore, they can be ignored 2400 in the context of full text searching. For example, every English text 2401 contains words like <literal>a</literal> and <literal>the</literal>, so it is 2402 useless to store them in an index. However, stop words do affect the 2403 positions in <type>tsvector</type>, which in turn affect ranking: 2404 2405<screen> 2406SELECT to_tsvector('english', 'in the list of stop words'); 2407 to_tsvector 2408---------------------------- 2409 'list':3 'stop':5 'word':6 2410</screen> 2411 2412 The missing positions 1,2,4 are because of stop words. Ranks 2413 calculated for documents with and without stop words are quite different: 2414 2415<screen> 2416SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop')); 2417 ts_rank_cd 2418------------ 2419 0.05 2420 2421SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop')); 2422 ts_rank_cd 2423------------ 2424 0.1 2425</screen> 2426 2427 </para> 2428 2429 <para> 2430 It is up to the specific dictionary how it treats stop words. For example, 2431 <literal>ispell</literal> dictionaries first normalize words and then 2432 look at the list of stop words, while <literal>Snowball</literal> stemmers 2433 first check the list of stop words. The reason for the different 2434 behavior is an attempt to decrease noise. 2435 </para> 2436 2437 </sect2> 2438 2439 <sect2 id="textsearch-simple-dictionary"> 2440 <title>Simple Dictionary</title> 2441 2442 <para> 2443 The <literal>simple</literal> dictionary template operates by converting the 2444 input token to lower case and checking it against a file of stop words. 2445 If it is found in the file then an empty array is returned, causing 2446 the token to be discarded. If not, the lower-cased form of the word 2447 is returned as the normalized lexeme. Alternatively, the dictionary 2448 can be configured to report non-stop-words as unrecognized, allowing 2449 them to be passed on to the next dictionary in the list. 2450 </para> 2451 2452 <para> 2453 Here is an example of a dictionary definition using the <literal>simple</literal> 2454 template: 2455 2456<programlisting> 2457CREATE TEXT SEARCH DICTIONARY public.simple_dict ( 2458 TEMPLATE = pg_catalog.simple, 2459 STOPWORDS = english 2460); 2461</programlisting> 2462 2463 Here, <literal>english</literal> is the base name of a file of stop words. 2464 The file's full name will be 2465 <filename>$SHAREDIR/tsearch_data/english.stop</filename>, 2466 where <literal>$SHAREDIR</literal> means the 2467 <productname>PostgreSQL</productname> installation's shared-data directory, 2468 often <filename>/usr/local/share/postgresql</filename> (use <command>pg_config 2469 --sharedir</command> to determine it if you're not sure). 2470 The file format is simply a list 2471 of words, one per line. Blank lines and trailing spaces are ignored, 2472 and upper case is folded to lower case, but no other processing is done 2473 on the file contents. 2474 </para> 2475 2476 <para> 2477 Now we can test our dictionary: 2478 2479<screen> 2480SELECT ts_lexize('public.simple_dict', 'YeS'); 2481 ts_lexize 2482----------- 2483 {yes} 2484 2485SELECT ts_lexize('public.simple_dict', 'The'); 2486 ts_lexize 2487----------- 2488 {} 2489</screen> 2490 </para> 2491 2492 <para> 2493 We can also choose to return <literal>NULL</literal>, instead of the lower-cased 2494 word, if it is not found in the stop words file. This behavior is 2495 selected by setting the dictionary's <literal>Accept</literal> parameter to 2496 <literal>false</literal>. Continuing the example: 2497 2498<screen> 2499ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false ); 2500 2501SELECT ts_lexize('public.simple_dict', 'YeS'); 2502 ts_lexize 2503----------- 2504 2505 2506SELECT ts_lexize('public.simple_dict', 'The'); 2507 ts_lexize 2508----------- 2509 {} 2510</screen> 2511 </para> 2512 2513 <para> 2514 With the default setting of <literal>Accept</literal> = <literal>true</literal>, 2515 it is only useful to place a <literal>simple</literal> dictionary at the end 2516 of a list of dictionaries, since it will never pass on any token to 2517 a following dictionary. Conversely, <literal>Accept</literal> = <literal>false</literal> 2518 is only useful when there is at least one following dictionary. 2519 </para> 2520 2521 <caution> 2522 <para> 2523 Most types of dictionaries rely on configuration files, such as files of 2524 stop words. These files <emphasis>must</emphasis> be stored in UTF-8 encoding. 2525 They will be translated to the actual database encoding, if that is 2526 different, when they are read into the server. 2527 </para> 2528 </caution> 2529 2530 <caution> 2531 <para> 2532 Normally, a database session will read a dictionary configuration file 2533 only once, when it is first used within the session. If you modify a 2534 configuration file and want to force existing sessions to pick up the 2535 new contents, issue an <command>ALTER TEXT SEARCH DICTIONARY</command> command 2536 on the dictionary. This can be a <quote>dummy</quote> update that doesn't 2537 actually change any parameter values. 2538 </para> 2539 </caution> 2540 2541 </sect2> 2542 2543 <sect2 id="textsearch-synonym-dictionary"> 2544 <title>Synonym Dictionary</title> 2545 2546 <para> 2547 This dictionary template is used to create dictionaries that replace a 2548 word with a synonym. Phrases are not supported (use the thesaurus 2549 template (<xref linkend="textsearch-thesaurus"/>) for that). A synonym 2550 dictionary can be used to overcome linguistic problems, for example, to 2551 prevent an English stemmer dictionary from reducing the word <quote>Paris</quote> to 2552 <quote>pari</quote>. It is enough to have a <literal>Paris paris</literal> line in the 2553 synonym dictionary and put it before the <literal>english_stem</literal> 2554 dictionary. For example: 2555 2556<screen> 2557SELECT * FROM ts_debug('english', 'Paris'); 2558 alias | description | token | dictionaries | dictionary | lexemes 2559-----------+-----------------+-------+----------------+--------------+--------- 2560 asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari} 2561 2562CREATE TEXT SEARCH DICTIONARY my_synonym ( 2563 TEMPLATE = synonym, 2564 SYNONYMS = my_synonyms 2565); 2566 2567ALTER TEXT SEARCH CONFIGURATION english 2568 ALTER MAPPING FOR asciiword 2569 WITH my_synonym, english_stem; 2570 2571SELECT * FROM ts_debug('english', 'Paris'); 2572 alias | description | token | dictionaries | dictionary | lexemes 2573-----------+-----------------+-------+---------------------------+------------+--------- 2574 asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris} 2575</screen> 2576 </para> 2577 2578 <para> 2579 The only parameter required by the <literal>synonym</literal> template is 2580 <literal>SYNONYMS</literal>, which is the base name of its configuration file 2581 — <literal>my_synonyms</literal> in the above example. 2582 The file's full name will be 2583 <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</filename> 2584 (where <literal>$SHAREDIR</literal> means the 2585 <productname>PostgreSQL</productname> installation's shared-data directory). 2586 The file format is just one line 2587 per word to be substituted, with the word followed by its synonym, 2588 separated by white space. Blank lines and trailing spaces are ignored. 2589 </para> 2590 2591 <para> 2592 The <literal>synonym</literal> template also has an optional parameter 2593 <literal>CaseSensitive</literal>, which defaults to <literal>false</literal>. When 2594 <literal>CaseSensitive</literal> is <literal>false</literal>, words in the synonym file 2595 are folded to lower case, as are input tokens. When it is 2596 <literal>true</literal>, words and tokens are not folded to lower case, 2597 but are compared as-is. 2598 </para> 2599 2600 <para> 2601 An asterisk (<literal>*</literal>) can be placed at the end of a synonym 2602 in the configuration file. This indicates that the synonym is a prefix. 2603 The asterisk is ignored when the entry is used in 2604 <function>to_tsvector()</function>, but when it is used in 2605 <function>to_tsquery()</function>, the result will be a query item with 2606 the prefix match marker (see 2607 <xref linkend="textsearch-parsing-queries"/>). 2608 For example, suppose we have these entries in 2609 <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</filename>: 2610<programlisting> 2611postgres pgsql 2612postgresql pgsql 2613postgre pgsql 2614gogle googl 2615indices index* 2616</programlisting> 2617 Then we will get these results: 2618<screen> 2619mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample'); 2620mydb=# SELECT ts_lexize('syn', 'indices'); 2621 ts_lexize 2622----------- 2623 {index} 2624(1 row) 2625 2626mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple); 2627mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn; 2628mydb=# SELECT to_tsvector('tst', 'indices'); 2629 to_tsvector 2630------------- 2631 'index':1 2632(1 row) 2633 2634mydb=# SELECT to_tsquery('tst', 'indices'); 2635 to_tsquery 2636------------ 2637 'index':* 2638(1 row) 2639 2640mydb=# SELECT 'indexes are very useful'::tsvector; 2641 tsvector 2642--------------------------------- 2643 'are' 'indexes' 'useful' 'very' 2644(1 row) 2645 2646mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices'); 2647 ?column? 2648---------- 2649 t 2650(1 row) 2651</screen> 2652 </para> 2653 </sect2> 2654 2655 <sect2 id="textsearch-thesaurus"> 2656 <title>Thesaurus Dictionary</title> 2657 2658 <para> 2659 A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is 2660 a collection of words that includes information about the relationships 2661 of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower 2662 terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related 2663 terms, etc. 2664 </para> 2665 2666 <para> 2667 Basically a thesaurus dictionary replaces all non-preferred terms by one 2668 preferred term and, optionally, preserves the original terms for indexing 2669 as well. <productname>PostgreSQL</productname>'s current implementation of the 2670 thesaurus dictionary is an extension of the synonym dictionary with added 2671 <firstterm>phrase</firstterm> support. A thesaurus dictionary requires 2672 a configuration file of the following format: 2673 2674<programlisting> 2675# this is a comment 2676sample word(s) : indexed word(s) 2677more sample word(s) : more indexed word(s) 2678... 2679</programlisting> 2680 2681 where the colon (<symbol>:</symbol>) symbol acts as a delimiter between a 2682 phrase and its replacement. 2683 </para> 2684 2685 <para> 2686 A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which 2687 is specified in the dictionary's configuration) to normalize the input 2688 text before checking for phrase matches. It is only possible to select one 2689 subdictionary. An error is reported if the subdictionary fails to 2690 recognize a word. In that case, you should remove the use of the word or 2691 teach the subdictionary about it. You can place an asterisk 2692 (<symbol>*</symbol>) at the beginning of an indexed word to skip applying 2693 the subdictionary to it, but all sample words <emphasis>must</emphasis> be known 2694 to the subdictionary. 2695 </para> 2696 2697 <para> 2698 The thesaurus dictionary chooses the longest match if there are multiple 2699 phrases matching the input, and ties are broken by using the last 2700 definition. 2701 </para> 2702 2703 <para> 2704 Specific stop words recognized by the subdictionary cannot be 2705 specified; instead use <literal>?</literal> to mark the location where any 2706 stop word can appear. For example, assuming that <literal>a</literal> and 2707 <literal>the</literal> are stop words according to the subdictionary: 2708 2709<programlisting> 2710? one ? two : swsw 2711</programlisting> 2712 2713 matches <literal>a one the two</literal> and <literal>the one a two</literal>; 2714 both would be replaced by <literal>swsw</literal>. 2715 </para> 2716 2717 <para> 2718 Since a thesaurus dictionary has the capability to recognize phrases it 2719 must remember its state and interact with the parser. A thesaurus dictionary 2720 uses these assignments to check if it should handle the next word or stop 2721 accumulation. The thesaurus dictionary must be configured 2722 carefully. For example, if the thesaurus dictionary is assigned to handle 2723 only the <literal>asciiword</literal> token, then a thesaurus dictionary 2724 definition like <literal>one 7</literal> will not work since token type 2725 <literal>uint</literal> is not assigned to the thesaurus dictionary. 2726 </para> 2727 2728 <caution> 2729 <para> 2730 Thesauruses are used during indexing so any change in the thesaurus 2731 dictionary's parameters <emphasis>requires</emphasis> reindexing. 2732 For most other dictionary types, small changes such as adding or 2733 removing stopwords does not force reindexing. 2734 </para> 2735 </caution> 2736 2737 <sect3 id="textsearch-thesaurus-config"> 2738 <title>Thesaurus Configuration</title> 2739 2740 <para> 2741 To define a new thesaurus dictionary, use the <literal>thesaurus</literal> 2742 template. For example: 2743 2744<programlisting> 2745CREATE TEXT SEARCH DICTIONARY thesaurus_simple ( 2746 TEMPLATE = thesaurus, 2747 DictFile = mythesaurus, 2748 Dictionary = pg_catalog.english_stem 2749); 2750</programlisting> 2751 2752 Here: 2753 <itemizedlist spacing="compact" mark="bullet"> 2754 <listitem> 2755 <para> 2756 <literal>thesaurus_simple</literal> is the new dictionary's name 2757 </para> 2758 </listitem> 2759 <listitem> 2760 <para> 2761 <literal>mythesaurus</literal> is the base name of the thesaurus 2762 configuration file. 2763 (Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</filename>, 2764 where <literal>$SHAREDIR</literal> means the installation shared-data 2765 directory.) 2766 </para> 2767 </listitem> 2768 <listitem> 2769 <para> 2770 <literal>pg_catalog.english_stem</literal> is the subdictionary (here, 2771 a Snowball English stemmer) to use for thesaurus normalization. 2772 Notice that the subdictionary will have its own 2773 configuration (for example, stop words), which is not shown here. 2774 </para> 2775 </listitem> 2776 </itemizedlist> 2777 2778 Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal> 2779 to the desired token types in a configuration, for example: 2780 2781<programlisting> 2782ALTER TEXT SEARCH CONFIGURATION russian 2783 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart 2784 WITH thesaurus_simple; 2785</programlisting> 2786 </para> 2787 2788 </sect3> 2789 2790 <sect3 id="textsearch-thesaurus-examples"> 2791 <title>Thesaurus Example</title> 2792 2793 <para> 2794 Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>, 2795 which contains some astronomical word combinations: 2796 2797<programlisting> 2798supernovae stars : sn 2799crab nebulae : crab 2800</programlisting> 2801 2802 Below we create a dictionary and bind some token types to 2803 an astronomical thesaurus and English stemmer: 2804 2805<programlisting> 2806CREATE TEXT SEARCH DICTIONARY thesaurus_astro ( 2807 TEMPLATE = thesaurus, 2808 DictFile = thesaurus_astro, 2809 Dictionary = english_stem 2810); 2811 2812ALTER TEXT SEARCH CONFIGURATION russian 2813 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart 2814 WITH thesaurus_astro, english_stem; 2815</programlisting> 2816 2817 Now we can see how it works. 2818 <function>ts_lexize</function> is not very useful for testing a thesaurus, 2819 because it treats its input as a single token. Instead we can use 2820 <function>plainto_tsquery</function> and <function>to_tsvector</function> 2821 which will break their input strings into multiple tokens: 2822 2823<screen> 2824SELECT plainto_tsquery('supernova star'); 2825 plainto_tsquery 2826----------------- 2827 'sn' 2828 2829SELECT to_tsvector('supernova star'); 2830 to_tsvector 2831------------- 2832 'sn':1 2833</screen> 2834 2835 In principle, one can use <function>to_tsquery</function> if you quote 2836 the argument: 2837 2838<screen> 2839SELECT to_tsquery('''supernova star'''); 2840 to_tsquery 2841------------ 2842 'sn' 2843</screen> 2844 2845 Notice that <literal>supernova star</literal> matches <literal>supernovae 2846 stars</literal> in <literal>thesaurus_astro</literal> because we specified 2847 the <literal>english_stem</literal> stemmer in the thesaurus definition. 2848 The stemmer removed the <literal>e</literal> and <literal>s</literal>. 2849 </para> 2850 2851 <para> 2852 To index the original phrase as well as the substitute, just include it 2853 in the right-hand part of the definition: 2854 2855<screen> 2856supernovae stars : sn supernovae stars 2857 2858SELECT plainto_tsquery('supernova star'); 2859 plainto_tsquery 2860----------------------------- 2861 'sn' & 'supernova' & 'star' 2862</screen> 2863 </para> 2864 2865 </sect3> 2866 2867 </sect2> 2868 2869 <sect2 id="textsearch-ispell-dictionary"> 2870 <title><application>Ispell</application> Dictionary</title> 2871 2872 <para> 2873 The <application>Ispell</application> dictionary template supports 2874 <firstterm>morphological dictionaries</firstterm>, which can normalize many 2875 different linguistic forms of a word into the same lexeme. For example, 2876 an English <application>Ispell</application> dictionary can match all declensions and 2877 conjugations of the search term <literal>bank</literal>, e.g., 2878 <literal>banking</literal>, <literal>banked</literal>, <literal>banks</literal>, 2879 <literal>banks'</literal>, and <literal>bank's</literal>. 2880 </para> 2881 2882 <para> 2883 The standard <productname>PostgreSQL</productname> distribution does 2884 not include any <application>Ispell</application> configuration files. 2885 Dictionaries for a large number of languages are available from <ulink 2886 url="https://www.cs.hmc.edu/~geoff/ispell.html">Ispell</ulink>. 2887 Also, some more modern dictionary file formats are supported — <ulink 2888 url="https://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO < 2.0.1) 2889 and <ulink url="https://sourceforge.net/projects/hunspell/">Hunspell</ulink> 2890 (OO >= 2.0.2). A large list of dictionaries is available on the <ulink 2891 url="https://wiki.openoffice.org/wiki/Dictionaries">OpenOffice 2892 Wiki</ulink>. 2893 </para> 2894 2895 <para> 2896 To create an <application>Ispell</application> dictionary perform these steps: 2897 </para> 2898 <itemizedlist spacing="compact" mark="bullet"> 2899 <listitem> 2900 <para> 2901 download dictionary configuration files. <productname>OpenOffice</productname> 2902 extension files have the <filename>.oxt</filename> extension. It is necessary 2903 to extract <filename>.aff</filename> and <filename>.dic</filename> files, change 2904 extensions to <filename>.affix</filename> and <filename>.dict</filename>. For some 2905 dictionary files it is also needed to convert characters to the UTF-8 2906 encoding with commands (for example, for a Norwegian language dictionary): 2907<programlisting> 2908iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff 2909iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic 2910</programlisting> 2911 </para> 2912 </listitem> 2913 <listitem> 2914 <para> 2915 copy files to the <filename>$SHAREDIR/tsearch_data</filename> directory 2916 </para> 2917 </listitem> 2918 <listitem> 2919 <para> 2920 load files into PostgreSQL with the following command: 2921<programlisting> 2922CREATE TEXT SEARCH DICTIONARY english_hunspell ( 2923 TEMPLATE = ispell, 2924 DictFile = en_us, 2925 AffFile = en_us, 2926 Stopwords = english); 2927</programlisting> 2928 </para> 2929 </listitem> 2930 </itemizedlist> 2931 2932 <para> 2933 Here, <literal>DictFile</literal>, <literal>AffFile</literal>, and <literal>StopWords</literal> 2934 specify the base names of the dictionary, affixes, and stop-words files. 2935 The stop-words file has the same format explained above for the 2936 <literal>simple</literal> dictionary type. The format of the other files is 2937 not specified here but is available from the above-mentioned web sites. 2938 </para> 2939 2940 <para> 2941 Ispell dictionaries usually recognize a limited set of words, so they 2942 should be followed by another broader dictionary; for 2943 example, a Snowball dictionary, which recognizes everything. 2944 </para> 2945 2946 <para> 2947 The <filename>.affix</filename> file of <application>Ispell</application> has the following 2948 structure: 2949<programlisting> 2950prefixes 2951flag *A: 2952 . > RE # As in enter > reenter 2953suffixes 2954flag T: 2955 E > ST # As in late > latest 2956 [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest 2957 [AEIOU]Y > EST # As in gray > grayest 2958 [^EY] > EST # As in small > smallest 2959</programlisting> 2960 </para> 2961 <para> 2962 And the <filename>.dict</filename> file has the following structure: 2963<programlisting> 2964lapse/ADGRS 2965lard/DGRS 2966large/PRTY 2967lark/MRS 2968</programlisting> 2969 </para> 2970 2971 <para> 2972 Format of the <filename>.dict</filename> file is: 2973<programlisting> 2974basic_form/affix_class_name 2975</programlisting> 2976 </para> 2977 2978 <para> 2979 In the <filename>.affix</filename> file every affix flag is described in the 2980 following format: 2981<programlisting> 2982condition > [-stripping_letters,] adding_affix 2983</programlisting> 2984 </para> 2985 2986 <para> 2987 Here, condition has a format similar to the format of regular expressions. 2988 It can use groupings <literal>[...]</literal> and <literal>[^...]</literal>. 2989 For example, <literal>[AEIOU]Y</literal> means that the last letter of the word 2990 is <literal>"y"</literal> and the penultimate letter is <literal>"a"</literal>, 2991 <literal>"e"</literal>, <literal>"i"</literal>, <literal>"o"</literal> or <literal>"u"</literal>. 2992 <literal>[^EY]</literal> means that the last letter is neither <literal>"e"</literal> 2993 nor <literal>"y"</literal>. 2994 </para> 2995 2996 <para> 2997 Ispell dictionaries support splitting compound words; 2998 a useful feature. 2999 Notice that the affix file should specify a special flag using the 3000 <literal>compoundwords controlled</literal> statement that marks dictionary 3001 words that can participate in compound formation: 3002 3003<programlisting> 3004compoundwords controlled z 3005</programlisting> 3006 3007 Here are some examples for the Norwegian language: 3008 3009<programlisting> 3010SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent'); 3011 {over,buljong,terning,pakk,mester,assistent} 3012SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); 3013 {sjokoladefabrikk,sjokolade,fabrikk} 3014</programlisting> 3015 </para> 3016 3017 <para> 3018 <application>MySpell</application> format is a subset of <application>Hunspell</application>. 3019 The <filename>.affix</filename> file of <application>Hunspell</application> has the following 3020 structure: 3021<programlisting> 3022PFX A Y 1 3023PFX A 0 re . 3024SFX T N 4 3025SFX T 0 st e 3026SFX T y iest [^aeiou]y 3027SFX T 0 est [aeiou]y 3028SFX T 0 est [^ey] 3029</programlisting> 3030 </para> 3031 3032 <para> 3033 The first line of an affix class is the header. Fields of an affix rules are 3034 listed after the header: 3035 </para> 3036 <itemizedlist spacing="compact" mark="bullet"> 3037 <listitem> 3038 <para> 3039 parameter name (PFX or SFX) 3040 </para> 3041 </listitem> 3042 <listitem> 3043 <para> 3044 flag (name of the affix class) 3045 </para> 3046 </listitem> 3047 <listitem> 3048 <para> 3049 stripping characters from beginning (at prefix) or end (at suffix) of the 3050 word 3051 </para> 3052 </listitem> 3053 <listitem> 3054 <para> 3055 adding affix 3056 </para> 3057 </listitem> 3058 <listitem> 3059 <para> 3060 condition that has a format similar to the format of regular expressions. 3061 </para> 3062 </listitem> 3063 </itemizedlist> 3064 3065 <para> 3066 The <filename>.dict</filename> file looks like the <filename>.dict</filename> file of 3067 <application>Ispell</application>: 3068<programlisting> 3069larder/M 3070lardy/RT 3071large/RSPMYT 3072largehearted 3073</programlisting> 3074 </para> 3075 3076 <note> 3077 <para> 3078 <application>MySpell</application> does not support compound words. 3079 <application>Hunspell</application> has sophisticated support for compound words. At 3080 present, <productname>PostgreSQL</productname> implements only the basic 3081 compound word operations of Hunspell. 3082 </para> 3083 </note> 3084 3085 </sect2> 3086 3087 <sect2 id="textsearch-snowball-dictionary"> 3088 <title><application>Snowball</application> Dictionary</title> 3089 3090 <para> 3091 The <application>Snowball</application> dictionary template is based on a project 3092 by Martin Porter, inventor of the popular Porter's stemming algorithm 3093 for the English language. Snowball now provides stemming algorithms for 3094 many languages (see the <ulink url="https://snowballstem.org/">Snowball 3095 site</ulink> for more information). Each algorithm understands how to 3096 reduce common variant forms of words to a base, or stem, spelling within 3097 its language. A Snowball dictionary requires a <literal>language</literal> 3098 parameter to identify which stemmer to use, and optionally can specify a 3099 <literal>stopword</literal> file name that gives a list of words to eliminate. 3100 (<productname>PostgreSQL</productname>'s standard stopword lists are also 3101 provided by the Snowball project.) 3102 For example, there is a built-in definition equivalent to 3103 3104<programlisting> 3105CREATE TEXT SEARCH DICTIONARY english_stem ( 3106 TEMPLATE = snowball, 3107 Language = english, 3108 StopWords = english 3109); 3110</programlisting> 3111 3112 The stopword file format is the same as already explained. 3113 </para> 3114 3115 <para> 3116 A <application>Snowball</application> dictionary recognizes everything, whether 3117 or not it is able to simplify the word, so it should be placed 3118 at the end of the dictionary list. It is useless to have it 3119 before any other dictionary because a token will never pass through it to 3120 the next dictionary. 3121 </para> 3122 3123 </sect2> 3124 3125 </sect1> 3126 3127 <sect1 id="textsearch-configuration"> 3128 <title>Configuration Example</title> 3129 3130 <para> 3131 A text search configuration specifies all options necessary to transform a 3132 document into a <type>tsvector</type>: the parser to use to break text 3133 into tokens, and the dictionaries to use to transform each token into a 3134 lexeme. Every call of 3135 <function>to_tsvector</function> or <function>to_tsquery</function> 3136 needs a text search configuration to perform its processing. 3137 The configuration parameter 3138 <xref linkend="guc-default-text-search-config"/> 3139 specifies the name of the default configuration, which is the 3140 one used by text search functions if an explicit configuration 3141 parameter is omitted. 3142 It can be set in <filename>postgresql.conf</filename>, or set for an 3143 individual session using the <command>SET</command> command. 3144 </para> 3145 3146 <para> 3147 Several predefined text search configurations are available, and 3148 you can create custom configurations easily. To facilitate management 3149 of text search objects, a set of <acronym>SQL</acronym> commands 3150 is available, and there are several <application>psql</application> commands that display information 3151 about text search objects (<xref linkend="textsearch-psql"/>). 3152 </para> 3153 3154 <para> 3155 As an example we will create a configuration 3156 <literal>pg</literal>, starting by duplicating the built-in 3157 <literal>english</literal> configuration: 3158 3159<programlisting> 3160CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = pg_catalog.english ); 3161</programlisting> 3162 </para> 3163 3164 <para> 3165 We will use a PostgreSQL-specific synonym list 3166 and store it in <filename>$SHAREDIR/tsearch_data/pg_dict.syn</filename>. 3167 The file contents look like: 3168 3169<programlisting> 3170postgres pg 3171pgsql pg 3172postgresql pg 3173</programlisting> 3174 3175 We define the synonym dictionary like this: 3176 3177<programlisting> 3178CREATE TEXT SEARCH DICTIONARY pg_dict ( 3179 TEMPLATE = synonym, 3180 SYNONYMS = pg_dict 3181); 3182</programlisting> 3183 3184 Next we register the <productname>Ispell</productname> dictionary 3185 <literal>english_ispell</literal>, which has its own configuration files: 3186 3187<programlisting> 3188CREATE TEXT SEARCH DICTIONARY english_ispell ( 3189 TEMPLATE = ispell, 3190 DictFile = english, 3191 AffFile = english, 3192 StopWords = english 3193); 3194</programlisting> 3195 3196 Now we can set up the mappings for words in configuration 3197 <literal>pg</literal>: 3198 3199<programlisting> 3200ALTER TEXT SEARCH CONFIGURATION pg 3201 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, 3202 word, hword, hword_part 3203 WITH pg_dict, english_ispell, english_stem; 3204</programlisting> 3205 3206 We choose not to index or search some token types that the built-in 3207 configuration does handle: 3208 3209<programlisting> 3210ALTER TEXT SEARCH CONFIGURATION pg 3211 DROP MAPPING FOR email, url, url_path, sfloat, float; 3212</programlisting> 3213 </para> 3214 3215 <para> 3216 Now we can test our configuration: 3217 3218<programlisting> 3219SELECT * FROM ts_debug('public.pg', ' 3220PostgreSQL, the highly scalable, SQL compliant, open source object-relational 3221database management system, is now undergoing beta testing of the next 3222version of our software. 3223'); 3224</programlisting> 3225 </para> 3226 3227 <para> 3228 The next step is to set the session to use the new configuration, which was 3229 created in the <literal>public</literal> schema: 3230 3231<screen> 3232=> \dF 3233 List of text search configurations 3234 Schema | Name | Description 3235---------+------+------------- 3236 public | pg | 3237 3238SET default_text_search_config = 'public.pg'; 3239SET 3240 3241SHOW default_text_search_config; 3242 default_text_search_config 3243---------------------------- 3244 public.pg 3245</screen> 3246 </para> 3247 3248 </sect1> 3249 3250 <sect1 id="textsearch-debugging"> 3251 <title>Testing and Debugging Text Search</title> 3252 3253 <para> 3254 The behavior of a custom text search configuration can easily become 3255 confusing. The functions described 3256 in this section are useful for testing text search objects. You can 3257 test a complete configuration, or test parsers and dictionaries separately. 3258 </para> 3259 3260 <sect2 id="textsearch-configuration-testing"> 3261 <title>Configuration Testing</title> 3262 3263 <para> 3264 The function <function>ts_debug</function> allows easy testing of a 3265 text search configuration. 3266 </para> 3267 3268 <indexterm> 3269 <primary>ts_debug</primary> 3270 </indexterm> 3271 3272<synopsis> 3273ts_debug(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>, 3274 OUT <replaceable class="parameter">alias</replaceable> <type>text</type>, 3275 OUT <replaceable class="parameter">description</replaceable> <type>text</type>, 3276 OUT <replaceable class="parameter">token</replaceable> <type>text</type>, 3277 OUT <replaceable class="parameter">dictionaries</replaceable> <type>regdictionary[]</type>, 3278 OUT <replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>, 3279 OUT <replaceable class="parameter">lexemes</replaceable> <type>text[]</type>) 3280 returns setof record 3281</synopsis> 3282 3283 <para> 3284 <function>ts_debug</function> displays information about every token of 3285 <replaceable class="parameter">document</replaceable> as produced by the 3286 parser and processed by the configured dictionaries. It uses the 3287 configuration specified by <replaceable 3288 class="parameter">config</replaceable>, 3289 or <varname>default_text_search_config</varname> if that argument is 3290 omitted. 3291 </para> 3292 3293 <para> 3294 <function>ts_debug</function> returns one row for each token identified in the text 3295 by the parser. The columns returned are 3296 3297 <itemizedlist spacing="compact" mark="bullet"> 3298 <listitem> 3299 <para> 3300 <replaceable>alias</replaceable> <type>text</type> — short name of the token type 3301 </para> 3302 </listitem> 3303 <listitem> 3304 <para> 3305 <replaceable>description</replaceable> <type>text</type> — description of the 3306 token type 3307 </para> 3308 </listitem> 3309 <listitem> 3310 <para> 3311 <replaceable>token</replaceable> <type>text</type> — text of the token 3312 </para> 3313 </listitem> 3314 <listitem> 3315 <para> 3316 <replaceable>dictionaries</replaceable> <type>regdictionary[]</type> — the 3317 dictionaries selected by the configuration for this token type 3318 </para> 3319 </listitem> 3320 <listitem> 3321 <para> 3322 <replaceable>dictionary</replaceable> <type>regdictionary</type> — the dictionary 3323 that recognized the token, or <literal>NULL</literal> if none did 3324 </para> 3325 </listitem> 3326 <listitem> 3327 <para> 3328 <replaceable>lexemes</replaceable> <type>text[]</type> — the lexeme(s) produced 3329 by the dictionary that recognized the token, or <literal>NULL</literal> if 3330 none did; an empty array (<literal>{}</literal>) means it was recognized as a 3331 stop word 3332 </para> 3333 </listitem> 3334 </itemizedlist> 3335 </para> 3336 3337 <para> 3338 Here is a simple example: 3339 3340<screen> 3341SELECT * FROM ts_debug('english', 'a fat cat sat on a mat - it ate a fat rats'); 3342 alias | description | token | dictionaries | dictionary | lexemes 3343-----------+-----------------+-------+----------------+--------------+--------- 3344 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} 3345 blank | Space symbols | | {} | | 3346 asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} 3347 blank | Space symbols | | {} | | 3348 asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat} 3349 blank | Space symbols | | {} | | 3350 asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat} 3351 blank | Space symbols | | {} | | 3352 asciiword | Word, all ASCII | on | {english_stem} | english_stem | {} 3353 blank | Space symbols | | {} | | 3354 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} 3355 blank | Space symbols | | {} | | 3356 asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat} 3357 blank | Space symbols | | {} | | 3358 blank | Space symbols | - | {} | | 3359 asciiword | Word, all ASCII | it | {english_stem} | english_stem | {} 3360 blank | Space symbols | | {} | | 3361 asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate} 3362 blank | Space symbols | | {} | | 3363 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} 3364 blank | Space symbols | | {} | | 3365 asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} 3366 blank | Space symbols | | {} | | 3367 asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat} 3368</screen> 3369 </para> 3370 3371 <para> 3372 For a more extensive demonstration, we 3373 first create a <literal>public.english</literal> configuration and 3374 Ispell dictionary for the English language: 3375 </para> 3376 3377<programlisting> 3378CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english ); 3379 3380CREATE TEXT SEARCH DICTIONARY english_ispell ( 3381 TEMPLATE = ispell, 3382 DictFile = english, 3383 AffFile = english, 3384 StopWords = english 3385); 3386 3387ALTER TEXT SEARCH CONFIGURATION public.english 3388 ALTER MAPPING FOR asciiword WITH english_ispell, english_stem; 3389</programlisting> 3390 3391<screen> 3392SELECT * FROM ts_debug('public.english', 'The Brightest supernovaes'); 3393 alias | description | token | dictionaries | dictionary | lexemes 3394-----------+-----------------+-------------+-------------------------------+----------------+------------- 3395 asciiword | Word, all ASCII | The | {english_ispell,english_stem} | english_ispell | {} 3396 blank | Space symbols | | {} | | 3397 asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | english_ispell | {bright} 3398 blank | Space symbols | | {} | | 3399 asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem | {supernova} 3400</screen> 3401 3402 <para> 3403 In this example, the word <literal>Brightest</literal> was recognized by the 3404 parser as an <literal>ASCII word</literal> (alias <literal>asciiword</literal>). 3405 For this token type the dictionary list is 3406 <literal>english_ispell</literal> and 3407 <literal>english_stem</literal>. The word was recognized by 3408 <literal>english_ispell</literal>, which reduced it to the noun 3409 <literal>bright</literal>. The word <literal>supernovaes</literal> is 3410 unknown to the <literal>english_ispell</literal> dictionary so it 3411 was passed to the next dictionary, and, fortunately, was recognized (in 3412 fact, <literal>english_stem</literal> is a Snowball dictionary which 3413 recognizes everything; that is why it was placed at the end of the 3414 dictionary list). 3415 </para> 3416 3417 <para> 3418 The word <literal>The</literal> was recognized by the 3419 <literal>english_ispell</literal> dictionary as a stop word (<xref 3420 linkend="textsearch-stopwords"/>) and will not be indexed. 3421 The spaces are discarded too, since the configuration provides no 3422 dictionaries at all for them. 3423 </para> 3424 3425 <para> 3426 You can reduce the width of the output by explicitly specifying which columns 3427 you want to see: 3428 3429<screen> 3430SELECT alias, token, dictionary, lexemes 3431FROM ts_debug('public.english', 'The Brightest supernovaes'); 3432 alias | token | dictionary | lexemes 3433-----------+-------------+----------------+------------- 3434 asciiword | The | english_ispell | {} 3435 blank | | | 3436 asciiword | Brightest | english_ispell | {bright} 3437 blank | | | 3438 asciiword | supernovaes | english_stem | {supernova} 3439</screen> 3440 </para> 3441 3442 </sect2> 3443 3444 <sect2 id="textsearch-parser-testing"> 3445 <title>Parser Testing</title> 3446 3447 <para> 3448 The following functions allow direct testing of a text search parser. 3449 </para> 3450 3451 <indexterm> 3452 <primary>ts_parse</primary> 3453 </indexterm> 3454 3455<synopsis> 3456ts_parse(<replaceable class="parameter">parser_name</replaceable> <type>text</type>, <replaceable class="parameter">document</replaceable> <type>text</type>, 3457 OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, OUT <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>setof record</type> 3458ts_parse(<replaceable class="parameter">parser_oid</replaceable> <type>oid</type>, <replaceable class="parameter">document</replaceable> <type>text</type>, 3459 OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, OUT <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>setof record</type> 3460</synopsis> 3461 3462 <para> 3463 <function>ts_parse</function> parses the given <replaceable>document</replaceable> 3464 and returns a series of records, one for each token produced by 3465 parsing. Each record includes a <varname>tokid</varname> showing the 3466 assigned token type and a <varname>token</varname> which is the text of the 3467 token. For example: 3468 3469<screen> 3470SELECT * FROM ts_parse('default', '123 - a number'); 3471 tokid | token 3472-------+-------- 3473 22 | 123 3474 12 | 3475 12 | - 3476 1 | a 3477 12 | 3478 1 | number 3479</screen> 3480 </para> 3481 3482 <indexterm> 3483 <primary>ts_token_type</primary> 3484 </indexterm> 3485 3486<synopsis> 3487ts_token_type(<replaceable class="parameter">parser_name</replaceable> <type>text</type>, OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, 3488 OUT <replaceable class="parameter">alias</replaceable> <type>text</type>, OUT <replaceable class="parameter">description</replaceable> <type>text</type>) returns <type>setof record</type> 3489ts_token_type(<replaceable class="parameter">parser_oid</replaceable> <type>oid</type>, OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, 3490 OUT <replaceable class="parameter">alias</replaceable> <type>text</type>, OUT <replaceable class="parameter">description</replaceable> <type>text</type>) returns <type>setof record</type> 3491</synopsis> 3492 3493 <para> 3494 <function>ts_token_type</function> returns a table which describes each type of 3495 token the specified parser can recognize. For each token type, the table 3496 gives the integer <varname>tokid</varname> that the parser uses to label a 3497 token of that type, the <varname>alias</varname> that names the token type 3498 in configuration commands, and a short <varname>description</varname>. For 3499 example: 3500 3501<screen> 3502SELECT * FROM ts_token_type('default'); 3503 tokid | alias | description 3504-------+-----------------+------------------------------------------ 3505 1 | asciiword | Word, all ASCII 3506 2 | word | Word, all letters 3507 3 | numword | Word, letters and digits 3508 4 | email | Email address 3509 5 | url | URL 3510 6 | host | Host 3511 7 | sfloat | Scientific notation 3512 8 | version | Version number 3513 9 | hword_numpart | Hyphenated word part, letters and digits 3514 10 | hword_part | Hyphenated word part, all letters 3515 11 | hword_asciipart | Hyphenated word part, all ASCII 3516 12 | blank | Space symbols 3517 13 | tag | XML tag 3518 14 | protocol | Protocol head 3519 15 | numhword | Hyphenated word, letters and digits 3520 16 | asciihword | Hyphenated word, all ASCII 3521 17 | hword | Hyphenated word, all letters 3522 18 | url_path | URL path 3523 19 | file | File or path name 3524 20 | float | Decimal notation 3525 21 | int | Signed integer 3526 22 | uint | Unsigned integer 3527 23 | entity | XML entity 3528</screen> 3529 </para> 3530 3531 </sect2> 3532 3533 <sect2 id="textsearch-dictionary-testing"> 3534 <title>Dictionary Testing</title> 3535 3536 <para> 3537 The <function>ts_lexize</function> function facilitates dictionary testing. 3538 </para> 3539 3540 <indexterm> 3541 <primary>ts_lexize</primary> 3542 </indexterm> 3543 3544<synopsis> 3545ts_lexize(<replaceable class="parameter">dict</replaceable> <type>regdictionary</type>, <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>text[]</type> 3546</synopsis> 3547 3548 <para> 3549 <function>ts_lexize</function> returns an array of lexemes if the input 3550 <replaceable>token</replaceable> is known to the dictionary, 3551 or an empty array if the token 3552 is known to the dictionary but it is a stop word, or 3553 <literal>NULL</literal> if it is an unknown word. 3554 </para> 3555 3556 <para> 3557 Examples: 3558 3559<screen> 3560SELECT ts_lexize('english_stem', 'stars'); 3561 ts_lexize 3562----------- 3563 {star} 3564 3565SELECT ts_lexize('english_stem', 'a'); 3566 ts_lexize 3567----------- 3568 {} 3569</screen> 3570 </para> 3571 3572 <note> 3573 <para> 3574 The <function>ts_lexize</function> function expects a single 3575 <emphasis>token</emphasis>, not text. Here is a case 3576 where this can be confusing: 3577 3578<screen> 3579SELECT ts_lexize('thesaurus_astro', 'supernovae stars') is null; 3580 ?column? 3581---------- 3582 t 3583</screen> 3584 3585 The thesaurus dictionary <literal>thesaurus_astro</literal> does know the 3586 phrase <literal>supernovae stars</literal>, but <function>ts_lexize</function> 3587 fails since it does not parse the input text but treats it as a single 3588 token. Use <function>plainto_tsquery</function> or <function>to_tsvector</function> to 3589 test thesaurus dictionaries, for example: 3590 3591<screen> 3592SELECT plainto_tsquery('supernovae stars'); 3593 plainto_tsquery 3594----------------- 3595 'sn' 3596</screen> 3597 </para> 3598 </note> 3599 3600 </sect2> 3601 3602 </sect1> 3603 3604 <sect1 id="textsearch-indexes"> 3605 <title>GIN and GiST Index Types</title> 3606 3607 <indexterm zone="textsearch-indexes"> 3608 <primary>text search</primary> 3609 <secondary>indexes</secondary> 3610 </indexterm> 3611 3612 <para> 3613 There are two kinds of indexes that can be used to speed up full text 3614 searches. 3615 Note that indexes are not mandatory for full text searching, but in 3616 cases where a column is searched on a regular basis, an index is 3617 usually desirable. 3618 3619 <variablelist> 3620 3621 <varlistentry> 3622 3623 <term> 3624 <indexterm zone="textsearch-indexes"> 3625 <primary>index</primary> 3626 <secondary>GIN</secondary> 3627 <tertiary>text search</tertiary> 3628 </indexterm> 3629 3630 <literal>CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING GIN (<replaceable>column</replaceable>);</literal> 3631 </term> 3632 3633 <listitem> 3634 <para> 3635 Creates a GIN (Generalized Inverted Index)-based index. 3636 The <replaceable>column</replaceable> must be of <type>tsvector</type> type. 3637 </para> 3638 </listitem> 3639 </varlistentry> 3640 3641 <varlistentry> 3642 3643 <term> 3644 <indexterm zone="textsearch-indexes"> 3645 <primary>index</primary> 3646 <secondary>GiST</secondary> 3647 <tertiary>text search</tertiary> 3648 </indexterm> 3649 3650 <literal>CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING GIST (<replaceable>column</replaceable>);</literal> 3651 </term> 3652 3653 <listitem> 3654 <para> 3655 Creates a GiST (Generalized Search Tree)-based index. 3656 The <replaceable>column</replaceable> can be of <type>tsvector</type> or 3657 <type>tsquery</type> type. 3658 </para> 3659 </listitem> 3660 </varlistentry> 3661 3662 </variablelist> 3663 </para> 3664 3665 <para> 3666 GIN indexes are the preferred text search index type. As inverted 3667 indexes, they contain an index entry for each word (lexeme), with a 3668 compressed list of matching locations. Multi-word searches can find 3669 the first match, then use the index to remove rows that are lacking 3670 additional words. GIN indexes store only the words (lexemes) of 3671 <type>tsvector</type> values, and not their weight labels. Thus a table 3672 row recheck is needed when using a query that involves weights. 3673 </para> 3674 3675 <para> 3676 A GiST index is <firstterm>lossy</firstterm>, meaning that the index 3677 might produce false matches, and it is necessary 3678 to check the actual table row to eliminate such false matches. 3679 (<productname>PostgreSQL</productname> does this automatically when needed.) 3680 GiST indexes are lossy because each document is represented in the 3681 index by a fixed-length signature. The signature is generated by hashing 3682 each word into a single bit in an n-bit string, with all these bits OR-ed 3683 together to produce an n-bit document signature. When two words hash to 3684 the same bit position there will be a false match. If all words in 3685 the query have matches (real or false) then the table row must be 3686 retrieved to see if the match is correct. 3687 </para> 3688 3689 <para> 3690 A GiST index can be covering, i.e., use the <literal>INCLUDE</literal> 3691 clause. Included columns can have data types without any GiST operator 3692 class. Included attributes will be stored uncompressed. 3693 </para> 3694 3695 <para> 3696 Lossiness causes performance degradation due to unnecessary fetches of table 3697 records that turn out to be false matches. Since random access to table 3698 records is slow, this limits the usefulness of GiST indexes. The 3699 likelihood of false matches depends on several factors, in particular the 3700 number of unique words, so using dictionaries to reduce this number is 3701 recommended. 3702 </para> 3703 3704 <para> 3705 Note that <acronym>GIN</acronym> index build time can often be improved 3706 by increasing <xref linkend="guc-maintenance-work-mem"/>, while 3707 <acronym>GiST</acronym> index build time is not sensitive to that 3708 parameter. 3709 </para> 3710 3711 <para> 3712 Partitioning of big collections and the proper use of GIN and GiST indexes 3713 allows the implementation of very fast searches with online update. 3714 Partitioning can be done at the database level using table inheritance, 3715 or by distributing documents over 3716 servers and collecting external search results, e.g., via <link 3717 linkend="ddl-foreign-data">Foreign Data</link> access. 3718 The latter is possible because ranking functions use 3719 only local information. 3720 </para> 3721 3722 </sect1> 3723 3724 <sect1 id="textsearch-psql"> 3725 <title><application>psql</application> Support</title> 3726 3727 <para> 3728 Information about text search configuration objects can be obtained 3729 in <application>psql</application> using a set of commands: 3730<synopsis> 3731\dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional> 3732</synopsis> 3733 An optional <literal>+</literal> produces more details. 3734 </para> 3735 3736 <para> 3737 The optional parameter <replaceable>PATTERN</replaceable> can be the name of 3738 a text search object, optionally schema-qualified. If 3739 <replaceable>PATTERN</replaceable> is omitted then information about all 3740 visible objects will be displayed. <replaceable>PATTERN</replaceable> can be a 3741 regular expression and can provide <emphasis>separate</emphasis> patterns 3742 for the schema and object names. The following examples illustrate this: 3743 3744<screen> 3745=> \dF *fulltext* 3746 List of text search configurations 3747 Schema | Name | Description 3748--------+--------------+------------- 3749 public | fulltext_cfg | 3750</screen> 3751 3752<screen> 3753=> \dF *.fulltext* 3754 List of text search configurations 3755 Schema | Name | Description 3756----------+---------------------------- 3757 fulltext | fulltext_cfg | 3758 public | fulltext_cfg | 3759</screen> 3760 3761 The available commands are: 3762 </para> 3763 3764 <variablelist> 3765 <varlistentry> 3766 <term><literal>\dF<optional>+</optional> <optional>PATTERN</optional></literal></term> 3767 <listitem> 3768 <para> 3769 List text search configurations (add <literal>+</literal> for more detail). 3770<screen> 3771=> \dF russian 3772 List of text search configurations 3773 Schema | Name | Description 3774------------+---------+------------------------------------ 3775 pg_catalog | russian | configuration for russian language 3776 3777=> \dF+ russian 3778Text search configuration "pg_catalog.russian" 3779Parser: "pg_catalog.default" 3780 Token | Dictionaries 3781-----------------+-------------- 3782 asciihword | english_stem 3783 asciiword | english_stem 3784 email | simple 3785 file | simple 3786 float | simple 3787 host | simple 3788 hword | russian_stem 3789 hword_asciipart | english_stem 3790 hword_numpart | simple 3791 hword_part | russian_stem 3792 int | simple 3793 numhword | simple 3794 numword | simple 3795 sfloat | simple 3796 uint | simple 3797 url | simple 3798 url_path | simple 3799 version | simple 3800 word | russian_stem 3801</screen> 3802 </para> 3803 </listitem> 3804 </varlistentry> 3805 3806 <varlistentry> 3807 <term><literal>\dFd<optional>+</optional> <optional>PATTERN</optional></literal></term> 3808 <listitem> 3809 <para> 3810 List text search dictionaries (add <literal>+</literal> for more detail). 3811<screen> 3812=> \dFd 3813 List of text search dictionaries 3814 Schema | Name | Description 3815------------+-----------------+----------------------------------------------------------- 3816 pg_catalog | arabic_stem | snowball stemmer for arabic language 3817 pg_catalog | danish_stem | snowball stemmer for danish language 3818 pg_catalog | dutch_stem | snowball stemmer for dutch language 3819 pg_catalog | english_stem | snowball stemmer for english language 3820 pg_catalog | finnish_stem | snowball stemmer for finnish language 3821 pg_catalog | french_stem | snowball stemmer for french language 3822 pg_catalog | german_stem | snowball stemmer for german language 3823 pg_catalog | hungarian_stem | snowball stemmer for hungarian language 3824 pg_catalog | indonesian_stem | snowball stemmer for indonesian language 3825 pg_catalog | irish_stem | snowball stemmer for irish language 3826 pg_catalog | italian_stem | snowball stemmer for italian language 3827 pg_catalog | lithuanian_stem | snowball stemmer for lithuanian language 3828 pg_catalog | nepali_stem | snowball stemmer for nepali language 3829 pg_catalog | norwegian_stem | snowball stemmer for norwegian language 3830 pg_catalog | portuguese_stem | snowball stemmer for portuguese language 3831 pg_catalog | romanian_stem | snowball stemmer for romanian language 3832 pg_catalog | russian_stem | snowball stemmer for russian language 3833 pg_catalog | simple | simple dictionary: just lower case and check for stopword 3834 pg_catalog | spanish_stem | snowball stemmer for spanish language 3835 pg_catalog | swedish_stem | snowball stemmer for swedish language 3836 pg_catalog | tamil_stem | snowball stemmer for tamil language 3837 pg_catalog | turkish_stem | snowball stemmer for turkish language 3838</screen> 3839 </para> 3840 </listitem> 3841 </varlistentry> 3842 3843 <varlistentry> 3844 <term><literal>\dFp<optional>+</optional> <optional>PATTERN</optional></literal></term> 3845 <listitem> 3846 <para> 3847 List text search parsers (add <literal>+</literal> for more detail). 3848<screen> 3849=> \dFp 3850 List of text search parsers 3851 Schema | Name | Description 3852------------+---------+--------------------- 3853 pg_catalog | default | default word parser 3854=> \dFp+ 3855 Text search parser "pg_catalog.default" 3856 Method | Function | Description 3857-----------------+----------------+------------- 3858 Start parse | prsd_start | 3859 Get next token | prsd_nexttoken | 3860 End parse | prsd_end | 3861 Get headline | prsd_headline | 3862 Get token types | prsd_lextype | 3863 3864 Token types for parser "pg_catalog.default" 3865 Token name | Description 3866-----------------+------------------------------------------ 3867 asciihword | Hyphenated word, all ASCII 3868 asciiword | Word, all ASCII 3869 blank | Space symbols 3870 email | Email address 3871 entity | XML entity 3872 file | File or path name 3873 float | Decimal notation 3874 host | Host 3875 hword | Hyphenated word, all letters 3876 hword_asciipart | Hyphenated word part, all ASCII 3877 hword_numpart | Hyphenated word part, letters and digits 3878 hword_part | Hyphenated word part, all letters 3879 int | Signed integer 3880 numhword | Hyphenated word, letters and digits 3881 numword | Word, letters and digits 3882 protocol | Protocol head 3883 sfloat | Scientific notation 3884 tag | XML tag 3885 uint | Unsigned integer 3886 url | URL 3887 url_path | URL path 3888 version | Version number 3889 word | Word, all letters 3890(23 rows) 3891</screen> 3892 </para> 3893 </listitem> 3894 </varlistentry> 3895 3896 <varlistentry> 3897 <term><literal>\dFt<optional>+</optional> <optional>PATTERN</optional></literal></term> 3898 <listitem> 3899 <para> 3900 List text search templates (add <literal>+</literal> for more detail). 3901<screen> 3902=> \dFt 3903 List of text search templates 3904 Schema | Name | Description 3905------------+-----------+----------------------------------------------------------- 3906 pg_catalog | ispell | ispell dictionary 3907 pg_catalog | simple | simple dictionary: just lower case and check for stopword 3908 pg_catalog | snowball | snowball stemmer 3909 pg_catalog | synonym | synonym dictionary: replace word by its synonym 3910 pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution 3911</screen> 3912 </para> 3913 </listitem> 3914 </varlistentry> 3915 </variablelist> 3916 3917 </sect1> 3918 3919 <sect1 id="textsearch-limitations"> 3920 <title>Limitations</title> 3921 3922 <para> 3923 The current limitations of <productname>PostgreSQL</productname>'s 3924 text search features are: 3925 <itemizedlist spacing="compact" mark="bullet"> 3926 <listitem> 3927 <para>The length of each lexeme must be less than 2K bytes</para> 3928 </listitem> 3929 <listitem> 3930 <para>The length of a <type>tsvector</type> (lexemes + positions) must be 3931 less than 1 megabyte</para> 3932 </listitem> 3933 <listitem> 3934 <!-- TODO: number of lexemes in what? This is unclear --> 3935 <para>The number of lexemes must be less than 3936 2<superscript>64</superscript></para> 3937 </listitem> 3938 <listitem> 3939 <para>Position values in <type>tsvector</type> must be greater than 0 and 3940 no more than 16,383</para> 3941 </listitem> 3942 <listitem> 3943 <para>The match distance in a <literal><<replaceable>N</replaceable>></literal> 3944 (FOLLOWED BY) <type>tsquery</type> operator cannot be more than 3945 16,384</para> 3946 </listitem> 3947 <listitem> 3948 <para>No more than 256 positions per lexeme</para> 3949 </listitem> 3950 <listitem> 3951 <para>The number of nodes (lexemes + operators) in a <type>tsquery</type> 3952 must be less than 32,768</para> 3953 </listitem> 3954 </itemizedlist> 3955 </para> 3956 3957 <para> 3958 For comparison, the <productname>PostgreSQL</productname> 8.1 documentation 3959 contained 10,441 unique words, a total of 335,420 words, and the most 3960 frequent word <quote>postgresql</quote> was mentioned 6,127 times in 655 3961 documents. 3962 </para> 3963 3964 <!-- TODO we need to put a date on these numbers? --> 3965 <para> 3966 Another example — the <productname>PostgreSQL</productname> mailing 3967 list archives contained 910,989 unique words with 57,491,343 lexemes in 3968 461,020 messages. 3969 </para> 3970 3971 </sect1> 3972 3973</chapter> 3974