1<!-- doc/src/sgml/unaccent.sgml --> 2 3<sect1 id="unaccent" xreflabel="unaccent"> 4 <title>unaccent</title> 5 6 <indexterm zone="unaccent"> 7 <primary>unaccent</primary> 8 </indexterm> 9 10 <para> 11 <filename>unaccent</filename> is a text search dictionary that removes accents 12 (diacritic signs) from lexemes. 13 It's a filtering dictionary, which means its output is 14 always passed to the next dictionary (if any), unlike the normal 15 behavior of dictionaries. This allows accent-insensitive processing 16 for full text search. 17 </para> 18 19 <para> 20 The current implementation of <filename>unaccent</filename> cannot be used as a 21 normalizing dictionary for the <filename>thesaurus</filename> dictionary. 22 </para> 23 24 <para> 25 This module is considered <quote>trusted</quote>, that is, it can be 26 installed by non-superusers who have <literal>CREATE</literal> privilege 27 on the current database. 28 </para> 29 30 <sect2> 31 <title>Configuration</title> 32 33 <para> 34 An <literal>unaccent</literal> dictionary accepts the following options: 35 </para> 36 <itemizedlist> 37 <listitem> 38 <para> 39 <literal>RULES</literal> is the base name of the file containing the list of 40 translation rules. This file must be stored in 41 <filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means 42 the <productname>PostgreSQL</productname> installation's shared-data directory). 43 Its name must end in <literal>.rules</literal> (which is not to be included in 44 the <literal>RULES</literal> parameter). 45 </para> 46 </listitem> 47 </itemizedlist> 48 <para> 49 The rules file has the following format: 50 </para> 51 <itemizedlist> 52 <listitem> 53 <para> 54 Each line represents one translation rule, consisting of a character with 55 accent followed by a character without accent. The first is translated 56 into the second. For example, 57<programlisting> 58À A 59Á A 60Â A 61Ã A 62Ä A 63Å A 64Æ AE 65</programlisting> 66 The two characters must be separated by whitespace, and any leading or 67 trailing whitespace on a line is ignored. 68 </para> 69 </listitem> 70 71 <listitem> 72 <para> 73 Alternatively, if only one character is given on a line, instances of 74 that character are deleted; this is useful in languages where accents 75 are represented by separate characters. 76 </para> 77 </listitem> 78 79 <listitem> 80 <para> 81 Actually, each <quote>character</quote> can be any string not containing 82 whitespace, so <filename>unaccent</filename> dictionaries could be used for 83 other sorts of substring substitutions besides diacritic removal. 84 </para> 85 </listitem> 86 87 <listitem> 88 <para> 89 As with other <productname>PostgreSQL</productname> text search configuration files, 90 the rules file must be stored in UTF-8 encoding. The data is 91 automatically translated into the current database's encoding when 92 loaded. Any lines containing untranslatable characters are silently 93 ignored, so that rules files can contain rules that are not applicable in 94 the current encoding. 95 </para> 96 </listitem> 97 </itemizedlist> 98 99 <para> 100 A more complete example, which is directly useful for most European 101 languages, can be found in <filename>unaccent.rules</filename>, which is installed 102 in <filename>$SHAREDIR/tsearch_data/</filename> when the <filename>unaccent</filename> 103 module is installed. This rules file translates characters with accents 104 to the same characters without accents, and it also expands ligatures 105 into the equivalent series of simple characters (for example, Æ to 106 AE). 107 </para> 108 </sect2> 109 110 <sect2> 111 <title>Usage</title> 112 113 <para> 114 Installing the <literal>unaccent</literal> extension creates a text 115 search template <literal>unaccent</literal> and a dictionary <literal>unaccent</literal> 116 based on it. The <literal>unaccent</literal> dictionary has the default 117 parameter setting <literal>RULES='unaccent'</literal>, which makes it immediately 118 usable with the standard <filename>unaccent.rules</filename> file. 119 If you wish, you can alter the parameter, for example 120 121<programlisting> 122mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules'); 123</programlisting> 124 125 or create new dictionaries based on the template. 126 </para> 127 128 <para> 129 To test the dictionary, you can try: 130<programlisting> 131mydb=# select ts_lexize('unaccent','Hôtel'); 132 ts_lexize 133----------- 134 {Hotel} 135(1 row) 136</programlisting> 137 </para> 138 139 <para> 140 Here is an example showing how to insert the 141 <filename>unaccent</filename> dictionary into a text search configuration: 142<programlisting> 143mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french ); 144mydb=# ALTER TEXT SEARCH CONFIGURATION fr 145 ALTER MAPPING FOR hword, hword_part, word 146 WITH unaccent, french_stem; 147mydb=# select to_tsvector('fr','Hôtels de la Mer'); 148 to_tsvector 149------------------- 150 'hotel':1 'mer':4 151(1 row) 152 153mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels'); 154 ?column? 155---------- 156 t 157(1 row) 158 159mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels')); 160 ts_headline 161------------------------ 162 <b>Hôtel</b> de la Mer 163(1 row) 164</programlisting> 165 </para> 166 </sect2> 167 168 <sect2> 169 <title>Functions</title> 170 171 <para> 172 The <function>unaccent()</function> function removes accents (diacritic signs) from 173 a given string. Basically, it's a wrapper around 174 <filename>unaccent</filename>-type dictionaries, but it can be used outside normal 175 text search contexts. 176 </para> 177 178 <indexterm> 179 <primary>unaccent</primary> 180 </indexterm> 181 182<synopsis> 183unaccent(<optional><replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>, </optional> <replaceable class="parameter">string</replaceable> <type>text</type>) returns <type>text</type> 184</synopsis> 185 186 <para> 187 If the <replaceable class="parameter">dictionary</replaceable> argument is 188 omitted, the text search dictionary named <literal>unaccent</literal> and 189 appearing in the same schema as the <function>unaccent()</function> 190 function itself is used. 191 </para> 192 193 <para> 194 For example: 195<programlisting> 196SELECT unaccent('unaccent', 'Hôtel'); 197SELECT unaccent('Hôtel'); 198</programlisting> 199 </para> 200 </sect2> 201 202</sect1> 203