1<!-- doc/src/sgml/unaccent.sgml -->
2
3<sect1 id="unaccent" xreflabel="unaccent">
4 <title>unaccent</title>
5
6 <indexterm zone="unaccent">
7  <primary>unaccent</primary>
8 </indexterm>
9
10 <para>
11  <filename>unaccent</filename> is a text search dictionary that removes accents
12  (diacritic signs) from lexemes.
13  It's a filtering dictionary, which means its output is
14  always passed to the next dictionary (if any), unlike the normal
15  behavior of dictionaries.  This allows accent-insensitive processing
16  for full text search.
17 </para>
18
19 <para>
20  The current implementation of <filename>unaccent</filename> cannot be used as a
21  normalizing dictionary for the <filename>thesaurus</filename> dictionary.
22 </para>
23
24 <para>
25  This module is considered <quote>trusted</quote>, that is, it can be
26  installed by non-superusers who have <literal>CREATE</literal> privilege
27  on the current database.
28 </para>
29
30 <sect2>
31  <title>Configuration</title>
32
33  <para>
34   An <literal>unaccent</literal> dictionary accepts the following options:
35  </para>
36  <itemizedlist>
37   <listitem>
38    <para>
39     <literal>RULES</literal> is the base name of the file containing the list of
40     translation rules.  This file must be stored in
41     <filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means
42     the <productname>PostgreSQL</productname> installation's shared-data directory).
43     Its name must end in <literal>.rules</literal> (which is not to be included in
44     the <literal>RULES</literal> parameter).
45    </para>
46   </listitem>
47  </itemizedlist>
48  <para>
49   The rules file has the following format:
50  </para>
51  <itemizedlist>
52   <listitem>
53    <para>
54     Each line represents one translation rule, consisting of a character with
55     accent followed by a character without accent.  The first is translated
56     into the second.  For example,
57<programlisting>
58&Agrave;        A
59&Aacute;        A
60&Acirc;        A
61&Atilde;        A
62&Auml;        A
63&Aring;        A
64&AElig;        AE
65</programlisting>
66     The two characters must be separated by whitespace, and any leading or
67     trailing whitespace on a line is ignored.
68    </para>
69   </listitem>
70
71   <listitem>
72    <para>
73     Alternatively, if only one character is given on a line, instances of
74     that character are deleted; this is useful in languages where accents
75     are represented by separate characters.
76    </para>
77   </listitem>
78
79   <listitem>
80    <para>
81     Actually, each <quote>character</quote> can be any string not containing
82     whitespace, so <filename>unaccent</filename> dictionaries could be used for
83     other sorts of substring substitutions besides diacritic removal.
84    </para>
85   </listitem>
86
87   <listitem>
88    <para>
89     As with other <productname>PostgreSQL</productname> text search configuration files,
90     the rules file must be stored in UTF-8 encoding.  The data is
91     automatically translated into the current database's encoding when
92     loaded.  Any lines containing untranslatable characters are silently
93     ignored, so that rules files can contain rules that are not applicable in
94     the current encoding.
95    </para>
96   </listitem>
97  </itemizedlist>
98
99  <para>
100   A more complete example, which is directly useful for most European
101   languages, can be found in <filename>unaccent.rules</filename>, which is installed
102   in <filename>$SHAREDIR/tsearch_data/</filename> when the <filename>unaccent</filename>
103   module is installed.  This rules file translates characters with accents
104   to the same characters without accents, and it also expands ligatures
105   into the equivalent series of simple characters (for example, &AElig; to
106   AE).
107  </para>
108 </sect2>
109
110 <sect2>
111  <title>Usage</title>
112
113  <para>
114   Installing the <literal>unaccent</literal> extension creates a text
115   search template <literal>unaccent</literal> and a dictionary <literal>unaccent</literal>
116   based on it.  The <literal>unaccent</literal> dictionary has the default
117   parameter setting <literal>RULES='unaccent'</literal>, which makes it immediately
118   usable with the standard <filename>unaccent.rules</filename> file.
119   If you wish, you can alter the parameter, for example
120
121<programlisting>
122mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
123</programlisting>
124
125   or create new dictionaries based on the template.
126  </para>
127
128  <para>
129   To test the dictionary, you can try:
130<programlisting>
131mydb=# select ts_lexize('unaccent','H&ocirc;tel');
132 ts_lexize
133-----------
134 {Hotel}
135(1 row)
136</programlisting>
137  </para>
138
139  <para>
140   Here is an example showing how to insert the
141   <filename>unaccent</filename> dictionary into a text search configuration:
142<programlisting>
143mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
144mydb=# ALTER TEXT SEARCH CONFIGURATION fr
145        ALTER MAPPING FOR hword, hword_part, word
146        WITH unaccent, french_stem;
147mydb=# select to_tsvector('fr','H&ocirc;tels de la Mer');
148    to_tsvector
149-------------------
150 'hotel':1 'mer':4
151(1 row)
152
153mydb=# select to_tsvector('fr','H&ocirc;tel de la Mer') @@ to_tsquery('fr','Hotels');
154 ?column?
155----------
156 t
157(1 row)
158
159mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels'));
160      ts_headline
161------------------------
162 &lt;b&gt;H&ocirc;tel&lt;/b&gt; de la Mer
163(1 row)
164</programlisting>
165  </para>
166 </sect2>
167
168 <sect2>
169 <title>Functions</title>
170
171 <para>
172  The <function>unaccent()</function> function removes accents (diacritic signs) from
173  a given string.  Basically, it's a wrapper around
174  <filename>unaccent</filename>-type dictionaries, but it can be used outside normal
175  text search contexts.
176 </para>
177
178 <indexterm>
179  <primary>unaccent</primary>
180 </indexterm>
181
182<synopsis>
183unaccent(<optional><replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>, </optional> <replaceable class="parameter">string</replaceable> <type>text</type>) returns <type>text</type>
184</synopsis>
185
186 <para>
187  If the <replaceable class="parameter">dictionary</replaceable> argument is
188  omitted, the text search dictionary named <literal>unaccent</literal> and
189  appearing in the same schema as the <function>unaccent()</function>
190  function itself is used.
191 </para>
192
193 <para>
194  For example:
195<programlisting>
196SELECT unaccent('unaccent', 'H&ocirc;tel');
197SELECT unaccent('H&ocirc;tel');
198</programlisting>
199 </para>
200 </sect2>
201
202</sect1>
203