1.. _imap-developer-guidance-internationalization:
2
3..  Note: This document was converted from the original by Nic Bernstein
4    (Onlight).  Any formatting mistakes are my fault and not the
5    original author's.
6
7Cyrus IMAP Server: Internationalization
8=======================================
9
10introduction
11------------
12
13Cyrus currently transcodes characters to a canonical UTF-8 form for
14searching. The base spec of IMAP4 only requires understanding multiple
15character sets to properly implement SEARCH. Since the base spec came
16out, several extensions have been proposed that require further charset
17support: SORT, THREAD, and the Sieve subsystem. As of this writing,
18Cyrus doesn't correctly support these other commands.
19
20Cyrus currently only believes in 16-bit characters. Technically, Unicode
21has up to 21-bit characters (expressible in UTF-16 and 3-byte UTF-8) and
22ISO 10646 allows up to 31-bit characters (though ISO's current policy is
23to not allocate any characters outside of the 21-bit Unicode range). The
24lower 16-bit characters make up the basic multilingual plane (BMP) where
25the majority of languages live. This restriction is apparent in
26``charset.c:writeutf8()``, the UTF-8 decoders, and the Unicode
27canonicalization table used by Cyrus. Since Cyrus's known character sets
28(except for UTF-8) don't contain any characters off of the BMP this
29isn't seen to be a major problem.
30
31Throughout this text, Unicode and ISO 10646 will be used interchangeable
32to refer to the 16-bit character set of the BMP, regardless of encoding.
33"Character", unless otherwise specified, refers to a single Unicode
34character ``ffff`` or under.
35
36cyrus canonical form
37--------------------
38
39Since when users search e-mail messages it's much easier for them to
40eliminate false positives than realize there are hits that aren't
41displayed, the Cyrus searching algorithm errs on the side of more
42matches. Before comparing any two strings, Cyrus puts them in a
43canonical form. Logically, the process works as follows:
44
45-  the input string is translated into a sequence of Unicode characters.
46-  each character is transformed into lowercase. (For some characters, a
47   single uppercase character may transform into multiple lowercase
48   characters.)
49-  each character is fully decomposed.
50-  all whitespace (Unicode general categories starting with ``Z``) is
51   removed.
52-  combining diacritical marks, such as the accent on é, are removed.
53   (These are Unicode characters ``0300``-``03ff``.)
54-  certain characters are expanded to alternative spellings using ASCII
55   characters, such as "æ" to "ae".
56-  the output characters are then encoded in UTF-8.
57
58The actual transcoding does all of these steps at once with the aid of
59tables, carefully built at compile-time.
60
61The central part of Cyrus's internationalization support is it's
62transcoding routines in ``lib/charset.[ch]``, and
63``lib/chartable.[ch]``. Cyrus's transcoding routines are very elegant
64and very compact, thus somewhat intimidating. During compilation, Cyrus
65builds up a large number of tables (see `mkchartable <#mkchartable>`__)
66and uses them so that it never has to consider more than a single octet
67at a time while outputting the Cyrus canonical form for an input string.
68
69external interface
70------------------
71
72``lib/charset.h`` is the public interface for Cyrus lib clients to get
73character canonicalization and searching support. In contains the
74following functions:
75
76``char *charset_convert(const char *s, int charset, char *buf, int bufsz)``
77    Given a string *s* in charset *charset*, decode it into canonical
78    form in *buf*. *buf* must be reallocable and currently at least size
79    *bufsz*.
80``char *charset_decode_mimeheader(const char *s, char *buf, int bufsz)``
81    Given a string *s* containing possible MIME encoded substrings (per
82    RFC 2047), decode into canonical form in *buf*. *buf* must be
83    reallocable and currently at least size *bufsz*.
84``charset_index charset_lookupname(const char *name)``
85    Given *name* return the Cyrus charset index. 0 always represents
86    US-ASCII. The returned charset\_index may be saved in a file; it is
87    stable and is an integer. If this version of Cyrus does not support
88    the charset, ``CHARSET_UNKNOWN_CHARSET`` is returned.
89``comp_pat *charset_compilepat(const char *s)``
90    Compiles a NUL-terminated canonicalized string *s* into a
91    Boyer-Moore table for fast searching. I'll describe these `compiled
92    patterns <#comp_pat>`__ later.
93``void charset_freepat(comp_pat *pat)``
94    Frees a pattern previously return by ``charset_compilepat()``.
95``int charset_searchstring(const char *substr, comp_pat *pat,     const char *s, int len)``
96    Searches for a canonicalized string *substr* in the canonicalized
97    string *s*. *s* is of length *len*. *substr* must have been
98    previously compiled into *pat*. Returns non-zero for a hit, zero for
99    no match.
100``int charset_searchfile(const char *substr, comp_pat *pat,                               const char *msg_base, int mapnl, int len,                               charset_index charset, int encoding)``
101    Searches for the canonicalized string *substr* with compiled pattern
102    *pat* in a large buffer starting at *msg\_base* of length *len*. The
103    large buffer is of charset *charset* with the encoding *encoding*.
104    ``charset_searchfile()`` will dynamically unencode and canonicalize
105    the search text looking for *substr*. (If *mapnl* is set, the buffer
106    has only ``\n`` instead of ``\r\n``, but the length assumes that
107    each ``\n`` is dynamically converted to ``\r\n``. This feature is
108    deprecated.)
109``char *charset_decode_mimebody(const char *msg_base, int len,                                      int encoding, char **buf, int *bufsz,                                      int *outlen)``
110    Decode the MIME body part (per RFC 2045) located in the large buffer
111    starting at *msg\_base* of length *len*. The large buffer is of
112    encoding *encoding*. ``charset_decode_mimebody()`` will decode into
113    *buf*. *buf* must be reallocable and currently at least size
114    *bufsz*. The number of decoded bytes is returned in *outlen*.
115``charset_extractfile()``
116    Used by ``squatter`` and possibly other text indexing engines, but
117    not described here.
118
119the TRANSLATE macro: using the transcoding tables
120-------------------------------------------------
121
122The external interface is implemented with the help of the ``START`` and
123``TRANSLATE`` macros:
124
125``void START(struct decode_state *state, const unsigned char (*table)[256][4])``
126    ``START`` initializes *state* to be ready for transcoding of the
127    charset translation table given with *table*. The starting active
128    table is always the first one in the list passed in.
129``void TRANSLATE(struct decode_state *state, unsigned char input, unsigned char *outbuf, unsigned outindex)``
130    ``TRANSLATE`` takes four parameters: *state* is the current state of
131    the translation; it must have been initialized with ``START`` and is
132    modified by ``TRANSLATE``; *input* is one octet of input from the
133    stream to be transcoded; *outbuf* is a pointer to the start of the
134    buffer to write output characters; *outindex* is the index where
135    this translation should be written. The size of *outbuf* must be at
136    least *outindex + charset\_max\_translation*.
137
138Each charset consists of a set of one or more tables; the *table*
139parameter passed into ``START`` is the first of these tables and the
140others are adjacent in memory. Characters are transcoded by indexing
141into the active table with *input* and examining the 4 octet
142translation. The 4 octet translation may consist of 0–3 character
143translations followed by a control code or a series of control codes. In
144effect, the translation for a given octet is a mini-program that
145consists either of UTF-8 octets or control codes. One of the control
146codes RET, END, JSR, or JMP must occur in the 4 octet translation.
147
148control codes
149~~~~~~~~~~~~~
150
151Control codes are represented by uppercase US-ASCII characters since no
152uppercase characters can appear in the output translation (recall that
153Cyrus canonical form downcases). Any uppercase US-ASCII character
154(``[A .. Z]``) is thus interpreted specially by the ``TRANSLATE``
155virtual machine. Any other octet encountered as an output translation is
156presumed to be part of the UTF-8 output sequence and copied to the
157output.
158
159The names of control codes are actually C pre-processor defines to
160uppercase US-ASCII characters. As the mnenomics are easier to
161understand, I use them in discussing their semantics.
162
163control code reference
164~~~~~~~~~~~~~~~~~~~~~~
165
166``TRANSLATE`` recognizes the following "normal" control codes:
167
168XLT
169    This is the first octet of the four octet sequence, indicating that
170    the desired translation is larger than 3 UTF-8 octets. The next two
171    octets represent an offset to look up in the special
172    chartables\_long\_translations[] table. After that translation is
173    copied to the outbuf, the final octet is interpreted (it must be
174    either a RET or an END).
175JSR
176    The ``TRANSLATE`` virtual machine has a stack, fixed at size 1. A
177    JSR copies address of the current active table to the stack and
178    transitions to the active table given by the next two octets. (For
179    instance, table 1 would be the next table after the table given as a
180    parameter to ``START``.) Translation of the current octet stops
181    after encountering a JSR.
182
183    JSRs are useful for converting a two octet input character: the
184    first octet in the character will make a JSR to some table; the
185    second octet will produce a translation and RET to the current
186    table.
187
188    Since the virtual machine has a fixed size stack, it would be highly
189    unusual for the virtual machine to encounter two different JSRs
190    without an intervening RET.
191
192JMP
193    Similar to JSR, but does not change the stack. It is the equivalent
194    of a goto. JMPs are useful to deal with modal input character sets
195    (such as an escape in ISO-2022-JP, see `how the tables are
196    generated <#mkchartable>`__).
197RET
198    Indicates that we are done translating this input octet and we
199    should return to the previous active table. It might appear as the
200    first of the 4 translation octets, in which case this input
201    character translates into nothing (it might be whitespace, for
202    instance).
203END
204    Indicates we are done translating this input octet. When
205    ``TRANSLATE`` is next called, that input octet will be interpreted
206    against the current active table; the stack does not change.
207
208In addition, it recognizes the following "special" control codes for
209charsets that aren't easily represented by a set of tables, UTF-8 and
210UTF-7:
211
212U7F
213    UTF-7 consists of US-ASCII characters and a special escape character
214    that indicates a transition to base-64 encoded UTF-16 characters.
215    The virtual machine has built in code to handle the base64 decoding.
216    In UTF-7's base64, 8 input octets result in 3 characters, so the
217    tables would be rather large.
218U7N
219    This indicates that the current octet is the continuation of the
220    base-64 section.
221U83
222    One and two character UTF-8 sequences are handled normally in the
223    charset code. To keep the table size down, 3 octet sequences are
224    handled specially. U83 indicates that the current input octet is the
225    start of a three character sequence. It is also an implicit jump to
226    the 2nd table in the UTF-8 sequence, ending this translation.
227U83\_2
228    This input octet 2nd of 3-octet UTF-8 input, with an implicit jump
229    to the 3rd table.
230U83\_3
231    3rd octet of a 3-octet UTF-8 input. This produces the output
232    characters and has an implicit jump to the 1st table of UTF-8.
233
234Finally, it's useful to mention the special character ``EMPTY`` which is
235guaranteed not to match any character. It is also represented by an
236uppercase US-ASCII character.
237
238searching and compiled patterns
239-------------------------------
240
241boyer-moore
242~~~~~~~~~~~
243
244brief description of boyer-moore xxx
245
246cyrus implementation
247~~~~~~~~~~~~~~~~~~~~
248
249why two arrays? us-ascii optimization, really kinda useless now xxx
250
251meta-data stored at the end xxx
252
253generating the tables: ``mkchartable``
254--------------------------------------
255
256The program ``mkchartable`` is used to generate the charset transcoding
257tables used by TRANSLATE. These tables are carefully constructed so no
258more than a single octet need be examined at a time; this octet results
259in either an output stream of UTF-8 characters being generated or some
260sort of state change.
261
262``mkchartable`` uses three different sorts of input files to generate
263these tables. These files are located in the ``lib/charset`` directory.
264
265charset tables
266~~~~~~~~~~~~~~
267
268Each charset file maps a single charset to the corresponding Unicode
269characters. For the US-ASCII and ISO-8859-x character sets this is
270trivial: each input byte corresponds to a single Unicode character.
271(Actually, some ISO-8859-x octets do not map to any Unicode character.
272In that case, the file either does not contain that octet or map it to
273"``????``".)
274
275Other character sets are trickier. For instance, GB-2312 has both single
276and double byte characters, but is still a simple map from input
277character to output character. More complicated are modal character
278encodings. For instance, ISO-2022-JP starts in US-ASCII mode and uses
279``1B`` as an escape character followed by another two characters to
280select a new mode.
281
282The input charset labels modes with "``:``" followed by the mode name.
283The starting mode "``US-ASCII``" in ISO-2022-JP is preceded by
284"``:US-ASCII``". Mode transitions are denoted by a Unicode conversion of
285"``>newmode``" or "``:newmode``". To denote that the octet ``42``
286transitions into the "``US-ASCII``" mode, the charset file has
287"``42 >US-ASCII``". The mode names themselves are arbitrary labels and
288have no effect on the output.
289
290The input charset labels modes with ":" followed by the mode name. The
291mode name is optionally followed by a space and the "``<``" character.
292If the "``<``" character is present, then all translations will be
293followed by a RET instruction instead of an END instruction.
294
295The transition "``>newmode``" results in a JSR instruction being
296generated. A JMP instruction is generated by a transition of
297"``:newmode``".
298
299The input byte can be specified as "``*``". This is used to define the
300"default action" which is used for input bytes that are not otherwise
301defined for the mode. If the default action is not explicitly stated, it
302is a translation to EMPTY.
303
304unicode data table
305~~~~~~~~~~~~~~~~~~
306
307The ``unidata2.txt`` file is verbatim from the Unicode standard. More
308recent versions should be available `from their
309website <http://www.unicode.org/xxx>`__. Each entry in the file
310describers a Unicode character by the following properties, separated by
311semicolons:
312
313-  code point (16-bit character value) in hex
314-  character name (unused by Cyrus)
315-  general category, such as whitespace or punctuation
316-  the canonical combining class (unused)
317-  bidirection category (unused)
318-  character decomposition
319-  decimal digit value (unused)
320-  digit value (unused, and, no, I don't know the difference)
321-  numeric value including fractions (unused)
322-  mirrored character (unused)
323-  Unicode 1.0 name (unused)
324-  comment (unused)
325-  upper case equivalent (unused)
326-  lower case equivalent
327
328In general, Cyrus uses the lower case equivalent if there is one, and
329the decomposed value otherwise.
330
331unicode fixup table
332~~~~~~~~~~~~~~~~~~~
333
334The ``unifix.txt`` file contains Cyrus-specific mappings for characters.
335It overrides the ``unidata2.txt`` table. Each rule in the file is
336explained with a comment. It's helpful to remember that the Unicode
337general categories starting with ``Z`` represent whitespace, and
338whitespace is always removed.
339
340generating ``chartable.c``
341~~~~~~~~~~~~~~~~~~~~~~~~~~
342
343how ``mkchartable`` works: collapses the encoding modes, the unicode
344translations, and other normalizations into the output tables described
345above xxx
346
347for the future
348--------------
349
350Sieve/ACAP comparators
351~~~~~~~~~~~~~~~~~~~~~~
352
353adjustable normalization?
354~~~~~~~~~~~~~~~~~~~~~~~~~
355
356The use of uppercase US-ASCII characters is one of the annoyances in
357trying to generalize the charset transcoding. If we continue to restrict
358the characters under consideration to the BMP, switching to UTF-8
359control codes that start 4 or 5 byte sequences is possible.
360
361Another possibility is to use a NUL character as an escape sequence,
362though this increases the size of each control code by 1 octet.
363
364handle >2 octet input characters
365~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
366
367make UTF-8 more regular
368~~~~~~~~~~~~~~~~~~~~~~~
369
370consider whether we really need U83, U83\_2, U83\_3. also consider
371changing ``{ U83, 0, 0, 0 }`` translations to ``{ U83, JMP, 0, 1 }``
372sequences to at least eliminate the implicit jump.
373
374require minimal UTF-8 characters
375~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
376
377references
378----------
379
380xxx
381
382-  [UNICODE] Unicode / ISO 10646
383-  [UTF-8] utf-8 RFC
384-  [UTF-7] utf-7 RFC
385-  [BM] boyer-moore
386-  [ACAP] the comparators reference. see section XXX of RFC 2244.
387