1.. _imap-developer-guidance-internationalization: 2 3.. Note: This document was converted from the original by Nic Bernstein 4 (Onlight). Any formatting mistakes are my fault and not the 5 original author's. 6 7Cyrus IMAP Server: Internationalization 8======================================= 9 10introduction 11------------ 12 13Cyrus currently transcodes characters to a canonical UTF-8 form for 14searching. The base spec of IMAP4 only requires understanding multiple 15character sets to properly implement SEARCH. Since the base spec came 16out, several extensions have been proposed that require further charset 17support: SORT, THREAD, and the Sieve subsystem. As of this writing, 18Cyrus doesn't correctly support these other commands. 19 20Cyrus currently only believes in 16-bit characters. Technically, Unicode 21has up to 21-bit characters (expressible in UTF-16 and 3-byte UTF-8) and 22ISO 10646 allows up to 31-bit characters (though ISO's current policy is 23to not allocate any characters outside of the 21-bit Unicode range). The 24lower 16-bit characters make up the basic multilingual plane (BMP) where 25the majority of languages live. This restriction is apparent in 26``charset.c:writeutf8()``, the UTF-8 decoders, and the Unicode 27canonicalization table used by Cyrus. Since Cyrus's known character sets 28(except for UTF-8) don't contain any characters off of the BMP this 29isn't seen to be a major problem. 30 31Throughout this text, Unicode and ISO 10646 will be used interchangeable 32to refer to the 16-bit character set of the BMP, regardless of encoding. 33"Character", unless otherwise specified, refers to a single Unicode 34character ``ffff`` or under. 35 36cyrus canonical form 37-------------------- 38 39Since when users search e-mail messages it's much easier for them to 40eliminate false positives than realize there are hits that aren't 41displayed, the Cyrus searching algorithm errs on the side of more 42matches. Before comparing any two strings, Cyrus puts them in a 43canonical form. Logically, the process works as follows: 44 45- the input string is translated into a sequence of Unicode characters. 46- each character is transformed into lowercase. (For some characters, a 47 single uppercase character may transform into multiple lowercase 48 characters.) 49- each character is fully decomposed. 50- all whitespace (Unicode general categories starting with ``Z``) is 51 removed. 52- combining diacritical marks, such as the accent on é, are removed. 53 (These are Unicode characters ``0300``-``03ff``.) 54- certain characters are expanded to alternative spellings using ASCII 55 characters, such as "æ" to "ae". 56- the output characters are then encoded in UTF-8. 57 58The actual transcoding does all of these steps at once with the aid of 59tables, carefully built at compile-time. 60 61The central part of Cyrus's internationalization support is it's 62transcoding routines in ``lib/charset.[ch]``, and 63``lib/chartable.[ch]``. Cyrus's transcoding routines are very elegant 64and very compact, thus somewhat intimidating. During compilation, Cyrus 65builds up a large number of tables (see `mkchartable <#mkchartable>`__) 66and uses them so that it never has to consider more than a single octet 67at a time while outputting the Cyrus canonical form for an input string. 68 69external interface 70------------------ 71 72``lib/charset.h`` is the public interface for Cyrus lib clients to get 73character canonicalization and searching support. In contains the 74following functions: 75 76``char *charset_convert(const char *s, int charset, char *buf, int bufsz)`` 77 Given a string *s* in charset *charset*, decode it into canonical 78 form in *buf*. *buf* must be reallocable and currently at least size 79 *bufsz*. 80``char *charset_decode_mimeheader(const char *s, char *buf, int bufsz)`` 81 Given a string *s* containing possible MIME encoded substrings (per 82 RFC 2047), decode into canonical form in *buf*. *buf* must be 83 reallocable and currently at least size *bufsz*. 84``charset_index charset_lookupname(const char *name)`` 85 Given *name* return the Cyrus charset index. 0 always represents 86 US-ASCII. The returned charset\_index may be saved in a file; it is 87 stable and is an integer. If this version of Cyrus does not support 88 the charset, ``CHARSET_UNKNOWN_CHARSET`` is returned. 89``comp_pat *charset_compilepat(const char *s)`` 90 Compiles a NUL-terminated canonicalized string *s* into a 91 Boyer-Moore table for fast searching. I'll describe these `compiled 92 patterns <#comp_pat>`__ later. 93``void charset_freepat(comp_pat *pat)`` 94 Frees a pattern previously return by ``charset_compilepat()``. 95``int charset_searchstring(const char *substr, comp_pat *pat, const char *s, int len)`` 96 Searches for a canonicalized string *substr* in the canonicalized 97 string *s*. *s* is of length *len*. *substr* must have been 98 previously compiled into *pat*. Returns non-zero for a hit, zero for 99 no match. 100``int charset_searchfile(const char *substr, comp_pat *pat, const char *msg_base, int mapnl, int len, charset_index charset, int encoding)`` 101 Searches for the canonicalized string *substr* with compiled pattern 102 *pat* in a large buffer starting at *msg\_base* of length *len*. The 103 large buffer is of charset *charset* with the encoding *encoding*. 104 ``charset_searchfile()`` will dynamically unencode and canonicalize 105 the search text looking for *substr*. (If *mapnl* is set, the buffer 106 has only ``\n`` instead of ``\r\n``, but the length assumes that 107 each ``\n`` is dynamically converted to ``\r\n``. This feature is 108 deprecated.) 109``char *charset_decode_mimebody(const char *msg_base, int len, int encoding, char **buf, int *bufsz, int *outlen)`` 110 Decode the MIME body part (per RFC 2045) located in the large buffer 111 starting at *msg\_base* of length *len*. The large buffer is of 112 encoding *encoding*. ``charset_decode_mimebody()`` will decode into 113 *buf*. *buf* must be reallocable and currently at least size 114 *bufsz*. The number of decoded bytes is returned in *outlen*. 115``charset_extractfile()`` 116 Used by ``squatter`` and possibly other text indexing engines, but 117 not described here. 118 119the TRANSLATE macro: using the transcoding tables 120------------------------------------------------- 121 122The external interface is implemented with the help of the ``START`` and 123``TRANSLATE`` macros: 124 125``void START(struct decode_state *state, const unsigned char (*table)[256][4])`` 126 ``START`` initializes *state* to be ready for transcoding of the 127 charset translation table given with *table*. The starting active 128 table is always the first one in the list passed in. 129``void TRANSLATE(struct decode_state *state, unsigned char input, unsigned char *outbuf, unsigned outindex)`` 130 ``TRANSLATE`` takes four parameters: *state* is the current state of 131 the translation; it must have been initialized with ``START`` and is 132 modified by ``TRANSLATE``; *input* is one octet of input from the 133 stream to be transcoded; *outbuf* is a pointer to the start of the 134 buffer to write output characters; *outindex* is the index where 135 this translation should be written. The size of *outbuf* must be at 136 least *outindex + charset\_max\_translation*. 137 138Each charset consists of a set of one or more tables; the *table* 139parameter passed into ``START`` is the first of these tables and the 140others are adjacent in memory. Characters are transcoded by indexing 141into the active table with *input* and examining the 4 octet 142translation. The 4 octet translation may consist of 0–3 character 143translations followed by a control code or a series of control codes. In 144effect, the translation for a given octet is a mini-program that 145consists either of UTF-8 octets or control codes. One of the control 146codes RET, END, JSR, or JMP must occur in the 4 octet translation. 147 148control codes 149~~~~~~~~~~~~~ 150 151Control codes are represented by uppercase US-ASCII characters since no 152uppercase characters can appear in the output translation (recall that 153Cyrus canonical form downcases). Any uppercase US-ASCII character 154(``[A .. Z]``) is thus interpreted specially by the ``TRANSLATE`` 155virtual machine. Any other octet encountered as an output translation is 156presumed to be part of the UTF-8 output sequence and copied to the 157output. 158 159The names of control codes are actually C pre-processor defines to 160uppercase US-ASCII characters. As the mnenomics are easier to 161understand, I use them in discussing their semantics. 162 163control code reference 164~~~~~~~~~~~~~~~~~~~~~~ 165 166``TRANSLATE`` recognizes the following "normal" control codes: 167 168XLT 169 This is the first octet of the four octet sequence, indicating that 170 the desired translation is larger than 3 UTF-8 octets. The next two 171 octets represent an offset to look up in the special 172 chartables\_long\_translations[] table. After that translation is 173 copied to the outbuf, the final octet is interpreted (it must be 174 either a RET or an END). 175JSR 176 The ``TRANSLATE`` virtual machine has a stack, fixed at size 1. A 177 JSR copies address of the current active table to the stack and 178 transitions to the active table given by the next two octets. (For 179 instance, table 1 would be the next table after the table given as a 180 parameter to ``START``.) Translation of the current octet stops 181 after encountering a JSR. 182 183 JSRs are useful for converting a two octet input character: the 184 first octet in the character will make a JSR to some table; the 185 second octet will produce a translation and RET to the current 186 table. 187 188 Since the virtual machine has a fixed size stack, it would be highly 189 unusual for the virtual machine to encounter two different JSRs 190 without an intervening RET. 191 192JMP 193 Similar to JSR, but does not change the stack. It is the equivalent 194 of a goto. JMPs are useful to deal with modal input character sets 195 (such as an escape in ISO-2022-JP, see `how the tables are 196 generated <#mkchartable>`__). 197RET 198 Indicates that we are done translating this input octet and we 199 should return to the previous active table. It might appear as the 200 first of the 4 translation octets, in which case this input 201 character translates into nothing (it might be whitespace, for 202 instance). 203END 204 Indicates we are done translating this input octet. When 205 ``TRANSLATE`` is next called, that input octet will be interpreted 206 against the current active table; the stack does not change. 207 208In addition, it recognizes the following "special" control codes for 209charsets that aren't easily represented by a set of tables, UTF-8 and 210UTF-7: 211 212U7F 213 UTF-7 consists of US-ASCII characters and a special escape character 214 that indicates a transition to base-64 encoded UTF-16 characters. 215 The virtual machine has built in code to handle the base64 decoding. 216 In UTF-7's base64, 8 input octets result in 3 characters, so the 217 tables would be rather large. 218U7N 219 This indicates that the current octet is the continuation of the 220 base-64 section. 221U83 222 One and two character UTF-8 sequences are handled normally in the 223 charset code. To keep the table size down, 3 octet sequences are 224 handled specially. U83 indicates that the current input octet is the 225 start of a three character sequence. It is also an implicit jump to 226 the 2nd table in the UTF-8 sequence, ending this translation. 227U83\_2 228 This input octet 2nd of 3-octet UTF-8 input, with an implicit jump 229 to the 3rd table. 230U83\_3 231 3rd octet of a 3-octet UTF-8 input. This produces the output 232 characters and has an implicit jump to the 1st table of UTF-8. 233 234Finally, it's useful to mention the special character ``EMPTY`` which is 235guaranteed not to match any character. It is also represented by an 236uppercase US-ASCII character. 237 238searching and compiled patterns 239------------------------------- 240 241boyer-moore 242~~~~~~~~~~~ 243 244brief description of boyer-moore xxx 245 246cyrus implementation 247~~~~~~~~~~~~~~~~~~~~ 248 249why two arrays? us-ascii optimization, really kinda useless now xxx 250 251meta-data stored at the end xxx 252 253generating the tables: ``mkchartable`` 254-------------------------------------- 255 256The program ``mkchartable`` is used to generate the charset transcoding 257tables used by TRANSLATE. These tables are carefully constructed so no 258more than a single octet need be examined at a time; this octet results 259in either an output stream of UTF-8 characters being generated or some 260sort of state change. 261 262``mkchartable`` uses three different sorts of input files to generate 263these tables. These files are located in the ``lib/charset`` directory. 264 265charset tables 266~~~~~~~~~~~~~~ 267 268Each charset file maps a single charset to the corresponding Unicode 269characters. For the US-ASCII and ISO-8859-x character sets this is 270trivial: each input byte corresponds to a single Unicode character. 271(Actually, some ISO-8859-x octets do not map to any Unicode character. 272In that case, the file either does not contain that octet or map it to 273"``????``".) 274 275Other character sets are trickier. For instance, GB-2312 has both single 276and double byte characters, but is still a simple map from input 277character to output character. More complicated are modal character 278encodings. For instance, ISO-2022-JP starts in US-ASCII mode and uses 279``1B`` as an escape character followed by another two characters to 280select a new mode. 281 282The input charset labels modes with "``:``" followed by the mode name. 283The starting mode "``US-ASCII``" in ISO-2022-JP is preceded by 284"``:US-ASCII``". Mode transitions are denoted by a Unicode conversion of 285"``>newmode``" or "``:newmode``". To denote that the octet ``42`` 286transitions into the "``US-ASCII``" mode, the charset file has 287"``42 >US-ASCII``". The mode names themselves are arbitrary labels and 288have no effect on the output. 289 290The input charset labels modes with ":" followed by the mode name. The 291mode name is optionally followed by a space and the "``<``" character. 292If the "``<``" character is present, then all translations will be 293followed by a RET instruction instead of an END instruction. 294 295The transition "``>newmode``" results in a JSR instruction being 296generated. A JMP instruction is generated by a transition of 297"``:newmode``". 298 299The input byte can be specified as "``*``". This is used to define the 300"default action" which is used for input bytes that are not otherwise 301defined for the mode. If the default action is not explicitly stated, it 302is a translation to EMPTY. 303 304unicode data table 305~~~~~~~~~~~~~~~~~~ 306 307The ``unidata2.txt`` file is verbatim from the Unicode standard. More 308recent versions should be available `from their 309website <http://www.unicode.org/xxx>`__. Each entry in the file 310describers a Unicode character by the following properties, separated by 311semicolons: 312 313- code point (16-bit character value) in hex 314- character name (unused by Cyrus) 315- general category, such as whitespace or punctuation 316- the canonical combining class (unused) 317- bidirection category (unused) 318- character decomposition 319- decimal digit value (unused) 320- digit value (unused, and, no, I don't know the difference) 321- numeric value including fractions (unused) 322- mirrored character (unused) 323- Unicode 1.0 name (unused) 324- comment (unused) 325- upper case equivalent (unused) 326- lower case equivalent 327 328In general, Cyrus uses the lower case equivalent if there is one, and 329the decomposed value otherwise. 330 331unicode fixup table 332~~~~~~~~~~~~~~~~~~~ 333 334The ``unifix.txt`` file contains Cyrus-specific mappings for characters. 335It overrides the ``unidata2.txt`` table. Each rule in the file is 336explained with a comment. It's helpful to remember that the Unicode 337general categories starting with ``Z`` represent whitespace, and 338whitespace is always removed. 339 340generating ``chartable.c`` 341~~~~~~~~~~~~~~~~~~~~~~~~~~ 342 343how ``mkchartable`` works: collapses the encoding modes, the unicode 344translations, and other normalizations into the output tables described 345above xxx 346 347for the future 348-------------- 349 350Sieve/ACAP comparators 351~~~~~~~~~~~~~~~~~~~~~~ 352 353adjustable normalization? 354~~~~~~~~~~~~~~~~~~~~~~~~~ 355 356The use of uppercase US-ASCII characters is one of the annoyances in 357trying to generalize the charset transcoding. If we continue to restrict 358the characters under consideration to the BMP, switching to UTF-8 359control codes that start 4 or 5 byte sequences is possible. 360 361Another possibility is to use a NUL character as an escape sequence, 362though this increases the size of each control code by 1 octet. 363 364handle >2 octet input characters 365~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 366 367make UTF-8 more regular 368~~~~~~~~~~~~~~~~~~~~~~~ 369 370consider whether we really need U83, U83\_2, U83\_3. also consider 371changing ``{ U83, 0, 0, 0 }`` translations to ``{ U83, JMP, 0, 1 }`` 372sequences to at least eliminate the implicit jump. 373 374require minimal UTF-8 characters 375~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 376 377references 378---------- 379 380xxx 381 382- [UNICODE] Unicode / ISO 10646 383- [UTF-8] utf-8 RFC 384- [UTF-7] utf-7 RFC 385- [BM] boyer-moore 386- [ACAP] the comparators reference. see section XXX of RFC 2244. 387