1.. _internationalization: 2 3================================ 4 Mailman 3 Internationalization 5================================ 6 7Mailman does not yet support IDNA (internationalized domain names, RFC 85890) or internationalized mailboxes (RFC 6531) in email addresses. 9But *display names* and *descriptions* are fully internationalized in 10Mailman, using Unicode. Email content is handled by the Python email 11package, which provides robust handling of internationalized content 12conforming to the MIME standard (RFCs 2045-2049 and others). 13 14The encoding of URI components addressing a REST endpoint is Unicode 15UTF-8. Mailman does not currently handle normalization, and we 16recommend consistently using normal form NFC. (For some languages 17NFKC is risky, as some users' personal names may be corrupted by this 18normalization.) Mailman does not check for confusables or check 19repertoire. 20 21 22Introduction to Unicode Concepts 23================================ 24 25The Unicode Standard is intended to provide an universal set of 26characters with a single, standard encoding providing an invertible 27mapping of characters to integers (called *code points* in this 28context). 29 30 31Repertoires 32----------- 33 34A set of characters is called a *repertoire*. Unicode itself is 35intended to provide an universal repertoire sufficient to represent 36all words in all written languages, but a system may handle a 37restricted repertoire and still be considered conformant, as long as 38it does not corrupt characters it does not handle, and does not emit 39non-character code points. 40 41 42Convertibility 43-------------- 44 45Unicode is intended to provide a character for each character defined 46in a national character set standard. This is often controversial: 47Chinese characters are often *unified* with Japanese characters that 48appear somewhat different when displayed, while the Cyrillic and Greek 49equivalents of the Latin character "A" are treated as separate 50characters despite being pronounced the same way and being displayed 51as identical glyphs. These judgments are informed by the notion that 52a text should *round-trip*. That is, when a text is converted from 53Unicode to another encoding, and then back to Unicode, the result 54should be identical to the source text. 55 56 57Normalization 58------------- 59 60For several reasons, Unicode provides for construction of characters 61by appending *composable characters* (such as accents) to *base 62characters* (typically letters). But since most languages assign a 63code point to each accented letter, the "round-tripping" requirement 64described above implies that Unicode should provide a code point for 65that accented letter, called a precomposed character. This means that 66for most accented characters, there are two or more ways to represent 67them, using various combinations of base characters, precomposed 68characters, and composable characters. 69 70There are also a number of cases where equivalent characters have 71different code points (in a few extreme cases, the same character has 72different code points because the original national standard had 73duplicates). These cases are called *compatibility* characters. 74 75The Unicode Standard requires that the compose character sequence be 76treated identically to the precomposed (single) character by all 77text-processing algorithms. For convenience in matching, an 78application may choose to *normalize* texts. There are two 79normalizations. The *NFC* normal form requires that all compositions 80to precomposed characters that can be done should be done. It has the 81advantage that the length of a word in characters is the number of 82code points in the word. The *NFD* normal form requires that all 83precomposed characters be decomposed into a sequence of a base 84character followed by composable characters. It useful in contexts 85where fuzzy matches (*i.e.*, ignoring accents) are desired. 86 87Finally, in each of these two forms a compatibility character may be 88replaced by its *canonical equivalent*, denoted *NFKC* and *NFKD*, 89respectively. 90 91 92Using Unicode in Mailman 93------------------------ 94 95In most cases in Mailman it is highly recommended that input be 96encoded as UTF-8 in NFC format. Although highly conformant systems 97are becoming more common, there are still many systems that assume 98that one code point is translated to one glyph on display. On such 99systems NFC will provide a smoother user experience than NFD. Since 100much of the text data that Mailman handles is user names, and users 101frequently strongly prefer a particular compatibility character to its 102canonical equivalent, NFKC (or NFKD) should be avoided. 103 104There are two other considerations in using Unicode in Mailman. The 105first is the problem of confusables. *Confusables* are characters 106which are considered different but whose glyphs are indistinguishable, 107such as Latin capital letter A and Greek capital letter Alpha. 108Similarly, many code points in Unicode are not yet assigned 109characters, or even defined as non-characters, and thus are not part 110of the repertoire of characters represented by Unicode. 111 112Mailman makes no attempt to detect inappropriate use of confusables or 113non-characters (for example, to redirect users to a domain 114disseminating malware). The risks at present are vanishingly small 115because the necessary support in the mail system itself is not yet 116widespread, but this is likely to change in the near future. 117 118 119.. _localization: 120 121Localization 122============ 123 124GNU Mailman project uses `Weblate`_ for translations. If you are interested to 125translate Mailman into language of your choice, please create an account at 126`Weblate`_ and follow the instructions in `weblate docs`_ for translating a 127project. 128 129If you want to add a new language to Mailman or have any questions related to 130translations, please reach out to us at mailman-developers@python.org. 131 132 133.. _Weblate: https://hosted.weblate.org/projects/gnu-mailman/mailman/ 134.. _weblate docs: https://docs.weblate.org/en/latest/user/translating.html 135 136 137Generating pot files 138-------------------- 139 140This is the documentation for adding a new language or updating existing 141``.pot`` files in Mailman source. 142 143.. note:: This is only meant for Mailman Developers, if you are interested in 144 translating, please see the :ref:`localization`: for instructions on 145 how to translate. 146 147This is a great `gettext tutorial`_ refresh memory on how GNU gettext works. 148 149We use xgettext_ tool to generate ``mailman.pot``:: 150 151 # from Mailman's root directory. 152 $ ./update-pot.sh 153 154This will generate or update the ``src/mailman/messages/mailman.pot`` file and 155update all the existing ``.po`` files with the new un-translated strings. 156 157Generating po files 158------------------- 159 160To generate ``po`` file for a new language:: 161 162 $ cd src/mailman/messages/ 163 $ mkdir -p <lang>/LC_MESSAGES/ 164 $ msginit -i mailman.pot -l <lang> --no-translator -o <lang>/LC_MESSAGES/mailman.po 165 166Finally, before releasing a new version, run: 167 168 $ ./generate_mo.sh 169 170This script will run ``msgfmt`` command on all the ``.po`` files in the source 171and generate a compiled ``.mo`` which is used at runtime. This should not be 172checked in the source control. 173 174 175.. _gettext tutorial: https://www.labri.fr/perso/fleury/posts/programming/a-quick-gettext-tutorial.html 176.. _xgettext: https://www.gnu.org/software/gettext/manual/html_node/xgettext-Invocation.html 177