1.. _internationalization:
2
3================================
4 Mailman 3 Internationalization
5================================
6
7Mailman does not yet support IDNA (internationalized domain names, RFC
85890) or internationalized mailboxes (RFC 6531) in email addresses.
9But *display names* and *descriptions* are fully internationalized in
10Mailman, using Unicode.  Email content is handled by the Python email
11package, which provides robust handling of internationalized content
12conforming to the MIME standard (RFCs 2045-2049 and others).
13
14The encoding of URI components addressing a REST endpoint is Unicode
15UTF-8.  Mailman does not currently handle normalization, and we
16recommend consistently using normal form NFC.  (For some languages
17NFKC is risky, as some users' personal names may be corrupted by this
18normalization.)  Mailman does not check for confusables or check
19repertoire.
20
21
22Introduction to Unicode Concepts
23================================
24
25The Unicode Standard is intended to provide an universal set of
26characters with a single, standard encoding providing an invertible
27mapping of characters to integers (called *code points* in this
28context).
29
30
31Repertoires
32-----------
33
34A set of characters is called a *repertoire*.  Unicode itself is
35intended to provide an universal repertoire sufficient to represent
36all words in all written languages, but a system may handle a
37restricted repertoire and still be considered conformant, as long as
38it does not corrupt characters it does not handle, and does not emit
39non-character code points.
40
41
42Convertibility
43--------------
44
45Unicode is intended to provide a character for each character defined
46in a national character set standard.  This is often controversial:
47Chinese characters are often *unified* with Japanese characters that
48appear somewhat different when displayed, while the Cyrillic and Greek
49equivalents of the Latin character "A" are treated as separate
50characters despite being pronounced the same way and being displayed
51as identical glyphs.  These judgments are informed by the notion that
52a text should *round-trip*.  That is, when a text is converted from
53Unicode to another encoding, and then back to Unicode, the result
54should be identical to the source text.
55
56
57Normalization
58-------------
59
60For several reasons, Unicode provides for construction of characters
61by appending *composable characters* (such as accents) to *base
62characters* (typically letters).  But since most languages assign a
63code point to each accented letter, the "round-tripping" requirement
64described above implies that Unicode should provide a code point for
65that accented letter, called a precomposed character.  This means that
66for most accented characters, there are two or more ways to represent
67them, using various combinations of base characters, precomposed
68characters, and composable characters.
69
70There are also a number of cases where equivalent characters have
71different code points (in a few extreme cases, the same character has
72different code points because the original national standard had
73duplicates).  These cases are called *compatibility* characters.
74
75The Unicode Standard requires that the compose character sequence be
76treated identically to the precomposed (single) character by all
77text-processing algorithms.  For convenience in matching, an
78application may choose to *normalize* texts.  There are two
79normalizations.  The *NFC* normal form requires that all compositions
80to precomposed characters that can be done should be done.  It has the
81advantage that the length of a word in characters is the number of
82code points in the word.  The *NFD* normal form requires that all
83precomposed characters be decomposed into a sequence of a base
84character followed by composable characters.  It useful in contexts
85where fuzzy matches (*i.e.*, ignoring accents) are desired.
86
87Finally, in each of these two forms a compatibility character may be
88replaced by its *canonical equivalent*, denoted *NFKC* and *NFKD*,
89respectively.
90
91
92Using Unicode in Mailman
93------------------------
94
95In most cases in Mailman it is highly recommended that input be
96encoded as UTF-8 in NFC format.  Although highly conformant systems
97are becoming more common, there are still many systems that assume
98that one code point is translated to one glyph on display.  On such
99systems NFC will provide a smoother user experience than NFD.  Since
100much of the text data that Mailman handles is user names, and users
101frequently strongly prefer a particular compatibility character to its
102canonical equivalent, NFKC (or NFKD) should be avoided.
103
104There are two other considerations in using Unicode in Mailman.  The
105first is the problem of confusables.  *Confusables* are characters
106which are considered different but whose glyphs are indistinguishable,
107such as Latin capital letter A and Greek capital letter Alpha.
108Similarly, many code points in Unicode are not yet assigned
109characters, or even defined as non-characters, and thus are not part
110of the repertoire of characters represented by Unicode.
111
112Mailman makes no attempt to detect inappropriate use of confusables or
113non-characters (for example, to redirect users to a domain
114disseminating malware).  The risks at present are vanishingly small
115because the necessary support in the mail system itself is not yet
116widespread, but this is likely to change in the near future.
117
118
119.. _localization:
120
121Localization
122============
123
124GNU Mailman project uses `Weblate`_ for translations. If you are interested to
125translate Mailman into language of your choice, please create an account at
126`Weblate`_ and follow the instructions in `weblate docs`_ for translating a
127project.
128
129If you want to add a new language to Mailman or have any questions related to
130translations, please reach out to us at mailman-developers@python.org.
131
132
133.. _Weblate: https://hosted.weblate.org/projects/gnu-mailman/mailman/
134.. _weblate docs: https://docs.weblate.org/en/latest/user/translating.html
135
136
137Generating pot files
138--------------------
139
140This is the documentation for adding a new language or updating existing
141``.pot`` files in Mailman source.
142
143.. note:: This is only meant for Mailman Developers, if you are interested in
144          translating, please see the :ref:`localization`: for instructions on
145          how to translate.
146
147This is a great `gettext tutorial`_ refresh memory on how GNU gettext works.
148
149We use xgettext_ tool to generate ``mailman.pot``::
150
151  # from Mailman's root directory.
152  $ ./update-pot.sh
153
154This will generate or update the ``src/mailman/messages/mailman.pot`` file and
155update all the existing ``.po`` files with the new un-translated strings.
156
157Generating po files
158-------------------
159
160To generate ``po`` file for a new language::
161
162  $ cd src/mailman/messages/
163  $ mkdir -p <lang>/LC_MESSAGES/
164  $ msginit -i mailman.pot -l <lang> --no-translator -o <lang>/LC_MESSAGES/mailman.po
165
166Finally, before releasing a new version, run:
167
168  $ ./generate_mo.sh
169
170This script will run ``msgfmt`` command on all the ``.po`` files in the source
171and generate a compiled ``.mo`` which is used at runtime. This should not be
172checked in the source control.
173
174
175.. _gettext tutorial: https://www.labri.fr/perso/fleury/posts/programming/a-quick-gettext-tutorial.html
176.. _xgettext: https://www.gnu.org/software/gettext/manual/html_node/xgettext-Invocation.html
177