• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

Unidecode.egg-info/H03-May-2022-260195

tests/H21-Jun-2019-601426

unidecode/H21-Jun-2019-48,98248,720

ChangeLogH A D21-Jun-20194.9 KiB153109

LICENSEH A D19-Jun-201817.7 KiB340281

MANIFEST.inH A D19-Jun-2018105 65

PKG-INFOH A D21-Jun-201912 KiB260195

README.rstH A D31-May-20199.2 KiB236172

perl2python.plH A D19-Jun-2018897 5130

setup.cfgH A D21-Jun-2019102 117

setup.pyH A D21-Jun-20191.4 KiB4737

README.rst

1Unidecode, lossy ASCII transliterations of Unicode text
2=======================================================
3
4It often happens that you have text data in Unicode, but you need to
5represent it in ASCII. For example when integrating with legacy code that
6doesn't support Unicode, or for ease of entry of non-Roman names on a US
7keyboard, or when constructing ASCII machine identifiers from
8human-readable Unicode strings that should still be somewhat intelligible
9(a popular example of this is when making an URL slug from an article
10title).
11
12In most of these examples you could represent Unicode characters as ``???`` or
13``\\15BA\\15A0\\1610``, to mention two extreme cases. But that's nearly useless
14to someone who actually wants to read what the text says.
15
16What Unidecode provides is a middle road: the function ``unidecode()`` takes
17Unicode data and tries to represent it in ASCII characters (i.e., the
18universally displayable characters between 0x00 and 0x7F), where the
19compromises taken when mapping between two character sets are chosen to be
20near what a human with a US keyboard would choose.
21
22The quality of resulting ASCII representation varies. For languages of
23western origin it should be between perfect and good. On the other hand
24transliteration (i.e., conveying, in Roman letters, the pronunciation
25expressed by the text in some other writing system) of languages like
26Chinese, Japanese or Korean is a very complex issue and this library does
27not even attempt to address it. It draws the line at context-free
28character-by-character mapping. So a good rule of thumb is that the further
29the script you are transliterating is from Latin alphabet, the worse the
30transliteration will be.
31
32Note that this module generally produces better results than simply
33stripping accents from characters (which can be done in Python with
34built-in functions). It is based on hand-tuned character mappings that for
35example also contain ASCII approximations for symbols and non-Latin
36alphabets.
37
38This is a Python port of ``Text::Unidecode`` Perl module by Sean M. Burke
39<sburke@cpan.org>.
40
41
42Module content
43--------------
44
45The module exports a function that takes an Unicode object (Python 2.x) or
46string (Python 3.x) and returns a string (that can be encoded to ASCII bytes in
47Python 3.x)::
48
49    >>> from unidecode import unidecode
50    >>> unidecode(u'ko\u017eu\u0161\u010dek')
51    'kozuscek'
52    >>> unidecode(u'30 \U0001d5c4\U0001d5c6/\U0001d5c1')
53    '30 km/h'
54    >>> unidecode(u"\u5317\u4EB0")
55    'Bei Jing '
56
57A utility is also included that allows you to transliterate text from the
58command line in several ways. Reading from standard input::
59
60    $ echo hello | unidecode
61    hello
62
63from a command line argument::
64
65    $ unidecode -c hello
66    hello
67
68or from a file::
69
70    $ unidecode hello.txt
71    hello
72
73The default encoding used by the utility depends on your system locale. You can
74specify another encoding with the ``-e`` argument. See ``unidecode --help`` for
75a full list of available options.
76
77Requirements
78------------
79
80Nothing except Python itself. Unidecode supports Python 2.7 and 3.4 or later.
81
82You need a Python build with "wide" Unicode characters (also called "UCS-4
83build") in order for Unidecode to work correctly with characters outside of
84Basic Multilingual Plane (BMP). Common characters outside BMP are bold, italic,
85script, etc. variants of the Latin alphabet intended for mathematical notation.
86Surrogate pair encoding of "narrow" builds is not supported in Unidecode.
87
88If your Python build supports "wide" Unicode the following expression will
89return True::
90
91    >>> import sys
92    >>> sys.maxunicode > 0xffff
93    True
94
95See `PEP 261 <https://www.python.org/dev/peps/pep-0261/>`_ for details
96regarding support for "wide" Unicode characters in Python.
97
98
99Installation
100------------
101
102To install the latest version of Unidecode from the Python package index, use
103these commands::
104
105    $ pip install unidecode
106
107To install Unidecode from the source distribution and run unit tests, use::
108
109    $ python setup.py install
110    $ python setup.py test
111
112Frequently asked questions
113--------------------------
114
115German umlauts are transliterated incorrectly
116    Latin letters "a", "o" and "u" with diaeresis are transliterated by
117    Unidecode as "a", "o", "u", *not* according to German rules "ae", "oe",
118    "ue". This is intentional and will not be changed. Rationale is that these
119    letters are used in languages other than German (for example, Finnish and
120    Turkish). German text transliterated without the extra "e" is much more
121    readable than other languages transliterated using German rules. A
122    workaround is to do your own replacements of these characters before
123    passing the string to ``unidecode()``.
124
125Unidecode should support localization (e.g. a language or country parameter, inspecting system locale, etc.)
126    Language-specific transliteration is a complicated problem and beyond the
127    scope of this library. Changes related to this will not be accepted. Please
128    consider using other libraries which do provide this capability, such as
129    `Unihandecode <https://github.com/miurahr/unihandecode>`_.
130
131Unidecode should use a permissive license such as MIT or the BSD license.
132    The maintainer of Unidecode believes that providing access to source code
133    on redistribution is a fair and reasonable request when basing products on
134    voluntary work of many contributors. If the license is not suitable for
135    you, please consider using other libraries, such as `text-unidecode
136    <https://github.com/kmike/text-unidecode>`_.
137
138Unidecode produces completely wrong results (e.g. "u" with diaeresis transliterating as "A 1/4 ")
139    The strings you are passing to Unidecode have been wrongly decoded
140    somewhere in your program. For example, you might be decoding utf-8 encoded
141    strings as latin1. With a misconfigured terminal, locale and/or a text
142    editor this might not be immediately apparent. Inspect your strings with
143    ``repr()`` and consult the
144    `Unicode HOWTO <https://docs.python.org/3/howto/unicode.html>`_.
145
146I've upgraded Unidecode and now some URLs on my website return 404 Not Found.
147    This is an issue with the software that is running your website, not
148    Unidecode. Occasionally, new versions of Unidecode library are released
149    which contain improvements to the transliteration tables. This means that
150    you cannot rely that ``unidecode()`` output will not change across
151    different versions of Unidecode library. If you use ``unidecode()`` to
152    generate URLs for your website, either generate the URL slug once and store
153    it in the database or lock your dependency of Unidecode to one specific
154    version.
155
156Some of the issues in this section are discussed in more detail in `this blog
157post <https://www.tablix.org/~avian/blog/archives/2013/09/python_unidecode_release_0_04_14/>`_.
158
159
160Performance notes
161-----------------
162
163By default, ``unidecode()`` optimizes for the use case where most of the strings
164passed to it are already ASCII-only and no transliteration is necessary (this
165default might change in future versions).
166
167For performance critical applications, two additional functions are exposed:
168
169``unidecode_expect_ascii()`` is optimized for ASCII-only inputs (approximately
1705 times faster than ``unidecode_expect_nonascii()`` on 10 character strings,
171more on longer strings), but slightly slower for non-ASCII inputs.
172
173``unidecode_expect_nonascii()`` takes approximately the same amount of time on
174ASCII and non-ASCII inputs, but is slightly faster for non-ASCII inputs than
175``unidecode_expect_ascii()``.
176
177Apart from differences in run time, both functions produce identical results.
178For most users of Unidecode, the difference in performance should be
179negligible.
180
181
182Source
183------
184
185You can get the latest development version of Unidecode with::
186
187    $ git clone https://www.tablix.org/~avian/git/unidecode.git
188
189There is also an official mirror of this repository on GitHub at
190https://github.com/avian2/unidecode
191
192
193Contact
194-------
195
196Please make sure to read the `Frequently asked questions`_ section above before
197contacting the maintainer.
198
199Bug reports, patches and suggestions for Unidecode can be sent to
200tomaz.solc@tablix.org.
201
202Alternatively, you can also open a ticket or pull request at
203https://github.com/avian2/unidecode
204
205
206Copyright
207---------
208
209Original character transliteration tables:
210
211Copyright 2001, Sean M. Burke <sburke@cpan.org>, all rights reserved.
212
213Python code and later additions:
214
215Copyright 2019, Tomaz Solc <tomaz.solc@tablix.org>
216
217This program is free software; you can redistribute it and/or modify it
218under the terms of the GNU General Public License as published by the Free
219Software Foundation; either version 2 of the License, or (at your option)
220any later version.
221
222This program is distributed in the hope that it will be useful, but WITHOUT
223ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
224FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
225more details.
226
227You should have received a copy of the GNU General Public License along
228with this program; if not, write to the Free Software Foundation, Inc., 51
229Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.  The programs and
230documentation in this dist are distributed in the hope that they will be
231useful, but without any warranty; without even the implied warranty of
232merchantability or fitness for a particular purpose.
233
234..
235    vim: set filetype=rst:
236