• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

ftfy/H24-May-2021-2,7761,991

ftfy.egg-info/H03-May-2022-154111

tests/H03-May-2022-1,2721,159

CHANGELOG.mdH A D16-May-202120.8 KiB606377

MANIFEST.inH A D04-May-202178 74

PKG-INFOH A D24-May-20217.1 KiB154111

README.mdH A D04-May-20215.2 KiB13088

setup.cfgH A D24-May-202195 117

setup.pyH A D24-May-20211.8 KiB6149

README.md

1# ftfy: fixes text for you
2
3[![PyPI package](https://badge.fury.io/py/ftfy.svg)](https://badge.fury.io/py/ftfy)
4[![Docs](https://readthedocs.org/projects/ftfy/badge/?version=latest)](https://ftfy.readthedocs.org/en/latest/)
5
6```python
7>>> print(fix_encoding("(ง'⌣')ง"))
8(ง'⌣')ง
9```
10
11The full documentation of ftfy is available at [ftfy.readthedocs.org](https://ftfy.readthedocs.org). The documentation covers a lot more than this README, so here are
12some links into it:
13
14- [Fixing problems and getting explanations](https://ftfy.readthedocs.io/en/v6.0.1/explain.html)
15- [Configuring ftfy](https://ftfy.readthedocs.io/en/v6.0.1/config.html)
16- [Encodings ftfy can handle](https://ftfy.readthedocs.io/en/v6.0.1/encodings.html)
17- [“Fixer” functions](https://ftfy.readthedocs.io/en/v6.0.1/fixes.html)
18- [Is ftfy an encoding detector?](https://ftfy.readthedocs.io/en/v6.0.1/detect.html)
19- [Heuristics for detecting mojibake](https://ftfy.readthedocs.io/en/v6.0.1/heuristics.html)
20- [Support for “bad” encodings](https://ftfy.readthedocs.io/en/v6.0.1/bad_encodings.html)
21- [Command-line usage](https://ftfy.readthedocs.io/en/v6.0.1/cli.html)
22- [Citing ftfy](https://ftfy.readthedocs.io/en/v6.0.1/cite.html)
23
24
25## Testimonials
26
27- “My life is livable again!”
28  — [@planarrowspace](https://twitter.com/planarrowspace)
29- “A handy piece of magic”
30  — [@simonw](https://twitter.com/simonw)
31- “Saved me a large amount of frustrating dev work”
32  — [@iancal](https://twitter.com/iancal)
33- “ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.”
34  — Brennan Young
35- “I have no idea when I’m gonna need this, but I’m definitely bookmarking it.”
36  — [/u/ocrow](https://reddit.com/u/ocrow)
37- “9.2/10”
38  — [pylint](https://bitbucket.org/logilab/pylint/)
39
40## What it does
41
42Here are some examples (found in the real world) of what ftfy can do:
43
44ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:
45
46    >>> import ftfy
47    >>> ftfy.fix_text('✔ No problems')
48    '✔ No problems'
49
50Does this sound impossible? It's really not. UTF-8 is a well-designed encoding that makes it obvious when it's being misused, and a string of mojibake usually contains all the information we need to recover the original string.
51
52ftfy can fix multiple layers of mojibake simultaneously:
53
54    >>> ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.')
55    "The Mona Lisa doesn't have eyebrows."
56
57It can fix mojibake that has had "curly quotes" applied on top of it, which cannot be consistently decoded until the quotes are uncurled:
58
59    >>> ftfy.fix_text("l’humanité")
60    "l'humanité"
61
62ftfy can fix mojibake that would have included the character U+A0 (non-breaking space), but the U+A0 was turned into an ASCII space and then combined with another following space:
63
64    >>> ftfy.fix_text('Ã\xa0 perturber la réflexion')
65    'à perturber la réflexion'
66    >>> ftfy.fix_text('à perturber la réflexion')
67    'à perturber la réflexion'
68
69ftfy can also decode HTML entities that appear outside of HTML, even in cases where the entity has been incorrectly capitalized:
70
71    >>> # by the HTML 5 standard, only 'PÉREZ' is acceptable
72    >>> ftfy.fix_text('PÉREZ')
73    'PÉREZ'
74
75These fixes are not applied in all cases, because ftfy has a strongly-held goal of avoiding false positives -- it should never change correctly-decoded text to something else.
76
77The following text could be encoded in Windows-1252 and decoded in UTF-8, and it would decode as 'MARQUɅ'. However, the original text is already sensible, so it is unchanged.
78
79    >>> ftfy.fix_text('IL Y MARQUÉ…')
80    'IL Y MARQUÉ…'
81
82
83## Installing
84
85ftfy is a Python 3 package that can be installed using `pip`:
86
87    pip install ftfy
88
89(Or use `pip3 install ftfy` on systems where Python 2 and 3 are both globally
90installed and `pip` refers to Python 2.)
91
92You can also clone this Git repository and install it with
93`python setup.py install`.
94
95
96## Who maintains ftfy?
97
98I'm Robyn Speer. You can find me [on GitHub](https://github.com/rspeer).
99I created ftfy as part of my work at the text understanding company
100[Luminoso](https://luminoso.com), and now maintain it independently.
101
102
103## Citing ftfy
104
105ftfy has been used as a crucial data processing step in major NLP research.
106
107It's important to give credit appropriately to everyone whose work you build on
108in research. This includes software, not just high-status contributions such as
109mathematical models. All I ask when you use ftfy for research is that you cite
110it.
111
112ftfy has a citable record [on Zenodo](https://zenodo.org/record/2591652).
113A citation of ftfy may look like this:
114
115    Robyn Speer. (2019). ftfy (Version 5.5). Zenodo.
116    http://doi.org/10.5281/zenodo.2591652
117
118In BibTeX format, the citation is::
119
120    @misc{speer-2019-ftfy,
121      author       = {Robyn Speer},
122      title        = {ftfy},
123      note         = {Version 5.5},
124      year         = 2019,
125      howpublished = {Zenodo},
126      doi          = {10.5281/zenodo.2591652},
127      url          = {https://doi.org/10.5281/zenodo.2591652}
128    }
129
130