|
|
Metadata-Version: 2.1
|
|
|
Name: ftfy
|
|
|
Version: 6.1.3
|
|
|
Summary: Fixes mojibake and other problems with Unicode, after the fact
|
|
|
License: Apache-2.0
|
|
|
Author: Robyn Speer
|
|
|
Author-email: rspeer@arborelia.net
|
|
|
Requires-Python: >=3.8,<4
|
|
|
Classifier: License :: OSI Approved :: Apache Software License
|
|
|
Classifier: Programming Language :: Python :: 3
|
|
|
Classifier: Programming Language :: Python :: 3.8
|
|
|
Classifier: Programming Language :: Python :: 3.9
|
|
|
Classifier: Programming Language :: Python :: 3.10
|
|
|
Classifier: Programming Language :: Python :: 3.11
|
|
|
Classifier: Programming Language :: Python :: 3.12
|
|
|
Requires-Dist: wcwidth (>=0.2.12,<0.3.0)
|
|
|
Description-Content-Type: text/markdown
|
|
|
|
|
|
# ftfy: fixes text for you
|
|
|
|
|
|
[![PyPI package](https://badge.fury.io/py/ftfy.svg)](https://badge.fury.io/py/ftfy)
|
|
|
[![Docs](https://readthedocs.org/projects/ftfy/badge/?version=latest)](https://ftfy.readthedocs.org/en/latest/)
|
|
|
|
|
|
```python
|
|
|
|
|
|
>>> from ftfy import fix_encoding
|
|
|
>>> print(fix_encoding("(ง'⌣')ง"))
|
|
|
(ง'⌣')ง
|
|
|
|
|
|
```
|
|
|
|
|
|
The full documentation of ftfy is available at [ftfy.readthedocs.org](https://ftfy.readthedocs.org). The documentation covers a lot more than this README, so here are
|
|
|
some links into it:
|
|
|
|
|
|
- [Fixing problems and getting explanations](https://ftfy.readthedocs.io/en/latest/explain.html)
|
|
|
- [Configuring ftfy](https://ftfy.readthedocs.io/en/latest/config.html)
|
|
|
- [Encodings ftfy can handle](https://ftfy.readthedocs.io/en/latest/encodings.html)
|
|
|
- [“Fixer” functions](https://ftfy.readthedocs.io/en/latest/fixes.html)
|
|
|
- [Is ftfy an encoding detector?](https://ftfy.readthedocs.io/en/latest/detect.html)
|
|
|
- [Heuristics for detecting mojibake](https://ftfy.readthedocs.io/en/latest/heuristic.html)
|
|
|
- [Support for “bad” encodings](https://ftfy.readthedocs.io/en/latest/bad_encodings.html)
|
|
|
- [Command-line usage](https://ftfy.readthedocs.io/en/latest/cli.html)
|
|
|
- [Citing ftfy](https://ftfy.readthedocs.io/en/latest/cite.html)
|
|
|
|
|
|
## Testimonials
|
|
|
|
|
|
- “My life is livable again!”
|
|
|
— [@planarrowspace](https://twitter.com/planarrowspace)
|
|
|
- “A handy piece of magic”
|
|
|
— [@simonw](https://twitter.com/simonw)
|
|
|
- “Saved me a large amount of frustrating dev work”
|
|
|
— [@iancal](https://twitter.com/iancal)
|
|
|
- “ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.”
|
|
|
— Brennan Young
|
|
|
- “I have no idea when I’m gonna need this, but I’m definitely bookmarking it.”
|
|
|
— [/u/ocrow](https://reddit.com/u/ocrow)
|
|
|
- “9.2/10”
|
|
|
— [pylint](https://bitbucket.org/logilab/pylint/)
|
|
|
|
|
|
## What it does
|
|
|
|
|
|
Here are some examples (found in the real world) of what ftfy can do:
|
|
|
|
|
|
ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:
|
|
|
|
|
|
>>> import ftfy
|
|
|
>>> ftfy.fix_text('✔ No problems')
|
|
|
'✔ No problems'
|
|
|
|
|
|
Does this sound impossible? It's really not. UTF-8 is a well-designed encoding that makes it obvious when it's being misused, and a string of mojibake usually contains all the information we need to recover the original string.
|
|
|
|
|
|
ftfy can fix multiple layers of mojibake simultaneously:
|
|
|
|
|
|
>>> ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.')
|
|
|
"The Mona Lisa doesn't have eyebrows."
|
|
|
|
|
|
It can fix mojibake that has had "curly quotes" applied on top of it, which cannot be consistently decoded until the quotes are uncurled:
|
|
|
|
|
|
>>> ftfy.fix_text("l’humanité")
|
|
|
"l'humanité"
|
|
|
|
|
|
ftfy can fix mojibake that would have included the character U+A0 (non-breaking space), but the U+A0 was turned into an ASCII space and then combined with another following space:
|
|
|
|
|
|
>>> ftfy.fix_text('Ã\xa0 perturber la réflexion')
|
|
|
'à perturber la réflexion'
|
|
|
>>> ftfy.fix_text('à perturber la réflexion')
|
|
|
'à perturber la réflexion'
|
|
|
|
|
|
ftfy can also decode HTML entities that appear outside of HTML, even in cases where the entity has been incorrectly capitalized:
|
|
|
|
|
|
>>> # by the HTML 5 standard, only 'PÉREZ' is acceptable
|
|
|
>>> ftfy.fix_text('P&EACUTE;REZ')
|
|
|
'PÉREZ'
|
|
|
|
|
|
These fixes are not applied in all cases, because ftfy has a strongly-held goal of avoiding false positives -- it should never change correctly-decoded text to something else.
|
|
|
|
|
|
The following text could be encoded in Windows-1252 and decoded in UTF-8, and it would decode as 'MARQUɅ'. However, the original text is already sensible, so it is unchanged.
|
|
|
|
|
|
>>> ftfy.fix_text('IL Y MARQUÉ…')
|
|
|
'IL Y MARQUÉ…'
|
|
|
|
|
|
## Installing
|
|
|
|
|
|
ftfy is a Python 3 package that can be installed using `pip`:
|
|
|
|
|
|
pip install ftfy
|
|
|
|
|
|
(Or use `pip3 install ftfy` on systems where Python 2 and 3 are both globally
|
|
|
installed and `pip` refers to Python 2.)
|
|
|
|
|
|
### Local development
|
|
|
|
|
|
ftfy is developed using `poetry`. Its `setup.py` is vestigial and is not the
|
|
|
recommended way to install it.
|
|
|
|
|
|
[Install Poetry](https://python-poetry.org/docs/master/#installing-with-the-official-installer), check out this repository, and run `poetry install` to install ftfy for local development, such as experimenting with the heuristic or running tests.
|
|
|
|
|
|
## Who maintains ftfy?
|
|
|
|
|
|
I'm Robyn Speer, also known as Elia Robyn Lake. You can find me
|
|
|
[on GitHub](https://github.com/rspeer) or [Cohost](https://cohost.org/arborelia).
|
|
|
|
|
|
## Citing ftfy
|
|
|
|
|
|
ftfy has been used as a crucial data processing step in major NLP research.
|
|
|
|
|
|
It's important to give credit appropriately to everyone whose work you build on
|
|
|
in research. This includes software, not just high-status contributions such as
|
|
|
mathematical models. All I ask when you use ftfy for research is that you cite
|
|
|
it.
|
|
|
|
|
|
ftfy has a citable record [on Zenodo](https://zenodo.org/record/2591652).
|
|
|
A citation of ftfy may look like this:
|
|
|
|
|
|
Robyn Speer. (2019). ftfy (Version 5.5). Zenodo.
|
|
|
http://doi.org/10.5281/zenodo.2591652
|
|
|
|
|
|
In BibTeX format, the citation is::
|
|
|
|
|
|
@misc{speer-2019-ftfy,
|
|
|
author = {Robyn Speer},
|
|
|
title = {ftfy},
|
|
|
note = {Version 5.5},
|
|
|
year = 2019,
|
|
|
howpublished = {Zenodo},
|
|
|
doi = {10.5281/zenodo.2591652},
|
|
|
url = {https://doi.org/10.5281/zenodo.2591652}
|
|
|
}
|
|
|
|