textproc/py-guess-language/guess-language-0.2

Attempts to determine the natural language of a selection of Unicode (utf-8) text.

Based on `guesslanguage.cpp
<http://websvn.kde.org/branches/work/sonnet-refactoring/common/nlp/guesslanguage.cpp?view=markup>`_
by Jacob R Rideout for KDE which itself is based on
`Language::Guess <http://languid.cantbedone.org/>`_ by Maciej Ceglowski.

Detects over 60 languages - all languages listed in the `trigrams
<http://code.google.com/p/guess-language/source/browse/trunk/guess_language/trigrams/>`_
directory plus Japanese, Chinese, Korean and Greek.

guess_language uses heuristics based on the character set and trigrams in a sample text
to detect the language. It works better with longer samples and will be confused if
the sample text includes markup such as HTML tags.

Usage
=====

The main entry points all take a single string as input and return a language identifier.
The string must be Unicode or UTF-8 text. The language identifer can be the language name
in English, the two- or three-letter IANA language code, a language ID or a tuple containing
all three codes.

The primary entry points, and the return values, are as follows::

  guessLanguage(txt) - IANA language code
  guessLanguageTag(txt) - IANA language code (same as guessLanguage)
  guessLanguageName(txt) - Language name in English
  guessLanguageId(txt) - language ID
  guessLanguageInfo(txt) - tuple of (IANA code, id, name)
Name		Date	Size	#Lines	LOC
..		03-May-2022	-
guess_language/	H	03-May-2022	-	19,984	19,757
PKG-INFO	H A D	02-Aug-2010	2.2 KiB	48	39
README	H A D	31-Jul-2010	1.4 KiB	31	23
setup.py	H A D	01-Aug-2010	971	31	26