• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

guess_language/H03-May-2022-19,98419,757

PKG-INFOH A D02-Aug-20102.2 KiB4839

READMEH A D31-Jul-20101.4 KiB3123

setup.pyH A D01-Aug-2010971 3126

README

1Attempts to determine the natural language of a selection of Unicode (utf-8) text.
2
3Based on `guesslanguage.cpp
4<http://websvn.kde.org/branches/work/sonnet-refactoring/common/nlp/guesslanguage.cpp?view=markup>`_
5by Jacob R Rideout for KDE which itself is based on
6`Language::Guess <http://languid.cantbedone.org/>`_ by Maciej Ceglowski.
7
8Detects over 60 languages - all languages listed in the `trigrams
9<http://code.google.com/p/guess-language/source/browse/trunk/guess_language/trigrams/>`_
10directory plus Japanese, Chinese, Korean and Greek.
11
12guess_language uses heuristics based on the character set and trigrams in a sample text
13to detect the language. It works better with longer samples and will be confused if
14the sample text includes markup such as HTML tags.
15
16Usage
17=====
18
19The main entry points all take a single string as input and return a language identifier.
20The string must be Unicode or UTF-8 text. The language identifer can be the language name
21in English, the two- or three-letter IANA language code, a language ID or a tuple containing
22all three codes.
23
24The primary entry points, and the return values, are as follows::
25
26  guessLanguage(txt) - IANA language code
27  guessLanguageTag(txt) - IANA language code (same as guessLanguage)
28  guessLanguageName(txt) - Language name in English
29  guessLanguageId(txt) - language ID
30  guessLanguageInfo(txt) - tuple of (IANA code, id, name)
31