• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

langclass/H03-May-2022-74,83774,181

m4/H08-Nov-2021-9,0608,188

src/H03-May-2022-3,9042,423

ChangeLogH A D25-Feb-20212.4 KiB7157

LICENSEH A D25-Feb-20211.5 KiB3124

Makefile.amH A D25-Feb-2021269 127

Makefile.inH A D03-May-202228.3 KiB890793

READMEH A D08-Nov-20213.9 KiB12284

README.libtextcatH A D25-Feb-20215.3 KiB146101

TODOH A D25-Feb-2021366 87

aclocal.m4H A D08-Nov-202142.8 KiB1,1791,071

compileH A D08-Nov-20217.2 KiB349259

config.guessH A D25-Feb-202143.1 KiB1,4871,294

config.subH A D25-Feb-202130.7 KiB1,7911,636

configureH A D08-Nov-2021432.9 KiB14,78912,376

configure.acH A D08-Nov-20212.1 KiB8472

depcompH A D08-Nov-202123 KiB792502

install-shH A D25-Feb-202115.3 KiB530346

libexttextcat.pc.inH A D25-Feb-2021358 1613

libexttextcat.vapiH A D25-Feb-20211.8 KiB4038

ltmain.shH A D08-Nov-2021316.6 KiB11,1507,980

missingH A D08-Nov-20216.7 KiB216143

README

1libexttextcat is an N-Gram-Based Text Categorization library primarily intended
2for language guessing.
3
4Fundamentally this is an adaption of wiseguys libtextcat extended to be UTF-8
5aware. See README.libtextcat for details on original libtextcat.
6
7Building:
8
9 * ./configure
10 * make
11 * make check
12
13the tests can be run under valgrind's memcheck with export VALGRIND=memcheck,
14e.g.
15
16 * export VALGRIND=memcheck
17 * make check
18
19Quickstart: language guesser
20
21 Assuming that you have successfully compiled the library, you need some
22language models to start guessing languages. A collection of over 150 language
23models, mostly derived from using the included "createfp" utility on UDHR
24translations, is bundled, with a matching configuration file, in the langclass
25directory:
26
27  * cd langclass/LM
28  * ../../src/testtextcat ../fpdb.conf
29
30Paste some text onto the commandline, and watch it get classified.
31
32Using the API:
33
34Classifying the language of a textbuffer can be as easy as:
35
36 #include "textcat.h"
37 ...
38 void *h = textcat_Init( "fpdb.conf" );
39 ...
40 printf( "Language: %s\n", textcat_Classify(h, buffer, 400);
41 ...
42 textcat_Done(h);
43
44Creating your own fingerprints:
45
46The createfp program allows you to easily create your own document
47fingerprints. Just feed it an example document on standard input, and store the
48standard output:
49
50Put the names of your fingerprints in a configuration file, add some id's and
51you're ready to classify.
52
53Here's a worked example. The UN Declaration of Human Rights is available in a
54massive pile of translations[4], and and unicode.org makes much of these
55available as plain text[5], so...
56
57% cd langclass/ShortTexts/
58% wget http://unicode.org/udhr/d/udhr_abk.txt
59% tail -n+7 udhr_abk.txt > ab.txt #skip english header, name is using BCP-47
60% cd ../LM
61% ../../src/createfp < ../ShortTexts/ab.txt > ab.lm
62% echo "ab.lm       ab--utf8" >> ../fpdb.conf
63
64Eventually we'll drop fpdb.conf and assume the name of the fingerprint .lm file
65is the correct BCP-47 tag for the language it detects.
66
67Performance tuning:
68
69This library was made with efficiency in mind. There are couple of
70parameters you may wish to tweak if you intend to use it for other
71tasks than language guessing.
72
73The most important thing is buffer size. For reliable language
74guessing the classifier only needs a couple of hundreds of bytes max.
75So don't feed it 100KB of text unless you are creating a fingerprint.
76
77If you insist on feeding the classifier lots of text, try fiddling
78with TABLEPOW, which determines the size of the hash table that is
79used to store the n-grams. Making it too small will result in many
80hashtable clashes, making it too large will cause wild memory
81behaviour and both are bad for the performance.
82
83Putting the most probable models at the top of the list in your config
84file improves performance, because this will raise the threshold for
85likely candidates more quickly.
86
87Since the speed of the classifier is roughly linear with respect to
88the number of models, you should consider how many models you really
89need. In case of language guessing: do you really want to recognize
90every language ever invented?
91
92Acknowledgements
93
94UTF-8 conversion and adaption for OpenOffice.org, Jocelyn Merand.
95Original libTextCat, Frank Scheelen & Rob de Wit at wise-guys.nl.
96Original language models, copyright Gertjan van Noord.
97
98References:
99
100[1] The document that started it all can be downloaded at John M.
101Trenkle's site: N-Gram-Based Text Categorization
102
103http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz
104
105[2] The Perl implementation by Gertjan van Noord (code + language
106models): downloadable from his website
107
108http://odur.let.rug.nl/~vannoord/TextCat/
109
110[3] Original libtextcat implementation at
111
112http://software.wise-guys.nl/libtextcat/
113
114[4] http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx
115
116[5] https://unicode.org/udhr/translations.html
117
118Contact:
119
120Questions or patches can be directed to libreoffice@lists.freedesktop.org.
121Bugs can be directed to https://bugs.freedesktop.org
122

README.libtextcat

1
2  :: libTextCat 2.2 ::
3
4  What is it?
5
6     Libtextcat is a library with functions that implement the
7     classification technique described in Cavnar & Trenkle, "N-Gram-Based
8     Text Categorization" [1]. It was primarily developed for language
9     guessing, a task on which it is known to perform with near-perfect
10     accuracy.
11
12     The central idea of the Cavnar & Trenkle technique is to calculate a
13     "fingerprint" of a document with an unknown category, and compare this
14     with the fingerprints of a number of documents of which the categories
15     are known. The categories of the closest matches are output as the
16     classification. A fingerprint is a list of the most frequent n-grams
17     occurring in a document, ordered by frequency. Fingerprints are
18     compared with a simple out-of-place metric. See the article for more
19     details.
20
21     Considerable effort went into making this implementation fast and
22     efficient. The language guesser processes over 100 documents/second on
23     a simple PC, which makes it practical for many uses. It was developed
24     for use in our webcrawler and search engine software, in which it it
25     handles millions of documents a day.
26
27  Download
28
29     The library is released under the BSD License, which basicly states
30     that you can do anything you like with it as long as you mention us
31     and make it clear that this library is covered by the BSD License. It
32     also exempts us from any liability, should this library eat your hard
33     disc, kill your cat or classify your attorney's e-mails as spam.
34
35     The current version is 2.1.
36
37     It can be downloaded from our website:
38
39       http://software.wise-guys.nl/libtextcat/
40
41     As of yet there is no development version.
42
43  Installation
44
45     Do the familiar dance:
46
47       tar xzf libtextcat-2.2.tar.gz
48       cd libtextcat-2.2
49       ./configure
50       make
51       make install
52
53     This will install the library in /usr/local/lib/ and the createfp
54     binary in /usr/local/bin.
55
56     The library is known to compile flawlessly on GNU/Linux for x86, and
57     IRIX64 (both 32 and 64 bits).
58
59  Quickstart: language guesser
60
61     Assuming that you have successfully compiled the library, you still
62     need some language models to start guessing languages. If you don't
63     feel like creating them yourself (cf. [2]Creating your own
64     fingerprints below), you can use the excellent collection of over 70
65     language models provided in Gertjan van Noord's "TextCat" package.
66     You can find these models and a matching configuration file
67     in the langclass directory:
68
69       * cd libtextcat-2.2/langclass/
70       * ../src/testtextcat conf.txt
71
72     Paste some text onto the commandline, and watch it get classified.
73
74  Using the API
75
76     Classifying the language of a textbuffer can be as easy as:
77
78       #include "textcat.h"
79       ...
80       void *h = textcat_Init( "conf.txt" );
81       ...
82       printf( "Language: %s\n", textcat_Classify(h, buffer, 400);
83       ...
84       textcat_Done(h);
85
86  Creating your own fingerprints
87
88     The createfp program allows you to easily create your own document
89     fingerprints. Just feed it an example document on standard input, and
90     store the standard output:
91
92       % createfp < mydocument.txt > myfingerprint.txt
93
94     Put the names of your fingerprints in a configuration file, add some
95     id's and you're ready to classify.
96
97  Performance tuning
98
99     This library was made with efficiency in mind. There are couple of
100     parameters you may wish to tweak if you intend to use it for other
101     tasks than language guessing.
102
103     The most important thing is buffer size. For reliable language
104     guessing the classifier only needs a couple of hundreds of bytes max.
105     So don't feed it 100KB of text unless you are creating a fingerprint.
106
107     If you insist on feeding the classifier lots of text, try fiddling
108     with TABLEPOW, which determines the size of the hash table that is
109     used to store the n-grams. Making it too small will result in many
110     hashtable clashes, making it too large will cause wild memory
111     behaviour and both are bad for the performance.
112
113     Putting the most probable models at the top of the list in your config
114     file improves performance, because this will raise the threshold for
115     likely candidates more quickly.
116
117     Since the speed of the classifier is roughly linear with respect to
118     the number of models, you should consider how many models you really
119     need. In case of language guessing: do you really want to recognize
120     every language ever invented?
121
122  Acknowledgements
123
124     The language models are copyright Gertjan van Noord.
125
126  References
127
128     [1] The document that started it all can be downloaded at John M.
129     Trenkle's site: N-Gram-Based Text Categorization
130
131       http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz
132
133     [2] The Perl implementation by Gertjan van Noord (code + language
134     models): downloadable from his [7]website
135
136       http://odur.let.rug.nl/~vannoord/TextCat/
137
138  Contact
139
140     Praise and flames may be directed at us through
141     libtextcat@wise-guys.nl. If there is enough interest, we'll whip up
142     a mailing list. The current project maintainer is Frank Scheelen.
143
144  c. 2003 WiseGuys Internet B.V.
145
146