|
Name |
|
Date |
Size |
#Lines |
LOC |
| .. | | 03-May-2022 | - |
| langclass/ | H | 03-May-2022 | - | 74,837 | 74,181 |
| m4/ | H | 08-Nov-2021 | - | 9,060 | 8,188 |
| src/ | H | 03-May-2022 | - | 3,904 | 2,423 |
| ChangeLog | H A D | 25-Feb-2021 | 2.4 KiB | 71 | 57 |
| LICENSE | H A D | 25-Feb-2021 | 1.5 KiB | 31 | 24 |
| Makefile.am | H A D | 25-Feb-2021 | 269 | 12 | 7 |
| Makefile.in | H A D | 03-May-2022 | 28.3 KiB | 890 | 793 |
| README | H A D | 08-Nov-2021 | 3.9 KiB | 122 | 84 |
| README.libtextcat | H A D | 25-Feb-2021 | 5.3 KiB | 146 | 101 |
| TODO | H A D | 25-Feb-2021 | 366 | 8 | 7 |
| aclocal.m4 | H A D | 08-Nov-2021 | 42.8 KiB | 1,179 | 1,071 |
| compile | H A D | 08-Nov-2021 | 7.2 KiB | 349 | 259 |
| config.guess | H A D | 25-Feb-2021 | 43.1 KiB | 1,487 | 1,294 |
| config.sub | H A D | 25-Feb-2021 | 30.7 KiB | 1,791 | 1,636 |
| configure | H A D | 08-Nov-2021 | 432.9 KiB | 14,789 | 12,376 |
| configure.ac | H A D | 08-Nov-2021 | 2.1 KiB | 84 | 72 |
| depcomp | H A D | 08-Nov-2021 | 23 KiB | 792 | 502 |
| install-sh | H A D | 25-Feb-2021 | 15.3 KiB | 530 | 346 |
| libexttextcat.pc.in | H A D | 25-Feb-2021 | 358 | 16 | 13 |
| libexttextcat.vapi | H A D | 25-Feb-2021 | 1.8 KiB | 40 | 38 |
| ltmain.sh | H A D | 08-Nov-2021 | 316.6 KiB | 11,150 | 7,980 |
| missing | H A D | 08-Nov-2021 | 6.7 KiB | 216 | 143 |
README
1libexttextcat is an N-Gram-Based Text Categorization library primarily intended
2for language guessing.
3
4Fundamentally this is an adaption of wiseguys libtextcat extended to be UTF-8
5aware. See README.libtextcat for details on original libtextcat.
6
7Building:
8
9 * ./configure
10 * make
11 * make check
12
13the tests can be run under valgrind's memcheck with export VALGRIND=memcheck,
14e.g.
15
16 * export VALGRIND=memcheck
17 * make check
18
19Quickstart: language guesser
20
21 Assuming that you have successfully compiled the library, you need some
22language models to start guessing languages. A collection of over 150 language
23models, mostly derived from using the included "createfp" utility on UDHR
24translations, is bundled, with a matching configuration file, in the langclass
25directory:
26
27 * cd langclass/LM
28 * ../../src/testtextcat ../fpdb.conf
29
30Paste some text onto the commandline, and watch it get classified.
31
32Using the API:
33
34Classifying the language of a textbuffer can be as easy as:
35
36 #include "textcat.h"
37 ...
38 void *h = textcat_Init( "fpdb.conf" );
39 ...
40 printf( "Language: %s\n", textcat_Classify(h, buffer, 400);
41 ...
42 textcat_Done(h);
43
44Creating your own fingerprints:
45
46The createfp program allows you to easily create your own document
47fingerprints. Just feed it an example document on standard input, and store the
48standard output:
49
50Put the names of your fingerprints in a configuration file, add some id's and
51you're ready to classify.
52
53Here's a worked example. The UN Declaration of Human Rights is available in a
54massive pile of translations[4], and and unicode.org makes much of these
55available as plain text[5], so...
56
57% cd langclass/ShortTexts/
58% wget http://unicode.org/udhr/d/udhr_abk.txt
59% tail -n+7 udhr_abk.txt > ab.txt #skip english header, name is using BCP-47
60% cd ../LM
61% ../../src/createfp < ../ShortTexts/ab.txt > ab.lm
62% echo "ab.lm ab--utf8" >> ../fpdb.conf
63
64Eventually we'll drop fpdb.conf and assume the name of the fingerprint .lm file
65is the correct BCP-47 tag for the language it detects.
66
67Performance tuning:
68
69This library was made with efficiency in mind. There are couple of
70parameters you may wish to tweak if you intend to use it for other
71tasks than language guessing.
72
73The most important thing is buffer size. For reliable language
74guessing the classifier only needs a couple of hundreds of bytes max.
75So don't feed it 100KB of text unless you are creating a fingerprint.
76
77If you insist on feeding the classifier lots of text, try fiddling
78with TABLEPOW, which determines the size of the hash table that is
79used to store the n-grams. Making it too small will result in many
80hashtable clashes, making it too large will cause wild memory
81behaviour and both are bad for the performance.
82
83Putting the most probable models at the top of the list in your config
84file improves performance, because this will raise the threshold for
85likely candidates more quickly.
86
87Since the speed of the classifier is roughly linear with respect to
88the number of models, you should consider how many models you really
89need. In case of language guessing: do you really want to recognize
90every language ever invented?
91
92Acknowledgements
93
94UTF-8 conversion and adaption for OpenOffice.org, Jocelyn Merand.
95Original libTextCat, Frank Scheelen & Rob de Wit at wise-guys.nl.
96Original language models, copyright Gertjan van Noord.
97
98References:
99
100[1] The document that started it all can be downloaded at John M.
101Trenkle's site: N-Gram-Based Text Categorization
102
103http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz
104
105[2] The Perl implementation by Gertjan van Noord (code + language
106models): downloadable from his website
107
108http://odur.let.rug.nl/~vannoord/TextCat/
109
110[3] Original libtextcat implementation at
111
112http://software.wise-guys.nl/libtextcat/
113
114[4] http://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx
115
116[5] https://unicode.org/udhr/translations.html
117
118Contact:
119
120Questions or patches can be directed to libreoffice@lists.freedesktop.org.
121Bugs can be directed to https://bugs.freedesktop.org
122
README.libtextcat
1
2 :: libTextCat 2.2 ::
3
4 What is it?
5
6 Libtextcat is a library with functions that implement the
7 classification technique described in Cavnar & Trenkle, "N-Gram-Based
8 Text Categorization" [1]. It was primarily developed for language
9 guessing, a task on which it is known to perform with near-perfect
10 accuracy.
11
12 The central idea of the Cavnar & Trenkle technique is to calculate a
13 "fingerprint" of a document with an unknown category, and compare this
14 with the fingerprints of a number of documents of which the categories
15 are known. The categories of the closest matches are output as the
16 classification. A fingerprint is a list of the most frequent n-grams
17 occurring in a document, ordered by frequency. Fingerprints are
18 compared with a simple out-of-place metric. See the article for more
19 details.
20
21 Considerable effort went into making this implementation fast and
22 efficient. The language guesser processes over 100 documents/second on
23 a simple PC, which makes it practical for many uses. It was developed
24 for use in our webcrawler and search engine software, in which it it
25 handles millions of documents a day.
26
27 Download
28
29 The library is released under the BSD License, which basicly states
30 that you can do anything you like with it as long as you mention us
31 and make it clear that this library is covered by the BSD License. It
32 also exempts us from any liability, should this library eat your hard
33 disc, kill your cat or classify your attorney's e-mails as spam.
34
35 The current version is 2.1.
36
37 It can be downloaded from our website:
38
39 http://software.wise-guys.nl/libtextcat/
40
41 As of yet there is no development version.
42
43 Installation
44
45 Do the familiar dance:
46
47 tar xzf libtextcat-2.2.tar.gz
48 cd libtextcat-2.2
49 ./configure
50 make
51 make install
52
53 This will install the library in /usr/local/lib/ and the createfp
54 binary in /usr/local/bin.
55
56 The library is known to compile flawlessly on GNU/Linux for x86, and
57 IRIX64 (both 32 and 64 bits).
58
59 Quickstart: language guesser
60
61 Assuming that you have successfully compiled the library, you still
62 need some language models to start guessing languages. If you don't
63 feel like creating them yourself (cf. [2]Creating your own
64 fingerprints below), you can use the excellent collection of over 70
65 language models provided in Gertjan van Noord's "TextCat" package.
66 You can find these models and a matching configuration file
67 in the langclass directory:
68
69 * cd libtextcat-2.2/langclass/
70 * ../src/testtextcat conf.txt
71
72 Paste some text onto the commandline, and watch it get classified.
73
74 Using the API
75
76 Classifying the language of a textbuffer can be as easy as:
77
78 #include "textcat.h"
79 ...
80 void *h = textcat_Init( "conf.txt" );
81 ...
82 printf( "Language: %s\n", textcat_Classify(h, buffer, 400);
83 ...
84 textcat_Done(h);
85
86 Creating your own fingerprints
87
88 The createfp program allows you to easily create your own document
89 fingerprints. Just feed it an example document on standard input, and
90 store the standard output:
91
92 % createfp < mydocument.txt > myfingerprint.txt
93
94 Put the names of your fingerprints in a configuration file, add some
95 id's and you're ready to classify.
96
97 Performance tuning
98
99 This library was made with efficiency in mind. There are couple of
100 parameters you may wish to tweak if you intend to use it for other
101 tasks than language guessing.
102
103 The most important thing is buffer size. For reliable language
104 guessing the classifier only needs a couple of hundreds of bytes max.
105 So don't feed it 100KB of text unless you are creating a fingerprint.
106
107 If you insist on feeding the classifier lots of text, try fiddling
108 with TABLEPOW, which determines the size of the hash table that is
109 used to store the n-grams. Making it too small will result in many
110 hashtable clashes, making it too large will cause wild memory
111 behaviour and both are bad for the performance.
112
113 Putting the most probable models at the top of the list in your config
114 file improves performance, because this will raise the threshold for
115 likely candidates more quickly.
116
117 Since the speed of the classifier is roughly linear with respect to
118 the number of models, you should consider how many models you really
119 need. In case of language guessing: do you really want to recognize
120 every language ever invented?
121
122 Acknowledgements
123
124 The language models are copyright Gertjan van Noord.
125
126 References
127
128 [1] The document that started it all can be downloaded at John M.
129 Trenkle's site: N-Gram-Based Text Categorization
130
131 http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz
132
133 [2] The Perl implementation by Gertjan van Noord (code + language
134 models): downloadable from his [7]website
135
136 http://odur.let.rug.nl/~vannoord/TextCat/
137
138 Contact
139
140 Praise and flames may be directed at us through
141 libtextcat@wise-guys.nl. If there is enough interest, we'll whip up
142 a mailing list. The current project maintainer is Frank Scheelen.
143
144 c. 2003 WiseGuys Internet B.V.
145
146