• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

doc/H03-May-2022-

eg/H24-Mar-2007-287198

lib/AI/H24-Mar-2007-5,7312,269

t/H24-Mar-2007-743493

Build.PLH A D03-May-2022748 3329

ChangesH A D24-Mar-20076.6 KiB191129

INSTALLH A D24-Mar-2007486 2316

MANIFESTH A D24-Mar-20071.6 KiB6362

META.ymlH A D24-Mar-20073.5 KiB105104

Makefile.PLH A D24-Mar-20071.1 KiB3221

READMEH A D24-Mar-200716 KiB332268

README

1NAME
2    AI::Categorizer - Automatic Text Categorization
3
4SYNOPSIS
5     use AI::Categorizer;
6     my $c = new AI::Categorizer(...parameters...);
7
8     # Run a complete experiment - training on a corpus, testing on a test
9     # set, printing a summary of results to STDOUT
10     $c->run_experiment;
11
12     # Or, run the parts of $c->run_experiment separately
13     $c->scan_features;
14     $c->read_training_set;
15     $c->train;
16     $c->evaluate_test_set;
17     print $c->stats_table;
18
19     # After training, use the Learner for categorization
20     my $l = $c->learner;
21     while (...) {
22       my $d = ...create a document...
23       my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
24       print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
25       print "Best category: ", $hypothesis->best_category, "\n";
26     }
27
28DESCRIPTION
29    "AI::Categorizer" is a framework for automatic text categorization. It
30    consists of a collection of Perl modules that implement common
31    categorization tasks, and a set of defined relationships among those
32    modules. The various details are flexible - for example, you can choose what
33    categorization algorithm to use, what features (words or otherwise) of the
34    documents should be used (or how to automatically choose these features),
35    what format the documents are in, and so on.
36
37    The basic process of using this module will typically involve obtaining a
38    collection of pre-categorized documents, creating a "knowledge set"
39    representation of those documents, training a categorizer on that knowledge
40    set, and saving the trained categorizer for later use. There are several
41    ways to carry out this process. The top-level "AI::Categorizer" module
42    provides an umbrella class for high-level operations, or you may use the
43    interfaces of the individual classes in the framework.
44
45    A simple sample script that reads a training corpus, trains a categorizer,
46    and tests the categorizer on a test corpus, is distributed as eg/demo.pl .
47
48    Disclaimer: the results of any of the machine learning algorithms are far
49    from infallible (close to fallible?). Categorization of documents is often a
50    difficult task even for humans well-trained in the particular domain of
51    knowledge, and there are many things a human would consider that none of
52    these algorithms consider. These are only statistical tests - at best they
53    are neat tricks or helpful assistants, and at worst they are totally
54    unreliable. If you plan to use this module for anything really important,
55    human supervision is essential, both of the categorization process and the
56    final results.
57
58    For the usage details, please see the documentation of each individual
59    module.
60
61FRAMEWORK COMPONENTS
62    This section explains the major pieces of the "AI::Categorizer" object
63    framework. We give a conceptual overview, but don't get into any of the
64    details about interfaces or usage. See the documentation for the individual
65    classes for more details.
66
67    A diagram of the various classes in the framework can be seen in
68    "doc/classes-overview.png", and a more detailed view of the same thing can
69    be seen in "doc/classes.png".
70
71  Knowledge Sets
72
73    A "knowledge set" is defined as a collection of documents, together with
74    some information on the categories each document belongs to. Note that this
75    term is somewhat unique to this project - other sources may call it a
76    "training corpus", or "prior knowledge". A knowledge set also contains some
77    information on how documents will be parsed and how their features (words)
78    will be extracted and turned into meaningful representations. In this sense,
79    a knowledge set represents not only a collection of data, but a particular
80    view on that data.
81
82    A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
83    class. Before you can start playing with categorizers, you will have to
84    start playing with knowledge sets, so that the categorizers have some data
85    to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
86    module for information on its interface.
87
88   Feature selection
89
90    Deciding which features are the most important is a very large part of the
91    categorization task - you cannot simply consider all the words in all the
92    documents when training, and all the words in the document being
93    categorized. There are two main reasons for this - first, it would mean that
94    your training and categorizing processes would take forever and use tons of
95    memory, and second, the significant stuff of the documents would get lost in
96    the "noise" of the insignificant stuff.
97
98    The process of selecting the most important features in the training set is
99    called "feature selection". It is managed by the
100    "AI::Categorizer::KnowledgeSet" class, and you will find the details of
101    feature selection processes in that class's documentation.
102
103  Collections
104
105    Because documents may be stored in lots of different formats, a "collection"
106    class has been created as an abstraction of a stored set of documents,
107    together with a way to iterate through the set and return Document objects.
108    A knowledge set contains a single collection object. A "Categorizer" doing a
109    complete test run generally contains two collections, one for training and
110    one for testing. A "Learner" can mass-categorize a collection.
111
112    The "AI::Categorizer::Collection" class and its subclasses instantiate the
113    idea of a collection in this sense.
114
115  Documents
116
117    Each document is represented by an "AI::Categorizer::Document" object, or an
118    object of one of its subclasses. Each document class contains methods for
119    turning a bunch of data into a Feature Vector. Each document also has a
120    method to report which categories it belongs to.
121
122  Categories
123
124    Each category is represented by an "AI::Categorizer::Category" object. Its
125    main purpose is to keep track of which documents belong to it, though you
126    can also examine statistical properties of an entire category, such as
127    obtaining a Feature Vector representing an amalgamation of all the documents
128    that belong to it.
129
130  Machine Learning Algorithms
131
132    There are lots of different ways to make the inductive leap from the
133    training documents to unseen documents. The Machine Learning community has
134    studied many algorithms for this purpose. To allow flexibility in choosing
135    and configuring categorization algorithms, each such algorithm is a subclass
136    of "AI::Categorizer::Learner". There are currently four categorizers
137    included in the distribution:
138
139    AI::Categorizer::Learner::NaiveBayes
140        A pure-perl implementation of a Naive Bayes classifier. No dependencies
141        on external modules or other resources. Naive Bayes is usually very fast
142        to train and fast to make categorization decisions, but isn't always the
143        most accurate categorizer.
144
145    AI::Categorizer::Learner::SVM
146        An interface to Corey Spencer's "Algorithm::SVM", which implements a
147        Support Vector Machine classifier. SVMs can take a while to train
148        (though in certain conditions there are optimizations to make them quite
149        fast), but are pretty quick to categorize. They often have very good
150        accuracy.
151
152    AI::Categorizer::Learner::DecisionTree
153        An interface to "AI::DecisionTree", which implements a Decision Tree
154        classifier. Decision Trees generally take longer to train than Naive
155        Bayes or SVM classifiers, but they are also quite fast when
156        categorizing. Decision Trees have the advantage that you can scrutinize
157        the structures of trained decision trees to see how decisions are being
158        made.
159
160    AI::Categorizer::Learner::Weka
161        An interface to version 2 of the Weka Knowledge Analysis system that
162        lets you use any of the machine learners it defines. This gives you
163        access to lots and lots of machine learning algorithms in use by machine
164        learning researches. The main drawback is that Weka tends to be quite
165        slow and use a lot of memory, and the current interface between Weka and
166        "AI::Categorizer" is a bit clumsy.
167
168    Other machine learning methods that may be implemented soonish include
169    Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts combiner
170    for ensemble learning. No timetable for their creation has yet been set.
171
172    Please see the documentation of these individual modules for more details on
173    their guts and quirks. See the "AI::Categorizer::Learner" documentation for
174    a description of the general categorizer interface.
175
176    If you wish to create your own classifier, you should inherit from
177    "AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
178    abstract classes that manage some of the work for you.
179
180  Feature Vectors
181
182    Most categorization algorithms don't deal directly with documents' data,
183    they instead deal with a *vector representation* of a document's *features*.
184    The features may be any properties of the document that seem helpful for
185    determining its category, but they are usually some version of the "most
186    important" words in the document. A list of features and their weights in
187    each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
188    You may think of this class as roughly analogous to a Perl hash, where the
189    keys are the names of features and the values are their weights.
190
191  Hypotheses
192
193    The result of asking a categorizer to categorize a previously unseen
194    document is called a hypothesis, because it is some kind of "statistical
195    guess" of what categories this document should be assigned to. Since you may
196    be interested in any of several pieces of information about the hypothesis
197    (for instance, which categories were assigned, which category was the single
198    most likely category, the scores assigned to each category, etc.), the
199    hypothesis is returned as an object of the "AI::Categorizer::Hypothesis"
200    class, and you can use its object methods to get information about the
201    hypothesis. See its class documentation for the details.
202
203  Experiments
204
205    The "AI::Categorizer::Experiment" class helps you organize the results of
206    categorization experiments. As you get lots of categorization results
207    (Hypotheses) back from the Learner, you can feed these results to the
208    Experiment class, along with the correct answers. When all results have been
209    collected, you can get a report on accuracy, precision, recall, F1, and so
210    on, with both micro-averaging and macro-averaging over categories. We use
211    the "Statistics::Contingency" module from CPAN to manage the calculations.
212    See the docs for "AI::Categorizer::Experiment" for more details.
213
214METHODS
215    new()
216        Creates a new Categorizer object and returns it. Accepts lots of
217        parameters controlling behavior. In addition to the parameters listed
218        here, you may pass any parameter accepted by any class that we create
219        internally (the KnowledgeSet, Learner, Experiment, or Collection
220        classes), or any class that *they* create. This is managed by the
221        "Class::Container" module, so see its documentation for the details of
222        how this works.
223
224        The specific parameters accepted here are:
225
226        progress_file
227            A string that indicates a place where objects will be saved during
228            several of the methods of this class. The default value is the
229            string "save", which means files like "save-01-knowledge_set" will
230            get created. The exact names of these files may change in future
231            releases, since they're just used internally to resume where we last
232            left off.
233
234        verbose
235            If true, a few status messages will be printed during execution.
236
237        training_set
238            Specifies the "path" parameter that will be fed to the
239            KnowledgeSet's "scan_features()" and "read()" methods during our
240            "scan_features()" and "read_training_set()" methods.
241
242        test_set
243            Specifies the "path" parameter that will be used when creating a
244            Collection during the "evaluate_test_set()" method.
245
246        data_root
247            A shortcut for setting the "training_set", "test_set", and
248            "category_file" parameters separately. Sets "training_set" to
249            "$data_root/training", "test_set" to "$data_root/test", and
250            "category_file" (used by some of the Collection classes) to
251            "$data_root/cats.txt".
252
253    learner()
254        Returns the Learner object associated with this Categorizer. Before
255        "train()", the Learner will of course not be trained yet.
256
257    knowledge_set()
258        Returns the KnowledgeSet object associated with this Categorizer. If
259        "read_training_set()" has not yet been called, the KnowledgeSet will not
260        yet be populated with any training data.
261
262    run_experiment()
263        Runs a complete experiment on the training and testing data, reporting
264        the results on "STDOUT". Internally, this is just a shortcut for calling
265        the "scan_features()", "read_training_set()", "train()", and
266        "evaluate_test_set()" methods, then printing the value of the
267        "stats_table()" method.
268
269    scan_features()
270        Scans the Collection specified in the "test_set" parameter to determine
271        the set of features (words) that will be considered when training the
272        Learner. Internally, this calls the "scan_features()" method of the
273        KnowledgeSet, then saves a list of the KnowledgeSet's features for later
274        use.
275
276        This step is not strictly necessary, but it can dramatically reduce
277        memory requirements if you scan for features before reading the entire
278        corpus into memory.
279
280    read_training_set()
281        Populates the KnowledgeSet with the data specified in the "test_set"
282        parameter. Internally, this calls the "read()" method of the
283        KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
284        object for later use.
285
286    train()
287        Calls the Learner's "train()" method, passing it the KnowledgeSet
288        created during "read_training_set()". Returns the Learner object. Also
289        saves the Learner object for later use.
290
291    evaluate_test_set()
292        Creates a Collection based on the value of the "test_set" parameter, and
293        calls the Learner's "categorize_collection()" method using this
294        Collection. Returns the resultant Experiment object. Also saves the
295        Experiment object for later use in the "stats_table()" method.
296
297    stats_table()
298        Returns the value of the Experiment's (as created by
299        "evaluate_test_set()") "stats_table()" method. This is a string that
300        shows various statistics about the accuracy/precision/recall/F1/etc. of
301        the assignments made during testing.
302
303HISTORY
304    This module is a revised and redesigned version of the previous
305    "AI::Categorize" module by the same author. Note the added 'r' in the new
306    name. The older module has a different interface, and no attempt at backward
307    compatibility has been made - that's why I changed the name.
308
309    You can have both "AI::Categorize" and "AI::Categorizer" installed at the
310    same time on the same machine, if you want. They don't know about each other
311    or use conflicting namespaces.
312
313AUTHOR
314    Ken Williams <ken@mathforum.org>
315
316    Discussion about this module can be directed to the perl-AI list at
317    <perl-ai@perl.org>. For more info about the list, see
318    http://lists.perl.org/showlist.cgi?name=perl-ai
319
320REFERENCES
321    An excellent introduction to the academic field of Text Categorization is
322    Fabrizio Sebastiani's "Machine Learning in Automated Text Categorization":
323    ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1-47.
324
325COPYRIGHT
326    Copyright 2000-2003 Ken Williams. All rights reserved.
327
328    This distribution is free software; you can redistribute it and/or modify it
329    under the same terms as Perl itself. These terms apply to every file in the
330    distribution - if you have questions, please contact the author.
331
332