1# Natural Language Toolkit: Classifiers 2# 3# Copyright (C) 2001-2019 NLTK Project 4# Author: Edward Loper <edloper@gmail.com> 5# URL: <http://nltk.org/> 6# For license information, see LICENSE.TXT 7 8""" 9Classes and interfaces for labeling tokens with category labels (or 10"class labels"). Typically, labels are represented with strings 11(such as ``'health'`` or ``'sports'``). Classifiers can be used to 12perform a wide range of classification tasks. For example, 13classifiers can be used... 14 15- to classify documents by topic 16- to classify ambiguous words by which word sense is intended 17- to classify acoustic signals by which phoneme they represent 18- to classify sentences by their author 19 20Features 21======== 22In order to decide which category label is appropriate for a given 23token, classifiers examine one or more 'features' of the token. These 24"features" are typically chosen by hand, and indicate which aspects 25of the token are relevant to the classification decision. For 26example, a document classifier might use a separate feature for each 27word, recording how often that word occurred in the document. 28 29Featuresets 30=========== 31The features describing a token are encoded using a "featureset", 32which is a dictionary that maps from "feature names" to "feature 33values". Feature names are unique strings that indicate what aspect 34of the token is encoded by the feature. Examples include 35``'prevword'``, for a feature whose value is the previous word; and 36``'contains-word(library)'`` for a feature that is true when a document 37contains the word ``'library'``. Feature values are typically 38booleans, numbers, or strings, depending on which feature they 39describe. 40 41Featuresets are typically constructed using a "feature detector" 42(also known as a "feature extractor"). A feature detector is a 43function that takes a token (and sometimes information about its 44context) as its input, and returns a featureset describing that token. 45For example, the following feature detector converts a document 46(stored as a list of words) to a featureset describing the set of 47words included in the document: 48 49 >>> # Define a feature detector function. 50 >>> def document_features(document): 51 ... return dict([('contains-word(%s)' % w, True) for w in document]) 52 53Feature detectors are typically applied to each token before it is fed 54to the classifier: 55 56 >>> # Classify each Gutenberg document. 57 >>> from nltk.corpus import gutenberg 58 >>> for fileid in gutenberg.fileids(): # doctest: +SKIP 59 ... doc = gutenberg.words(fileid) # doctest: +SKIP 60 ... print fileid, classifier.classify(document_features(doc)) # doctest: +SKIP 61 62The parameters that a feature detector expects will vary, depending on 63the task and the needs of the feature detector. For example, a 64feature detector for word sense disambiguation (WSD) might take as its 65input a sentence, and the index of a word that should be classified, 66and return a featureset for that word. The following feature detector 67for WSD includes features describing the left and right contexts of 68the target word: 69 70 >>> def wsd_features(sentence, index): 71 ... featureset = {} 72 ... for i in range(max(0, index-3), index): 73 ... featureset['left-context(%s)' % sentence[i]] = True 74 ... for i in range(index, max(index+3, len(sentence))): 75 ... featureset['right-context(%s)' % sentence[i]] = True 76 ... return featureset 77 78Training Classifiers 79==================== 80Most classifiers are built by training them on a list of hand-labeled 81examples, known as the "training set". Training sets are represented 82as lists of ``(featuredict, label)`` tuples. 83""" 84 85from nltk.classify.api import ClassifierI, MultiClassifierI 86from nltk.classify.megam import config_megam, call_megam 87from nltk.classify.weka import WekaClassifier, config_weka 88from nltk.classify.naivebayes import NaiveBayesClassifier 89from nltk.classify.positivenaivebayes import PositiveNaiveBayesClassifier 90from nltk.classify.decisiontree import DecisionTreeClassifier 91from nltk.classify.rte_classify import rte_classifier, rte_features, RTEFeatureExtractor 92from nltk.classify.util import accuracy, apply_features, log_likelihood 93from nltk.classify.scikitlearn import SklearnClassifier 94from nltk.classify.maxent import ( 95 MaxentClassifier, 96 BinaryMaxentFeatureEncoding, 97 TypedMaxentFeatureEncoding, 98 ConditionalExponentialClassifier, 99) 100from nltk.classify.senna import Senna 101from nltk.classify.textcat import TextCat 102