• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

Testing/H04-Oct-2015-37,54630,194

bin/H04-Oct-2015-12,1456,508

doc/H03-May-2022-3,4852,430

lib/Text/H04-Oct-2015-10,1863,008

t/H04-Oct-2015-7,7606,498

CHANGESH A D04-Oct-20159.9 KiB257180

INSTALLH A D05-Mar-20105.1 KiB147107

MANIFESTH A D04-Oct-201528.9 KiB896895

META.jsonH A D04-Oct-2015853 4039

META.ymlH A D04-Oct-2015495 2221

Makefile.PLH A D31-Mar-20113.9 KiB11856

READMEH A D16-Feb-201353.3 KiB1,3271,005

TODOH A D03-Oct-201511.4 KiB267202

README

1NAME
2    README Introduction to Ngram Statistics Package (Text-NSP)
3
4SYNOPSIS
5    This document provides a general introduction to the Ngram Statistics
6    Package.
7
8DESCRIPTION
9  1. Introduction
10    The Ngram Statistics Package (NSP) is a suite of programs that aids in
11    analyzing Ngrams in text files. We define an Ngram as a sequence of 'n'
12    tokens that occur within a window of at least 'n' tokens in the text;
13    what constitutes a "token" can be defined by the user.
14
15    In earlier versions (v0.1, v0.3, v0.4) this package was known as the
16    Bigram Statistics Package (BSP). The name change reflects the widening
17    scope of the package in moving beyond Bigrams to Ngrams.
18
19    NSP consists of two core programs and three utilities:
20
21    Program count.pl takes flat text files as input and generates a list of
22    all the Ngrams that occur in those files. The Ngrams, along with their
23    frequencies, are output in descending order of their frequency.
24
25    Program statistic.pl takes as input a list of Ngrams with their
26    frequencies (in the format output by count.pl) and runs a user-selected
27    statistical measure of association to compute a "score" for each Ngram.
28    The Ngrams, along with their scores, are output in descending order of
29    this score. The statistical score computed for each Ngram can be used to
30    decide whether or not there is enough evidence to reject the null
31    hypothesis (that the Ngram is not a collocation) for that Ngram.
32
33    Various utility programs are found in bin/utils/ and take as their input
34    the results (output) from count.pl and/or statistic.pl.
35
36    rank.pl takes as input two files output by statistic.pl and computes the
37    Spearman's rank correlation coefficient on the Ngrams that are common to
38    both files. Typically the two files should be produced by applying
39    statistic.pl on the same Ngram count file but by using two different
40    statistical measures. In such a scenario, the value output by rank.pl
41    can be used to measure how similar these the two measures are. A value
42    close to 1 would indicate that these two measures rank Ngrams in the
43    same order, -1 that the two orderings are exactly opposite to each other
44    and 0 that they are not related.
45
46    kocos.pl takes as input a file output by count.pl or statistic.pl and
47    uses that to identify kth order co-occurrences of a given word. A kth
48    order co-occurrence of a target WORD is a word that co-occurs with a
49    (k-1)th co-occurrence of the given target WORD. So A is a 2nd order
50    co-occurrence of X if X occurs with B and B occurs with A. Put more
51    concretely in "New York", "New" and "York" co-occur (the are 1st order
52    co-occurrences). In "New Jack", "New" and "Jack" co-occur. Thus, "Jack"
53    and "York" are second order co-occurrences because they both co-occur
54    with "New".
55
56    combig.pl will take the output of count.pl and find unordered counts of
57    bigrams. Normally count.pl treats bigrams like "fine wine" and "wine
58    fine" as distinct. combig.pl (combine bigram) will adjust the counts
59    such that they do not depend on the order. So one could then go on to
60    measure how much the words "fine" and "wine" are associated without
61    respect to their order.
62
63    huge-count.pl allows a user to run count.pl on much larger corpora. It
64    essentially divides the whole bigrams list generated by count.pl with
65    --tokenlist opition, then splits the entire bigrams list into smaller
66    pieces, and then sort and merge the bigrams lists to get the final
67    output. huge-count.pl also uses bin/utils/huge-split.pl,
68    bin/utils/huge-sort.pl, bin/utils/huge-merge.pl and
69    bin/utils/huge-delete.pl.
70
71    This README continues with an introduction to the basic definitions of
72    tokens, the tokenization process and the Ngram formation process. This
73    is followed by a description of the two main programs in this suite
74    (count.pl and statistic.pl) and brief notes one how one could typically
75    use each of them. The programs rank.pl, kocos.pl, and combig.pl are
76    described in separate READMEs in the /utils directory.
77
78  2. Tokens
79    We define a token as a contiguous sequence of characters that match one
80    of a set of regular expressions. These regular expressions may be
81    user-provided, or, if not provided, are assumed to be the following two
82    regular expressions:
83
84      \w+        -> this matches a contiguous sequence of alpha-numeric characters
85
86      [\.,;:\?!] -> this matches a single punctuation mark
87
88    For example, assume the following is a line of text:
89
90    "the stock markets fell by 20 points today!"
91
92    Then, using the above regular expressions, we get the following tokens:
93
94        the       stock     markets
95        fell      by        20
96        points    today     !
97
98    Now assume that the user provides the following lone regular expression:
99
100      [a-zA-Z]+  -> this matches a contiguous sequence of alphabetic characters
101
102    Then, we get the following tokens:
103
104        the       stock     markets
105        fell      by        points
106        today
107
108  3. The Tokenization Process:
109    Given a text file and a set of regular expressions, the text is
110    "tokenized", that is, broken up into tokens. To do so, the entire input
111    text is considered as one long "input string" with new-line characters
112    being replaced by space characters (this is the default behaviour and
113    can be modified; see point 4 below). Then, the following is done:
114
115     while the input string is non empty
116
117        foreach regular expression r
118            if r is matched by a sequence of characters starting with the first
119            character in the input string...
120                quit this for loop
121            end if
122        end foreach
123
124        if we have a matching regular expression r
125            the portion of the input string matched by r is our next token. remove
126            this token from the input string.
127        else
128            remove the first character from the input string
129        end if
130
131     end while
132
133   3.1 Notes:
134    3.1.1. In looking for a regular expression that yields a successful
135    match (in the foreach loop above), we want a regular expression that
136    matches the input string starting with the first character of the input
137    string. Thus, the regular expression /b/ matches the input string "be
138    good" but not the input string " be good".
139
140    3.1.2. If none of the regular expressions give a successful match, then
141    the first character in the input string is removed. This character is
142    considered a "non-token" and is henceforth ignored.
143
144    3.1.3. Since the matching process (the foreach loop above) stops at the
145    first match, the order in which the regular expressions are tested is
146    important. The order is exactly the order in which they are provided by
147    the user, or if the default regular expressions are used, the order in
148    which they are listed above.
149
150   3.2 Examples:
151   3.2.1 Example 1:
152    3.2.1.1. Input text:
153
154        why's the stock falling?
155
156    3.2.1.2. Regular expressions:
157
158        \w+
159        [\.,;:\?!]
160
161    3.2.1.3. Resulting tokens:
162
163        why       s         the
164        stock     falling   ?
165
166    3.2.1.4. Explanation:
167
168    Initially our input string is the entire input text: "why's the stock
169    falling?". The first token found is "why" which matches the regular
170    expression /\w+/. This token is removed, and our input string becomes
171    "'s the stock falling?".
172
173    Now neither of the regular expressions can match the ' character. Thus
174    this character is considered a non-token and is removed, leaving the
175    input string like so: "s the stock falling?".
176
177    "s" is now matched by /\w+/, and this forms our next token. Upon
178    removing this token, we get the following input string " the stock
179    falling?".
180
181    Again, neither of the regular expressions match this input string, and
182    the leading space character is removed as a non-token. Similarly the
183    rest of the line is tokenized to yield the tokens "the", "stock",
184    "falling" and "?".
185
186   3.2.2 Example 2:
187    3.2.2.1. Input text:
188
189        why's the stock falling?
190
191    3.2.2.2. Regular expressions:
192
193        /fall/
194        /falling/
195        /stock/
196
197    3.2.2.3. Resulting tokens:
198
199        stock     fall
200
201    3.2.2.4. Explanation:
202
203    Initially our input string is the entire input text: "why's the stock
204    falling?". None of the regular expressions match, and we remove the
205    first character to get as input string the following: "why's the stock
206    falling?". Similarly, again the regular expressions don't match, and we
207    have to remove the first character. This goes on until our input string
208    becomes: "stock falling?".
209
210    Now "stock" matches the regular expression /stock/, and this token is
211    removed, leaving " falling?" as the input string. Since the space
212    character does not form a token, it is removed. Now we have "falling?"
213    as our input string.
214
215    Now observe that we have two regular expressions, /fall/ and /falling/,
216    both of which can match the input string. However, since /fall/ appears
217    before /falling/ in the list, the token formed is "fall". This leaves
218    our input string as: "ing?". None of the regular expressions match this
219    or any of the subsequent input strings obtained by removing one by one
220    the first characters. Hence we get as tokens "stock" and "fall".
221
222   3.2.3 Example 3:
223    3.2.3.1. Input text:
224
225        why's the stock falling?
226
227    3.2.3.2. Regular expressions:
228
229        /falling/
230        /fall/
231        /stock/
232
233    3.2.3.3. Resulting tokens:
234
235        stock     falling
236
237    3.2.3.4. Explanation:
238
239    Observe that this example differs from the previous one only in the
240    order of the regular expressions. The tokenization proceeds exactly as
241    in the previous example, until we have as our input string "falling?".
242    Here, we have /falling/ as our first regular expression, and so we get
243    "falling" as our token.
244
245    Examples 3.2.2 and 3.2.3 demonstrate the importance of the order in
246    which the regular expressions are provided to the tokenization process.
247
248   3.2.4. Example 4:
249    3.2.4.1. Input text:
250
251        why's the stock falling?
252
253    3.2.4.2. Regular expressions:
254
255        /the stock/
256        /\w+/
257
258    3.2.4.3. Resulting tokens:
259
260        why       s       the stock
261        falling
262
263    3.2.4.4. Explanation:
264
265    The thing to note here is that one of the regular expressions has an
266    embedded space character in it. This causes no problems: our definition
267    of a token allows embedded space characters in them! Once our input
268    string is "the stock falling?", the regular expression /the stock/ is
269    matched, and the string "the stock" forms our next token.
270
271  4. Ngrams:
272    An Ngram is a sequence of n tokens. We shall delimit tokens in an Ngram
273    by the diamond symbol, i.e. "<>". Thus, "big<>boy<>" is a bigram whose
274    tokens are "big" and "boy". Similarly, "stock<>falling<>?<>" is a
275    trigram whose tokens are "stock" and "falling" and "?". "the
276    stock<>falling<>" is a bigram with tokens "the stock" and "falling".
277
278    Given a piece of text, Ngrams are usually formed of contiguous tokens.
279    For instance, lets take example 3.2.1, where our tokens, in the order in
280    which they appear in the text, are the following:
281
282        why      s      the      stock      falling      ?
283
284    Then, the following are all the bigrams:
285
286        why<>s<>            s<>the<>        the<>stock<>
287        stock<>falling<>    falling<>?<>
288
289    The following are all the trigrams:
290
291        why<>s<>the<>           s<>the<>stock<>
292        the<>stock<>falling<>   stock<>falling<>?<>
293
294    The following are all the 4-grams:
295
296        why<>s<>the<>stock
297        s<>the<>stock<>falling
298        s<>the<>stock<>falling<>?<>
299
300    Etcetera.
301
302    The Ngrams shown above are all formed from contiguous tokens. Although
303    this is the default, we also allow Ngrams to be formed from
304    non-contiguous tokens.
305
306    To do so, we first define a "window" of size k to be a sequence of k
307    contiguous tokens, where the value of k is greater than or equal to the
308    value of n for the Ngrams. An Ngram can be formed from any n tokens as
309    long as all the tokens belong to a single window of size k. Further the
310    n tokens must occur in the Ngram in exactly the same order as they occur
311    in the window.
312
313    Put another way, given a window of k tokens, we drop k-n tokens from the
314    window, and what remains is an Ngram!
315
316    Thus for instance, taking example 3.2.1 again, recall that our tokens in
317    the order in which they occur in the text are the following:
318
319        why      s      the      stock      falling      ?
320
321    Then, the following are all the bigrams with a window size of 3:
322
323        why<>s<>               why<>the<>         s<>the<>
324        s<>stock<>             the<>stock<>       the<>falling<>
325        stock<>falling<>       stock<>?<>         falling<>?<>
326
327    The following are all the bigrams with a window size of 4:
328
329        why<>s<>               why<>the<>         why<>stock<>
330        s<>the<>               s<>stock<>         s<>falling<>
331        the<>stock<>           the<>falling<>     the<>?<>
332        stock<>falling<>       stock<>?<>         falling<>?<>
333
334    The following are all the trigrams with a window size of 4:
335
336        why<>s<>the<>          why<>s<>stock<>     why<>the<>stock<>
337        s<>the<>stock<>        s<>the<>falling<>   s<>stock<>falling<>
338        the<>stock<>falling<>  the<>stock<>?<>     the<>falling<>?<>
339        stock<>falling<>?<>
340
341    Etc.
342
343  5. Program count.pl:
344    This program takes as input a flat ASCII text file and outputs all
345    Ngrams, or token sequences of length 'n', where the value of 'n' can be
346    decided by the user. Non-contiguous Ngrams within a window of size 'k'
347    as described above can also be found and output. For every output Ngram,
348    its frequency of occurrence as well as the frequencies of all the
349    combinations of the tokens it is made up of are output. Details follow.
350
351   5.1. Default Way to Run count.pl:
352    The most basic way of running this program is the following:
353
354    Example 5.1: count.pl output.txt input.txt
355
356    where input.txt is the input text file in which to find the Ngrams and
357    output.txt is the output file into which count.pl will put all the
358    Ngrams with their frequencies.
359
360   5.2. Changing the Length of Ngrams and the Size of the Window:
361    Several default values are in use when the program is run this way. For
362    example it is assumed that one is counting bigrams, that is the value of
363    'n' is 2. This can be changed by using the option --ngram N, where 'N'
364    is the number of tokens you want in each Ngram. Thus, to find all
365    trigrams in input.txt, run count.pl thus:
366
367    Example 5.2: count.pl --ngram 3 output.txt input.txt
368
369    Another default value in use is the window size. Window size defaults to
370    the value of 'n' for Ngrams. Thus, in example 5.1 the window size was 2
371    while in example 5.1, because of the --ngram 3 option , the window size
372    was 3. This can be changed using the --window N option. Thus, for
373    example to find all bigrams within windows of size 3, one would run the
374    program like so:
375
376    Example 5.3a: count.pl --window 3 output.txt input.txt
377
378    Similarly, to find all trigrams within a window of size 4:
379
380    Example 5.3b: count.pl --ngram 3 --window 4 output.txt input.txt
381
382   5.3. Using User-Provided Token Definitions:
383    In all these examples, the tokenization and Ngram formation proceeds as
384    described in sections 3 and 4 above. In these examples, the default
385    token definitions are used:
386
387     \w+        -> this matches a contiguous sequence of alpha-numeric characters
388     [\.,;:\?!] -> this matches a single punctuation mark
389
390    As mentioned previously, these default token definitions can be
391    over-ridden by using the option --token FILE, where FILE is the name of
392    the file containing the regular expressions on which the token
393    definitions will be based. Each regular expression in this FILE should
394    be on a line of its own, and should be delimited by the forward slash
395    '/'. Further, these should be valid Perl regular expressions, as defined
396    in [1], which means for example that any occurrence of the forward slash
397    '/' within the regular expression must be 'escaped'.
398
399   5.4 Removing character strings via --nontoken option:
400    This option allows a user to define regular expressions that will match
401    strings that should not be considered as tokens. These strings will be
402    removed from the data and not counted or included in Ngrams.
403
404    The --nontoken option is recommended when there are predictable
405    sequences of characters that you know should not be included as tokens
406    for purposes of counting Ngrams, finding collocations, etc.
407
408    For example, if mark-up symbols like <s>, <p>, [item], [/ptr] exist in
409    text being processed, you may want to include those in your list of
410    nontoken items so they are discarded. If not, a simple regex such as
411    /\w+/ will match with 's', 'p', 'item', 'ptr' from these tags, leading
412    to confusing results.
413
414    The --nontoken option on the command line should be followed by a file
415    name (NON_TOKEN). This file should contain Perl regular expressions
416    delimited by forward slashes '/' that define non-tokens. Multiple
417    expressions may be placed on separate lines or be separated via the '|'
418    (Perl 'or') as in /regex1|regex2|../
419
420    The following are some of the examples of valid non-token definitions.
421
422     /<\/?s|p>/ : will remove xml tags like <s>, <p>, </s>, </p>.
423
424     /\[\w+\]/  : will remove all words which appear in square brackets like
425             [p], [item], [123] and so on.
426
427    count.pl will first remove any string from the input data that matches
428    the non-token regular expression, and only then will match the remaining
429    data against the token definitions. Thus, if by chance a string matches
430    both the token and nontoken definitions, it will be removed as
431    --nontoken has a higher priority than --token or the default token
432    definition.
433
434   5.5. The Output Format of count.pl:
435    Assume that the following are the contents of the input text file to
436    count.pl; let us call the file test.txt:
437
438     first line of text
439     second line
440     and a third line of text
441
442    Further assume that count.pl is run like so:
443
444     count.pl test.cnt test.txt
445
446    Thus, test.cnt will have all the bigrams found in file test.txt using a
447    window size of 2 and using the two default tokens as above. Following
448    then are the contents of file test.cnt:
449
450     11
451     line<>of<>2 3 2
452     of<>text<>2 2 2
453     second<>line<>1 1 3
454     line<>and<>1 3 1
455     and<>a<>1 1 1
456     a<>third<>1 1 1
457     first<>line<>1 1 3
458     third<>line<>1 1 3
459     text<>second<>1 1 1
460
461    The number on the first line, 11, indicates that there were total 11
462    bigrams in the input file.
463
464    From the next line onwards, the various bigrams found are listed. Recall
465    that the tokens of the Ngrams are delimited by the diamond signs: <>.
466    Thus the bigram on the first line is line<>of<>, made up of the tokens
467    "line" and "of" in that order; the bigram on the second line is
468    of<>text<>, made up of the tokens "of" and "text", etc.
469
470    After the diamond following the last token there are three numbers. The
471    first of these numbers denotes the number of times this Ngram occurs in
472    the input text file. Thus bigram line<>of<> occurs 2 times in the input
473    file, as does bigram of<>text<>. The second number denotes in how many
474    bigrams the token "line" occurs as the left-hand-token. In this case,
475    "line" occurs on the left of three bigrams, namely two copies of bigram
476    "line<>of" and the bigram "line<>and<>". Similarly, the third number
477    denotes the number of bigrams in which the word "of" occurs as the
478    right-hand-token. In this case, "of" occurs on the right of two bigrams,
479    namely the two copies of the bigram "line<>of<>".
480
481    Similar output is obtained for trigrams. Assume again that the input
482    file is above, and assume that count.pl is run thusly:
483
484     count.pl --ngram 3 test.cnt test.txt
485
486    The output test.cnt file is as follows:
487
488     10
489     line<>of<>text<>2 3 2 2 2 2 2
490     and<>a<>third<>1 1 1 1 1 1 1
491     third<>line<>of<>1 1 3 2 1 1 2
492     second<>line<>and<>1 1 3 1 1 1 1
493     line<>and<>a<>1 3 1 1 1 1 1
494     a<>third<>line<>1 1 1 2 1 1 1
495     text<>second<>line<>1 1 1 2 1 1 1
496     of<>text<>second<>1 1 1 1 1 1 1
497     first<>line<>of<>1 1 3 2 1 1 2
498
499    Once again, the number on the first line says that there are 10 trigrams
500    in the input text file. The first trigram in the list is
501    "line<>of<>text<>" made up of the tokens "line", "of" and "text" in that
502    order. Similarly, the next trigram is "and<>a<>third<>" made of the
503    tokens "and", "a" and "third".
504
505    Observe that this time there are more numbers after the last token. The
506    first number denotes, as before, the number of times this trigram occurs
507    in the input text file. Thus, "line<>of<>text" occurs twice in the input
508    file while "and<>a<>third" occurs just once. The second, third and
509    fourth numbers denote the number of trigrams in which the tokens "line",
510    "of" and "text" appear in the first, second and third positions
511    respectively. Thus, "line" occurs as the token in the first position in
512    3 trigrams, namely 2 copies of "line<>of<>text<>" and one copy of
513    "line<>and<>a<>". Similarly, the tokens "of" and "text" appear as the
514    second and third tokens respectively of two bigrams, namely the two
515    copies of "line<>of<>text<>".
516
517    The fifth number denotes the number of bigrams in which "line" occurs as
518    the first token and "of" occurs as the second token. Once again, there
519    are only two trigrams in which this happens: the two copies of
520    "line<>of<>text<>". The sixth number denotes the number of bigrams in
521    which "line" occurs as the token in the first place and "text" occurs as
522    the token in the third place. The seventh number denotes the number of
523    bigrams in which "of" occurs as the token in the second place and "text"
524    occurs as the token in the third place.
525
526    In general, assume we are dealing with Ngrams of size 'n'. Given an
527    Ngram, denote its leftmost token as w[0], the next token as w[1], and so
528    on until w[n-1]. Further let f(a, b, ..., c) be the number of Ngrams
529    that have token w[a] in position a, token w[b] in position b, ... and
530    token w[c] in position c, where 0 <= a < b < ... < c < n.
531
532    Then, given an ngram, the first frequency value reported is f(0, 1, ...,
533    n-1).
534
535    This is followed by n frequency values, f(0), f(1), ..., f(n-1).
536
537    This is followed by (n choose 2) values, f(0, 1), f(0, 2), ..., f(0,
538    n-1), f(1, 2), ..., f(1, n-1), ... f(n-2, n-1).
539
540    This is followed by (n choose 3) values, f(0, 1, 2), f(0, 1, 3), ...,
541    f(0, 1, n-1), f(0, 2, 3), ..., f(0, 2, n-1), ..., f(0, n-2, n-1), ...,
542    f(1, 2, 3), ..., f(n-3, n-2, n-1).
543
544    And so on, until (n choose n-1), that is n, frequency values f(0, 1,
545    ..., n-2), f(0, 1, ..., n-3, n-1), f(0, 1, ..., n-4, n-2, n-1), ...,
546    f(1, 2, ..., n-1).
547
548    This gives us a total of 2^n-1 possible frequency values. We call each
549    such frequency value a "frequency combination", since it expresses the
550    number of Ngrams that has a given combination of one or more tokens in
551    one or more fixed positions. By default all such combinations are
552    printed, exactly in the order showed above. To see which combinations
553    are being printed one could use the option --get_freq_combo FILE. This
554    prints to the file the inputs to the imaginary 'f' function defined
555    above exactly in the order the frequency values occur in the main
556    output. Thus for instance, running the program like so:
557
558     count.pl --get_freq_combo freq_combo.txt test.cnt test.txt
559
560    Assuming that test.txt file is the one shown above, the following output
561    is created in file freq_combo.txt:
562
563     0 1
564     0
565     1
566
567    and the following output in file test.cnt:
568
569     11
570     line<>of<>2 3 2
571     of<>text<>2 2 2
572     second<>line<>1 1 3
573     line<>and<>1 3 1
574     and<>a<>1 1 1
575     a<>third<>1 1 1
576     first<>line<>1 1 3
577     third<>line<>1 1 3
578     text<>second<>1 1 1
579
580    Recall that since the option --ngram is not being used, the default
581    value of n, 2, is being used here. After each bigram in the test.cnt
582    file are three numbers; the first number corresponds to f(0, 1), the
583    second number corresponds to f(0) and the third to f(1). Observe that
584    line 'i' of the output in file freq_combo.txt file represents the input
585    to the imaginary 'f' function that creates the 'i_th' frequency value on
586    each line of the output in file test.cnt.
587
588    Similarly, running the program thus:
589
590     count.pl --ngram 3 --get_freq_combo freq_combo.txt test.cnt test.txt
591
592    produces the following output in freq_combo.txt:
593
594     0 1 2
595     0
596     1
597     2
598     0 1
599     0 2
600     1 2
601
602    and the following output in file test.cnt
603
604     10
605     line<>of<>text<>2 3 2 2 2 2 2
606     and<>a<>third<>1 1 1 1 1 1 1
607     third<>line<>of<>1 1 3 2 1 1 2
608     second<>line<>and<>1 1 3 1 1 1 1
609     line<>and<>a<>1 3 1 1 1 1 1
610     a<>third<>line<>1 1 1 2 1 1 1
611     text<>second<>line<>1 1 1 2 1 1 1
612     of<>text<>second<>1 1 1 1 1 1 1
613     first<>line<>of<>1 1 3 2 1 1 2
614
615    The seven numbers after each trigram in file test.cnt correspond
616    respectively to f(0, 1, 2), f(0), f(1), f(2), f(0, 1), f(0, 2) and f(1,
617    2), as shown in the file freq_combo.txt.
618
619    It is possible that the user may not require all the frequency values
620    output by default, or that the user requires the frequency values in a
621    different order. To change the default frequency values output, one may
622    provide count.pl with a file containing the inputs to the 'f' function
623    using the option --set_freq_combo.
624
625    Thus for instance, if the user wants to create trigrams, and only
626    requires the frequencies of the trigrams and the frequency values of the
627    three tokens in the trigrams (and not of the pairs of tokens), then he
628    may create the following file (say, user_freq_combo.txt):
629
630     0 1 2
631     0
632     1
633     2
634
635    and provide this file to the count.pl program thus:
636
637    count.pl --ngram 3 --set_freq_combo user_freq_combo.txt test.cnt
638    test.txt
639
640    this produces the following test.cnt file:
641
642     10
643     line<>of<>text<>2 3 2 2
644     and<>a<>third<>1 1 1 1
645     third<>line<>of<>1 1 3 2
646     second<>line<>and<>1 1 3 1
647     line<>and<>a<>1 3 1 1
648     a<>third<>line<>1 1 1 2
649     text<>second<>line<>1 1 1 2
650     of<>text<>second<>1 1 1 1
651     first<>line<>of<>1 1 3 2
652
653    Observe that the only difference between this output and the default
654    output is that instead of reporting 7 frequency values per ngram, only
655    the 4 requested are output.
656
657    count2huge.pl is a method to convert the output of count.pl to
658    huge-count.pl. The program can sort the bigrams in the alphabet order
659    and generate the same output with huge-count.pl. The reason we sort the
660    bigrams is because when we use the bigrams list to generate
661    co-occurrence matrix for the vector relatedness measure of
662    UMLS-Similarity, it requires the input bigrams which start with the same
663    term are grouped together. Sort the bigrams when create the
664    co-occurrence can imporve the efficiency.
665
666   5.6. "Stopping" the Ngrams:
667    The user may "stop" the Ngrams formed by count.pl by providing a list of
668    stop-tokens through the option --stop FILE. Each stop token in FILE
669    should be a Perl regular expression that occurs on a line by itself.
670    This expression should be delimited by forward slashes, as in /REGEX/.
671    All regular expression capabilities in Perl are supported except for
672    regular expression modifiers (like the "i" /REGEX/i).
673
674    The following are a few examples of valid entries in the stop list.
675
676     /^\d+$/
677     /\bthe\b/
678     /\b[Tt][Hh][Ee]\b/
679     /^and$/
680     /\bor\b/
681     /^be(ing)?$/
682
683    There are two modes in which a stop list can be used, AND and OR. The
684    default mode is AND, which means that an Ngram must be made up entirely
685    of words from the stoplist before it is eliminated. The OR mode
686    eliminates an Ngram if any of the words that make up the Ngram are found
687    in the stoplist.
688
689    The mode is specified via an extended option that should appear on the
690    first line of the stop file. For example,
691
692     @stop.mode=AND
693     /^for$/
694     /^the$/
695     /^\d+$/
696
697    would eliminate bigrams such as 'for the', 'for 10', etc. (where both
698    elements of the bigram are from the stop list.) But will not remove
699    bigrams like '10 dollars' or 'of the'.
700
701     @stop.mode=OR
702     /^for$/
703     /^the$/
704     /^\d+$/
705
706    would eliminate bigrams such as 'for our', '10 dollars', etc. (where at
707    least one element of the bigram is from the stop list).
708
709    If the @stop.mode= option is not specified, the default value is AND.
710
711    In both modes, Ngrams that are eliminated do not add to the various
712    Ngram and individual word frequency counts. Ngrams that are "stoplisted"
713    are treated as if they never existed and are not counted.
714
715   5.6.1 Usage Notes for Regular Expressions in Stop Lists:
716    (1) In Perl regular expressions, \b specifies word boundary and ^ and $
717    specify the start and end of a string (or line of text). These can be
718    used in defining your stop list entries, but must be used with somewhat
719    carefully.
720
721    count.pl examines each token individually, thereby treating each as a
722    separate string or line. As a result, you can use either /\bregex\b/ or
723    /^regex$/ to exactly match a token made up of alphanumeric characters,
724    as in \bcat\b or \^cat$\. However, please note that if a token consists
725    of other characters (as in n.b.a.) they can behave differently. Suppose
726    for example that your token is www.dot.com. If you have a stop list
727    entry \bwww\b it will match the 'www' portion of the token, since the
728    '.' is considered to be a word boundary. \^www$\ would not have that
729    problem.
730
731    (2) If instead of /^the$/, regex /the/ is used as a stop regex, then
732    every token that matches /the/ will be removed. So tokens like 'there',
733    'their', 'weather','together' will be excluded with the stop regex
734    /the/. On the other hand, with the regex /^the$/, all occurrences of
735    only word 'the' will be removed.
736
737    (3) You can also use a stop regex /^the/ to remove tokens that begin
738    with 'the' like 'their' or 'them' but not 'together'. Similarly, stop
739    regex /the$/ will remove all tokens which end in 'the' like 'swathe' or
740    'tithe' but not 'together' or 'their'.
741
742    (4) Please note that stoplist handling changed as of version 0.53. If
743    you use a stoplist developed for an earlier version of NSP, then it will
744    not behave in the same way!!
745
746    In earlier versions when you specified /regex/ as a stoplist item, we
747    assumed that you really meant /\bregex\b/ and proceeded accordingly.
748    However, since regular expressions are now fully supported we require
749    that you specify exactly what you mean. So if you include /is/ as a
750    member of your stoplist, we will now assume that you mean any word that
751    contains 'is'somewhere within in (like 'this' or 'kiss' or 'isthmus'
752    ...) To preserve the functionality of your old stoplists, simply convert
753    them from
754
755     /the/
756     /is/
757     /of/
758
759    to
760
761     /\bthe\b/
762     /\bis\b/
763     /\bof\b/
764
765    (6) regex modifiers like i or g which come after the end slash like:
766
767     /regex/i
768     /regex/g
769
770    are not supported. See FAQ.txt for an explanation.
771
772    This makes it slightly inconvenient to specify that you would like to
773    stop any form of a given word. For example, if you wanted to stop 'THE',
774    'The', 'THe', etc. you would have to specify a regex such as
775
776     /[Tt][Hh][Ee]/
777
778   5.6.2. Differences between --nontoken and --stop:
779    In theory we can remove "unwanted" words using either the --nontoken
780    option or the --stop option. However, these are rather different
781    techniques.
782
783    --stop only removes stop words after they are recognized as valid
784    tokens. Thus, if you wish to remove some markup tags like [p] or [item]
785    from the data using a stop list, you first need to recognize these as
786    tokens (via a --token definition like /\[\w+\]/) and then remove them
787    with a --stop list.
788
789    In addition, the --stop option operates on an Ngram and does not remove
790    individual words. It removes Ngrams (and reduces the count of the number
791    of Ngrams in the sample). In other words, the --stop option only comes
792    into effect after the Ngrams have been created.
793
794    On the other hand, the --nontoken option eliminates individual
795    occurrence of a non-token sequence before finding Ngrams.
796
797    Some examples to clarify the distinction between --stop and --nontoken
798
799    -----------------------------------------------------------------------
800
801    Consider an input file count.input =>
802
803      [ptr] <s> this is a test written for count.pl </s> [/ptr]
804      their them together wither tithe
805
806    NontokenFile nontoken.regex =>
807
808      /\[\/?\w+\]/
809      /<\/?\w+>/
810
811    case (a) StopFile stopfile.txt => /the/
812    ----------------------------------------
813
814    Running count.pl with the command :
815
816     count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input
817
818    will first remove all nontokens from the input file. Hence the tokenized
819    text from which the bigrams will be created will be =>
820
821      this is a test written for count.pl
822      their them together wither tithe
823
824    Since the StopFile contains /the/ all tokens which include 'the' are
825    eliminated. Thus, the bigrams:
826
827     their<>them<>
828     them<>together<>
829     together<>wither<>
830     wither<>tithe<>
831
832    will all be removed. This is because each word in each bigram contains
833    "the" and the default stop mode is AND. Note that if there was a bigram
834    such as "on<>their<>" it would not be removed since both words to not
835    match the stoplist. The output file count.out will contain the
836    following:
837
838     count.out=>
839
840     9
841     test<>written<>1 1 1
842     this<>is<>1 1 1
843     a<>test<>1 1 1
844     is<>a<>1 1 1
845     for<>count<>1 1 1
846     .<>pl<>1 1 1
847     count<>.<>1 1 1
848     written<>for<>1 1 1
849     pl<>their<>1 1 1
850
851    case (b) StopFile stopfile.txt => /^the/
852
853    ----------------------------------------
854
855    Running count.pl with the command:
856
857     count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input
858
859    will first remove all nontokens from the input file. The tokenized text
860    will be:
861
862            this is a test written for count.pl
863            their them together wither tithe
864
865    Since the StopFile contains /^the/, all tokens which begin with "the"
866    are eliminated. Thus, the bigram
867
868     their<>them<>
869
870    will be removed since it consists of two words that begin with "the".
871    The output file count.out will contain the 12 bigrams as shown below.
872
873     count.out=>
874
875     12
876     test<>written<>1 1 1
877     this<>is<>1 1 1
878     a<>test<>1 1 1
879     is<>a<>1 1 1
880     for<>count<>1 1 1
881     them<>together<>1 1 1
882     .<>pl<>1 1 1
883     count<>.<>1 1 1
884     written<>for<>1 1 1
885     pl<>their<>1 1 1
886     wither<>tithe<>1 1 1
887     together<>wither<>1 1 1
888
889     case (c) StopFile stopfile.txt => @stop.mode=OR
890              /the$/
891
892    ------------------------------------------------
893
894    Running count.pl with the command:
895
896     count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input
897
898    will first remove all nontokens from the input file. Hence the tokenized
899    text will be:
900
901            this is a test written for count.pl
902            their them together wither tithe
903
904    As the StopFile contains /the$/ all tokens which end in 'the' are stop
905    words. Thus, in the bigram
906
907     wither<>tithe<>
908
909    "tithe" will match the stoplist since it ends with "the". However, this
910    bigram will be eliminated since the stop mode is OR (meaning that if
911    either word is in the stop list then the bigram is eliminated). The
912    output file count.out will contain the 12 bigrams as shown below.
913
914     count.out=>
915
916     12
917     test<>written<>1 1 1
918     this<>is<>1 1 1
919     a<>test<>1 1 1
920     is<>a<>1 1 1
921     for<>count<>1 1 1
922     them<>together<>1 1 1
923     .<>pl<>1 1 1
924     their<>them<>1 1 1
925     count<>.<>1 1 1
926     written<>for<>1 1 1
927     pl<>their<>1 1 1
928     together<>wither<>1 1 1
929
930   5.7. Removing and Not Displaying Low Frequency Ngrams:
931    We allow the user to either remove or to not display low frequency
932    Ngrams. The user can remove low frequency Ngrams by using the option
933    --remove N by which all Ngrams that occur less than n times are removed.
934    The Ngram and the individual frequency counts are adjusted accordingly
935    upon the removal of these Ngrams.
936
937    The user can choose not to display low frequency Ngrams by using the
938    option --frequency N, by which Ngrams that occur less than n times are
939    not displayed in the output. Note that this differs from the --remove
940    option above in that the various frequency counts are not changed.
941    Intuitively, we continue to believe that these Ngrams have occurred in
942    the text - we are simply not interested in looking at them. By contrast,
943    in the --remove option we want to actually think that the Ngrams didn't
944    occur in the text in the first place, and so we want our numbers to
945    agree to that too!
946
947   5.8. Extended Output:
948    Observe that one may modify the actual counting process in various ways
949    through the various options above. To keep a "record" of which option
950    were used and with what values, one can turn the "extended" output on
951    with the switch --extended. The extended output records the size of the
952    Ngram, the size of the window, the frequency value at which the Ngrams
953    were removed and a list of all the source files used to create the count
954    output. If a switch was not used, the default value is printed.
955
956   5.9. Histogram Output:
957    The user can also generate a "histogram" output by using the --histogram
958    FILE option. This histogram output shows how many times Ngrams of a
959    certain frequency has occurred. Following is a typical line out of a
960    histogram output:
961
962     Number of n-grams that occurred   5 time(s) =    14 (40.94 percent)
963
964    This says that there were 14 distinct Ngrams that occurred 5 times each,
965    and between themselves they make up around 41% of the total number of
966    Ngrams.
967
968   5.10. Searching for Source Files in Directories, Recursively if Need Be:
969    One would usual provide a source file to create Ngrams from. One could
970    also provide a directory name - all text files from the directory are
971    used to create Ngrams from. Along with a directory name if one also uses
972    the switch --recurse, all subdirectories inside the source directory are
973    searched for text files recursively, and all text files so found are
974    used to create Ngrams from.
975
976  6. Program statistic.pl:
977    Program statistic.pl takes as input a list of Ngrams with their
978    frequencies in the format output by count.pl and runs a user-selected
979    statistical measure of association to compute a "score" for each Ngram.
980    The Ngrams, along with their scores, are output in descending order of
981    this score.
982
983    The statistical measures of association are implemented separately in
984    separate Perl packages (files ending with .pm extension). When running
985    statistic.pl, the user needs to provide the name of a statistical
986    measure (either from among the ones provided as a part of this
987    distribution or those written by the user). Say the name of the
988    statistic provided by the user is X. Program statistic.pl will then look
989    for Perl package X.pm (in the current directory, or, failing that, the
990    system path). If found, this Perl package file will be loaded and then
991    used to calculate the statistic on the list of Ngrams provided.
992
993    Please remember to include the path of Measures Directory (in the main
994    NSP Package directory) in your system path. This will enable the
995    statistic.pl program to find the modules provided with this package.
996
997    As a part of this distribution, we provide the following statistical
998    packages: dice, log-likelihood (ll), mutual information (mi), the
999    chi-squared test (x2), and the left-fisher test of associativity
1000    (leftFisher). All these packages follow a fixed set of rules as
1001    discussed below. It is hoped that these rules are easy to follow and
1002    that new packages may be written quickly and easily.
1003
1004    In a sense, program statistic.pl is framework. Its job is to take as
1005    input Ngrams with their frequencies, to provide those frequencies to the
1006    statistical library and to format the output from that library. The
1007    heart of the statistical measure - the actual calculation - lies in the
1008    library that can be plugged in. This framework allows for quickly
1009    rigging up new measures; to do so one need worry only about the actual
1010    calculation, and not of the various mundane issues that are taken care
1011    of by statistic.pl.
1012
1013    This section follows with details on how to run statistic.pl, and then
1014    the format of the libraries and tips on how to write them.
1015
1016   6.1. Default Way to Run statistic.pl:
1017    The default way to run statistic.pl is so:
1018
1019    statistic.pl dice test.dice test.cnt
1020
1021      where: dice      is the name of the statistic library to be loaded.
1022            test.dice is the name of the output file in which the results
1023                      of applying the dice coefficient will be stored.
1024            test.cnt  is the name of the input file containing the Ngrams
1025                      and their various frequency values.
1026
1027    A Perl package with filename dice.pm is searched for in the Perl @INC
1028    path. Instead of writing just "dice" on the command line, one may also
1029    write the file name "dice.pm", or the full measure name
1030    "Text::NSP::Measures::2D::Dice::dice".
1031
1032    Once such a file is found, it is exported into statistic.pl and tests
1033    are done to see if this file has the minimum requirements for a
1034    statistical library (more details below). If these tests fail,
1035    statistic.pl stops with an error message. Otherwise the library is
1036    initialized and then for each Ngram in file test.cnt, its frequency
1037    values are passed to it and its calculated value is noted. Finally, when
1038    all values have been calculated, the Ngrams are sorted on their
1039    statistic value and output to file test.dice.
1040
1041    For example, assume our input test.cnt file is this:
1042
1043      11
1044      line<>of<>2 3 2
1045      of<>text<>2 2 2
1046      second<>line<>1 1 3
1047      line<>and<>1 3 1
1048      and<>a<>1 1 1
1049      a<>third<>1 1 1
1050      first<>line<>1 1 3
1051      third<>line<>1 1 3
1052      text<>second<>1 1 1
1053
1054    Thus there are 11 bigrams, the first of which is "line<>of<>", the
1055    second "of<>text<>" etc.
1056
1057    Running statistic.pl thusly: statistic.pl dice test.dice test.cnt will
1058    produce the following test.dice file:
1059
1060      11
1061      of<>text<>1 1.0000 2 2 2
1062      and<>a<>1 1.0000 1 1 1
1063      a<>third<>1 1.0000 1 1 1
1064      text<>second<>1 1.0000 1 1 1
1065      line<>of<>2 0.8000 2 3 2
1066      third<>line<>3 0.5000 1 1 3
1067      line<>and<>3 0.5000 1 3 1
1068      second<>line<>3 0.5000 1 1 3
1069      first<>line<>3 0.5000 1 1 3
1070
1071    Once again, the first number is the total number of bigrams - 11. On the
1072    next line is the highest ranked bigram "of<>text<>". The first number
1073    following this bigram, 1, is its rank. The next number, 1.0000, is its
1074    value computed using the dice statistic. The final three numbers are
1075    exactly the numbers associated with this Ngram in the test.cnt file.
1076
1077    Observe that three other bigrams also have the same score of 1.000 and
1078    so the same rank 1. The bigram with the next highest score of 0.8000,
1079    "line<>of<>", is ranked 2nd instead of 5th. This is a feature of our
1080    ranking mechanism; the fact that a bigram has a rank 'r' implies that
1081    there are r-1 distinct scores greater than the score of this Ngram. It
1082    does not imply that there are r-1 bigrams with higher scores.
1083
1084   6.2. Changing the Default Ngram Size:
1085    By default, the Ngrams in the input file are assumed to be bigrams. This
1086    can however be changed by using the option --ngram. Given an Ngram size
1087    (either by default or by using the --ngram option), statistic.pl checks
1088    if there are exactly the correct number of tokens in each Ngram. If this
1089    is not true, an error is printed and statistic.pl halts.
1090
1091   6.3. Defining the Meaning of the Frequency Values:
1092    The "meaning" of the various frequency values after each Ngram in the
1093    input file is important in that the statistic calculated depends on
1094    them. By default, the default meanings as defined by count.pl are
1095    assumed.
1096
1097    count.pl and all statistical libraries (.pm modules) provided with this
1098    package are implemented such that they produce/accept the frequency
1099    values in the same order. So for an ngram,
1100
1101                word1<>word2<>...wordn-1<>
1102
1103    "the first frequency value reported is f(0,1,...n-1); this is the
1104    frequency of the Ngram itself. This is followed by n frequency values
1105    f(0), f(1),...f(n-1); these are the frequencies of the individual tokens
1106    in their specific positions in the given Ngram. This is followed by (n
1107    choose 2) values, f(0,1), f(0,2), ..., f(0,n-1), f(1,2), ..., f(1,n-1),
1108    ... f(n-2,n-1). This is followed by (n choose 3) values, f(0,1,2),
1109    f(0,1,3), ..., f(0,1,n-1), f(0,2,3), ... , f(0,2,n-1), ... f(0,n-2,n-1),
1110    f(1,2,3), ..., f(n-3,n-2,n-1). And so on, until (n choose n-1), that is
1111    n, frequency values f(0,1,...n-2), f(0,1,..n-3,n-1), f(0,1,...n-4,n-1),
1112    ..., f(1,2,...n-1)"
1113
1114    (The above explanation is from "The Design, Implementation and Use of
1115    the Ngram Statistics Package" [2].)
1116
1117    So the bigram output of count.pl/bigram input to any statistical library
1118    will be something like -
1119
1120        word1<>word2<>f(0,1)<>f(0)<>f(1)
1121
1122    Or you can also view this as
1123
1124          word1<>word2<>n11<>n1p<>np1
1125
1126    where n1p,np1 represent marginal totals in a 2x2 contingency table.
1127
1128    Similarly, the trigram output of count.pl/trigram input to ll3.pm (which
1129    is the only trigram statistical library currently provided) will be -
1130
1131        word1<>word2<>word3<>f(0,1,2)<>f(0)<>f(1)<>f(2)<>f(0,1)<>f(0,2)<>f(1,2)
1132
1133    Or you can also view this as
1134    word1<>word2<>word3<>n111<>n1pp<>np1p<>npp1<>n11p<>n1p1<>np11
1135
1136    where n1pp,np1p,npp1,n11p,n1p1,np11 represent marginal frequencies in a
1137    3x3 contingency table.
1138
1139    The frequency combinations being used can be output to a file by using
1140    the option get_freq_combo.
1141
1142    If count.pl was run with a set of user-defined frequency combinations
1143    different from the defaults, then the file containing these frequency
1144    combinations must be provided to statistic.pl using the option
1145    set_freq_combo.
1146
1147    If the number of frequency values does not match the number expected
1148    (either through the default frequency combinations or through the user
1149    defined ones provided through the set_freq_combo option) then an error
1150    is reported. Besides checking that the number of frequency values is
1151    correct, nothing else is checked.
1152
1153   6.4. Modifying the Output of statistic.pl:
1154    One may request statistic.pl to ignore all Ngrams which have a frequency
1155    less than a user-defined threshold by using the --frequency option. To
1156    be able to do this however, the Ngram frequency should be present among
1157    the various frequency values in the input Ngram file. It is possible to
1158    set up a frequency combination file that prevents count.pl from printing
1159    the actual frequency of each Ngram; if such a file is given to
1160    statistic.pl, the frequency cut-off requested through option --frequency
1161    will be ignored and a warning issued to that effect.
1162
1163    Once the statistical values for the Ngrams are calculated and the Ngrams
1164    have been ranked according to these values, one may request not to print
1165    Ngrams below a certain rank. This can be done using the option --rank.
1166    Unlike the frequency cut-off above, all calculations are done and then
1167    Ngrams that fall below a certain rank are cut-off. In the frequency
1168    cut-off, calculations are not performed on the Ngrams that are ignored.
1169
1170    The value returned by the statistic libraries may be floating point
1171    numbers; by default 4 places of decimal are shown. This can be changed
1172    by using the option --precision through which the user can decide how
1173    many places of decimal he wishes to see. Note that the values returned
1174    by the library are rounded to the places of decimal requested by the
1175    user, and THEN the ranking is done. Thus two Ngram that actually have
1176    different scores, but whose scores both round up to the same number for
1177    the given precision will get the same rank!
1178
1179    The user can also use the statistical score to cut off Ngrams. Thus,
1180    using the option --score, one may request statistic.pl to not print
1181    Ngrams that get a score less than the given threshold.
1182
1183    Similar to count.pl, the user can request statistic.pl to print extended
1184    information by using the --extended switch. Without this switch, all
1185    extended information already in the input file will be lost; with it,
1186    they will all be preserved and new extended data will be output.
1187
1188    The output of statistic.pl is not formatted for human eyes - this can be
1189    done using the switch --format. Columns will be aligned as much as
1190    possible and the output is (often) neater than the default output.
1191
1192   6.5. The Measures of Association Provided in This Distribution:
1193    We provide the 10 measures of association with this distribution. Nine
1194    are suitable for use with bigrams and one may be used with trigrams.
1195
1196    The bigram measures are:
1197
1198    *   Dice Coefficient (Text::NSP::Measures::2D::Dice::dice)
1199
1200    *   Fishers exact test - left sided
1201        (Text::NSP::Measures::2D::Fisher::left)
1202
1203    *   Fishers exact test - right sided
1204        (Text::NSP::Measures::2D::Fisher::right)
1205
1206    *   Fishers twotailed test - right sided
1207        (Text::NSP::Measures::2D::Fisher::twotailed)
1208
1209    *   Jaccard Coefficient (Text::NSP::Measures::2D::Dice::jaccard)
1210
1211    *   Log-likelihood ratio (Text::NSP::Measures::2D::MI::ll)
1212
1213    *   Mutual Information (Text::NSP::Measures::2D::MI::tmi)
1214
1215    *   Odds Ratio (Text::NSP::Measures::2D::odds)
1216
1217    *   Pointwise Mutual Information (Text::NSP::Measures::2D::MI::pmi)
1218
1219    *   Phi Coefficient (Text::NSP::Measures::2D::CHI::phi)
1220
1221    *   Pearson's Chi Squared Test (Text::NSP::Measures::2D::CHI::x2)
1222
1223    *   Poisson Stirling Measure (Text::NSP::Measures::2D::MI::ps)
1224
1225    *   T-score (Text::NSP::Measures::2D::CHI::tscore)
1226
1227    The trigram measures are:
1228
1229    *   Log-likelihood ratio (Text::NSP::Measures::3D::MI::ll)
1230
1231    *   Mutual Information (Text::NSP::Measures::3D::MI::tmi)
1232
1233    *   Pointwise Mutual Information (Text::NSP::Measures::3D::MI::pmi)
1234
1235    *   Poisson Stirling Measure (Text::NSP::Measures::3D::MI::ps)
1236
1237    The 4-gram measures is:
1238
1239    *   Log-likelihood ratio (Text::NSP::Measures::4D::MI::ll)
1240
1241    Any of these measures can be used as follows:
1242
1243      statistic.pl XXXX output.txt input.txt
1244
1245    where XXXX is the name of the measure.
1246
1247    More information on how to write a new statistic library is provided in
1248    the documentation (perldoc) of Text::NSP::Measures. A few additional
1249    details about the Measures can be found in their respective perldocs.
1250
1251  7. Referencing:
1252    If you write a paper that has used NSP in some way, we'd certainly be
1253    grateful if you sent us a copy and referenced NSP. We have a published
1254    paper about NSP that provides a suitable reference:
1255
1256     @inproceedings{BanerjeeP03,
1257            author = {Banerjee, S. and Pedersen, T.},
1258            title = {The Design, Implementation, and Use of the {N}gram {S}tatistic {P}ackage},
1259            booktitle = {Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics},
1260            pages = {370-381},
1261            year = {2003},
1262            month ={February},
1263            address = {Mexico City}}
1264
1265    This paper can be found at :
1266
1267    <http://cpansearch.perl.org/src/TPEDERSE/Text-NSP-1.13/doc/cicling2003.p
1268    s>
1269
1270    or
1271
1272    <http://cpansearch.perl.org/src/TPEDERSE/Text-NSP-1.13/doc/cicling2003.p
1273    df>
1274
1275AUTHORS
1276    Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu
1277
1278    Satanjeev Banerjee
1279
1280    Amruta Purandare
1281
1282    Saiyam Kohli
1283
1284    Last modified by : $Id: README.pod,v 1.13 2010/11/12 19:13:41 btmcinnes
1285    Exp $
1286
1287BUGS
1288    Please report to the NSP mailing list
1289
1290SEE ALSO
1291    *   NSP Home: <http://ngram.sourceforge.net>
1292
1293    *   Mailing List : <http://groups.yahoo.com/group/ngram/>
1294
1295  8. Acknowledgments:
1296    This work has been partially supported by a National Science Foundation
1297    Faculty Early CAREER Development award (\#0092784) and by a Grant-in-Aid
1298    of Research, Artistry and Scholarship from the Office of the Vice
1299    President for Research and the Dean of the Graduate School of the
1300    University of Minnesota.
1301
1302COPYRIGHT
1303    Copyright (C) 2000-2010, Ted Pedersen, Satanjeev Banerjee, Amruta
1304    Purandare, Bridget Thomson-McInnes Saiyam Kohli, and Ying Liu
1305
1306    This program is free software; you can redistribute it and/or modify it
1307    under the terms of the GNU General Public License as published by the
1308    Free Software Foundation; either version 2 of the License, or (at your
1309    option) any later version.
1310
1311    This program is distributed in the hope that it will be useful, but
1312    WITHOUT ANY WARRANTY; without even the implied warranty of
1313    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
1314    Public License for more details.
1315
1316    You should have received a copy of the GNU General Public License along
1317    with this program; if not, write to
1318
1319        The Free Software Foundation, Inc.,
1320        59 Temple Place - Suite 330,
1321        Boston, MA  02111-1307, USA.
1322
1323    Note: a copy of the GNU General Public License is available on the web
1324    at <http://www.gnu.org/licenses/gpl.txt> and is included in this
1325    distribution as GPL.txt.
1326
1327