1NAME
2 README Introduction to Ngram Statistics Package (Text-NSP)
3
4SYNOPSIS
5 This document provides a general introduction to the Ngram Statistics
6 Package.
7
8DESCRIPTION
9 1. Introduction
10 The Ngram Statistics Package (NSP) is a suite of programs that aids in
11 analyzing Ngrams in text files. We define an Ngram as a sequence of 'n'
12 tokens that occur within a window of at least 'n' tokens in the text;
13 what constitutes a "token" can be defined by the user.
14
15 In earlier versions (v0.1, v0.3, v0.4) this package was known as the
16 Bigram Statistics Package (BSP). The name change reflects the widening
17 scope of the package in moving beyond Bigrams to Ngrams.
18
19 NSP consists of two core programs and three utilities:
20
21 Program count.pl takes flat text files as input and generates a list of
22 all the Ngrams that occur in those files. The Ngrams, along with their
23 frequencies, are output in descending order of their frequency.
24
25 Program statistic.pl takes as input a list of Ngrams with their
26 frequencies (in the format output by count.pl) and runs a user-selected
27 statistical measure of association to compute a "score" for each Ngram.
28 The Ngrams, along with their scores, are output in descending order of
29 this score. The statistical score computed for each Ngram can be used to
30 decide whether or not there is enough evidence to reject the null
31 hypothesis (that the Ngram is not a collocation) for that Ngram.
32
33 Various utility programs are found in bin/utils/ and take as their input
34 the results (output) from count.pl and/or statistic.pl.
35
36 rank.pl takes as input two files output by statistic.pl and computes the
37 Spearman's rank correlation coefficient on the Ngrams that are common to
38 both files. Typically the two files should be produced by applying
39 statistic.pl on the same Ngram count file but by using two different
40 statistical measures. In such a scenario, the value output by rank.pl
41 can be used to measure how similar these the two measures are. A value
42 close to 1 would indicate that these two measures rank Ngrams in the
43 same order, -1 that the two orderings are exactly opposite to each other
44 and 0 that they are not related.
45
46 kocos.pl takes as input a file output by count.pl or statistic.pl and
47 uses that to identify kth order co-occurrences of a given word. A kth
48 order co-occurrence of a target WORD is a word that co-occurs with a
49 (k-1)th co-occurrence of the given target WORD. So A is a 2nd order
50 co-occurrence of X if X occurs with B and B occurs with A. Put more
51 concretely in "New York", "New" and "York" co-occur (the are 1st order
52 co-occurrences). In "New Jack", "New" and "Jack" co-occur. Thus, "Jack"
53 and "York" are second order co-occurrences because they both co-occur
54 with "New".
55
56 combig.pl will take the output of count.pl and find unordered counts of
57 bigrams. Normally count.pl treats bigrams like "fine wine" and "wine
58 fine" as distinct. combig.pl (combine bigram) will adjust the counts
59 such that they do not depend on the order. So one could then go on to
60 measure how much the words "fine" and "wine" are associated without
61 respect to their order.
62
63 huge-count.pl allows a user to run count.pl on much larger corpora. It
64 essentially divides the whole bigrams list generated by count.pl with
65 --tokenlist opition, then splits the entire bigrams list into smaller
66 pieces, and then sort and merge the bigrams lists to get the final
67 output. huge-count.pl also uses bin/utils/huge-split.pl,
68 bin/utils/huge-sort.pl, bin/utils/huge-merge.pl and
69 bin/utils/huge-delete.pl.
70
71 This README continues with an introduction to the basic definitions of
72 tokens, the tokenization process and the Ngram formation process. This
73 is followed by a description of the two main programs in this suite
74 (count.pl and statistic.pl) and brief notes one how one could typically
75 use each of them. The programs rank.pl, kocos.pl, and combig.pl are
76 described in separate READMEs in the /utils directory.
77
78 2. Tokens
79 We define a token as a contiguous sequence of characters that match one
80 of a set of regular expressions. These regular expressions may be
81 user-provided, or, if not provided, are assumed to be the following two
82 regular expressions:
83
84 \w+ -> this matches a contiguous sequence of alpha-numeric characters
85
86 [\.,;:\?!] -> this matches a single punctuation mark
87
88 For example, assume the following is a line of text:
89
90 "the stock markets fell by 20 points today!"
91
92 Then, using the above regular expressions, we get the following tokens:
93
94 the stock markets
95 fell by 20
96 points today !
97
98 Now assume that the user provides the following lone regular expression:
99
100 [a-zA-Z]+ -> this matches a contiguous sequence of alphabetic characters
101
102 Then, we get the following tokens:
103
104 the stock markets
105 fell by points
106 today
107
108 3. The Tokenization Process:
109 Given a text file and a set of regular expressions, the text is
110 "tokenized", that is, broken up into tokens. To do so, the entire input
111 text is considered as one long "input string" with new-line characters
112 being replaced by space characters (this is the default behaviour and
113 can be modified; see point 4 below). Then, the following is done:
114
115 while the input string is non empty
116
117 foreach regular expression r
118 if r is matched by a sequence of characters starting with the first
119 character in the input string...
120 quit this for loop
121 end if
122 end foreach
123
124 if we have a matching regular expression r
125 the portion of the input string matched by r is our next token. remove
126 this token from the input string.
127 else
128 remove the first character from the input string
129 end if
130
131 end while
132
133 3.1 Notes:
134 3.1.1. In looking for a regular expression that yields a successful
135 match (in the foreach loop above), we want a regular expression that
136 matches the input string starting with the first character of the input
137 string. Thus, the regular expression /b/ matches the input string "be
138 good" but not the input string " be good".
139
140 3.1.2. If none of the regular expressions give a successful match, then
141 the first character in the input string is removed. This character is
142 considered a "non-token" and is henceforth ignored.
143
144 3.1.3. Since the matching process (the foreach loop above) stops at the
145 first match, the order in which the regular expressions are tested is
146 important. The order is exactly the order in which they are provided by
147 the user, or if the default regular expressions are used, the order in
148 which they are listed above.
149
150 3.2 Examples:
151 3.2.1 Example 1:
152 3.2.1.1. Input text:
153
154 why's the stock falling?
155
156 3.2.1.2. Regular expressions:
157
158 \w+
159 [\.,;:\?!]
160
161 3.2.1.3. Resulting tokens:
162
163 why s the
164 stock falling ?
165
166 3.2.1.4. Explanation:
167
168 Initially our input string is the entire input text: "why's the stock
169 falling?". The first token found is "why" which matches the regular
170 expression /\w+/. This token is removed, and our input string becomes
171 "'s the stock falling?".
172
173 Now neither of the regular expressions can match the ' character. Thus
174 this character is considered a non-token and is removed, leaving the
175 input string like so: "s the stock falling?".
176
177 "s" is now matched by /\w+/, and this forms our next token. Upon
178 removing this token, we get the following input string " the stock
179 falling?".
180
181 Again, neither of the regular expressions match this input string, and
182 the leading space character is removed as a non-token. Similarly the
183 rest of the line is tokenized to yield the tokens "the", "stock",
184 "falling" and "?".
185
186 3.2.2 Example 2:
187 3.2.2.1. Input text:
188
189 why's the stock falling?
190
191 3.2.2.2. Regular expressions:
192
193 /fall/
194 /falling/
195 /stock/
196
197 3.2.2.3. Resulting tokens:
198
199 stock fall
200
201 3.2.2.4. Explanation:
202
203 Initially our input string is the entire input text: "why's the stock
204 falling?". None of the regular expressions match, and we remove the
205 first character to get as input string the following: "why's the stock
206 falling?". Similarly, again the regular expressions don't match, and we
207 have to remove the first character. This goes on until our input string
208 becomes: "stock falling?".
209
210 Now "stock" matches the regular expression /stock/, and this token is
211 removed, leaving " falling?" as the input string. Since the space
212 character does not form a token, it is removed. Now we have "falling?"
213 as our input string.
214
215 Now observe that we have two regular expressions, /fall/ and /falling/,
216 both of which can match the input string. However, since /fall/ appears
217 before /falling/ in the list, the token formed is "fall". This leaves
218 our input string as: "ing?". None of the regular expressions match this
219 or any of the subsequent input strings obtained by removing one by one
220 the first characters. Hence we get as tokens "stock" and "fall".
221
222 3.2.3 Example 3:
223 3.2.3.1. Input text:
224
225 why's the stock falling?
226
227 3.2.3.2. Regular expressions:
228
229 /falling/
230 /fall/
231 /stock/
232
233 3.2.3.3. Resulting tokens:
234
235 stock falling
236
237 3.2.3.4. Explanation:
238
239 Observe that this example differs from the previous one only in the
240 order of the regular expressions. The tokenization proceeds exactly as
241 in the previous example, until we have as our input string "falling?".
242 Here, we have /falling/ as our first regular expression, and so we get
243 "falling" as our token.
244
245 Examples 3.2.2 and 3.2.3 demonstrate the importance of the order in
246 which the regular expressions are provided to the tokenization process.
247
248 3.2.4. Example 4:
249 3.2.4.1. Input text:
250
251 why's the stock falling?
252
253 3.2.4.2. Regular expressions:
254
255 /the stock/
256 /\w+/
257
258 3.2.4.3. Resulting tokens:
259
260 why s the stock
261 falling
262
263 3.2.4.4. Explanation:
264
265 The thing to note here is that one of the regular expressions has an
266 embedded space character in it. This causes no problems: our definition
267 of a token allows embedded space characters in them! Once our input
268 string is "the stock falling?", the regular expression /the stock/ is
269 matched, and the string "the stock" forms our next token.
270
271 4. Ngrams:
272 An Ngram is a sequence of n tokens. We shall delimit tokens in an Ngram
273 by the diamond symbol, i.e. "<>". Thus, "big<>boy<>" is a bigram whose
274 tokens are "big" and "boy". Similarly, "stock<>falling<>?<>" is a
275 trigram whose tokens are "stock" and "falling" and "?". "the
276 stock<>falling<>" is a bigram with tokens "the stock" and "falling".
277
278 Given a piece of text, Ngrams are usually formed of contiguous tokens.
279 For instance, lets take example 3.2.1, where our tokens, in the order in
280 which they appear in the text, are the following:
281
282 why s the stock falling ?
283
284 Then, the following are all the bigrams:
285
286 why<>s<> s<>the<> the<>stock<>
287 stock<>falling<> falling<>?<>
288
289 The following are all the trigrams:
290
291 why<>s<>the<> s<>the<>stock<>
292 the<>stock<>falling<> stock<>falling<>?<>
293
294 The following are all the 4-grams:
295
296 why<>s<>the<>stock
297 s<>the<>stock<>falling
298 s<>the<>stock<>falling<>?<>
299
300 Etcetera.
301
302 The Ngrams shown above are all formed from contiguous tokens. Although
303 this is the default, we also allow Ngrams to be formed from
304 non-contiguous tokens.
305
306 To do so, we first define a "window" of size k to be a sequence of k
307 contiguous tokens, where the value of k is greater than or equal to the
308 value of n for the Ngrams. An Ngram can be formed from any n tokens as
309 long as all the tokens belong to a single window of size k. Further the
310 n tokens must occur in the Ngram in exactly the same order as they occur
311 in the window.
312
313 Put another way, given a window of k tokens, we drop k-n tokens from the
314 window, and what remains is an Ngram!
315
316 Thus for instance, taking example 3.2.1 again, recall that our tokens in
317 the order in which they occur in the text are the following:
318
319 why s the stock falling ?
320
321 Then, the following are all the bigrams with a window size of 3:
322
323 why<>s<> why<>the<> s<>the<>
324 s<>stock<> the<>stock<> the<>falling<>
325 stock<>falling<> stock<>?<> falling<>?<>
326
327 The following are all the bigrams with a window size of 4:
328
329 why<>s<> why<>the<> why<>stock<>
330 s<>the<> s<>stock<> s<>falling<>
331 the<>stock<> the<>falling<> the<>?<>
332 stock<>falling<> stock<>?<> falling<>?<>
333
334 The following are all the trigrams with a window size of 4:
335
336 why<>s<>the<> why<>s<>stock<> why<>the<>stock<>
337 s<>the<>stock<> s<>the<>falling<> s<>stock<>falling<>
338 the<>stock<>falling<> the<>stock<>?<> the<>falling<>?<>
339 stock<>falling<>?<>
340
341 Etc.
342
343 5. Program count.pl:
344 This program takes as input a flat ASCII text file and outputs all
345 Ngrams, or token sequences of length 'n', where the value of 'n' can be
346 decided by the user. Non-contiguous Ngrams within a window of size 'k'
347 as described above can also be found and output. For every output Ngram,
348 its frequency of occurrence as well as the frequencies of all the
349 combinations of the tokens it is made up of are output. Details follow.
350
351 5.1. Default Way to Run count.pl:
352 The most basic way of running this program is the following:
353
354 Example 5.1: count.pl output.txt input.txt
355
356 where input.txt is the input text file in which to find the Ngrams and
357 output.txt is the output file into which count.pl will put all the
358 Ngrams with their frequencies.
359
360 5.2. Changing the Length of Ngrams and the Size of the Window:
361 Several default values are in use when the program is run this way. For
362 example it is assumed that one is counting bigrams, that is the value of
363 'n' is 2. This can be changed by using the option --ngram N, where 'N'
364 is the number of tokens you want in each Ngram. Thus, to find all
365 trigrams in input.txt, run count.pl thus:
366
367 Example 5.2: count.pl --ngram 3 output.txt input.txt
368
369 Another default value in use is the window size. Window size defaults to
370 the value of 'n' for Ngrams. Thus, in example 5.1 the window size was 2
371 while in example 5.1, because of the --ngram 3 option , the window size
372 was 3. This can be changed using the --window N option. Thus, for
373 example to find all bigrams within windows of size 3, one would run the
374 program like so:
375
376 Example 5.3a: count.pl --window 3 output.txt input.txt
377
378 Similarly, to find all trigrams within a window of size 4:
379
380 Example 5.3b: count.pl --ngram 3 --window 4 output.txt input.txt
381
382 5.3. Using User-Provided Token Definitions:
383 In all these examples, the tokenization and Ngram formation proceeds as
384 described in sections 3 and 4 above. In these examples, the default
385 token definitions are used:
386
387 \w+ -> this matches a contiguous sequence of alpha-numeric characters
388 [\.,;:\?!] -> this matches a single punctuation mark
389
390 As mentioned previously, these default token definitions can be
391 over-ridden by using the option --token FILE, where FILE is the name of
392 the file containing the regular expressions on which the token
393 definitions will be based. Each regular expression in this FILE should
394 be on a line of its own, and should be delimited by the forward slash
395 '/'. Further, these should be valid Perl regular expressions, as defined
396 in [1], which means for example that any occurrence of the forward slash
397 '/' within the regular expression must be 'escaped'.
398
399 5.4 Removing character strings via --nontoken option:
400 This option allows a user to define regular expressions that will match
401 strings that should not be considered as tokens. These strings will be
402 removed from the data and not counted or included in Ngrams.
403
404 The --nontoken option is recommended when there are predictable
405 sequences of characters that you know should not be included as tokens
406 for purposes of counting Ngrams, finding collocations, etc.
407
408 For example, if mark-up symbols like <s>, <p>, [item], [/ptr] exist in
409 text being processed, you may want to include those in your list of
410 nontoken items so they are discarded. If not, a simple regex such as
411 /\w+/ will match with 's', 'p', 'item', 'ptr' from these tags, leading
412 to confusing results.
413
414 The --nontoken option on the command line should be followed by a file
415 name (NON_TOKEN). This file should contain Perl regular expressions
416 delimited by forward slashes '/' that define non-tokens. Multiple
417 expressions may be placed on separate lines or be separated via the '|'
418 (Perl 'or') as in /regex1|regex2|../
419
420 The following are some of the examples of valid non-token definitions.
421
422 /<\/?s|p>/ : will remove xml tags like <s>, <p>, </s>, </p>.
423
424 /\[\w+\]/ : will remove all words which appear in square brackets like
425 [p], [item], [123] and so on.
426
427 count.pl will first remove any string from the input data that matches
428 the non-token regular expression, and only then will match the remaining
429 data against the token definitions. Thus, if by chance a string matches
430 both the token and nontoken definitions, it will be removed as
431 --nontoken has a higher priority than --token or the default token
432 definition.
433
434 5.5. The Output Format of count.pl:
435 Assume that the following are the contents of the input text file to
436 count.pl; let us call the file test.txt:
437
438 first line of text
439 second line
440 and a third line of text
441
442 Further assume that count.pl is run like so:
443
444 count.pl test.cnt test.txt
445
446 Thus, test.cnt will have all the bigrams found in file test.txt using a
447 window size of 2 and using the two default tokens as above. Following
448 then are the contents of file test.cnt:
449
450 11
451 line<>of<>2 3 2
452 of<>text<>2 2 2
453 second<>line<>1 1 3
454 line<>and<>1 3 1
455 and<>a<>1 1 1
456 a<>third<>1 1 1
457 first<>line<>1 1 3
458 third<>line<>1 1 3
459 text<>second<>1 1 1
460
461 The number on the first line, 11, indicates that there were total 11
462 bigrams in the input file.
463
464 From the next line onwards, the various bigrams found are listed. Recall
465 that the tokens of the Ngrams are delimited by the diamond signs: <>.
466 Thus the bigram on the first line is line<>of<>, made up of the tokens
467 "line" and "of" in that order; the bigram on the second line is
468 of<>text<>, made up of the tokens "of" and "text", etc.
469
470 After the diamond following the last token there are three numbers. The
471 first of these numbers denotes the number of times this Ngram occurs in
472 the input text file. Thus bigram line<>of<> occurs 2 times in the input
473 file, as does bigram of<>text<>. The second number denotes in how many
474 bigrams the token "line" occurs as the left-hand-token. In this case,
475 "line" occurs on the left of three bigrams, namely two copies of bigram
476 "line<>of" and the bigram "line<>and<>". Similarly, the third number
477 denotes the number of bigrams in which the word "of" occurs as the
478 right-hand-token. In this case, "of" occurs on the right of two bigrams,
479 namely the two copies of the bigram "line<>of<>".
480
481 Similar output is obtained for trigrams. Assume again that the input
482 file is above, and assume that count.pl is run thusly:
483
484 count.pl --ngram 3 test.cnt test.txt
485
486 The output test.cnt file is as follows:
487
488 10
489 line<>of<>text<>2 3 2 2 2 2 2
490 and<>a<>third<>1 1 1 1 1 1 1
491 third<>line<>of<>1 1 3 2 1 1 2
492 second<>line<>and<>1 1 3 1 1 1 1
493 line<>and<>a<>1 3 1 1 1 1 1
494 a<>third<>line<>1 1 1 2 1 1 1
495 text<>second<>line<>1 1 1 2 1 1 1
496 of<>text<>second<>1 1 1 1 1 1 1
497 first<>line<>of<>1 1 3 2 1 1 2
498
499 Once again, the number on the first line says that there are 10 trigrams
500 in the input text file. The first trigram in the list is
501 "line<>of<>text<>" made up of the tokens "line", "of" and "text" in that
502 order. Similarly, the next trigram is "and<>a<>third<>" made of the
503 tokens "and", "a" and "third".
504
505 Observe that this time there are more numbers after the last token. The
506 first number denotes, as before, the number of times this trigram occurs
507 in the input text file. Thus, "line<>of<>text" occurs twice in the input
508 file while "and<>a<>third" occurs just once. The second, third and
509 fourth numbers denote the number of trigrams in which the tokens "line",
510 "of" and "text" appear in the first, second and third positions
511 respectively. Thus, "line" occurs as the token in the first position in
512 3 trigrams, namely 2 copies of "line<>of<>text<>" and one copy of
513 "line<>and<>a<>". Similarly, the tokens "of" and "text" appear as the
514 second and third tokens respectively of two bigrams, namely the two
515 copies of "line<>of<>text<>".
516
517 The fifth number denotes the number of bigrams in which "line" occurs as
518 the first token and "of" occurs as the second token. Once again, there
519 are only two trigrams in which this happens: the two copies of
520 "line<>of<>text<>". The sixth number denotes the number of bigrams in
521 which "line" occurs as the token in the first place and "text" occurs as
522 the token in the third place. The seventh number denotes the number of
523 bigrams in which "of" occurs as the token in the second place and "text"
524 occurs as the token in the third place.
525
526 In general, assume we are dealing with Ngrams of size 'n'. Given an
527 Ngram, denote its leftmost token as w[0], the next token as w[1], and so
528 on until w[n-1]. Further let f(a, b, ..., c) be the number of Ngrams
529 that have token w[a] in position a, token w[b] in position b, ... and
530 token w[c] in position c, where 0 <= a < b < ... < c < n.
531
532 Then, given an ngram, the first frequency value reported is f(0, 1, ...,
533 n-1).
534
535 This is followed by n frequency values, f(0), f(1), ..., f(n-1).
536
537 This is followed by (n choose 2) values, f(0, 1), f(0, 2), ..., f(0,
538 n-1), f(1, 2), ..., f(1, n-1), ... f(n-2, n-1).
539
540 This is followed by (n choose 3) values, f(0, 1, 2), f(0, 1, 3), ...,
541 f(0, 1, n-1), f(0, 2, 3), ..., f(0, 2, n-1), ..., f(0, n-2, n-1), ...,
542 f(1, 2, 3), ..., f(n-3, n-2, n-1).
543
544 And so on, until (n choose n-1), that is n, frequency values f(0, 1,
545 ..., n-2), f(0, 1, ..., n-3, n-1), f(0, 1, ..., n-4, n-2, n-1), ...,
546 f(1, 2, ..., n-1).
547
548 This gives us a total of 2^n-1 possible frequency values. We call each
549 such frequency value a "frequency combination", since it expresses the
550 number of Ngrams that has a given combination of one or more tokens in
551 one or more fixed positions. By default all such combinations are
552 printed, exactly in the order showed above. To see which combinations
553 are being printed one could use the option --get_freq_combo FILE. This
554 prints to the file the inputs to the imaginary 'f' function defined
555 above exactly in the order the frequency values occur in the main
556 output. Thus for instance, running the program like so:
557
558 count.pl --get_freq_combo freq_combo.txt test.cnt test.txt
559
560 Assuming that test.txt file is the one shown above, the following output
561 is created in file freq_combo.txt:
562
563 0 1
564 0
565 1
566
567 and the following output in file test.cnt:
568
569 11
570 line<>of<>2 3 2
571 of<>text<>2 2 2
572 second<>line<>1 1 3
573 line<>and<>1 3 1
574 and<>a<>1 1 1
575 a<>third<>1 1 1
576 first<>line<>1 1 3
577 third<>line<>1 1 3
578 text<>second<>1 1 1
579
580 Recall that since the option --ngram is not being used, the default
581 value of n, 2, is being used here. After each bigram in the test.cnt
582 file are three numbers; the first number corresponds to f(0, 1), the
583 second number corresponds to f(0) and the third to f(1). Observe that
584 line 'i' of the output in file freq_combo.txt file represents the input
585 to the imaginary 'f' function that creates the 'i_th' frequency value on
586 each line of the output in file test.cnt.
587
588 Similarly, running the program thus:
589
590 count.pl --ngram 3 --get_freq_combo freq_combo.txt test.cnt test.txt
591
592 produces the following output in freq_combo.txt:
593
594 0 1 2
595 0
596 1
597 2
598 0 1
599 0 2
600 1 2
601
602 and the following output in file test.cnt
603
604 10
605 line<>of<>text<>2 3 2 2 2 2 2
606 and<>a<>third<>1 1 1 1 1 1 1
607 third<>line<>of<>1 1 3 2 1 1 2
608 second<>line<>and<>1 1 3 1 1 1 1
609 line<>and<>a<>1 3 1 1 1 1 1
610 a<>third<>line<>1 1 1 2 1 1 1
611 text<>second<>line<>1 1 1 2 1 1 1
612 of<>text<>second<>1 1 1 1 1 1 1
613 first<>line<>of<>1 1 3 2 1 1 2
614
615 The seven numbers after each trigram in file test.cnt correspond
616 respectively to f(0, 1, 2), f(0), f(1), f(2), f(0, 1), f(0, 2) and f(1,
617 2), as shown in the file freq_combo.txt.
618
619 It is possible that the user may not require all the frequency values
620 output by default, or that the user requires the frequency values in a
621 different order. To change the default frequency values output, one may
622 provide count.pl with a file containing the inputs to the 'f' function
623 using the option --set_freq_combo.
624
625 Thus for instance, if the user wants to create trigrams, and only
626 requires the frequencies of the trigrams and the frequency values of the
627 three tokens in the trigrams (and not of the pairs of tokens), then he
628 may create the following file (say, user_freq_combo.txt):
629
630 0 1 2
631 0
632 1
633 2
634
635 and provide this file to the count.pl program thus:
636
637 count.pl --ngram 3 --set_freq_combo user_freq_combo.txt test.cnt
638 test.txt
639
640 this produces the following test.cnt file:
641
642 10
643 line<>of<>text<>2 3 2 2
644 and<>a<>third<>1 1 1 1
645 third<>line<>of<>1 1 3 2
646 second<>line<>and<>1 1 3 1
647 line<>and<>a<>1 3 1 1
648 a<>third<>line<>1 1 1 2
649 text<>second<>line<>1 1 1 2
650 of<>text<>second<>1 1 1 1
651 first<>line<>of<>1 1 3 2
652
653 Observe that the only difference between this output and the default
654 output is that instead of reporting 7 frequency values per ngram, only
655 the 4 requested are output.
656
657 count2huge.pl is a method to convert the output of count.pl to
658 huge-count.pl. The program can sort the bigrams in the alphabet order
659 and generate the same output with huge-count.pl. The reason we sort the
660 bigrams is because when we use the bigrams list to generate
661 co-occurrence matrix for the vector relatedness measure of
662 UMLS-Similarity, it requires the input bigrams which start with the same
663 term are grouped together. Sort the bigrams when create the
664 co-occurrence can imporve the efficiency.
665
666 5.6. "Stopping" the Ngrams:
667 The user may "stop" the Ngrams formed by count.pl by providing a list of
668 stop-tokens through the option --stop FILE. Each stop token in FILE
669 should be a Perl regular expression that occurs on a line by itself.
670 This expression should be delimited by forward slashes, as in /REGEX/.
671 All regular expression capabilities in Perl are supported except for
672 regular expression modifiers (like the "i" /REGEX/i).
673
674 The following are a few examples of valid entries in the stop list.
675
676 /^\d+$/
677 /\bthe\b/
678 /\b[Tt][Hh][Ee]\b/
679 /^and$/
680 /\bor\b/
681 /^be(ing)?$/
682
683 There are two modes in which a stop list can be used, AND and OR. The
684 default mode is AND, which means that an Ngram must be made up entirely
685 of words from the stoplist before it is eliminated. The OR mode
686 eliminates an Ngram if any of the words that make up the Ngram are found
687 in the stoplist.
688
689 The mode is specified via an extended option that should appear on the
690 first line of the stop file. For example,
691
692 @stop.mode=AND
693 /^for$/
694 /^the$/
695 /^\d+$/
696
697 would eliminate bigrams such as 'for the', 'for 10', etc. (where both
698 elements of the bigram are from the stop list.) But will not remove
699 bigrams like '10 dollars' or 'of the'.
700
701 @stop.mode=OR
702 /^for$/
703 /^the$/
704 /^\d+$/
705
706 would eliminate bigrams such as 'for our', '10 dollars', etc. (where at
707 least one element of the bigram is from the stop list).
708
709 If the @stop.mode= option is not specified, the default value is AND.
710
711 In both modes, Ngrams that are eliminated do not add to the various
712 Ngram and individual word frequency counts. Ngrams that are "stoplisted"
713 are treated as if they never existed and are not counted.
714
715 5.6.1 Usage Notes for Regular Expressions in Stop Lists:
716 (1) In Perl regular expressions, \b specifies word boundary and ^ and $
717 specify the start and end of a string (or line of text). These can be
718 used in defining your stop list entries, but must be used with somewhat
719 carefully.
720
721 count.pl examines each token individually, thereby treating each as a
722 separate string or line. As a result, you can use either /\bregex\b/ or
723 /^regex$/ to exactly match a token made up of alphanumeric characters,
724 as in \bcat\b or \^cat$\. However, please note that if a token consists
725 of other characters (as in n.b.a.) they can behave differently. Suppose
726 for example that your token is www.dot.com. If you have a stop list
727 entry \bwww\b it will match the 'www' portion of the token, since the
728 '.' is considered to be a word boundary. \^www$\ would not have that
729 problem.
730
731 (2) If instead of /^the$/, regex /the/ is used as a stop regex, then
732 every token that matches /the/ will be removed. So tokens like 'there',
733 'their', 'weather','together' will be excluded with the stop regex
734 /the/. On the other hand, with the regex /^the$/, all occurrences of
735 only word 'the' will be removed.
736
737 (3) You can also use a stop regex /^the/ to remove tokens that begin
738 with 'the' like 'their' or 'them' but not 'together'. Similarly, stop
739 regex /the$/ will remove all tokens which end in 'the' like 'swathe' or
740 'tithe' but not 'together' or 'their'.
741
742 (4) Please note that stoplist handling changed as of version 0.53. If
743 you use a stoplist developed for an earlier version of NSP, then it will
744 not behave in the same way!!
745
746 In earlier versions when you specified /regex/ as a stoplist item, we
747 assumed that you really meant /\bregex\b/ and proceeded accordingly.
748 However, since regular expressions are now fully supported we require
749 that you specify exactly what you mean. So if you include /is/ as a
750 member of your stoplist, we will now assume that you mean any word that
751 contains 'is'somewhere within in (like 'this' or 'kiss' or 'isthmus'
752 ...) To preserve the functionality of your old stoplists, simply convert
753 them from
754
755 /the/
756 /is/
757 /of/
758
759 to
760
761 /\bthe\b/
762 /\bis\b/
763 /\bof\b/
764
765 (6) regex modifiers like i or g which come after the end slash like:
766
767 /regex/i
768 /regex/g
769
770 are not supported. See FAQ.txt for an explanation.
771
772 This makes it slightly inconvenient to specify that you would like to
773 stop any form of a given word. For example, if you wanted to stop 'THE',
774 'The', 'THe', etc. you would have to specify a regex such as
775
776 /[Tt][Hh][Ee]/
777
778 5.6.2. Differences between --nontoken and --stop:
779 In theory we can remove "unwanted" words using either the --nontoken
780 option or the --stop option. However, these are rather different
781 techniques.
782
783 --stop only removes stop words after they are recognized as valid
784 tokens. Thus, if you wish to remove some markup tags like [p] or [item]
785 from the data using a stop list, you first need to recognize these as
786 tokens (via a --token definition like /\[\w+\]/) and then remove them
787 with a --stop list.
788
789 In addition, the --stop option operates on an Ngram and does not remove
790 individual words. It removes Ngrams (and reduces the count of the number
791 of Ngrams in the sample). In other words, the --stop option only comes
792 into effect after the Ngrams have been created.
793
794 On the other hand, the --nontoken option eliminates individual
795 occurrence of a non-token sequence before finding Ngrams.
796
797 Some examples to clarify the distinction between --stop and --nontoken
798
799 -----------------------------------------------------------------------
800
801 Consider an input file count.input =>
802
803 [ptr] <s> this is a test written for count.pl </s> [/ptr]
804 their them together wither tithe
805
806 NontokenFile nontoken.regex =>
807
808 /\[\/?\w+\]/
809 /<\/?\w+>/
810
811 case (a) StopFile stopfile.txt => /the/
812 ----------------------------------------
813
814 Running count.pl with the command :
815
816 count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input
817
818 will first remove all nontokens from the input file. Hence the tokenized
819 text from which the bigrams will be created will be =>
820
821 this is a test written for count.pl
822 their them together wither tithe
823
824 Since the StopFile contains /the/ all tokens which include 'the' are
825 eliminated. Thus, the bigrams:
826
827 their<>them<>
828 them<>together<>
829 together<>wither<>
830 wither<>tithe<>
831
832 will all be removed. This is because each word in each bigram contains
833 "the" and the default stop mode is AND. Note that if there was a bigram
834 such as "on<>their<>" it would not be removed since both words to not
835 match the stoplist. The output file count.out will contain the
836 following:
837
838 count.out=>
839
840 9
841 test<>written<>1 1 1
842 this<>is<>1 1 1
843 a<>test<>1 1 1
844 is<>a<>1 1 1
845 for<>count<>1 1 1
846 .<>pl<>1 1 1
847 count<>.<>1 1 1
848 written<>for<>1 1 1
849 pl<>their<>1 1 1
850
851 case (b) StopFile stopfile.txt => /^the/
852
853 ----------------------------------------
854
855 Running count.pl with the command:
856
857 count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input
858
859 will first remove all nontokens from the input file. The tokenized text
860 will be:
861
862 this is a test written for count.pl
863 their them together wither tithe
864
865 Since the StopFile contains /^the/, all tokens which begin with "the"
866 are eliminated. Thus, the bigram
867
868 their<>them<>
869
870 will be removed since it consists of two words that begin with "the".
871 The output file count.out will contain the 12 bigrams as shown below.
872
873 count.out=>
874
875 12
876 test<>written<>1 1 1
877 this<>is<>1 1 1
878 a<>test<>1 1 1
879 is<>a<>1 1 1
880 for<>count<>1 1 1
881 them<>together<>1 1 1
882 .<>pl<>1 1 1
883 count<>.<>1 1 1
884 written<>for<>1 1 1
885 pl<>their<>1 1 1
886 wither<>tithe<>1 1 1
887 together<>wither<>1 1 1
888
889 case (c) StopFile stopfile.txt => @stop.mode=OR
890 /the$/
891
892 ------------------------------------------------
893
894 Running count.pl with the command:
895
896 count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input
897
898 will first remove all nontokens from the input file. Hence the tokenized
899 text will be:
900
901 this is a test written for count.pl
902 their them together wither tithe
903
904 As the StopFile contains /the$/ all tokens which end in 'the' are stop
905 words. Thus, in the bigram
906
907 wither<>tithe<>
908
909 "tithe" will match the stoplist since it ends with "the". However, this
910 bigram will be eliminated since the stop mode is OR (meaning that if
911 either word is in the stop list then the bigram is eliminated). The
912 output file count.out will contain the 12 bigrams as shown below.
913
914 count.out=>
915
916 12
917 test<>written<>1 1 1
918 this<>is<>1 1 1
919 a<>test<>1 1 1
920 is<>a<>1 1 1
921 for<>count<>1 1 1
922 them<>together<>1 1 1
923 .<>pl<>1 1 1
924 their<>them<>1 1 1
925 count<>.<>1 1 1
926 written<>for<>1 1 1
927 pl<>their<>1 1 1
928 together<>wither<>1 1 1
929
930 5.7. Removing and Not Displaying Low Frequency Ngrams:
931 We allow the user to either remove or to not display low frequency
932 Ngrams. The user can remove low frequency Ngrams by using the option
933 --remove N by which all Ngrams that occur less than n times are removed.
934 The Ngram and the individual frequency counts are adjusted accordingly
935 upon the removal of these Ngrams.
936
937 The user can choose not to display low frequency Ngrams by using the
938 option --frequency N, by which Ngrams that occur less than n times are
939 not displayed in the output. Note that this differs from the --remove
940 option above in that the various frequency counts are not changed.
941 Intuitively, we continue to believe that these Ngrams have occurred in
942 the text - we are simply not interested in looking at them. By contrast,
943 in the --remove option we want to actually think that the Ngrams didn't
944 occur in the text in the first place, and so we want our numbers to
945 agree to that too!
946
947 5.8. Extended Output:
948 Observe that one may modify the actual counting process in various ways
949 through the various options above. To keep a "record" of which option
950 were used and with what values, one can turn the "extended" output on
951 with the switch --extended. The extended output records the size of the
952 Ngram, the size of the window, the frequency value at which the Ngrams
953 were removed and a list of all the source files used to create the count
954 output. If a switch was not used, the default value is printed.
955
956 5.9. Histogram Output:
957 The user can also generate a "histogram" output by using the --histogram
958 FILE option. This histogram output shows how many times Ngrams of a
959 certain frequency has occurred. Following is a typical line out of a
960 histogram output:
961
962 Number of n-grams that occurred 5 time(s) = 14 (40.94 percent)
963
964 This says that there were 14 distinct Ngrams that occurred 5 times each,
965 and between themselves they make up around 41% of the total number of
966 Ngrams.
967
968 5.10. Searching for Source Files in Directories, Recursively if Need Be:
969 One would usual provide a source file to create Ngrams from. One could
970 also provide a directory name - all text files from the directory are
971 used to create Ngrams from. Along with a directory name if one also uses
972 the switch --recurse, all subdirectories inside the source directory are
973 searched for text files recursively, and all text files so found are
974 used to create Ngrams from.
975
976 6. Program statistic.pl:
977 Program statistic.pl takes as input a list of Ngrams with their
978 frequencies in the format output by count.pl and runs a user-selected
979 statistical measure of association to compute a "score" for each Ngram.
980 The Ngrams, along with their scores, are output in descending order of
981 this score.
982
983 The statistical measures of association are implemented separately in
984 separate Perl packages (files ending with .pm extension). When running
985 statistic.pl, the user needs to provide the name of a statistical
986 measure (either from among the ones provided as a part of this
987 distribution or those written by the user). Say the name of the
988 statistic provided by the user is X. Program statistic.pl will then look
989 for Perl package X.pm (in the current directory, or, failing that, the
990 system path). If found, this Perl package file will be loaded and then
991 used to calculate the statistic on the list of Ngrams provided.
992
993 Please remember to include the path of Measures Directory (in the main
994 NSP Package directory) in your system path. This will enable the
995 statistic.pl program to find the modules provided with this package.
996
997 As a part of this distribution, we provide the following statistical
998 packages: dice, log-likelihood (ll), mutual information (mi), the
999 chi-squared test (x2), and the left-fisher test of associativity
1000 (leftFisher). All these packages follow a fixed set of rules as
1001 discussed below. It is hoped that these rules are easy to follow and
1002 that new packages may be written quickly and easily.
1003
1004 In a sense, program statistic.pl is framework. Its job is to take as
1005 input Ngrams with their frequencies, to provide those frequencies to the
1006 statistical library and to format the output from that library. The
1007 heart of the statistical measure - the actual calculation - lies in the
1008 library that can be plugged in. This framework allows for quickly
1009 rigging up new measures; to do so one need worry only about the actual
1010 calculation, and not of the various mundane issues that are taken care
1011 of by statistic.pl.
1012
1013 This section follows with details on how to run statistic.pl, and then
1014 the format of the libraries and tips on how to write them.
1015
1016 6.1. Default Way to Run statistic.pl:
1017 The default way to run statistic.pl is so:
1018
1019 statistic.pl dice test.dice test.cnt
1020
1021 where: dice is the name of the statistic library to be loaded.
1022 test.dice is the name of the output file in which the results
1023 of applying the dice coefficient will be stored.
1024 test.cnt is the name of the input file containing the Ngrams
1025 and their various frequency values.
1026
1027 A Perl package with filename dice.pm is searched for in the Perl @INC
1028 path. Instead of writing just "dice" on the command line, one may also
1029 write the file name "dice.pm", or the full measure name
1030 "Text::NSP::Measures::2D::Dice::dice".
1031
1032 Once such a file is found, it is exported into statistic.pl and tests
1033 are done to see if this file has the minimum requirements for a
1034 statistical library (more details below). If these tests fail,
1035 statistic.pl stops with an error message. Otherwise the library is
1036 initialized and then for each Ngram in file test.cnt, its frequency
1037 values are passed to it and its calculated value is noted. Finally, when
1038 all values have been calculated, the Ngrams are sorted on their
1039 statistic value and output to file test.dice.
1040
1041 For example, assume our input test.cnt file is this:
1042
1043 11
1044 line<>of<>2 3 2
1045 of<>text<>2 2 2
1046 second<>line<>1 1 3
1047 line<>and<>1 3 1
1048 and<>a<>1 1 1
1049 a<>third<>1 1 1
1050 first<>line<>1 1 3
1051 third<>line<>1 1 3
1052 text<>second<>1 1 1
1053
1054 Thus there are 11 bigrams, the first of which is "line<>of<>", the
1055 second "of<>text<>" etc.
1056
1057 Running statistic.pl thusly: statistic.pl dice test.dice test.cnt will
1058 produce the following test.dice file:
1059
1060 11
1061 of<>text<>1 1.0000 2 2 2
1062 and<>a<>1 1.0000 1 1 1
1063 a<>third<>1 1.0000 1 1 1
1064 text<>second<>1 1.0000 1 1 1
1065 line<>of<>2 0.8000 2 3 2
1066 third<>line<>3 0.5000 1 1 3
1067 line<>and<>3 0.5000 1 3 1
1068 second<>line<>3 0.5000 1 1 3
1069 first<>line<>3 0.5000 1 1 3
1070
1071 Once again, the first number is the total number of bigrams - 11. On the
1072 next line is the highest ranked bigram "of<>text<>". The first number
1073 following this bigram, 1, is its rank. The next number, 1.0000, is its
1074 value computed using the dice statistic. The final three numbers are
1075 exactly the numbers associated with this Ngram in the test.cnt file.
1076
1077 Observe that three other bigrams also have the same score of 1.000 and
1078 so the same rank 1. The bigram with the next highest score of 0.8000,
1079 "line<>of<>", is ranked 2nd instead of 5th. This is a feature of our
1080 ranking mechanism; the fact that a bigram has a rank 'r' implies that
1081 there are r-1 distinct scores greater than the score of this Ngram. It
1082 does not imply that there are r-1 bigrams with higher scores.
1083
1084 6.2. Changing the Default Ngram Size:
1085 By default, the Ngrams in the input file are assumed to be bigrams. This
1086 can however be changed by using the option --ngram. Given an Ngram size
1087 (either by default or by using the --ngram option), statistic.pl checks
1088 if there are exactly the correct number of tokens in each Ngram. If this
1089 is not true, an error is printed and statistic.pl halts.
1090
1091 6.3. Defining the Meaning of the Frequency Values:
1092 The "meaning" of the various frequency values after each Ngram in the
1093 input file is important in that the statistic calculated depends on
1094 them. By default, the default meanings as defined by count.pl are
1095 assumed.
1096
1097 count.pl and all statistical libraries (.pm modules) provided with this
1098 package are implemented such that they produce/accept the frequency
1099 values in the same order. So for an ngram,
1100
1101 word1<>word2<>...wordn-1<>
1102
1103 "the first frequency value reported is f(0,1,...n-1); this is the
1104 frequency of the Ngram itself. This is followed by n frequency values
1105 f(0), f(1),...f(n-1); these are the frequencies of the individual tokens
1106 in their specific positions in the given Ngram. This is followed by (n
1107 choose 2) values, f(0,1), f(0,2), ..., f(0,n-1), f(1,2), ..., f(1,n-1),
1108 ... f(n-2,n-1). This is followed by (n choose 3) values, f(0,1,2),
1109 f(0,1,3), ..., f(0,1,n-1), f(0,2,3), ... , f(0,2,n-1), ... f(0,n-2,n-1),
1110 f(1,2,3), ..., f(n-3,n-2,n-1). And so on, until (n choose n-1), that is
1111 n, frequency values f(0,1,...n-2), f(0,1,..n-3,n-1), f(0,1,...n-4,n-1),
1112 ..., f(1,2,...n-1)"
1113
1114 (The above explanation is from "The Design, Implementation and Use of
1115 the Ngram Statistics Package" [2].)
1116
1117 So the bigram output of count.pl/bigram input to any statistical library
1118 will be something like -
1119
1120 word1<>word2<>f(0,1)<>f(0)<>f(1)
1121
1122 Or you can also view this as
1123
1124 word1<>word2<>n11<>n1p<>np1
1125
1126 where n1p,np1 represent marginal totals in a 2x2 contingency table.
1127
1128 Similarly, the trigram output of count.pl/trigram input to ll3.pm (which
1129 is the only trigram statistical library currently provided) will be -
1130
1131 word1<>word2<>word3<>f(0,1,2)<>f(0)<>f(1)<>f(2)<>f(0,1)<>f(0,2)<>f(1,2)
1132
1133 Or you can also view this as
1134 word1<>word2<>word3<>n111<>n1pp<>np1p<>npp1<>n11p<>n1p1<>np11
1135
1136 where n1pp,np1p,npp1,n11p,n1p1,np11 represent marginal frequencies in a
1137 3x3 contingency table.
1138
1139 The frequency combinations being used can be output to a file by using
1140 the option get_freq_combo.
1141
1142 If count.pl was run with a set of user-defined frequency combinations
1143 different from the defaults, then the file containing these frequency
1144 combinations must be provided to statistic.pl using the option
1145 set_freq_combo.
1146
1147 If the number of frequency values does not match the number expected
1148 (either through the default frequency combinations or through the user
1149 defined ones provided through the set_freq_combo option) then an error
1150 is reported. Besides checking that the number of frequency values is
1151 correct, nothing else is checked.
1152
1153 6.4. Modifying the Output of statistic.pl:
1154 One may request statistic.pl to ignore all Ngrams which have a frequency
1155 less than a user-defined threshold by using the --frequency option. To
1156 be able to do this however, the Ngram frequency should be present among
1157 the various frequency values in the input Ngram file. It is possible to
1158 set up a frequency combination file that prevents count.pl from printing
1159 the actual frequency of each Ngram; if such a file is given to
1160 statistic.pl, the frequency cut-off requested through option --frequency
1161 will be ignored and a warning issued to that effect.
1162
1163 Once the statistical values for the Ngrams are calculated and the Ngrams
1164 have been ranked according to these values, one may request not to print
1165 Ngrams below a certain rank. This can be done using the option --rank.
1166 Unlike the frequency cut-off above, all calculations are done and then
1167 Ngrams that fall below a certain rank are cut-off. In the frequency
1168 cut-off, calculations are not performed on the Ngrams that are ignored.
1169
1170 The value returned by the statistic libraries may be floating point
1171 numbers; by default 4 places of decimal are shown. This can be changed
1172 by using the option --precision through which the user can decide how
1173 many places of decimal he wishes to see. Note that the values returned
1174 by the library are rounded to the places of decimal requested by the
1175 user, and THEN the ranking is done. Thus two Ngram that actually have
1176 different scores, but whose scores both round up to the same number for
1177 the given precision will get the same rank!
1178
1179 The user can also use the statistical score to cut off Ngrams. Thus,
1180 using the option --score, one may request statistic.pl to not print
1181 Ngrams that get a score less than the given threshold.
1182
1183 Similar to count.pl, the user can request statistic.pl to print extended
1184 information by using the --extended switch. Without this switch, all
1185 extended information already in the input file will be lost; with it,
1186 they will all be preserved and new extended data will be output.
1187
1188 The output of statistic.pl is not formatted for human eyes - this can be
1189 done using the switch --format. Columns will be aligned as much as
1190 possible and the output is (often) neater than the default output.
1191
1192 6.5. The Measures of Association Provided in This Distribution:
1193 We provide the 10 measures of association with this distribution. Nine
1194 are suitable for use with bigrams and one may be used with trigrams.
1195
1196 The bigram measures are:
1197
1198 * Dice Coefficient (Text::NSP::Measures::2D::Dice::dice)
1199
1200 * Fishers exact test - left sided
1201 (Text::NSP::Measures::2D::Fisher::left)
1202
1203 * Fishers exact test - right sided
1204 (Text::NSP::Measures::2D::Fisher::right)
1205
1206 * Fishers twotailed test - right sided
1207 (Text::NSP::Measures::2D::Fisher::twotailed)
1208
1209 * Jaccard Coefficient (Text::NSP::Measures::2D::Dice::jaccard)
1210
1211 * Log-likelihood ratio (Text::NSP::Measures::2D::MI::ll)
1212
1213 * Mutual Information (Text::NSP::Measures::2D::MI::tmi)
1214
1215 * Odds Ratio (Text::NSP::Measures::2D::odds)
1216
1217 * Pointwise Mutual Information (Text::NSP::Measures::2D::MI::pmi)
1218
1219 * Phi Coefficient (Text::NSP::Measures::2D::CHI::phi)
1220
1221 * Pearson's Chi Squared Test (Text::NSP::Measures::2D::CHI::x2)
1222
1223 * Poisson Stirling Measure (Text::NSP::Measures::2D::MI::ps)
1224
1225 * T-score (Text::NSP::Measures::2D::CHI::tscore)
1226
1227 The trigram measures are:
1228
1229 * Log-likelihood ratio (Text::NSP::Measures::3D::MI::ll)
1230
1231 * Mutual Information (Text::NSP::Measures::3D::MI::tmi)
1232
1233 * Pointwise Mutual Information (Text::NSP::Measures::3D::MI::pmi)
1234
1235 * Poisson Stirling Measure (Text::NSP::Measures::3D::MI::ps)
1236
1237 The 4-gram measures is:
1238
1239 * Log-likelihood ratio (Text::NSP::Measures::4D::MI::ll)
1240
1241 Any of these measures can be used as follows:
1242
1243 statistic.pl XXXX output.txt input.txt
1244
1245 where XXXX is the name of the measure.
1246
1247 More information on how to write a new statistic library is provided in
1248 the documentation (perldoc) of Text::NSP::Measures. A few additional
1249 details about the Measures can be found in their respective perldocs.
1250
1251 7. Referencing:
1252 If you write a paper that has used NSP in some way, we'd certainly be
1253 grateful if you sent us a copy and referenced NSP. We have a published
1254 paper about NSP that provides a suitable reference:
1255
1256 @inproceedings{BanerjeeP03,
1257 author = {Banerjee, S. and Pedersen, T.},
1258 title = {The Design, Implementation, and Use of the {N}gram {S}tatistic {P}ackage},
1259 booktitle = {Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics},
1260 pages = {370-381},
1261 year = {2003},
1262 month ={February},
1263 address = {Mexico City}}
1264
1265 This paper can be found at :
1266
1267 <http://cpansearch.perl.org/src/TPEDERSE/Text-NSP-1.13/doc/cicling2003.p
1268 s>
1269
1270 or
1271
1272 <http://cpansearch.perl.org/src/TPEDERSE/Text-NSP-1.13/doc/cicling2003.p
1273 df>
1274
1275AUTHORS
1276 Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu
1277
1278 Satanjeev Banerjee
1279
1280 Amruta Purandare
1281
1282 Saiyam Kohli
1283
1284 Last modified by : $Id: README.pod,v 1.13 2010/11/12 19:13:41 btmcinnes
1285 Exp $
1286
1287BUGS
1288 Please report to the NSP mailing list
1289
1290SEE ALSO
1291 * NSP Home: <http://ngram.sourceforge.net>
1292
1293 * Mailing List : <http://groups.yahoo.com/group/ngram/>
1294
1295 8. Acknowledgments:
1296 This work has been partially supported by a National Science Foundation
1297 Faculty Early CAREER Development award (\#0092784) and by a Grant-in-Aid
1298 of Research, Artistry and Scholarship from the Office of the Vice
1299 President for Research and the Dean of the Graduate School of the
1300 University of Minnesota.
1301
1302COPYRIGHT
1303 Copyright (C) 2000-2010, Ted Pedersen, Satanjeev Banerjee, Amruta
1304 Purandare, Bridget Thomson-McInnes Saiyam Kohli, and Ying Liu
1305
1306 This program is free software; you can redistribute it and/or modify it
1307 under the terms of the GNU General Public License as published by the
1308 Free Software Foundation; either version 2 of the License, or (at your
1309 option) any later version.
1310
1311 This program is distributed in the hope that it will be useful, but
1312 WITHOUT ANY WARRANTY; without even the implied warranty of
1313 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
1314 Public License for more details.
1315
1316 You should have received a copy of the GNU General Public License along
1317 with this program; if not, write to
1318
1319 The Free Software Foundation, Inc.,
1320 59 Temple Place - Suite 330,
1321 Boston, MA 02111-1307, USA.
1322
1323 Note: a copy of the GNU General Public License is available on the web
1324 at <http://www.gnu.org/licenses/gpl.txt> and is included in this
1325 distribution as GPL.txt.
1326
1327