1NAME
2    TODO Things to do in the Ngram Statistics Package
3
4SYNOPSIS
5    Ngram Statistics Package Todo list
6
7DESCRIPTION
8    The following list describes some of the features that we'd like to
9    include in NSP in future. No particular priority is assigned to these
10    items - they are all things we've discussed amongst ourselves or with
11    users and agree would be good to add.
12
13    If you have additional ideas, or would like to comment on something on
14    the current list, please let us know via the ngram mailing list.
15
16  WEB INTERFACE / WEB SERVER
17    It would be nice to offer a web interface or web server for users who
18    just want to run a few measures.
19
20  UNICODE SUPPORT / ENCODING ISSUES
21    NSP is geared for the Roman alphabet (Latin-1). Perl has increasingly
22    better Unicode support with each passing release, and we will
23    incorporate Unicode support in future. We attempted to use the Unicode
24    features in Perl 5.6, but found them to be incomplete. We have not yet
25    attempted this with Perl 5.8 (the now current version) but it is said to
26    be considerably better.
27
28    Perl support for unicode will include language / alphabet specific
29    definitions of regular expression character classes like \d+ or \w+
30    (digits and non-white space characters). So you should be able to use
31    (in theory) the same regular expression definitions with any alphabet
32    and have it match in a way that makes sense for that language.
33
34    Our expertise in this area is fairly limited, so please let us know if
35    we are missing something obvious or misunderstanding what Perl is
36    attempting to do.
37
38    In a discussion in Feb 2008, Richard Jelinek suggested the use of the
39    ENCODE module, discussion starts here :
40
41    <http://tech.groups.yahoo.com/group/ngram/message/210>
42
43    In that discussion some drawbacks to 'use locale' were pointed out, so
44    for the moment we have made no changes, but it seems like fitting NSP
45    with ENCODE support is a good idea.
46
47  MORE EFFICIENT COUNTING
48    Right now all the ngrams being counted are stored in memory. Each ngram
49    is an element in a hash. This is ok for up to a few million word
50    corpora, but after that things really slow down. We would like to pursue
51    the idea of using suffix trees which would greatly improve space
52    utilization.
53
54    The use of suffix trees for counting term frequencies is based on :
55
56    Yamamoto, M. and Church, K (2001) Using Suffix Arrays to compute Term
57    Frequency and Document Frequency for All Substrings in a Corpus,
58    Computational Linguistics, vol 27:1, pp. 1-30, MIT Press.
59
60    Find the article at:
61
62     L<http://acl.ldc.upenn.edu/J/J01/J01-1001.pdf>
63
64     L<http://www.research.att.com/~kwc/CL_suffix_array.pdf>
65
66    In fact, they even provide a C implementation:
67
68     L<http://www.milab.is.tsukuba.ac.jp/~myama/tfdf/index.html>
69
70    However, we would convert this into Perl and may need to modify it
71    somewhat to fit into NSP.
72
73    Another alternative would be to simply modify the count.pl program such
74    that rather than using memory it used disk space to accumulate counts.
75    This would be very slow but might suffice for certain situations. This
76    is what huge-count.pl currently does.
77
78    Another alternative would be to tie the hashes that are used in NSP to a
79    database, and thereby reduce some memory use.
80
81    Regardless of the changes we make to counting, would continue to support
82    counting in memory, which is perfectly adequate for smaller amounts of
83    corpora.
84
85  GET COUNTS FROM WEB
86    The web is a huge source of text, and we could get counts for words or
87    ngrams from the web (probably using something like Perl LWP module).
88
89    Rather than running count.pl on a particular body of text (as is the
90    case now) we'd probably have to run count.pl such that it looked for
91    counts for a specific set of words as found on the web. Simply running
92    count.pl on the entire www wouldn't really make sense. So perhaps we
93    would run count on one sample to get a list of the word types/ngrams
94    that we are interested in, and then run count on the www to find out
95    their respective counts.
96
97    [Our interest in this has been inspired by both Peter Turney (ACL-02
98    paper) and Frank Keller (EMNLP-02 paper).]
99
100  PARALLEL COUNTING
101    Counting words and ngrams in large corpora could be parallelized. The
102    trick is not so much in the counting, but in the combining of counts
103    from various sources.
104
105    This is something we might try and implement using MPI (Message Passing
106    Interface).
107
108  PROGRESS METER for count.pl
109    When processing large files, count.pl gives no indication of how much of
110    the file has been processed, or even if it is still making progress. A
111    "progress meter" could show how much of the file has been proceeded, or
112    how many ngrams have been counted, or something to indicate that
113    progress is being made.
114
115  OVERLY LONG LINE DETECTOR for count.pl
116    If count.pl encounters a very long line of text (with literally
117    thousands and thousands of words on a single line) it may operate very
118    very slowly. It would be good to let a user know that an overly long
119    line (we'd need to define more precisely what "overly long" is) is being
120    processed (this fits into the progress meter mentioned above) so that a
121    user can decide if they want to continue with this, or possibly
122    terminate processing and reformat the input file.
123
124  GENERALIZE --newLine in count.pl
125    The --newLine switch tells count.pl that Ngrams may not cross over end
126    of line markers. Presumably this would be used when each line of text
127    consists of a sentence (thus the end of a line also marks the end of a
128    sentence). However, if the text is not formatted and there may be
129    multiple sentences per line, or sentences may extend across several
130    lines, we may want to allow --newLine to include other characters that
131    Ngrams would not be allowed to cross.
132
133    For example we could have the switch --dontCross "\n\.,;\?" which would
134    prevent ngrams from crossing the newline, the fullstop, the comma, the
135    semicolon and the question mark.
136
137  RECURSE LIKE OPTION THAT CREATES MULTIPLE COUNT FILES
138    Our current --recurse option creates a single count output file for all
139    the words in all the texts found in a directory structure. We might want
140    to be able to process all the files in a directory structure such that
141    each file is treated separately and a separate count file is created for
142    it.
143
144    For example, suppose we have the directory /txts that contains the files
145    text1 and text2.
146
147     count.pl --recurse output txts
148
149    output will consist of the combined counts from txts/text1 and
150    txts/text2.
151
152    This new option would count these files separately and produce separate
153    count output files.
154
155  OTHER CUTOFFS FOR count.pl
156    DONE IN VERSION 1.13! (--uremove option): What about having a frequency
157    cutoff for count.pl that removed any ngrams that occur more than some
158    number of times? The idea here would be to eliminate high frequency
159    ngrams not through the use of a stoplist but rather through a frequency
160    cutoff, based on the presumption that most very high frequent ngrams
161    will be made up of stop words.
162
163    What about a percentage cutoff? In other words, eliminate the least (or
164    most) frequent ngrams?
165
166  AUTOMATIC CREATION OF STOPLISTS
167    It would be useful to allow NSP to automatically create a stoplist based
168    on a combination of frequency counts and/or scores like tf/idf. While
169    tf/idf depends on the idea of a document, we would simply chunk up a
170    large corpus into 100 token long pieces, and consider each piece a
171    document, and consider stop words those words that occur in some number
172    of these chunks.
173
174  SUPPORT FOR STDIN/STDOUT
175    Right now count.pl and statistic.pl operate such that the output file is
176    designated first, followed by the input file.
177
178    For example,
179
180     count.pl outputfile inputfile
181
182    However, there are advantages to allowing a user to redirect input and
183    output, particularly in the Unix and MS-DOS world. As Derek Jones
184    pointed out to us, if we have Windows users they are probably looking
185    for a GUI (and they won't find much will they!!). This would enable the
186    use of syntax such as...
187
188     count.pl input > out
189
190     cat input | count.pl > outfile
191
192    which would help in building scripts, etc.
193
194  INSTALL SCRIPT FOR UNIX
195    Rather than have user set paths, have a script that would ask the users
196    questions to set things up properly. This might be especially useful if
197    we want to maintain the "old" style of output input file specifications
198    in count.pl and statistic.pl (see point above) as well as STDIN STDOUT.
199    (Maybe a user could pick which one?) In addition, there may be other
200    options that a user could specify this way (such as a default token
201    definition, home directory, etc.)
202
203  EXTEND huge-count.pl to Ngrams
204    At present huge-count.pl is only able to count bigrams. It would be very
205    useful to extend it so that it could count Ngrams in general. Also,
206    there is no support for windowing provided at present, so the bigrams it
207    counts must be adjacent. It would be desirable to support windowing for
208    bigrams and Ngrams generally.
209
210  ERROR RETURN CODES
211    At present all programs simply exit when they encounter an error. We
212    will return an error code that can be detected by the calling program,
213    so that abnormal termination is clear. This affects count.pl and
214    statistic.pl particularly, but will also be changed in rank.pl,
215    combig.pl and kocos.pl.
216
217  MODULAR COUNTING
218    There is a certain amount of redundant code in count.pl, huge-count.pl
219    and kocos.pl. It would be useful to make these more modular, to allow
220    for inheritance and code sharing, as well as the use of objects
221    (potentially).
222
223  RANK.PL TIE HANDLING
224    Right now rank.pl does not handle ties in any way other than re-ranking
225    them such that all members of a tie have the same rank, and that the
226    next rank after the ties is incremented by the number of ties. Some
227    sources advocate using Pearson's correlation coefficient on the ranks in
228    case of ties :
229
230    <http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient>
231
232    Other sources prefer the use of Kendall's Tau over Spearman's:
233
234    <http://rsscse.org.uk/ts/bts/noether/text.html>
235
236    Our suggestion is that if you have data with numerous ties, you want to
237    look very carefully at alternatives to the methods described in rank.pl
238    However, typical collocation data collected from corpora usually doesn't
239    have too many ties, so in general we feel rank.pl remains useful.
240
241  UPDATE USAGE.pod
242    USAGE.pod has not been updated since 2001, and is very basic.
243
244AUTHOR
245    Ted Pedersen, tpederse@d.umn.edu
246
247    Last Updated : $Id: TODO,v 1.26 2015/10/03 12:22:59 tpederse Exp $
248
249BUGS
250SEE ALSO
251     home page:    L<http://www.d.umn.edu/~tpederse/nsp.html>
252
253     mailing list: L<http://groups.yahoo.com/group/ngram/>
254
255COPYRIGHT
256    Copyright (C) 2000-2010 Ted Pedersen
257
258    Permission is granted to copy, distribute and/or modify this document
259    under the terms of the GNU Free Documentation License, Version 1.2 or
260    any later version published by the Free Software Foundation; with no
261    Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
262
263    Note: a copy of the GNU Free Documentation License is available on the
264    web at <http://www.gnu.org/copyleft/fdl.html> and is included in this
265    distribution as FDL.txt.
266
267