1NAME 2 TODO Things to do in the Ngram Statistics Package 3 4SYNOPSIS 5 Ngram Statistics Package Todo list 6 7DESCRIPTION 8 The following list describes some of the features that we'd like to 9 include in NSP in future. No particular priority is assigned to these 10 items - they are all things we've discussed amongst ourselves or with 11 users and agree would be good to add. 12 13 If you have additional ideas, or would like to comment on something on 14 the current list, please let us know via the ngram mailing list. 15 16 WEB INTERFACE / WEB SERVER 17 It would be nice to offer a web interface or web server for users who 18 just want to run a few measures. 19 20 UNICODE SUPPORT / ENCODING ISSUES 21 NSP is geared for the Roman alphabet (Latin-1). Perl has increasingly 22 better Unicode support with each passing release, and we will 23 incorporate Unicode support in future. We attempted to use the Unicode 24 features in Perl 5.6, but found them to be incomplete. We have not yet 25 attempted this with Perl 5.8 (the now current version) but it is said to 26 be considerably better. 27 28 Perl support for unicode will include language / alphabet specific 29 definitions of regular expression character classes like \d+ or \w+ 30 (digits and non-white space characters). So you should be able to use 31 (in theory) the same regular expression definitions with any alphabet 32 and have it match in a way that makes sense for that language. 33 34 Our expertise in this area is fairly limited, so please let us know if 35 we are missing something obvious or misunderstanding what Perl is 36 attempting to do. 37 38 In a discussion in Feb 2008, Richard Jelinek suggested the use of the 39 ENCODE module, discussion starts here : 40 41 <http://tech.groups.yahoo.com/group/ngram/message/210> 42 43 In that discussion some drawbacks to 'use locale' were pointed out, so 44 for the moment we have made no changes, but it seems like fitting NSP 45 with ENCODE support is a good idea. 46 47 MORE EFFICIENT COUNTING 48 Right now all the ngrams being counted are stored in memory. Each ngram 49 is an element in a hash. This is ok for up to a few million word 50 corpora, but after that things really slow down. We would like to pursue 51 the idea of using suffix trees which would greatly improve space 52 utilization. 53 54 The use of suffix trees for counting term frequencies is based on : 55 56 Yamamoto, M. and Church, K (2001) Using Suffix Arrays to compute Term 57 Frequency and Document Frequency for All Substrings in a Corpus, 58 Computational Linguistics, vol 27:1, pp. 1-30, MIT Press. 59 60 Find the article at: 61 62 L<http://acl.ldc.upenn.edu/J/J01/J01-1001.pdf> 63 64 L<http://www.research.att.com/~kwc/CL_suffix_array.pdf> 65 66 In fact, they even provide a C implementation: 67 68 L<http://www.milab.is.tsukuba.ac.jp/~myama/tfdf/index.html> 69 70 However, we would convert this into Perl and may need to modify it 71 somewhat to fit into NSP. 72 73 Another alternative would be to simply modify the count.pl program such 74 that rather than using memory it used disk space to accumulate counts. 75 This would be very slow but might suffice for certain situations. This 76 is what huge-count.pl currently does. 77 78 Another alternative would be to tie the hashes that are used in NSP to a 79 database, and thereby reduce some memory use. 80 81 Regardless of the changes we make to counting, would continue to support 82 counting in memory, which is perfectly adequate for smaller amounts of 83 corpora. 84 85 GET COUNTS FROM WEB 86 The web is a huge source of text, and we could get counts for words or 87 ngrams from the web (probably using something like Perl LWP module). 88 89 Rather than running count.pl on a particular body of text (as is the 90 case now) we'd probably have to run count.pl such that it looked for 91 counts for a specific set of words as found on the web. Simply running 92 count.pl on the entire www wouldn't really make sense. So perhaps we 93 would run count on one sample to get a list of the word types/ngrams 94 that we are interested in, and then run count on the www to find out 95 their respective counts. 96 97 [Our interest in this has been inspired by both Peter Turney (ACL-02 98 paper) and Frank Keller (EMNLP-02 paper).] 99 100 PARALLEL COUNTING 101 Counting words and ngrams in large corpora could be parallelized. The 102 trick is not so much in the counting, but in the combining of counts 103 from various sources. 104 105 This is something we might try and implement using MPI (Message Passing 106 Interface). 107 108 PROGRESS METER for count.pl 109 When processing large files, count.pl gives no indication of how much of 110 the file has been processed, or even if it is still making progress. A 111 "progress meter" could show how much of the file has been proceeded, or 112 how many ngrams have been counted, or something to indicate that 113 progress is being made. 114 115 OVERLY LONG LINE DETECTOR for count.pl 116 If count.pl encounters a very long line of text (with literally 117 thousands and thousands of words on a single line) it may operate very 118 very slowly. It would be good to let a user know that an overly long 119 line (we'd need to define more precisely what "overly long" is) is being 120 processed (this fits into the progress meter mentioned above) so that a 121 user can decide if they want to continue with this, or possibly 122 terminate processing and reformat the input file. 123 124 GENERALIZE --newLine in count.pl 125 The --newLine switch tells count.pl that Ngrams may not cross over end 126 of line markers. Presumably this would be used when each line of text 127 consists of a sentence (thus the end of a line also marks the end of a 128 sentence). However, if the text is not formatted and there may be 129 multiple sentences per line, or sentences may extend across several 130 lines, we may want to allow --newLine to include other characters that 131 Ngrams would not be allowed to cross. 132 133 For example we could have the switch --dontCross "\n\.,;\?" which would 134 prevent ngrams from crossing the newline, the fullstop, the comma, the 135 semicolon and the question mark. 136 137 RECURSE LIKE OPTION THAT CREATES MULTIPLE COUNT FILES 138 Our current --recurse option creates a single count output file for all 139 the words in all the texts found in a directory structure. We might want 140 to be able to process all the files in a directory structure such that 141 each file is treated separately and a separate count file is created for 142 it. 143 144 For example, suppose we have the directory /txts that contains the files 145 text1 and text2. 146 147 count.pl --recurse output txts 148 149 output will consist of the combined counts from txts/text1 and 150 txts/text2. 151 152 This new option would count these files separately and produce separate 153 count output files. 154 155 OTHER CUTOFFS FOR count.pl 156 DONE IN VERSION 1.13! (--uremove option): What about having a frequency 157 cutoff for count.pl that removed any ngrams that occur more than some 158 number of times? The idea here would be to eliminate high frequency 159 ngrams not through the use of a stoplist but rather through a frequency 160 cutoff, based on the presumption that most very high frequent ngrams 161 will be made up of stop words. 162 163 What about a percentage cutoff? In other words, eliminate the least (or 164 most) frequent ngrams? 165 166 AUTOMATIC CREATION OF STOPLISTS 167 It would be useful to allow NSP to automatically create a stoplist based 168 on a combination of frequency counts and/or scores like tf/idf. While 169 tf/idf depends on the idea of a document, we would simply chunk up a 170 large corpus into 100 token long pieces, and consider each piece a 171 document, and consider stop words those words that occur in some number 172 of these chunks. 173 174 SUPPORT FOR STDIN/STDOUT 175 Right now count.pl and statistic.pl operate such that the output file is 176 designated first, followed by the input file. 177 178 For example, 179 180 count.pl outputfile inputfile 181 182 However, there are advantages to allowing a user to redirect input and 183 output, particularly in the Unix and MS-DOS world. As Derek Jones 184 pointed out to us, if we have Windows users they are probably looking 185 for a GUI (and they won't find much will they!!). This would enable the 186 use of syntax such as... 187 188 count.pl input > out 189 190 cat input | count.pl > outfile 191 192 which would help in building scripts, etc. 193 194 INSTALL SCRIPT FOR UNIX 195 Rather than have user set paths, have a script that would ask the users 196 questions to set things up properly. This might be especially useful if 197 we want to maintain the "old" style of output input file specifications 198 in count.pl and statistic.pl (see point above) as well as STDIN STDOUT. 199 (Maybe a user could pick which one?) In addition, there may be other 200 options that a user could specify this way (such as a default token 201 definition, home directory, etc.) 202 203 EXTEND huge-count.pl to Ngrams 204 At present huge-count.pl is only able to count bigrams. It would be very 205 useful to extend it so that it could count Ngrams in general. Also, 206 there is no support for windowing provided at present, so the bigrams it 207 counts must be adjacent. It would be desirable to support windowing for 208 bigrams and Ngrams generally. 209 210 ERROR RETURN CODES 211 At present all programs simply exit when they encounter an error. We 212 will return an error code that can be detected by the calling program, 213 so that abnormal termination is clear. This affects count.pl and 214 statistic.pl particularly, but will also be changed in rank.pl, 215 combig.pl and kocos.pl. 216 217 MODULAR COUNTING 218 There is a certain amount of redundant code in count.pl, huge-count.pl 219 and kocos.pl. It would be useful to make these more modular, to allow 220 for inheritance and code sharing, as well as the use of objects 221 (potentially). 222 223 RANK.PL TIE HANDLING 224 Right now rank.pl does not handle ties in any way other than re-ranking 225 them such that all members of a tie have the same rank, and that the 226 next rank after the ties is incremented by the number of ties. Some 227 sources advocate using Pearson's correlation coefficient on the ranks in 228 case of ties : 229 230 <http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient> 231 232 Other sources prefer the use of Kendall's Tau over Spearman's: 233 234 <http://rsscse.org.uk/ts/bts/noether/text.html> 235 236 Our suggestion is that if you have data with numerous ties, you want to 237 look very carefully at alternatives to the methods described in rank.pl 238 However, typical collocation data collected from corpora usually doesn't 239 have too many ties, so in general we feel rank.pl remains useful. 240 241 UPDATE USAGE.pod 242 USAGE.pod has not been updated since 2001, and is very basic. 243 244AUTHOR 245 Ted Pedersen, tpederse@d.umn.edu 246 247 Last Updated : $Id: TODO,v 1.26 2015/10/03 12:22:59 tpederse Exp $ 248 249BUGS 250SEE ALSO 251 home page: L<http://www.d.umn.edu/~tpederse/nsp.html> 252 253 mailing list: L<http://groups.yahoo.com/group/ngram/> 254 255COPYRIGHT 256 Copyright (C) 2000-2010 Ted Pedersen 257 258 Permission is granted to copy, distribute and/or modify this document 259 under the terms of the GNU Free Documentation License, Version 1.2 or 260 any later version published by the Free Software Foundation; with no 261 Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. 262 263 Note: a copy of the GNU Free Documentation License is available on the 264 web at <http://www.gnu.org/copyleft/fdl.html> and is included in this 265 distribution as FDL.txt. 266 267