1#!/usr/local/bin/perl -w 2 3=head1 NAME 4 5rank.pl - Calculate Spearman's Correlation on two ranked lists output by count.pl or statistic.pl 6 7=head1 SYNOPSIS 8 9Program to calculate the rank correlation coefficient between the rankings 10generated by two different statistical measures on the same 11bigram-frequency (as output by count.pl). 12 13=head1 DESCRIPTION 14 15=head2 1. Introduction 16 17This is a program that is meant to be used to compare two different 18statistical measures of association. Given the same set of n-grams ranked 19in two different ways by two different statistical measures, this program 20computes Spearman's rank correlation coefficient between the two rankings. 21 22=head3 1.2. Typical Way to Run rank.pl: 23 24Assume that test.cnt is a list of n-grams with their frequencies as 25output by program count.pl. Assume that we wish to test the 26dis/similarity of the statistical measures 'dice' and 'x2' with 27respect to the n-grams contained in test.cnt. To do so, we must first 28rank the n-grams using these two statistical measures using program 29statistic.pl. 30 31 perl statistic.pl dice test.dice test.cnt 32 perl statistic.pl x2 test.x2 test.cnt 33 34Having obtained two different rankings of the n-grams in test.cnt in 35files test.dice and test.x2, we can now compute the Spearman's rank 36correlation coefficient using these two rankings like so: 37 38 perl rank.pl test.dice test.x2. 39 40This will output a floating point number between -1 and 1. A return of 41'1' indicates a perfect match in rankings, '-1' a completely reversed 42ranking and '0' a pair of rankings that are completely unrelated to 43each other. Numbers that lie between these numbers indicate various 44degrees of relatedness / un-relatedness / reverse-relatedness. 45 46=head3 1.3. Re-Ranking the Ngrams: 47 48Recall that program statistic.pl ranks n-grams in such a way that the 49fact that an ngram has a rank 'r' implies that there are 'r-1' 50distinct scores greater than the score of this ngram. Thus say if 'k' 51n-grams are tied at a score with rank 'a', then the next highest 52scoring n-grams is given a rank 'a+1' instead of 'a+k+1'. 53 54For example, observe the following file output by statistic.pl: 55 56 11 57 of<>text<>1 1.0000 2 2 2 58 and<>a<>1 1.0000 1 1 1 59 a<>third<>1 1.0000 1 1 1 60 text<>second<>1 1.0000 1 1 1 61 line<>of<>2 0.8000 2 3 2 62 third<>line<>3 0.5000 1 1 3 63 line<>and<>3 0.5000 1 3 1 64 second<>line<>3 0.5000 1 1 3 65 first<>line<>3 0.5000 1 1 3 66 67Observe that although 4 bigrams have a rank of 1, the next highest 68scoring bigram is not ranked 5, but instead 2. 69 70Spearman's rank correlation coefficient requires the more conventional 71kind of ranking. Thus the above file is first "re-ranked" to the 72following: 73 74 11 75 of<>text<>1 1.0000 2 2 2 76 and<>a<>1 1.0000 1 1 1 77 a<>third<>1 1.0000 1 1 1 78 text<>second<>1 1.0000 1 1 1 79 line<>of<>5 0.8000 2 3 2 80 third<>line<>6 0.5000 1 1 3 81 line<>and<>6 0.5000 1 3 1 82 second<>line<>6 0.5000 1 1 3 83 first<>line<>6 0.5000 1 1 3 84 85And then these rankings are used to compute the correlation 86coefficient. 87 88=head3 1.4. Dealing with Dissimilar Lists of N-grams: 89 90The two input files to rank.pl may not have the same set of n-grams. In 91particular, if one or both of the files generated using statistic.pl 92has been generated using a frequency, rank or score cut-off, then it 93is likely that the two files will have different sets of n-grams. In 94such a situation, n-grams that do not occur in both files are removed, 95the n-grams that remain are re-ranked and then the correlation 96coefficient is computed. 97 98For example assume the following two files output by statistic.pl 99using two fictitious statistical measures from a fictitious file 100output by program count.pl. 101 102The first file: 103 104 first<>bigram<>1 4.000 1 1 105 second<>bigram<>2 3.000 2 2 106 extra<>bigram1<>3 2.000 3 3 107 third<>bigram<>4 1.000 4 4 108 109The second file: 110 111 second<>bigram<>1 4.000 2 2 112 extra<>bigram2<>2 3.000 4 4 113 first<>bigram<>3 2.000 1 1 114 third<>bigram<>4 1.000 3 3 115 116Observe that the bigrams extra<>bigram1<> in the first file and 117extra<>bigram2<> in the second file are not present in both 118files. After removing these bigrams and re-ranking the rest, we get the 119following files: 120 121The modified first file: 122 123 first<>bigram<>1 4.000 1 1 124 second<>bigram<>2 3.000 2 2 125 third<>bigram<>3 1.000 4 4 126 127The modified second file: 128 129 second<>bigram<>1 4.000 2 2 130 first<>bigram<>2 2.000 1 1 131 third<>bigram<>3 1.000 3 3 132 133Since each ngram belongs to both files, the correlation coefficient 134may be computed on both files. 135 136=head3 1.5. Example Shell Script rank-script.sh: 137 138We provide c-shell script rank-script.sh that takes a bigram count 139file and the names of two libraries and then computes the Spearman's 140rank correlation coefficient by making use successively of programs 141statistic.pl and rank.pl. 142 143Run this script like so: rank-script.sh <lib1> <lib2> <file> 144 145 where <lib1> is the first library, say dice 146 <lib2> is the second library, say x2 147 <file> is the file of ngrams and their frequencies produced 148 by program count.pl. 149 150For example, if test.cnt contains bigrams and their frequencies, we 151can run it like so to compute the rank correlation coefficient between 152dice and x2: 153 154 csh rank-script.sh dice x2 test.cnt. 155 156This runs the following commands in succession: 157 158 perl statistic.pl dice out1 test.cnt 159 perl statistic.pl x2 out2 test.cnt 160 perl rank.pl out1 out2 161 162The intermediate files out1 and out2 are later destroyed. 163 164Note that since no command line options are utilized in the running of 165program statistic.pl here, this script only works for bigrams and 166enforces no cut-offs. However the script is simple enough to be 167manually modified to the user's requirements. 168 169=head1 AUTHORS 170 171 Ted Pedersen, tpederse@umn.edu 172 Satanjeev Banerjee, bane0025@d.umn.edu 173 Bridget McInnes, bthomson@umn.edu 174 175This work has been partially supported by a National Science Foundation 176Faculty Early CAREER Development award (\#0092784) and by a Grant-in-Aid 177of Research, Artistry and Scholarship from the Office of the Vice 178President for Research and the Dean of the Graduate School of the 179University of Minnesota. 180 181=head1 COPYRIGHT 182 183Copyright (C) 2000-2012, Ted Pedersen and Satanjeev Banerjee and Bridget T. McInnes 184 185This suite of programs is free software; you can redistribute it and/or 186modify it under the terms of the GNU General Public License as published by the 187Free Software Foundation; either version 2 of the License, or (at your option) 188any later version. 189 190This program is distributed in the hope that it will be useful, but WITHOUT ANY 191WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A 192PARTICULAR PURPOSE. See the GNU General Public License for more details. 193 194You should have received a copy of the GNU General Public License along with 195this program; if not, write to the Free Software Foundation, Inc., 59 Temple 196Place - Suite 330, Boston, MA 02111-1307, USA. 197 198Note: The text of the GNU General Public License is provided in the file 199GPL.txt that you should have received with this distribution. 200 201=cut 202 203# 204#----------------------------------------------------------------------------- 205# Start of Program! 206#----------------------------------------------------------------------------- 207 208# we have to use commandline options, so use the necessary package! 209use Getopt::Long; 210 211# first check if no commandline options have been provided... in which case 212# print out the usage notes! 213if ( $#ARGV == -1 ) 214{ 215 minimalUsageNotes(); 216 askHelp(); 217 exit; 218} 219 220# now get the options! 221GetOptions("version", "help", "precision=i", "N"); 222 223# if help has been requested, print out help and exit 224if (defined $opt_help) 225{ 226 $opt_help = 1; 227 showHelp(); 228 exit; 229} 230 231# if version has been requested, print out the version number and exit 232if (defined $opt_version) 233{ 234 $opt_version = 1; 235 showVersion(); 236 exit; 237} 238 239my $precision = 4; 240# if precision is defined, set precision 241if (defined $opt_precision) 242{ 243 $precision = $opt_precision; 244} 245 246# if N is defined set N 247if (defined $opt_N) 248{ 249 $opt_N = 1; 250} 251 252 253# initialize variables 254my $xfile = shift; 255my $yfile = shift; 256 257open(X, $xfile) || die "Could not open file: $xfile\n"; 258open(Y, $yfile) || die "Could not open file: $yfile\n"; 259 260my $ymean = 0; 261my $xmean = 0; 262 263my $ycount = 0; 264my $xcount = 0; 265 266my %xhash = (); 267my %yhash = (); 268 269my %xlist = (); 270my %ylist = (); 271 272my $xtotal = 0; 273my $ytotal = 0; 274 275my $xN = <X>; 276while(<X>) { 277 278 chomp; 279 280 my @array = split/<>/; 281 282 my $numbers = pop @array; 283 284 my @stats = split/\s+/, $numbers; 285 my $rank = shift @stats; 286 287 my $ngram = join "<>", @array; 288 289 $xtotal++; 290 291 push @{$xhash{$rank}}, $ngram; 292 $xlist{$ngram}++; 293} close X; 294 295my $yN = <Y>; 296while(<Y>) { 297 chomp; 298 my @array = split/<>/; 299 300 my $numbers = pop @array; 301 my @stats = split/\s+/, $numbers; 302 my $rank = shift @stats; 303 304 my $ngram = join "<>", @array; 305 306 $ytotal++; 307 308 push @{$yhash{$rank}}, $ngram; 309 $ylist{$ngram}++; 310} close Y; 311 312my %xrank = (); 313my %yrank = (); 314 315my $r = 0; 316foreach my $rank (sort {$b<=>$a} keys %xhash) { 317 318 my $count = 0; 319 my $computed_rank = 0; my $crank = $r + 1; 320 foreach my $term (@{$xhash{$rank}}) { 321 if(exists $ylist{$term}) { 322 $computed_rank += $crank; 323 $count++; $crank++; 324 } 325 } 326 327 if($count == 0) { next; } 328 329 $computed_rank = $computed_rank / $count; 330 331 foreach my $term (@{$xhash{$rank}}) { 332 if(! (exists $ylist{$term})) { next; } 333 $xrank{$term} = $computed_rank; 334 $xmean += $computed_rank; 335 $xcount++; 336 $r++; 337 } 338} 339 340$r = 0; 341foreach my $rank (sort {$b<=>$a} keys %yhash) { 342 343 my $count = 0; 344 my $computed_rank = 0; my $crank = $r + 1; 345 foreach my $term (@{$yhash{$rank}}) { 346 if(exists $xlist{$term}) { 347 $computed_rank += $crank; 348 $count++; $crank++; 349 } 350 } 351 352 if($count == 0) { next; } 353 354 $computed_rank = $computed_rank / $count; 355 356 foreach my $term (@{$yhash{$rank}}) { 357 if(! (exists $xlist{$term})) { next; } 358 $yrank{$term} = $computed_rank; 359 $ymean += $computed_rank; 360 $ycount++; 361 $r++; 362 } 363} 364 365if( ($xcount == 0) or ($ycount == 0) ) { 366 print "ERROR: There are no ngrams in common between the files.\n"; 367 ## showHelp(); 368 exit; 369} 370 371$xmean = $xmean/$xcount; 372$ymean = $ymean/$ycount; 373 374 375my $numerator = 0; 376my $xdenom = 0; 377my $ydenom = 0; 378foreach my $term (sort keys %xrank) { 379 my $xi = $xrank{$term}; 380 my $yi = $yrank{$term}; 381 382 $numerator += ( ($xi-$xmean) * ($yi-$ymean) ); 383 384 $xdenom += ( ($xi - $xmean)**2 ); 385 $ydenom += ( ($yi - $ymean)**2 ); 386} 387 388my $denominator = sqrt($xdenom * $ydenom); 389 390if($denominator <= 0) { 391 print STDERR "Correlation can not be calculated.\n"; 392 print STDERR "There are no ngrams in common between the files.\n"; 393 exit; 394} 395 396my $pearsons = $numerator / $denominator; 397 398my $floatformat = join '', '%', '.', $precision, 'f'; 399my $score = sprintf $floatformat, $pearsons; 400 401printf("Rank correlation coefficient = $floatformat", $score); 402 403if(defined $opt_N) { print " (N = $xcount)"; } 404 405print "\n"; 406 407 408# function to output a minimal usage note when the user has not provided any 409# commandline options 410sub minimalUsageNotes 411{ 412 print "Usage: rank.pl [OPTIONS] SOURCE_FILE1 SOURCE_FILE2\n"; 413} 414 415# function to output help messages for this program 416sub showHelp 417{ 418 print "Usage: rank.pl [OPTIONS] SOURCE_FILE1 SOURCE_FILE2\n\n"; 419 420 print "To compute the Spearman rank correlation coefficient between two lists\n"; 421 print "of ngrams produced by program statistic.pl using two (probably different)\n"; 422 print "statistical measures. SOURCE_FILE1 and SOURCE_FILE2 should contain the two\n"; 423 print "lists respectively. Ngrams occurring in both source files are chosen,\n"; 424 print "their ranks are adjusted and then these ranks are used to compute the\n"; 425 print "correlation coefficient.\n\n"; 426 427 print "OPTIONS:\n\n"; 428 429 print " --precision N Rounds coefficient to N places (default = 4)\n"; 430 print " --N Returns count of ngrams in common between files\n\n"; 431 432 print " --version Prints the version number.\n\n"; 433 434 print " --help Prints this help message.\n\n"; 435 436} 437 438 439# function to show the version number 440sub showVersion 441{ 442 print "rank.pl - version 0.05\n"; 443 print "Copyright (C) 2000-2012, Ted Pedersen & Satanjeev Banerjee & Bridget T McInnes\n"; 444} 445 446# function to output "ask for help" message when the user's goofed up! 447sub askHelp 448{ 449 print STDERR "Type rank.pl --help for help.\n"; 450} 451