1#!/usr/local/bin/perl -w
2
3=head1 NAME
4
5rank.pl - Calculate Spearman's Correlation on two ranked lists output by count.pl or statistic.pl
6
7=head1 SYNOPSIS
8
9Program to calculate the rank correlation coefficient between the rankings
10generated by two different statistical measures on the same
11bigram-frequency (as output by count.pl).
12
13=head1 DESCRIPTION
14
15=head2 1. Introduction
16
17This is a program that is meant to be used to compare two different
18statistical measures of association. Given the same set of n-grams ranked
19in two different ways by two different statistical measures, this program
20computes Spearman's rank correlation coefficient between the two rankings.
21
22=head3 1.2. Typical Way to Run rank.pl:
23
24Assume that test.cnt is a list of n-grams with their frequencies as
25output by program count.pl. Assume that we wish to test the
26dis/similarity of the statistical measures 'dice' and 'x2' with
27respect to the n-grams contained in test.cnt. To do so, we must first
28rank the n-grams using these two statistical measures using program
29statistic.pl.
30
31 perl statistic.pl dice test.dice test.cnt
32 perl statistic.pl x2 test.x2 test.cnt
33
34Having obtained two different rankings of the n-grams in test.cnt in
35files test.dice and test.x2, we can now compute the Spearman's rank
36correlation coefficient using these two rankings like so:
37
38 perl rank.pl test.dice test.x2.
39
40This will output a floating point number between -1 and 1. A return of
41'1' indicates a perfect match in rankings, '-1' a completely reversed
42ranking and '0' a pair of rankings that are completely unrelated to
43each other. Numbers that lie between these numbers indicate various
44degrees of relatedness / un-relatedness / reverse-relatedness.
45
46=head3 1.3. Re-Ranking the Ngrams:
47
48Recall that program statistic.pl ranks n-grams in such a way that the
49fact that an ngram has a rank 'r' implies that there are 'r-1'
50distinct scores greater than the score of this ngram. Thus say if 'k'
51n-grams are tied at a score with rank 'a', then the next highest
52scoring n-grams is given a rank 'a+1' instead of 'a+k+1'.
53
54For example, observe the following file output by statistic.pl:
55
56 11
57 of<>text<>1 1.0000 2 2 2
58 and<>a<>1 1.0000 1 1 1
59 a<>third<>1 1.0000 1 1 1
60 text<>second<>1 1.0000 1 1 1
61 line<>of<>2 0.8000 2 3 2
62 third<>line<>3 0.5000 1 1 3
63 line<>and<>3 0.5000 1 3 1
64 second<>line<>3 0.5000 1 1 3
65 first<>line<>3 0.5000 1 1 3
66
67Observe that although 4 bigrams have a rank of 1, the next highest
68scoring bigram is not ranked 5, but instead 2.
69
70Spearman's rank correlation coefficient requires the more conventional
71kind of ranking. Thus the above file is first "re-ranked" to the
72following:
73
74 11
75 of<>text<>1 1.0000 2 2 2
76 and<>a<>1 1.0000 1 1 1
77 a<>third<>1 1.0000 1 1 1
78 text<>second<>1 1.0000 1 1 1
79 line<>of<>5 0.8000 2 3 2
80 third<>line<>6 0.5000 1 1 3
81 line<>and<>6 0.5000 1 3 1
82 second<>line<>6 0.5000 1 1 3
83 first<>line<>6 0.5000 1 1 3
84
85And then these rankings are used to compute the correlation
86coefficient.
87
88=head3 1.4. Dealing with Dissimilar Lists of N-grams:
89
90The two input files to rank.pl may not have the same set of n-grams. In
91particular, if one or both of the files generated using statistic.pl
92has been generated using a frequency, rank or score cut-off, then it
93is likely that the two files will have different sets of n-grams. In
94such a situation, n-grams that do not occur in both files are removed,
95the n-grams that remain are re-ranked and then the correlation
96coefficient is computed.
97
98For example assume the following two files output by statistic.pl
99using two fictitious statistical measures from a fictitious file
100output by program count.pl.
101
102The first file:
103
104 first<>bigram<>1 4.000 1 1
105 second<>bigram<>2 3.000 2 2
106 extra<>bigram1<>3 2.000 3 3
107 third<>bigram<>4 1.000 4 4
108
109The second file:
110
111 second<>bigram<>1 4.000 2 2
112 extra<>bigram2<>2 3.000 4 4
113 first<>bigram<>3 2.000 1 1
114 third<>bigram<>4 1.000 3 3
115
116Observe that the bigrams extra<>bigram1<> in the first file and
117extra<>bigram2<> in the second file are not present in both
118files. After removing these bigrams and re-ranking the rest, we get the
119following files:
120
121The modified first file:
122
123 first<>bigram<>1 4.000 1 1
124 second<>bigram<>2 3.000 2 2
125 third<>bigram<>3 1.000 4 4
126
127The modified second file:
128
129 second<>bigram<>1 4.000 2 2
130 first<>bigram<>2 2.000 1 1
131 third<>bigram<>3 1.000 3 3
132
133Since each ngram belongs to both files, the correlation coefficient
134may be computed on both files.
135
136=head3 1.5. Example Shell Script rank-script.sh:
137
138We provide c-shell script rank-script.sh that takes a bigram count
139file and the names of two libraries and then computes the Spearman's
140rank correlation coefficient by making use successively of programs
141statistic.pl and rank.pl.
142
143Run this script like so: rank-script.sh <lib1> <lib2> <file>
144
145    where <lib1> is the first library, say dice
146          <lib2> is the second library, say x2
147	  <file> is the file of ngrams and their frequencies produced
148		 by program count.pl.
149
150For example, if test.cnt contains bigrams and their frequencies, we
151can run it like so to compute the rank correlation coefficient between
152dice and x2:
153
154    csh rank-script.sh dice x2 test.cnt.
155
156This runs the following commands in succession:
157
158 perl statistic.pl dice out1 test.cnt
159 perl statistic.pl x2 out2 test.cnt
160 perl rank.pl out1 out2
161
162The intermediate files out1 and out2 are later destroyed.
163
164Note that since no command line options are utilized in the running of
165program statistic.pl here, this script only works for bigrams and
166enforces no cut-offs. However the script is simple enough to be
167manually modified to the user's requirements.
168
169=head1 AUTHORS
170
171 Ted Pedersen, tpederse@umn.edu
172 Satanjeev Banerjee, bane0025@d.umn.edu
173 Bridget McInnes, bthomson@umn.edu
174
175This work has been partially supported by a National Science Foundation
176Faculty Early CAREER Development award (\#0092784) and by a Grant-in-Aid
177of Research, Artistry and Scholarship from the Office of the Vice
178President for Research and the Dean of the Graduate School of the
179University of Minnesota.
180
181=head1 COPYRIGHT
182
183Copyright (C) 2000-2012, Ted Pedersen and Satanjeev Banerjee and Bridget T. McInnes
184
185This suite of programs is free software; you can redistribute it and/or
186modify it under the terms of the GNU General Public License as published by the
187Free Software Foundation; either version 2 of the License, or (at your option)
188any later version.
189
190This program is distributed in the hope that it will be useful, but WITHOUT ANY
191WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
192PARTICULAR PURPOSE.  See the GNU General Public License for more details.
193
194You should have received a copy of the GNU General Public License along with
195this program; if not, write to the Free Software Foundation, Inc., 59 Temple
196Place - Suite 330, Boston, MA  02111-1307, USA.
197
198Note: The text of the GNU General Public License is provided in the file
199GPL.txt that you should have received with this distribution.
200
201=cut
202
203#
204#-----------------------------------------------------------------------------
205#                                 Start of Program!
206#-----------------------------------------------------------------------------
207
208# we have to use commandline options, so use the necessary package!
209use Getopt::Long;
210
211# first check if no commandline options have been provided... in which case
212# print out the usage notes!
213if ( $#ARGV == -1 )
214{
215    minimalUsageNotes();
216    askHelp();
217    exit;
218}
219
220# now get the options!
221GetOptions("version", "help", "precision=i", "N");
222
223# if help has been requested, print out help and exit
224if (defined $opt_help)
225{
226    $opt_help = 1;
227    showHelp();
228    exit;
229}
230
231# if version has been requested, print out the version number and exit
232if (defined $opt_version)
233{
234    $opt_version = 1;
235    showVersion();
236    exit;
237}
238
239my $precision = 4;
240#  if precision is defined, set precision
241if (defined $opt_precision)
242{
243    $precision = $opt_precision;
244}
245
246#  if N is defined set N
247if (defined $opt_N)
248{
249    $opt_N = 1;
250}
251
252
253#  initialize variables
254my $xfile = shift;
255my $yfile = shift;
256
257open(X, $xfile) || die "Could not open file: $xfile\n";
258open(Y, $yfile) || die "Could not open file: $yfile\n";
259
260my $ymean = 0;
261my $xmean = 0;
262
263my $ycount = 0;
264my $xcount = 0;
265
266my %xhash = ();
267my %yhash = ();
268
269my %xlist = ();
270my %ylist = ();
271
272my $xtotal = 0;
273my $ytotal = 0;
274
275my $xN = <X>;
276while(<X>) {
277
278    chomp;
279
280    my @array = split/<>/;
281
282    my $numbers = pop @array;
283
284    my @stats   = split/\s+/, $numbers;
285    my $rank    = shift @stats;
286
287    my $ngram = join "<>", @array;
288
289    $xtotal++;
290
291    push @{$xhash{$rank}}, $ngram;
292    $xlist{$ngram}++;
293} close X;
294
295my $yN = <Y>;
296while(<Y>) {
297    chomp;
298    my @array = split/<>/;
299
300    my $numbers = pop @array;
301    my @stats   = split/\s+/, $numbers;
302    my $rank    = shift @stats;
303
304    my $ngram = join "<>", @array;
305
306    $ytotal++;
307
308    push @{$yhash{$rank}}, $ngram;
309    $ylist{$ngram}++;
310} close Y;
311
312my %xrank = ();
313my %yrank = ();
314
315my $r = 0;
316foreach my $rank (sort {$b<=>$a} keys %xhash) {
317
318    my $count = 0;
319    my $computed_rank = 0; my $crank = $r + 1;
320    foreach my $term (@{$xhash{$rank}}) {
321	if(exists $ylist{$term}) {
322	    $computed_rank += $crank;
323	    $count++; $crank++;
324	}
325    }
326
327    if($count == 0) { next; }
328
329    $computed_rank = $computed_rank / $count;
330
331    foreach my $term (@{$xhash{$rank}}) {
332	if(! (exists $ylist{$term})) { next; }
333	$xrank{$term} = $computed_rank;
334	$xmean += $computed_rank;
335	$xcount++;
336	$r++;
337    }
338}
339
340$r = 0;
341foreach my $rank (sort {$b<=>$a} keys %yhash) {
342
343    my $count = 0;
344    my $computed_rank = 0; my $crank = $r + 1;
345    foreach my $term (@{$yhash{$rank}}) {
346	if(exists $xlist{$term}) {
347	    $computed_rank += $crank;
348	    $count++; $crank++;
349	}
350    }
351
352    if($count == 0) { next; }
353
354    $computed_rank = $computed_rank / $count;
355
356    foreach my $term (@{$yhash{$rank}}) {
357	if(! (exists $xlist{$term})) { next; }
358	$yrank{$term} = $computed_rank;
359	$ymean += $computed_rank;
360	$ycount++;
361	$r++;
362    }
363}
364
365if( ($xcount == 0) or ($ycount == 0) ) {
366    print "ERROR: There are no ngrams in common between the files.\n";
367    ##    showHelp();
368    exit;
369}
370
371$xmean = $xmean/$xcount;
372$ymean = $ymean/$ycount;
373
374
375my $numerator = 0;
376my $xdenom = 0;
377my $ydenom = 0;
378foreach my $term (sort keys %xrank) {
379    my $xi = $xrank{$term};
380    my $yi = $yrank{$term};
381
382    $numerator += ( ($xi-$xmean) * ($yi-$ymean) );
383
384    $xdenom += ( ($xi - $xmean)**2 );
385    $ydenom += ( ($yi - $ymean)**2 );
386}
387
388my $denominator = sqrt($xdenom * $ydenom);
389
390if($denominator <= 0) {
391    print STDERR "Correlation can not be calculated.\n";
392    print STDERR "There are no ngrams in common between the files.\n";
393    exit;
394}
395
396my $pearsons = $numerator / $denominator;
397
398my $floatformat = join '', '%', '.', $precision, 'f';
399my $score = sprintf $floatformat, $pearsons;
400
401printf("Rank correlation coefficient = $floatformat", $score);
402
403if(defined $opt_N) { print " (N = $xcount)"; }
404
405print "\n";
406
407
408# function to output a minimal usage note when the user has not provided any
409# commandline options
410sub minimalUsageNotes
411{
412    print "Usage: rank.pl [OPTIONS] SOURCE_FILE1 SOURCE_FILE2\n";
413}
414
415# function to output help messages for this program
416sub showHelp
417{
418    print "Usage: rank.pl [OPTIONS] SOURCE_FILE1 SOURCE_FILE2\n\n";
419
420    print "To compute the Spearman rank correlation coefficient between two lists\n";
421    print "of ngrams produced by program statistic.pl using two (probably different)\n";
422    print "statistical measures. SOURCE_FILE1 and SOURCE_FILE2 should contain the two\n";
423    print "lists respectively. Ngrams occurring in both source files are chosen,\n";
424    print "their ranks are adjusted and then these ranks are used to compute the\n";
425    print "correlation coefficient.\n\n";
426
427    print "OPTIONS:\n\n";
428
429    print "   --precision N   Rounds coefficient to N places (default = 4)\n";
430    print "   --N             Returns count of ngrams in common between files\n\n";
431
432    print "   --version       Prints the version number.\n\n";
433
434    print "   --help          Prints this help message.\n\n";
435
436}
437
438
439# function to show the version number
440sub showVersion
441{
442    print "rank.pl         -       version 0.05\n";
443    print "Copyright (C) 2000-2012, Ted Pedersen & Satanjeev Banerjee & Bridget T McInnes\n";
444}
445
446# function to output "ask for help" message when the user's goofed up!
447sub askHelp
448{
449    print STDERR "Type rank.pl --help for help.\n";
450}
451