1#!/usr/local/bin/perl -w
2
3=head1 NAME
4
5huge-count.pl - Count all the bigrams in a huge text without using huge amounts of memory.
6
7=head1 SYNOPSIS
8
9huge-count.pl --tokenlist --split 100 destination-dir input
10
11=head1 DESCRIPTION
12
13Runs count.pl efficiently on large amounts of data by splitting the data into separate files, and counting up each file separately, and then merging them to get overall results.
14
15Two output files are created. destination-dir/huge-count.output contains
16the bigram counts after applying --remove and --remove.
17destination-dir/complete-huge-count.output provides the bigram counts as
18if no --uremove or --remove cutoff were provided.
19
20=head1 USAGE
21
22huge-count.pl [OPTIONS] DESTINATION [SOURCE]+
23
24=head1 INPUT
25
26=head2 Required Arguments:
27
28=head3 [SOURCE]+
29
30Input to huge-count.pl should be a -
31
32=over
33
34=item 1. Single plain text file
35
36Or
37
38=item 2. Single flat directory containing multiple plain text files
39
40Or
41
42=item 3. List of multiple plain text files
43
44=back
45
46=head3 DESTINATION
47
48A complete path to a writable directory to which huge-count.pl can write all
49intermediate and final output files. If DESTINATION does not exist,
50a new directory is created, otherwise, the current directory is simply used
51for writing the output files.
52
53NOTE: If DESTINATION already exists and if the names of some of the existing
54files in DESTINATION clash with the names of the output files created by
55huge-count, these files will be over-written w/o prompting user.
56
57=head3 --tokenlist
58
59This parameter is required. huge-count will call count.pl and print out all
60the bigrams count.pl can find out.
61
62=head2 Optional Arguments:
63
64=head4 --split N
65
66This parameter is required. huge-count will divide the output bigrams
67tokenlist generated by count.pl, sort on each part and recombine the bigram
68counts from all these intermediate result files into a single bigram output
69that shows bigram counts in SOURCE.
70
71Each part created with --split N will contain N lines. Value of N should be
72chosen such that huge-sort.pl can be efficiently run on any part containing
73N lines from the file contains all bigrams file.
74
75We suggest that N is equal to the number of KB of memory you have. If the
76computer has 8 GB RAM, which is 8,000,000 KB, N should be set to 8000000. If
77N is set too small, split output file suffixes exhausted.
78
79=head4 --token TOKENFILE
80
81Specify a file containing Perl regular expressions that define the tokenization
82scheme for counting. This will be provided to count.pl's --token option.
83
84--nontoken NOTOKENFILE
85
86Specify a file containing Perl regular expressions of non-token sequences
87that are removed prior to tokenization. This will be provided to the
88count.pl's --nontoken option.
89
90--stop STOPFILE
91
92Specify a file of Perl regex/s containing the list of stop words to be
93omitted from the output BIGRAMS. Stop list can be used in two modes -
94
95AND mode declared with '@stop.mode = AND' on the 1st line of the STOPFILE
96
97or
98
99OR mode declared using '@stop.mode = OR' on the 1st line of the STOPFILE.
100
101In AND mode, bigrams whose both constituent words are stop words are removed
102while, in OR mode, bigrams whose either or both constituent words are
103stopwords are removed from the output.
104
105=head4 --window W
106
107Tokens appearing within W positions from each other (with at most W-2
108intervening words) will form bigrams. Same as count.pl's --window option.
109
110=head4 --remove L
111
112Bigrams with counts less than L in the entire SOURCE data are removed from
113the sample. The counts of the removed bigrams are not counted in any
114marginal totals. This has same effect as count.pl's --remove option.
115
116=head4 --uremove L
117
118Bigrams with counts more than L in the entire SOURCE data are removed from
119the sample. The counts of the removed bigrams are not counted in any
120marginal totals. This has same effect as count.pl's --uremove option.
121
122=head4 --frequency F
123
124Bigrams with counts less than F in the entire SOURCE are not displayed.
125The counts of the skipped bigrams ARE counted in the marginal totals. In other
126words, --frequency in huge-count.pl has same effect as the count.pl's
127--frequency option.
128
129=head4 --ufrequency F
130
131Bigrams with counts more than F in the entire SOURCE are not displayed.
132The counts of the skipped bigrams ARE counted in the marginal totals. In other
133words, --frequency in huge-count.pl has same effect as the count.pl's
134--ufrequency option.
135
136=head4 --newLine
137
138Switches ON the --newLine option in count.pl. This will prevent bigrams from
139spanning across the lines.
140
141=head3 Other Options :
142
143=head4 --help
144
145Displays this message.
146
147=head4 --version
148
149Displays the version information.
150
151=head1 PROGRAM LOGIC
152
153=over
154
155=item * STEP 1
156
157 # create output dir
158 if(!-e DESTINATION) then
159 mkdir DESTINATION;
160
161=item * STEP 2
162
163=over 3
164
165=item 1. If SOURCE is a single plain file -
166
167huge-count.pl with --tokenlist option call count.pl and run on the single
168plain file and print out all bigrams into one file.  The count outputs are
169also created in DESTINATION.
170
171=item 2. SOURCE is a single flat directory containing multiple plain files -
172
173huge-count.pl with --tokenlist option call count.pl and run on each file
174present in the SOURCE directory. All files in SOURCE are treated as the
175data files. If SOURCE contains sub-directories, these are simply skipped.
176Intermediate bigram outputs are written in DESTINATION.
177
178=item 3. SOURCE is a list of multiple plain files -
179
180If #arg > 2, all arguments specified after the first argument are considered
181as the SOURCE file names. count.pl is separately run on each of the SOURCE
182files specified by argv[1], argv[2], ... argv[n] (skipping argv[0] which
183should be DESTINATION). Intermediate results are created in DESTINATION.
184
185=back
186
187In summary, a large datafile can be provided to huge-count in the form of
188
189a. A single plain file
190
191b. A directory containing several plain files
192
193c. Multiple plain files directly specified as command line arguments
194
195In all these cases, count.pl with --tokenlist is separately run on SOURCE
196files or parts of SOURCE file and intermediate results are written in
197DESTINATION dir.
198
199=back
200
201=over
202
203=item * STEP 3
204
205Split the output file generate by count.pl with --tokenlist  into smaller
206files by the number of bigrams N.
207
208=item * STEP 4
209
210huge-sort.pl counts the unique bigrams and sort them in alphabetic order.
211
212=item * STEP 5
213
214huge-merge.pl merge the bigrams of each sorted bigrams file.
215
216=back
217
218=head1 OUTPUT
219
220After huge-count finishes successfully, DESTINATION will contain -
221
222=over
223
224=item * Final bigram count file (huge-count.output) showing bigram counts in
225the entire SOURCE after --remove and --uremove applied.
226
227=item * Final bigram count file (complete-huge-count.output) showing
228bigram counts in the entire SOURCE without --remove and --uremove.
229
230=back
231
232=head1 BUGS
233
234huge-count.pl doesn't consider bigrams at file boundaries. In other words,
235the result of count.pl and huge-count.pl on the same data file will
236differ if --newLine is not used, in that, huge-count.pl runs count.pl
237on multiple files separately and thus looses the track of the bigrams
238on file boundaries. With --window not specified, there will be loss
239of one bigram at each file boundary while its W bigrams with --window W.
240
241Functionality of huge-count with --tokenlist is same as count only if
242--newLine is used and all files start and end on sentence boundaries.
243In other words, there should not be any sentence breaks at the start or
244end of any file given to huge-count.
245
246=head1 AUTHOR
247
248Amruta Purandare, University of Minnesota, Duluth
249
250Ted Pedersen, University of Minnesota, Duluth
251tpederse at umn.edu
252
253Ying Liu, University of Minnesota, Twin Cities
254liux0395 at umn.edu
255
256=head1 COPYRIGHT
257
258Copyright (c) 2004-2010, Amruta Purandare, Ted Pedersen, and Ying Liu
259
260This program is free software; you can redistribute it and/or modify it under
261the terms of the GNU General Public License as published by the Free Software
262Foundation; either version 2 of the License, or (at your option) any later
263version.
264
265This program is distributed in the hope that it will be useful, but WITHOUT
266ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
267FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
268
269You should have received a copy of the GNU General Public License along with
270this program; if not, write to
271
272The Free Software Foundation, Inc.,
27359 Temple Place - Suite 330,
274Boston, MA  02111-1307, USA.
275
276=cut
277
278###############################################################################
279
280
281#$0 contains the program name along with
282#the complete path. Extract just the program
283#name and use in error messages
284$0=~s/.*\/(.+)/$1/;
285
286###############################################################################
287
288#                           ================================
289#                            COMMAND LINE OPTIONS AND USAGE
290#                           ================================
291
292# command line options
293use Cwd;
294use Getopt::Long;
295GetOptions ("help","version","tokenlist","token=s","nontoken=s","remove=i","uremove=i", "window=i","stop=s","split=i","frequency=i","ufrequency=i", "newLine");
296# show help option
297if(defined $opt_help)
298{
299        $opt_help=1;
300        &showhelp();
301        exit;
302}
303
304
305# make sure tokenlist is used in huge-count.pl
306if (!defined $opt_tokenlist)
307{
308	print "--tokenlist is required!\n";
309	print STDERR "Type huge-count.pl --help for help.\n";
310	exit;
311}
312
313if ((defined $opt_remove) and (defined $opt_uremove))
314{
315	if ($opt_remove > $opt_uremove)
316	{
317		print "--remove must be smaller than --uremove!\n";
318		print STDERR "Type huge-count.pl --help for help.\n";
319		exit;
320	}
321}
322
323if ((defined $opt_frequency) and (defined $opt_ufrequency))
324{
325	if ($opt_frequency > $opt_ufrequency)
326	{
327		print "--frequency must be smaller than --ufrequency!\n";
328		print STDERR "Type huge-count.pl --help for help.\n";
329		exit;
330	}
331}
332
333# show version information
334if(defined $opt_version)
335{
336        $opt_version=1;
337        &showversion();
338        exit;
339}
340
341# show minimal usage message if fewer arguments
342if($#ARGV<1)
343{
344        &showminimal();
345        exit;
346}
347
348
349#############################################################################
350
351#			========================
352#			      CODE SECTION
353#			========================
354
355#accept the destination dir name
356my $current_dir = getcwd;
357
358$destdir=$ARGV[0];
359if(-e $destdir)
360{
361	if(!-d $destdir)
362	{
363		print STDERR "ERROR($0):
364	$destdir is not a directory.\n";
365		exit;
366	}
367}
368else
369{
370	system("mkdir $destdir");
371}
372
373
374# ----------
375#  Counting
376# ----------
377
378
379# source = dir
380if($#ARGV==1 && -d $ARGV[1])
381{
382	$sourcedir=$ARGV[1];
383	opendir(DIR,$sourcedir) || die "ERROR($0):
384	Error (code=$!) in opening Source Directory <$sourcedir>.\n";
385	while(defined ($file=readdir DIR))
386	{
387		next if $file =~ /^\.\.?$/;
388		if(-f "$sourcedir/$file")
389		{
390			&runcount("$sourcedir/$file",$destdir);
391		}
392	}
393}
394# source is a single file
395elsif($#ARGV==1 && -f $ARGV[1])
396{
397	$source=$ARGV[1];
398
399	system("cp $source $destdir");
400	if(defined $opt_token)
401	{
402		system("cp $opt_token $destdir");
403	}
404	if(defined $opt_nontoken)
405	{
406		system("cp $opt_nontoken $destdir");
407	}
408	if(defined $opt_stop)
409	{
410		system("cp $opt_stop $destdir");
411	}
412	chdir $destdir;
413	$chdir=1;
414	&runcount($source,".");
415
416}
417# source contains multiple files
418elsif($#ARGV > 1)
419{
420	foreach $i (1..$#ARGV)
421	{
422		if(-f $ARGV[$i])
423		{
424			&runcount($ARGV[$i],$destdir);
425		}
426		else
427		{
428			print STDERR "ERROR($0):
429	ARGV[$i]=$ARGV[$i] should be a plain file.\n";
430			exit;
431		}
432	}
433}
434# unexpected input
435else
436{
437	&showminimal();
438	exit;
439}
440
441
442# --------------------
443# Split bigrams
444# --------------------
445
446if(!defined $chdir)
447{
448	chdir $destdir;
449	$chdir = 1;
450}
451# current dir is now destdir
452opendir(DIR,".") || die "ERROR($0):
453       Error (code=$!) in opening Destination Directory <$destdir>.\n";
454
455if (defined $opt_split)
456{
457	print "split the bigrams files...\n";
458	while(defined ($file = readdir DIR))
459	{
460		if($file=~/\.bigrams$/)
461   			{
462				system("huge-split.pl --split $opt_split $file");
463               	system("/bin/rm $file");
464   			}
465	}
466}
467else
468{
469	print STDERR "Warning($0): You can run huge-sort.pl directly on the \n";
470	print STDERR	 "single tokenlist file if don't want to split the tokenlist.\n";
471}
472
473# --------------------
474# Sort bigrams
475# --------------------
476
477if (defined $opt_tokenlist)
478{
479	print "sort the bigrams files...\n";
480	if(!defined $chdir)
481	{
482		chdir $destdir;
483		$chdir = 1;
484	}
485	# current dir is now destdir
486	opendir(DIR,".") || die "ERROR($0):
487        Error (code=$!) in opening Destination Directory <$destdir>.\n";
488
489	while(defined ($file = readdir DIR))
490	{
491		if(($file=~/\.bigrams/) and ($file !~ /sorted$/))
492		{
493			system("huge-sort.pl $file");
494		}
495	}
496}
497
498# --------------------
499# Combine bigrams
500# --------------------
501
502print "combine the bigrams files...\n";
503if(defined $chdir)
504{
505	chdir $current_dir;
506}
507
508system("huge-merge.pl $destdir");
509
510
511# --------------------
512# Delete bigrams
513# --------------------
514
515print "delete the bigrams ...\n";
516if (defined $opt_remove)
517{
518	if (defined $opt_uremove)
519	{
520		if (defined $opt_frequency)
521		{
522			if (defined $opt_ufrequency)
523			{
524				system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
525			}
526			else
527			{
528				system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove --frequency $opt_frequency $destdir/merge* $destdir/finalmerge");
529			}
530		}
531		# --frequency not used
532		else
533		{
534			if (defined $opt_ufrequency)
535			{
536				system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
537			}
538			else
539			{
540				system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove $destdir/merge* $destdir/finalmerge");
541			}
542
543		}
544	}
545	# --uremove not used
546	else
547	{
548		if (defined $opt_frequency)
549		{
550			if (defined $opt_ufrequency)
551			{
552				system("huge-delete.pl --remove $opt_remove --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
553			}
554			else
555			{
556				system("huge-delete.pl --remove $opt_remove --frequency $opt_frequency $destdir/merge* $destdir/finalmerge");
557			}
558		}
559		# --frequency not used
560		else
561		{
562			if (defined $opt_ufrequency)
563			{
564				system("huge-delete.pl --remove $opt_remove --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
565			}
566			else
567			{
568				system("huge-delete.pl --remove $opt_remove $destdir/merge* $destdir/finalmerge");
569			}
570
571		}
572	}
573}
574# --remove not used
575else
576{
577	if (defined $opt_uremove)
578	{
579		if (defined $opt_frequency)
580		{
581			if (defined $opt_ufrequency)
582			{
583				system("huge-delete.pl --uremove $opt_uremove --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
584			}
585			else
586			{
587				system("huge-delete.pl --uremove $opt_uremove --frequency $opt_frequency $destdir/merge* $destdir/finalmerge");
588			}
589		}
590		# --frequency not used
591		else
592		{
593			if (defined $opt_ufrequency)
594			{
595				system("huge-delete.pl --uremove $opt_uremove --ufrequency $opt_ufrequency $destdir/mgerge* $destdir/finalmerge");
596			}
597			else
598			{
599				system("huge-delete.pl --uremove $opt_uremove $destdir/merge* $destdir/finalmerge");
600			}
601
602		}
603	}
604	# --uremove not used
605	else
606	{
607		if (defined $opt_frequency)
608		{
609			if (defined $opt_ufrequency)
610			{
611				system("huge-delete.pl --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
612			}
613			else
614			{
615				system("huge-delete.pl --frequency $opt_frequency $destdir/merge* $destdir/finalmerge");
616			}
617		}
618		# --frequency not used
619		else
620		{
621			if (defined $opt_ufrequency)
622			{
623				system("huge-delete.pl --ufrequency $opt_ufrequency $destdir/mgerge* $destdir/finalmerge");
624			}
625		}
626
627	}
628}
629
630$output="complete-huge-count.output";
631if ((defined $opt_remove ) or (defined $opt_uremove) or (defined $opt_frequency) or (defined $opt_ufrequency))
632{
633	system("mv $destdir/merge.* $destdir/$output");
634	system("mv $destdir/finalmerge $destdir/huge-count.output");
635	print STDERR "Check the output in $destdir/huge-count.output\n";
636}
637else
638{
639	system("mv $destdir/merge.* $destdir/$output");
640	print STDERR "Check the output in $destdir/$output\n";
641}
642
643exit;
644
645##############################################################################
646
647#                      ==========================
648#                          SUBROUTINE SECTION
649#                      ==========================
650
651sub runcount()
652{
653    my $file=shift;
654    my $destdir=shift;
655    my $justfile=$file;
656    $justfile=~s/.*\/(.+)/$1/;
657
658
659
660# --tokenlist used
661if(defined $opt_tokenlist)
662{
663    # --window used
664    if(defined $opt_window)
665    {
666	# --token used
667	if(defined $opt_token)
668	{
669	    # --nontoken used
670	    if(defined $opt_nontoken)
671	    {
672		# --stop used
673		if(defined $opt_stop)
674		{
675		    if(defined $opt_newLine)
676		    {
677			system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
678		    }
679		    else
680		    {
681			system("count.pl --tokenlist --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
682		    }
683		}
684		# --stop not used
685		else
686		{
687		    if(defined $opt_newLine)
688		    {
689			system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
690		    }
691		    else
692		    {
693			system("count.pl --tokenlist --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
694		    }
695		}
696	    }
697	    # nontoken not used
698	    else
699	    {
700		# --stop used
701		if(defined $opt_stop)
702		{
703		    if(defined $opt_newLine)
704		    {
705			system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
706		    }
707		    else
708		    {
709			system("count.pl --tokenlist --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file")
710		    }
711		}
712		# --stop not used
713		else
714		{
715		    if(defined $opt_newLine)
716		    {
717			system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file");
718		    }
719		    else
720		    {
721			system("count.pl --tokenlist --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file");
722		    }
723		}
724	    }
725	}
726	# --token not used
727	else
728	{
729	    # --nontoken used
730	    if(defined $opt_nontoken)
731	    {
732		# --stop used
733		if(defined $opt_stop)
734		{
735		    if(defined $opt_newLine)
736		    {
737			system("count.pl --tokenlist --newLine --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
738		    }
739		    else
740		    {
741			system("count.pl --tokenlist --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
742		    }
743		}
744		# --stop not used
745		else
746		{
747		    if(defined $opt_newLine)
748		    {
749			system("count.pl --tokenlist --newLine --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
750		    }
751		    else
752		    {
753			system("count.pl --tokenlist --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
754		    }
755		}
756	    }
757	    # nontoken not used
758	    else
759	    {
760		# --stop used
761		if(defined $opt_stop)
762		{
763		    if(defined $opt_newLine)
764		    {
765			system("count.pl --tokenlist --newLine --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file");
766		    }
767		    else
768		    {
769			system("count.pl --tokenlist --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file");
770		    }
771		}
772		# --stop not used
773		else
774		{
775		    if(defined $opt_newLine)
776		    {
777			system("count.pl --tokenlist --newLine --window $opt_window $destdir/$justfile.bigrams $file");
778		    }
779		    else
780		    {
781			system("count.pl --tokenlist --window $opt_window $destdir/$justfile.bigrams $file");
782		    }
783		}
784	    }
785	}
786    }
787    # --window not used
788    else
789    {
790	# --token used
791	if(defined $opt_token)
792	{
793	    # --nontoken used
794	    if(defined $opt_nontoken)
795	    {
796		# --stop used
797		if(defined $opt_stop)
798		{
799		    if(defined $opt_newLine)
800		    {
801			system("count.pl --tokenlist --newLine --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
802		    }
803		    else
804		    {
805			system("count.pl --tokenlist --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
806		    }
807		}
808		# --stop not used
809		else
810		{
811		    if(defined $opt_newLine)
812		    {
813			system("count.pl --tokenlist --newLine --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
814		    }
815		    else
816		    {
817			system("count.pl --tokenlist --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
818		    }
819		}
820	    }
821	    # nontoken not used
822	    else
823	    {
824		# --stop used
825		if(defined $opt_stop)
826		{
827		    if(defined $opt_newLine)
828		    {
829			system("count.pl --tokenlist --newLine --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
830		    }
831		    else
832		    {
833			system("count.pl --tokenlist --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
834		    }
835		}
836		# --stop not used
837		else
838		{
839		    if(defined $opt_newLine)
840		    {
841			system("count.pl --tokenlist --newLine --token $opt_token $destdir/$justfile.bigrams $file");
842		    }
843		    else
844		    {
845			system("count.pl --tokenlist --token $opt_token $destdir/$justfile.bigrams $file");
846		    }
847		}
848	    }
849	}
850	# --token not used
851	else
852	{
853	    # --nontoken used
854	    if(defined $opt_nontoken)
855	    {
856		# --stop used
857		if(defined $opt_stop)
858		{
859		    if(defined $opt_newLine)
860		    {
861			system("count.pl --tokenlist --newLine --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
862		    }
863		    else
864		    {
865			system("count.pl --tokenlist --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
866		    }
867		}
868		# --stop not used
869		else
870		{
871		    if(defined $opt_newLine)
872		    {
873			system("count.pl --tokenlist --newLine --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
874		    }
875		    else
876		    {
877			system("count.pl --tokenlist --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
878		    }
879		}
880	    }
881	    # nontoken not used
882	    else
883	    {
884		# --stop used
885		if(defined $opt_stop)
886		{
887		    if(defined $opt_newLine)
888		    {
889			system("count.pl --tokenlist --newLine --stop $opt_stop $destdir/$justfile.bigrams $file");
890		    }
891		    else
892		    {
893			system("count.pl --tokenlist --stop $opt_stop $destdir/$justfile.bigrams $file");
894		    }
895		}
896		# --stop not used
897		else
898		{
899		    if(defined $opt_newLine)
900		    {
901			system("count.pl --tokenlist --newLine $destdir/$justfile.bigrams $file");
902		    }
903		    else
904		    {
905			system("count.pl --tokenlist $destdir/$justfile.bigrams $file");
906		    }
907		}
908	    }
909	}
910    }
911}
912# --tokenlist not used
913else
914{
915    # --window used
916    if(defined $opt_window)
917    {
918	# --token used
919	if(defined $opt_token)
920	{
921	    # --nontoken used
922	    if(defined $opt_nontoken)
923	    {
924		# --stop used
925		if(defined $opt_stop)
926		{
927		    if(defined $opt_newLine)
928		    {
929			system("count.pl --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
930		    }
931		    else
932		    {
933			system("count.pl --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
934		    }
935		}
936		# --stop not used
937		else
938		{
939		    if(defined $opt_newLine)
940		    {
941			system("count.pl --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
942		    }
943		    else
944		    {
945			system("count.pl --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
946		    }
947		}
948	    }
949	    # nontoken not used
950	    else
951	    {
952		# --stop used
953		if(defined $opt_stop)
954		{
955		    if(defined $opt_newLine)
956		    {
957			system("count.pl --newLine --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
958		    }
959		    else
960		    {
961			system("count.pl --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file")
962		    }
963		}
964		# --stop not used
965		else
966		{
967		    if(defined $opt_newLine)
968		    {
969			system("count.pl --newLine --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file");
970		    }
971		    else
972		    {
973			system("count.pl --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file");
974		    }
975		}
976	    }
977	}
978	# --token not used
979	else
980	{
981	    # --nontoken used
982	    if(defined $opt_nontoken)
983	    {
984		# --stop used
985		if(defined $opt_stop)
986		{
987		    if(defined $opt_newLine)
988		    {
989			system("count.pl --newLine --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
990		    }
991		    else
992		    {
993			system("count.pl --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
994		    }
995		}
996		# --stop not used
997		else
998		{
999		    if(defined $opt_newLine)
1000		    {
1001			system("count.pl --newLine --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
1002		    }
1003		    else
1004		    {
1005			system("count.pl --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
1006		    }
1007		}
1008	    }
1009	    # nontoken not used
1010	    else
1011	    {
1012		# --stop used
1013		if(defined $opt_stop)
1014		{
1015		    if(defined $opt_newLine)
1016		    {
1017			system("count.pl --newLine --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file");
1018		    }
1019		    else
1020		    {
1021			system("count.pl --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file");
1022		    }
1023		}
1024		# --stop not used
1025		else
1026		{
1027		    if(defined $opt_newLine)
1028		    {
1029			system("count.pl --newLine --window $opt_window $destdir/$justfile.bigrams $file");
1030		    }
1031		    else
1032		    {
1033			system("count.pl --window $opt_window $destdir/$justfile.bigrams $file");
1034		    }
1035		}
1036	    }
1037	}
1038    }
1039    # --window not used
1040    else
1041    {
1042	# --token used
1043	if(defined $opt_token)
1044	{
1045	    # --nontoken used
1046	    if(defined $opt_nontoken)
1047	    {
1048		# --stop used
1049		if(defined $opt_stop)
1050		{
1051		    if(defined $opt_newLine)
1052		    {
1053			system("count.pl --newLine --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
1054		    }
1055		    else
1056		    {
1057			system("count.pl --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
1058		    }
1059		}
1060		# --stop not used
1061		else
1062		{
1063		    if(defined $opt_newLine)
1064		    {
1065			system("count.pl --newLine --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
1066		    }
1067		    else
1068		    {
1069			system("count.pl --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
1070		    }
1071		}
1072	    }
1073	    # nontoken not used
1074	    else
1075	    {
1076		# --stop used
1077		if(defined $opt_stop)
1078		{
1079		    if(defined $opt_newLine)
1080		    {
1081			system("count.pl --newLine --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
1082		    }
1083		    else
1084		    {
1085			system("count.pl --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
1086		    }
1087		}
1088		# --stop not used
1089		else
1090		{
1091		    if(defined $opt_newLine)
1092		    {
1093			system("count.pl --newLine --token $opt_token $destdir/$justfile.bigrams $file");
1094		    }
1095		    else
1096		    {
1097			system("count.pl --token $opt_token $destdir/$justfile.bigrams $file");
1098		    }
1099		}
1100	    }
1101	}
1102	# --token not used
1103	else
1104	{
1105	    # --nontoken used
1106	    if(defined $opt_nontoken)
1107	    {
1108		# --stop used
1109		if(defined $opt_stop)
1110		{
1111		    if(defined $opt_newLine)
1112		    {
1113			system("count.pl --newLine --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
1114		    }
1115		    else
1116		    {
1117			system("count.pl --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
1118		    }
1119		}
1120		# --stop not used
1121		else
1122		{
1123		    if(defined $opt_newLine)
1124		    {
1125			system("count.pl --newLine --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
1126		    }
1127		    else
1128		    {
1129			system("count.pl --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
1130		    }
1131		}
1132	    }
1133	    # nontoken not used
1134	    else
1135	    {
1136		# --stop used
1137		if(defined $opt_stop)
1138		{
1139		    if(defined $opt_newLine)
1140		    {
1141			system("count.pl --newLine --stop $opt_stop $destdir/$justfile.bigrams $file");
1142		    }
1143		    else
1144		    {
1145			system("count.pl --stop $opt_stop $destdir/$justfile.bigrams $file");
1146		    }
1147		}
1148		# --stop not used
1149		else
1150		{
1151		    if(defined $opt_newLine)
1152		    {
1153			system("count.pl --newLine $destdir/$justfile.bigrams $file");
1154		    }
1155		    else
1156		    {
1157			system("count.pl $destdir/$justfile.bigrams $file");
1158		    }
1159		}
1160	    }
1161	}
1162    }
1163}
1164
1165} # end of sub runcount()
1166
1167
1168#-----------------------------------------------------------------------------
1169#show minimal usage message
1170sub showminimal()
1171{
1172        print "Usage: huge-count.pl --tokenlist [OPTIONS] DESTINATION [SOURCE]+";
1173        print "\nTYPE huge-count.pl --help for help\n";
1174}
1175
1176#-----------------------------------------------------------------------------
1177#show help
1178sub showhelp()
1179{
1180	print "Usage:  huge-count.pl --tokenlist [OPTIONS] DESTINATION [SOURCE]+
1181
1182Efficiently runs count.pl on a huge data.
1183
1184SOURCE
1185	Could be a -
1186
1187		1. single plain file
1188		2. single flat directory containing multiple plain files
1189		3. list of plain files
1190
1191DESTINATION
1192	Should be a directory where output is written.
1193
1194REQUIRED PARAMETERS:
1195
1196--tokenlist
1197	This option is required. Print out all bigrams list.
1198
1199OPTIONS:
1200
1201--split N
1202	Number of bigrams for each seperated bigrams file.
1203
1204--token TOKENFILE
1205	Specify a file containing Perl regular expressions that define the
1206	tokenization scheme for counting.
1207
1208--nontoken NOTOKENFILE
1209	Specify a file containing Perl regular expressions of non-token
1210	sequences that are removed prior to tokenization.
1211
1212--stop STOPFILE
1213	Specify a file containing Perl regular expressions of stop words
1214	that are to be removed from the output bigrams.
1215
1216--window W
1217	Specify the window size for counting.
1218
1219--remove L
1220	Bigrams with counts less than L will be removed from the sample.
1221	remove must be smaller than uremove.
1222
1223--uremove L
1224	Bigrams with counts more than L will be removed from the sample.
1225	uremove must be bigger than remove.
1226
1227--frequency F
1228	Bigrams with counts less than F will not be displayed.
1229	frequency must be smaller than ufrequency.
1230
1231--ufrequency F
1232	Bigrams with counts more than F will not be displayed.
1233	ufrequency must be bigger than frequency.
1234
1235--newLine
1236	Prevents bigrams from spanning across the new-line characters.
1237
1238--help
1239        Displays this message.
1240
1241--version
1242        Displays the version information.
1243
1244Type 'perldoc huge-count.pl' to view detailed documentation of huge-count.\n";
1245}
1246
1247#------------------------------------------------------------------------------
1248#version information
1249sub showversion()
1250{
1251        print 'huge-count.pl  $Id: huge-count.pl,v 1.26 2011/03/31 23:04:04 tpederse Exp $';
1252        print "\nEfficiently runs count.pl on a huge data.\n";
1253        print "Copyright (C) 2004-2011, Amruta Purandare, Ted Pedersen & Ying Liu.\n";
1254}
1255
1256#############################################################################
1257
1258