1#!/usr/local/bin/perl -w
2
3=head1 NAME
4
5huge-count.pl - Divide huge text into pieces and run count.pl separately on each (and then combine)
6
7=head1 SYNOPSIS
8
9Runs count.pl efficiently on a huge data.
10
11=head1 USGAE
12
13huge-count.pl [OPTIONS] DESTINATION [SOURCE]+
14
15=head1 INPUT
16
17=head2 Required Arguments:
18
19=head3 [SOURCE]+
20
21Input to huge-count.pl should be a -
22
23=over
24
25=item 1. Single plain text file
26
27Or
28
29item 2. Single flat directory containing multiple plain text files
30
31Or
32
33=item 3. List of multiple plain text files
34
35=back
36
37=head3 DESTINATION
38
39A complete path to a writable directory to which huge-count.pl can write all
40intermediate and final output files. If DESTINATION does not exist,
41a new directory is created, otherwise, the current directory is simply used
42for writing the output files.
43
44NOTE: If DESTINATION already exists and if the names of some of the existing
45files in DESTINATION clash with the names of the output files created by
46huge-count, these files will be over-written w/o prompting user.
47
48=head2 Optional Arguments:
49
50=head4 --split P
51
52This option should be specified when SOURCE is a single plain file. huge-count
53will divide the given SOURCE file into P (approximately) equal parts,
54will run count.pl separately on each part and will then recombine the bigram
55counts from all these intermediate result files into a single bigram output
56that shows bigram counts in SOURCE.
57
58If SOURCE file contains M lines, each part created with --split P will
59contain approximately M/P lines. Value of P should be chosen such that
60count.pl can be efficiently run on any part containing M/P lines from SOURCE.
61As #words/line differ from files to files, it is recommended that P should
62be large enough so that each part will contain at most million words in total.
63
64=head4 --token TOKENFILE
65
66Specify a file containing Perl regular expressions that define the tokenization
67scheme for counting. This will be provided to count.pl's --token option.
68
69--nontoken NOTOKENFILE
70
71Specify a file containing Perl regular expressions of non-token sequences
72that are removed prior to tokenization. This will be provided to the
73count.pl's --nontoken option.
74
75--stop STOPFILE
76
77Specify a file of Perl regex/s containing the list of stop words to be
78omitted from the output BIGRAMS. Stop list can be used in two modes -
79
80AND mode declared with '@stop.mode = AND' on the 1st line of the STOPFILE
81
82or
83
84OR mode declared using '@stop.mode = OR' on the 1st line of the STOPFILE.
85
86In AND mode, bigrams whose both constituent words are stop words are removed
87while, in OR mode, bigrams whose either or both constituent words are
88stopwords are removed from the output.
89
90=head4 --window W
91
92Tokens appearing within W positions from each other (with at most W-2
93intervening words) will form bigrams. Same as count.pl's --window option.
94
95=head4 --remove L
96
97Bigrams with counts less than L in the entire SOURCE data are removed from
98the sample. The counts of the removed bigrams are not counted in any
99marginal totals. This has same effect as count.pl's --remove option.
100
101=head4 --frequency F
102
103Bigrams with counts less than F in the entire SOURCE are not displayed.
104The counts of the skipped bigrams ARE counted in the marginal totals. In other
105words, --frequency in huge-count.pl has same effect as the count.pl's
106--frequency option.
107
108=head4 --newLine
109
110Switches ON the --newLine option in count.pl. This will prevent bigrams from
111spanning across the lines.
112
113=head3 Other Options :
114
115=head4 --help
116
117Displays this message.
118
119=head4 --version
120
121Displays the version information.
122
123=head1 PROGRAM LOGIC
124
125=over
126
127=item * STEP 1
128
129 # create output dir
130 if(!-e DESTINATION) then
131 mkdir DESTINATION;
132
133=item * STEP 2
134
135=over 4
136
137=item 1. If SOURCE is a single plain file -
138
139Split SOURCE into P smaller files (as specified by --split P).
140These files are created in the DESTINATION directory and their names are
141formatted as SOURCE1, SOURCE2, ... SOURCEP.
142
143Run count.pl on each of the P smaller files. The count outputs are also
144created in DESTINATION and their names are formatted as SOURCE1.bigrams,
145SOURCE2.bigrams, .... SOURCEP.bigrams.
146
147=item 2. SOURCE is a single flat directory containing multiple plain files -
148
149count.pl is run on each file present in the SOURCE directory. All files in
150SOURCE are treated as the data files. If SOURCE contains sub-directories,
151these are simply skipped. Intermediate bigram outputs are written in
152DESTINATION.
153
154=item 3. SOURCE is a list of multiple plain files -
155
156If #arg > 2, all arguments specified after the first argument are considered
157as the SOURCE file names. count.pl is separately run on each of the SOURCE
158files specified by argv[1], argv[2], ... argv[n] (skipping argv[0] which
159should be DESTINATION). Intermediate results are created in DESTINATION.
160
161Files specified in the list of SOURCE should be relatively small sized
162plain files with #words < 1,000,000.
163
164=back
165
166In summary, a large datafile can be provided to huge-count in the form of
167
168a. A single plain file (along with --split P)
169
170b. A directory containing several plain files
171
172c. Multiple plain files directly specified as command line arguments
173
174In all these cases, count.pl is separately run on SOURCE files or parts of
175SOURCE file and intermediate results are written in DESTINATION dir.
176
177=back
178
179=head2 STEP 3
180
181Intermediate count results created in STEP 2 are recombined in a pair-wise
182fashion such that for P separate count output files, C1, C2, C3 ... , CP,
183
184C1 and C2 are first recombined and result is written to huge-count.output
185
186Counts from each of the C3, C4, ... CP are then combined (added) to
187huge-count.output and each time while recombining, always the smaller of the
188two files is loaded.
189
190=head2 STEP 4
191
192After all files are recombined, the resultant huge-count.output is then sorted
193in the descending order of the bigram counts. If --remove is specified,
194bigrams with counts less than the specified value of --remove, in the final
195huge-count.output file are removed from the sample and their counts are
196deleted from the marginal totals. If --frequency is selected, bigrams with
197scores less than the specified value are simply skipped from output.
198
199=head1 OUTPUT
200
201After huge-count finishes successfully, DESTINATION will contain -
202
203=over
204
205=item * Intermediate bigram count files (*.bigrams) created for each of the
206given SOURCE files or split parts of the SOURCE file.
207
208=item * Final bigram count file (huge-count.output) showing bigram counts in
209the entire SOURCE.
210
211=back
212
213=head1 BUGS
214
215huge-count.pl doesn't consider bigrams at file boundaries. In other words,
216the result of count.pl and huge-count.pl on the same data file will
217differ if --newLine is not used, in that, huge-count.pl runs count.pl
218on multiple files separately and thus looses the track of the bigrams
219on file boundaries. With --window not specified, there will be loss
220of one bigram at each file boundary while its W bigrams with --window W.
221
222Functionality of huge-count is same as count only if --newLine is used and
223all files start and end on sentence boundaries. In other words, there
224should not be any sentence breaks at the start or end of any file given to
225huge-count.
226
227=head1 AUTHOR
228
229Amruta Purandare, Ted Pedersen.
230University of Minnesota at Duluth.
231
232=head1 COPYRIGHT
233
234Copyright (c) 2004,
235
236Amruta Purandare, University of Minnesota, Duluth.
237pura0010@umn.edu
238
239Ted Pedersen, University of Minnesota, Duluth.
240tpederse@umn.edu
241
242This program is free software; you can redistribute it and/or modify it under
243the terms of the GNU General Public License as published by the Free Software
244Foundation; either version 2 of the License, or (at your option) any later
245version.
246
247This program is distributed in the hope that it will be useful, but WITHOUT
248ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
249FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
250
251You should have received a copy of the GNU General Public License along with
252this program; if not, write to
253
254The Free Software Foundation, Inc.,
25559 Temple Place - Suite 330,
256Boston, MA  02111-1307, USA.
257
258=cut
259
260###############################################################################
261
262
263#$0 contains the program name along with
264#the complete path. Extract just the program
265#name and use in error messages
266$0=~s/.*\/(.+)/$1/;
267
268###############################################################################
269
270#                           ================================
271#                            COMMAND LINE OPTIONS AND USAGE
272#                           ================================
273
274# command line options
275use Getopt::Long;
276GetOptions ("help","version","token=s","nontoken=s","remove=i","window=i","stop=s","split=i","frequency=i","newLine");
277# show help option
278if(defined $opt_help)
279{
280        $opt_help=1;
281        &showhelp();
282        exit;
283}
284
285# show version information
286if(defined $opt_version)
287{
288        $opt_version=1;
289        &showversion();
290        exit;
291}
292
293# show minimal usage message if fewer arguments
294if($#ARGV<1)
295{
296        &showminimal();
297        exit;
298}
299
300if(defined $opt_frequency && defined $opt_remove)
301{
302	print STDERR "ERROR($0):
303	Options --remove and --frequency can't be both used together.\n";
304	exit;
305}
306
307#############################################################################
308
309#			========================
310#			      CODE SECTION
311#			========================
312
313#accept the destination dir name
314$destdir=$ARGV[0];
315if(-e $destdir)
316{
317	if(!-d $destdir)
318	{
319		print STDERR "ERROR($0):
320	$destdir is not a directory.\n";
321		exit;
322	}
323}
324else
325{
326	system("mkdir $destdir");
327}
328
329# ----------
330#  Counting
331# ----------
332
333# source = dir
334if($#ARGV==1 && -d $ARGV[1])
335{
336	$sourcedir=$ARGV[1];
337	opendir(DIR,$sourcedir) || die "ERROR($0):
338	Error (code=$!) in opening Source Directory <$sourcedir>.\n";
339	while(defined ($file=readdir DIR))
340	{
341		next if $file =~ /^\.\.?$/;
342		if(-f "$sourcedir/$file")
343		{
344			&runcount("$sourcedir/$file",$destdir);
345		}
346	}
347}
348# source is a single file
349elsif($#ARGV==1 && -f $ARGV[1])
350{
351	$source=$ARGV[1];
352	if(defined $opt_split)
353	{
354		system("cp $source $destdir");
355		if(defined $opt_token)
356		{
357			system("cp $opt_token $destdir");
358		}
359		if(defined $opt_nontoken)
360		{
361			system("cp $opt_nontoken $destdir");
362		}
363		if(defined $opt_stop)
364		{
365			system("cp $opt_stop $destdir");
366		}
367		chdir $destdir;
368		$chdir=1;
369		system("split-data.pl --parts $opt_split $source");
370		system("/bin/rm -r -f $source");
371		opendir(DIR,".") || die "ERROR($0):
372        Error (code=$!) in opening Destination Directory <$destdir>.\n";
373		while(defined ($file=readdir DIR))
374		{
375			if($file=~/$source/ && $file!~/\.bigrams/)
376			{
377				&runcount($file,".");
378			}
379		}
380		close DIR;
381	}
382	else
383	{
384		print STDERR "Warning($0):
385	You can run count.pl directly on the single source file if don't
386	want to split the source.\n";
387		exit;
388	}
389}
390# source contains multiple files
391elsif($#ARGV > 1)
392{
393	foreach $i (1..$#ARGV)
394	{
395		if(-f $ARGV[$i])
396		{
397			&runcount($ARGV[$i],$destdir);
398		}
399		else
400		{
401			print STDERR "ERROR($0):
402	ARGV[$i]=$ARGV[$i] should be a plain file.\n";
403			exit;
404		}
405	}
406}
407# unexpected input
408else
409{
410	&showminimal();
411	exit;
412}
413
414# --------------------
415# Recombining counts
416# --------------------
417
418if(!defined $chdir)
419{
420	chdir $destdir;
421}
422
423# current dir is now destdir
424opendir(DIR,".") || die "ERROR($0):
425        Error (code=$!) in opening Destination Directory <$destdir>.\n";
426
427$output="huge-count.output";
428$tempfile="tempfile" . time(). ".tmp";
429
430if(-e $output)
431{
432	system("/bin/rm -r -f $output");
433}
434
435while(defined ($file=readdir DIR))
436{
437	if($file=~/\.bigrams$/)
438	{
439		if(!-e $output)
440		{
441			system("cp $file $output");
442		}
443		else
444		{
445			system("huge-combine.pl $file $output > $tempfile");
446			system("mv $tempfile $output");
447		}
448	}
449}
450
451close DIR;
452
453# ---------------------
454# Sorting and Removing
455# ---------------------
456
457if(defined $opt_remove)
458{
459	system("sort-bigrams.pl --remove $opt_remove $output > $tempfile");
460}
461else
462{
463	if(defined $opt_frequency)
464	{
465		system("sort-bigrams.pl --frequency $opt_frequency $output > $tempfile");
466	}
467	else
468	{
469		system("sort-bigrams.pl $output > $tempfile");
470	}
471}
472system("mv $tempfile $output");
473
474print STDERR "Check the output in $destdir/$output.\n";
475exit;
476
477##############################################################################
478
479#                      ==========================
480#                          SUBROUTINE SECTION
481#                      ==========================
482
483sub runcount()
484{
485    my $file=shift;
486    my $destdir=shift;
487    my $justfile=$file;
488    $justfile=~s/.*\/(.+)/$1/;
489    # --window used
490    if(defined $opt_window)
491    {
492	# --token used
493	if(defined $opt_token)
494	{
495	    # --nontoken used
496	    if(defined $opt_nontoken)
497	    {
498		# --stop used
499		if(defined $opt_stop)
500		{
501		    if(defined $opt_newLine)
502		    {
503			system("count.pl --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
504		    }
505		    else
506		    {
507			system("count.pl --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
508		    }
509		}
510		# --stop not used
511		else
512		{
513		    if(defined $opt_newLine)
514		    {
515			system("count.pl --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
516		    }
517		    else
518		    {
519			system("count.pl --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
520		    }
521		}
522	    }
523	    # nontoken not used
524	    else
525	    {
526		# --stop used
527		if(defined $opt_stop)
528		{
529		    if(defined $opt_newLine)
530		    {
531			system("count.pl --newLine --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
532		    }
533		    else
534		    {
535			system("count.pl --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file")
536		    }
537		}
538		# --stop not used
539		else
540		{
541		    if(defined $opt_newLine)
542		    {
543			system("count.pl --newLine --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file");
544		    }
545		    else
546		    {
547			system("count.pl --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file");
548		    }
549		}
550	    }
551	}
552	# --token not used
553	else
554	{
555	    # --nontoken used
556	    if(defined $opt_nontoken)
557	    {
558		# --stop used
559		if(defined $opt_stop)
560		{
561		    if(defined $opt_newLine)
562		    {
563			system("count.pl --newLine --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
564		    }
565		    else
566		    {
567			system("count.pl --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
568		    }
569		}
570		# --stop not used
571		else
572		{
573		    if(defined $opt_newLine)
574		    {
575			system("count.pl --newLine --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
576		    }
577		    else
578		    {
579			system("count.pl --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
580		    }
581		}
582	    }
583	    # nontoken not used
584	    else
585	    {
586		# --stop used
587		if(defined $opt_stop)
588		{
589		    if(defined $opt_newLine)
590		    {
591			system("count.pl --newLine --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file");
592		    }
593		    else
594		    {
595			system("count.pl --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file");
596		    }
597		}
598		# --stop not used
599		else
600		{
601		    if(defined $opt_newLine)
602		    {
603			system("count.pl --newLine --window $opt_window $destdir/$justfile.bigrams $file");
604		    }
605		    else
606		    {
607			system("count.pl --window $opt_window $destdir/$justfile.bigrams $file");
608		    }
609		}
610	    }
611	}
612    }
613    # --window not used
614    else
615    {
616	# --token used
617	if(defined $opt_token)
618	{
619	    # --nontoken used
620	    if(defined $opt_nontoken)
621	    {
622		# --stop used
623		if(defined $opt_stop)
624		{
625		    if(defined $opt_newLine)
626		    {
627			system("count.pl --newLine --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
628		    }
629		    else
630		    {
631			system("count.pl --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
632		    }
633		}
634		# --stop not used
635		else
636		{
637		    if(defined $opt_newLine)
638		    {
639			system("count.pl --newLine --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
640		    }
641		    else
642		    {
643			system("count.pl --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
644		    }
645		}
646	    }
647	    # nontoken not used
648	    else
649	    {
650		# --stop used
651		if(defined $opt_stop)
652		{
653		    if(defined $opt_newLine)
654		    {
655			system("count.pl --newLine --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
656		    }
657		    else
658		    {
659			system("count.pl --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
660		    }
661		}
662		# --stop not used
663		else
664		{
665		    if(defined $opt_newLine)
666		    {
667			system("count.pl --newLine --token $opt_token $destdir/$justfile.bigrams $file");
668		    }
669		    else
670		    {
671			system("count.pl --token $opt_token $destdir/$justfile.bigrams $file");
672		    }
673		}
674	    }
675	}
676	# --token not used
677	else
678	{
679	    # --nontoken used
680	    if(defined $opt_nontoken)
681	    {
682		# --stop used
683		if(defined $opt_stop)
684		{
685		    if(defined $opt_newLine)
686		    {
687			system("count.pl --newLine --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
688		    }
689		    else
690		    {
691			system("count.pl --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
692		    }
693		}
694		# --stop not used
695		else
696		{
697		    if(defined $opt_newLine)
698		    {
699			system("count.pl --newLine --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
700		    }
701		    else
702		    {
703			system("count.pl --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
704		    }
705		}
706	    }
707	    # nontoken not used
708	    else
709	    {
710		# --stop used
711		if(defined $opt_stop)
712		{
713		    if(defined $opt_newLine)
714		    {
715			system("count.pl --newLine --stop $opt_stop $destdir/$justfile.bigrams $file");
716		    }
717		    else
718		    {
719			system("count.pl --stop $opt_stop $destdir/$justfile.bigrams $file");
720		    }
721		}
722		# --stop not used
723		else
724		{
725		    if(defined $opt_newLine)
726		    {
727			system("count.pl --newLine $destdir/$justfile.bigrams $file");
728		    }
729		    else
730		    {
731			system("count.pl $destdir/$justfile.bigrams $file");
732		    }
733		}
734	    }
735	}
736    }
737}
738
739
740#-----------------------------------------------------------------------------
741#show minimal usage message
742sub showminimal()
743{
744        print "Usage: huge-count.pl [OPTIONS] DESTINATION [SOURCE]+";
745        print "\nTYPE huge-count.pl --help for help\n";
746}
747
748#-----------------------------------------------------------------------------
749#show help
750sub showhelp()
751{
752	print "Usage:  huge-count.pl [OPTIONS] DESTINATION [SOURCE]+
753
754Efficiently runs count.pl on a huge data.
755
756SOURCE
757	Could be a -
758
759		1. single plain file
760		2. single flat directory containing multiple plain files
761		3. list of plain files
762
763DESTINATION
764	Should be a directory where output is written.
765
766OPTIONS:
767
768--split P
769	If SOURCE is a single plain file, --split has to be specified to
770	split the source file into P parts and to run count.pl separately
771	on each part.
772
773--token TOKENFILE
774	Specify a file containing Perl regular expressions that define the
775	tokenization scheme for counting.
776
777--nontoken NOTOKENFILE
778	Specify a file containing Perl regular expressions of non-token
779	sequences that are removed prior to tokenization.
780
781--stop STOPFILE
782	Specify a file containing Perl regular expressions of stop words
783	that are to be removed from the output bigrams.
784
785--window W
786	Specify the window size for counting.
787
788--remove L
789	Bigrams with counts less than L will be removed from the sample.
790
791--frequency F
792	Bigrams with counts less than F will not be displayed.
793
794--newLine
795	Prevents bigrams from spanning across the new-line characters.
796
797--help
798        Displays this message.
799
800--version
801        Displays the version information.
802
803Type 'perldoc huge-count.pl' to view detailed documentation of huge-count.\n";
804}
805
806#------------------------------------------------------------------------------
807#version information
808sub showversion()
809{
810        print "huge-count.pl      -       Version 0.03\n";
811        print "Efficiently runs count.pl on a huge data.\n";
812        print "Copyright (C) 2004, Amruta Purandare & Ted Pedersen.\n";
813        print "Date of Last Update:     03/30/2004\n";
814}
815
816#############################################################################
817
818