1#!/usr/bin/perl -w
2
3# huge-count3.pl - Counts large numbers of trigrams
4
5eval 'exec /usr/bin/perl -w -S $0 ${1+"$@"}'
6    if 0; # not running under some shell
7
8=head1 NAME
9
10huge-count3.pl - Divide huge text into pieces and run huge-count3.pl for 3grams separately on each (and then combine)
11
12=head1 SYNOPSIS
13
14Runs count.pl efficiently on a huge data.
15
16=head1 USGAE
17
18huge-count3.pl [OPTIONS] DESTINATION [SOURCE]+
19
20=head1 INPUT
21
22=head2 Required Arguments:
23
24=head3 [SOURCE]+
25
26Input to huge-count3.pl should be a -
27
28=over
29
30=item 1. Single plain text file
31
32Or
33
34item 2. Single flat directory containing multiple plain text files
35
36Or
37
38=item 3. List of multiple plain text files
39
40=back
41
42=head3 DESTINATION
43
44A complete path to a writable directory to which huge-count3.pl can write all
45intermediate and final output files. If DESTINATION does not exist,
46a new directory is created, otherwise, the current directory is simply used
47for writing the output files.
48
49NOTE: If DESTINATION already exists and if the names of some of the existing
50files in DESTINATION clash with the names of the output files created by
51huge-count, these files will be over-written w/o prompting user.
52
53=head2 Optional Arguments:
54
55=head4 --split P
56
57This option should be specified when SOURCE is a single plain file. huge-count
58will divide the given SOURCE file into P (approximately) equal parts,
59will run count.pl separately on each part and will then recombine the trigram
60counts from all these intermediate result files into a single trigram output
61that shows trigram counts in SOURCE.
62
63If SOURCE file contains M lines, each part created with --split P will
64contain approximately M/P lines. Value of P should be chosen such that
65count.pl can be efficiently run on any part containing M/P lines from SOURCE.
66As #words/line differ from files to files, it is recommended that P should
67be large enough so that each part will contain at most million words in total.
68
69=head4 --token TOKENFILE
70
71Specify a file containing Perl regular expressions that define the tokenization
72scheme for counting. This will be provided to count.pl's --token option.
73
74--nontoken NOTOKENFILE
75
76Specify a file containing Perl regular expressions of non-token sequences
77that are removed prior to tokenization. This will be provided to the
78count.pl's --nontoken option.
79
80--stop STOPFILE
81
82Specify a file of Perl regex/s containing the list of stop words to be
83omitted from the output TRIGRAMS. Stop list can be used in two modes -
84
85AND mode declared with '@stop.mode = AND' on the 1st line of the STOPFILE
86
87or
88
89OR mode declared using '@stop.mode = OR' on the 1st line of the STOPFILE.
90
91In AND mode, trigrams whose both constituent words are stop words are removed
92while, in OR mode, triigrams whose either or both constituent words are
93stopwords are removed from the output.
94
95=head4 --window W
96
97Tokens appearing within W positions from each other (with at most W-2
98intervening words) will form trigrams. Same as count.pl's --window option.
99
100=head4 --remove L
101
102Trigrams with counts less than L in the entire SOURCE data are removed from
103the sample. The counts of the removed trigrams are not counted in any
104marginal totals. This has same effect as count.pl's --remove option.
105
106=head4 --frequency F
107
108trigrams with counts less than F in the entire SOURCE are not displayed.
109The counts of the skipped trigrams ARE counted in the marginal totals. In other
110words, --frequency in huge-count3.pl has same effect as the count.pl's
111--frequency option.
112
113=head4 --newLine
114
115Switches ON the --newLine option in count.pl. This will prevent trigrams from
116spanning across the lines.
117
118=head3 Other Options :
119
120=head4 --help
121
122Displays this message.
123
124=head4 --version
125
126Displays the version information.
127
128=head1 PROGRAM LOGIC
129
130=over
131
132=item * STEP 1
133
134 # create output dir
135 if(!-e DESTINATION) then
136 mkdir DESTINATION;
137
138=item * STEP 2
139
140=over 4
141
142=item 1. If SOURCE is a single plain file -
143
144Split SOURCE into P smaller files (as specified by --split P).
145These files are created in the DESTINATION directory and their names are
146formatted as SOURCE1, SOURCE2, ... SOURCEP.
147
148Run count.pl on each of the P smaller files. The count outputs are also
149created in DESTINATION and their names are formatted as SOURCE1.trigrams,
150SOURCE2.trigrams, .... SOURCEP.trigrams.
151
152=item 2. SOURCE is a single flat directory containing multiple plain files -
153
154count.pl is run on each file present in the SOURCE directory. All files in
155SOURCE are treated as the data files. If SOURCE contains sub-directories,
156these are simply skipped. Intermediate trigram outputs are written in
157DESTINATION.
158
159=item 3. SOURCE is a list of multiple plain files -
160
161If #arg > 2, all arguments specified after the first argument are considered
162as the SOURCE file names. count.pl is separately run on each of the SOURCE
163files specified by argv[1], argv[2], ... argv[n] (skipping argv[0] which
164should be DESTINATION). Intermediate results are created in DESTINATION.
165
166Files specified in the list of SOURCE should be relatively small sized
167plain files with #words < 1,000,000.
168
169=back
170
171In summary, a large datafile can be provided to huge-count3 in the form of
172
173a. A single plain file (along with --split P)
174
175b. A directory containing several plain files
176
177c. Multiple plain files directly specified as command line arguments
178
179In all these cases, count.pl is separately run on SOURCE files or parts of
180SOURCE file and intermediate results are written in DESTINATION dir.
181
182=back
183
184=head2 STEP 3
185
186Intermediate count results created in STEP 2 are recombined in a pair-wise
187fashion such that for P separate count output files, C1, C2, C3 ... , CP,
188
189C1 and C2 are first recombined and result is written to huge-count3.output
190
191Counts from each of the C3, C4, ... CP are then combined (added) to
192huge-count3.output and each time while recombining, always the smaller of the
193two files is loaded.
194
195=head2 STEP 4
196
197After all files are recombined, the resultant huge-count3.output is then sorted
198in the descending order of the trigram counts. If --remove is specified,
199trigrams with counts less than the specified value of --remove, in the final
200huge-count3.output file are removed from the sample and their counts are
201deleted from the marginal totals. If --frequency is selected, trigrams with
202scores less than the specified value are simply skipped from output.
203
204=head1 OUTPUT
205
206After huge-count3 finishes successfully, DESTINATION will contain -
207
208=over
209
210=item * Intermediate trigram count files (*.trigrams) created for each of the
211given SOURCE files or split parts of the SOURCE file.
212
213=item * Final trigram count file (huge-count3.output) showing trigram counts in
214the entire SOURCE.
215
216=back
217
218=head1 BUGS
219
220huge-count3.pl doesn't consider trigrams at file boundaries. In other words,
221the result of count.pl and huge-count3.pl on the same data file will
222differ if --newLine is not used, in that, huge-count3.pl runs count.pl
223on multiple files separately and thus looses the track of the trigrams
224on file boundaries. With --window not specified, there will be loss
225of one trigram at each file boundary while its W trigrams with --window W.
226
227Functionality of huge-count3 is same as count only if --newLine is used and
228all files start and end on sentence boundaries. In other words, there
229should not be any sentence breaks at the start or end of any file given to
230huge-count3.
231
232=head1 AUTHOR
233
234Amruta Purandare, Ted Pedersen.
235University of Minnesota at Duluth.
236
237=head1 COPYRIGHT
238
239Copyright (c) 2004, 2009
240
241Amruta Purandare, University of Minnesota, Duluth.
242pura0010@umn.edu
243
244Ted Pedersen, University of Minnesota, Duluth.
245tpederse@umn.edu
246
247Cyrus Shaoul, University of Alberta, Edmonton
248cyrus.shaoul@ualberta.ca
249
250This program is free software; you can redistribute it and/or modify it under
251the terms of the GNU General Public License as published by the Free Software
252Foundation; either version 2 of the License, or (at your option) any later
253version.
254
255This program is distributed in the hope that it will be useful, but WITHOUT
256ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
257FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
258
259You should have received a copy of the GNU General Public License along with
260this program; if not, write to
261
262The Free Software Foundation, Inc.,
26359 Temple Place - Suite 330,
264Boston, MA  02111-1307, USA.
265
266=cut
267
268###############################################################################
269
270
271#$0 contains the program name along with
272#the complete path. Extract just the program
273#name and use in error messages
274$0=~s/.*\/(.+)/$1/;
275
276###############################################################################
277
278#                           ================================
279#                            COMMAND LINE OPTIONS AND USAGE
280#                           ================================
281
282# command line options
283use Getopt::Long;
284GetOptions ("help","version","token=s","nontoken=s","remove=i","window=i","stop=s","split=i","frequency=i","newLine");
285# show help option
286if(defined $opt_help)
287{
288        $opt_help=1;
289        &showhelp();
290        exit;
291}
292
293# show version information
294if(defined $opt_version)
295{
296        $opt_version=1;
297        &showversion();
298        exit;
299}
300
301
302# show minimal usage message if fewer arguments
303if($#ARGV<1)
304{
305        &showminimal();
306        exit;
307}
308
309if(defined $opt_frequency && defined $opt_remove)
310{
311	print STDERR "ERROR($0):
312	Options --remove and --frequency can't be both used together.\n";
313	exit;
314}
315
316#############################################################################
317
318#			========================
319#			      CODE SECTION
320#			========================
321
322#accept the destination dir name
323$destdir=$ARGV[0];
324if(-e $destdir)
325{
326	if(!-d $destdir)
327	{
328		print STDERR "ERROR($0):
329	$destdir is not a directory.\n";
330		exit;
331	}
332}
333else
334{
335	system("mkdir $destdir");
336}
337
338# ----------
339#  Counting
340# ----------
341
342# source = dir
343if($#ARGV==1 && -d $ARGV[1])
344{
345	$sourcedir=$ARGV[1];
346	opendir(DIR,$sourcedir) || die "ERROR($0):
347	Error (code=$!) in opening Source Directory <$sourcedir>.\n";
348	while(defined ($file=readdir DIR))
349	{
350		next if $file =~ /^\.\.?$/;
351		if(-f "$sourcedir/$file")
352		{
353			&runcount("$sourcedir/$file",$destdir);
354		}
355	}
356}
357# source is a single file
358elsif($#ARGV==1 && -f $ARGV[1])
359{
360	$source=$ARGV[1];
361	if(defined $opt_split)
362	{
363		system("cp $source $destdir");
364		if(defined $opt_token)
365		{
366			system("cp $opt_token $destdir");
367		}
368		if(defined $opt_nontoken)
369		{
370			system("cp $opt_nontoken $destdir");
371		}
372		if(defined $opt_stop)
373		{
374			system("cp $opt_stop $destdir");
375		}
376		chdir $destdir;
377		$chdir=1;
378		system("split-data.pl --parts $opt_split $source");
379		system("/bin/rm -r -f $source");
380		opendir(DIR,".") || die "ERROR($0):
381        Error (code=$!) in opening Destination Directory <$destdir>.\n";
382		while(defined ($file=readdir DIR))
383		{
384			if($file=~/$source/ && $file!~/\.trigrams/)
385			{
386				&runcount($file,".");
387			}
388		}
389		close DIR;
390	}
391	else
392	{
393		print STDERR "Warning($0):
394	You can run count.pl directly on the single source file if don't
395	want to split the source.\n";
396		exit;
397	}
398}
399# source contains multiple files
400elsif($#ARGV > 1)
401{
402	foreach $i (1..$#ARGV)
403	{
404		if(-f $ARGV[$i])
405		{
406			&runcount($ARGV[$i],$destdir);
407		}
408		else
409		{
410			print STDERR "ERROR($0):
411	ARGV[$i]=$ARGV[$i] should be a plain file.\n";
412			exit;
413		}
414	}
415}
416# unexpected input
417else
418{
419	&showminimal();
420	exit;
421}
422
423# --------------------
424# Recombining counts
425# --------------------
426
427if(!defined $chdir)
428{
429	chdir $destdir;
430}
431
432# current dir is now destdir
433opendir(DIR,".") || die "ERROR($0):
434        Error (code=$!) in opening Destination Directory <$destdir>.\n";
435
436$output="huge-count3.output";
437$tempfile="tempfile" . time(). ".tmp";
438
439if(-e $output)
440{
441	system("/bin/rm -r -f $output");
442}
443
444while(defined ($file=readdir DIR))
445{
446	if($file=~/\.trigrams$/)
447	{
448		if(!-e $output)
449		{
450			system("cp $file $output");
451		}
452		else
453		{
454			system("huge-combine3.pl $file $output > $tempfile");
455			system("mv $tempfile $output");
456		}
457	}
458}
459
460close DIR;
461
462# ---------------------
463# Sorting and Removing
464# ---------------------
465
466if(defined $opt_remove)
467{
468	system("sort-trigrams.pl --remove $opt_remove $output > $tempfile");
469}
470else
471{
472	if(defined $opt_frequency)
473	{
474		system("sort-trigrams.pl --frequency $opt_frequency $output > $tempfile");
475	}
476	else
477	{
478		system("sort-trigrams.pl $output > $tempfile");
479	}
480}
481system("mv $tempfile $output");
482
483print STDERR "Check the output in $destdir/$output.\n";
484exit;
485
486##############################################################################
487
488#                      ==========================
489#                          SUBROUTINE SECTION
490#                      ==========================
491
492sub runcount()
493{
494    my $file=shift;
495    my $destdir=shift;
496    my $justfile=$file;
497    $justfile=~s/.*\/(.+)/$1/;
498    # --window used
499    if(defined $opt_window)
500    {
501	# --token used
502	if(defined $opt_token)
503	{
504	    # --nontoken used
505	    if(defined $opt_nontoken)
506	    {
507		# --stop used
508		if(defined $opt_stop)
509		{
510		    if(defined $opt_newLine)
511		    {
512			system("count.pl --ngram 3  --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file");
513		    }
514		    else
515		    {
516			system("count.pl --ngram 3  --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file");
517		    }
518		}
519		# --stop not used
520		else
521		{
522		    if(defined $opt_newLine)
523		    {
524			system("count.pl --ngram 3  --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.trigrams $file");
525		    }
526		    else
527		    {
528			system("count.pl --ngram 3  --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.trigrams $file");
529		    }
530		}
531	    }
532	    # nontoken not used
533	    else
534	    {
535		# --stop used
536		if(defined $opt_stop)
537		{
538		    if(defined $opt_newLine)
539		    {
540			system("count.pl --ngram 3  --newLine --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.trigrams $file");
541		    }
542		    else
543		    {
544			system("count.pl --ngram 3  --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.trigrams $file")
545		    }
546		}
547		# --stop not used
548		else
549		{
550		    if(defined $opt_newLine)
551		    {
552			system("count.pl --ngram 3  --newLine --window $opt_window --token $opt_token $destdir/$justfile.trigrams $file");
553		    }
554		    else
555		    {
556			system("count.pl --ngram 3  --window $opt_window --token $opt_token $destdir/$justfile.trigrams $file");
557		    }
558		}
559	    }
560	}
561	# --token not used
562	else
563	{
564	    # --nontoken used
565	    if(defined $opt_nontoken)
566	    {
567		# --stop used
568		if(defined $opt_stop)
569		{
570		    if(defined $opt_newLine)
571		    {
572			system("count.pl --ngram 3  --newLine --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file");
573		    }
574		    else
575		    {
576			system("count.pl --ngram 3  --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file");
577		    }
578		}
579		# --stop not used
580		else
581		{
582		    if(defined $opt_newLine)
583		    {
584			system("count.pl --ngram 3  --newLine --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.trigrams $file");
585		    }
586		    else
587		    {
588			system("count.pl --ngram 3  --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.trigrams $file");
589		    }
590		}
591	    }
592	    # nontoken not used
593	    else
594	    {
595		# --stop used
596		if(defined $opt_stop)
597		{
598		    if(defined $opt_newLine)
599		    {
600			system("count.pl --ngram 3  --newLine --window $opt_window --stop $opt_stop $destdir/$justfile.trigrams $file");
601		    }
602		    else
603		    {
604			system("count.pl --ngram 3  --window $opt_window --stop $opt_stop $destdir/$justfile.trigrams $file");
605		    }
606		}
607		# --stop not used
608		else
609		{
610		    if(defined $opt_newLine)
611		    {
612			system("count.pl --ngram 3  --newLine --window $opt_window $destdir/$justfile.trigrams $file");
613		    }
614		    else
615		    {
616			system("count.pl --ngram 3  --window $opt_window $destdir/$justfile.trigrams $file");
617		    }
618		}
619	    }
620	}
621    }
622    # --window not used
623    else
624    {
625	# --token used
626	if(defined $opt_token)
627	{
628	    # --nontoken used
629	    if(defined $opt_nontoken)
630	    {
631		# --stop used
632		if(defined $opt_stop)
633		{
634		    if(defined $opt_newLine)
635		    {
636			system("count.pl --ngram 3  --newLine --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file");
637		    }
638		    else
639		    {
640			system("count.pl --ngram 3  --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file");
641		    }
642		}
643		# --stop not used
644		else
645		{
646		    if(defined $opt_newLine)
647		    {
648			system("count.pl --ngram 3  --newLine --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.trigrams $file");
649		    }
650		    else
651		    {
652			system("count.pl --ngram 3  --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.trigrams $file");
653		    }
654		}
655	    }
656	    # nontoken not used
657	    else
658	    {
659		# --stop used
660		if(defined $opt_stop)
661		{
662		    if(defined $opt_newLine)
663		    {
664			system("count.pl --ngram 3  --newLine --token $opt_token --stop $opt_stop $destdir/$justfile.trigrams $file");
665		    }
666		    else
667		    {
668			system("count.pl --ngram 3  --token $opt_token --stop $opt_stop $destdir/$justfile.trigrams $file");
669		    }
670		}
671		# --stop not used
672		else
673		{
674		    if(defined $opt_newLine)
675		    {
676			system("count.pl --ngram 3  --newLine --token $opt_token $destdir/$justfile.trigrams $file");
677		    }
678		    else
679		    {
680			system("count.pl --ngram 3  --token $opt_token $destdir/$justfile.trigrams $file");
681		    }
682		}
683	    }
684	}
685	# --token not used
686	else
687	{
688	    # --nontoken used
689	    if(defined $opt_nontoken)
690	    {
691		# --stop used
692		if(defined $opt_stop)
693		{
694		    if(defined $opt_newLine)
695		    {
696			system("count.pl --ngram 3  --newLine --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file");
697		    }
698		    else
699		    {
700			system("count.pl --ngram 3  --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file");
701		    }
702		}
703		# --stop not used
704		else
705		{
706		    if(defined $opt_newLine)
707		    {
708			system("count.pl --ngram 3  --newLine --nontoken $opt_nontoken $destdir/$justfile.trigrams $file");
709		    }
710		    else
711		    {
712			system("count.pl --ngram 3  --nontoken $opt_nontoken $destdir/$justfile.trigrams $file");
713		    }
714		}
715	    }
716	    # nontoken not used
717	    else
718	    {
719		# --stop used
720		if(defined $opt_stop)
721		{
722		    if(defined $opt_newLine)
723		    {
724			system("count.pl --ngram 3  --newLine --stop $opt_stop $destdir/$justfile.trigrams $file");
725		    }
726		    else
727		    {
728			system("count.pl --ngram 3  --stop $opt_stop $destdir/$justfile.trigrams $file");
729		    }
730		}
731		# --stop not used
732		else
733		{
734		    if(defined $opt_newLine)
735		    {
736			system("count.pl --ngram 3  --newLine $destdir/$justfile.trigrams $file");
737		    }
738		    else
739		    {
740			system("count.pl --ngram 3  $destdir/$justfile.trigrams $file");
741		    }
742		}
743	    }
744	}
745    }
746}
747
748
749#-----------------------------------------------------------------------------
750#show minimal usage message
751sub showminimal()
752{
753        print "Usage: huge-count3.pl [OPTIONS] DESTINATION [SOURCE]+";
754        print "\nTYPE huge-count3.pl --help for help\n";
755}
756
757#-----------------------------------------------------------------------------
758#show help
759sub showhelp()
760{
761	print "Usage:  huge-count3.pl [OPTIONS] DESTINATION [SOURCE]+
762
763Efficiently runs count.pl for trigrams on a huge data.
764
765SOURCE
766	Could be a -
767
768		1. single plain file
769		2. single flat directory containing multiple plain files
770		3. list of plain files
771
772DESTINATION
773	Should be a directory where output is written.
774
775OPTIONS:
776
777--split P
778	If SOURCE is a single plain file, --split has to be specified to
779	split the source file into P parts and to run count.pl separately
780	on each part.
781
782--token TOKENFILE
783	Specify a file containing Perl regular expressions that define the
784	tokenization scheme for counting.
785
786--nontoken NOTOKENFILE
787	Specify a file containing Perl regular expressions of non-token
788	sequences that are removed prior to tokenization.
789
790--stop STOPFILE
791	Specify a file containing Perl regular expressions of stop words
792	that are to be removed from the output trigrams.
793
794--window W
795	Specify the window size for counting.
796
797--remove L
798	Trigrams with counts less than L will be removed from the sample.
799
800--frequency F
801	Trigrams with counts less than F will not be displayed.
802
803--newLine
804	Prevents trigrams from spanning across the new-line characters.
805
806--help
807        Displays this message.
808
809--version
810        Displays the version information.
811
812Type 'perldoc huge-count3.pl' to view detailed documentation of huge-count3.\n";
813}
814
815#------------------------------------------------------------------------------
816#version information
817sub showversion()
818{
819        print "huge-count3.pl      -       Version 0.03\n";
820        print "Efficiently runs count.pl on a huge data.\n";
821        print "Copyright (C) 2004, Amruta Purandare & Ted Pedersen.\n";
822        print "Date of Last Update:     03/30/2004\n";
823}
824
825#############################################################################
826
827