1#!/usr/bin/perl -w 2 3# huge-count3.pl - Counts large numbers of trigrams 4 5eval 'exec /usr/bin/perl -w -S $0 ${1+"$@"}' 6 if 0; # not running under some shell 7 8=head1 NAME 9 10huge-count3.pl - Divide huge text into pieces and run huge-count3.pl for 3grams separately on each (and then combine) 11 12=head1 SYNOPSIS 13 14Runs count.pl efficiently on a huge data. 15 16=head1 USGAE 17 18huge-count3.pl [OPTIONS] DESTINATION [SOURCE]+ 19 20=head1 INPUT 21 22=head2 Required Arguments: 23 24=head3 [SOURCE]+ 25 26Input to huge-count3.pl should be a - 27 28=over 29 30=item 1. Single plain text file 31 32Or 33 34item 2. Single flat directory containing multiple plain text files 35 36Or 37 38=item 3. List of multiple plain text files 39 40=back 41 42=head3 DESTINATION 43 44A complete path to a writable directory to which huge-count3.pl can write all 45intermediate and final output files. If DESTINATION does not exist, 46a new directory is created, otherwise, the current directory is simply used 47for writing the output files. 48 49NOTE: If DESTINATION already exists and if the names of some of the existing 50files in DESTINATION clash with the names of the output files created by 51huge-count, these files will be over-written w/o prompting user. 52 53=head2 Optional Arguments: 54 55=head4 --split P 56 57This option should be specified when SOURCE is a single plain file. huge-count 58will divide the given SOURCE file into P (approximately) equal parts, 59will run count.pl separately on each part and will then recombine the trigram 60counts from all these intermediate result files into a single trigram output 61that shows trigram counts in SOURCE. 62 63If SOURCE file contains M lines, each part created with --split P will 64contain approximately M/P lines. Value of P should be chosen such that 65count.pl can be efficiently run on any part containing M/P lines from SOURCE. 66As #words/line differ from files to files, it is recommended that P should 67be large enough so that each part will contain at most million words in total. 68 69=head4 --token TOKENFILE 70 71Specify a file containing Perl regular expressions that define the tokenization 72scheme for counting. This will be provided to count.pl's --token option. 73 74--nontoken NOTOKENFILE 75 76Specify a file containing Perl regular expressions of non-token sequences 77that are removed prior to tokenization. This will be provided to the 78count.pl's --nontoken option. 79 80--stop STOPFILE 81 82Specify a file of Perl regex/s containing the list of stop words to be 83omitted from the output TRIGRAMS. Stop list can be used in two modes - 84 85AND mode declared with '@stop.mode = AND' on the 1st line of the STOPFILE 86 87or 88 89OR mode declared using '@stop.mode = OR' on the 1st line of the STOPFILE. 90 91In AND mode, trigrams whose both constituent words are stop words are removed 92while, in OR mode, triigrams whose either or both constituent words are 93stopwords are removed from the output. 94 95=head4 --window W 96 97Tokens appearing within W positions from each other (with at most W-2 98intervening words) will form trigrams. Same as count.pl's --window option. 99 100=head4 --remove L 101 102Trigrams with counts less than L in the entire SOURCE data are removed from 103the sample. The counts of the removed trigrams are not counted in any 104marginal totals. This has same effect as count.pl's --remove option. 105 106=head4 --frequency F 107 108trigrams with counts less than F in the entire SOURCE are not displayed. 109The counts of the skipped trigrams ARE counted in the marginal totals. In other 110words, --frequency in huge-count3.pl has same effect as the count.pl's 111--frequency option. 112 113=head4 --newLine 114 115Switches ON the --newLine option in count.pl. This will prevent trigrams from 116spanning across the lines. 117 118=head3 Other Options : 119 120=head4 --help 121 122Displays this message. 123 124=head4 --version 125 126Displays the version information. 127 128=head1 PROGRAM LOGIC 129 130=over 131 132=item * STEP 1 133 134 # create output dir 135 if(!-e DESTINATION) then 136 mkdir DESTINATION; 137 138=item * STEP 2 139 140=over 4 141 142=item 1. If SOURCE is a single plain file - 143 144Split SOURCE into P smaller files (as specified by --split P). 145These files are created in the DESTINATION directory and their names are 146formatted as SOURCE1, SOURCE2, ... SOURCEP. 147 148Run count.pl on each of the P smaller files. The count outputs are also 149created in DESTINATION and their names are formatted as SOURCE1.trigrams, 150SOURCE2.trigrams, .... SOURCEP.trigrams. 151 152=item 2. SOURCE is a single flat directory containing multiple plain files - 153 154count.pl is run on each file present in the SOURCE directory. All files in 155SOURCE are treated as the data files. If SOURCE contains sub-directories, 156these are simply skipped. Intermediate trigram outputs are written in 157DESTINATION. 158 159=item 3. SOURCE is a list of multiple plain files - 160 161If #arg > 2, all arguments specified after the first argument are considered 162as the SOURCE file names. count.pl is separately run on each of the SOURCE 163files specified by argv[1], argv[2], ... argv[n] (skipping argv[0] which 164should be DESTINATION). Intermediate results are created in DESTINATION. 165 166Files specified in the list of SOURCE should be relatively small sized 167plain files with #words < 1,000,000. 168 169=back 170 171In summary, a large datafile can be provided to huge-count3 in the form of 172 173a. A single plain file (along with --split P) 174 175b. A directory containing several plain files 176 177c. Multiple plain files directly specified as command line arguments 178 179In all these cases, count.pl is separately run on SOURCE files or parts of 180SOURCE file and intermediate results are written in DESTINATION dir. 181 182=back 183 184=head2 STEP 3 185 186Intermediate count results created in STEP 2 are recombined in a pair-wise 187fashion such that for P separate count output files, C1, C2, C3 ... , CP, 188 189C1 and C2 are first recombined and result is written to huge-count3.output 190 191Counts from each of the C3, C4, ... CP are then combined (added) to 192huge-count3.output and each time while recombining, always the smaller of the 193two files is loaded. 194 195=head2 STEP 4 196 197After all files are recombined, the resultant huge-count3.output is then sorted 198in the descending order of the trigram counts. If --remove is specified, 199trigrams with counts less than the specified value of --remove, in the final 200huge-count3.output file are removed from the sample and their counts are 201deleted from the marginal totals. If --frequency is selected, trigrams with 202scores less than the specified value are simply skipped from output. 203 204=head1 OUTPUT 205 206After huge-count3 finishes successfully, DESTINATION will contain - 207 208=over 209 210=item * Intermediate trigram count files (*.trigrams) created for each of the 211given SOURCE files or split parts of the SOURCE file. 212 213=item * Final trigram count file (huge-count3.output) showing trigram counts in 214the entire SOURCE. 215 216=back 217 218=head1 BUGS 219 220huge-count3.pl doesn't consider trigrams at file boundaries. In other words, 221the result of count.pl and huge-count3.pl on the same data file will 222differ if --newLine is not used, in that, huge-count3.pl runs count.pl 223on multiple files separately and thus looses the track of the trigrams 224on file boundaries. With --window not specified, there will be loss 225of one trigram at each file boundary while its W trigrams with --window W. 226 227Functionality of huge-count3 is same as count only if --newLine is used and 228all files start and end on sentence boundaries. In other words, there 229should not be any sentence breaks at the start or end of any file given to 230huge-count3. 231 232=head1 AUTHOR 233 234Amruta Purandare, Ted Pedersen. 235University of Minnesota at Duluth. 236 237=head1 COPYRIGHT 238 239Copyright (c) 2004, 2009 240 241Amruta Purandare, University of Minnesota, Duluth. 242pura0010@umn.edu 243 244Ted Pedersen, University of Minnesota, Duluth. 245tpederse@umn.edu 246 247Cyrus Shaoul, University of Alberta, Edmonton 248cyrus.shaoul@ualberta.ca 249 250This program is free software; you can redistribute it and/or modify it under 251the terms of the GNU General Public License as published by the Free Software 252Foundation; either version 2 of the License, or (at your option) any later 253version. 254 255This program is distributed in the hope that it will be useful, but WITHOUT 256ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS 257FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. 258 259You should have received a copy of the GNU General Public License along with 260this program; if not, write to 261 262The Free Software Foundation, Inc., 26359 Temple Place - Suite 330, 264Boston, MA 02111-1307, USA. 265 266=cut 267 268############################################################################### 269 270 271#$0 contains the program name along with 272#the complete path. Extract just the program 273#name and use in error messages 274$0=~s/.*\/(.+)/$1/; 275 276############################################################################### 277 278# ================================ 279# COMMAND LINE OPTIONS AND USAGE 280# ================================ 281 282# command line options 283use Getopt::Long; 284GetOptions ("help","version","token=s","nontoken=s","remove=i","window=i","stop=s","split=i","frequency=i","newLine"); 285# show help option 286if(defined $opt_help) 287{ 288 $opt_help=1; 289 &showhelp(); 290 exit; 291} 292 293# show version information 294if(defined $opt_version) 295{ 296 $opt_version=1; 297 &showversion(); 298 exit; 299} 300 301 302# show minimal usage message if fewer arguments 303if($#ARGV<1) 304{ 305 &showminimal(); 306 exit; 307} 308 309if(defined $opt_frequency && defined $opt_remove) 310{ 311 print STDERR "ERROR($0): 312 Options --remove and --frequency can't be both used together.\n"; 313 exit; 314} 315 316############################################################################# 317 318# ======================== 319# CODE SECTION 320# ======================== 321 322#accept the destination dir name 323$destdir=$ARGV[0]; 324if(-e $destdir) 325{ 326 if(!-d $destdir) 327 { 328 print STDERR "ERROR($0): 329 $destdir is not a directory.\n"; 330 exit; 331 } 332} 333else 334{ 335 system("mkdir $destdir"); 336} 337 338# ---------- 339# Counting 340# ---------- 341 342# source = dir 343if($#ARGV==1 && -d $ARGV[1]) 344{ 345 $sourcedir=$ARGV[1]; 346 opendir(DIR,$sourcedir) || die "ERROR($0): 347 Error (code=$!) in opening Source Directory <$sourcedir>.\n"; 348 while(defined ($file=readdir DIR)) 349 { 350 next if $file =~ /^\.\.?$/; 351 if(-f "$sourcedir/$file") 352 { 353 &runcount("$sourcedir/$file",$destdir); 354 } 355 } 356} 357# source is a single file 358elsif($#ARGV==1 && -f $ARGV[1]) 359{ 360 $source=$ARGV[1]; 361 if(defined $opt_split) 362 { 363 system("cp $source $destdir"); 364 if(defined $opt_token) 365 { 366 system("cp $opt_token $destdir"); 367 } 368 if(defined $opt_nontoken) 369 { 370 system("cp $opt_nontoken $destdir"); 371 } 372 if(defined $opt_stop) 373 { 374 system("cp $opt_stop $destdir"); 375 } 376 chdir $destdir; 377 $chdir=1; 378 system("split-data.pl --parts $opt_split $source"); 379 system("/bin/rm -r -f $source"); 380 opendir(DIR,".") || die "ERROR($0): 381 Error (code=$!) in opening Destination Directory <$destdir>.\n"; 382 while(defined ($file=readdir DIR)) 383 { 384 if($file=~/$source/ && $file!~/\.trigrams/) 385 { 386 &runcount($file,"."); 387 } 388 } 389 close DIR; 390 } 391 else 392 { 393 print STDERR "Warning($0): 394 You can run count.pl directly on the single source file if don't 395 want to split the source.\n"; 396 exit; 397 } 398} 399# source contains multiple files 400elsif($#ARGV > 1) 401{ 402 foreach $i (1..$#ARGV) 403 { 404 if(-f $ARGV[$i]) 405 { 406 &runcount($ARGV[$i],$destdir); 407 } 408 else 409 { 410 print STDERR "ERROR($0): 411 ARGV[$i]=$ARGV[$i] should be a plain file.\n"; 412 exit; 413 } 414 } 415} 416# unexpected input 417else 418{ 419 &showminimal(); 420 exit; 421} 422 423# -------------------- 424# Recombining counts 425# -------------------- 426 427if(!defined $chdir) 428{ 429 chdir $destdir; 430} 431 432# current dir is now destdir 433opendir(DIR,".") || die "ERROR($0): 434 Error (code=$!) in opening Destination Directory <$destdir>.\n"; 435 436$output="huge-count3.output"; 437$tempfile="tempfile" . time(). ".tmp"; 438 439if(-e $output) 440{ 441 system("/bin/rm -r -f $output"); 442} 443 444while(defined ($file=readdir DIR)) 445{ 446 if($file=~/\.trigrams$/) 447 { 448 if(!-e $output) 449 { 450 system("cp $file $output"); 451 } 452 else 453 { 454 system("huge-combine3.pl $file $output > $tempfile"); 455 system("mv $tempfile $output"); 456 } 457 } 458} 459 460close DIR; 461 462# --------------------- 463# Sorting and Removing 464# --------------------- 465 466if(defined $opt_remove) 467{ 468 system("sort-trigrams.pl --remove $opt_remove $output > $tempfile"); 469} 470else 471{ 472 if(defined $opt_frequency) 473 { 474 system("sort-trigrams.pl --frequency $opt_frequency $output > $tempfile"); 475 } 476 else 477 { 478 system("sort-trigrams.pl $output > $tempfile"); 479 } 480} 481system("mv $tempfile $output"); 482 483print STDERR "Check the output in $destdir/$output.\n"; 484exit; 485 486############################################################################## 487 488# ========================== 489# SUBROUTINE SECTION 490# ========================== 491 492sub runcount() 493{ 494 my $file=shift; 495 my $destdir=shift; 496 my $justfile=$file; 497 $justfile=~s/.*\/(.+)/$1/; 498 # --window used 499 if(defined $opt_window) 500 { 501 # --token used 502 if(defined $opt_token) 503 { 504 # --nontoken used 505 if(defined $opt_nontoken) 506 { 507 # --stop used 508 if(defined $opt_stop) 509 { 510 if(defined $opt_newLine) 511 { 512 system("count.pl --ngram 3 --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file"); 513 } 514 else 515 { 516 system("count.pl --ngram 3 --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file"); 517 } 518 } 519 # --stop not used 520 else 521 { 522 if(defined $opt_newLine) 523 { 524 system("count.pl --ngram 3 --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.trigrams $file"); 525 } 526 else 527 { 528 system("count.pl --ngram 3 --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.trigrams $file"); 529 } 530 } 531 } 532 # nontoken not used 533 else 534 { 535 # --stop used 536 if(defined $opt_stop) 537 { 538 if(defined $opt_newLine) 539 { 540 system("count.pl --ngram 3 --newLine --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.trigrams $file"); 541 } 542 else 543 { 544 system("count.pl --ngram 3 --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.trigrams $file") 545 } 546 } 547 # --stop not used 548 else 549 { 550 if(defined $opt_newLine) 551 { 552 system("count.pl --ngram 3 --newLine --window $opt_window --token $opt_token $destdir/$justfile.trigrams $file"); 553 } 554 else 555 { 556 system("count.pl --ngram 3 --window $opt_window --token $opt_token $destdir/$justfile.trigrams $file"); 557 } 558 } 559 } 560 } 561 # --token not used 562 else 563 { 564 # --nontoken used 565 if(defined $opt_nontoken) 566 { 567 # --stop used 568 if(defined $opt_stop) 569 { 570 if(defined $opt_newLine) 571 { 572 system("count.pl --ngram 3 --newLine --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file"); 573 } 574 else 575 { 576 system("count.pl --ngram 3 --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file"); 577 } 578 } 579 # --stop not used 580 else 581 { 582 if(defined $opt_newLine) 583 { 584 system("count.pl --ngram 3 --newLine --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.trigrams $file"); 585 } 586 else 587 { 588 system("count.pl --ngram 3 --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.trigrams $file"); 589 } 590 } 591 } 592 # nontoken not used 593 else 594 { 595 # --stop used 596 if(defined $opt_stop) 597 { 598 if(defined $opt_newLine) 599 { 600 system("count.pl --ngram 3 --newLine --window $opt_window --stop $opt_stop $destdir/$justfile.trigrams $file"); 601 } 602 else 603 { 604 system("count.pl --ngram 3 --window $opt_window --stop $opt_stop $destdir/$justfile.trigrams $file"); 605 } 606 } 607 # --stop not used 608 else 609 { 610 if(defined $opt_newLine) 611 { 612 system("count.pl --ngram 3 --newLine --window $opt_window $destdir/$justfile.trigrams $file"); 613 } 614 else 615 { 616 system("count.pl --ngram 3 --window $opt_window $destdir/$justfile.trigrams $file"); 617 } 618 } 619 } 620 } 621 } 622 # --window not used 623 else 624 { 625 # --token used 626 if(defined $opt_token) 627 { 628 # --nontoken used 629 if(defined $opt_nontoken) 630 { 631 # --stop used 632 if(defined $opt_stop) 633 { 634 if(defined $opt_newLine) 635 { 636 system("count.pl --ngram 3 --newLine --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file"); 637 } 638 else 639 { 640 system("count.pl --ngram 3 --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file"); 641 } 642 } 643 # --stop not used 644 else 645 { 646 if(defined $opt_newLine) 647 { 648 system("count.pl --ngram 3 --newLine --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.trigrams $file"); 649 } 650 else 651 { 652 system("count.pl --ngram 3 --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.trigrams $file"); 653 } 654 } 655 } 656 # nontoken not used 657 else 658 { 659 # --stop used 660 if(defined $opt_stop) 661 { 662 if(defined $opt_newLine) 663 { 664 system("count.pl --ngram 3 --newLine --token $opt_token --stop $opt_stop $destdir/$justfile.trigrams $file"); 665 } 666 else 667 { 668 system("count.pl --ngram 3 --token $opt_token --stop $opt_stop $destdir/$justfile.trigrams $file"); 669 } 670 } 671 # --stop not used 672 else 673 { 674 if(defined $opt_newLine) 675 { 676 system("count.pl --ngram 3 --newLine --token $opt_token $destdir/$justfile.trigrams $file"); 677 } 678 else 679 { 680 system("count.pl --ngram 3 --token $opt_token $destdir/$justfile.trigrams $file"); 681 } 682 } 683 } 684 } 685 # --token not used 686 else 687 { 688 # --nontoken used 689 if(defined $opt_nontoken) 690 { 691 # --stop used 692 if(defined $opt_stop) 693 { 694 if(defined $opt_newLine) 695 { 696 system("count.pl --ngram 3 --newLine --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file"); 697 } 698 else 699 { 700 system("count.pl --ngram 3 --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.trigrams $file"); 701 } 702 } 703 # --stop not used 704 else 705 { 706 if(defined $opt_newLine) 707 { 708 system("count.pl --ngram 3 --newLine --nontoken $opt_nontoken $destdir/$justfile.trigrams $file"); 709 } 710 else 711 { 712 system("count.pl --ngram 3 --nontoken $opt_nontoken $destdir/$justfile.trigrams $file"); 713 } 714 } 715 } 716 # nontoken not used 717 else 718 { 719 # --stop used 720 if(defined $opt_stop) 721 { 722 if(defined $opt_newLine) 723 { 724 system("count.pl --ngram 3 --newLine --stop $opt_stop $destdir/$justfile.trigrams $file"); 725 } 726 else 727 { 728 system("count.pl --ngram 3 --stop $opt_stop $destdir/$justfile.trigrams $file"); 729 } 730 } 731 # --stop not used 732 else 733 { 734 if(defined $opt_newLine) 735 { 736 system("count.pl --ngram 3 --newLine $destdir/$justfile.trigrams $file"); 737 } 738 else 739 { 740 system("count.pl --ngram 3 $destdir/$justfile.trigrams $file"); 741 } 742 } 743 } 744 } 745 } 746} 747 748 749#----------------------------------------------------------------------------- 750#show minimal usage message 751sub showminimal() 752{ 753 print "Usage: huge-count3.pl [OPTIONS] DESTINATION [SOURCE]+"; 754 print "\nTYPE huge-count3.pl --help for help\n"; 755} 756 757#----------------------------------------------------------------------------- 758#show help 759sub showhelp() 760{ 761 print "Usage: huge-count3.pl [OPTIONS] DESTINATION [SOURCE]+ 762 763Efficiently runs count.pl for trigrams on a huge data. 764 765SOURCE 766 Could be a - 767 768 1. single plain file 769 2. single flat directory containing multiple plain files 770 3. list of plain files 771 772DESTINATION 773 Should be a directory where output is written. 774 775OPTIONS: 776 777--split P 778 If SOURCE is a single plain file, --split has to be specified to 779 split the source file into P parts and to run count.pl separately 780 on each part. 781 782--token TOKENFILE 783 Specify a file containing Perl regular expressions that define the 784 tokenization scheme for counting. 785 786--nontoken NOTOKENFILE 787 Specify a file containing Perl regular expressions of non-token 788 sequences that are removed prior to tokenization. 789 790--stop STOPFILE 791 Specify a file containing Perl regular expressions of stop words 792 that are to be removed from the output trigrams. 793 794--window W 795 Specify the window size for counting. 796 797--remove L 798 Trigrams with counts less than L will be removed from the sample. 799 800--frequency F 801 Trigrams with counts less than F will not be displayed. 802 803--newLine 804 Prevents trigrams from spanning across the new-line characters. 805 806--help 807 Displays this message. 808 809--version 810 Displays the version information. 811 812Type 'perldoc huge-count3.pl' to view detailed documentation of huge-count3.\n"; 813} 814 815#------------------------------------------------------------------------------ 816#version information 817sub showversion() 818{ 819 print "huge-count3.pl - Version 0.03\n"; 820 print "Efficiently runs count.pl on a huge data.\n"; 821 print "Copyright (C) 2004, Amruta Purandare & Ted Pedersen.\n"; 822 print "Date of Last Update: 03/30/2004\n"; 823} 824 825############################################################################# 826 827