1#!/usr/local/bin/perl -w 2 3=head1 NAME 4 5huge-count.pl - Count all the bigrams in a huge text without using huge amounts of memory. 6 7=head1 SYNOPSIS 8 9huge-count.pl --tokenlist --split 100 destination-dir input 10 11=head1 DESCRIPTION 12 13Runs count.pl efficiently on large amounts of data by splitting the data into separate files, and counting up each file separately, and then merging them to get overall results. 14 15Two output files are created. destination-dir/huge-count.output contains 16the bigram counts after applying --remove and --remove. 17destination-dir/complete-huge-count.output provides the bigram counts as 18if no --uremove or --remove cutoff were provided. 19 20=head1 USAGE 21 22huge-count.pl [OPTIONS] DESTINATION [SOURCE]+ 23 24=head1 INPUT 25 26=head2 Required Arguments: 27 28=head3 [SOURCE]+ 29 30Input to huge-count.pl should be a - 31 32=over 33 34=item 1. Single plain text file 35 36Or 37 38=item 2. Single flat directory containing multiple plain text files 39 40Or 41 42=item 3. List of multiple plain text files 43 44=back 45 46=head3 DESTINATION 47 48A complete path to a writable directory to which huge-count.pl can write all 49intermediate and final output files. If DESTINATION does not exist, 50a new directory is created, otherwise, the current directory is simply used 51for writing the output files. 52 53NOTE: If DESTINATION already exists and if the names of some of the existing 54files in DESTINATION clash with the names of the output files created by 55huge-count, these files will be over-written w/o prompting user. 56 57=head3 --tokenlist 58 59This parameter is required. huge-count will call count.pl and print out all 60the bigrams count.pl can find out. 61 62=head2 Optional Arguments: 63 64=head4 --split N 65 66This parameter is required. huge-count will divide the output bigrams 67tokenlist generated by count.pl, sort on each part and recombine the bigram 68counts from all these intermediate result files into a single bigram output 69that shows bigram counts in SOURCE. 70 71Each part created with --split N will contain N lines. Value of N should be 72chosen such that huge-sort.pl can be efficiently run on any part containing 73N lines from the file contains all bigrams file. 74 75We suggest that N is equal to the number of KB of memory you have. If the 76computer has 8 GB RAM, which is 8,000,000 KB, N should be set to 8000000. If 77N is set too small, split output file suffixes exhausted. 78 79=head4 --token TOKENFILE 80 81Specify a file containing Perl regular expressions that define the tokenization 82scheme for counting. This will be provided to count.pl's --token option. 83 84--nontoken NOTOKENFILE 85 86Specify a file containing Perl regular expressions of non-token sequences 87that are removed prior to tokenization. This will be provided to the 88count.pl's --nontoken option. 89 90--stop STOPFILE 91 92Specify a file of Perl regex/s containing the list of stop words to be 93omitted from the output BIGRAMS. Stop list can be used in two modes - 94 95AND mode declared with '@stop.mode = AND' on the 1st line of the STOPFILE 96 97or 98 99OR mode declared using '@stop.mode = OR' on the 1st line of the STOPFILE. 100 101In AND mode, bigrams whose both constituent words are stop words are removed 102while, in OR mode, bigrams whose either or both constituent words are 103stopwords are removed from the output. 104 105=head4 --window W 106 107Tokens appearing within W positions from each other (with at most W-2 108intervening words) will form bigrams. Same as count.pl's --window option. 109 110=head4 --remove L 111 112Bigrams with counts less than L in the entire SOURCE data are removed from 113the sample. The counts of the removed bigrams are not counted in any 114marginal totals. This has same effect as count.pl's --remove option. 115 116=head4 --uremove L 117 118Bigrams with counts more than L in the entire SOURCE data are removed from 119the sample. The counts of the removed bigrams are not counted in any 120marginal totals. This has same effect as count.pl's --uremove option. 121 122=head4 --frequency F 123 124Bigrams with counts less than F in the entire SOURCE are not displayed. 125The counts of the skipped bigrams ARE counted in the marginal totals. In other 126words, --frequency in huge-count.pl has same effect as the count.pl's 127--frequency option. 128 129=head4 --ufrequency F 130 131Bigrams with counts more than F in the entire SOURCE are not displayed. 132The counts of the skipped bigrams ARE counted in the marginal totals. In other 133words, --frequency in huge-count.pl has same effect as the count.pl's 134--ufrequency option. 135 136=head4 --newLine 137 138Switches ON the --newLine option in count.pl. This will prevent bigrams from 139spanning across the lines. 140 141=head3 Other Options : 142 143=head4 --help 144 145Displays this message. 146 147=head4 --version 148 149Displays the version information. 150 151=head1 PROGRAM LOGIC 152 153=over 154 155=item * STEP 1 156 157 # create output dir 158 if(!-e DESTINATION) then 159 mkdir DESTINATION; 160 161=item * STEP 2 162 163=over 3 164 165=item 1. If SOURCE is a single plain file - 166 167huge-count.pl with --tokenlist option call count.pl and run on the single 168plain file and print out all bigrams into one file. The count outputs are 169also created in DESTINATION. 170 171=item 2. SOURCE is a single flat directory containing multiple plain files - 172 173huge-count.pl with --tokenlist option call count.pl and run on each file 174present in the SOURCE directory. All files in SOURCE are treated as the 175data files. If SOURCE contains sub-directories, these are simply skipped. 176Intermediate bigram outputs are written in DESTINATION. 177 178=item 3. SOURCE is a list of multiple plain files - 179 180If #arg > 2, all arguments specified after the first argument are considered 181as the SOURCE file names. count.pl is separately run on each of the SOURCE 182files specified by argv[1], argv[2], ... argv[n] (skipping argv[0] which 183should be DESTINATION). Intermediate results are created in DESTINATION. 184 185=back 186 187In summary, a large datafile can be provided to huge-count in the form of 188 189a. A single plain file 190 191b. A directory containing several plain files 192 193c. Multiple plain files directly specified as command line arguments 194 195In all these cases, count.pl with --tokenlist is separately run on SOURCE 196files or parts of SOURCE file and intermediate results are written in 197DESTINATION dir. 198 199=back 200 201=over 202 203=item * STEP 3 204 205Split the output file generate by count.pl with --tokenlist into smaller 206files by the number of bigrams N. 207 208=item * STEP 4 209 210huge-sort.pl counts the unique bigrams and sort them in alphabetic order. 211 212=item * STEP 5 213 214huge-merge.pl merge the bigrams of each sorted bigrams file. 215 216=back 217 218=head1 OUTPUT 219 220After huge-count finishes successfully, DESTINATION will contain - 221 222=over 223 224=item * Final bigram count file (huge-count.output) showing bigram counts in 225the entire SOURCE after --remove and --uremove applied. 226 227=item * Final bigram count file (complete-huge-count.output) showing 228bigram counts in the entire SOURCE without --remove and --uremove. 229 230=back 231 232=head1 BUGS 233 234huge-count.pl doesn't consider bigrams at file boundaries. In other words, 235the result of count.pl and huge-count.pl on the same data file will 236differ if --newLine is not used, in that, huge-count.pl runs count.pl 237on multiple files separately and thus looses the track of the bigrams 238on file boundaries. With --window not specified, there will be loss 239of one bigram at each file boundary while its W bigrams with --window W. 240 241Functionality of huge-count with --tokenlist is same as count only if 242--newLine is used and all files start and end on sentence boundaries. 243In other words, there should not be any sentence breaks at the start or 244end of any file given to huge-count. 245 246=head1 AUTHOR 247 248Amruta Purandare, University of Minnesota, Duluth 249 250Ted Pedersen, University of Minnesota, Duluth 251tpederse at umn.edu 252 253Ying Liu, University of Minnesota, Twin Cities 254liux0395 at umn.edu 255 256=head1 COPYRIGHT 257 258Copyright (c) 2004-2010, Amruta Purandare, Ted Pedersen, and Ying Liu 259 260This program is free software; you can redistribute it and/or modify it under 261the terms of the GNU General Public License as published by the Free Software 262Foundation; either version 2 of the License, or (at your option) any later 263version. 264 265This program is distributed in the hope that it will be useful, but WITHOUT 266ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS 267FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. 268 269You should have received a copy of the GNU General Public License along with 270this program; if not, write to 271 272The Free Software Foundation, Inc., 27359 Temple Place - Suite 330, 274Boston, MA 02111-1307, USA. 275 276=cut 277 278############################################################################### 279 280 281#$0 contains the program name along with 282#the complete path. Extract just the program 283#name and use in error messages 284$0=~s/.*\/(.+)/$1/; 285 286############################################################################### 287 288# ================================ 289# COMMAND LINE OPTIONS AND USAGE 290# ================================ 291 292# command line options 293use Cwd; 294use Getopt::Long; 295GetOptions ("help","version","tokenlist","token=s","nontoken=s","remove=i","uremove=i", "window=i","stop=s","split=i","frequency=i","ufrequency=i", "newLine"); 296# show help option 297if(defined $opt_help) 298{ 299 $opt_help=1; 300 &showhelp(); 301 exit; 302} 303 304 305# make sure tokenlist is used in huge-count.pl 306if (!defined $opt_tokenlist) 307{ 308 print "--tokenlist is required!\n"; 309 print STDERR "Type huge-count.pl --help for help.\n"; 310 exit; 311} 312 313if ((defined $opt_remove) and (defined $opt_uremove)) 314{ 315 if ($opt_remove > $opt_uremove) 316 { 317 print "--remove must be smaller than --uremove!\n"; 318 print STDERR "Type huge-count.pl --help for help.\n"; 319 exit; 320 } 321} 322 323if ((defined $opt_frequency) and (defined $opt_ufrequency)) 324{ 325 if ($opt_frequency > $opt_ufrequency) 326 { 327 print "--frequency must be smaller than --ufrequency!\n"; 328 print STDERR "Type huge-count.pl --help for help.\n"; 329 exit; 330 } 331} 332 333# show version information 334if(defined $opt_version) 335{ 336 $opt_version=1; 337 &showversion(); 338 exit; 339} 340 341# show minimal usage message if fewer arguments 342if($#ARGV<1) 343{ 344 &showminimal(); 345 exit; 346} 347 348 349############################################################################# 350 351# ======================== 352# CODE SECTION 353# ======================== 354 355#accept the destination dir name 356my $current_dir = getcwd; 357 358$destdir=$ARGV[0]; 359if(-e $destdir) 360{ 361 if(!-d $destdir) 362 { 363 print STDERR "ERROR($0): 364 $destdir is not a directory.\n"; 365 exit; 366 } 367} 368else 369{ 370 system("mkdir $destdir"); 371} 372 373 374# ---------- 375# Counting 376# ---------- 377 378 379# source = dir 380if($#ARGV==1 && -d $ARGV[1]) 381{ 382 $sourcedir=$ARGV[1]; 383 opendir(DIR,$sourcedir) || die "ERROR($0): 384 Error (code=$!) in opening Source Directory <$sourcedir>.\n"; 385 while(defined ($file=readdir DIR)) 386 { 387 next if $file =~ /^\.\.?$/; 388 if(-f "$sourcedir/$file") 389 { 390 &runcount("$sourcedir/$file",$destdir); 391 } 392 } 393} 394# source is a single file 395elsif($#ARGV==1 && -f $ARGV[1]) 396{ 397 $source=$ARGV[1]; 398 399 system("cp $source $destdir"); 400 if(defined $opt_token) 401 { 402 system("cp $opt_token $destdir"); 403 } 404 if(defined $opt_nontoken) 405 { 406 system("cp $opt_nontoken $destdir"); 407 } 408 if(defined $opt_stop) 409 { 410 system("cp $opt_stop $destdir"); 411 } 412 chdir $destdir; 413 $chdir=1; 414 &runcount($source,"."); 415 416} 417# source contains multiple files 418elsif($#ARGV > 1) 419{ 420 foreach $i (1..$#ARGV) 421 { 422 if(-f $ARGV[$i]) 423 { 424 &runcount($ARGV[$i],$destdir); 425 } 426 else 427 { 428 print STDERR "ERROR($0): 429 ARGV[$i]=$ARGV[$i] should be a plain file.\n"; 430 exit; 431 } 432 } 433} 434# unexpected input 435else 436{ 437 &showminimal(); 438 exit; 439} 440 441 442# -------------------- 443# Split bigrams 444# -------------------- 445 446if(!defined $chdir) 447{ 448 chdir $destdir; 449 $chdir = 1; 450} 451# current dir is now destdir 452opendir(DIR,".") || die "ERROR($0): 453 Error (code=$!) in opening Destination Directory <$destdir>.\n"; 454 455if (defined $opt_split) 456{ 457 print "split the bigrams files...\n"; 458 while(defined ($file = readdir DIR)) 459 { 460 if($file=~/\.bigrams$/) 461 { 462 system("huge-split.pl --split $opt_split $file"); 463 system("/bin/rm $file"); 464 } 465 } 466} 467else 468{ 469 print STDERR "Warning($0): You can run huge-sort.pl directly on the \n"; 470 print STDERR "single tokenlist file if don't want to split the tokenlist.\n"; 471} 472 473# -------------------- 474# Sort bigrams 475# -------------------- 476 477if (defined $opt_tokenlist) 478{ 479 print "sort the bigrams files...\n"; 480 if(!defined $chdir) 481 { 482 chdir $destdir; 483 $chdir = 1; 484 } 485 # current dir is now destdir 486 opendir(DIR,".") || die "ERROR($0): 487 Error (code=$!) in opening Destination Directory <$destdir>.\n"; 488 489 while(defined ($file = readdir DIR)) 490 { 491 if(($file=~/\.bigrams/) and ($file !~ /sorted$/)) 492 { 493 system("huge-sort.pl $file"); 494 } 495 } 496} 497 498# -------------------- 499# Combine bigrams 500# -------------------- 501 502print "combine the bigrams files...\n"; 503if(defined $chdir) 504{ 505 chdir $current_dir; 506} 507 508system("huge-merge.pl $destdir"); 509 510 511# -------------------- 512# Delete bigrams 513# -------------------- 514 515print "delete the bigrams ...\n"; 516if (defined $opt_remove) 517{ 518 if (defined $opt_uremove) 519 { 520 if (defined $opt_frequency) 521 { 522 if (defined $opt_ufrequency) 523 { 524 system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge"); 525 } 526 else 527 { 528 system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove --frequency $opt_frequency $destdir/merge* $destdir/finalmerge"); 529 } 530 } 531 # --frequency not used 532 else 533 { 534 if (defined $opt_ufrequency) 535 { 536 system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge"); 537 } 538 else 539 { 540 system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove $destdir/merge* $destdir/finalmerge"); 541 } 542 543 } 544 } 545 # --uremove not used 546 else 547 { 548 if (defined $opt_frequency) 549 { 550 if (defined $opt_ufrequency) 551 { 552 system("huge-delete.pl --remove $opt_remove --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge"); 553 } 554 else 555 { 556 system("huge-delete.pl --remove $opt_remove --frequency $opt_frequency $destdir/merge* $destdir/finalmerge"); 557 } 558 } 559 # --frequency not used 560 else 561 { 562 if (defined $opt_ufrequency) 563 { 564 system("huge-delete.pl --remove $opt_remove --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge"); 565 } 566 else 567 { 568 system("huge-delete.pl --remove $opt_remove $destdir/merge* $destdir/finalmerge"); 569 } 570 571 } 572 } 573} 574# --remove not used 575else 576{ 577 if (defined $opt_uremove) 578 { 579 if (defined $opt_frequency) 580 { 581 if (defined $opt_ufrequency) 582 { 583 system("huge-delete.pl --uremove $opt_uremove --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge"); 584 } 585 else 586 { 587 system("huge-delete.pl --uremove $opt_uremove --frequency $opt_frequency $destdir/merge* $destdir/finalmerge"); 588 } 589 } 590 # --frequency not used 591 else 592 { 593 if (defined $opt_ufrequency) 594 { 595 system("huge-delete.pl --uremove $opt_uremove --ufrequency $opt_ufrequency $destdir/mgerge* $destdir/finalmerge"); 596 } 597 else 598 { 599 system("huge-delete.pl --uremove $opt_uremove $destdir/merge* $destdir/finalmerge"); 600 } 601 602 } 603 } 604 # --uremove not used 605 else 606 { 607 if (defined $opt_frequency) 608 { 609 if (defined $opt_ufrequency) 610 { 611 system("huge-delete.pl --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge"); 612 } 613 else 614 { 615 system("huge-delete.pl --frequency $opt_frequency $destdir/merge* $destdir/finalmerge"); 616 } 617 } 618 # --frequency not used 619 else 620 { 621 if (defined $opt_ufrequency) 622 { 623 system("huge-delete.pl --ufrequency $opt_ufrequency $destdir/mgerge* $destdir/finalmerge"); 624 } 625 } 626 627 } 628} 629 630$output="complete-huge-count.output"; 631if ((defined $opt_remove ) or (defined $opt_uremove) or (defined $opt_frequency) or (defined $opt_ufrequency)) 632{ 633 system("mv $destdir/merge.* $destdir/$output"); 634 system("mv $destdir/finalmerge $destdir/huge-count.output"); 635 print STDERR "Check the output in $destdir/huge-count.output\n"; 636} 637else 638{ 639 system("mv $destdir/merge.* $destdir/$output"); 640 print STDERR "Check the output in $destdir/$output\n"; 641} 642 643exit; 644 645############################################################################## 646 647# ========================== 648# SUBROUTINE SECTION 649# ========================== 650 651sub runcount() 652{ 653 my $file=shift; 654 my $destdir=shift; 655 my $justfile=$file; 656 $justfile=~s/.*\/(.+)/$1/; 657 658 659 660# --tokenlist used 661if(defined $opt_tokenlist) 662{ 663 # --window used 664 if(defined $opt_window) 665 { 666 # --token used 667 if(defined $opt_token) 668 { 669 # --nontoken used 670 if(defined $opt_nontoken) 671 { 672 # --stop used 673 if(defined $opt_stop) 674 { 675 if(defined $opt_newLine) 676 { 677 system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 678 } 679 else 680 { 681 system("count.pl --tokenlist --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 682 } 683 } 684 # --stop not used 685 else 686 { 687 if(defined $opt_newLine) 688 { 689 system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 690 } 691 else 692 { 693 system("count.pl --tokenlist --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 694 } 695 } 696 } 697 # nontoken not used 698 else 699 { 700 # --stop used 701 if(defined $opt_stop) 702 { 703 if(defined $opt_newLine) 704 { 705 system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file"); 706 } 707 else 708 { 709 system("count.pl --tokenlist --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file") 710 } 711 } 712 # --stop not used 713 else 714 { 715 if(defined $opt_newLine) 716 { 717 system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file"); 718 } 719 else 720 { 721 system("count.pl --tokenlist --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file"); 722 } 723 } 724 } 725 } 726 # --token not used 727 else 728 { 729 # --nontoken used 730 if(defined $opt_nontoken) 731 { 732 # --stop used 733 if(defined $opt_stop) 734 { 735 if(defined $opt_newLine) 736 { 737 system("count.pl --tokenlist --newLine --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 738 } 739 else 740 { 741 system("count.pl --tokenlist --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 742 } 743 } 744 # --stop not used 745 else 746 { 747 if(defined $opt_newLine) 748 { 749 system("count.pl --tokenlist --newLine --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 750 } 751 else 752 { 753 system("count.pl --tokenlist --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 754 } 755 } 756 } 757 # nontoken not used 758 else 759 { 760 # --stop used 761 if(defined $opt_stop) 762 { 763 if(defined $opt_newLine) 764 { 765 system("count.pl --tokenlist --newLine --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file"); 766 } 767 else 768 { 769 system("count.pl --tokenlist --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file"); 770 } 771 } 772 # --stop not used 773 else 774 { 775 if(defined $opt_newLine) 776 { 777 system("count.pl --tokenlist --newLine --window $opt_window $destdir/$justfile.bigrams $file"); 778 } 779 else 780 { 781 system("count.pl --tokenlist --window $opt_window $destdir/$justfile.bigrams $file"); 782 } 783 } 784 } 785 } 786 } 787 # --window not used 788 else 789 { 790 # --token used 791 if(defined $opt_token) 792 { 793 # --nontoken used 794 if(defined $opt_nontoken) 795 { 796 # --stop used 797 if(defined $opt_stop) 798 { 799 if(defined $opt_newLine) 800 { 801 system("count.pl --tokenlist --newLine --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 802 } 803 else 804 { 805 system("count.pl --tokenlist --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 806 } 807 } 808 # --stop not used 809 else 810 { 811 if(defined $opt_newLine) 812 { 813 system("count.pl --tokenlist --newLine --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 814 } 815 else 816 { 817 system("count.pl --tokenlist --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 818 } 819 } 820 } 821 # nontoken not used 822 else 823 { 824 # --stop used 825 if(defined $opt_stop) 826 { 827 if(defined $opt_newLine) 828 { 829 system("count.pl --tokenlist --newLine --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file"); 830 } 831 else 832 { 833 system("count.pl --tokenlist --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file"); 834 } 835 } 836 # --stop not used 837 else 838 { 839 if(defined $opt_newLine) 840 { 841 system("count.pl --tokenlist --newLine --token $opt_token $destdir/$justfile.bigrams $file"); 842 } 843 else 844 { 845 system("count.pl --tokenlist --token $opt_token $destdir/$justfile.bigrams $file"); 846 } 847 } 848 } 849 } 850 # --token not used 851 else 852 { 853 # --nontoken used 854 if(defined $opt_nontoken) 855 { 856 # --stop used 857 if(defined $opt_stop) 858 { 859 if(defined $opt_newLine) 860 { 861 system("count.pl --tokenlist --newLine --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 862 } 863 else 864 { 865 system("count.pl --tokenlist --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 866 } 867 } 868 # --stop not used 869 else 870 { 871 if(defined $opt_newLine) 872 { 873 system("count.pl --tokenlist --newLine --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 874 } 875 else 876 { 877 system("count.pl --tokenlist --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 878 } 879 } 880 } 881 # nontoken not used 882 else 883 { 884 # --stop used 885 if(defined $opt_stop) 886 { 887 if(defined $opt_newLine) 888 { 889 system("count.pl --tokenlist --newLine --stop $opt_stop $destdir/$justfile.bigrams $file"); 890 } 891 else 892 { 893 system("count.pl --tokenlist --stop $opt_stop $destdir/$justfile.bigrams $file"); 894 } 895 } 896 # --stop not used 897 else 898 { 899 if(defined $opt_newLine) 900 { 901 system("count.pl --tokenlist --newLine $destdir/$justfile.bigrams $file"); 902 } 903 else 904 { 905 system("count.pl --tokenlist $destdir/$justfile.bigrams $file"); 906 } 907 } 908 } 909 } 910 } 911} 912# --tokenlist not used 913else 914{ 915 # --window used 916 if(defined $opt_window) 917 { 918 # --token used 919 if(defined $opt_token) 920 { 921 # --nontoken used 922 if(defined $opt_nontoken) 923 { 924 # --stop used 925 if(defined $opt_stop) 926 { 927 if(defined $opt_newLine) 928 { 929 system("count.pl --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 930 } 931 else 932 { 933 system("count.pl --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 934 } 935 } 936 # --stop not used 937 else 938 { 939 if(defined $opt_newLine) 940 { 941 system("count.pl --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 942 } 943 else 944 { 945 system("count.pl --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 946 } 947 } 948 } 949 # nontoken not used 950 else 951 { 952 # --stop used 953 if(defined $opt_stop) 954 { 955 if(defined $opt_newLine) 956 { 957 system("count.pl --newLine --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file"); 958 } 959 else 960 { 961 system("count.pl --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file") 962 } 963 } 964 # --stop not used 965 else 966 { 967 if(defined $opt_newLine) 968 { 969 system("count.pl --newLine --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file"); 970 } 971 else 972 { 973 system("count.pl --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file"); 974 } 975 } 976 } 977 } 978 # --token not used 979 else 980 { 981 # --nontoken used 982 if(defined $opt_nontoken) 983 { 984 # --stop used 985 if(defined $opt_stop) 986 { 987 if(defined $opt_newLine) 988 { 989 system("count.pl --newLine --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 990 } 991 else 992 { 993 system("count.pl --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 994 } 995 } 996 # --stop not used 997 else 998 { 999 if(defined $opt_newLine) 1000 { 1001 system("count.pl --newLine --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 1002 } 1003 else 1004 { 1005 system("count.pl --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 1006 } 1007 } 1008 } 1009 # nontoken not used 1010 else 1011 { 1012 # --stop used 1013 if(defined $opt_stop) 1014 { 1015 if(defined $opt_newLine) 1016 { 1017 system("count.pl --newLine --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file"); 1018 } 1019 else 1020 { 1021 system("count.pl --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file"); 1022 } 1023 } 1024 # --stop not used 1025 else 1026 { 1027 if(defined $opt_newLine) 1028 { 1029 system("count.pl --newLine --window $opt_window $destdir/$justfile.bigrams $file"); 1030 } 1031 else 1032 { 1033 system("count.pl --window $opt_window $destdir/$justfile.bigrams $file"); 1034 } 1035 } 1036 } 1037 } 1038 } 1039 # --window not used 1040 else 1041 { 1042 # --token used 1043 if(defined $opt_token) 1044 { 1045 # --nontoken used 1046 if(defined $opt_nontoken) 1047 { 1048 # --stop used 1049 if(defined $opt_stop) 1050 { 1051 if(defined $opt_newLine) 1052 { 1053 system("count.pl --newLine --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 1054 } 1055 else 1056 { 1057 system("count.pl --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 1058 } 1059 } 1060 # --stop not used 1061 else 1062 { 1063 if(defined $opt_newLine) 1064 { 1065 system("count.pl --newLine --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 1066 } 1067 else 1068 { 1069 system("count.pl --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 1070 } 1071 } 1072 } 1073 # nontoken not used 1074 else 1075 { 1076 # --stop used 1077 if(defined $opt_stop) 1078 { 1079 if(defined $opt_newLine) 1080 { 1081 system("count.pl --newLine --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file"); 1082 } 1083 else 1084 { 1085 system("count.pl --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file"); 1086 } 1087 } 1088 # --stop not used 1089 else 1090 { 1091 if(defined $opt_newLine) 1092 { 1093 system("count.pl --newLine --token $opt_token $destdir/$justfile.bigrams $file"); 1094 } 1095 else 1096 { 1097 system("count.pl --token $opt_token $destdir/$justfile.bigrams $file"); 1098 } 1099 } 1100 } 1101 } 1102 # --token not used 1103 else 1104 { 1105 # --nontoken used 1106 if(defined $opt_nontoken) 1107 { 1108 # --stop used 1109 if(defined $opt_stop) 1110 { 1111 if(defined $opt_newLine) 1112 { 1113 system("count.pl --newLine --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 1114 } 1115 else 1116 { 1117 system("count.pl --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 1118 } 1119 } 1120 # --stop not used 1121 else 1122 { 1123 if(defined $opt_newLine) 1124 { 1125 system("count.pl --newLine --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 1126 } 1127 else 1128 { 1129 system("count.pl --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 1130 } 1131 } 1132 } 1133 # nontoken not used 1134 else 1135 { 1136 # --stop used 1137 if(defined $opt_stop) 1138 { 1139 if(defined $opt_newLine) 1140 { 1141 system("count.pl --newLine --stop $opt_stop $destdir/$justfile.bigrams $file"); 1142 } 1143 else 1144 { 1145 system("count.pl --stop $opt_stop $destdir/$justfile.bigrams $file"); 1146 } 1147 } 1148 # --stop not used 1149 else 1150 { 1151 if(defined $opt_newLine) 1152 { 1153 system("count.pl --newLine $destdir/$justfile.bigrams $file"); 1154 } 1155 else 1156 { 1157 system("count.pl $destdir/$justfile.bigrams $file"); 1158 } 1159 } 1160 } 1161 } 1162 } 1163} 1164 1165} # end of sub runcount() 1166 1167 1168#----------------------------------------------------------------------------- 1169#show minimal usage message 1170sub showminimal() 1171{ 1172 print "Usage: huge-count.pl --tokenlist [OPTIONS] DESTINATION [SOURCE]+"; 1173 print "\nTYPE huge-count.pl --help for help\n"; 1174} 1175 1176#----------------------------------------------------------------------------- 1177#show help 1178sub showhelp() 1179{ 1180 print "Usage: huge-count.pl --tokenlist [OPTIONS] DESTINATION [SOURCE]+ 1181 1182Efficiently runs count.pl on a huge data. 1183 1184SOURCE 1185 Could be a - 1186 1187 1. single plain file 1188 2. single flat directory containing multiple plain files 1189 3. list of plain files 1190 1191DESTINATION 1192 Should be a directory where output is written. 1193 1194REQUIRED PARAMETERS: 1195 1196--tokenlist 1197 This option is required. Print out all bigrams list. 1198 1199OPTIONS: 1200 1201--split N 1202 Number of bigrams for each seperated bigrams file. 1203 1204--token TOKENFILE 1205 Specify a file containing Perl regular expressions that define the 1206 tokenization scheme for counting. 1207 1208--nontoken NOTOKENFILE 1209 Specify a file containing Perl regular expressions of non-token 1210 sequences that are removed prior to tokenization. 1211 1212--stop STOPFILE 1213 Specify a file containing Perl regular expressions of stop words 1214 that are to be removed from the output bigrams. 1215 1216--window W 1217 Specify the window size for counting. 1218 1219--remove L 1220 Bigrams with counts less than L will be removed from the sample. 1221 remove must be smaller than uremove. 1222 1223--uremove L 1224 Bigrams with counts more than L will be removed from the sample. 1225 uremove must be bigger than remove. 1226 1227--frequency F 1228 Bigrams with counts less than F will not be displayed. 1229 frequency must be smaller than ufrequency. 1230 1231--ufrequency F 1232 Bigrams with counts more than F will not be displayed. 1233 ufrequency must be bigger than frequency. 1234 1235--newLine 1236 Prevents bigrams from spanning across the new-line characters. 1237 1238--help 1239 Displays this message. 1240 1241--version 1242 Displays the version information. 1243 1244Type 'perldoc huge-count.pl' to view detailed documentation of huge-count.\n"; 1245} 1246 1247#------------------------------------------------------------------------------ 1248#version information 1249sub showversion() 1250{ 1251 print 'huge-count.pl $Id: huge-count.pl,v 1.26 2011/03/31 23:04:04 tpederse Exp $'; 1252 print "\nEfficiently runs count.pl on a huge data.\n"; 1253 print "Copyright (C) 2004-2011, Amruta Purandare, Ted Pedersen & Ying Liu.\n"; 1254} 1255 1256############################################################################# 1257 1258