1#!/usr/local/bin/perl -w 2 3=head1 NAME 4 5huge-count.pl - Divide huge text into pieces and run count.pl separately on each (and then combine) 6 7=head1 SYNOPSIS 8 9Runs count.pl efficiently on a huge data. 10 11=head1 USGAE 12 13huge-count.pl [OPTIONS] DESTINATION [SOURCE]+ 14 15=head1 INPUT 16 17=head2 Required Arguments: 18 19=head3 [SOURCE]+ 20 21Input to huge-count.pl should be a - 22 23=over 24 25=item 1. Single plain text file 26 27Or 28 29item 2. Single flat directory containing multiple plain text files 30 31Or 32 33=item 3. List of multiple plain text files 34 35=back 36 37=head3 DESTINATION 38 39A complete path to a writable directory to which huge-count.pl can write all 40intermediate and final output files. If DESTINATION does not exist, 41a new directory is created, otherwise, the current directory is simply used 42for writing the output files. 43 44NOTE: If DESTINATION already exists and if the names of some of the existing 45files in DESTINATION clash with the names of the output files created by 46huge-count, these files will be over-written w/o prompting user. 47 48=head2 Optional Arguments: 49 50=head4 --split P 51 52This option should be specified when SOURCE is a single plain file. huge-count 53will divide the given SOURCE file into P (approximately) equal parts, 54will run count.pl separately on each part and will then recombine the bigram 55counts from all these intermediate result files into a single bigram output 56that shows bigram counts in SOURCE. 57 58If SOURCE file contains M lines, each part created with --split P will 59contain approximately M/P lines. Value of P should be chosen such that 60count.pl can be efficiently run on any part containing M/P lines from SOURCE. 61As #words/line differ from files to files, it is recommended that P should 62be large enough so that each part will contain at most million words in total. 63 64=head4 --token TOKENFILE 65 66Specify a file containing Perl regular expressions that define the tokenization 67scheme for counting. This will be provided to count.pl's --token option. 68 69--nontoken NOTOKENFILE 70 71Specify a file containing Perl regular expressions of non-token sequences 72that are removed prior to tokenization. This will be provided to the 73count.pl's --nontoken option. 74 75--stop STOPFILE 76 77Specify a file of Perl regex/s containing the list of stop words to be 78omitted from the output BIGRAMS. Stop list can be used in two modes - 79 80AND mode declared with '@stop.mode = AND' on the 1st line of the STOPFILE 81 82or 83 84OR mode declared using '@stop.mode = OR' on the 1st line of the STOPFILE. 85 86In AND mode, bigrams whose both constituent words are stop words are removed 87while, in OR mode, bigrams whose either or both constituent words are 88stopwords are removed from the output. 89 90=head4 --window W 91 92Tokens appearing within W positions from each other (with at most W-2 93intervening words) will form bigrams. Same as count.pl's --window option. 94 95=head4 --remove L 96 97Bigrams with counts less than L in the entire SOURCE data are removed from 98the sample. The counts of the removed bigrams are not counted in any 99marginal totals. This has same effect as count.pl's --remove option. 100 101=head4 --frequency F 102 103Bigrams with counts less than F in the entire SOURCE are not displayed. 104The counts of the skipped bigrams ARE counted in the marginal totals. In other 105words, --frequency in huge-count.pl has same effect as the count.pl's 106--frequency option. 107 108=head4 --newLine 109 110Switches ON the --newLine option in count.pl. This will prevent bigrams from 111spanning across the lines. 112 113=head3 Other Options : 114 115=head4 --help 116 117Displays this message. 118 119=head4 --version 120 121Displays the version information. 122 123=head1 PROGRAM LOGIC 124 125=over 126 127=item * STEP 1 128 129 # create output dir 130 if(!-e DESTINATION) then 131 mkdir DESTINATION; 132 133=item * STEP 2 134 135=over 4 136 137=item 1. If SOURCE is a single plain file - 138 139Split SOURCE into P smaller files (as specified by --split P). 140These files are created in the DESTINATION directory and their names are 141formatted as SOURCE1, SOURCE2, ... SOURCEP. 142 143Run count.pl on each of the P smaller files. The count outputs are also 144created in DESTINATION and their names are formatted as SOURCE1.bigrams, 145SOURCE2.bigrams, .... SOURCEP.bigrams. 146 147=item 2. SOURCE is a single flat directory containing multiple plain files - 148 149count.pl is run on each file present in the SOURCE directory. All files in 150SOURCE are treated as the data files. If SOURCE contains sub-directories, 151these are simply skipped. Intermediate bigram outputs are written in 152DESTINATION. 153 154=item 3. SOURCE is a list of multiple plain files - 155 156If #arg > 2, all arguments specified after the first argument are considered 157as the SOURCE file names. count.pl is separately run on each of the SOURCE 158files specified by argv[1], argv[2], ... argv[n] (skipping argv[0] which 159should be DESTINATION). Intermediate results are created in DESTINATION. 160 161Files specified in the list of SOURCE should be relatively small sized 162plain files with #words < 1,000,000. 163 164=back 165 166In summary, a large datafile can be provided to huge-count in the form of 167 168a. A single plain file (along with --split P) 169 170b. A directory containing several plain files 171 172c. Multiple plain files directly specified as command line arguments 173 174In all these cases, count.pl is separately run on SOURCE files or parts of 175SOURCE file and intermediate results are written in DESTINATION dir. 176 177=back 178 179=head2 STEP 3 180 181Intermediate count results created in STEP 2 are recombined in a pair-wise 182fashion such that for P separate count output files, C1, C2, C3 ... , CP, 183 184C1 and C2 are first recombined and result is written to huge-count.output 185 186Counts from each of the C3, C4, ... CP are then combined (added) to 187huge-count.output and each time while recombining, always the smaller of the 188two files is loaded. 189 190=head2 STEP 4 191 192After all files are recombined, the resultant huge-count.output is then sorted 193in the descending order of the bigram counts. If --remove is specified, 194bigrams with counts less than the specified value of --remove, in the final 195huge-count.output file are removed from the sample and their counts are 196deleted from the marginal totals. If --frequency is selected, bigrams with 197scores less than the specified value are simply skipped from output. 198 199=head1 OUTPUT 200 201After huge-count finishes successfully, DESTINATION will contain - 202 203=over 204 205=item * Intermediate bigram count files (*.bigrams) created for each of the 206given SOURCE files or split parts of the SOURCE file. 207 208=item * Final bigram count file (huge-count.output) showing bigram counts in 209the entire SOURCE. 210 211=back 212 213=head1 BUGS 214 215huge-count.pl doesn't consider bigrams at file boundaries. In other words, 216the result of count.pl and huge-count.pl on the same data file will 217differ if --newLine is not used, in that, huge-count.pl runs count.pl 218on multiple files separately and thus looses the track of the bigrams 219on file boundaries. With --window not specified, there will be loss 220of one bigram at each file boundary while its W bigrams with --window W. 221 222Functionality of huge-count is same as count only if --newLine is used and 223all files start and end on sentence boundaries. In other words, there 224should not be any sentence breaks at the start or end of any file given to 225huge-count. 226 227=head1 AUTHOR 228 229Amruta Purandare, Ted Pedersen. 230University of Minnesota at Duluth. 231 232=head1 COPYRIGHT 233 234Copyright (c) 2004, 235 236Amruta Purandare, University of Minnesota, Duluth. 237pura0010@umn.edu 238 239Ted Pedersen, University of Minnesota, Duluth. 240tpederse@umn.edu 241 242This program is free software; you can redistribute it and/or modify it under 243the terms of the GNU General Public License as published by the Free Software 244Foundation; either version 2 of the License, or (at your option) any later 245version. 246 247This program is distributed in the hope that it will be useful, but WITHOUT 248ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS 249FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. 250 251You should have received a copy of the GNU General Public License along with 252this program; if not, write to 253 254The Free Software Foundation, Inc., 25559 Temple Place - Suite 330, 256Boston, MA 02111-1307, USA. 257 258=cut 259 260############################################################################### 261 262 263#$0 contains the program name along with 264#the complete path. Extract just the program 265#name and use in error messages 266$0=~s/.*\/(.+)/$1/; 267 268############################################################################### 269 270# ================================ 271# COMMAND LINE OPTIONS AND USAGE 272# ================================ 273 274# command line options 275use Getopt::Long; 276GetOptions ("help","version","token=s","nontoken=s","remove=i","window=i","stop=s","split=i","frequency=i","newLine"); 277# show help option 278if(defined $opt_help) 279{ 280 $opt_help=1; 281 &showhelp(); 282 exit; 283} 284 285# show version information 286if(defined $opt_version) 287{ 288 $opt_version=1; 289 &showversion(); 290 exit; 291} 292 293# show minimal usage message if fewer arguments 294if($#ARGV<1) 295{ 296 &showminimal(); 297 exit; 298} 299 300if(defined $opt_frequency && defined $opt_remove) 301{ 302 print STDERR "ERROR($0): 303 Options --remove and --frequency can't be both used together.\n"; 304 exit; 305} 306 307############################################################################# 308 309# ======================== 310# CODE SECTION 311# ======================== 312 313#accept the destination dir name 314$destdir=$ARGV[0]; 315if(-e $destdir) 316{ 317 if(!-d $destdir) 318 { 319 print STDERR "ERROR($0): 320 $destdir is not a directory.\n"; 321 exit; 322 } 323} 324else 325{ 326 system("mkdir $destdir"); 327} 328 329# ---------- 330# Counting 331# ---------- 332 333# source = dir 334if($#ARGV==1 && -d $ARGV[1]) 335{ 336 $sourcedir=$ARGV[1]; 337 opendir(DIR,$sourcedir) || die "ERROR($0): 338 Error (code=$!) in opening Source Directory <$sourcedir>.\n"; 339 while(defined ($file=readdir DIR)) 340 { 341 next if $file =~ /^\.\.?$/; 342 if(-f "$sourcedir/$file") 343 { 344 &runcount("$sourcedir/$file",$destdir); 345 } 346 } 347} 348# source is a single file 349elsif($#ARGV==1 && -f $ARGV[1]) 350{ 351 $source=$ARGV[1]; 352 if(defined $opt_split) 353 { 354 system("cp $source $destdir"); 355 if(defined $opt_token) 356 { 357 system("cp $opt_token $destdir"); 358 } 359 if(defined $opt_nontoken) 360 { 361 system("cp $opt_nontoken $destdir"); 362 } 363 if(defined $opt_stop) 364 { 365 system("cp $opt_stop $destdir"); 366 } 367 chdir $destdir; 368 $chdir=1; 369 system("split-data.pl --parts $opt_split $source"); 370 system("/bin/rm -r -f $source"); 371 opendir(DIR,".") || die "ERROR($0): 372 Error (code=$!) in opening Destination Directory <$destdir>.\n"; 373 while(defined ($file=readdir DIR)) 374 { 375 if($file=~/$source/ && $file!~/\.bigrams/) 376 { 377 &runcount($file,"."); 378 } 379 } 380 close DIR; 381 } 382 else 383 { 384 print STDERR "Warning($0): 385 You can run count.pl directly on the single source file if don't 386 want to split the source.\n"; 387 exit; 388 } 389} 390# source contains multiple files 391elsif($#ARGV > 1) 392{ 393 foreach $i (1..$#ARGV) 394 { 395 if(-f $ARGV[$i]) 396 { 397 &runcount($ARGV[$i],$destdir); 398 } 399 else 400 { 401 print STDERR "ERROR($0): 402 ARGV[$i]=$ARGV[$i] should be a plain file.\n"; 403 exit; 404 } 405 } 406} 407# unexpected input 408else 409{ 410 &showminimal(); 411 exit; 412} 413 414# -------------------- 415# Recombining counts 416# -------------------- 417 418if(!defined $chdir) 419{ 420 chdir $destdir; 421} 422 423# current dir is now destdir 424opendir(DIR,".") || die "ERROR($0): 425 Error (code=$!) in opening Destination Directory <$destdir>.\n"; 426 427$output="huge-count.output"; 428$tempfile="tempfile" . time(). ".tmp"; 429 430if(-e $output) 431{ 432 system("/bin/rm -r -f $output"); 433} 434 435while(defined ($file=readdir DIR)) 436{ 437 if($file=~/\.bigrams$/) 438 { 439 if(!-e $output) 440 { 441 system("cp $file $output"); 442 } 443 else 444 { 445 system("huge-combine.pl $file $output > $tempfile"); 446 system("mv $tempfile $output"); 447 } 448 } 449} 450 451close DIR; 452 453# --------------------- 454# Sorting and Removing 455# --------------------- 456 457if(defined $opt_remove) 458{ 459 system("sort-bigrams.pl --remove $opt_remove $output > $tempfile"); 460} 461else 462{ 463 if(defined $opt_frequency) 464 { 465 system("sort-bigrams.pl --frequency $opt_frequency $output > $tempfile"); 466 } 467 else 468 { 469 system("sort-bigrams.pl $output > $tempfile"); 470 } 471} 472system("mv $tempfile $output"); 473 474print STDERR "Check the output in $destdir/$output.\n"; 475exit; 476 477############################################################################## 478 479# ========================== 480# SUBROUTINE SECTION 481# ========================== 482 483sub runcount() 484{ 485 my $file=shift; 486 my $destdir=shift; 487 my $justfile=$file; 488 $justfile=~s/.*\/(.+)/$1/; 489 # --window used 490 if(defined $opt_window) 491 { 492 # --token used 493 if(defined $opt_token) 494 { 495 # --nontoken used 496 if(defined $opt_nontoken) 497 { 498 # --stop used 499 if(defined $opt_stop) 500 { 501 if(defined $opt_newLine) 502 { 503 system("count.pl --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 504 } 505 else 506 { 507 system("count.pl --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 508 } 509 } 510 # --stop not used 511 else 512 { 513 if(defined $opt_newLine) 514 { 515 system("count.pl --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 516 } 517 else 518 { 519 system("count.pl --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 520 } 521 } 522 } 523 # nontoken not used 524 else 525 { 526 # --stop used 527 if(defined $opt_stop) 528 { 529 if(defined $opt_newLine) 530 { 531 system("count.pl --newLine --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file"); 532 } 533 else 534 { 535 system("count.pl --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file") 536 } 537 } 538 # --stop not used 539 else 540 { 541 if(defined $opt_newLine) 542 { 543 system("count.pl --newLine --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file"); 544 } 545 else 546 { 547 system("count.pl --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file"); 548 } 549 } 550 } 551 } 552 # --token not used 553 else 554 { 555 # --nontoken used 556 if(defined $opt_nontoken) 557 { 558 # --stop used 559 if(defined $opt_stop) 560 { 561 if(defined $opt_newLine) 562 { 563 system("count.pl --newLine --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 564 } 565 else 566 { 567 system("count.pl --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 568 } 569 } 570 # --stop not used 571 else 572 { 573 if(defined $opt_newLine) 574 { 575 system("count.pl --newLine --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 576 } 577 else 578 { 579 system("count.pl --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 580 } 581 } 582 } 583 # nontoken not used 584 else 585 { 586 # --stop used 587 if(defined $opt_stop) 588 { 589 if(defined $opt_newLine) 590 { 591 system("count.pl --newLine --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file"); 592 } 593 else 594 { 595 system("count.pl --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file"); 596 } 597 } 598 # --stop not used 599 else 600 { 601 if(defined $opt_newLine) 602 { 603 system("count.pl --newLine --window $opt_window $destdir/$justfile.bigrams $file"); 604 } 605 else 606 { 607 system("count.pl --window $opt_window $destdir/$justfile.bigrams $file"); 608 } 609 } 610 } 611 } 612 } 613 # --window not used 614 else 615 { 616 # --token used 617 if(defined $opt_token) 618 { 619 # --nontoken used 620 if(defined $opt_nontoken) 621 { 622 # --stop used 623 if(defined $opt_stop) 624 { 625 if(defined $opt_newLine) 626 { 627 system("count.pl --newLine --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 628 } 629 else 630 { 631 system("count.pl --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 632 } 633 } 634 # --stop not used 635 else 636 { 637 if(defined $opt_newLine) 638 { 639 system("count.pl --newLine --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 640 } 641 else 642 { 643 system("count.pl --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 644 } 645 } 646 } 647 # nontoken not used 648 else 649 { 650 # --stop used 651 if(defined $opt_stop) 652 { 653 if(defined $opt_newLine) 654 { 655 system("count.pl --newLine --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file"); 656 } 657 else 658 { 659 system("count.pl --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file"); 660 } 661 } 662 # --stop not used 663 else 664 { 665 if(defined $opt_newLine) 666 { 667 system("count.pl --newLine --token $opt_token $destdir/$justfile.bigrams $file"); 668 } 669 else 670 { 671 system("count.pl --token $opt_token $destdir/$justfile.bigrams $file"); 672 } 673 } 674 } 675 } 676 # --token not used 677 else 678 { 679 # --nontoken used 680 if(defined $opt_nontoken) 681 { 682 # --stop used 683 if(defined $opt_stop) 684 { 685 if(defined $opt_newLine) 686 { 687 system("count.pl --newLine --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 688 } 689 else 690 { 691 system("count.pl --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file"); 692 } 693 } 694 # --stop not used 695 else 696 { 697 if(defined $opt_newLine) 698 { 699 system("count.pl --newLine --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 700 } 701 else 702 { 703 system("count.pl --nontoken $opt_nontoken $destdir/$justfile.bigrams $file"); 704 } 705 } 706 } 707 # nontoken not used 708 else 709 { 710 # --stop used 711 if(defined $opt_stop) 712 { 713 if(defined $opt_newLine) 714 { 715 system("count.pl --newLine --stop $opt_stop $destdir/$justfile.bigrams $file"); 716 } 717 else 718 { 719 system("count.pl --stop $opt_stop $destdir/$justfile.bigrams $file"); 720 } 721 } 722 # --stop not used 723 else 724 { 725 if(defined $opt_newLine) 726 { 727 system("count.pl --newLine $destdir/$justfile.bigrams $file"); 728 } 729 else 730 { 731 system("count.pl $destdir/$justfile.bigrams $file"); 732 } 733 } 734 } 735 } 736 } 737} 738 739 740#----------------------------------------------------------------------------- 741#show minimal usage message 742sub showminimal() 743{ 744 print "Usage: huge-count.pl [OPTIONS] DESTINATION [SOURCE]+"; 745 print "\nTYPE huge-count.pl --help for help\n"; 746} 747 748#----------------------------------------------------------------------------- 749#show help 750sub showhelp() 751{ 752 print "Usage: huge-count.pl [OPTIONS] DESTINATION [SOURCE]+ 753 754Efficiently runs count.pl on a huge data. 755 756SOURCE 757 Could be a - 758 759 1. single plain file 760 2. single flat directory containing multiple plain files 761 3. list of plain files 762 763DESTINATION 764 Should be a directory where output is written. 765 766OPTIONS: 767 768--split P 769 If SOURCE is a single plain file, --split has to be specified to 770 split the source file into P parts and to run count.pl separately 771 on each part. 772 773--token TOKENFILE 774 Specify a file containing Perl regular expressions that define the 775 tokenization scheme for counting. 776 777--nontoken NOTOKENFILE 778 Specify a file containing Perl regular expressions of non-token 779 sequences that are removed prior to tokenization. 780 781--stop STOPFILE 782 Specify a file containing Perl regular expressions of stop words 783 that are to be removed from the output bigrams. 784 785--window W 786 Specify the window size for counting. 787 788--remove L 789 Bigrams with counts less than L will be removed from the sample. 790 791--frequency F 792 Bigrams with counts less than F will not be displayed. 793 794--newLine 795 Prevents bigrams from spanning across the new-line characters. 796 797--help 798 Displays this message. 799 800--version 801 Displays the version information. 802 803Type 'perldoc huge-count.pl' to view detailed documentation of huge-count.\n"; 804} 805 806#------------------------------------------------------------------------------ 807#version information 808sub showversion() 809{ 810 print "huge-count.pl - Version 0.03\n"; 811 print "Efficiently runs count.pl on a huge data.\n"; 812 print "Copyright (C) 2004, Amruta Purandare & Ted Pedersen.\n"; 813 print "Date of Last Update: 03/30/2004\n"; 814} 815 816############################################################################# 817 818