1# 2# BioPerl module for Bio::Tools::Run::StandAloneBlast 3# 4# Copyright Peter Schattner 5# 6# You may distribute this module under the same terms as perl itself 7 8# POD documentation - main docs before the code 9 10=head1 NAME 11 12Bio::Tools::Run::StandAloneBlast - Object for the local execution 13of the NCBI BLAST program suite (blastall, blastpgp, bl2seq). 14There is experimental support for WU-Blast and NCBI rpsblast. 15 16=head1 SYNOPSIS 17 18 # Local-blast "factory object" creation and blast-parameter 19 # initialization: 20 @params = (-database => 'swissprot', -outfile => 'blast1.out'); 21 $factory = Bio::Tools::Run::StandAloneBlast->new(@params); 22 23 # Blast a sequence against a database: 24 $str = Bio::SeqIO->new(-file=>'t/amino.fa', -format => 'Fasta'); 25 $input = $str->next_seq(); 26 $input2 = $str->next_seq(); 27 $blast_report = $factory->blastall($input); 28 29 # Run an iterated Blast (psiblast) of a sequence against a database: 30 $factory->j(3); # 'j' is blast parameter for # of iterations 31 $factory->outfile('psiblast1.out'); 32 $factory = Bio::Tools::Run::StandAloneBlast->new(@params); 33 $blast_report = $factory->blastpgp($input); 34 35 # Use blast to align 2 sequences against each other: 36 $factory = Bio::Tools::Run::StandAloneBlast->new(-outfile => 'bl2seq.out'); 37 $factory->bl2seq($input, $input2); 38 39 # Experimental support for WU-Blast 2.0 40 my $factory = Bio::Tools::Run::StandAloneBlast->new(-program =>"wublastp", 41 -database =>"swissprot", 42 -e => 1e-20); 43 my $blast_report = $factory->wublast($seq); 44 45 # Experimental support for NCBI rpsblast 46 my $factory = Bio::Tools::Run::StandAloneBlast->new(-db => 'CDD/Cog', 47 -expect => 0.001); 48 $factory->F('T'); # turn on SEG filtering of query sequence 49 my $blast_report = $factory->rpsblast($seq); 50 51 # Use the experimental fast Blast parser, 'blast_pull' 52 my $factory = Bio::Tools::Run::StandAloneBlast->new(-_READMETHOD =>'blast_pull', 53 @other_params); 54 55 # Various additional options and input formats are available, 56 # see the DESCRIPTION section for details. 57 58=head1 DESCRIPTION 59 60This DESCRIPTION only documents Bio::Tools::Run::StandAloneBlast, a 61Bioperl object for running the NCBI standAlone BLAST package. Blast 62itself is a large & complex program - for more information regarding 63BLAST, please see the BLAST documentation which accompanies the BLAST 64distribution. BLAST is available from ftp://ncbi.nlm.nih.gov/blast/. 65 66A source of confusion in documenting a BLAST interface is that the 67term "program" is used in - at least - three different ways in the 68BLAST documentation. In this DESCRIPTION, "program" will refer to the 69BLAST routine set by the BLAST C<-p> parameter that can be set to blastn, 70blastp, tblastx etc. We will use the term Blast "executable" to refer 71to the various different executable files that may be called - ie. 72blastall, blastpgp or bl2seq. In addition, there are several BLAST 73capabilities, which are also referred to as "programs", and are 74implemented by using specific combinations of BLAST executables, 75programs and parameters. They will be referred by their specific 76names - eg PSIBLAST and PHIBLAST. 77 78Before running StandAloneBlast it is necessary: to install BLAST 79on your system, to edit set the environmental variable $BLASTDIR 80or your $PATH variable to point to the BLAST directory, and to 81ensure that users have execute privileges for the BLAST program. 82 83If the databases which will be searched by BLAST are located in the 84data subdirectory of the blast program directory (the default 85installation location), StandAloneBlast will find them; however, 86if the database files are located in any other location, environmental 87variable $BLASTDATADIR will need to be set to point to that directory. 88 89The use of the StandAloneBlast module is as follows: Initially, a 90local blast "factory object" is created. The constructor may be passed 91an optional array of (non-default) parameters to be used by the 92factory, eg: 93 94 @params = (-program => 'blastn', -database => 'ecoli.nt'); 95 $factory = Bio::Tools::Run::StandAloneBlast->new(@params); 96 97Any parameters not explicitly set will remain as the defaults of the 98BLAST executable. Note each BLAST executable has somewhat different 99parameters and options. See the BLAST Documentation for a description 100or run the BLAST executable from the command line followed solely with 101a "-" to see a list of options and default values for that executable; 102eg E<gt>blastall -. 103 104BLAST parameters can be changed and/or examined at any time after the 105factory has been created. The program checks that any 106parameter/switch being set/read is valid. Except where specifically 107noted, StandAloneBlast uses the same single-letter, case-sensitive 108parameter names as the actual blast program. Currently no checks are 109included to verify that parameters are of the proper type (e.g. string 110or numeric) or that their values are within the proper range. 111 112As an example, to change the value of the Blast parameter 'e' ('e' is 113the parameter for expectation-value cutoff) 114 115 $expectvalue = 0.01; 116 $factory->e($expectvalue); 117 118Note that for improved script readibility one can modify the name of 119the (ncbi) BLAST parameters as desired as long as the initial letter (and 120case) of the parameter are preserved, e.g.: 121 122 $factory->expectvalue($expectvalue); 123 124Unfortunately, some of the BLAST parameters are not the single 125letter one might expect (eg "iteration round" in blastpgp is 'j'). 126Again one can check by using, for example: 127 128 > blastpgp - 129 130Wublast parameters need to be complete (ie. don't truncate them to their 131first letter), but are case-insensitive. 132 133Once the factory has been created and the appropriate parameters set, 134one can call one of the supported blast executables. The input 135sequence(s) to these executables may be fasta file(s) as described in 136the BLAST documentation. 137 138 $inputfilename = 't/testquery.fa'; 139 $blast_report = $factory->blastall($inputfilename); 140 141In addition, sequence input may be in the form of either a Bio::Seq 142object or (a reference to) an array of Bio::Seq objects, e.g.: 143 144 $input = Bio::Seq->new(-id => "test query", 145 -seq => "ACTACCCTTTAAATCAGTGGGGG"); 146 $blast_report = $factory->blastall($input); 147 148NOTE: Use of the BPlite method has been deprecated and is no longer supported. 149 150For blastall and non-psiblast blastpgp runs, report object is a L<Bio::SearchIO> 151object, selected by the user with the parameter _READMETHOD. The leading 152underscore is needed to distinguish this option from options which are passed to 153the BLAST executable. The default parser is Bio::SearchIO::blast. In any case, 154the "raw" blast report is also available. The filename is set by the 'outfile' 155parameter and has the default value of "blastreport.out". 156 157For psiblast execution in the BLAST "jumpstart" mode, the program must 158be passed (in addition to the query sequence itself) an alignment 159containing the query sequence (in the form of a SimpleAlign object) as 160well as a "mask" specifying at what residues position-specific scoring 161matrices (PSSMs) are to used and at what residues default scoring 162matrices (eg BLOSUM) are to be used. See psiblast documentation for 163more details. The mask itself is a string of 0's and 1's which is the 164same length as each sequence in the alignment and has a "1" at 165locations where (PSSMs) are to be used and a "0" at all other 166locations. So for example: 167 168 $str = Bio::AlignIO->new(-file => "cysprot.msf", 169 -format => 'msf'); 170 $aln = $str->next_aln(); 171 $len = $aln->length_aln(); 172 $mask = '1' x $len; 173 # simple case where PSSM's to be used at all residues 174 $report = $factory->blastpgp("cysprot1.fa", $aln, $mask); 175 176For bl2seq execution, StandAloneBlast.pm can be combined with 177AlignIO.pm to directly produce a SimpleAlign object from the alignment 178of the two sequences produced by bl2seq as in: 179 180 # Get 2 sequences 181 $str = Bio::SeqIO->new(-file=>'t/amino.fa' , -format => 'Fasta'); 182 my $seq3 = $str->next_seq(); 183 my $seq4 = $str->next_seq(); 184 185 # Run bl2seq on them 186 $factory = Bio::Tools::Run::StandAloneBlast->new(-program => 'blastp', 187 -outfile => 'bl2seq.out'); 188 my $bl2seq_report = $factory->bl2seq($seq3, $seq4); 189 190 # Use AlignIO.pm to create a SimpleAlign object from the bl2seq report 191 $str = Bio::AlignIO->new(-file=> 'bl2seq.out',-format => 'bl2seq'); 192 $aln = $str->next_aln(); 193 194For more examples of syntax and use of StandAloneBlast.pm, the user is 195encouraged to run the scripts standaloneblast.pl in the bioperl 196examples/tools directory and StandAloneBlast.t in the bioperl t/ 197directory. 198 199=head1 FEEDBACK 200 201=head2 Mailing Lists 202 203User feedback is an integral part of the evolution of this and other 204Bioperl modules. Send your comments and suggestions preferably to one 205of the Bioperl mailing lists. Your participation is much appreciated. 206 207 bioperl-l@bioperl.org - General discussion 208 http://bioperl.org/wiki/Mailing_lists - About the mailing lists 209 210=head2 Support 211 212Please direct usage questions or support issues to the mailing list: 213 214I<bioperl-l@bioperl.org> 215 216rather than to the module maintainer directly. Many experienced and 217reponsive experts will be able look at the problem and quickly 218address it. Please include a thorough description of the problem 219with code and data examples if at all possible. 220 221=head2 Reporting Bugs 222 223Report bugs to the Bioperl bug tracking system to help us keep track 224the bugs and their resolution. Bug reports can be submitted via 225the web: 226 227 https://github.com/bioperl/bioperl-live/issues 228 229=head1 AUTHOR - Peter Schattner 230 231Email schattner at alum.mit.edu 232 233=head1 MAINTAINER - Torsten Seemann 234 235Email torsten at infotech.monash.edu.au 236 237=head1 CONTRIBUTORS 238 239Sendu Bala bix@sendu.me.uk (reimplementation) 240 241=head1 APPENDIX 242 243The rest of the documentation details each of the object 244methods. Internal methods are usually preceded with a _ 245 246=cut 247 248package Bio::Tools::Run::StandAloneBlast; 249 250use strict; 251use warnings; 252 253use Bio::Root::IO; 254use Bio::Seq; 255use Bio::SeqIO; 256use Bio::SearchIO; 257use File::Spec; 258 259use base qw(Bio::Tools::Run::WrapperBase Bio::Factory::ApplicationFactoryI); 260 261our $AUTOLOAD; 262our $DEFAULTBLASTTYPE = 'NCBI'; 263our $DEFAULTREADMETHOD = 'BLAST'; 264 265# If local BLAST databases are not stored in the standard 266# /data directory, the variable BLASTDATADIR will need to be 267# set explicitly 268our $DATADIR = $ENV{'BLASTDATADIR'} || $ENV{'BLASTDB'}; 269if (! defined $DATADIR && defined $ENV{'BLASTDIR'}) { 270 my $dir = Bio::Root::IO->catfile($ENV{'BLASTDIR'}, 'data'); 271 if (-d $dir) { 272 $DATADIR = $dir; 273 } 274 elsif ($ENV{'BLASTDIR'} =~ /bin/) { 275 $dir = $ENV{'BLASTDIR'}; 276 $dir =~ s/bin/data/; 277 $DATADIR = $dir if -d $dir; 278 } 279} 280 281=head2 new 282 283 Title : new 284 Usage : my $obj = Bio::Tools::Run::StandAloneBlast->new(); 285 Function: Builds a newBio::Tools::Run::StandAloneBlast object 286 Returns : Bio::Tools::Run::StandAloneNCBIBlast or StandAloneWUBlast 287 Args : -quiet => boolean # make program execution quiet 288 -_READMETHOD => 'BLAST' (default, synonym 'SearchIO') || 'blast_pull' 289 # the parsing method, case insensitive 290 291Essentially all BLAST parameters can be set via StandAloneBlast.pm. 292Some of the most commonly used parameters are listed below. All 293parameters have defaults and are optional except for -p in those programs that 294have it. For a complete listing of settable parameters, run the relevant 295executable BLAST program with the option "-" as in blastall - 296Note that the input parameters (-i, -j, -input) should not be set directly by 297you: this module sets them when you call one of the executable methods. 298 299Blastall 300 301 -p Program Name [String] 302 Input should be one of "blastp", "blastn", "blastx", 303 "tblastn", or "tblastx". 304 -d Database [String] default = nr 305 The database specified must first be formatted with formatdb. 306 Multiple database names (bracketed by quotations) will be accepted. 307 An example would be -d "nr est" 308 -e Expectation value (E) [Real] default = 10.0 309 -o BLAST report Output File [File Out] Optional, 310 default = ./blastreport.out ; set by StandAloneBlast.pm 311 -S Query strands to search against database (for blast[nx], and tblastx). 3 is both, 1 is top, 2 is bottom [Integer] 312 default = 3 313 314Blastpgp (including Psiblast) 315 316 -j is the maximum number of rounds (default 1; i.e., regular BLAST) 317 -h is the e-value threshold for including sequences in the 318 score matrix model (default 0.001) 319 -c is the "constant" used in the pseudocount formula specified in the paper (default 10) 320 -B Multiple alignment file for PSI-BLAST "jump start mode" Optional 321 -Q Output File for PSI-BLAST Matrix in ASCII [File Out] Optional 322 323rpsblast 324 325 -d Database [String] default = (none - you must specify a database) 326 The database specified must first be formatted with formatdb. 327 Multiple database names (bracketed by quotations) will be accepted. 328 An example would be -d "Cog Smart" 329 -e Expectation value (E) [Real] default = 10.0 330 -o BLAST report Output File [File Out] Optional, 331 default = ./blastreport.out ; set by StandAloneBlast.pm 332 333Bl2seq 334 335 -p Program name: blastp, blastn, blastx. For blastx 1st argument should be nucleotide [String] 336 default = blastp 337 -o alignment output file [File Out] default = stdout 338 -e Expectation value (E) [Real] default = 10.0 339 -S Query strands to search against database (blastn only). 3 is both, 1 is top, 2 is bottom [Integer] 340 default = 3 341 342WU-Blast 343 344 -p Program Name [String] 345 Input should be one of "wublastp", "wublastn", "wublastx", 346 "wutblastn", or "wutblastx". 347 -d Database [String] default = nr 348 The database specified must first be formatted with xdformat. 349 -E Expectation value (E) [Real] default = 10.0 350 -o BLAST report Output File [File Out] Optional, 351 default = ./blastreport.out ; set by StandAloneBlast.pm 352 353=cut 354 355sub new { 356 my ($caller, @args) = @_; 357 my $class = ref($caller) || $caller; 358 359 # Because of case-sensitivity issues, ncbi and wublast methods are 360 # mutually exclusive. We can't load ncbi methods if we start with wublast 361 # (and vice versa) since wublast e() and E() should be the same thing, 362 # whilst they must be different things in ncbi blast. 363 # 364 # Solution: split StandAloneBlast out into two more modules for NCBI and WU 365 366 if ($class =~ /NCBI|WU/) { 367 return $class->SUPER::new(@args); 368 } 369 370 my %args = @args; 371 my $blasttype = $DEFAULTBLASTTYPE; 372 while (my ($attr, $value) = each %args) { 373 if ($attr =~/^-?\s*program\s*$|^-?p$/) { 374 if ($value =~ /^wu*/) { 375 $blasttype = 'WU'; 376 } 377 } 378 } 379 380 my $module = "Bio::Tools::Run::StandAlone${blasttype}Blast"; 381 Bio::Root::Root->_load_module($module); 382 return $module->new(@args); 383} 384 385=head2 executable 386 387 Title : executable 388 Usage : my $exe = $blastfactory->executable('blastall'); 389 Function: Finds the full path to the executable 390 Returns : string representing the full path to the exe 391 Args : [optional] name of executable to set path to 392 [optional] boolean flag whether or not warn when exe is not found 393 394=cut 395 396sub executable { 397 my ($self, $exename, $exe, $warn) = @_; 398 $exename = 'blastall' unless (defined $exename || $self =~ /WUBlast/); 399 $self->program_name($exename); 400 401 if( defined $exe && -x $exe ) { 402 $self->{'_pathtoexe'}->{$exename} = $exe; 403 } 404 unless( defined $self->{'_pathtoexe'}->{$exename} ) { 405 my $f = $self->program_path($exename); 406 $exe = $self->{'_pathtoexe'}->{$exename} = $f if(-e $f && -x $f ); 407 408 # This is how I meant to split up these conditionals --jason 409 # if exe is null we will execute this (handle the case where 410 # PROGRAMDIR pointed to something invalid) 411 unless( $exe ) { # we didn't find it in that last conditional 412 if( ($exe = $self->io->exists_exe($exename)) && -x $exe ) { 413 $self->{'_pathtoexe'}->{$exename} = $exe; 414 } 415 else { 416 $self->warn("Cannot find executable for $exename") if $warn; 417 $self->{'_pathtoexe'}->{$exename} = undef; 418 } 419 } 420 } 421 return $self->{'_pathtoexe'}->{$exename}; 422} 423 424=head2 program_dir 425 426 Title : program_dir 427 Usage : my $dir = $factory->program_dir(); 428 Function: Abstract get method for dir of program. 429 Returns : string representing program directory 430 Args : none 431 432=cut 433 434sub program_dir { 435 my $self = shift; 436 $self =~ /NCBIBlast/? $ENV{'BLASTDIR'}: $ENV{'WUBLASTDIR'}; 437} 438 439sub program_name { 440 my $self = shift; 441 if (@_) { $self->{program_name} = shift } 442 return $self->{program_name} || ''; 443} 444 445sub program { 446 my $self = shift; 447 if( wantarray ) { 448 return ($self->executable, $self->p()); 449 } else { 450 return $self->executable(@_); 451 } 452} 453 454=head2 _setinput 455 456 Title : _setinput 457 Usage : Internal function, not to be called directly 458 Function: Create input file(s) for Blast executable 459 Example : 460 Returns : name of file containing Blast data input 461 Args : Seq object reference or input file name 462 463=cut 464 465sub _setinput { 466 my ($self, $executable, $input1, $input2) = @_; 467 my ($seq, $temp, $infilename1, $infilename2,$fh ) ; 468 # If $input1 is not a reference it better be the name of a file with 469 # the sequence/ alignment data... 470 $self->io->_io_cleanup(); 471 472 SWITCH: { 473 unless (ref $input1) { 474 $infilename1 = (-e $input1) ? $input1 : 0 ; 475 last SWITCH; 476 } 477 478 # $input may be an array of BioSeq objects... 479 if (ref($input1) =~ /ARRAY/i ) { 480 ($fh,$infilename1) = $self->io->tempfile(TEMPLATE=>'blastquery-XXXXXX', SUFFIX=>'.fasta'); 481 $temp = Bio::SeqIO->new(-fh=> $fh, -format => 'fasta'); 482 foreach $seq (@$input1) { 483 unless ($seq->isa("Bio::PrimarySeqI")) {return 0;} 484 $seq->display_id($seq->display_id); 485 $temp->write_seq($seq); 486 } 487 close $fh; 488 $fh = undef; 489 last SWITCH; 490 } 491 492 # $input may be a single BioSeq object... 493 elsif ($input1->isa("Bio::PrimarySeqI")) { 494 ($fh,$infilename1) = $self->io->tempfile(TEMPLATE=>'blastquery-XXXXXX', SUFFIX=>'.fasta'); 495 496 # just in case $input1 is taken from an alignment and has spaces (ie 497 # deletions) indicated within it, we have to remove them - otherwise 498 # the BLAST programs will be unhappy 499 my $seq_string = $input1->seq(); 500 $seq_string =~ s/\W+//g; # get rid of spaces in sequence 501 $input1->seq($seq_string); 502 $temp = Bio::SeqIO->new(-fh=> $fh, '-format' => 'fasta'); 503 $temp->write_seq($input1); 504 close $fh; 505 undef $fh; 506 last SWITCH; 507 } 508 509 $infilename1 = 0; # Set error flag if you get here 510 } 511 512 unless ($input2) { return $infilename1; } 513 514 SWITCH2: { 515 unless (ref $input2) { 516 $infilename2 = (-e $input2) ? $input2 : 0 ; 517 last SWITCH2; 518 } 519 if ($input2->isa("Bio::PrimarySeqI") && $executable eq 'bl2seq' ) { 520 ($fh,$infilename2) = $self->io->tempfile(TEMPLATE=>'blastquery-XXXXXX', SUFFIX=>'.fasta'); 521 522 $temp = Bio::SeqIO->new(-fh=> $fh, '-format' => 'Fasta'); 523 $temp->write_seq($input2); 524 close $fh; 525 undef $fh; 526 last SWITCH2; 527 } 528 529 # Option for using psiblast's pre-alignment "jumpstart" feature 530 elsif ($input2->isa("Bio::SimpleAlign") && $executable eq 'blastpgp' ) { 531 # a bit of a lie since it won't be a fasta file 532 ($fh,$infilename2) = $self->io->tempfile(TEMPLATE=>'blastquery-XXXXXX', SUFFIX=>'.fasta'); 533 534 # first we retrieve the "mask" that determines which residues should 535 # by scored according to their position and which should be scored 536 # using the non-position-specific matrices 537 my @mask = split("", shift ); # get mask 538 539 # then we have to convert all the residues in every sequence to upper 540 # case at the positions that we want psiblast to use position specific 541 # scoring 542 foreach $seq ( $input2->each_seq() ) { 543 my @seqstringlist = split("",$seq->seq()); 544 for (my $i = 0; $i < scalar(@mask); $i++) { 545 unless ( $seqstringlist[$i] =~ /[a-zA-Z]/ ) {next} 546 $seqstringlist[$i] = $mask[$i] ? uc $seqstringlist[$i]: lc $seqstringlist[$i] ; 547 } 548 my $newseqstring = join("", @seqstringlist); 549 $seq->seq($newseqstring); 550 } 551 552 # Now we need to write out the alignment to a file 553 # in the "psi format" which psiblast is expecting 554 $input2->map_chars('\.','-'); 555 $temp = Bio::AlignIO->new(-fh=> $fh, '-format' => 'psi'); 556 $temp->write_aln($input2); 557 close $fh; 558 undef $fh; 559 last SWITCH2; 560 } 561 562 $infilename2 = 0; # Set error flag if you get here 563 } 564 565 return ($infilename1, $infilename2); 566} 567 568=head1 Bio::Tools::Run::WrapperBase methods 569 570=cut 571 572=head2 no_param_checks 573 574 Title : no_param_checks 575 Usage : $obj->no_param_checks($newval) 576 Function: Boolean flag as to whether or not we should 577 trust the sanity checks for parameter values 578 Returns : value of no_param_checks 579 Args : newvalue (optional) 580 581=cut 582 583=head2 save_tempfiles 584 585 Title : save_tempfiles 586 Usage : $obj->save_tempfiles($newval) 587 Function: 588 Returns : value of save_tempfiles 589 Args : newvalue (optional) 590 591=cut 592 593=head2 outfile_name 594 595 Title : outfile_name 596 Usage : my $outfile = $tcoffee->outfile_name(); 597 Function: Get/Set the name of the output file for this run 598 (if you wanted to do something special) 599 Returns : string 600 Args : [optional] string to set value to 601 602=cut 603 604=head2 tempdir 605 606 Title : tempdir 607 Usage : my $tmpdir = $self->tempdir(); 608 Function: Retrieve a temporary directory name (which is created) 609 Returns : string which is the name of the temporary directory 610 Args : none 611 612=cut 613 614=head2 cleanup 615 616 Title : cleanup 617 Usage : $tcoffee->cleanup(); 618 Function: Will cleanup the tempdir directory after a PAML run 619 Returns : none 620 Args : none 621 622=cut 623 624=head2 io 625 626 Title : io 627 Usage : $obj->io($newval) 628 Function: Gets a Bio::Root::IO object 629 Returns : Bio::Root::IO 630 Args : none 631 632=cut 633 6341; 635