1=head1 NAME 2 3Bio::Search::BlastUtils - Utility functions for Bio::Search:: BLAST objects 4 5=head1 SYNOPSIS 6 7 # This module is just a collection of subroutines, not an object. 8 9See L<Bio::Search::Hit::BlastHit>. 10 11=head1 DESCRIPTION 12 13The BlastUtils.pm module is a collection of subroutines used primarily by 14Bio::Search::Hit::BlastHit objects for some of the additional 15functionality, such as HSP tiling. Right now, the BlastUtils is just a 16collection of methods, not an object, and it's tightly coupled to 17Bio::Search::Hit::BlastHit. A goal for the future is to generalize it 18to work based on the Bio::Search interfaces, then it can work with any 19objects that implements them. 20 21=head1 AUTHOR 22 23Steve Chervitz E<lt>sac@bioperl.orgE<gt> 24 25=cut 26 27#' 28 29package Bio::Search::BlastUtils; 30$Bio::Search::BlastUtils::VERSION = '1.7.7'; 31 32=head2 tile_hsps 33 34 Usage : tile_hsps( $sbjct ); 35 : This is called automatically by Bio::Search::Hit::BlastHit 36 : during object construction or 37 : as needed by methods that rely on having tiled data. 38 Purpose : Collect statistics about the aligned sequences in a set of HSPs. 39 : Calculates the following data across all HSPs: 40 : -- total alignment length 41 : -- total identical residues 42 : -- total conserved residues 43 Returns : n/a 44 Argument : A Bio::Search::Hit::BlastHit object 45 Throws : n/a 46 Comments : 47 : This method is *strongly* coupled to Bio::Search::Hit::BlastHit 48 : (it accesses BlastHit data members directly). 49 : TODO: Re-write this to the Bio::Search::Hit::HitI interface. 50 : 51 : This method performs more careful summing of data across 52 : all HSPs in the Sbjct object. Only HSPs that are in the same strand 53 : and frame are tiled. Simply summing the data from all HSPs 54 : in the same strand and frame will overestimate the actual 55 : length of the alignment if there is overlap between different HSPs 56 : (often the case). 57 : 58 : The strategy is to tile the HSPs and sum over the 59 : contigs, collecting data separately from overlapping and 60 : non-overlapping regions of each HSP. To facilitate this, the 61 : HSP.pm object now permits extraction of data from sub-sections 62 : of an HSP. 63 : 64 : Additional useful information is collected from the results 65 : of the tiling. It is possible that sub-sequences in 66 : different HSPs will overlap significantly. In this case, it 67 : is impossible to create a single unambiguous alignment by 68 : concatenating the HSPs. The ambiguity may indicate the 69 : presence of multiple, similar domains in one or both of the 70 : aligned sequences. This ambiguity is recorded using the 71 : ambiguous_aln() method. 72 : 73 : This method does not attempt to discern biologically 74 : significant vs. insignificant overlaps. The allowable amount of 75 : overlap can be set with the overlap() method or with the -OVERLAP 76 : parameter used when constructing the Blast & Sbjct objects. 77 : 78 : For a given hit, both the query and the sbjct sequences are 79 : tiled independently. 80 : 81 : -- If only query sequence HSPs overlap, 82 : this may suggest multiple domains in the sbjct. 83 : -- If only sbjct sequence HSPs overlap, 84 : this may suggest multiple domains in the query. 85 : -- If both query & sbjct sequence HSPs overlap, 86 : this suggests multiple domains in both. 87 : -- If neither query & sbjct sequence HSPs overlap, 88 : this suggests either no multiple domains in either 89 : sequence OR that both sequences have the same 90 : distribution of multiple similar domains. 91 : 92 : This method can deal with the special case of when multiple 93 : HSPs exactly overlap. 94 : 95 : Efficiency concerns: 96 : Speed will be an issue for sequences with numerous HSPs. 97 : 98 Bugs : Currently, tile_hsps() does not properly account for 99 : the number of non-tiled but overlapping HSPs, which becomes a problem 100 : as overlap() grows. Large values overlap() may thus lead to 101 : incorrect statistics for some hits. For best results, keep overlap() 102 : below 5 (DEFAULT IS 2). For more about this, see the "HSP Tiling and 103 : Ambiguous Alignments" section in L<Bio::Search::Hit::BlastHit>. 104 105See Also : L<_adjust_contigs>(), L<Bio::Search::Hit::BlastHit|Bio::Search::Hit::BlastHit> 106 107=cut 108 109#-------------- 110sub tile_hsps { 111#-------------- 112 my $sbjct = shift; 113 114 $sbjct->{'_tile_hsps'} = 1; 115 $sbjct->{'_gaps_query'} = 0; 116 $sbjct->{'_gaps_sbjct'} = 0; 117 118 ## Simple summation scheme. Valid if there is only one HSP. 119 if((defined($sbjct->{'_n'}) and $sbjct->{'_n'} == 1) or $sbjct->num_hsps == 1) { 120 my $hsp = $sbjct->hsp; 121 $sbjct->{'_length_aln_query'} = $hsp->length('query'); 122 $sbjct->{'_length_aln_sbjct'} = $hsp->length('sbjct'); 123 $sbjct->{'_length_aln_total'} = $hsp->length('total'); 124 ($sbjct->{'_totalIdentical'},$sbjct->{'_totalConserved'}) = $hsp->matches(); 125 $sbjct->{'_gaps_query'} = $hsp->gaps('query'); 126 $sbjct->{'_gaps_sbjct'} = $hsp->gaps('sbjct'); 127 128# print "_tile_hsps(): single HSP, easy stats.\n"; 129 return; 130 } else { 131# print STDERR "Sbjct: _tile_hsps: summing multiple HSPs\n"; 132 $sbjct->{'_length_aln_query'} = 0; 133 $sbjct->{'_length_aln_sbjct'} = 0; 134 $sbjct->{'_length_aln_total'} = 0; 135 $sbjct->{'_totalIdentical'} = 0; 136 $sbjct->{'_totalConserved'} = 0; 137 } 138 139 ## More than one HSP. Must tile HSPs. 140# print "\nTiling HSPs for $sbjct\n"; 141 my($hsp, $qstart, $qstop, $sstart, $sstop); 142 my($frame, $strand, $qstrand, $sstrand); 143 my(@qcontigs, @scontigs); 144 my $qoverlap = 0; 145 my $soverlap = 0; 146 my $max_overlap = $sbjct->{'_overlap'}; 147 148 foreach $hsp ($sbjct->hsps()) { 149# printf " HSP: %s\n%s\n",$hsp->name, $hsp->str('query'); 150# printf " Length = %d; Identical = %d; Conserved = %d; Conserved(1-10): %d",$hsp->length, $hsp->length(-TYPE=>'iden'), $hsp->length(-TYPE=>'cons'), $hsp->length(-TYPE=>'cons',-START=>0,-STOP=>10); 151 ($qstart, $qstop) = $hsp->range('query'); 152 ($sstart, $sstop) = $hsp->range('sbjct'); 153 $frame = $hsp->frame('hit'); 154 $frame = -1 unless defined $frame; 155 ($qstrand, $sstrand) = $hsp->strand; 156 157 my ($qgaps, $sgaps) = $hsp->gaps(); 158 $sbjct->{'_gaps_query'} += $qgaps; 159 $sbjct->{'_gaps_sbjct'} += $sgaps; 160 161 $sbjct->{'_length_aln_total'} += $hsp->length; 162 ## Collect contigs in the query sequence. 163 $qoverlap = &_adjust_contigs('query', $hsp, $qstart, $qstop, \@qcontigs, $max_overlap, $frame, $qstrand); 164 165 ## Collect contigs in the sbjct sequence (needed for domain data and gapped Blast). 166 $soverlap = &_adjust_contigs('sbjct', $hsp, $sstart, $sstop, \@scontigs, $max_overlap, $frame, $sstrand); 167 168 ## Collect overall start and stop data for query and sbjct over all HSPs. 169 if(not defined $sbjct->{'_queryStart'}) { 170 $sbjct->{'_queryStart'} = $qstart; 171 $sbjct->{'_queryStop'} = $qstop; 172 $sbjct->{'_sbjctStart'} = $sstart; 173 $sbjct->{'_sbjctStop'} = $sstop; 174 } else { 175 $sbjct->{'_queryStart'} = ($qstart < $sbjct->{'_queryStart'} ? $qstart : $sbjct->{'_queryStart'}); 176 $sbjct->{'_queryStop'} = ($qstop > $sbjct->{'_queryStop'} ? $qstop : $sbjct->{'_queryStop'}); 177 $sbjct->{'_sbjctStart'} = ($sstart < $sbjct->{'_sbjctStart'} ? $sstart : $sbjct->{'_sbjctStart'}); 178 $sbjct->{'_sbjctStop'} = ($sstop > $sbjct->{'_sbjctStop'} ? $sstop : $sbjct->{'_sbjctStop'}); 179 } 180 } 181 182 ## Collect data across the collected contigs. 183 184# print "\nQUERY CONTIGS:\n"; 185# print " gaps = $sbjct->{'_gaps_query'}\n"; 186 187 # TODO: Account for strand/frame issue! 188 # Strategy: collect data on a per strand+frame basis and save the most significant one. 189 my (%qctg_dat); 190 foreach(@qcontigs) { 191# print " query contig: $_->{'start'} - $_->{'stop'}\n"; 192# print " iden = $_->{'iden'}; cons = $_->{'cons'}\n"; 193 ($frame, $strand) = ($_->{'frame'}, $_->{'strand'}); 194 $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1; 195 $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'}; 196 $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'}; 197 $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand; 198 } 199 200 # Find longest contig. 201 my @sortedkeys = reverse sort { $qctg_dat{ $a }->{'length_aln_query'} <=> $qctg_dat{ $b }->{'length_aln_query'} } keys %qctg_dat; 202 203 # Save the largest to the sbjct: 204 my $longest = $sortedkeys[0]; 205 $sbjct->{'_length_aln_query'} = $qctg_dat{ $longest }->{'length_aln_query'}; 206 $sbjct->{'_totalIdentical'} = $qctg_dat{ $longest }->{'totalIdentical'}; 207 $sbjct->{'_totalConserved'} = $qctg_dat{ $longest }->{'totalConserved'}; 208 $sbjct->{'_qstrand'} = $qctg_dat{ $longest }->{'qstrand'}; 209 210 ## Collect data for sbjct contigs. Important for gapped Blast. 211 ## The totalIdentical and totalConserved numbers will be the same 212 ## as determined for the query contigs. 213 214# print "\nSBJCT CONTIGS:\n"; 215# print " gaps = $sbjct->{'_gaps_sbjct'}\n"; 216 217 my (%sctg_dat); 218 foreach(@scontigs) { 219# print " sbjct contig: $_->{'start'} - $_->{'stop'}\n"; 220# print " iden = $_->{'iden'}; cons = $_->{'cons'}\n"; 221 ($frame, $strand) = ($_->{'frame'}, $_->{'strand'}); 222 $sctg_dat{ "$frame$strand" }->{'length_aln_sbjct'} += $_->{'stop'} - $_->{'start'} + 1; 223 $sctg_dat{ "$frame$strand" }->{'frame'} = $frame; 224 $sctg_dat{ "$frame$strand" }->{'sstrand'} = $strand; 225 } 226 227 @sortedkeys = reverse sort { $sctg_dat{ $a }->{'length_aln_sbjct'} <=> $sctg_dat{ $b }->{'length_aln_sbjct'} } keys %sctg_dat; 228 229 # Save the largest to the sbjct: 230 $longest = $sortedkeys[0]; 231 232 $sbjct->{'_length_aln_sbjct'} = $sctg_dat{ $longest }->{'length_aln_sbjct'}; 233 $sbjct->{'_frame'} = $sctg_dat{ $longest }->{'frame'}; 234 $sbjct->{'_sstrand'} = $sctg_dat{ $longest }->{'sstrand'}; 235 236 if($qoverlap) { 237 if($soverlap) { $sbjct->ambiguous_aln('qs'); 238# print "\n*** AMBIGUOUS ALIGNMENT: Query and Sbjct\n\n"; 239 } 240 else { $sbjct->ambiguous_aln('q'); 241# print "\n*** AMBIGUOUS ALIGNMENT: Query\n\n"; 242 } 243 } elsif($soverlap) { 244 $sbjct->ambiguous_aln('s'); 245# print "\n*** AMBIGUOUS ALIGNMENT: Sbjct\n\n"; 246 } 247 248 # Adjust length based on BLAST flavor. 249 my $prog = $sbjct->algorithm; 250 if($prog eq 'TBLASTN') { 251 $sbjct->{'_length_aln_sbjct'} /= 3; 252 } elsif($prog eq 'BLASTX' ) { 253 $sbjct->{'_length_aln_query'} /= 3; 254 } elsif($prog eq 'TBLASTX') { 255 $sbjct->{'_length_aln_query'} /= 3; 256 $sbjct->{'_length_aln_sbjct'} /= 3; 257 } 258} 259 260 261 262=head2 _adjust_contigs 263 264 Usage : n/a; called automatically during object construction. 265 Purpose : Builds HSP contigs for a given BLAST hit. 266 : Utility method called by _tile_hsps() 267 Returns : 268 Argument : 269 Throws : Exceptions propagated from Bio::Search::Hit::BlastHSP::matches() 270 : for invalid sub-sequence ranges. 271 Status : Experimental 272 Comments : This method does not currently support gapped alignments. 273 : Also, it does not keep track of the number of HSPs that 274 : overlap within the amount specified by overlap(). 275 : This will lead to significant tracking errors for large 276 : overlap values. 277 278See Also : L<tile_hsps>(), L<Bio::Search::Hit::BlastHSP::matches|Bio::Search::Hit::BlastHSP> 279 280=cut 281 282#------------------- 283sub _adjust_contigs { 284#------------------- 285 my ($seqType, $hsp, $start, $stop, $contigs_ref, $max_overlap, $frame, $strand) = @_; 286 287 my $overlap = 0; 288 my ($numID, $numCons); 289 290# print STDERR "Testing $seqType data: HSP (${\$hsp->name}); $start, $stop, strand=$strand, frame=$frame\n"; 291 foreach(@$contigs_ref) { 292# print STDERR " Contig: $_->{'start'} - $_->{'stop'}, strand=$_->{'strand'}, frame=$_->{'frame'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n"; 293 294 # Don't merge things unless they have matching strand/frame. 295 next unless ($_->{'frame'} == $frame and $_->{'strand'} == $strand); 296 297 ## Test special case of a nested HSP. Skip it. 298 if($start >= $_->{'start'} and $stop <= $_->{'stop'}) { 299# print STDERR "----> Nested HSP. Skipping.\n"; 300 $overlap = 1; 301 next; 302 } 303 304 ## Test for overlap at beginning of contig. 305 if($start < $_->{'start'} and $stop > ($_->{'start'} + $max_overlap)) { 306# print STDERR "----> Overlaps beg: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n"; 307 # Collect stats over the non-overlapping region. 308 eval { 309 ($numID, $numCons) = $hsp->matches(-SEQ =>$seqType, 310 -START =>$start, 311 -STOP =>$_->{'start'}-1); 312 }; 313 if($@) { warn "\a\n$@\n"; } 314 else { 315 $_->{'start'} = $start; # Assign a new start coordinate to the contig 316 $_->{'iden'} += $numID; # and add new data to #identical, #conserved. 317 $_->{'cons'} += $numCons; 318 $overlap = 1; 319 } 320 } 321 322 ## Test for overlap at end of contig. 323 if($stop > $_->{'stop'} and $start < ($_->{'stop'} - $max_overlap)) { 324# print STDERR "----> Overlaps end: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n"; 325 # Collect stats over the non-overlapping region. 326 eval { 327 ($numID,$numCons) = $hsp->matches(-SEQ =>$seqType, 328 -START =>$_->{'stop'}, 329 -STOP =>$stop); 330 }; 331 if($@) { warn "\a\n$@\n"; } 332 else { 333 $_->{'stop'} = $stop; # Assign a new stop coordinate to the contig 334 $_->{'iden'} += $numID; # and add new data to #identical, #conserved. 335 $_->{'cons'} += $numCons; 336 $overlap = 1; 337 } 338 } 339 $overlap && do { 340# print STDERR " New Contig data:\n"; 341# print STDERR " Contig: $_->{'start'} - $_->{'stop'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n"; 342 last; 343 }; 344 } 345 ## If there is no overlap, add the complete HSP data. 346 !$overlap && do { 347# print STDERR "No overlap. Adding new contig.\n"; 348 ($numID,$numCons) = $hsp->matches(-SEQ=>$seqType); 349 push @$contigs_ref, {'start'=>$start, 'stop'=>$stop, 350 'iden'=>$numID, 'cons'=>$numCons, 351 'strand'=>$strand, 'frame'=>$frame}; 352 }; 353 $overlap; 354} 355 356=head2 get_exponent 357 358 Usage : &get_exponent( number ); 359 Purpose : Determines the power of 10 exponent of an integer, float, 360 : or scientific notation number. 361 Example : &get_exponent("4.0e-206"); 362 : &get_exponent("0.00032"); 363 : &get_exponent("10."); 364 : &get_exponent("1000.0"); 365 : &get_exponent("e+83"); 366 Argument : Float, Integer, or scientific notation number 367 Returns : Integer representing the exponent part of the number (+ or -). 368 : If argument == 0 (zero), return value is "-999". 369 Comments : Exponents are rounded up (less negative) if the mantissa is >= 5. 370 : Exponents are rounded down (more negative) if the mantissa is <= -5. 371 372=cut 373 374#------------------ 375sub get_exponent { 376#------------------ 377 my $data = shift; 378 379 my($num, $exp) = split /[eE]/, $data; 380 381 if( defined $exp) { 382 $num = 1 if not $num; 383 $num >= 5 and $exp++; 384 $num <= -5 and $exp--; 385 } elsif( $num == 0) { 386 $exp = -999; 387 } elsif( not $num =~ /\./) { 388 $exp = CORE::length($num) -1; 389 } else { 390 $exp = 0; 391 $num .= '0' if $num =~ /\.$/; 392 my ($c); 393 my $rev = 0; 394 if($num !~ /^0/) { 395 $num = reverse($num); 396 $rev = 1; 397 } 398 do { $c = chop($num); 399 $c == 0 && $exp++; 400 } while( $c ne '.'); 401 402 $exp = -$exp if $num == 0 and not $rev; 403 $exp -= 1 if $rev; 404 } 405 return $exp; 406} 407 408=head2 collapse_nums 409 410 Usage : @cnums = collapse_nums( @numbers ); 411 Purpose : Collapses a list of numbers into a set of ranges of consecutive terms: 412 : Useful for condensing long lists of consecutive numbers. 413 : EXPANDED: 414 : 1 2 3 4 5 6 10 12 13 14 15 17 18 20 21 22 24 26 30 31 32 415 : COLLAPSED: 416 : 1-6 10 12-15 17 18 20-22 24 26 30-32 417 Argument : List of numbers sorted numerically. 418 Returns : List of numbers mixed with ranges of numbers (see above). 419 Throws : n/a 420 421See Also : L<Bio::Search::Hit::BlastHit::seq_inds()|Bio::Search::Hit::BlastHit> 422 423=cut 424 425#------------------ 426sub collapse_nums { 427#------------------ 428# This is probably not the slickest connectivity algorithm, but will do for now. 429 my @a = @_; 430 my ($from, $to, $i, @ca, $consec); 431 432 $consec = 0; 433 for($i=0; $i < @a; $i++) { 434 not $from and do{ $from = $a[$i]; next; }; 435 if($a[$i] == $a[$i-1]+1) { 436 $to = $a[$i]; 437 $consec++; 438 } else { 439 if($consec == 1) { $from .= ",$to"; } 440 else { $from .= $consec>1 ? "\-$to" : ""; } 441 push @ca, split(',', $from); 442 $from = $a[$i]; 443 $consec = 0; 444 $to = undef; 445 } 446 } 447 if(defined $to) { 448 if($consec == 1) { $from .= ",$to"; } 449 else { $from .= $consec>1 ? "\-$to" : ""; } 450 } 451 push @ca, split(',', $from) if $from; 452 453 @ca; 454} 455 456 457=head2 strip_blast_html 458 459 Usage : $boolean = &strip_blast_html( string_ref ); 460 : This method is exported. 461 Purpose : Removes HTML formatting from a supplied string. 462 : Attempts to restore the Blast report to enable 463 : parsing by Bio::SearchIO::blast.pm 464 Returns : Boolean: true if string was stripped, false if not. 465 Argument : string_ref = reference to a string containing the whole Blast 466 : report containing HTML formatting. 467 Throws : Croaks if the argument is not a scalar reference. 468 Comments : Based on code originally written by Alex Dong Li 469 : (ali@genet.sickkids.on.ca). 470 : This method does some Blast-specific stripping 471 : (adds back a '>' character in front of each HSP 472 : alignment listing). 473 : 474 : THIS METHOD IS VERY SENSITIVE TO BLAST FORMATTING CHANGES! 475 : 476 : Removal of the HTML tags and accurate reconstitution of the 477 : non-HTML-formatted report is highly dependent on structure of 478 : the HTML-formatted version. For example, it assumes that first 479 : line of each alignment section (HSP listing) starts with a 480 : <a name=..> anchor tag. This permits the reconstruction of the 481 : original report in which these lines begin with a ">". 482 : This is required for parsing. 483 : 484 : If the structure of the Blast report itself is not intended to 485 : be a standard, the structure of the HTML-formatted version 486 : is even less so. Therefore, the use of this method to 487 : reconstitute parsable Blast reports from HTML-format versions 488 : should be considered a temporary solution. 489 490=cut 491 492#-------------------- 493sub strip_blast_html { 494#-------------------- 495 # This may not best way to remove html tags. However, it is simple. 496 # it won't work under following conditions: 497 # 1) if quoted > appears in a tag (does this ever happen?) 498 # 2) if a tag is split over multiple lines and this method is 499 # used to process one line at a time. 500 501 my ($string_ref) = shift; 502 503 ref $string_ref eq 'SCALAR' or 504 croak ("Can't strip HTML: ". 505 "Argument is should be a SCALAR reference not a ${\ref $string_ref}\n"); 506 507 my $str = $$string_ref; 508 my $stripped = 0; 509 510 # Removing "<a name =...>" and adding the '>' character for 511 # HSP alignment listings. 512 $str =~ s/(\A|\n)<a name ?=[^>]+> ?/>/sgi and $stripped = 1; 513 514 # Removing all "<>" tags. 515 $str =~ s/<[^>]+>| //sgi and $stripped = 1; 516 517 # Re-uniting any lone '>' characters. 518 $str =~ s/(\A|\n)>\s+/\n\n>/sgi and $stripped = 1; 519 520 $$string_ref = $str; 521 $stripped; 522} 523 524 5251; 526 527 528