1=head1 NAME
2
3Bio::Search::BlastUtils - Utility functions for Bio::Search:: BLAST objects
4
5=head1 SYNOPSIS
6
7 # This module is just a collection of subroutines, not an object.
8
9See L<Bio::Search::Hit::BlastHit>.
10
11=head1 DESCRIPTION
12
13The BlastUtils.pm module is a collection of subroutines used primarily by
14Bio::Search::Hit::BlastHit objects for some of the additional
15functionality, such as HSP tiling. Right now, the BlastUtils is just a
16collection of methods, not an object, and it's tightly coupled to
17Bio::Search::Hit::BlastHit. A goal for the future is to generalize it
18to work based on the Bio::Search interfaces, then it can work with any
19objects that implements them.
20
21=head1 AUTHOR
22
23Steve Chervitz E<lt>sac@bioperl.orgE<gt>
24
25=cut
26
27#'
28
29package Bio::Search::BlastUtils;
30$Bio::Search::BlastUtils::VERSION = '1.7.7';
31
32=head2 tile_hsps
33
34 Usage     : tile_hsps( $sbjct );
35           : This is called automatically by Bio::Search::Hit::BlastHit
36           : during object construction or
37           : as needed by methods that rely on having tiled data.
38 Purpose   : Collect statistics about the aligned sequences in a set of HSPs.
39           : Calculates the following data across all HSPs:
40           :    -- total alignment length
41           :    -- total identical residues
42           :    -- total conserved residues
43 Returns   : n/a
44 Argument  : A Bio::Search::Hit::BlastHit object
45 Throws    : n/a
46 Comments  :
47 	   : This method is *strongly* coupled to Bio::Search::Hit::BlastHit
48 	   : (it accesses BlastHit data members directly).
49 	   : TODO: Re-write this to the Bio::Search::Hit::HitI interface.
50 	   :
51           : This method performs more careful summing of data across
52           : all HSPs in the Sbjct object. Only HSPs that are in the same strand
53           : and frame are tiled. Simply summing the data from all HSPs
54           : in the same strand and frame will overestimate the actual
55           : length of the alignment if there is overlap between different HSPs
56           : (often the case).
57           :
58           : The strategy is to tile the HSPs and sum over the
59           : contigs, collecting data separately from overlapping and
60           : non-overlapping regions of each HSP. To facilitate this, the
61           : HSP.pm object now permits extraction of data from sub-sections
62           : of an HSP.
63           :
64           : Additional useful information is collected from the results
65           : of the tiling. It is possible that sub-sequences in
66           : different HSPs will overlap significantly. In this case, it
67           : is impossible to create a single unambiguous alignment by
68           : concatenating the HSPs. The ambiguity may indicate the
69           : presence of multiple, similar domains in one or both of the
70           : aligned sequences. This ambiguity is recorded using the
71           : ambiguous_aln() method.
72           :
73           : This method does not attempt to discern biologically
74           : significant vs. insignificant overlaps. The allowable amount of
75           : overlap can be set with the overlap() method or with the -OVERLAP
76           : parameter used when constructing the Blast & Sbjct objects.
77           :
78           : For a given hit, both the query and the sbjct sequences are
79           : tiled independently.
80           :
81           :    -- If only query sequence HSPs overlap,
82           :          this may suggest multiple domains in the sbjct.
83           :    -- If only sbjct sequence HSPs overlap,
84           :          this may suggest multiple domains in the query.
85           :    -- If both query & sbjct sequence HSPs overlap,
86           :          this suggests multiple domains in both.
87           :    -- If neither query & sbjct sequence HSPs overlap,
88           :          this suggests either no multiple domains in either
89           :          sequence OR that both sequences have the same
90           :          distribution of multiple similar domains.
91           :
92           : This method can deal with the special case of when multiple
93           : HSPs exactly overlap.
94           :
95           : Efficiency concerns:
96           :  Speed will be an issue for sequences with numerous HSPs.
97           :
98 Bugs      : Currently, tile_hsps() does not properly account for
99           : the number of non-tiled but overlapping HSPs, which becomes a problem
100           : as overlap() grows. Large values overlap() may thus lead to
101           : incorrect statistics for some hits. For best results, keep overlap()
102           : below 5 (DEFAULT IS 2). For more about this, see the "HSP Tiling and
103           : Ambiguous Alignments" section in L<Bio::Search::Hit::BlastHit>.
104
105See Also   : L<_adjust_contigs>(), L<Bio::Search::Hit::BlastHit|Bio::Search::Hit::BlastHit>
106
107=cut
108
109#--------------
110sub tile_hsps {
111#--------------
112    my $sbjct = shift;
113
114    $sbjct->{'_tile_hsps'} = 1;
115    $sbjct->{'_gaps_query'} = 0;
116    $sbjct->{'_gaps_sbjct'} = 0;
117
118    ## Simple summation scheme. Valid if there is only one HSP.
119    if((defined($sbjct->{'_n'}) and $sbjct->{'_n'} == 1) or $sbjct->num_hsps == 1) {
120	my $hsp = $sbjct->hsp;
121	$sbjct->{'_length_aln_query'} = $hsp->length('query');
122	$sbjct->{'_length_aln_sbjct'} = $hsp->length('sbjct');
123	$sbjct->{'_length_aln_total'} = $hsp->length('total');
124	($sbjct->{'_totalIdentical'},$sbjct->{'_totalConserved'}) = $hsp->matches();
125	$sbjct->{'_gaps_query'} = $hsp->gaps('query');
126	$sbjct->{'_gaps_sbjct'} = $hsp->gaps('sbjct');
127
128#	print "_tile_hsps(): single HSP, easy stats.\n";
129	return;
130    } else {
131#	print STDERR "Sbjct: _tile_hsps: summing multiple HSPs\n";
132	$sbjct->{'_length_aln_query'} = 0;
133	$sbjct->{'_length_aln_sbjct'} = 0;
134	$sbjct->{'_length_aln_total'} = 0;
135	$sbjct->{'_totalIdentical'}   = 0;
136	$sbjct->{'_totalConserved'}   = 0;
137    }
138
139    ## More than one HSP. Must tile HSPs.
140#    print "\nTiling HSPs for $sbjct\n";
141    my($hsp, $qstart, $qstop, $sstart, $sstop);
142    my($frame, $strand, $qstrand, $sstrand);
143    my(@qcontigs, @scontigs);
144    my $qoverlap = 0;
145    my $soverlap = 0;
146    my $max_overlap = $sbjct->{'_overlap'};
147
148    foreach $hsp ($sbjct->hsps()) {
149#	printf "  HSP: %s\n%s\n",$hsp->name, $hsp->str('query');
150#	printf "  Length = %d; Identical = %d; Conserved = %d; Conserved(1-10): %d",$hsp->length, $hsp->length(-TYPE=>'iden'), $hsp->length(-TYPE=>'cons'), $hsp->length(-TYPE=>'cons',-START=>0,-STOP=>10);
151	($qstart, $qstop) = $hsp->range('query');
152	($sstart, $sstop) = $hsp->range('sbjct');
153	$frame = $hsp->frame('hit');
154	$frame = -1 unless defined $frame;
155	($qstrand, $sstrand) = $hsp->strand;
156
157	my ($qgaps, $sgaps)  = $hsp->gaps();
158	$sbjct->{'_gaps_query'} += $qgaps;
159	$sbjct->{'_gaps_sbjct'} += $sgaps;
160
161	$sbjct->{'_length_aln_total'} += $hsp->length;
162	## Collect contigs in the query sequence.
163	$qoverlap = &_adjust_contigs('query', $hsp, $qstart, $qstop, \@qcontigs, $max_overlap, $frame, $qstrand);
164
165	## Collect contigs in the sbjct sequence (needed for domain data and gapped Blast).
166	$soverlap = &_adjust_contigs('sbjct', $hsp, $sstart, $sstop, \@scontigs, $max_overlap, $frame, $sstrand);
167
168	## Collect overall start and stop data for query and sbjct over all HSPs.
169	if(not defined $sbjct->{'_queryStart'}) {
170	    $sbjct->{'_queryStart'} = $qstart;
171	    $sbjct->{'_queryStop'}  = $qstop;
172	    $sbjct->{'_sbjctStart'} = $sstart;
173	    $sbjct->{'_sbjctStop'}  = $sstop;
174	} else {
175	    $sbjct->{'_queryStart'} = ($qstart < $sbjct->{'_queryStart'} ? $qstart : $sbjct->{'_queryStart'});
176	    $sbjct->{'_queryStop'}  = ($qstop  > $sbjct->{'_queryStop'}  ? $qstop  : $sbjct->{'_queryStop'});
177	    $sbjct->{'_sbjctStart'} = ($sstart < $sbjct->{'_sbjctStart'} ? $sstart : $sbjct->{'_sbjctStart'});
178	    $sbjct->{'_sbjctStop'}  = ($sstop  > $sbjct->{'_sbjctStop'}  ? $sstop  : $sbjct->{'_sbjctStop'});
179	}
180    }
181
182    ## Collect data across the collected contigs.
183
184#    print "\nQUERY CONTIGS:\n";
185#    print "  gaps = $sbjct->{'_gaps_query'}\n";
186
187    # TODO: Account for strand/frame issue!
188    # Strategy: collect data on a per strand+frame basis and save the most significant one.
189    my (%qctg_dat);
190    foreach(@qcontigs) {
191#	print "  query contig: $_->{'start'} - $_->{'stop'}\n";
192#	print "         iden = $_->{'iden'}; cons = $_->{'cons'}\n";
193	($frame, $strand) = ($_->{'frame'}, $_->{'strand'});
194	$qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1;
195	$qctg_dat{ "$frame$strand" }->{'totalIdentical'}   += $_->{'iden'};
196	$qctg_dat{ "$frame$strand" }->{'totalConserved'}   += $_->{'cons'};
197	$qctg_dat{ "$frame$strand" }->{'qstrand'}   = $strand;
198    }
199
200    # Find longest contig.
201    my @sortedkeys = reverse sort { $qctg_dat{ $a }->{'length_aln_query'} <=> $qctg_dat{ $b }->{'length_aln_query'} } keys %qctg_dat;
202
203    # Save the largest to the sbjct:
204    my $longest = $sortedkeys[0];
205    $sbjct->{'_length_aln_query'} = $qctg_dat{ $longest }->{'length_aln_query'};
206    $sbjct->{'_totalIdentical'}   = $qctg_dat{ $longest }->{'totalIdentical'};
207    $sbjct->{'_totalConserved'}   = $qctg_dat{ $longest }->{'totalConserved'};
208    $sbjct->{'_qstrand'} = $qctg_dat{ $longest }->{'qstrand'};
209
210    ## Collect data for sbjct contigs. Important for gapped Blast.
211    ## The totalIdentical and totalConserved numbers will be the same
212    ## as determined for the query contigs.
213
214#    print "\nSBJCT CONTIGS:\n";
215#    print "  gaps = $sbjct->{'_gaps_sbjct'}\n";
216
217    my (%sctg_dat);
218    foreach(@scontigs) {
219#	print "  sbjct contig: $_->{'start'} - $_->{'stop'}\n";
220#	print "         iden = $_->{'iden'}; cons = $_->{'cons'}\n";
221	($frame, $strand) = ($_->{'frame'}, $_->{'strand'});
222	$sctg_dat{ "$frame$strand" }->{'length_aln_sbjct'}   += $_->{'stop'} - $_->{'start'} + 1;
223	$sctg_dat{ "$frame$strand" }->{'frame'}  = $frame;
224	$sctg_dat{ "$frame$strand" }->{'sstrand'}  = $strand;
225    }
226
227    @sortedkeys = reverse sort { $sctg_dat{ $a }->{'length_aln_sbjct'} <=> $sctg_dat{ $b }->{'length_aln_sbjct'} } keys %sctg_dat;
228
229    # Save the largest to the sbjct:
230    $longest = $sortedkeys[0];
231
232    $sbjct->{'_length_aln_sbjct'} = $sctg_dat{ $longest }->{'length_aln_sbjct'};
233    $sbjct->{'_frame'} = $sctg_dat{ $longest }->{'frame'};
234    $sbjct->{'_sstrand'} = $sctg_dat{ $longest }->{'sstrand'};
235
236    if($qoverlap) {
237	if($soverlap) { $sbjct->ambiguous_aln('qs');
238#			print "\n*** AMBIGUOUS ALIGNMENT: Query and Sbjct\n\n";
239		      }
240	else { $sbjct->ambiguous_aln('q');
241#	       print "\n*** AMBIGUOUS ALIGNMENT: Query\n\n";
242	   }
243    } elsif($soverlap) {
244	$sbjct->ambiguous_aln('s');
245#	print "\n*** AMBIGUOUS ALIGNMENT: Sbjct\n\n";
246    }
247
248    # Adjust length based on BLAST flavor.
249    my $prog = $sbjct->algorithm;
250    if($prog eq 'TBLASTN') {
251	$sbjct->{'_length_aln_sbjct'} /= 3;
252    } elsif($prog eq 'BLASTX' ) {
253	$sbjct->{'_length_aln_query'} /= 3;
254    } elsif($prog eq 'TBLASTX') {
255	$sbjct->{'_length_aln_query'} /= 3;
256	$sbjct->{'_length_aln_sbjct'} /= 3;
257    }
258}
259
260
261
262=head2 _adjust_contigs
263
264 Usage     : n/a; called automatically during object construction.
265 Purpose   : Builds HSP contigs for a given BLAST hit.
266           : Utility method called by _tile_hsps()
267 Returns   :
268 Argument  :
269 Throws    : Exceptions propagated from Bio::Search::Hit::BlastHSP::matches()
270           : for invalid sub-sequence ranges.
271 Status    : Experimental
272 Comments  : This method does not currently support gapped alignments.
273           : Also, it does not keep track of the number of HSPs that
274           : overlap within the amount specified by overlap().
275           : This will lead to significant tracking errors for large
276           : overlap values.
277
278See Also   : L<tile_hsps>(), L<Bio::Search::Hit::BlastHSP::matches|Bio::Search::Hit::BlastHSP>
279
280=cut
281
282#-------------------
283sub _adjust_contigs {
284#-------------------
285    my ($seqType, $hsp, $start, $stop, $contigs_ref, $max_overlap, $frame, $strand) = @_;
286
287    my $overlap = 0;
288    my ($numID, $numCons);
289
290#    print STDERR "Testing $seqType data: HSP (${\$hsp->name});  $start, $stop, strand=$strand, frame=$frame\n";
291    foreach(@$contigs_ref) {
292#	print STDERR "  Contig: $_->{'start'} - $_->{'stop'}, strand=$_->{'strand'}, frame=$_->{'frame'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n";
293
294	# Don't merge things unless they have matching strand/frame.
295	next unless ($_->{'frame'} == $frame and $_->{'strand'} == $strand);
296
297	## Test special case of a nested HSP. Skip it.
298	if($start >= $_->{'start'} and $stop <= $_->{'stop'}) {
299#	    print STDERR "----> Nested HSP. Skipping.\n";
300	    $overlap = 1;
301	    next;
302	}
303
304	## Test for overlap at beginning of contig.
305	if($start < $_->{'start'} and $stop > ($_->{'start'} + $max_overlap)) {
306#	    print STDERR "----> Overlaps beg: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n";
307	    # Collect stats over the non-overlapping region.
308	    eval {
309		($numID, $numCons) = $hsp->matches(-SEQ   =>$seqType,
310						   -START =>$start,
311						   -STOP  =>$_->{'start'}-1);
312	    };
313	    if($@) { warn "\a\n$@\n"; }
314	    else {
315		$_->{'start'} = $start; # Assign a new start coordinate to the contig
316		$_->{'iden'} += $numID; # and add new data to #identical, #conserved.
317		$_->{'cons'} += $numCons;
318		$overlap     = 1;
319	    }
320	}
321
322	## Test for overlap at end of contig.
323	if($stop > $_->{'stop'} and $start < ($_->{'stop'} - $max_overlap)) {
324#	    print STDERR "----> Overlaps end: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n";
325	    # Collect stats over the non-overlapping region.
326	    eval {
327		($numID,$numCons) = $hsp->matches(-SEQ   =>$seqType,
328						  -START =>$_->{'stop'},
329						  -STOP  =>$stop);
330	    };
331	    if($@) { warn "\a\n$@\n"; }
332	    else {
333		$_->{'stop'}  = $stop;  # Assign a new stop coordinate to the contig
334		$_->{'iden'} += $numID; # and add new data to #identical, #conserved.
335		$_->{'cons'} += $numCons;
336		$overlap    = 1;
337	    }
338	}
339	$overlap && do {
340#		print STDERR " New Contig data:\n";
341#		print STDERR "  Contig: $_->{'start'} - $_->{'stop'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n";
342		last;
343	    };
344    }
345    ## If there is no overlap, add the complete HSP data.
346    !$overlap && do {
347#	print STDERR "No overlap. Adding new contig.\n";
348	($numID,$numCons) = $hsp->matches(-SEQ=>$seqType);
349	push @$contigs_ref, {'start'=>$start, 'stop'=>$stop,
350			     'iden'=>$numID,  'cons'=>$numCons,
351			     'strand'=>$strand, 'frame'=>$frame};
352    };
353    $overlap;
354}
355
356=head2 get_exponent
357
358 Usage     : &get_exponent( number );
359 Purpose   : Determines the power of 10 exponent of an integer, float,
360           : or scientific notation number.
361 Example   : &get_exponent("4.0e-206");
362           : &get_exponent("0.00032");
363           : &get_exponent("10.");
364           : &get_exponent("1000.0");
365           : &get_exponent("e+83");
366 Argument  : Float, Integer, or scientific notation number
367 Returns   : Integer representing the exponent part of the number (+ or -).
368           : If argument == 0 (zero), return value is "-999".
369 Comments  : Exponents are rounded up (less negative) if the mantissa is >= 5.
370           : Exponents are rounded down (more negative) if the mantissa is <= -5.
371
372=cut
373
374#------------------
375sub get_exponent {
376#------------------
377    my $data = shift;
378
379    my($num, $exp) = split /[eE]/, $data;
380
381    if( defined $exp) {
382	$num = 1 if not $num;
383	$num >= 5 and $exp++;
384	$num <= -5 and $exp--;
385    } elsif( $num == 0) {
386	$exp = -999;
387    } elsif( not $num =~ /\./) {
388	$exp = CORE::length($num) -1;
389    } else {
390	$exp = 0;
391	$num .= '0' if $num =~ /\.$/;
392	my ($c);
393	my $rev = 0;
394	if($num !~ /^0/) {
395	    $num = reverse($num);
396	    $rev = 1;
397	}
398	do { $c = chop($num);
399	     $c == 0 && $exp++;
400	 } while( $c ne '.');
401
402	$exp = -$exp if $num == 0 and not $rev;
403	$exp -= 1 if $rev;
404    }
405    return $exp;
406}
407
408=head2 collapse_nums
409
410 Usage     : @cnums = collapse_nums( @numbers );
411 Purpose   : Collapses a list of numbers into a set of ranges of consecutive terms:
412           : Useful for condensing long lists of consecutive numbers.
413           :  EXPANDED:
414           :     1 2 3 4 5 6 10 12 13 14 15 17 18 20 21 22 24 26 30 31 32
415           :  COLLAPSED:
416           :     1-6 10 12-15 17 18 20-22 24 26 30-32
417 Argument  : List of numbers sorted numerically.
418 Returns   : List of numbers mixed with ranges of numbers (see above).
419 Throws    : n/a
420
421See Also   : L<Bio::Search::Hit::BlastHit::seq_inds()|Bio::Search::Hit::BlastHit>
422
423=cut
424
425#------------------
426sub collapse_nums {
427#------------------
428# This is probably not the slickest connectivity algorithm, but will do for now.
429    my @a = @_;
430    my ($from, $to, $i, @ca, $consec);
431
432    $consec = 0;
433    for($i=0; $i < @a; $i++) {
434	not $from and do{ $from = $a[$i]; next; };
435	if($a[$i] == $a[$i-1]+1) {
436	    $to = $a[$i];
437	    $consec++;
438	} else {
439	    if($consec == 1) { $from .= ",$to"; }
440	    else { $from .= $consec>1 ? "\-$to" : ""; }
441	    push @ca, split(',', $from);
442	    $from =  $a[$i];
443	    $consec = 0;
444	    $to = undef;
445	}
446    }
447    if(defined $to) {
448	if($consec == 1) { $from .= ",$to"; }
449	else { $from .= $consec>1 ? "\-$to" : ""; }
450    }
451    push @ca, split(',', $from) if $from;
452
453    @ca;
454}
455
456
457=head2 strip_blast_html
458
459 Usage     : $boolean = &strip_blast_html( string_ref );
460           : This method is exported.
461 Purpose   : Removes HTML formatting from a supplied string.
462           : Attempts to restore the Blast report to enable
463           : parsing by Bio::SearchIO::blast.pm
464 Returns   : Boolean: true if string was stripped, false if not.
465 Argument  : string_ref = reference to a string containing the whole Blast
466           :              report containing HTML formatting.
467 Throws    : Croaks if the argument is not a scalar reference.
468 Comments  : Based on code originally written by Alex Dong Li
469           : (ali@genet.sickkids.on.ca).
470           : This method does some Blast-specific stripping
471           : (adds back a '>' character in front of each HSP
472           : alignment listing).
473           :
474           : THIS METHOD IS VERY SENSITIVE TO BLAST FORMATTING CHANGES!
475           :
476           : Removal of the HTML tags and accurate reconstitution of the
477           : non-HTML-formatted report is highly dependent on structure of
478           : the HTML-formatted version. For example, it assumes that first
479           : line of each alignment section (HSP listing) starts with a
480           : <a name=..> anchor tag. This permits the reconstruction of the
481           : original report in which these lines begin with a ">".
482           : This is required for parsing.
483           :
484           : If the structure of the Blast report itself is not intended to
485           : be a standard, the structure of the HTML-formatted version
486           : is even less so. Therefore, the use of this method to
487           : reconstitute parsable Blast reports from HTML-format versions
488           : should be considered a temporary solution.
489
490=cut
491
492#--------------------
493sub strip_blast_html {
494#--------------------
495      # This may not best way to remove html tags. However, it is simple.
496      # it won't work under following conditions:
497      #    1) if quoted > appears in a tag  (does this ever happen?)
498      #    2) if a tag is split over multiple lines and this method is
499      #       used to process one line at a time.
500
501    my ($string_ref) = shift;
502
503    ref $string_ref eq 'SCALAR' or
504	croak ("Can't strip HTML: ".
505	       "Argument is should be a SCALAR reference not a ${\ref $string_ref}\n");
506
507    my $str = $$string_ref;
508    my $stripped = 0;
509
510    # Removing "<a name =...>" and adding the '>' character for
511    # HSP alignment listings.
512    $str =~ s/(\A|\n)<a name ?=[^>]+> ?/>/sgi and $stripped = 1;
513
514    # Removing all "<>" tags.
515    $str =~ s/<[^>]+>|&nbsp//sgi and $stripped = 1;
516
517    # Re-uniting any lone '>' characters.
518    $str =~ s/(\A|\n)>\s+/\n\n>/sgi and $stripped = 1;
519
520    $$string_ref = $str;
521    $stripped;
522}
523
524
5251;
526
527
528