1# 2# BioPerl module for Bio::SeqIO::genbank 3# 4# Please direct questions and support issues to <bioperl-l@bioperl.org> 5# 6# Cared for by Bioperl project bioperl-l(at)bioperl.org 7# 8# Copyright Elia Stupka and contributors see AUTHORS section 9# 10# You may distribute this module under the same terms as perl itself 11 12# POD documentation - main docs before the code 13 14=head1 NAME 15 16Bio::SeqIO::genbank - GenBank sequence input/output stream 17 18=head1 SYNOPSIS 19 20It is probably best not to use this object directly, but 21rather go through the SeqIO handler: 22 23 $stream = Bio::SeqIO->new(-file => $filename, 24 -format => 'GenBank'); 25 26 while ( my $seq = $stream->next_seq ) { 27 # do something with $seq 28 } 29 30 31=head1 DESCRIPTION 32 33This object can transform Bio::Seq objects to and from GenBank flat 34file databases. 35 36There is some flexibility here about how to write GenBank output 37that is not fully documented. 38 39=head2 Optional functions 40 41=over 3 42 43=item _show_dna() 44 45(output only) shows the dna or not 46 47=item _post_sort() 48 49(output only) provides a sorting func which is applied to the FTHelpers 50before printing 51 52=item _id_generation_func() 53 54This is function which is called as 55 56 print "ID ", $func($seq), "\n"; 57 58To generate the ID line. If it is not there, it generates a sensible ID 59line using a number of tools. 60 61If you want to output annotations in Genbank format they need to be 62stored in a Bio::Annotation::Collection object which is accessible 63through the Bio::SeqI interface method L<annotation()|annotation>. 64 65The following are the names of the keys which are pulled from a 66L<Bio::Annotation::Collection> object: 67 68 reference - Should contain Bio::Annotation::Reference objects 69 comment - Should contain Bio::Annotation::Comment objects 70 dblink - Should contain a Bio::Annotation::DBLink object 71 segment - Should contain a Bio::Annotation::SimpleValue object 72 origin - Should contain a Bio::Annotation::SimpleValue object 73 wgs - Should contain a Bio::Annotation::SimpleValue object 74 75=back 76 77=head1 Where does the data go? 78 79Data parsed in Bio::SeqIO::genbank is stored in a variety of data 80fields in the sequence object that is returned. Here is a partial list 81of fields. 82 83Items listed as RichSeq or Seq or PrimarySeq and then NAME() tell you 84the top level object which defines a function called NAME() which 85stores this information. 86 87Items listed as Annotation 'NAME' tell you the data is stored the 88associated Bio::AnnotationCollectionI object which is associated with 89Bio::Seq objects. If it is explicitly requested that no annotations 90should be stored when parsing a record of course they will not be 91available when you try and get them. If you are having this problem 92look at the type of SeqBuilder that is being used to construct your 93sequence object. 94 95 Comments Annotation 'comment' 96 References Annotation 'reference' 97 Segment Annotation 'segment' 98 Origin Annotation 'origin' 99 Dbsource Annotation 'dblink' 100 101 Accessions PrimarySeq accession_number() 102 Secondary accessions RichSeq get_secondary_accessions() 103 GI number PrimarySeq primary_id() 104 LOCUS PrimarySeq display_id() 105 Keywords RichSeq get_keywords() 106 Dates RichSeq get_dates() 107 Molecule RichSeq molecule() 108 Seq Version RichSeq seq_version() 109 PID RichSeq pid() 110 Division RichSeq division() 111 Features Seq get_SeqFeatures() 112 Alphabet PrimarySeq alphabet() 113 Definition PrimarySeq description() or desc() 114 Version PrimarySeq version() 115 116 Sequence PrimarySeq seq() 117 118There is more information in the Feature-Annotation HOWTO about each 119field and how it is mapped to the Sequence object 120L<http://bioperl.org/howtos/Features_and_Annotations_HOWTO.html>. 121 122=head1 FEEDBACK 123 124=head2 Mailing Lists 125 126User feedback is an integral part of the evolution of this and other 127Bioperl modules. Send your comments and suggestions preferably to one 128of the Bioperl mailing lists. Your participation is much appreciated. 129 130 bioperl-l@bioperl.org - General discussion 131 http://bioperl.org/wiki/Mailing_lists - About the mailing lists 132 133=head2 Support 134 135Please direct usage questions or support issues to the mailing list: 136 137I<bioperl-l@bioperl.org> 138 139rather than to the module maintainer directly. Many experienced and 140reponsive experts will be able look at the problem and quickly 141address it. Please include a thorough description of the problem 142with code and data examples if at all possible. 143 144=head2 Reporting Bugs 145 146Report bugs to the Bioperl bug tracking system to help us keep track 147the bugs and their resolution. Bug reports can be submitted via the web: 148 149 https://github.com/bioperl/bioperl-live/issues 150 151=head1 AUTHOR - Bioperl Project 152 153bioperl-l at bioperl.org 154 155Original author Elia Stupka, elia -at- tigem.it 156 157=head1 CONTRIBUTORS 158 159Ewan Birney birney at ebi.ac.uk 160Jason Stajich jason at bioperl.org 161Chris Mungall cjm at fruitfly.bdgp.berkeley.edu 162Lincoln Stein lstein at cshl.org 163Heikki Lehvaslaiho, heikki at ebi.ac.uk 164Hilmar Lapp, hlapp at gmx.net 165Donald G. Jackson, donald.jackson at bms.com 166James Wasmuth, james.wasmuth at ed.ac.uk 167Brian Osborne, bosborne at alum.mit.edu 168Chris Fields, cjfields at bioperl dot org 169 170=head1 APPENDIX 171 172The rest of the documentation details each of the object 173methods. Internal methods are usually preceded with a _ 174 175=cut 176 177# Let the code begin... 178 179package Bio::SeqIO::genbank; 180$Bio::SeqIO::genbank::VERSION = '1.7.7'; 181use strict; 182 183use Bio::SeqIO::FTHelper; 184use Bio::SeqFeature::Generic; 185use Bio::Species; 186use Bio::Seq::SeqFactory; 187use Bio::Annotation::Collection; 188use Bio::Annotation::Comment; 189use Bio::Annotation::Reference; 190use Bio::Annotation::DBLink; 191 192use base qw(Bio::SeqIO); 193 194our $FTQUAL_LINE_LENGTH = 60; 195 196our %FTQUAL_NO_QUOTE = map {$_ => 1} qw( 197 anticodon citation 198 codon codon_start 199 cons_splice direction 200 evidence label 201 mod_base number 202 rpt_type rpt_unit 203 transl_except transl_table 204 usedin 205 ); 206 207our %DBSOURCE = map {$_ => 1} qw( 208 EchoBASE IntAct SWISS-2DPAGE ECO2DBASE ECOGENE TIGRFAMs 209 TIGR GO InterPro Pfam PROSITE SGD GermOnline 210 HSSP PhosSite Ensembl RGD AGD ArrayExpress KEGG 211 H-InvDB HGNC LinkHub PANTHER PRINTS SMART SMR 212 MGI MIM RZPD-ProtExp ProDom MEROPS TRANSFAC Reactome 213 UniGene GlycoSuiteDB PIRSF HSC-2DPAGE PHCI-2DPAGE 214 PMMA-2DPAGE Siena-2DPAGE Rat-heart-2DPAGE Aarhus/Ghent-2DPAGE 215 Biocyc MetaCyc Biocyc:Metacyc GenomeReviews FlyBase 216 TMHOBP COMPLUYEAST-2DPAGE OGP DictyBase HAMAP 217 PhotoList Gramene WormBase WormPep Genew ZFIN 218 PeroxiBase MaizeDB TAIR DrugBank REBASE HPA 219 swissprot GenBank GenPept REFSEQ embl PDB UniProtKB 220 DIP PeptideAtlas PRIDE CYGD HOGENOME Gene3D 221 Project); 222 223our %VALID_MOLTYPE = map {$_ => 1} qw(NA DNA RNA tRNA rRNA cDNA cRNA ms-DNA 224 mRNA uRNA ss-RNA ss-DNA snRNA snoRNA PRT); 225 226our %VALID_ALPHABET = ( 227 'bp' => 'dna', 228 'aa' => 'protein', 229 'rc' => '' # rc = release candidate; file has no sequences 230); 231 232sub _initialize { 233 my($self, @args) = @_; 234 235 $self->SUPER::_initialize(@args); 236 # hash for functions for decoding keys. 237 $self->{'_func_ftunit_hash'} = {}; 238 $self->_show_dna(1); # sets this to one by default. People can change it 239 if ( not defined $self->sequence_factory ) { 240 $self->sequence_factory 241 (Bio::Seq::SeqFactory->new(-verbose => $self->verbose, 242 -type => 'Bio::Seq::RichSeq')); 243 } 244} 245 246=head2 next_seq 247 248 Title : next_seq 249 Usage : $seq = $stream->next_seq() 250 Function: returns the next sequence in the stream 251 Returns : Bio::Seq object 252 Args : 253 254=cut 255 256sub next_seq { 257 my ($self, @args) = @_; 258 my %args = @args; 259 my $builder = $self->sequence_builder; 260 my $seq; 261 my %params; 262 263 RECORDSTART: 264 while (1) { 265 my $buffer; 266 my ( @acc, @features ); 267 my ( $display_id, $annotation ); 268 my $species; 269 270 # initialize; we may come here because of starting over 271 @features = (); 272 $annotation = undef; 273 @acc = (); 274 $species = undef; 275 %params = ( -verbose => $self->verbose ); # reset hash 276 local ($/) = "\n"; 277 while ( defined( $buffer = $self->_readline ) ) { 278 last if index( $buffer, 'LOCUS ' ) == 0; 279 } 280 return unless defined $buffer; # end of file 281 $buffer =~ /^LOCUS\s+(\S.*)$/o 282 or $self->throw( "GenBank stream with bad LOCUS line. " 283 . "Not GenBank in my book. Got '$buffer'"); 284 my @tokens = split( ' ', $1 ); 285 286 # this is important to have the id for display in e.g. FTHelper, 287 # otherwise you won't know which entry caused an error 288 $display_id = shift @tokens; 289 $params{'-display_id'} = $display_id; 290 291 # may still be useful if we don't want the seq 292 my $seqlength = shift @tokens; 293 if ( exists $VALID_ALPHABET{$seqlength} ) { 294 # moved one token too far. No locus name? 295 $self->warn( "Bad LOCUS name? Changing [$params{'-display_id'}] " 296 . "to 'unknown' and length to '$display_id'" 297 ); 298 $params{'-display_id'} = 'unknown'; 299 $params{'-length'} = $display_id; 300 301 # add token back... 302 unshift @tokens, $seqlength; 303 } 304 else { 305 $params{'-length'} = $seqlength; 306 } 307 308 # the alphabet of the entry 309 # shouldn't assign alphabet unless one 310 # is specifically designated (such as for rc files) 311 my $alphabet = lc( shift @tokens ); 312 $params{'-alphabet'} = 313 ( exists $VALID_ALPHABET{$alphabet} ) 314 ? $VALID_ALPHABET{$alphabet} 315 : $self->warn("Unknown alphabet: $alphabet"); 316 317 # for aa there is usually no 'molecule' (mRNA etc) 318 if ( $params{'-alphabet'} eq 'protein' ) { 319 $params{'-molecule'} = 'PRT'; 320 } 321 else { 322 $params{'-molecule'} = shift(@tokens); 323 } 324 325 # take care of lower case issues 326 if ( $params{'-molecule'} eq 'dna' or $params{'-molecule'} eq 'rna' ) { 327 $params{'-molecule'} = uc $params{'-molecule'}; 328 } 329 $self->debug( "Unrecognized molecule type: " . $params{'-molecule'} ) 330 if not exists( $VALID_MOLTYPE{ $params{'-molecule'} } ); 331 332 my $circ = shift @tokens; 333 if ( $circ eq 'circular' ) { 334 $params{'-is_circular'} = 1; 335 $params{'-division'} = shift @tokens; 336 } 337 else { 338 # 'linear' or 'circular' may actually be omitted altogether 339 $params{'-division'} = 340 ( CORE::length($circ) == 3 ) ? $circ : shift @tokens; 341 } 342 my $date = join( ' ', @tokens ); # we lump together the rest 343 344 # this is per request bug #1513 345 # we can handle: 346 # 9-10-2003 347 # 9-10-03 348 # 09-10-2003 349 # 09-10-03 350 if ( $date =~ s/\s*((\d{1,2})-(\w{3})-(\d{2,4})).*/$1/ ) { 351 if ( length($date) < 11 ) { 352 # improperly formatted date 353 # But we'll be nice and fix it for them 354 my ( $d, $m, $y ) = ( $2, $3, $4 ); 355 if ( length($d) == 1 ) { 356 $d = "0$d"; 357 } 358 359 # guess the century here 360 if ( length($y) == 2 ) { 361 if ( $y > 60 ) { # arbitrarily guess that '60' means 1960 362 $y = "19$y"; 363 } 364 else { 365 $y = "20$y"; 366 } 367 $self->warn( "Date was malformed, guessing the " 368 . "century for $date to be $y\n" 369 ); 370 } 371 $params{'-dates'} = [ join( '-', $d, $m, $y ) ]; 372 } 373 else { 374 $params{'-dates'} = [$date]; 375 } 376 } 377 378 # set them all at once 379 $builder->add_slot_value(%params); 380 %params = (); 381 382 # parse the rest if desired, otherwise start over 383 if ( not $builder->want_object ) { 384 $builder->make_object; 385 next RECORDSTART; 386 } 387 388 # set up annotation depending on what the builder wants 389 if ( $builder->want_slot('annotation') ) { 390 $annotation = Bio::Annotation::Collection->new; 391 } 392 393 $buffer = $self->_readline; 394 while ( defined( my $line = $buffer ) ) { 395 # Description line(s) 396 if ($line =~ /^DEFINITION\s+(\S.*\S)/) { 397 my @desc = ($1); 398 while ( defined( $line = $self->_readline ) ) { 399 if ($line =~ /^\s+(.*)/) { 400 push( @desc, $1 ); 401 next; 402 } 403 last; 404 } 405 $builder->add_slot_value( -desc => join( ' ', @desc ) ); 406 407 # we'll continue right here because DEFINITION 408 # always comes at the top of the entry 409 $buffer = $line; 410 } 411 412 # accession number (there can be multiple accessions) 413 if ($line =~ /^ACCESSION\s+(\S.*\S)/) { 414 push( @acc, split( /\s+/, $1 ) ); 415 while ( defined( $line = $self->_readline ) ) { 416 if ($line =~ /^\s+(.*)/) { 417 push( @acc, split( /\s+/, $1 ) ); 418 next; 419 } 420 last; 421 } 422 $buffer = $line; 423 next; 424 } 425 426 # PID 427 elsif ($line =~ /^PID\s+(\S+)/) { 428 $params{'-pid'} = $1; 429 } 430 431 # Version number 432 elsif ($line =~ /^VERSION\s+(\S.+)$/) { 433 my ( $acc, $gi ) = split( ' ', $1 ); 434 if ( $acc =~ /^\w+\.(\d+)/ ) { 435 $params{'-version'} = $1; 436 $params{'-seq_version'} = $1; 437 } 438 if ( $gi && ( index( $gi, "GI:" ) == 0 ) ) { 439 $params{'-primary_id'} = substr( $gi, 3 ); 440 } 441 } 442 443 # Keywords 444 elsif ($line =~ /^KEYWORDS\s+(\S.*)/) { 445 my @kw = split( /\s*\;\s*/, $1 ); 446 while ( defined( $line = $self->_readline ) ) { 447 chomp $line; 448 if ($line =~ /^\s+(.*)/) { 449 push( @kw, split( /\s*\;\s*/, $1 ) ); 450 next; 451 } 452 last; 453 } 454 455 @kw && $kw[-1] =~ s/\.$//; 456 $params{'-keywords'} = \@kw; 457 $buffer = $line; 458 next; 459 } 460 461 # Organism name and phylogenetic information 462 elsif ($line =~ /^SOURCE\s+\S/) { 463 if ( $builder->want_slot('species') ) { 464 $species = $self->_read_GenBank_Species( \$buffer ); 465 $builder->add_slot_value( -species => $species ); 466 } 467 else { 468 while ( defined( $buffer = $self->_readline ) ) { 469 last if substr( $buffer, 0, 1 ) ne ' '; 470 } 471 } 472 next; 473 } 474 475 # References 476 elsif ($line =~ /^REFERENCE\s+\S/) { 477 if ($annotation) { 478 my @refs = $self->_read_GenBank_References( \$buffer ); 479 foreach my $ref (@refs) { 480 $annotation->add_Annotation( 'reference', $ref ); 481 } 482 } 483 else { 484 while ( defined( $buffer = $self->_readline ) ) { 485 last if substr( $buffer, 0, 1 ) ne ' '; 486 } 487 } 488 next; 489 } 490 491 # Project 492 elsif ($line =~ /^PROJECT\s+(\S.*)/) { 493 if ($annotation) { 494 my $project = 495 Bio::Annotation::SimpleValue->new( -value => $1 ); 496 $annotation->add_Annotation( 'project', $project ); 497 } 498 } 499 500 # Comments may be plain text or Structured Comments. 501 # Structured Comments are made up of tag/value pairs and have beginning 502 # and end delimiters like ##*-Data-START## and ##*-Data-END## 503 elsif ($line =~ /^COMMENT\s+(\S.*)/) { 504 if ($annotation) { 505 my $comment = $1; 506 while ( defined( $line = $self->_readline ) ) { 507 last if ($line =~ /^\S/); 508 $comment .= $line; 509 } 510 $comment =~ s/ +/ /g; 511 # Structured Comment, do not remove returns in the tabular section 512 if ( my ( $text, $table )= $comment 513 =~ /([^#]*)(##\S+Data-START##.+?##\S+Data-END##)/is 514 ) { 515 $text =~ s/\n/ /g if $text; 516 $table =~ s/START##/START##\n/; 517 $table =~ s/^\s+//gm; 518 $comment = $text . "\n" . $table; 519 } 520 # Plain text, remove returns 521 else { 522 $comment =~ s/\n/ /g; 523 } 524 $annotation->add_Annotation( 525 'comment', 526 Bio::Annotation::Comment->new( 527 -text => $comment, 528 -tagname => 'comment' 529 ) 530 ); 531 $buffer = $line; 532 } 533 else { 534 while ( defined( $buffer = $self->_readline ) ) { 535 last if substr( $buffer, 0, 1 ) ne ' '; 536 } 537 } 538 next; 539 } 540 541 # Corresponding Genbank nucleotide id, Genpept only 542 elsif ($line =~ /^DB(?:SOURCE|LINK)\s+(\S.+)/) { 543 if ($annotation) { 544 my $dbsource = $1; 545 while ( defined( $line = $self->_readline ) ) { 546 last if ($line =~ /^\S/); 547 $dbsource .= $line; 548 } 549 550 # deal with UniProKB dbsources 551 if ( $dbsource =~ 552 s/(UniProt(?:KB)?|swissprot):\s+locus\s+(\S+)\,.+\n// 553 ) { 554 $annotation->add_Annotation( 555 'dblink', 556 Bio::Annotation::DBLink->new( 557 -primary_id => $2, 558 -database => $1, 559 -tagname => 'dblink' 560 ) 561 ); 562 if ( $dbsource =~ s/\s+created:\s+([^\.]+)\.\n// ) { 563 $annotation->add_Annotation( 564 'swissprot_dates', 565 Bio::Annotation::SimpleValue->new( 566 -tagname => 'date_created', 567 -value => $1 568 ) 569 ); 570 } 571 while ( $dbsource =~ 572 s/\s+(sequence|annotation)\s+updated:\s+([^\.]+)\.\n//g 573 ) { 574 $annotation->add_Annotation( 575 'swissprot_dates', 576 Bio::Annotation::SimpleValue->new( 577 -tagname => 'date_updated', 578 -value => $2 579 ) 580 ); 581 } 582 $dbsource =~ s/\n/ /g; 583 if ( $dbsource =~ 584 s/\s+xrefs:\s+((?:\S+,\s+)+\S+)\s+xrefs/xrefs/ 585 ) { 586 # will use $i to determine even or odd 587 # for swissprot the accessions are paired 588 my $i = 0; 589 for my $dbsrc ( split( /,\s+/, $1 ) ) { 590 if ( $dbsrc =~ /(\S+)\.(\d+)/ 591 or $dbsrc =~ /(\S+)/ 592 ) { 593 my ( $id, $version ) = ( $1, $2 ); 594 $version = '' unless defined $version; 595 my $db = ( $id =~ /^\d\S{3}/ ) ? 'PDB' 596 : ( $i++ % 2 ) ? 'GenPept' 597 : 'GenBank'; 598 599 $annotation->add_Annotation( 600 'dblink', 601 Bio::Annotation::DBLink->new( 602 -primary_id => $id, 603 -version => $version, 604 -database => $db, 605 -tagname => 'dblink' 606 ) 607 ); 608 } 609 } 610 } 611 elsif ( $dbsource =~ s/\s+xrefs:\s+(.+)\s+xrefs/xrefs/i ) { 612 # download screwed up and ncbi didn't put acc in for gi numbers 613 my $i = 0; 614 for my $id ( split( /\,\s+/, $1 ) ) { 615 my ( $acc, $db ); 616 if ( $id =~ /gi:\s+(\d+)/ ) { 617 $acc = $1; 618 $db = ( $i++ % 2 ) ? 'GenPept' : 'GenBank'; 619 } 620 elsif ( $id =~ /pdb\s+accession\s+(\S+)/ ) { 621 $acc = $1; 622 $db = 'PDB'; 623 } 624 else { 625 $acc = $id; 626 $db = ''; 627 } 628 $annotation->add_Annotation( 629 'dblink', 630 Bio::Annotation::DBLink->new( 631 -primary_id => $acc, 632 -database => $db, 633 -tagname => 'dblink' 634 ) 635 ); 636 } 637 } 638 else { 639 $self->debug("Cannot match $dbsource\n"); 640 } 641 if ( $dbsource =~ s/xrefs\s+ 642 \(non\-sequence\s+databases\):\s+ 643 ((?:\S+,\s+)+\S+)//x 644 ) { 645 for my $id ( split( /\,\s+/, $1 ) ) { 646 my $db; 647 648 # this is because GenBank dropped the spaces!!! 649 # I'm sure we're not going to get this right 650 ##if ( $id =~ s/^://i ) { 651 ## $db = $1; 652 ##} 653 $db = substr( $id, 0, index( $id, ':' ) ); 654 if ( not exists $DBSOURCE{$db} ) { 655 $db = ''; # do we want 'GenBank' here? 656 } 657 $id = substr( $id, index( $id, ':' ) + 1 ); 658 $annotation->add_Annotation( 659 'dblink', 660 Bio::Annotation::DBLink->new( 661 -primary_id => $id, 662 -database => $db, 663 -tagname => 'dblink' 664 ) 665 ); 666 } 667 } 668 } 669 else { 670 if ( $dbsource =~ 671 /^(\S*?):?\s*accession\s+(\S+)\.(\d+)/ 672 ) { 673 my ( $db, $id, $version ) = ( $1, $2, $3 ); 674 $annotation->add_Annotation( 675 'dblink', 676 Bio::Annotation::DBLink->new( 677 -primary_id => $id, 678 -version => $version, 679 -database => $db || 'GenBank', 680 -tagname => 'dblink' 681 ) 682 ); 683 } 684 elsif ( $dbsource =~ /^(\S*?):?\s*accession\s+(\S+)/ ) { 685 my ( $db, $id ) = ( $1, $2 ); 686 $annotation->add_Annotation( 687 'dblink', 688 Bio::Annotation::DBLink->new( 689 -primary_id => $id, 690 -database => $db || 'GenBank', 691 -tagname => 'dblink' 692 ) 693 ); 694 } 695 elsif ( $dbsource =~ /(\S+)([\.:])\s*(\S+)/ ) { 696 my ( $db, $version ); 697 my @ids = (); 698 if ( $2 eq ':' ) { 699 $db = $1; 700 701 # Genbank 192 release notes say this: "The second 702 # field can consist of multiple comma-separated 703 # identifiers, if a sequence record has multiple 704 # DBLINK cross-references of a given type." 705 # For example: DBLINK Project:100,200,300" 706 @ids = split( /,/, $3 ); 707 } 708 else { 709 ( $db, $version ) = ( 'GenBank', $3 ); 710 $ids[0] = $1; 711 } 712 713 foreach my $id (@ids) { 714 $annotation->add_Annotation( 715 'dblink', 716 Bio::Annotation::DBLink->new( 717 -primary_id => $id, 718 -version => $version, 719 -database => $db, 720 -tagname => 'dblink' 721 ) 722 ); 723 } 724 } 725 else { 726 $self->warn( 727 "Unrecognized DBSOURCE data: $dbsource\n"); 728 } 729 } 730 $buffer = $line; 731 } 732 else { 733 while ( defined( $buffer = $self->_readline ) ) { 734 last if substr( $buffer, 0, 1 ) ne ' '; 735 } 736 } 737 next; 738 } 739 740 # Exit at start of Feature table, or start of sequence 741 if ($line =~ /^(FEATURES|ORIGIN)/) { 742 my $trap; 743 } 744 last if ($line =~ /^(FEATURES|ORIGIN)/); 745 746 # Get next line and loop again 747 $buffer = $self->_readline; 748 } 749 return unless defined $buffer; 750 751 # add them all at once for efficiency 752 $builder->add_slot_value( 753 -accession_number => shift(@acc), 754 -secondary_accessions => \@acc, 755 %params 756 ); 757 $builder->add_slot_value( -annotation => $annotation ) if $annotation; 758 %params = (); # reset before possible re-use to avoid setting twice 759 760 # start over if we don't want to continue with this entry 761 if ( not $builder->want_object ) { 762 $builder->make_object; 763 next RECORDSTART; 764 } 765 766 # some "minimal" formats may not necessarily have a feature table 767 if ( $builder->want_slot('features') 768 and defined $buffer 769 and $buffer =~ /^FEATURES/o 770 ) { 771 # need to read the first line of the feature table 772 $buffer = $self->_readline; 773 774 # DO NOT read lines in the while condition -- this is done 775 # as a side effect in _read_FTHelper_GenBank! 776 777 # part of new circular spec: 778 # commented out for now until kinks worked out 779 #my $sourceEnd = 0; 780 #$sourceEnd = $2 if ($buffer =~ /(\d+?)\.\.(\d+?)$/); 781 782 while ( defined $buffer ) { 783 # check immediately -- not at the end of the loop 784 # note: GenPept entries obviously do not have a BASE line 785 last if ( $buffer =~ /^BASE|ORIGIN|CONTIG|WGS/o ); 786 787 # slurp in one feature at a time -- at return, the start of 788 # the next feature will have been read already, so we need 789 # to pass a reference, and the called method must set this 790 # to the last line read before returning 791 my $ftunit = $self->_read_FTHelper_GenBank( \$buffer ); 792 793 # implement new circular spec: features that cross the origin are now 794 # seamless instead of being 2 separate joined features 795 # commented out until kinks get worked out 796 #if ((! $args{'-nojoin'}) && $ftunit->{'loc'} =~ /^join\((\d+?)\.\.(\d+?),(\d+?)..(\d+?)\)$/ 797 #&& $sourceEnd == $2 && $3 == 1) { 798 #my $start = $1; 799 #my $end = $2 + $4; 800 #$ftunit->{'loc'} = "$start..$end"; 801 #} 802 803 # fix suggested by James Diggans 804 805 if ( not defined $ftunit ) { 806 # GRRRR. We have fallen over. Try to recover 807 $self->warn( "Unexpected error in feature table for " 808 . $params{'-display_id'} 809 . " Skipping feature, attempting to recover" ); 810 811 unless ( $buffer =~ /^\s{5,5}\S+/o 812 or $buffer =~ /^\S+/o 813 ) { 814 $buffer = $self->_readline; 815 } 816 next; # back to reading FTHelpers 817 } 818 819 # process ftunit 820 my $feat = 821 $ftunit->_generic_seqfeature( $self->location_factory, 822 $display_id ); 823 824 # add taxon_id from source if available 825 if ( $species 826 and $feat->primary_tag eq 'source' 827 and $feat->has_tag('db_xref') 828 and ( not $species->ncbi_taxid 829 or ( $species->ncbi_taxid 830 and $species->ncbi_taxid =~ /^list/ ) ) 831 ) { 832 foreach my $tagval ( $feat->get_tag_values('db_xref') ) { 833 if ( index( $tagval, "taxon:" ) == 0 ) { 834 $species->ncbi_taxid( substr( $tagval, 6 ) ); 835 last; 836 } 837 } 838 } 839 840 # add feature to list of features 841 push( @features, $feat ); 842 } 843 $builder->add_slot_value( -features => \@features ); 844 } 845 846 if ( defined $buffer ) { 847 # CONTIG lines: TODO, this needs to be cleaned up 848 if ($buffer =~/^CONTIG\s+(.*)/o) { 849 my $ctg = $1; 850 while ( defined( $buffer = $self->_readline ) ) { 851 last if $buffer =~ m{^ORIGIN|//}o; 852 $buffer =~ s/\s+(.*)/$1/; 853 $ctg .= $buffer; 854 } 855 if ($ctg) { 856 $annotation->add_Annotation( 857 Bio::Annotation::SimpleValue->new( 858 -tagname => 'contig', 859 -value => $ctg 860 ) 861 ); 862 } 863 } 864 elsif ($buffer =~ /^WGS|WGS_SCAFLD\s+/o) { # catch WGS/WGS_SCAFLD lines 865 while ( $buffer =~ s/(^WGS|WGS_SCAFLD)\s+// ) { # gulp lines 866 chomp $buffer; 867 $annotation->add_Annotation( 868 Bio::Annotation::SimpleValue->new( 869 -value => $buffer, 870 -tagname => lc $1 871 ) 872 ); 873 $buffer = $self->_readline; 874 } 875 } 876 elsif ( $buffer !~ m{^ORIGIN|//}o ) { # advance to the sequence, if any 877 while ( defined( $buffer = $self->_readline ) ) { 878 last if $buffer =~ m{^(ORIGIN|//)}; 879 } 880 } 881 } 882 if ( not $builder->want_object ) { 883 $builder->make_object; # implicit end-of-object 884 next RECORDSTART; 885 } 886 if ( $builder->want_slot('seq') ) { 887 # the fact that we want a sequence does not necessarily mean that 888 # there also is a sequence ... 889 if ( defined $buffer and $buffer =~ s/^ORIGIN\s+// ) { 890 if ( $annotation and length($buffer) > 0 ) { 891 $annotation->add_Annotation( 892 'origin', 893 Bio::Annotation::SimpleValue->new( 894 -tagname => 'origin', 895 -value => $buffer 896 ) 897 ); 898 } 899 my $seqc = ''; 900 while ( defined( $buffer = $self->_readline ) ) { 901 last if $buffer =~ m{^//}; 902 $buffer = uc $buffer; 903 $buffer =~ s/[^A-Za-z]//g; 904 $seqc .= $buffer; 905 } 906 907 $builder->add_slot_value( -seq => $seqc ); 908 } 909 } 910 elsif ( defined($buffer) and ( substr( $buffer, 0, 2 ) ne '//' ) ) { 911 # advance to the end of the record 912 while ( defined( $buffer = $self->_readline ) ) { 913 last if substr( $buffer, 0, 2 ) eq '//'; 914 } 915 } 916 917 # Unlikely, but maybe the sequence is so weird that we don't want it 918 # anymore. We don't want to return undef if the stream's not exhausted 919 # yet. 920 $seq = $builder->make_object; 921 next RECORDSTART unless $seq; 922 last RECORDSTART; 923 } # end while RECORDSTART 924 925 return $seq; 926} 927 928=head2 write_seq 929 930 Title : write_seq 931 Usage : $stream->write_seq($seq) 932 Function: writes the $seq object (must be seq) to the stream 933 Returns : 1 for success and 0 for error 934 Args : array of 1 to n Bio::SeqI objects 935 936=cut 937 938sub write_seq { 939 my ($self,@seqs) = @_; 940 941 foreach my $seq ( @seqs ) { 942 $self->throw("Attempting to write with no seq!") unless defined $seq; 943 944 if ( not ref $seq or not $seq->isa('Bio::SeqI') ) { 945 $self->warn(" $seq is not a SeqI compliant module. Attempting to dump, but may fail!"); 946 } 947 948 my $str = $seq->seq; 949 my $len = $seq->length; 950 my $alpha = $seq->alphabet; 951 952 my ($div, $mol); 953 if ( not $seq->can('division') 954 or not defined($div = $seq->division) 955 ) { 956 $div = 'UNK'; 957 } 958 if ( not $seq->can('molecule') 959 or not defined ($mol = $seq->molecule) 960 ) { 961 $mol = $alpha || 'DNA'; 962 } 963 964 my $circular = ($seq->is_circular) ? 'circular' : 'linear '; 965 966 local($^W) = 0; # suppressing warnings about uninitialized fields. 967 968 my $temp_line; 969 if ( $self->_id_generation_func ) { 970 $temp_line = &{$self->_id_generation_func}($seq); 971 } 972 else { 973 my $date = ''; 974 if ( $seq->can('get_dates') ) { 975 ($date) = $seq->get_dates; 976 } 977 978 $self->warn("No whitespace allowed in GenBank display id [". $seq->display_id. "]") 979 if $seq->display_id =~ /\s/; 980 981 my @data = ( lc($alpha) eq 'protein' ) ? ('aa', '', '') : ('bp', '', $mol); 982 $temp_line = sprintf ("%-12s%-15s%13s %s%4s%-8s%-8s %3s %-s\n", 983 'LOCUS', $seq->id, $len, 984 @data, $circular, $div, $date); 985 } 986 987 $self->_print($temp_line); 988 $self->_write_line_GenBank_regex("DEFINITION ", " ", 989 $seq->desc, "\\s\+\|\$",80); 990 991 # if there, write the accession line 992 993 if ( $self->_ac_generation_func ) { 994 $temp_line = &{$self->_ac_generation_func}($seq); 995 $self->_print("ACCESSION $temp_line\n"); 996 } 997 else { 998 my @acc = (); 999 push @acc, $seq->accession_number; 1000 if ( $seq->isa('Bio::Seq::RichSeqI') ) { 1001 push @acc, $seq->get_secondary_accessions; 1002 } 1003 $self->_print("ACCESSION ", join(" ", @acc), "\n"); 1004 # otherwise - cannot print <sigh> 1005 } 1006 1007 # if PID defined, print it 1008 if ($seq->isa('Bio::Seq::RichSeqI') and $seq->pid) { 1009 $self->_print("PID ", $seq->pid, "\n"); 1010 } 1011 1012 # if there, write the version line 1013 if ( defined $self->_sv_generation_func ) { 1014 $temp_line = &{$self->_sv_generation_func}($seq); 1015 if ( $temp_line ) { 1016 $self->_print("VERSION $temp_line\n"); 1017 } 1018 } 1019 elsif ($seq->isa('Bio::Seq::RichSeqI') and defined($seq->seq_version)) { 1020 my $id = $seq->primary_id; # this may be a GI number 1021 my $data = (defined $id and $id =~ /^\d+$/) ? " GI:$id" : ""; 1022 $self->_print("VERSION ", 1023 $seq->accession_number, ".", 1024 $seq->seq_version, $data, "\n"); 1025 } 1026 1027 # if there, write the PROJECT line 1028 for my $proj ( $seq->annotation->get_Annotations('project') ) { 1029 $self->_print("PROJECT ".$proj->value."\n"); 1030 } 1031 1032 # if there, write the DBSOURCE line 1033 foreach my $ref ( $seq->annotation->get_Annotations('dblink') ) { 1034 my ($db, $id) = ($ref->database, $ref->primary_id); 1035 my $prefix = $db eq 'Project' ? 'DBLINK' : 'DBSOURCE'; 1036 my $text = $db eq 'GenBank' ? '' 1037 : $db eq 'Project' ? "$db:$id" 1038 : "$db accession $id"; 1039 $self->_print(sprintf ("%-11s %s\n", $prefix, $text)); 1040 } 1041 1042 # if there, write the keywords line 1043 if ( defined $self->_kw_generation_func ) { 1044 $temp_line = &{$self->_kw_generation_func}($seq); 1045 $self->_print("KEYWORDS $temp_line\n"); 1046 } 1047 elsif ( $seq->can('keywords') ) { 1048 my $kw = $seq->keywords; 1049 $kw .= '.' if ( $kw !~ /\.$/ ); 1050 $self->_print("KEYWORDS $kw\n"); 1051 } 1052 1053 # SEGMENT if it exists 1054 foreach my $ref ( $seq->annotation->get_Annotations('segment') ) { 1055 $self->_print(sprintf ("%-11s %s\n",'SEGMENT', 1056 $ref->value)); 1057 } 1058 1059 # Organism lines 1060 if (my $spec = $seq->species) { 1061 my ($on, $sn, $cn) = ($spec->can('organelle') ? $spec->organelle : '', 1062 $spec->scientific_name, 1063 $spec->common_name); 1064 my @classification; 1065 if ($spec->isa('Bio::Species')) { 1066 @classification = $spec->classification; 1067 shift @classification; 1068 } 1069 else { 1070 # Bio::Taxon should have a DB handle of some type attached, so 1071 # derive the classification from that 1072 my $node = $spec; 1073 while ($node) { 1074 $node = $node->ancestor || last; 1075 unshift @classification, $node->node_name; 1076 #$node eq $root && last; 1077 } 1078 @classification = reverse @classification; 1079 } 1080 my $abname = $spec->name('abbreviated') ? # from genbank file 1081 $spec->name('abbreviated')->[0] : $sn; 1082 my $sl = $on ? "$on " : ''; 1083 $sl .= $cn ? "$abname ($cn)" : $abname; 1084 1085 $self->_write_line_GenBank_regex("SOURCE ", ' 'x12, $sl, "\\s\+\|\$", 80); 1086 $self->_print(" ORGANISM ", $spec->scientific_name, "\n"); 1087 my $OC = join('; ', reverse @classification) . '.'; 1088 $self->_write_line_GenBank_regex(' 'x12,' 'x12, $OC, "\\s\+\|\$", 80); 1089 } 1090 1091 # Reference lines 1092 my $count = 1; 1093 foreach my $ref ( $seq->annotation->get_Annotations('reference') ) { 1094 $temp_line = "REFERENCE $count"; 1095 if ($ref->start) { 1096 $temp_line .= sprintf (" (%s %d to %d)", 1097 ($seq->alphabet() eq "protein" ? 1098 "residues" : "bases"), 1099 $ref->start, $ref->end); 1100 } 1101 elsif ($ref->gb_reference) { 1102 $temp_line .= sprintf (" (%s)", $ref->gb_reference); 1103 } 1104 $self->_print("$temp_line\n"); 1105 $self->_write_line_GenBank_regex(" AUTHORS ", ' 'x12, 1106 $ref->authors, "\\s\+\|\$", 80); 1107 $self->_write_line_GenBank_regex(" CONSRTM ", ' 'x12, 1108 $ref->consortium, "\\s\+\|\$", 80) if $ref->consortium; 1109 $self->_write_line_GenBank_regex(" TITLE ", ' 'x12, 1110 $ref->title, "\\s\+\|\$", 80); 1111 $self->_write_line_GenBank_regex(" JOURNAL ", ' 'x12, 1112 $ref->location, "\\s\+\|\$", 80); 1113 if ( $ref->medline) { 1114 $self->_write_line_GenBank_regex(" MEDLINE ", ' 'x12, 1115 $ref->medline, "\\s\+\|\$", 80); 1116 # I am assuming that pubmed entries only exist when there 1117 # are also MEDLINE entries due to the indentation 1118 } 1119 # This could be a wrong assumption 1120 if ( $ref->pubmed ) { 1121 $self->_write_line_GenBank_regex(" PUBMED ", ' 'x12, 1122 $ref->pubmed, "\\s\+\|\$", 80); 1123 } 1124 # put remark at the end 1125 if ($ref->comment) { 1126 $self->_write_line_GenBank_regex(" REMARK ", ' 'x12, 1127 $ref->comment, "\\s\+\|\$", 80); 1128 } 1129 $count++; 1130 } 1131 1132 # Comment lines 1133 foreach my $comment ( $seq->annotation->get_Annotations('comment') ) { 1134 $self->_write_line_GenBank_regex("COMMENT ", ' 'x12, 1135 $comment->text, "\\s\+\|\$", 80); 1136 } 1137 1138 # FEATURES section 1139 $self->_print("FEATURES Location/Qualifiers\n"); 1140 1141 if ( defined $self->_post_sort ) { 1142 # we need to read things into an array. Process. Sort them. Print 'em 1143 my $post_sort_func = $self->_post_sort; 1144 my @fth; 1145 1146 foreach my $sf ( $seq->top_SeqFeatures ) { 1147 push @fth, Bio::SeqIO::FTHelper::from_SeqFeature($sf, $seq); 1148 } 1149 1150 @fth = sort { &$post_sort_func($a, $b) } @fth; 1151 1152 foreach my $fth ( @fth ) { 1153 $self->_print_GenBank_FTHelper($fth); 1154 } 1155 } 1156 else { 1157 # not post sorted. And so we can print as we get them. 1158 # lower memory load... 1159 foreach my $sf ( $seq->top_SeqFeatures ) { 1160 my @fth = Bio::SeqIO::FTHelper::from_SeqFeature($sf, $seq); 1161 foreach my $fth ( @fth ) { 1162 if ( ! $fth->isa('Bio::SeqIO::FTHelper') ) { 1163 $sf->throw("Cannot process FTHelper... $fth"); 1164 } 1165 $self->_print_GenBank_FTHelper($fth); 1166 } 1167 } 1168 } 1169 1170 # deal with WGS; WGS_SCAFLD present only if WGS is also present 1171 if ($seq->annotation->get_Annotations('wgs')) { 1172 foreach my $wgs (map {$seq->annotation->get_Annotations($_)} 1173 qw(wgs wgs_scaffold) 1174 ) { 1175 $self->_print(sprintf ("%-11s %s\n", 1176 uc($wgs->tagname), 1177 $wgs->value)); 1178 } 1179 $self->_show_dna(0); 1180 } 1181 if ($seq->annotation->get_Annotations('contig')) { 1182 my $ct = 0; 1183 my $cline; 1184 foreach my $contig ($seq->annotation->get_Annotations('contig')) { 1185 unless ($ct) { 1186 $cline = uc($contig->tagname) . " " . $contig->value . "\n"; 1187 } 1188 else { 1189 $cline = " " . $contig->value . "\n"; 1190 } 1191 $self->_print($cline); 1192 $ct++; 1193 } 1194 } 1195 if ( $seq->length == 0 ) { 1196 $self->_show_dna(0); 1197 } 1198 1199 if ( $self->_show_dna == 0 ) { 1200 $self->_print("\n//\n"); 1201 return; 1202 } 1203 1204 # finished printing features. 1205 1206 $str =~ tr/A-Z/a-z/; 1207 1208 my ($o) = $seq->annotation->get_Annotations('origin'); 1209 $self->_print(sprintf("%-12s%s\n", 1210 'ORIGIN', $o ? $o->value : '')); 1211 # print out the sequence 1212 my $nuc = 60; # Number of nucleotides per line 1213 my $whole_pat = 'a10' x 6; # Pattern for unpacking a whole line 1214 my $out_pat = 'A11' x 6; # Pattern for packing a line 1215 my $length = length $str; 1216 1217 # Calculate the number of nucleotides which fit on whole lines 1218 my $whole = int($length / $nuc) * $nuc; 1219 1220 # Print the whole lines 1221 my $i; 1222 for ($i = 0; $i < $whole; $i += $nuc) { 1223 my $blocks = pack $out_pat, 1224 unpack $whole_pat, 1225 substr($str, $i, $nuc); 1226 chop $blocks; 1227 $self->_print(sprintf("%9d $blocks\n", $i + $nuc - 59)); 1228 } 1229 1230 # Print the last line 1231 if (my $last = substr($str, $i)) { 1232 my $last_len = length($last); 1233 my $last_pat = 'a10' x int($last_len / 10) 1234 . 'a' . $last_len % 10; 1235 my $blocks = pack $out_pat, 1236 unpack($last_pat, $last); 1237 $blocks =~ s/ +$//; 1238 $self->_print(sprintf("%9d $blocks\n", 1239 $length - $last_len + 1)); 1240 } 1241 1242 $self->_print("//\n"); 1243 1244 $self->flush if $self->_flush_on_write && defined $self->_fh; 1245 return 1; 1246 } 1247} 1248 1249=head2 _print_GenBank_FTHelper 1250 1251 Title : _print_GenBank_FTHelper 1252 Usage : 1253 Function: 1254 Example : 1255 Returns : 1256 Args : 1257 1258=cut 1259 1260sub _print_GenBank_FTHelper { 1261 my ( $self, $fth ) = @_; 1262 1263 if ( not ref $fth or not $fth->isa('Bio::SeqIO::FTHelper') ) { 1264 $fth->warn( 1265 "$fth is not a FTHelper class. Attempting to print but there could be issues" 1266 ); 1267 } 1268 1269 my $spacer = ( length $fth->key >= 15 ) ? ' ' : ''; 1270 $self->_write_line_GenBank_regex( 1271 sprintf( " %-16s%s", $fth->key, $spacer ), 1272 " " x 21, $fth->loc, "\,\|\$", 80 ); 1273 1274 foreach my $tag ( sort keys %{ $fth->field } ) { 1275 # Account for hash structure in Annotation::DBLink, not the expected array 1276 if ( $tag eq 'db_xref' and grep /HASH/, @{ $fth->field->{$tag} }) { 1277 for my $ref ( @{ $fth->field->{$tag} } ) { 1278 my $db = $ref->{'database'}; 1279 my $id = $ref->{'primary_id'}; 1280 $self->_write_line_GenBank_regex 1281 ( " " x 21, " " x 21, 1282 "/$tag=\"$db:$id\"", "\.\|\$", 80 ); 1283 } 1284 } 1285 # The usual case, where all values are found in an array 1286 else { 1287 foreach my $value ( @{ $fth->field->{$tag} } ) { 1288 $value =~ s/\"/\"\"/g; 1289 if ( $value eq "_no_value" ) { 1290 $self->_write_line_GenBank_regex 1291 ( " " x 21, " " x 21, 1292 "/$tag", "\.\|\$", 80 ); 1293 } 1294 1295 # There are almost 3x more quoted qualifier values and they 1296 # are more common too so we take quoted ones first. 1297 # Long qualifiers, that will be line wrapped, are always quoted 1298 1299 # Note (cjf): any tags explicitly *not* quoted must 1300 # remain unquoted in all output for proper round trip 1301 # parsing: 1302 # see https://github.com/bioperl/bioperl-live/issues/321 1303 1304 # If this breaks *NCBI* examples please file an issue with 1305 # example files 1306 elsif ( not $FTQUAL_NO_QUOTE{$tag} ) { 1307 my ($pat) = ( $value =~ /\s/ ? '\s|$' : 1308 '.|$' ); 1309 $self->_write_line_GenBank_regex 1310 ( " " x 21, " " x 21, 1311 "/$tag=\"$value\"", $pat, 80 ); 1312 } 1313 else { 1314 my ($pat) = ( $value =~ /[,]/ ? '[,]|$' : 1315 '.|$' ); 1316 $self->_write_line_GenBank_regex 1317 ( " " x 21, " " x 21, 1318 "/$tag=$value", $pat, 80 ); 1319 } 1320 } 1321 } 1322 } 1323} 1324 1325=head2 _read_GenBank_References 1326 1327 Title : _read_GenBank_References 1328 Usage : 1329 Function: Reads references from GenBank format. Internal function really 1330 Returns : 1331 Args : 1332 1333=cut 1334 1335sub _read_GenBank_References { 1336 my ($self, $buffer) = @_; 1337 my (@refs); 1338 my $ref; 1339 1340 # assume things are starting with RN 1341 if ( $$buffer !~ /^REFERENCE/ ) { 1342 warn("Not parsing line '$$buffer' which maybe important"); 1343 } 1344 1345 my $line = $$buffer; 1346 1347 my (@title,@loc,@authors,@consort,@com,@medline,@pubmed); 1348 1349 REFLOOP: 1350 while( defined($line) or defined($line = $self->_readline) ) { 1351 if ($line =~ /^\s{2}AUTHORS\s+(.*)/o) { 1352 push @authors, $1; 1353 while ( defined($line = $self->_readline) ) { 1354 if ($line =~ /^\s{9,}(.*)/o) { 1355 push @authors, $1; 1356 next; 1357 } 1358 last; 1359 } 1360 $ref->authors(join(' ', @authors)); 1361 } 1362 1363 if ($line =~ /^\s{2}CONSRTM\s+(.*)/o) { 1364 push @consort, $1; 1365 while ( defined($line = $self->_readline) ) { 1366 if ($line =~ /^\s{9,}(.*)/o) { 1367 push @consort, $1; 1368 next; 1369 } 1370 last; 1371 } 1372 $ref->consortium(join(' ', @consort)); 1373 } 1374 1375 if ($line =~ /^\s{2}TITLE\s+(.*)/o) { 1376 push @title, $1; 1377 while ( defined($line = $self->_readline) ) { 1378 if ($line =~ /^\s{9,}(.*)/o) { 1379 push @title, $1; 1380 next; 1381 } 1382 last; 1383 } 1384 $ref->title(join(' ', @title)); 1385 } 1386 1387 if ($line =~ /^\s{2}JOURNAL\s+(.*)/o) { 1388 push @loc, $1; 1389 while ( defined($line = $self->_readline) ) { 1390 # we only match when there are at least 4 spaces 1391 # there is probably a better way to match this 1392 # as it assumes that the describing tag is short enough 1393 if ($line =~ /^\s{9,}(.*)/o) { 1394 push @loc, $1; 1395 next; 1396 } 1397 last; 1398 } 1399 $ref->location(join(' ', @loc)); 1400 redo REFLOOP; 1401 } 1402 1403 if ($line =~ /^\s{2}REMARK\s+(.*)/o) { 1404 push @com, $1; 1405 while ( defined($line = $self->_readline) ) { 1406 if ($line =~ /^\s{9,}(.*)/o) { 1407 push @com, $1; 1408 next; 1409 } 1410 last; 1411 } 1412 $ref->comment(join(' ', @com)); 1413 redo REFLOOP; 1414 } 1415 1416 if ( $line =~ /^\s{2}MEDLINE\s+(.*)/ ) { 1417 push @medline, $1; 1418 while ( defined($line = $self->_readline) ) { 1419 if ($line =~ /^\s{9,}(.*)/) { 1420 push @medline, $1; 1421 next; 1422 } 1423 last; 1424 } 1425 $ref->medline(join(' ', @medline)); 1426 redo REFLOOP; 1427 } 1428 1429 if ( $line =~ /^\s{3}PUBMED\s+(.*)/ ) { 1430 push @pubmed, $1; 1431 while ( defined($line = $self->_readline) ) { 1432 if ($line =~ /^\s{9,}(.*)/) { 1433 push @pubmed, $1; 1434 next; 1435 } 1436 last; 1437 } 1438 $ref->pubmed(join(' ', @pubmed)); 1439 redo REFLOOP; 1440 } 1441 1442 if ( $line =~ /^REFERENCE/o ) { 1443 # store current reference 1444 $self->_add_ref_to_array(\@refs,$ref) if defined $ref; 1445 1446 # reset 1447 @authors = (); 1448 @title = (); 1449 @loc = (); 1450 @com = (); 1451 @pubmed = (); 1452 @medline = (); 1453 1454 # create the new reference object 1455 $ref = Bio::Annotation::Reference->new(-tagname => 'reference'); 1456 1457 # check whether start and end base is given 1458 if ($line =~ /^REFERENCE\s+\d+\s+\([a-z]+ (\d+) to (\d+)\)/){ 1459 $ref->start($1); 1460 $ref->end($2); 1461 } 1462 elsif ($line =~ /^REFERENCE\s+\d+\s+\((.*)\)/) { 1463 $ref->gb_reference($1); 1464 } 1465 } 1466 1467 last if ($line =~ /^(FEATURES)|(COMMENT)/o); 1468 1469 $line = undef; # Empty $line to trigger read of next line 1470 } 1471 1472 # store last reference 1473 $self->_add_ref_to_array(\@refs, $ref) if defined $ref; 1474 1475 $$buffer = $line; 1476 1477 #print "\nnumber of references found: ", $#refs+1,"\n"; 1478 1479 return @refs; 1480} 1481 1482=head2 _add_ref_to_array 1483 1484Title: _add_ref_to_array 1485Usage: 1486Function: Adds a Reference object to an array of Reference objects, takes 1487 care of possible cleanups to be done (currently, only author and title 1488 will be chopped of trailing semicolons). 1489Args: A reference to an array of Reference objects and 1490 the Reference object to be added 1491Returns: nothing 1492 1493=cut 1494 1495sub _add_ref_to_array { 1496 my ($self, $refs, $ref) = @_; 1497 1498 # first, polish author and title by removing possible trailing semicolons 1499 my $au = $ref->authors; 1500 my $title = $ref->title; 1501 $au =~ s/;\s*$//g if $au; 1502 $title =~ s/;\s*$//g if $title; 1503 $ref->authors($au); 1504 $ref->title($title); 1505 # the rest should be clean already, so go ahead and add it 1506 push @{$refs}, $ref; 1507} 1508 1509=head2 _read_GenBank_Species 1510 1511 Title : _read_GenBank_Species 1512 Usage : 1513 Function: Reads the GenBank Organism species and classification 1514 lines. Able to deal with unconvential Organism naming 1515 formats, and varietas in plants 1516 Example : ORGANISM unknown marine gamma proteobacterium NOR5 1517 $genus = undef 1518 $species = unknown marine gamma proteobacterium NOR5 1519 1520 ORGANISM Drosophila sp. 'white tip scutellum' 1521 $genus = Drosophila 1522 $species = sp. 'white tip scutellum' 1523 (yes, this really is a species and that is its name) 1524 $subspecies = undef 1525 1526 ORGANISM Ajellomyces capsulatus var. farciminosus 1527 $genus = Ajellomyces 1528 $species = capsulatus 1529 $subspecies = var. farciminosus 1530 1531 ORGANISM Hepatitis delta virus 1532 $genus = undef (though this virus has a genus in its lineage, we 1533 cannot know that without a database lookup) 1534 $species = Hepatitis delta virus 1535 1536 Returns : A Bio::Species object 1537 Args : A reference to the current line buffer 1538 1539=cut 1540 1541sub _read_GenBank_Species { 1542 my ($self, $buffer) = @_; 1543 1544 my @unkn_names = ('other', 'unknown organism', 'not specified', 'not shown', 1545 'Unspecified', 'Unknown', 'None', 'unclassified', 1546 'unidentified organism', 'not supplied'); 1547 # dictionary of synonyms for taxid 32644 1548 my @unkn_genus = ('unknown', 'unclassified', 'uncultured', 'unidentified'); 1549 # all above can be part of valid species name 1550 1551 my $line = $$buffer; 1552 1553 my( $sub_species, $species, $genus, $sci_name, $common, 1554 $class_lines, $source_flag, $abbr_name, $organelle, $sl ); 1555 my %source = map { $_ => 1 } qw(SOURCE ORGANISM CLASSIFICATION); 1556 1557 # upon first entering the loop, we must not read a new line -- the SOURCE 1558 # line is already in the buffer (HL 05/10/2000) 1559 my ($ann, $tag, $data); 1560 while (defined($line) or defined($line = $self->_readline)) { 1561 # de-HTMLify (links that may be encountered here don't contain 1562 # escaped '>', so a simple-minded approach suffices) 1563 $line =~ s{<[^>]+>}{}g; 1564 if ($line =~ m{^(?:\s{0,2})(\w+)\s+(.+)?$}ox) { 1565 ($tag, $data) = ($1, $2 || ''); 1566 last if ($tag and not exists $source{$tag}); 1567 } 1568 else { 1569 return unless $tag; 1570 ($data = $line) =~ s{^\s+}{}; 1571 chomp $data; 1572 $tag = 'CLASSIFICATION' if ( $tag ne 'CLASSIFICATION' 1573 and $tag eq 'ORGANISM' 1574 # Don't match "str." or "var." (fix NC_021815), 1575 # and don't match ".1" (fix NC_021902) 1576 and $line =~ m{(?<!\bstr|\bvar)[;\.]+(?!\d)}); 1577 } 1578 (exists $ann->{$tag}) ? ($ann->{$tag} .= ' '.$data) : ($ann->{$tag} .= $data); 1579 $line = undef; 1580 } 1581 1582 ($sl, $class_lines, $sci_name) = ($ann->{SOURCE}, $ann->{CLASSIFICATION}, $ann->{ORGANISM}); 1583 1584 $$buffer = $line; 1585 1586 $sci_name or return; 1587 1588 # parse out organelle, common name, abbreviated name if present; 1589 # this should catch everything, but falls back to 1590 # entire SOURCE line just in case 1591 if ($sl =~ m{^(mitochondrion|chloroplast|plastid)? 1592 \s*(.*?) 1593 \s*(?: \( (.*?) \) )?\.? 1594 $ 1595 }xms 1596 ) { 1597 ($organelle, $abbr_name, $common) = ($1, $2, $3); # optional 1598 } 1599 else { 1600 $abbr_name = $sl; # nothing caught; this is a backup! 1601 } 1602 1603 # Convert data in classification lines into classification array. 1604 # only split on ';' or '.' so that classification that is 2 or more words will 1605 # still get matched, use map() to remove trailing/leading/intervening spaces 1606 my @class = map { $_ =~ s/^\s+//; 1607 $_ =~ s/\s+$//; 1608 $_ =~ s/\s{2,}/ /g; 1609 $_; } 1610 split /(?<!subgen)[;\.]+/, $class_lines; 1611 1612 # do we have a genus? 1613 my $possible_genus = quotemeta($class[-1]) 1614 . ($class[-2] ? "|" . quotemeta($class[-2]) : ''); 1615 if ($sci_name =~ /^($possible_genus)/) { 1616 $genus = $1; 1617 ($species) = $sci_name =~ /^$genus\s+(.+)/; 1618 } 1619 else { 1620 $species = $sci_name; 1621 } 1622 1623 # is this organism of rank species or is it lower? 1624 # (we don't catch everything lower than species, but it doesn't matter - 1625 # this is just so we abide by previous behaviour whilst not calling a 1626 # species a subspecies) 1627 if ($species and $species =~ /(.+)\s+((?:subsp\.|var\.).+)/) { 1628 ($species, $sub_species) = ($1, $2); 1629 } 1630 1631 # Don't make a species object if it's empty or "Unknown" or "None" 1632 # return unless $genus and $genus !~ /^(Unknown|None)$/oi; 1633 # Don't make a species object if it belongs to taxid 32644 1634# my $unkn = grep { $_ =~ /^\Q$sl\E$/; } @unkn_names; 1635 my $unkn = grep { $_ eq $sl } @unkn_names; 1636 return unless (defined $species or defined $genus) and $unkn == 0; 1637 1638 # Bio::Species array needs array in Species -> Kingdom direction 1639 push @class, $sci_name; 1640 @class = reverse @class; 1641 1642 my $make = Bio::Species->new; 1643 $make->scientific_name($sci_name); 1644 $make->classification(@class) if @class > 0; 1645 $make->common_name( $common ) if $common; 1646 $make->name('abbreviated', $abbr_name) if $abbr_name; 1647 $make->organelle($organelle) if $organelle; 1648 #$make->sub_species( $sub_species ) if $sub_species; 1649 return $make; 1650} 1651 1652=head2 _read_FTHelper_GenBank 1653 1654 Title : _read_FTHelper_GenBank 1655 Usage : _read_FTHelper_GenBank($buffer) 1656 Function: reads the next FT key line 1657 Example : 1658 Returns : Bio::SeqIO::FTHelper object 1659 Args : filehandle and reference to a scalar 1660 1661=cut 1662 1663sub _read_FTHelper_GenBank { 1664 my ($self, $buffer) = @_; 1665 1666 my ($key, # The key of the feature 1667 $loc # The location line from the feature 1668 ); 1669 my @qual = (); # An array of lines making up the qualifiers 1670 1671 if ($$buffer =~ /^\s{5}(\S+)\s+(.+?)\s*$/o) { 1672 $key = $1; 1673 $loc = $2; 1674 # Read all the lines up to the next feature 1675 while ( defined(my $line = $self->_readline) ) { 1676 if ($line =~ /^(\s+)(.+?)\s*$/o) { 1677 # Lines inside features are preceded by 21 spaces 1678 # A new feature is preceded by 5 spaces 1679 if (length($1) > 6) { 1680 # Add to qualifiers if we're in the qualifiers, or if it's 1681 # the first qualifier 1682 if (@qual or (index($2,'/') == 0)) { 1683 push @qual, $2; 1684 } 1685 # We're still in the location line, so append to location 1686 else { 1687 $loc .= $2; 1688 } 1689 } 1690 else { 1691 # We've reached the start of the next feature 1692 # Put the first line of the next feature into the buffer 1693 $$buffer = $line; 1694 last; 1695 } 1696 } 1697 else { 1698 # We're at the end of the feature table 1699 # Put the first line of the next feature into the buffer 1700 $$buffer = $line; 1701 last; 1702 } 1703 } 1704 } 1705 else { 1706 # No feature key 1707 $self->debug("no feature key!\n"); 1708 # change suggested by JDiggans to avoid infinite loop- 1709 # see bugreport 1062. 1710 # reset buffer to prevent infinite loop 1711 $$buffer = $self->_readline; 1712 return; 1713 } 1714 1715 # Make the new FTHelper object 1716 my $out = Bio::SeqIO::FTHelper->new; 1717 $out->verbose($self->verbose); 1718 $out->key($key); 1719 $out->loc($loc); 1720 1721 # Now parse and add any qualifiers. (@qual is kept 1722 # intact to provide informative error messages.) 1723 QUAL: 1724 for (my $i = 0; $i < @qual; $i++) { 1725 my $data = $qual[$i]; 1726 my ( $qualifier, $value ) = ($data =~ m{^/([^=]+)(?:=\s*(.+))?}) 1727 or $self->warn( "cannot see new qualifier in feature $key: " 1728 . $data); 1729 $qualifier = '' if not defined $qualifier; 1730 1731 if (defined $value) { 1732 # Do we have a quoted value? 1733 if (substr($value, 0, 1) eq '"') { 1734 # Keep adding to value until we find the trailing quote 1735 # and the quotes are balanced 1736 while ($value !~ /\"$/ or $value =~ tr/"/"/ % 2) { 1737 if ($i >= $#qual) { 1738 $self->warn( "Unbalanced quote in:\n" 1739 . join("\n", @qual) 1740 . "No further qualifiers will " 1741 . "be added for this feature"); 1742 last QUAL; 1743 } 1744 # modifying a for-loop variable inside of the loop 1745 # is not the best programming style ... 1746 $i++; 1747 my $next = $qual[$i]; 1748 1749 # add to value with a space unless the value appears 1750 # to be a sequence (translation for example) 1751 # if (($value.$next) =~ /[^A-Za-z\"\-]/o) { 1752 # changed to explicitly look for translation tag - cjf 06/8/29 1753 if ($qualifier !~ /^translation$/i ) { 1754 $value .= " "; 1755 } 1756 $value .= $next; 1757 } 1758 # Trim leading and trailing quotes 1759 $value =~ s/^"|"$//g; 1760 # Undouble internal quotes 1761 $value =~ s/""/\"/g; 1762 } 1763 elsif ( $value =~ /^\(/ ) { # values quoted by ()s 1764 # Keep adding to value until we find the trailing bracket 1765 # and the ()s are balanced 1766 my $left = ($value =~ tr/\(/\(/); # count left parens 1767 my $right = ($value =~ tr/\)/\)/); # count right parens 1768 1769 while( $left != $right ) { # was "$value !~ /\)$/ or $left != $right" 1770 if ( $i >= $#qual) { 1771 $self->warn( "Unbalanced parens in:\n" 1772 . join("\n", @qual) 1773 . "\nNo further qualifiers will " 1774 . "be added for this feature"); 1775 last QUAL; 1776 } 1777 $i++; 1778 my $next = $qual[$i]; 1779 $value .= $next; 1780 $left += ($next =~ tr/\(/\(/); 1781 $right += ($next =~ tr/\)/\)/); 1782 } 1783 } 1784 } 1785 else { 1786 $value = '_no_value'; 1787 } 1788 # Store the qualifier 1789 $out->field->{$qualifier} ||= []; 1790 push @{$out->field->{$qualifier}}, $value; 1791 } 1792 return $out; 1793} 1794 1795=head2 _write_line_GenBank 1796 1797 Title : _write_line_GenBank 1798 Usage : 1799 Function: internal function 1800 Example : 1801 Returns : 1802 Args : 1803 1804=cut 1805 1806sub _write_line_GenBank { 1807 my ($self, $pre1, $pre2, $line, $length) = @_; 1808 1809 $length or $self->throw("Miscalled write_line_GenBank without length. Programming error!"); 1810 my $subl = $length - length $pre2; 1811 my $linel = length $line; 1812 my $i; 1813 1814 my $subr = substr($line,0,$length - length $pre1); 1815 1816 $self->_print("$pre1$subr\n"); 1817 for($i = ($length - length $pre1); $i < $linel; $i += $subl) { 1818 $subr = substr($line, $i, $subl); 1819 $self->_print("$pre2$subr\n"); 1820 } 1821} 1822 1823=head2 _write_line_GenBank_regex 1824 1825 Title : _write_line_GenBank_regex 1826 Usage : 1827 Function: internal function for writing lines of specified 1828 length, with different first and the next line 1829 left hand headers and split at specific points in the 1830 text 1831 Example : 1832 Returns : nothing 1833 Args : file handle, 1834 first header, 1835 second header, 1836 text-line, 1837 regex for line breaks, 1838 total line length 1839 1840=cut 1841 1842sub _write_line_GenBank_regex { 1843 my ($self, $pre1, $pre2, $line, $regex, $length) = @_; 1844 1845 # print STDOUT "Going to print with $line!\n"; 1846 1847 $length or $self->throw("Miscalled write_line_GenBank without length. Programming error!"); 1848 1849 my $subl = $length - (length $pre1) - 2; 1850 my @lines = (); 1851 1852 CHUNK: 1853 while ($line) { 1854 foreach my $pat ($regex, '[,;\.\/-]\s|'.$regex, '[,;\.\/-]|'.$regex) { 1855 if ($line =~ m/^(.{0,$subl})($pat)(.*)/ ) { 1856 my $l = $1 . $2; 1857 $line = substr($line, length $l); 1858 # be strict about not padding spaces according to 1859 # genbank format 1860 $l =~ s/\s+$//; 1861 next CHUNK if ($l eq ''); 1862 push @lines, $l; 1863 next CHUNK; 1864 } 1865 } 1866 # if we get here none of the patterns matched $subl or less chars 1867 $self->warn( "trouble dissecting \"$line\"\n into chunks " 1868 . "of $subl chars or less - this tag won't print right"); 1869 # insert a space char to prevent infinite loops 1870 $line = substr($line, 0, $subl) . " " . substr($line, $subl); 1871 } 1872 my $s = shift @lines; 1873 $self->_print("$pre1$s\n") if $s; 1874 foreach my $s ( @lines ) { 1875 $self->_print("$pre2$s\n"); 1876 } 1877} 1878 1879=head2 _post_sort 1880 1881 Title : _post_sort 1882 Usage : $obj->_post_sort($newval) 1883 Function: 1884 Returns : value of _post_sort 1885 Args : newvalue (optional) 1886 1887=cut 1888 1889sub _post_sort { 1890 my ($obj,$value) = @_; 1891 if ( defined $value) { 1892 $obj->{'_post_sort'} = $value; 1893 } 1894 return $obj->{'_post_sort'}; 1895} 1896 1897=head2 _show_dna 1898 1899 Title : _show_dna 1900 Usage : $obj->_show_dna($newval) 1901 Function: 1902 Returns : value of _show_dna 1903 Args : newvalue (optional) 1904 1905=cut 1906 1907sub _show_dna { 1908 my ($obj,$value) = @_; 1909 if ( defined $value) { 1910 $obj->{'_show_dna'} = $value; 1911 } 1912 return $obj->{'_show_dna'}; 1913} 1914 1915=head2 _id_generation_func 1916 1917 Title : _id_generation_func 1918 Usage : $obj->_id_generation_func($newval) 1919 Function: 1920 Returns : value of _id_generation_func 1921 Args : newvalue (optional) 1922 1923=cut 1924 1925sub _id_generation_func { 1926 my ($obj,$value) = @_; 1927 if ( defined $value ) { 1928 $obj->{'_id_generation_func'} = $value; 1929 } 1930 return $obj->{'_id_generation_func'}; 1931} 1932 1933=head2 _ac_generation_func 1934 1935 Title : _ac_generation_func 1936 Usage : $obj->_ac_generation_func($newval) 1937 Function: 1938 Returns : value of _ac_generation_func 1939 Args : newvalue (optional) 1940 1941=cut 1942 1943sub _ac_generation_func { 1944 my ($obj,$value) = @_; 1945 if ( defined $value ) { 1946 $obj->{'_ac_generation_func'} = $value; 1947 } 1948 return $obj->{'_ac_generation_func'}; 1949} 1950 1951=head2 _sv_generation_func 1952 1953 Title : _sv_generation_func 1954 Usage : $obj->_sv_generation_func($newval) 1955 Function: 1956 Returns : value of _sv_generation_func 1957 Args : newvalue (optional) 1958 1959=cut 1960 1961sub _sv_generation_func { 1962 my ($obj,$value) = @_; 1963 if ( defined $value ) { 1964 $obj->{'_sv_generation_func'} = $value; 1965 } 1966 return $obj->{'_sv_generation_func'}; 1967} 1968 1969=head2 _kw_generation_func 1970 1971 Title : _kw_generation_func 1972 Usage : $obj->_kw_generation_func($newval) 1973 Function: 1974 Returns : value of _kw_generation_func 1975 Args : newvalue (optional) 1976 1977=cut 1978 1979sub _kw_generation_func { 1980 my ($obj,$value) = @_; 1981 if ( defined $value ) { 1982 $obj->{'_kw_generation_func'} = $value; 1983 } 1984 return $obj->{'_kw_generation_func'}; 1985} 1986 19871; 1988