1#
2# BioPerl module for Bio::SeqIO::genbank
3#
4# Please direct questions and support issues to <bioperl-l@bioperl.org>
5#
6# Cared for by Bioperl project bioperl-l(at)bioperl.org
7#
8# Copyright Elia Stupka and contributors see AUTHORS section
9#
10# You may distribute this module under the same terms as perl itself
11
12# POD documentation - main docs before the code
13
14=head1 NAME
15
16Bio::SeqIO::genbank - GenBank sequence input/output stream
17
18=head1 SYNOPSIS
19
20It is probably best not to use this object directly, but
21rather go through the SeqIO handler:
22
23    $stream = Bio::SeqIO->new(-file   => $filename,
24                              -format => 'GenBank');
25
26    while ( my $seq = $stream->next_seq ) {
27        # do something with $seq
28    }
29
30
31=head1 DESCRIPTION
32
33This object can transform Bio::Seq objects to and from GenBank flat
34file databases.
35
36There is some flexibility here about how to write GenBank output
37that is not fully documented.
38
39=head2 Optional functions
40
41=over 3
42
43=item _show_dna()
44
45(output only) shows the dna or not
46
47=item _post_sort()
48
49(output only) provides a sorting func which is applied to the FTHelpers
50before printing
51
52=item _id_generation_func()
53
54This is function which is called as
55
56   print "ID   ", $func($seq), "\n";
57
58To generate the ID line. If it is not there, it generates a sensible ID
59line using a number of tools.
60
61If you want to output annotations in Genbank format they need to be
62stored in a Bio::Annotation::Collection object which is accessible
63through the Bio::SeqI interface method L<annotation()|annotation>.
64
65The following are the names of the keys which are pulled from a
66L<Bio::Annotation::Collection> object:
67
68 reference       - Should contain Bio::Annotation::Reference objects
69 comment         - Should contain Bio::Annotation::Comment objects
70 dblink          - Should contain a Bio::Annotation::DBLink object
71 segment         - Should contain a Bio::Annotation::SimpleValue object
72 origin          - Should contain a Bio::Annotation::SimpleValue object
73 wgs             - Should contain a Bio::Annotation::SimpleValue object
74
75=back
76
77=head1 Where does the data go?
78
79Data parsed in Bio::SeqIO::genbank is stored in a variety of data
80fields in the sequence object that is returned. Here is a partial list
81of fields.
82
83Items listed as RichSeq or Seq or PrimarySeq and then NAME() tell you
84the top level object which defines a function called NAME() which
85stores this information.
86
87Items listed as Annotation 'NAME' tell you the data is stored the
88associated Bio::AnnotationCollectionI object which is associated with
89Bio::Seq objects.  If it is explicitly requested that no annotations
90should be stored when parsing a record of course they will not be
91available when you try and get them.  If you are having this problem
92look at the type of SeqBuilder that is being used to construct your
93sequence object.
94
95 Comments             Annotation 'comment'
96 References           Annotation 'reference'
97 Segment              Annotation 'segment'
98 Origin               Annotation 'origin'
99 Dbsource             Annotation 'dblink'
100
101 Accessions           PrimarySeq accession_number()
102 Secondary accessions RichSeq get_secondary_accessions()
103 GI number            PrimarySeq primary_id()
104 LOCUS                PrimarySeq display_id()
105 Keywords             RichSeq get_keywords()
106 Dates                RichSeq get_dates()
107 Molecule             RichSeq molecule()
108 Seq Version          RichSeq seq_version()
109 PID                  RichSeq pid()
110 Division             RichSeq division()
111 Features             Seq get_SeqFeatures()
112 Alphabet             PrimarySeq alphabet()
113 Definition           PrimarySeq description() or desc()
114 Version              PrimarySeq version()
115
116 Sequence             PrimarySeq seq()
117
118There is more information in the Feature-Annotation HOWTO about each
119field and how it is mapped to the Sequence object
120L<http://bioperl.org/howtos/Features_and_Annotations_HOWTO.html>.
121
122=head1 FEEDBACK
123
124=head2 Mailing Lists
125
126User feedback is an integral part of the evolution of this and other
127Bioperl modules. Send your comments and suggestions preferably to one
128of the Bioperl mailing lists.  Your participation is much appreciated.
129
130  bioperl-l@bioperl.org                  - General discussion
131  http://bioperl.org/wiki/Mailing_lists  - About the mailing lists
132
133=head2 Support
134
135Please direct usage questions or support issues to the mailing list:
136
137I<bioperl-l@bioperl.org>
138
139rather than to the module maintainer directly. Many experienced and
140reponsive experts will be able look at the problem and quickly
141address it. Please include a thorough description of the problem
142with code and data examples if at all possible.
143
144=head2 Reporting Bugs
145
146Report bugs to the Bioperl bug tracking system to help us keep track
147the bugs and their resolution. Bug reports can be submitted via the web:
148
149  https://github.com/bioperl/bioperl-live/issues
150
151=head1 AUTHOR - Bioperl Project
152
153bioperl-l at bioperl.org
154
155Original author Elia Stupka, elia -at- tigem.it
156
157=head1 CONTRIBUTORS
158
159Ewan Birney birney at ebi.ac.uk
160Jason Stajich jason at bioperl.org
161Chris Mungall cjm at fruitfly.bdgp.berkeley.edu
162Lincoln Stein lstein at cshl.org
163Heikki Lehvaslaiho, heikki at ebi.ac.uk
164Hilmar Lapp, hlapp at gmx.net
165Donald G. Jackson, donald.jackson at bms.com
166James Wasmuth, james.wasmuth at ed.ac.uk
167Brian Osborne, bosborne at alum.mit.edu
168Chris Fields, cjfields at bioperl dot org
169
170=head1 APPENDIX
171
172The rest of the documentation details each of the object
173methods. Internal methods are usually preceded with a _
174
175=cut
176
177# Let the code begin...
178
179package Bio::SeqIO::genbank;
180$Bio::SeqIO::genbank::VERSION = '1.7.7';
181use strict;
182
183use Bio::SeqIO::FTHelper;
184use Bio::SeqFeature::Generic;
185use Bio::Species;
186use Bio::Seq::SeqFactory;
187use Bio::Annotation::Collection;
188use Bio::Annotation::Comment;
189use Bio::Annotation::Reference;
190use Bio::Annotation::DBLink;
191
192use base qw(Bio::SeqIO);
193
194our $FTQUAL_LINE_LENGTH = 60;
195
196our %FTQUAL_NO_QUOTE = map {$_ => 1} qw(
197    anticodon           citation
198    codon               codon_start
199    cons_splice         direction
200    evidence            label
201    mod_base            number
202    rpt_type            rpt_unit
203    transl_except       transl_table
204    usedin
205    );
206
207our %DBSOURCE = map {$_ => 1} qw(
208    EchoBASE     IntAct    SWISS-2DPAGE    ECO2DBASE    ECOGENE    TIGRFAMs
209    TIGR    GO    InterPro    Pfam    PROSITE    SGD    GermOnline
210    HSSP    PhosSite    Ensembl    RGD    AGD    ArrayExpress    KEGG
211    H-InvDB    HGNC    LinkHub    PANTHER    PRINTS    SMART    SMR
212    MGI    MIM    RZPD-ProtExp    ProDom    MEROPS    TRANSFAC    Reactome
213    UniGene    GlycoSuiteDB    PIRSF    HSC-2DPAGE    PHCI-2DPAGE
214    PMMA-2DPAGE    Siena-2DPAGE    Rat-heart-2DPAGE    Aarhus/Ghent-2DPAGE
215    Biocyc    MetaCyc    Biocyc:Metacyc    GenomeReviews    FlyBase
216    TMHOBP    COMPLUYEAST-2DPAGE    OGP    DictyBase    HAMAP
217    PhotoList    Gramene    WormBase    WormPep    Genew    ZFIN
218    PeroxiBase    MaizeDB    TAIR    DrugBank    REBASE    HPA
219    swissprot    GenBank    GenPept    REFSEQ    embl    PDB    UniProtKB
220    DIP    PeptideAtlas    PRIDE    CYGD    HOGENOME    Gene3D
221    Project);
222
223our %VALID_MOLTYPE = map {$_ => 1} qw(NA DNA RNA tRNA rRNA cDNA cRNA ms-DNA
224    mRNA  uRNA  ss-RNA  ss-DNA  snRNA snoRNA PRT);
225
226our %VALID_ALPHABET = (
227    'bp' => 'dna',
228    'aa' => 'protein',
229    'rc' => '' # rc = release candidate; file has no sequences
230);
231
232sub _initialize {
233    my($self, @args) = @_;
234
235    $self->SUPER::_initialize(@args);
236    # hash for functions for decoding keys.
237    $self->{'_func_ftunit_hash'} = {};
238    $self->_show_dna(1); # sets this to one by default. People can change it
239    if ( not defined $self->sequence_factory ) {
240        $self->sequence_factory
241            (Bio::Seq::SeqFactory->new(-verbose => $self->verbose,
242                                       -type    => 'Bio::Seq::RichSeq'));
243    }
244}
245
246=head2 next_seq
247
248 Title   : next_seq
249 Usage   : $seq = $stream->next_seq()
250 Function: returns the next sequence in the stream
251 Returns : Bio::Seq object
252 Args    :
253
254=cut
255
256sub next_seq {
257    my ($self, @args) = @_;
258    my %args    = @args;
259    my $builder = $self->sequence_builder;
260    my $seq;
261    my %params;
262
263  RECORDSTART:
264    while (1) {
265        my $buffer;
266        my ( @acc,        @features );
267        my ( $display_id, $annotation );
268        my $species;
269
270        # initialize; we may come here because of starting over
271        @features   = ();
272        $annotation = undef;
273        @acc        = ();
274        $species    = undef;
275        %params     = ( -verbose => $self->verbose );    # reset hash
276        local ($/) = "\n";
277        while ( defined( $buffer = $self->_readline ) ) {
278            last if index( $buffer, 'LOCUS       ' ) == 0;
279        }
280        return unless defined $buffer;                   # end of file
281        $buffer =~ /^LOCUS\s+(\S.*)$/o
282            or $self->throw(  "GenBank stream with bad LOCUS line. "
283                            . "Not GenBank in my book. Got '$buffer'");
284        my @tokens = split( ' ', $1 );
285
286        # this is important to have the id for display in e.g. FTHelper,
287        # otherwise you won't know which entry caused an error
288        $display_id = shift @tokens;
289        $params{'-display_id'} = $display_id;
290
291        # may still be useful if we don't want the seq
292        my $seqlength = shift @tokens;
293        if ( exists $VALID_ALPHABET{$seqlength} ) {
294            # moved one token too far.  No locus name?
295            $self->warn(  "Bad LOCUS name?  Changing [$params{'-display_id'}] "
296                        . "to 'unknown' and length to '$display_id'"
297            );
298            $params{'-display_id'} = 'unknown';
299            $params{'-length'}     = $display_id;
300
301            # add token back...
302            unshift @tokens, $seqlength;
303        }
304        else {
305            $params{'-length'} = $seqlength;
306        }
307
308        # the alphabet of the entry
309        # shouldn't assign alphabet unless one
310        # is specifically designated (such as for rc files)
311        my $alphabet = lc( shift @tokens );
312        $params{'-alphabet'} =
313          ( exists $VALID_ALPHABET{$alphabet} )
314          ? $VALID_ALPHABET{$alphabet}
315          : $self->warn("Unknown alphabet: $alphabet");
316
317        # for aa there is usually no 'molecule' (mRNA etc)
318        if ( $params{'-alphabet'} eq 'protein' ) {
319            $params{'-molecule'} = 'PRT';
320        }
321        else {
322            $params{'-molecule'} = shift(@tokens);
323        }
324
325        # take care of lower case issues
326        if ( $params{'-molecule'} eq 'dna' or $params{'-molecule'} eq 'rna' ) {
327            $params{'-molecule'} = uc $params{'-molecule'};
328        }
329        $self->debug( "Unrecognized molecule type: " . $params{'-molecule'} )
330            if not exists( $VALID_MOLTYPE{ $params{'-molecule'} } );
331
332        my $circ = shift @tokens;
333        if ( $circ eq 'circular' ) {
334            $params{'-is_circular'} = 1;
335            $params{'-division'}    = shift @tokens;
336        }
337        else {
338            # 'linear' or 'circular' may actually be omitted altogether
339            $params{'-division'} =
340                ( CORE::length($circ) == 3 ) ? $circ : shift @tokens;
341        }
342        my $date = join( ' ', @tokens );    # we lump together the rest
343
344        # this is per request bug #1513
345        # we can handle:
346        # 9-10-2003
347        # 9-10-03
348        # 09-10-2003
349        # 09-10-03
350        if ( $date =~ s/\s*((\d{1,2})-(\w{3})-(\d{2,4})).*/$1/ ) {
351            if ( length($date) < 11 ) {
352                # improperly formatted date
353                # But we'll be nice and fix it for them
354                my ( $d, $m, $y ) = ( $2, $3, $4 );
355                if ( length($d) == 1 ) {
356                    $d = "0$d";
357                }
358
359                # guess the century here
360                if ( length($y) == 2 ) {
361                    if ( $y > 60 ) {    # arbitrarily guess that '60' means 1960
362                        $y = "19$y";
363                    }
364                    else {
365                        $y = "20$y";
366                    }
367                    $self->warn(  "Date was malformed, guessing the "
368                                . "century for $date to be $y\n"
369                    );
370                }
371                $params{'-dates'} = [ join( '-', $d, $m, $y ) ];
372            }
373            else {
374                $params{'-dates'} = [$date];
375            }
376        }
377
378        # set them all at once
379        $builder->add_slot_value(%params);
380        %params = ();
381
382        # parse the rest if desired, otherwise start over
383        if ( not $builder->want_object ) {
384            $builder->make_object;
385            next RECORDSTART;
386        }
387
388        # set up annotation depending on what the builder wants
389        if ( $builder->want_slot('annotation') ) {
390            $annotation = Bio::Annotation::Collection->new;
391        }
392
393        $buffer = $self->_readline;
394        while ( defined( my $line = $buffer ) ) {
395            # Description line(s)
396            if ($line =~ /^DEFINITION\s+(\S.*\S)/) {
397                my @desc = ($1);
398                while ( defined( $line = $self->_readline ) ) {
399                    if ($line =~ /^\s+(.*)/) {
400                        push( @desc, $1 );
401                        next;
402                    }
403                    last;
404                }
405                $builder->add_slot_value( -desc => join( ' ', @desc ) );
406
407                # we'll continue right here because DEFINITION
408                # always comes at the top of the entry
409                $buffer = $line;
410            }
411
412            # accession number (there can be multiple accessions)
413            if ($line =~ /^ACCESSION\s+(\S.*\S)/) {
414                push( @acc, split( /\s+/, $1 ) );
415                while ( defined( $line = $self->_readline ) ) {
416                    if ($line =~ /^\s+(.*)/) {
417                        push( @acc, split( /\s+/, $1 ) );
418                        next;
419                    }
420                    last;
421                }
422                $buffer = $line;
423                next;
424            }
425
426            # PID
427            elsif ($line =~ /^PID\s+(\S+)/) {
428                $params{'-pid'} = $1;
429            }
430
431            # Version number
432            elsif ($line =~ /^VERSION\s+(\S.+)$/) {
433                my ( $acc, $gi ) = split( ' ', $1 );
434                if ( $acc =~ /^\w+\.(\d+)/ ) {
435                    $params{'-version'}     = $1;
436                    $params{'-seq_version'} = $1;
437                }
438                if ( $gi && ( index( $gi, "GI:" ) == 0 ) ) {
439                    $params{'-primary_id'} = substr( $gi, 3 );
440                }
441            }
442
443            # Keywords
444            elsif ($line =~ /^KEYWORDS\s+(\S.*)/) {
445                my @kw = split( /\s*\;\s*/, $1 );
446                while ( defined( $line = $self->_readline ) ) {
447                    chomp $line;
448                    if ($line =~ /^\s+(.*)/) {
449                        push( @kw, split( /\s*\;\s*/, $1 ) );
450                        next;
451                    }
452                    last;
453                }
454
455                @kw && $kw[-1] =~ s/\.$//;
456                $params{'-keywords'} = \@kw;
457                $buffer = $line;
458                next;
459            }
460
461            # Organism name and phylogenetic information
462            elsif ($line =~ /^SOURCE\s+\S/) {
463                if ( $builder->want_slot('species') ) {
464                    $species = $self->_read_GenBank_Species( \$buffer );
465                    $builder->add_slot_value( -species => $species );
466                }
467                else {
468                    while ( defined( $buffer = $self->_readline ) ) {
469                        last if substr( $buffer, 0, 1 ) ne ' ';
470                    }
471                }
472                next;
473            }
474
475            # References
476            elsif ($line =~ /^REFERENCE\s+\S/) {
477                if ($annotation) {
478                    my @refs = $self->_read_GenBank_References( \$buffer );
479                    foreach my $ref (@refs) {
480                        $annotation->add_Annotation( 'reference', $ref );
481                    }
482                }
483                else {
484                    while ( defined( $buffer = $self->_readline ) ) {
485                        last if substr( $buffer, 0, 1 ) ne ' ';
486                    }
487                }
488                next;
489            }
490
491            # Project
492            elsif ($line =~ /^PROJECT\s+(\S.*)/) {
493                if ($annotation) {
494                    my $project =
495                        Bio::Annotation::SimpleValue->new( -value => $1 );
496                    $annotation->add_Annotation( 'project', $project );
497                }
498            }
499
500            # Comments may be plain text or Structured Comments.
501            # Structured Comments are made up of tag/value pairs and have beginning
502            # and end delimiters like ##*-Data-START## and ##*-Data-END##
503            elsif ($line =~ /^COMMENT\s+(\S.*)/) {
504                if ($annotation) {
505                    my $comment = $1;
506                    while ( defined( $line = $self->_readline ) ) {
507                        last if ($line =~ /^\S/);
508                        $comment .= $line;
509                    }
510                    $comment =~ s/  +/ /g;
511                    # Structured Comment, do not remove returns in the tabular section
512                    if ( my ( $text, $table )= $comment
513                        =~ /([^#]*)(##\S+Data-START##.+?##\S+Data-END##)/is
514                        ) {
515                        $text    =~ s/\n/ /g if $text;
516                        $table   =~ s/START##/START##\n/;
517                        $table   =~ s/^\s+//gm;
518                        $comment = $text . "\n" . $table;
519                    }
520                    # Plain text, remove returns
521                    else {
522                        $comment =~ s/\n/ /g;
523                    }
524                    $annotation->add_Annotation(
525                        'comment',
526                        Bio::Annotation::Comment->new(
527                            -text    => $comment,
528                            -tagname => 'comment'
529                        )
530                    );
531                    $buffer = $line;
532                }
533                else {
534                    while ( defined( $buffer = $self->_readline ) ) {
535                        last if substr( $buffer, 0, 1 ) ne ' ';
536                    }
537                }
538                next;
539            }
540
541            # Corresponding Genbank nucleotide id, Genpept only
542            elsif ($line =~ /^DB(?:SOURCE|LINK)\s+(\S.+)/) {
543                if ($annotation) {
544                    my $dbsource = $1;
545                    while ( defined( $line = $self->_readline ) ) {
546                        last if ($line =~ /^\S/);
547                        $dbsource .= $line;
548                    }
549
550                    # deal with UniProKB dbsources
551                    if ( $dbsource =~
552                        s/(UniProt(?:KB)?|swissprot):\s+locus\s+(\S+)\,.+\n//
553                        ) {
554                        $annotation->add_Annotation(
555                            'dblink',
556                            Bio::Annotation::DBLink->new(
557                                -primary_id => $2,
558                                -database   => $1,
559                                -tagname    => 'dblink'
560                            )
561                        );
562                        if ( $dbsource =~ s/\s+created:\s+([^\.]+)\.\n// ) {
563                            $annotation->add_Annotation(
564                                'swissprot_dates',
565                                Bio::Annotation::SimpleValue->new(
566                                    -tagname => 'date_created',
567                                    -value   => $1
568                                )
569                            );
570                        }
571                        while ( $dbsource =~
572                               s/\s+(sequence|annotation)\s+updated:\s+([^\.]+)\.\n//g
573                            ) {
574                            $annotation->add_Annotation(
575                                'swissprot_dates',
576                                Bio::Annotation::SimpleValue->new(
577                                    -tagname => 'date_updated',
578                                    -value   => $2
579                                )
580                            );
581                        }
582                        $dbsource =~ s/\n/ /g;
583                        if ( $dbsource =~
584                            s/\s+xrefs:\s+((?:\S+,\s+)+\S+)\s+xrefs/xrefs/
585                            ) {
586                            # will use $i to determine even or odd
587                            # for swissprot the accessions are paired
588                            my $i = 0;
589                            for my $dbsrc ( split( /,\s+/, $1 ) ) {
590                                if (   $dbsrc =~ /(\S+)\.(\d+)/
591                                    or $dbsrc =~ /(\S+)/
592                                    ) {
593                                    my ( $id, $version ) = ( $1, $2 );
594                                    $version = '' unless defined $version;
595                                    my $db = ( $id =~ /^\d\S{3}/ ) ? 'PDB'
596                                           : ( $i++ % 2 )          ? 'GenPept'
597                                           : 'GenBank';
598
599                                    $annotation->add_Annotation(
600                                        'dblink',
601                                        Bio::Annotation::DBLink->new(
602                                            -primary_id => $id,
603                                            -version    => $version,
604                                            -database   => $db,
605                                            -tagname    => 'dblink'
606                                        )
607                                    );
608                                }
609                            }
610                        }
611                        elsif ( $dbsource =~ s/\s+xrefs:\s+(.+)\s+xrefs/xrefs/i ) {
612                            # download screwed up and ncbi didn't put acc in for gi numbers
613                            my $i = 0;
614                            for my $id ( split( /\,\s+/, $1 ) ) {
615                                my ( $acc, $db );
616                                if ( $id =~ /gi:\s+(\d+)/ ) {
617                                    $acc = $1;
618                                    $db = ( $i++ % 2 ) ? 'GenPept' : 'GenBank';
619                                }
620                                elsif ( $id =~ /pdb\s+accession\s+(\S+)/ ) {
621                                    $acc = $1;
622                                    $db  = 'PDB';
623                                }
624                                else {
625                                    $acc = $id;
626                                    $db  = '';
627                                }
628                                $annotation->add_Annotation(
629                                    'dblink',
630                                    Bio::Annotation::DBLink->new(
631                                        -primary_id => $acc,
632                                        -database   => $db,
633                                        -tagname    => 'dblink'
634                                    )
635                                );
636                            }
637                        }
638                        else {
639                            $self->debug("Cannot match $dbsource\n");
640                        }
641                        if ( $dbsource =~ s/xrefs\s+
642                                            \(non\-sequence\s+databases\):\s+
643                                            ((?:\S+,\s+)+\S+)//x
644                            ) {
645                            for my $id ( split( /\,\s+/, $1 ) ) {
646                                my $db;
647
648                                # this is because GenBank dropped the spaces!!!
649                                # I'm sure we're not going to get this right
650                                ##if ( $id =~ s/^://i ) {
651                                ##    $db = $1;
652                                ##}
653                                $db = substr( $id, 0, index( $id, ':' ) );
654                                if ( not exists $DBSOURCE{$db} ) {
655                                    $db = '';    # do we want 'GenBank' here?
656                                }
657                                $id = substr( $id, index( $id, ':' ) + 1 );
658                                $annotation->add_Annotation(
659                                    'dblink',
660                                    Bio::Annotation::DBLink->new(
661                                        -primary_id => $id,
662                                        -database   => $db,
663                                        -tagname    => 'dblink'
664                                    )
665                                );
666                            }
667                        }
668                    }
669                    else {
670                        if ( $dbsource =~
671                            /^(\S*?):?\s*accession\s+(\S+)\.(\d+)/
672                            ) {
673                            my ( $db, $id, $version ) = ( $1, $2, $3 );
674                            $annotation->add_Annotation(
675                                'dblink',
676                                Bio::Annotation::DBLink->new(
677                                    -primary_id => $id,
678                                    -version    => $version,
679                                    -database   => $db || 'GenBank',
680                                    -tagname    => 'dblink'
681                                )
682                            );
683                        }
684                        elsif ( $dbsource =~ /^(\S*?):?\s*accession\s+(\S+)/ ) {
685                            my ( $db, $id ) = ( $1, $2 );
686                            $annotation->add_Annotation(
687                                'dblink',
688                                Bio::Annotation::DBLink->new(
689                                    -primary_id => $id,
690                                    -database   => $db || 'GenBank',
691                                    -tagname    => 'dblink'
692                                )
693                            );
694                        }
695                        elsif ( $dbsource =~ /(\S+)([\.:])\s*(\S+)/ ) {
696                            my ( $db, $version );
697                            my @ids = ();
698                            if ( $2 eq ':' ) {
699                                $db = $1;
700
701                                # Genbank 192 release notes say this: "The second
702                                # field can consist of multiple comma-separated
703                                # identifiers, if a sequence record has multiple
704                                # DBLINK cross-references of a given type."
705                                # For example: DBLINK      Project:100,200,300"
706                                @ids = split( /,/, $3 );
707                            }
708                            else {
709                                ( $db, $version ) = ( 'GenBank', $3 );
710                                $ids[0] = $1;
711                            }
712
713                            foreach my $id (@ids) {
714                                $annotation->add_Annotation(
715                                    'dblink',
716                                    Bio::Annotation::DBLink->new(
717                                        -primary_id => $id,
718                                        -version    => $version,
719                                        -database   => $db,
720                                        -tagname    => 'dblink'
721                                    )
722                                );
723                            }
724                        }
725                        else {
726                            $self->warn(
727                                "Unrecognized DBSOURCE data: $dbsource\n");
728                        }
729                    }
730                    $buffer = $line;
731                }
732                else {
733                    while ( defined( $buffer = $self->_readline ) ) {
734                        last if substr( $buffer, 0, 1 ) ne ' ';
735                    }
736                }
737                next;
738            }
739
740            # Exit at start of Feature table, or start of sequence
741            if ($line =~ /^(FEATURES|ORIGIN)/) {
742                my $trap;
743            }
744            last if ($line =~ /^(FEATURES|ORIGIN)/);
745
746            # Get next line and loop again
747            $buffer = $self->_readline;
748        }
749        return unless defined $buffer;
750
751        # add them all at once for efficiency
752        $builder->add_slot_value(
753            -accession_number     => shift(@acc),
754            -secondary_accessions => \@acc,
755            %params
756        );
757        $builder->add_slot_value( -annotation => $annotation ) if $annotation;
758        %params = ();    # reset before possible re-use to avoid setting twice
759
760        # start over if we don't want to continue with this entry
761        if ( not $builder->want_object ) {
762            $builder->make_object;
763            next RECORDSTART;
764        }
765
766        # some "minimal" formats may not necessarily have a feature table
767        if (    $builder->want_slot('features')
768            and defined $buffer
769            and $buffer =~ /^FEATURES/o
770            ) {
771            # need to read the first line of the feature table
772            $buffer = $self->_readline;
773
774            # DO NOT read lines in the while condition -- this is done
775            # as a side effect in _read_FTHelper_GenBank!
776
777            # part of new circular spec:
778            # commented out for now until kinks worked out
779            #my $sourceEnd = 0;
780            #$sourceEnd = $2 if ($buffer =~ /(\d+?)\.\.(\d+?)$/);
781
782            while ( defined $buffer ) {
783                # check immediately -- not at the end of the loop
784                # note: GenPept entries obviously do not have a BASE line
785                last if ( $buffer =~ /^BASE|ORIGIN|CONTIG|WGS/o );
786
787                # slurp in one feature at a time -- at return, the start of
788                # the next feature will have been read already, so we need
789                # to pass a reference, and the called method must set this
790                # to the last line read before returning
791                my $ftunit = $self->_read_FTHelper_GenBank( \$buffer );
792
793                # implement new circular spec: features that cross the origin are now
794                # seamless instead of being 2 separate joined features
795                # commented out until kinks get worked out
796                #if ((! $args{'-nojoin'}) && $ftunit->{'loc'} =~ /^join\((\d+?)\.\.(\d+?),(\d+?)..(\d+?)\)$/
797                #&& $sourceEnd == $2 && $3 == 1) {
798                #my $start = $1;
799                #my $end = $2 + $4;
800                #$ftunit->{'loc'} = "$start..$end";
801                #}
802
803                # fix suggested by James Diggans
804
805                if ( not defined $ftunit ) {
806                    # GRRRR. We have fallen over. Try to recover
807                    $self->warn(  "Unexpected error in feature table for "
808                                . $params{'-display_id'}
809                                . " Skipping feature, attempting to recover" );
810
811                    unless (   $buffer =~ /^\s{5,5}\S+/o
812                            or $buffer =~ /^\S+/o
813                        ) {
814                        $buffer = $self->_readline;
815                    }
816                    next;    # back to reading FTHelpers
817                }
818
819                # process ftunit
820                my $feat =
821                    $ftunit->_generic_seqfeature( $self->location_factory,
822                                                  $display_id );
823
824                # add taxon_id from source if available
825                if (   $species
826                    and $feat->primary_tag eq 'source'
827                    and $feat->has_tag('db_xref')
828                    and (    not $species->ncbi_taxid
829                         or (    $species->ncbi_taxid
830                             and $species->ncbi_taxid =~ /^list/ ) )
831                    ) {
832                    foreach my $tagval ( $feat->get_tag_values('db_xref') ) {
833                        if ( index( $tagval, "taxon:" ) == 0 ) {
834                            $species->ncbi_taxid( substr( $tagval, 6 ) );
835                            last;
836                        }
837                    }
838                }
839
840                # add feature to list of features
841                push( @features, $feat );
842            }
843            $builder->add_slot_value( -features => \@features );
844        }
845
846        if ( defined $buffer ) {
847            # CONTIG lines: TODO, this needs to be cleaned up
848            if ($buffer =~/^CONTIG\s+(.*)/o) {
849                my $ctg = $1;
850                while ( defined( $buffer = $self->_readline ) ) {
851                    last if $buffer =~ m{^ORIGIN|//}o;
852                    $buffer =~ s/\s+(.*)/$1/;
853                    $ctg .= $buffer;
854                }
855                if ($ctg) {
856                    $annotation->add_Annotation(
857                        Bio::Annotation::SimpleValue->new(
858                            -tagname => 'contig',
859                            -value   => $ctg
860                        )
861                    );
862                }
863            }
864            elsif ($buffer =~ /^WGS|WGS_SCAFLD\s+/o) {    # catch WGS/WGS_SCAFLD lines
865                while ( $buffer =~ s/(^WGS|WGS_SCAFLD)\s+// ) {    # gulp lines
866                    chomp $buffer;
867                    $annotation->add_Annotation(
868                        Bio::Annotation::SimpleValue->new(
869                            -value   => $buffer,
870                            -tagname => lc $1
871                        )
872                    );
873                    $buffer = $self->_readline;
874                }
875            }
876            elsif ( $buffer !~ m{^ORIGIN|//}o ) {    # advance to the sequence, if any
877                while ( defined( $buffer = $self->_readline ) ) {
878                    last if $buffer =~ m{^(ORIGIN|//)};
879                }
880            }
881        }
882        if ( not $builder->want_object ) {
883            $builder->make_object;        # implicit end-of-object
884            next RECORDSTART;
885        }
886        if ( $builder->want_slot('seq') ) {
887            # the fact that we want a sequence does not necessarily mean that
888            # there also is a sequence ...
889            if ( defined $buffer and $buffer =~ s/^ORIGIN\s+// ) {
890                if ( $annotation and length($buffer) > 0 ) {
891                    $annotation->add_Annotation(
892                        'origin',
893                        Bio::Annotation::SimpleValue->new(
894                            -tagname => 'origin',
895                            -value   => $buffer
896                        )
897                    );
898                }
899                my $seqc = '';
900                while ( defined( $buffer = $self->_readline ) ) {
901                    last if $buffer =~ m{^//};
902                    $buffer = uc $buffer;
903                    $buffer =~ s/[^A-Za-z]//g;
904                    $seqc .= $buffer;
905                }
906
907                $builder->add_slot_value( -seq => $seqc );
908            }
909        }
910        elsif ( defined($buffer) and ( substr( $buffer, 0, 2 ) ne '//' ) ) {
911            # advance to the end of the record
912            while ( defined( $buffer = $self->_readline ) ) {
913                last if substr( $buffer, 0, 2 ) eq '//';
914            }
915        }
916
917        # Unlikely, but maybe the sequence is so weird that we don't want it
918        # anymore. We don't want to return undef if the stream's not exhausted
919        # yet.
920        $seq = $builder->make_object;
921        next RECORDSTART unless $seq;
922        last RECORDSTART;
923    }    # end while RECORDSTART
924
925    return $seq;
926}
927
928=head2 write_seq
929
930 Title   : write_seq
931 Usage   : $stream->write_seq($seq)
932 Function: writes the $seq object (must be seq) to the stream
933 Returns : 1 for success and 0 for error
934 Args    : array of 1 to n Bio::SeqI objects
935
936=cut
937
938sub write_seq {
939    my ($self,@seqs) = @_;
940
941    foreach my $seq ( @seqs ) {
942        $self->throw("Attempting to write with no seq!") unless defined $seq;
943
944        if ( not ref $seq or not $seq->isa('Bio::SeqI') ) {
945            $self->warn(" $seq is not a SeqI compliant module. Attempting to dump, but may fail!");
946        }
947
948        my $str   = $seq->seq;
949        my $len   = $seq->length;
950        my $alpha = $seq->alphabet;
951
952        my ($div, $mol);
953        if (   not $seq->can('division')
954            or not defined($div = $seq->division)
955            ) {
956            $div = 'UNK';
957        }
958        if (   not $seq->can('molecule')
959            or not defined ($mol = $seq->molecule)
960           ) {
961            $mol =  $alpha || 'DNA';
962        }
963
964        my $circular = ($seq->is_circular) ? 'circular' : 'linear  ';
965
966        local($^W) = 0; # suppressing warnings about uninitialized fields.
967
968        my $temp_line;
969        if ( $self->_id_generation_func ) {
970            $temp_line = &{$self->_id_generation_func}($seq);
971        }
972        else {
973            my $date = '';
974            if ( $seq->can('get_dates') ) {
975                ($date) = $seq->get_dates;
976            }
977
978            $self->warn("No whitespace allowed in GenBank display id [". $seq->display_id. "]")
979                if $seq->display_id =~ /\s/;
980
981            my @data = ( lc($alpha) eq 'protein' ) ? ('aa', '', '') : ('bp', '', $mol);
982            $temp_line = sprintf ("%-12s%-15s%13s %s%4s%-8s%-8s %3s %-s\n",
983                                  'LOCUS', $seq->id, $len,
984                                  @data, $circular, $div, $date);
985        }
986
987        $self->_print($temp_line);
988        $self->_write_line_GenBank_regex("DEFINITION  ", "            ",
989                                         $seq->desc,     "\\s\+\|\$",80);
990
991        # if there, write the accession line
992
993        if ( $self->_ac_generation_func ) {
994            $temp_line = &{$self->_ac_generation_func}($seq);
995            $self->_print("ACCESSION   $temp_line\n");
996        }
997        else {
998            my @acc = ();
999            push @acc, $seq->accession_number;
1000            if ( $seq->isa('Bio::Seq::RichSeqI') ) {
1001                push @acc, $seq->get_secondary_accessions;
1002            }
1003            $self->_print("ACCESSION   ", join(" ", @acc), "\n");
1004            # otherwise - cannot print <sigh>
1005        }
1006
1007        # if PID defined, print it
1008        if ($seq->isa('Bio::Seq::RichSeqI') and $seq->pid) {
1009            $self->_print("PID         ", $seq->pid, "\n");
1010        }
1011
1012        # if there, write the version line
1013        if ( defined $self->_sv_generation_func ) {
1014            $temp_line = &{$self->_sv_generation_func}($seq);
1015            if ( $temp_line ) {
1016                $self->_print("VERSION     $temp_line\n");
1017            }
1018        }
1019        elsif ($seq->isa('Bio::Seq::RichSeqI') and defined($seq->seq_version)) {
1020            my $id = $seq->primary_id; # this may be a GI number
1021            my $data = (defined $id and $id =~ /^\d+$/) ? "  GI:$id" : "";
1022            $self->_print("VERSION     ",
1023                          $seq->accession_number, ".",
1024                          $seq->seq_version, $data, "\n");
1025        }
1026
1027        # if there, write the PROJECT line
1028        for my $proj ( $seq->annotation->get_Annotations('project') ) {
1029            $self->_print("PROJECT     ".$proj->value."\n");
1030        }
1031
1032        # if there, write the DBSOURCE line
1033        foreach my $ref ( $seq->annotation->get_Annotations('dblink') ) {
1034            my ($db, $id) = ($ref->database, $ref->primary_id);
1035            my $prefix = $db eq 'Project' ? 'DBLINK' : 'DBSOURCE';
1036            my $text   = $db eq 'GenBank' ? ''
1037                       : $db eq 'Project' ? "$db:$id"
1038                       : "$db accession $id";
1039            $self->_print(sprintf ("%-11s %s\n", $prefix, $text));
1040        }
1041
1042        # if there, write the keywords line
1043        if ( defined $self->_kw_generation_func ) {
1044            $temp_line = &{$self->_kw_generation_func}($seq);
1045            $self->_print("KEYWORDS    $temp_line\n");
1046        }
1047        elsif ( $seq->can('keywords') ) {
1048            my $kw = $seq->keywords;
1049            $kw .= '.' if ( $kw !~ /\.$/ );
1050            $self->_print("KEYWORDS    $kw\n");
1051        }
1052
1053        # SEGMENT if it exists
1054        foreach my $ref ( $seq->annotation->get_Annotations('segment') ) {
1055            $self->_print(sprintf ("%-11s %s\n",'SEGMENT',
1056                                   $ref->value));
1057        }
1058
1059        # Organism lines
1060        if (my $spec = $seq->species) {
1061            my ($on, $sn, $cn) = ($spec->can('organelle') ? $spec->organelle : '',
1062                                  $spec->scientific_name,
1063                                  $spec->common_name);
1064            my @classification;
1065            if ($spec->isa('Bio::Species')) {
1066                @classification = $spec->classification;
1067                shift @classification;
1068            }
1069            else {
1070                # Bio::Taxon should have a DB handle of some type attached, so
1071                # derive the classification from that
1072                my $node = $spec;
1073                while ($node) {
1074                    $node = $node->ancestor || last;
1075                    unshift @classification, $node->node_name;
1076                    #$node eq $root && last;
1077                }
1078                @classification = reverse @classification;
1079            }
1080            my $abname = $spec->name('abbreviated') ? # from genbank file
1081                         $spec->name('abbreviated')->[0] : $sn;
1082            my $sl = $on ? "$on "          : '';
1083            $sl   .= $cn ? "$abname ($cn)" : $abname;
1084
1085            $self->_write_line_GenBank_regex("SOURCE      ", ' 'x12, $sl, "\\s\+\|\$", 80);
1086            $self->_print("  ORGANISM  ", $spec->scientific_name, "\n");
1087            my $OC = join('; ', reverse @classification) . '.';
1088            $self->_write_line_GenBank_regex(' 'x12,' 'x12, $OC, "\\s\+\|\$", 80);
1089        }
1090
1091        # Reference lines
1092        my $count = 1;
1093        foreach my $ref ( $seq->annotation->get_Annotations('reference') ) {
1094            $temp_line = "REFERENCE   $count";
1095            if ($ref->start) {
1096                $temp_line .= sprintf ("  (%s %d to %d)",
1097                                       ($seq->alphabet() eq "protein" ?
1098                                        "residues" : "bases"),
1099                                       $ref->start, $ref->end);
1100            }
1101            elsif ($ref->gb_reference) {
1102                $temp_line .= sprintf ("  (%s)", $ref->gb_reference);
1103            }
1104            $self->_print("$temp_line\n");
1105            $self->_write_line_GenBank_regex("  AUTHORS   ", ' 'x12,
1106                                             $ref->authors,    "\\s\+\|\$", 80);
1107            $self->_write_line_GenBank_regex("  CONSRTM   ", ' 'x12,
1108                                             $ref->consortium, "\\s\+\|\$", 80) if $ref->consortium;
1109            $self->_write_line_GenBank_regex("  TITLE     ", ' 'x12,
1110                                             $ref->title,      "\\s\+\|\$", 80);
1111            $self->_write_line_GenBank_regex("  JOURNAL   ", ' 'x12,
1112                                             $ref->location,   "\\s\+\|\$", 80);
1113            if ( $ref->medline) {
1114                $self->_write_line_GenBank_regex("  MEDLINE   ", ' 'x12,
1115                                                 $ref->medline,  "\\s\+\|\$", 80);
1116                # I am assuming that pubmed entries only exist when there
1117                # are also MEDLINE entries due to the indentation
1118            }
1119            # This could be a wrong assumption
1120            if ( $ref->pubmed ) {
1121                $self->_write_line_GenBank_regex("   PUBMED   ", ' 'x12,
1122                                                 $ref->pubmed,   "\\s\+\|\$", 80);
1123            }
1124            # put remark at the end
1125            if ($ref->comment) {
1126                $self->_write_line_GenBank_regex("  REMARK    ", ' 'x12,
1127                                                 $ref->comment,  "\\s\+\|\$", 80);
1128            }
1129            $count++;
1130        }
1131
1132        # Comment lines
1133        foreach my $comment ( $seq->annotation->get_Annotations('comment') ) {
1134            $self->_write_line_GenBank_regex("COMMENT     ", ' 'x12,
1135                                             $comment->text, "\\s\+\|\$", 80);
1136        }
1137
1138        # FEATURES section
1139        $self->_print("FEATURES             Location/Qualifiers\n");
1140
1141        if ( defined $self->_post_sort ) {
1142            # we need to read things into an array. Process. Sort them. Print 'em
1143            my $post_sort_func = $self->_post_sort;
1144            my @fth;
1145
1146            foreach my $sf ( $seq->top_SeqFeatures ) {
1147                push @fth, Bio::SeqIO::FTHelper::from_SeqFeature($sf, $seq);
1148            }
1149
1150            @fth = sort { &$post_sort_func($a, $b) } @fth;
1151
1152            foreach my $fth ( @fth ) {
1153                $self->_print_GenBank_FTHelper($fth);
1154            }
1155        }
1156        else {
1157            # not post sorted. And so we can print as we get them.
1158            # lower memory load...
1159            foreach my $sf ( $seq->top_SeqFeatures ) {
1160                my @fth = Bio::SeqIO::FTHelper::from_SeqFeature($sf, $seq);
1161                foreach my $fth ( @fth ) {
1162                    if ( ! $fth->isa('Bio::SeqIO::FTHelper') ) {
1163                        $sf->throw("Cannot process FTHelper... $fth");
1164                    }
1165                    $self->_print_GenBank_FTHelper($fth);
1166                }
1167            }
1168        }
1169
1170        # deal with WGS; WGS_SCAFLD present only if WGS is also present
1171        if ($seq->annotation->get_Annotations('wgs')) {
1172            foreach my $wgs (map {$seq->annotation->get_Annotations($_)}
1173                             qw(wgs wgs_scaffold)
1174                ) {
1175                $self->_print(sprintf ("%-11s %s\n",
1176                                       uc($wgs->tagname),
1177                                       $wgs->value));
1178            }
1179            $self->_show_dna(0);
1180        }
1181        if ($seq->annotation->get_Annotations('contig')) {
1182            my $ct = 0;
1183            my $cline;
1184            foreach my $contig ($seq->annotation->get_Annotations('contig')) {
1185                unless ($ct) {
1186                    $cline = uc($contig->tagname) . "      " . $contig->value . "\n";
1187                }
1188                else {
1189                    $cline = "            " . $contig->value . "\n";
1190                }
1191                $self->_print($cline);
1192                $ct++;
1193            }
1194        }
1195        if ( $seq->length == 0 ) {
1196            $self->_show_dna(0);
1197        }
1198
1199        if ( $self->_show_dna == 0 ) {
1200            $self->_print("\n//\n");
1201            return;
1202        }
1203
1204        # finished printing features.
1205
1206        $str =~ tr/A-Z/a-z/;
1207
1208        my ($o) = $seq->annotation->get_Annotations('origin');
1209        $self->_print(sprintf("%-12s%s\n",
1210                              'ORIGIN', $o ? $o->value : ''));
1211        # print out the sequence
1212        my $nuc = 60;           # Number of nucleotides per line
1213        my $whole_pat = 'a10' x 6; # Pattern for unpacking a whole line
1214        my $out_pat   = 'A11' x 6; # Pattern for packing a line
1215        my $length = length $str;
1216
1217        # Calculate the number of nucleotides which fit on whole lines
1218        my $whole = int($length / $nuc) * $nuc;
1219
1220        # Print the whole lines
1221        my $i;
1222        for ($i = 0; $i < $whole; $i += $nuc) {
1223            my $blocks = pack $out_pat,
1224            unpack $whole_pat,
1225            substr($str, $i, $nuc);
1226            chop $blocks;
1227            $self->_print(sprintf("%9d $blocks\n", $i + $nuc - 59));
1228        }
1229
1230        # Print the last line
1231        if (my $last = substr($str, $i)) {
1232            my $last_len = length($last);
1233            my $last_pat = 'a10' x int($last_len / 10)
1234                         . 'a' . $last_len % 10;
1235            my $blocks = pack $out_pat,
1236            unpack($last_pat, $last);
1237            $blocks =~ s/ +$//;
1238            $self->_print(sprintf("%9d $blocks\n",
1239                                  $length - $last_len + 1));
1240        }
1241
1242        $self->_print("//\n");
1243
1244        $self->flush if $self->_flush_on_write && defined $self->_fh;
1245        return 1;
1246    }
1247}
1248
1249=head2 _print_GenBank_FTHelper
1250
1251 Title   : _print_GenBank_FTHelper
1252 Usage   :
1253 Function:
1254 Example :
1255 Returns :
1256 Args    :
1257
1258=cut
1259
1260sub _print_GenBank_FTHelper {
1261    my ( $self, $fth ) = @_;
1262
1263    if ( not ref $fth or not $fth->isa('Bio::SeqIO::FTHelper') ) {
1264        $fth->warn(
1265            "$fth is not a FTHelper class. Attempting to print but there could be issues"
1266        );
1267    }
1268
1269    my $spacer = ( length $fth->key >= 15 ) ? ' ' : '';
1270    $self->_write_line_GenBank_regex(
1271        sprintf( "     %-16s%s", $fth->key, $spacer ),
1272                 " " x 21, $fth->loc, "\,\|\$", 80 );
1273
1274    foreach my $tag ( sort keys %{ $fth->field } ) {
1275        # Account for hash structure in Annotation::DBLink, not the expected array
1276        if ( $tag eq 'db_xref' and grep /HASH/, @{ $fth->field->{$tag} }) {
1277            for my $ref ( @{ $fth->field->{$tag} } ) {
1278                my $db = $ref->{'database'};
1279                my $id = $ref->{'primary_id'};
1280                $self->_write_line_GenBank_regex
1281                    ( " " x 21, " " x 21,
1282                      "/$tag=\"$db:$id\"", "\.\|\$", 80 );
1283            }
1284        }
1285        # The usual case, where all values are found in an array
1286        else {
1287            foreach my $value ( @{ $fth->field->{$tag} } ) {
1288                $value =~ s/\"/\"\"/g;
1289                if ( $value eq "_no_value" ) {
1290                    $self->_write_line_GenBank_regex
1291                        ( " " x 21, " " x 21,
1292                          "/$tag", "\.\|\$", 80 );
1293                }
1294
1295               # There are almost 3x more quoted qualifier values and they
1296               # are more common too so we take quoted ones first.
1297               # Long qualifiers, that will be line wrapped, are always quoted
1298
1299               # Note (cjf): any tags explicitly *not* quoted must
1300               # remain unquoted in all output for proper round trip
1301               # parsing:
1302               # see https://github.com/bioperl/bioperl-live/issues/321
1303
1304               # If this breaks *NCBI* examples please file an issue with
1305               # example files
1306                elsif (   not $FTQUAL_NO_QUOTE{$tag} ) {
1307                    my ($pat) = ( $value =~ /\s/ ? '\s|$' :
1308                                  '.|$' );
1309                    $self->_write_line_GenBank_regex
1310                        ( " " x 21, " " x 21,
1311                          "/$tag=\"$value\"", $pat, 80 );
1312                }
1313                else {
1314                    my ($pat) = ( $value =~ /[,]/ ? '[,]|$' :
1315                                  '.|$' );
1316                    $self->_write_line_GenBank_regex
1317                        ( " " x 21, " " x 21,
1318                          "/$tag=$value", $pat, 80 );
1319                }
1320            }
1321        }
1322    }
1323}
1324
1325=head2 _read_GenBank_References
1326
1327 Title   : _read_GenBank_References
1328 Usage   :
1329 Function: Reads references from GenBank format. Internal function really
1330 Returns :
1331 Args    :
1332
1333=cut
1334
1335sub _read_GenBank_References {
1336    my ($self, $buffer) = @_;
1337    my (@refs);
1338    my $ref;
1339
1340    # assume things are starting with RN
1341    if ( $$buffer !~ /^REFERENCE/ ) {
1342        warn("Not parsing line '$$buffer' which maybe important");
1343    }
1344
1345    my $line = $$buffer;
1346
1347    my (@title,@loc,@authors,@consort,@com,@medline,@pubmed);
1348
1349  REFLOOP:
1350    while( defined($line) or defined($line = $self->_readline) ) {
1351        if ($line =~ /^\s{2}AUTHORS\s+(.*)/o) {
1352            push @authors, $1;
1353            while ( defined($line = $self->_readline) ) {
1354                if ($line =~ /^\s{9,}(.*)/o) {
1355                    push @authors, $1;
1356                    next;
1357                }
1358                last;
1359            }
1360            $ref->authors(join(' ', @authors));
1361        }
1362
1363        if ($line =~ /^\s{2}CONSRTM\s+(.*)/o) {
1364            push @consort, $1;
1365            while ( defined($line = $self->_readline) ) {
1366                if ($line =~ /^\s{9,}(.*)/o) {
1367                    push @consort, $1;
1368                    next;
1369                }
1370                last;
1371            }
1372            $ref->consortium(join(' ', @consort));
1373        }
1374
1375        if ($line =~ /^\s{2}TITLE\s+(.*)/o) {
1376            push @title, $1;
1377            while ( defined($line = $self->_readline) ) {
1378                if ($line =~ /^\s{9,}(.*)/o) {
1379                    push @title, $1;
1380                    next;
1381                }
1382                last;
1383            }
1384            $ref->title(join(' ', @title));
1385        }
1386
1387        if ($line =~ /^\s{2}JOURNAL\s+(.*)/o) {
1388            push @loc, $1;
1389            while ( defined($line = $self->_readline) ) {
1390                # we only match when there are at least 4 spaces
1391                # there is probably a better way to match this
1392                # as it assumes that the describing tag is short enough
1393                if ($line =~ /^\s{9,}(.*)/o) {
1394                    push @loc, $1;
1395                    next;
1396                }
1397                last;
1398            }
1399            $ref->location(join(' ', @loc));
1400            redo REFLOOP;
1401        }
1402
1403        if ($line =~ /^\s{2}REMARK\s+(.*)/o) {
1404            push @com, $1;
1405            while ( defined($line = $self->_readline) ) {
1406                if ($line =~ /^\s{9,}(.*)/o) {
1407                    push @com, $1;
1408                    next;
1409                }
1410                last;
1411            }
1412            $ref->comment(join(' ', @com));
1413            redo REFLOOP;
1414        }
1415
1416        if ( $line =~ /^\s{2}MEDLINE\s+(.*)/ ) {
1417            push @medline, $1;
1418            while ( defined($line = $self->_readline) ) {
1419                if ($line =~ /^\s{9,}(.*)/) {
1420                    push @medline, $1;
1421                    next;
1422                }
1423                last;
1424            }
1425            $ref->medline(join(' ', @medline));
1426            redo REFLOOP;
1427        }
1428
1429        if ( $line =~ /^\s{3}PUBMED\s+(.*)/ ) {
1430            push @pubmed, $1;
1431            while ( defined($line = $self->_readline) ) {
1432                if ($line =~ /^\s{9,}(.*)/) {
1433                    push @pubmed, $1;
1434                    next;
1435                }
1436                last;
1437            }
1438            $ref->pubmed(join(' ', @pubmed));
1439            redo REFLOOP;
1440        }
1441
1442        if ( $line =~ /^REFERENCE/o ) {
1443            # store current reference
1444            $self->_add_ref_to_array(\@refs,$ref) if defined $ref;
1445
1446            # reset
1447            @authors = ();
1448            @title   = ();
1449            @loc     = ();
1450            @com     = ();
1451            @pubmed  = ();
1452            @medline = ();
1453
1454            # create the new reference object
1455            $ref = Bio::Annotation::Reference->new(-tagname => 'reference');
1456
1457            # check whether start and end base is given
1458            if ($line =~ /^REFERENCE\s+\d+\s+\([a-z]+ (\d+) to (\d+)\)/){
1459                $ref->start($1);
1460                $ref->end($2);
1461            }
1462            elsif ($line =~ /^REFERENCE\s+\d+\s+\((.*)\)/) {
1463                $ref->gb_reference($1);
1464            }
1465        }
1466
1467        last if ($line =~ /^(FEATURES)|(COMMENT)/o);
1468
1469        $line = undef; # Empty $line to trigger read of next line
1470    }
1471
1472    # store last reference
1473    $self->_add_ref_to_array(\@refs, $ref) if defined $ref;
1474
1475    $$buffer = $line;
1476
1477    #print "\nnumber of references found: ", $#refs+1,"\n";
1478
1479    return @refs;
1480}
1481
1482=head2 _add_ref_to_array
1483
1484Title: _add_ref_to_array
1485Usage:
1486Function: Adds a Reference object to an array of Reference objects, takes
1487          care of possible cleanups to be done (currently, only author and title
1488          will be chopped of trailing semicolons).
1489Args:     A reference to an array of Reference objects and
1490          the Reference object to be added
1491Returns: nothing
1492
1493=cut
1494
1495sub _add_ref_to_array {
1496    my ($self, $refs, $ref) = @_;
1497
1498    # first, polish author and title by removing possible trailing semicolons
1499    my $au    = $ref->authors;
1500    my $title = $ref->title;
1501    $au    =~ s/;\s*$//g if $au;
1502    $title =~ s/;\s*$//g if $title;
1503    $ref->authors($au);
1504    $ref->title($title);
1505    # the rest should be clean already, so go ahead and add it
1506    push @{$refs}, $ref;
1507}
1508
1509=head2 _read_GenBank_Species
1510
1511 Title   : _read_GenBank_Species
1512 Usage   :
1513 Function: Reads the GenBank Organism species and classification
1514           lines. Able to deal with unconvential Organism naming
1515           formats, and varietas in plants
1516 Example : ORGANISM  unknown marine gamma proteobacterium NOR5
1517           $genus = undef
1518           $species = unknown marine gamma proteobacterium NOR5
1519
1520           ORGANISM  Drosophila sp. 'white tip scutellum'
1521           $genus = Drosophila
1522           $species = sp. 'white tip scutellum'
1523           (yes, this really is a species and that is its name)
1524           $subspecies = undef
1525
1526           ORGANISM  Ajellomyces capsulatus var. farciminosus
1527           $genus = Ajellomyces
1528           $species = capsulatus
1529           $subspecies = var. farciminosus
1530
1531           ORGANISM  Hepatitis delta virus
1532           $genus = undef (though this virus has a genus in its lineage, we
1533                           cannot know that without a database lookup)
1534           $species = Hepatitis delta virus
1535
1536 Returns : A Bio::Species object
1537 Args    : A reference to the current line buffer
1538
1539=cut
1540
1541sub _read_GenBank_Species {
1542    my ($self, $buffer) = @_;
1543
1544    my @unkn_names = ('other', 'unknown organism', 'not specified', 'not shown',
1545                      'Unspecified', 'Unknown', 'None', 'unclassified',
1546                      'unidentified organism', 'not supplied');
1547    # dictionary of synonyms for taxid 32644
1548    my @unkn_genus = ('unknown', 'unclassified', 'uncultured', 'unidentified');
1549    # all above can be part of valid species name
1550
1551    my $line = $$buffer;
1552
1553    my( $sub_species, $species, $genus, $sci_name, $common,
1554        $class_lines, $source_flag, $abbr_name, $organelle, $sl );
1555    my %source = map { $_ => 1 } qw(SOURCE ORGANISM CLASSIFICATION);
1556
1557    # upon first entering the loop, we must not read a new line -- the SOURCE
1558    # line is already in the buffer (HL 05/10/2000)
1559    my ($ann, $tag, $data);
1560    while (defined($line) or defined($line = $self->_readline)) {
1561        # de-HTMLify (links that may be encountered here don't contain
1562        # escaped '>', so a simple-minded approach suffices)
1563        $line =~ s{<[^>]+>}{}g;
1564        if ($line =~ m{^(?:\s{0,2})(\w+)\s+(.+)?$}ox) {
1565            ($tag, $data) = ($1, $2 || '');
1566            last if ($tag and not exists $source{$tag});
1567        }
1568        else {
1569            return unless $tag;
1570            ($data = $line) =~ s{^\s+}{};
1571            chomp $data;
1572            $tag = 'CLASSIFICATION' if (    $tag ne 'CLASSIFICATION'
1573                                        and $tag eq 'ORGANISM'
1574                                        # Don't match "str." or "var." (fix NC_021815),
1575                                        # and don't match ".1" (fix NC_021902)
1576                                        and $line =~ m{(?<!\bstr|\bvar)[;\.]+(?!\d)});
1577        }
1578        (exists $ann->{$tag}) ? ($ann->{$tag} .= ' '.$data) : ($ann->{$tag} .= $data);
1579        $line = undef;
1580    }
1581
1582    ($sl, $class_lines, $sci_name) = ($ann->{SOURCE}, $ann->{CLASSIFICATION}, $ann->{ORGANISM});
1583
1584    $$buffer = $line;
1585
1586    $sci_name or return;
1587
1588    # parse out organelle, common name, abbreviated name if present;
1589    # this should catch everything, but falls back to
1590    # entire SOURCE line just in case
1591    if ($sl =~ m{^(mitochondrion|chloroplast|plastid)?
1592                  \s*(.*?)
1593                  \s*(?: \( (.*?) \) )?\.?
1594                  $
1595                 }xms
1596        ) {
1597        ($organelle, $abbr_name, $common) = ($1, $2, $3); # optional
1598    }
1599    else {
1600        $abbr_name = $sl; # nothing caught; this is a backup!
1601    }
1602
1603    # Convert data in classification lines into classification array.
1604    # only split on ';' or '.' so that classification that is 2 or more words will
1605    # still get matched, use map() to remove trailing/leading/intervening spaces
1606    my @class = map { $_ =~ s/^\s+//;
1607                      $_ =~ s/\s+$//;
1608                      $_ =~ s/\s{2,}/ /g;
1609                      $_; }
1610                split /(?<!subgen)[;\.]+/, $class_lines;
1611
1612    # do we have a genus?
1613    my $possible_genus =  quotemeta($class[-1])
1614                       . ($class[-2] ? "|" . quotemeta($class[-2]) : '');
1615    if ($sci_name =~ /^($possible_genus)/) {
1616        $genus = $1;
1617        ($species) = $sci_name =~ /^$genus\s+(.+)/;
1618    }
1619    else {
1620        $species = $sci_name;
1621    }
1622
1623    # is this organism of rank species or is it lower?
1624    # (we don't catch everything lower than species, but it doesn't matter -
1625    # this is just so we abide by previous behaviour whilst not calling a
1626    # species a subspecies)
1627    if ($species and $species =~ /(.+)\s+((?:subsp\.|var\.).+)/) {
1628        ($species, $sub_species) = ($1, $2);
1629    }
1630
1631    # Don't make a species object if it's empty or "Unknown" or "None"
1632    # return unless $genus and  $genus !~ /^(Unknown|None)$/oi;
1633    # Don't make a species object if it belongs to taxid 32644
1634#    my $unkn = grep { $_ =~ /^\Q$sl\E$/; } @unkn_names;
1635    my $unkn = grep { $_ eq $sl } @unkn_names;
1636    return unless (defined $species or defined $genus) and $unkn == 0;
1637
1638    # Bio::Species array needs array in Species -> Kingdom direction
1639    push @class, $sci_name;
1640    @class = reverse @class;
1641
1642    my $make = Bio::Species->new;
1643    $make->scientific_name($sci_name);
1644    $make->classification(@class)          if @class > 0;
1645    $make->common_name( $common )          if $common;
1646    $make->name('abbreviated', $abbr_name) if $abbr_name;
1647    $make->organelle($organelle)           if $organelle;
1648    #$make->sub_species( $sub_species )     if $sub_species;
1649    return $make;
1650}
1651
1652=head2 _read_FTHelper_GenBank
1653
1654 Title   : _read_FTHelper_GenBank
1655 Usage   : _read_FTHelper_GenBank($buffer)
1656 Function: reads the next FT key line
1657 Example :
1658 Returns : Bio::SeqIO::FTHelper object
1659 Args    : filehandle and reference to a scalar
1660
1661=cut
1662
1663sub _read_FTHelper_GenBank {
1664    my ($self, $buffer) = @_;
1665
1666    my ($key, # The key of the feature
1667        $loc  # The location line from the feature
1668    );
1669    my @qual = (); # An array of lines making up the qualifiers
1670
1671    if ($$buffer =~ /^\s{5}(\S+)\s+(.+?)\s*$/o) {
1672        $key = $1;
1673        $loc = $2;
1674        # Read all the lines up to the next feature
1675        while ( defined(my $line = $self->_readline) ) {
1676            if ($line =~ /^(\s+)(.+?)\s*$/o) {
1677                # Lines inside features are preceded by 21 spaces
1678                # A new feature is preceded by 5 spaces
1679                if (length($1) > 6) {
1680                    # Add to qualifiers if we're in the qualifiers, or if it's
1681                    # the first qualifier
1682                    if (@qual or (index($2,'/') == 0)) {
1683                        push @qual, $2;
1684                    }
1685                    # We're still in the location line, so append to location
1686                    else {
1687                        $loc .= $2;
1688                    }
1689                }
1690                else {
1691                    # We've reached the start of the next feature
1692                    # Put the first line of the next feature into the buffer
1693                    $$buffer = $line;
1694                    last;
1695                }
1696            }
1697            else {
1698                # We're at the end of the feature table
1699                # Put the first line of the next feature into the buffer
1700                $$buffer = $line;
1701                last;
1702            }
1703        }
1704    }
1705    else {
1706        # No feature key
1707        $self->debug("no feature key!\n");
1708        # change suggested by JDiggans to avoid infinite loop-
1709        # see bugreport 1062.
1710        # reset buffer to prevent infinite loop
1711        $$buffer = $self->_readline;
1712        return;
1713    }
1714
1715    # Make the new FTHelper object
1716    my $out = Bio::SeqIO::FTHelper->new;
1717    $out->verbose($self->verbose);
1718    $out->key($key);
1719    $out->loc($loc);
1720
1721    # Now parse and add any qualifiers.  (@qual is kept
1722    # intact to provide informative error messages.)
1723  QUAL:
1724    for (my $i = 0; $i < @qual; $i++) {
1725        my $data = $qual[$i];
1726        my ( $qualifier, $value ) = ($data =~ m{^/([^=]+)(?:=\s*(.+))?})
1727            or $self->warn(  "cannot see new qualifier in feature $key: "
1728                           . $data);
1729        $qualifier = '' if not defined $qualifier;
1730
1731        if (defined $value) {
1732            # Do we have a quoted value?
1733            if (substr($value, 0, 1) eq '"') {
1734                # Keep adding to value until we find the trailing quote
1735                # and the quotes are balanced
1736                while ($value !~ /\"$/ or $value =~ tr/"/"/ % 2) {
1737                    if ($i >= $#qual) {
1738                        $self->warn(  "Unbalanced quote in:\n"
1739                                    . join("\n", @qual)
1740                                    . "No further qualifiers will "
1741                                    . "be added for this feature");
1742                        last QUAL;
1743                    }
1744                    # modifying a for-loop variable inside of the loop
1745                    # is not the best programming style ...
1746                    $i++;
1747                    my $next = $qual[$i];
1748
1749                    # add to value with a space unless the value appears
1750                    # to be a sequence (translation for example)
1751                    # if (($value.$next) =~ /[^A-Za-z\"\-]/o) {
1752                    # changed to explicitly look for translation tag - cjf 06/8/29
1753                    if ($qualifier !~ /^translation$/i ) {
1754                        $value .= " ";
1755                    }
1756                    $value .= $next;
1757                }
1758                # Trim leading and trailing quotes
1759                $value =~ s/^"|"$//g;
1760                # Undouble internal quotes
1761                $value =~ s/""/\"/g;
1762            }
1763            elsif ( $value =~ /^\(/ ) { # values quoted by ()s
1764                # Keep adding to value until we find the trailing bracket
1765                # and the ()s are balanced
1766                my $left  = ($value =~ tr/\(/\(/); # count left parens
1767                my $right = ($value =~ tr/\)/\)/); # count right parens
1768
1769                while( $left != $right ) { # was "$value !~ /\)$/ or $left != $right"
1770                    if ( $i >= $#qual) {
1771                        $self->warn(  "Unbalanced parens in:\n"
1772                                    . join("\n", @qual)
1773                                    . "\nNo further qualifiers will "
1774                                    . "be added for this feature");
1775                        last QUAL;
1776                    }
1777                    $i++;
1778                    my $next = $qual[$i];
1779                    $value .=  $next;
1780                    $left  += ($next =~ tr/\(/\(/);
1781                    $right += ($next =~ tr/\)/\)/);
1782                }
1783            }
1784        }
1785        else {
1786            $value = '_no_value';
1787        }
1788        # Store the qualifier
1789        $out->field->{$qualifier} ||= [];
1790        push @{$out->field->{$qualifier}}, $value;
1791    }
1792    return $out;
1793}
1794
1795=head2 _write_line_GenBank
1796
1797 Title   : _write_line_GenBank
1798 Usage   :
1799 Function: internal function
1800 Example :
1801 Returns :
1802 Args    :
1803
1804=cut
1805
1806sub _write_line_GenBank {
1807    my ($self, $pre1, $pre2, $line, $length) = @_;
1808
1809    $length or $self->throw("Miscalled write_line_GenBank without length. Programming error!");
1810    my $subl  = $length - length $pre2;
1811    my $linel = length $line;
1812    my $i;
1813
1814    my $subr = substr($line,0,$length - length $pre1);
1815
1816    $self->_print("$pre1$subr\n");
1817    for($i = ($length - length $pre1); $i < $linel; $i += $subl) {
1818        $subr = substr($line, $i, $subl);
1819        $self->_print("$pre2$subr\n");
1820    }
1821}
1822
1823=head2 _write_line_GenBank_regex
1824
1825 Title   : _write_line_GenBank_regex
1826 Usage   :
1827 Function: internal function for writing lines of specified
1828           length, with different first and the next line
1829           left hand headers and split at specific points in the
1830           text
1831 Example :
1832 Returns : nothing
1833 Args    : file handle,
1834           first header,
1835           second header,
1836           text-line,
1837           regex for line breaks,
1838           total line length
1839
1840=cut
1841
1842sub _write_line_GenBank_regex {
1843    my ($self, $pre1, $pre2, $line, $regex, $length) = @_;
1844
1845    # print STDOUT "Going to print with $line!\n";
1846
1847    $length or $self->throw("Miscalled write_line_GenBank without length. Programming error!");
1848
1849    my $subl  = $length - (length $pre1) - 2;
1850    my @lines = ();
1851
1852    CHUNK:
1853    while ($line) {
1854        foreach my $pat ($regex, '[,;\.\/-]\s|'.$regex, '[,;\.\/-]|'.$regex) {
1855            if ($line =~ m/^(.{0,$subl})($pat)(.*)/ ) {
1856                my $l = $1 . $2;
1857                $line = substr($line, length $l);
1858                # be strict about not padding spaces according to
1859                # genbank format
1860                $l =~ s/\s+$//;
1861                next CHUNK if ($l eq '');
1862                push @lines, $l;
1863                next CHUNK;
1864            }
1865        }
1866        # if we get here none of the patterns matched $subl or less chars
1867        $self->warn(  "trouble dissecting \"$line\"\n     into chunks "
1868                    . "of $subl chars or less - this tag won't print right");
1869        # insert a space char to prevent infinite loops
1870        $line = substr($line, 0, $subl) . " " . substr($line, $subl);
1871    }
1872    my $s = shift @lines;
1873    $self->_print("$pre1$s\n") if $s;
1874    foreach my $s ( @lines ) {
1875        $self->_print("$pre2$s\n");
1876    }
1877}
1878
1879=head2 _post_sort
1880
1881 Title   : _post_sort
1882 Usage   : $obj->_post_sort($newval)
1883 Function:
1884 Returns : value of _post_sort
1885 Args    : newvalue (optional)
1886
1887=cut
1888
1889sub _post_sort {
1890    my ($obj,$value) = @_;
1891    if ( defined $value) {
1892        $obj->{'_post_sort'} = $value;
1893    }
1894    return $obj->{'_post_sort'};
1895}
1896
1897=head2 _show_dna
1898
1899 Title   : _show_dna
1900 Usage   : $obj->_show_dna($newval)
1901 Function:
1902 Returns : value of _show_dna
1903 Args    : newvalue (optional)
1904
1905=cut
1906
1907sub _show_dna {
1908    my ($obj,$value) = @_;
1909    if ( defined $value) {
1910        $obj->{'_show_dna'} = $value;
1911    }
1912    return $obj->{'_show_dna'};
1913}
1914
1915=head2 _id_generation_func
1916
1917 Title   : _id_generation_func
1918 Usage   : $obj->_id_generation_func($newval)
1919 Function:
1920 Returns : value of _id_generation_func
1921 Args    : newvalue (optional)
1922
1923=cut
1924
1925sub _id_generation_func {
1926    my ($obj,$value) = @_;
1927    if ( defined $value ) {
1928        $obj->{'_id_generation_func'} = $value;
1929    }
1930    return $obj->{'_id_generation_func'};
1931}
1932
1933=head2 _ac_generation_func
1934
1935 Title   : _ac_generation_func
1936 Usage   : $obj->_ac_generation_func($newval)
1937 Function:
1938 Returns : value of _ac_generation_func
1939 Args    : newvalue (optional)
1940
1941=cut
1942
1943sub _ac_generation_func {
1944    my ($obj,$value) = @_;
1945    if ( defined $value ) {
1946        $obj->{'_ac_generation_func'} = $value;
1947    }
1948    return $obj->{'_ac_generation_func'};
1949}
1950
1951=head2 _sv_generation_func
1952
1953 Title   : _sv_generation_func
1954 Usage   : $obj->_sv_generation_func($newval)
1955 Function:
1956 Returns : value of _sv_generation_func
1957 Args    : newvalue (optional)
1958
1959=cut
1960
1961sub _sv_generation_func {
1962    my ($obj,$value) = @_;
1963    if ( defined $value ) {
1964        $obj->{'_sv_generation_func'} = $value;
1965    }
1966    return $obj->{'_sv_generation_func'};
1967}
1968
1969=head2 _kw_generation_func
1970
1971 Title   : _kw_generation_func
1972 Usage   : $obj->_kw_generation_func($newval)
1973 Function:
1974 Returns : value of _kw_generation_func
1975 Args    : newvalue (optional)
1976
1977=cut
1978
1979sub _kw_generation_func {
1980    my ($obj,$value) = @_;
1981    if ( defined $value ) {
1982        $obj->{'_kw_generation_func'} = $value;
1983    }
1984    return $obj->{'_kw_generation_func'};
1985}
1986
19871;
1988