seqio-1.2.2/doc/format.doc

SEQIO -- A Package for Sequence File I/O


FORMAT.DOC - The SEQIO File Formats
***********************************


The File Formats
================

This file describes the specific assumptions the SEQIO package makes
about the file formats it supports. The basic file formats are (with
alternative names in parens):

 o Raw
 o Plain
 o GenBank (gb)
 o EMBL
 o Swiss-Prot (swissprot, sprot)
 o PIR (CODATA)
 o NBRF
 o FASTA (Pearson)
 o IG/Stanford (IG, Stanford)
 o ASN.1 (ASN)
 o GCG
 o GCG-* (GCG-GenBank, GCG-PIR, GCG-EMBL, ...)
 o MSF
 o PHYLIP
 o PHYLIP-Seq (phylip-s, phylips)
 o PHYLIP-Int (phylip-i, phylipi)
 o Clustalw (clustal)
 o FASTA-output (fasta-out, fastaout, fout)
 o BLAST-output (blast-out, blastout, bout)

where `FASTA-output' and `BLAST-output' specify the output
produced by the programs in the FASTA and BLAST packages. The
`GCG-*' format actually refers to a set of formats which specify the
GCG forms of the GenBank, EMBL, Swiss-Prot, PIR, NBRF, FASTA and
IG/Stanford formats. These formats are included to distinguish the GCG
forms of these formats from the generic GCG format (where the header
lines of an entry are considered as unstructured comments). Any valid
name for one of the seven formats, plus their *-old variants given
below, can replace the `*' in `GCG-*'.

In addition to the basic file formats, there are four file "formats" which
use faster file reading implementations. They are specifically geared to
the formats of the GenBank, PIR, EMBL and Swiss-Prot databases, and
they are included to speedup database searches (they run about 30%
faster than the basic implementations, but at the cost of less error
checking and depending that the file format exactly matches the
database's format):

 o gbfast
 o pirfast
 o emblfast
 o spfast

My advice is that these formats only be used when searching the actual
databases, and the basic file formats be used the rest of the time. The
difference in time only becomes significant when reading files in the
multi-megabyte range.

Finally, there are also format variants which have been added to
account for FASTA, NBRF and IG/Stanford format limitations commonly
in use. For FASTA and IG/Stanford, the limitation is that only one
header line (any line beginning with a '>' or ';') may appear in the entry.
For NBRF, the limitation is that no lines like "C;Accession:" or
"C;Comment:" may appear after the sequence. The formats below have
a different output function which outputs entries in these limited
formats (at the cost of losing some information about the sequences).
Thus, the package can output entries that are readable by other
programs which require the limited format.

 o NBRF-old (NBRFold)
 o FASTA-old (FASTAold)
 o Stanford-old (Stanfordold, IG-old, IGold)

These three format variants are included in the `GCG-*' set of formats.

File Format Types
=================

Each format is considered to be one of the following types, which gives
a basic description of the capabilities and common uses of the format:

T_SEQONLY
   The entries of the format contain only a sequence. It does not
   contain any place to store sequence information or comments.
   (Plain, Raw)
T_DATABANK
   The entries are used mainly to store unadorned sequences (i.e.,
   not used for sequences containing alignment characters).
   (GenBank, PIR, EMBL, Swiss-Prot, their GCG-* forms, ASN.1)
T_GENERAL
   The entries can contain both unadorned sequences and
   alignment sequences. In addition, there is a place to store
   sequence information and comments.
   (FASTA, NBRF, IG/Stanford, their GCG-* forms, GCG)
T_LIMITED
   The entries can contain both unadorned sequences and
   alignment sequence, but there no place to store extra sequence
   information and comments.
   (FASTA-old, NBRF-old, IG-old, their GCG-* forms)
T_ALIGNMENT
   The entries are used mainly to store multiple sequence
   alignments. They are not considered to contain much sequence
   information and do not have any place to store comments.
   (PHYLIP, Clustalw, MSF)
T_OUTPUT
   The format is the output of an aligment program, and these
   formats are read-only formats.
   (FASTA-output, BLAST-output)

These types may be of some use when developing software that wishes
to perform different operations based on this file type information (the
"fmtseq" program included in the distribution is one such piece of
software).

(NOTE: Why is having someplace to store comments so important?
Well, one of the goals of this package is to try to unify all of the file
formats and be able to capture and transfer as much information from
one format to another. The plans are to use these comment sections as
the place to store any extra information for which there is not explicit
spot in the entry. And that can't happen if the file format doesn't have a
comment section. This is also the reason for the FASTA, NBRF and
IG/Stanford variants mentioned above.)

Automatically Determining the Format Type
=========================================

The SEQIO package has the ability to automatically determine the
format of a file, if that file is one of the following formats:

   Plain, GenBank, PIR, EMBL/Swiss-Prot, FASTA, NBRF,
   IG/Stanford, ASN.1, GCG, GCG-*, MSF, PHYLIP, Clustalw,
   FASTA-output, BLAST-output

The Raw format and all of the format variations (*-old, *fast) must be
explicitly specified in order to be used. The package makes the format
determination in two phases. The first phase looks at the initial
non-whitespace text of the file. The second phase looks at the text of
the first entry in the file. Both of these phases occur during the
opening of the file.

First Phase
+++++++++++

The first phase operation first skips over an e-mail header at the
beginning of the file, if the file begins with the string "From ". It then
looks for the first non-whitespace character of the file and attempts to
match that non-whitespace text to one of the following keywords
(where the matching is case-insensitive and the `?' character is a
wildcard which can match any character in the file):

    GenBank - "LOCUS ", "GB???.SEQ          Genetic Sequence Data Bank"
       NBRF - ">??;"
      FASTA - ">"
       EMBL - "ID   ", "CC ", "XX "
        PIR - "\\\", "ENTRY", "P R O T E I N  S E Q U E N C E  D A T A B A S E"
IG/Stanford - ";"
      ASN.1 - "Bioseq-set ::= {", "Seq-set ::= {"
  FASTA-out - "FASTA", "TFASTA", "SSEARCH", "LFASTA", "LALIGN", "ALIGN"
     PHYLIP - "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"
   Clustalw - "CLUSTAL"
        MSF - "PileUp"
  BLAST-out - "BLASTN", "BLASTP", "BLASTX"

The keyword matching occurs in the order specified here, and the first
matching keyword specifies the file format. So, for NBRF and FASTA
files, if the first entry's header line has a ';' as the third character after
the initial '>', the file format is taken to be NBRF. Files without that
semi-colon are taken to be in FASTA format.

If there's a match, then the file format has been determined. Otherwise,
the file's format is considered to be `Plain' at this point.

Second Phase
++++++++++++

The second phase distinguishes more subtle variations of the file
formats by looking in more detail at the text of the entries. The possible
changes in the determined format are the following:

 o For EMBL files, the "ID " line of each entry is scanned, and if it
   contains exactly 2 semi-colons, a period and the string "PRT"
   occuring before the second semi-colon, the entry is taken to be
   a Swiss-Prot entry.

   In addition, if the string occurring before the last semi-colon on
   the "ID " line is "EPD", then the entry identifier is taken to be an
   EPD database identifier, but the entry itself is still considered to
   be an EMBL formatted entry.

 o For all of the basic formats of the GCG-* formats, if the entry's
   sequence lines are in the GCG format, then the entry is
   considered to be the corresponding GCG-* format (so, a
   GenBank format becomes a GCG-GenBank format).

 o For PHYLIP files, each entry is checked to see if it is in the
   Interleaved or Sequential format. This checking is a complete
   match of the text to the two formats, so the likelihood of an
   incorrect determination is remote. See below in the description
   of the PHYLIP format for more details.

 o For Plain files (as determined by phase 1), the entry text is
   checked to see if a line ending with the string ".." occurs (or,
   more precisely, a line whose last non-whitespace characters are
   ".."). If so, the file is considered either a GCG or MSF file. If the
   line ending with the ".." contains the string "MSF:", then the
   entry is considered to be an MSF file. If not, the entry is
   considered to be a GCG file.


The SEQIO File Format Implementations
*************************************

The package has six main (internal) operations that encapsulate the
details of the file formats. Those operations are:

read
   Read the input file to find the beginning and end of the next
   entry in the file. Also, find the beginning of the lines containing
   the sequence and if the entry explicitly specifies a sequence
   length, get that value.
getseq
   Retrieve the sequence, if it exists, from the entry.
rawseq
   Retrieve the raw sequence, if it exists, from the entry. The raw
   sequence typically contains the sequence characters plus any
   alignment or notational characters.
getinfo
   Get one piece or all of the SEQINFO information from the entry.
putseq
   Given a sequence and SEQINFO structure, output a correctly
   formatted entry.
annotate
   Output an entry's text, adding new text to its comment section
   (creating a comment section, if none exists in the entry).

Each of the supported file formats will be described in terms of what
those six operations do for that format.

General Comments
================

 o There are no limits on lengths of anything (lines, entries,
   sequences, etc.), except for memory limitations and when
   outputting formats whose official descriptions specify a
   maximum line length (see below in the format descriptions).

 o When outputting formats that do have a maximum line length,
   long description/organism/comment lines are broken between
   word boundaries. That maximum line length is maintained unless
   there is a single word that is longer than the line length. That
   word is not broken up, but is output on a line that will be longer
   that the maximum length.

 o Except for gbfast, emblfast, spfast and pirfast, the case of the
   entry's keywords is irrelevant (they can be in upper or lower
   case, or any mixture of the two). The "fast" formats require
   keywords in upper case (as occurs in the databases).

 o When outputting in the Plain, FASTA, NBRF or IG/Stanford
   formats, the putseq operation looks at the sequence being
   output, and may add whitespace to the output sequence to make
   it look prettier. By default, the extra spaces are added when the
   sequence is DNA, RNA or Protein and when there are no
   non-alphabetic characters in the sequence (such as alignment
   characters).

   This prettying operation can be turned off or turned on for all
   sequences using the function `seqfsetpretty'.


Raw Format
**********

In the raw format, all of the characters of the file are the characters of
the sequence (including spaces, newlines, non-printable characters,
and so on).

The read operation simply reads the whole file. The getseq and rawseq
operations return that text. The getinfo operation merely stores the
filename in the description field. The putseq operation just outputs the
sequence characters. And there is no annotate operation.


Plain Format
************

In the plain format, all of the alphabetic characters of the file are taken
as the characters of the sequence, while spaces, newlines, position
numbers and other punctuation characters are ignored.

The read operation reads in the whole file. The getseq operation
extracts all of the alphabetic characters from the text. The rawseq
operation extracts all of the non-whitespace and non-numeric
characters from the text. The getinfo operation stores the filename in
the description field.

The putseq operation outputs the sequence in one of two formats,
depending on the sequence's alphabet. If the alphabet is DNA, RNA or
Protein, or the alphabet is Unknown but does not contain newline
characters, the sequence is output 60 sequence characters per line,
with interspersed spaces to improve the look of the output. If the
alphabet is Unknown and it contains newline characters, then it is
output as is.


GenBank Flat-File Format
************************

The read operation first looks for a "LOCUS" line and extracts the
sequence length from positions 23-29 of that line (if the text there
consists of digits). Then, it looks for the entry ending "//" line, along
with the "ORIGIN" line which specifies where the sequence lines begin.
The "ORIGIN" line is not required, however if it does not exist, the entry
is assumed to contain no sequence.

The getseq operation scans the sequence lines, from just after the
"ORIGIN" line to the "//" line. All alphabetic characters there are
assumed to be part of the sequence. No assumptions are made about
the format of these lines.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence.

The getinfo operation looks first at the "LOCUS" line. It takes the
identifier from positions 13-22 (and assumes it's a GenBank id, unless
marked by an identifier prefix), the alphabet determination from
positions 37-40, whether it's circular from the existence of the keyword
"circular" at positions 43-52, and the date from positions 63-73. Then,
it looks for the "ACCESSION", "NID", "PID", "DEFINITION",
"COMMENT" and "SOURCE" lines, where `lines' here mean one or
more text lines corresponding to that part of the entry and where the
lines can appear in any order. Accession numbers, NID numbers and
PID numbers are extracted from the "ACCESSION", "NID" and "PID"
lines, respectively. The description is taken from the "DEFINITION" line.
Comments are retrieved from the "COMMENT" line. The organism name
is taken from the "ORGANISM" sub-record of the "SOURCE" line. The
getinfo operation cannot determine the value of the isfragment field
(since that is not explicitly given anywhere in the entry).

The putseq operation outputs an entry with the following lines (in
order): LOCUS, DEFINITION, ACCESSION, NID, SOURCE/ORGANISM,
COMMENT, BASE COUNT, ORIGIN, sequence lines, //. The form of
these lines follows that described in the GenBank Release Notes, with
the following exceptions:

 o Except for the LOCUS line and the ORIGIN-sequence-// lines, no
   lines are output if the SEQINFO information for that line does not
   exist.
 o Only an non-accession identifier 10 characters long or less is
   output on the LOCUS line. If there are no such identifiers in the
   idlist, then the keyword "Unknown" is output (or "(below)" to
   signal that the long identifiers occur in the COMMENT lines).
 o On the LOCUS line, the "bp" in positions 31-32 may be replaced
   with "aa" or "ch" if the alphabet is Protein or Unknown. The
   alphabet string in positions 37-40 could be "PRT" or "UNK" for
   the same reason. The output classification in positions 53-55 is
   "UNC" (Unclassified). And finally, the date in positions 63-73 is
   "01-JAN-0000" if no date is specified in the SEQINFO structure.
 o The history lines, and any extra references, are output at the end
   of the COMMENT lines (or a COMMENT line is added which
   contains those lines). Each of the added lines begins with the
   keyword "SEQIO".

The annotate operation replaces or appends to the COMMENT line, if it
exists. If no COMMENT line exists, then a new COMMENT line will be
inserted (or rather output between the existing lines of the entry) just
before one of the following lines (whichever comes first in the entry):
FEATURES, BASE COUNT or ORIGIN. One of those lines must appear
in the entry.

Example GenBank entry:

LOCUS       A02201        664 bp    DNA             UNC       10-MAR-1993
DEFINITION  Phage phi-105 DNA for immF plypeptide.
ACCESSION   A02201
SOURCE      .
  ORGANISM  Bacteriophage phi-105
COMMENT     NCBI gi: 345121

            SEQIO retrieval from GenBank database entry.   07-Feb-1996
BASE COUNT      237 a    111 c    144 g    172 t
ORIGIN
        1 tgatcaccta tctcctttac aacacatagt gcctcactgt gccactgtgt cttgtggcat
       61 gacacaatta tagtatccga atgtcggaaa tacaatacta aaaaagacgg aaatacaagt
      121 attttttagt aaattgacgg aaatacaaga taaatactct ctgaatcttt aaaatgcttg
      181 aatttcgtca aatttcgact tttacaaaat gtcgtgaata ccatacaatt tagacatacc
      241 ttaacgggag gtgataatca tgctggatgg gaaaaagctt ggggctttaa ttaaggacaa
      301 aagaaaagaa aagcacttga aacagacaga aatggcgaag gcactgggta tgtccagaac
      361 ttatctctct gatatcgaaa acggcagata tctgccgagt acaaaaacac tttccagaat
      421 agcgatttta ataaatctgg atttaaatgt gttaaaaatg acggaaatac aagtagttga
      481 ggagggtgga tatgatagag ctgccggcac atgtagaaga caggctttat gagattttta
      541 tgaaactatc agttccaagg ttgcttgaga aagaagccct ggagaaagga gagaagccga
      601 atgcggaaag aaaaggcgct tgacctcgcg gccttcttcg ctgaatttga acaaatgatg
      661 atca
//


GBFAST variation of GenBank
***************************

The read operation performs the same steps as the GenBank read,
however it makes some additional assumptions. First, all keywords
must appear in uppercase. Second, the sequence length must appear
in positions 23-29 on the "LOCUS" line. Third, an "ORIGIN" line must
appear in the entry (as must a sequence). Fourth, all of the lines of
sequence except the last must be in the format as described in the
Release Notes, and so must be 75 characters long (9 characters for the
position number, 60 characters of sequence, 6 spaces), plus the
newline characters. See the above example.

The getseq operation assumes that the sequence lines are in the format
described in the previous paragraph, and all of the characters in the
correct positions in that format are assumed to be characters of the
sequence. So, if the line format is incorrect, you will get garbage as the
sequence.

The rawseq operation here is exactly the same as the getseq operation,
since the GenBank sequences don't contain other characters.

The getinfo, putseq and annotate functions are the same as in the
GenBank format.


PIR/CODATA Format
*****************

The read operation first looks for an "ENTRY" line. It then looks for the
entry ending "///" line, but during this scan it also looks for the
"SUMMARY" line and the "SEQUENCE" line. If the "SUMMARY" line is
found, the sequence length is extracted by scanning for "#length" on
the line, and then looking for digits after that keyword. The
"SEQUENCE" line specifies the beginning of the sequence lines
(starting on the next line), and no sequence is assumed to appear in
the entry if the "SEQUENCE" line is missing.

The getseq operation scans the sequences lines from just after the
"SEQUENCE" line to the "///" line ending the entry. All alphabetic
characters on those lines are assumed to be in the sequence. No
format for those lines is assumed.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence.

The getinfo operation first looks at the "ENTRY" line. The next word
(i.e., non-whitespace string) after the "ENTRY" keyword is taken for an
identifier, and then the rest of the line is searched for a "#type" option.
If the word after "#type" is "fragment", the isfragment field is set to 1.
Then, the entry is searched for the "ACCESSIONS", "COMMENT",
"DATE", "ORGANISM" and "TITLE" lines, which can appear in any
order. The "ACCESSIONS" line holds accession numbers (and the
search for the "ACCESSIONS" line will also find lines beginning with
just "ACCESSION", for backward compatibility). The "COMMENT" lines
hold comments. The "DATE" line holds the date, and the date taken is
the last given on the line, with the assumption being that the dates on
the line are specified from oldest to newest (not absolutely accurate,
but handling dates better is on my TODO list). The "TITLE" line holds
the description, an optional organism name and possibly one of the
keywords "(fragment)", "(fragment)" or "(tentative sequence)". The text
before the string " - " is taken for the description, and the rest of the
text, except for a trailing keyword, is taken for the organism name. If the
keywords "(fragment)" or "(fragments)" appear at the end of the string,
isfragment is set to 1. If "(tentative sequence)" appears, it is considered
part of the description. The "ORGANISM" line holds an organism name
which is taken if the "TITLE" line does not specify an organism.

The putseq operation outputs a PIR entry containing the following lines
(in order): ENTRY, TITLE, ORGANISM, DATE, ACCESSIONS,
COMMENT, SUMMARY, SEQUENCE, sequence lines, ///. The format of
those lines follows the PIR Release Notes, with the following
exceptions:

 o The TITLE, ORGANISM, DATE, ACCESSIONS and COMMENT
   lines may not appear, if the SEQINFO structure does not contain
   the appropriate information.
 o If no idlist is given, the keyword "UNKNWN" is output on the
   ENTRY line, instead of the sequence identifier.
 o The SEQIO package attempts to follow the guidelines for the
   TITLE line (i.e., description " - " organism, and an optional
   "(fragment)") as best it can. Depending on the text of the
   description and organism fields, this may or may not turn out
   well.
 o The organism name is output in the "#formal_name" field of the
   ORGANISM line, even though it may not be the formal name of
   the organism. (Better handling of the organism names is another
   thing on my TODO list.)
 o The SUMMARY line only contains the "#length" field on it.
 o The history lines, and any extra references, are output at the end
   of the COMMENT lines (or a COMMENT line is added which
   contains those lines). Each of the added lines begins with the
   keyword "SEQIO".

The annotate operation replaces or appends to the COMMENT line, if it
exists. If no COMMENT line exists, then a new COMMENT line will be
inserted just before one of the following lines (whichever comes first in
the entry): GENETIC, CLASSIFICATION, KEYWORDS, FEATURE,
SUMMARY or SEQUENCE. One of those lines must appear in the entry.

Example PIR entry:

ENTRY            CCMST       #type complete
TITLE            cytochrome c, testis-specific - mouse
ORGANISM         #formal_name mouse
DATE             04-Nov-1994
ACCESSIONS       B28160; A00012
COMMENT    Mammalian testis contains two forms of cytochrome c, one identical
           with the form found in somatic tissues and another that is
           expressed in a stage-specific manner during spermatogenic
           differentiation.

           SEQIO retrieval from PIR database entry.   07-Feb-1996
SUMMARY          #length 105
SEQUENCE
                5        10        15        20        25        30
      1 M G D A E A G K K I F V Q K C A Q C H T V E K G G K H K T G
     31 P N L W G L F G R K T G Q A P G F S Y T D A N K N K G V I W
     61 S E E T L M E Y L E N P K K Y I P G T K M I F A G I K K K S
     91 E R E D L I K Y L K Q A T S S
///


PIRFAST Variation of PIR
************************

The read operation performs the same steps as the PIR read, however it
makes some additional assumptions. First, all keywords must appear in
uppercase. Second, a "SUMMARY" line must appear in the entry, and
it must contain a "#length" field (although the field can appear
anywhere on the line). Third, a "SEQUENCE" line must appear in the
entry immediately after the "SUMMARY" line (and the entry must
contain a sequence). Fourth, the format of the sequence lines must be
as given in the PIR database, and so must be either 67 or 68 characters
long (7 characters for the position number, 30 characters of sequence,
30 or 31 spaces or notational characters), plus the newline character.
See the above example.

The getseq operation assumes that the sequence lines are in the format
described in the previous paragraph, and all of the characters in the
correct positions in that format are assumed to be characters of the
sequence. So, if the line format is incorrect, you will get garbage as the
sequence.

The rawseq operation here does not use the "fast" implementation, but
uses the rawseq operation of the basic PIR format.

The getinfo, putseq and annotate functions are the same as in the PIR
format.


EMBL/Swiss-Prot File Formats
****************************

 NOTE: The EMBL and Swiss-Prot file format implementations are
 essentially the same, differing only in their putseq and annotate
 operations. So, we'll describe them together.

 NOTE2: The EMBL read, getseq and getinfo implementations have
 been tested on, and are compatible with, the "EMBL" entries in
 the EMBL, EPD, aids-db, ENZYME, PROSITE and Swiss-Prot
 databases. Because of the variations of the entries in these
 databases, some of the assumptions made in the implementations
 will differ from the official EMBL or Swiss-Prot file format
 descriptions.

The read operation first looks for an "ID " line. It then looks for the
entry ending "//" line, but during this scan it also looks for an "SQ "
line and a line beginning with two spaces. If the "SQ " line is found and
the next word after "SQ Sequence" consists of digits, it is taken for the
sequence length. The first line beginning with two spaces is assumed
to be the beginning of the sequence lines, and if no such lines appear,
the entry is assumed to contain no sequence.

The getseq operation scans the sequences lines from the first line
beginning with two spaces to the "///" line ending the entry. All
alphabetic characters on those lines are assumed to be in the
sequence. No format for those lines is assumed.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence.

The getinfo operation first looks at the "ID " line. The next word (i.e.,
non-whitespace string) after the "ID" keyword is taken for an identifier,
and an attempt is made to determine if it is an EMBL id, an EPD id, a
Swiss-Prot id, or something else. It does this by counting the number
of semi-colons on the line and checking whether the line ends with a
period. If three semi-colons and a period are found, then the string just
before the third identifier is checked, and the identifier is assumed to
be an EPD id if that string is "EPD" and is assumed to be an EMBL id
otherwise. If two semi-colons and a period are found, and the string
just before the second semi-colon is "PRT", the identifier is assumed
to be a Swiss-Prot id. Otherwise, the identifier is some other id. After
figuring out the type of identifier and extracting it from the line, the rest
of the line is searched for words that specify the alphabet ("DNA",
"RNA", "PRT", and so on) and whether the sequence is circular
("circular").

Then the rest of the entry is searched for the "AC ", "NI ", "PI ", "DT ",
"DE ", "OS ", "CC " and "XX " lines, which can appear in any order. The
"AC ", "NI " and "PI " lines contain accession, NID and PID numbers.
The "DT " lines contain dates, of which the date on the last "DT " line is
taken, under the assumption that the dates are given from oldest
tonewest. The "DE " lines contain the description, and may end with
one of the keywords "(fragment)" or "(fragments)", in which
caseisfragment is set to 1. The "OS " lines specify the organism name.
The "CC " and "XX " lines specify the comment lines, about which there
are a couple things to note. First, an "XX " line isdifferent from any line
beginning with "XX", in that three spacesmust appear after the "XX"
and non-whitespace text must appear after that, in order for it to be
considered a comment line. These lines do not occur in the official
EMBL or Swiss-Prot formats, but do appear in some of the variations.
Second, more than one comment section can appear in an entry. When
a "CC " line is reached, the comment section beginning at that line is
assumed to consist of all "CC " and "XX" lines (note the lack of spaces
after the "XX") following that line, upto the first line not beginning with
"CC" or "XX" (and ignoring a trailing "XX" line). When an "XX " line is
seen, all following "XX " lines are considered part of that comment
section. The text for these sections are concatenated together to make
up the comment lines.

For the EMBL format, the putseq operation outputs an EMBL entry
containing the following lines (in order): ID, AC, NI, DT, DE, OS, CC,
SQ, sequence lines, //. In the output, XX lines are added between each
of the lines (except the sequence lines) as specified in the EMBL format.
The format of the lines follows the EMBL Release Notes, with the
following exceptions:

 o The AC, NI, DT, DE, OS, and CC lines may not appear if the
   SEQINFO structure does not contain the appropriate
   information.
 o On the ID line, if no idlist is given, the keyword "Unknown" is
   output instead of an identifier. The keyword "converted" is
   output instead of "standard" or "preliminary". The keyword
   "UNC" is output instead of the classification code. The keyword
   "UNK" might be output for the alphabet, if the alphabet is
   Unknown. And, the keyword "AA" or "CH" could appear after the
   sequence length, if the alphabet is Protein or Unknown.
 o There will be at most one DT line, and it will only contain the
   specified date.
 o Instead of outputting "XX" lines to specify a `blank' line in a
   comment, a line containing "CC " followed immediately by a
   newline is output (so, in my design of the comment sections, the
   comments are specified by the "CC " lines).
 o The history lines, and any extra references, are output at the end
   of the output comment section. Each of the added lines begins
   with the keyword "SEQIO".

For the Swiss-Prot format, the putseq operation outputs a Swiss-Prot
entry containing the following lines (in order): ID, AC, DT, DE, OS, CC,
SQ, sequence lines, //. The format of the lines follows the Swiss-Prot
Release Notes, with the following exceptions:

 o The AC, DT, DE, OS, and CC lines may not appear if the
   SEQINFO structure does not contain the appropriate
   information.
 o On the ID line, if no idlist is given, the keyword "Unknown" is
   output instead of an identifier. The keyword "converted" is
   output instead of "standard" or "preliminary".
 o The alphabet keyword could be "RNA", "DNA" or "UNK" if the
   alphabet is not Protein. And, the keyword "circular" could
   appear before the alphabet (if iscircular is 1). The keyword "BP"
   or "CH" could appear after the sequence length, if the alphabet
   is DNA, RNA or Unknown.
 o There will be at most one DT line, and it will only contain the
   specified date.
 o The history lines, and any extra references, are output at the end
   of the output comment section. Each of the added lines begins
   with the keyword "SEQIO".

For the EMBL format, the annotate operation replaces or appends to
the "CC " or "XX " lines, if one exists. The operation looks for the first
comment section, and will insert or replace at that point. If no comment
section exists, then a new comment section using "CC " lines will be
inserted (or rather output between the existing lines of the entry) as
follows. If a "DR ", "PR ", "FH " or "FT " line appears in the entry, the
comment is inserted just before the first of those lines. Otherwise, the
comment is inserted just before the "SQ ", or " " (i.e., sequence) lines.
One of these lines must appear in the entry.

For the Swiss-Prot format, the annotate operation replaces or appends
to the "CC " lines, if they exist. If no comment section exists, then a new
comment section will be inserted (or rather output between the existing
lines of the entry) as follows. If a "DR ", "KW " or "FT " line appears in
the entry, the comment is inserted just before the first of those lines.
Otherwise, the comment is inserted just before the "SQ " or sequence
lines. One of these lines must appear in the entry.

Example EMBL entry:

ID   CM23SRIBR  converted; DNA; UNC; 805 BP.
XX
AC   X80636;
XX
DT   22-MAR-1995
XX
DE   C.mucosalis gene for 23S ribosomal RNA (fragment)
XX
OS   Campylobacter mucosalis
XX
CC   SEQIO retrieval from EMBL-format entry.   07-Feb-1996
XX
SQ   Sequence 805 BP; 226 A; 158 C; 224 G; 194 T; 3 other;
     gattctgcgc ggaaaatata acggggctaa aatgagtacc gaagctttag acttagtttt        60
     actaagtggt aggagcgttc tattcagcgt tgaaggtgta ccggtaagga gcgctggagc       120
     ggatagaagt gagcatgcag gcatgagtag cgataattgg ggtgagaatc cccaacgccg       180
     taarcccaag gtttcctacg cgatgctcgt catcgtaggg ttagccgggt cctaagcaaa       240
     gtccgaaagg ggtatgcgat ggaaaattgg ttaatattcc aatgccaaca ttattgtgcg       300
     atggaaggac gcttagagtt aaaggagcca gctgatggaa gtgctggtcg aaaggtgtag       360
     gttgagttac aggcaaatcc gtaactcttt atccgagacc ccacaggcgt ttgaagttct       420
     tcggaatgga tgacgaatcc ttgatactgt cgagccaaga aaagtttcta agtttagata       480
     atgttgcccg taccgtaaac cgacacaggt gggtgggatg agtattctaa ggcgcgtgga       540
     agaactctct tcaaggaact ctgcaaaata gcaccgtatc ttcggtataa ggtgtgccta       600
     actttgtgaa ggatttactc cgtaagcatt gaaggttaca acaaagagtc cctcccgact       660
     gtttaccaaa aacacagcac tctgctaact cgtaagagga tgtatagggt gtgacgcctg       720
     cccggtgctc gaaggttaat tgatggggty agcagyaatg cgaagctctt gatcgaagcc       780
     cgagtaaacg gccgccgtaa ctata                                             805
//

Example Swiss-Prot entry:

ID   104K_THEPA  CONVERTED;      PRT;   924 AA.
AC   P15711;
DT   01-AUG-1992
DE   104 KD MICRONEME-RHOPTRY ANTIGEN.
OS   THEILERIA PARVA.
CC   -!- DEVELOPMENTAL STAGE: SPOROZOITE ANTIGEN.
CC   -!- SUBCELLULAR LOCATION: IN MICRONEME/RHOPTRY COMPLEXES.
CC
CC   SEQIO retrieval from Swiss-Prot database entry.   07-Feb-1996
SQ   SEQUENCE   924 AA;
     MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL
     QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG
     DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN
     GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK
     YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI
     TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT
     THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS
     EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT
     QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS
     SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR
     PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD
     DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK
     DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR
     SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL
     TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP
     KKPDSAYIPS ILAILVVSLI VGIL
//


EMBLFAST/SPFAST Variation of EMBL/Swiss-Prot
********************************************

The read operation performs the same steps as the EMBL/Swiss-Prot
read, however it makes some additional assumptions. First, all
keywords must appear in uppercase, with one exception noted next.
Second, an "SQ Sequence" line must appear in the entry, although the
keyword "Sequence" can appear in uppercase, as in "SQ SEQUENCE".
Third, the sequence length must be the next word after "SQ Sequence".
Fourth, the format of the sequence lines must occur as in the EMBL or
Swiss-Prot databases. The EMBL sequence lines are 80 characters
long (5 spaces, 60 sequence characters with 5 interspersed spaces,
and 10 characters with a right justified position number), plus the
newline character. The Swiss-Prot sequence lines are 70 characters
long (same as EMBL except no position numbers), plus the newline.

The getseq operation assumes that the sequence lines are in the format
described in the previous paragraph, and all of the characters in the
correct positions in that format are assumed to be characters of the
sequence. So, if the line format is incorrect, you will get garbage as the
sequence.

The rawseq operation here is exactly the same as the getseq operation,
since the EMBL and Swiss-Prot sequences don't contain other
characters.

The getinfo, putseq and annotate functions are the same as in the
EMBL/Swiss-Prot format.


FASTA/FASTA-old File Formats
****************************

 NOTE: The implementation of the FASTA format here follows the
 format described in the FASTA program documentation, with the
 exception that, at the beginning of the entry, multiple lines
 beginning with either '>' or ';' can appear. This was done in order
 to better distinguish the entry's header lines from the sequence
 lines (where comments beginning with ';' are permitted). This
 exception only occurs when reading FASTA entries. The FASTA
 output functions only use ';' for those additional header lines.

The read operation looks for a line beginning with '>'. That line is taken
as the header/description line for the entry. If that line has been
formatted using the standard one-line description format (see file "
user.doc"), then the sequence length is extracted from that line. The
operation then looks for the next line which does not begin with a '>'
and which does not begin with a ';'. If such a line occurs before the
next line with a '>', that line is the first line of the sequence. Finally, the
operation looks for the entry's end at either the next line which does
begin with a '>' or the end of the file.

The getseq operation scans the sequences lines (all of the lines not
beginning with '>'). All alphabetic characters on those lines are
assumed to be in the sequence, except that when a semi-colon
appears on a line, the rest of that line is considered a comment and not
part of the sequence. No format for those lines is assumed.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence.

The getinfo operation first looks at the first header line of the entry, and
parses it according to the one-line description format specified in file "
user.doc". It then considers any following lines that begin either with a
'>' or a ';' as comment lines. Any other comments in the entry are
ignored.

In the FASTA format, the putseq operation outputs a first header line
according to the one-line description format. The comment/history
lines and the sequence identifiers are output as additional header lines
that begin with a ';'. Finally, the sequence is output.

In the FASTA-old format, the putseq operation only outputs the first
header line and the sequence lines. No comment/history lines are
output, and the identifiers appear in the header line.

In the FASTA format, the annotate operation either replaces, appends
or inserts the comment lines just after the first header line. There is no
annotate operation in the FASTA-old format.

Example FASTA entry:

>gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
;
;NCBI gi: 579066
;
;SEQIO retrieval from GenBank database entry.   07-Feb-1996
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c

Example FASTA-old entry:

>gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c


NBRF/NBRF-old File Formats
**************************

 NOTE: The implementation of the NBRF format follows the format
 descriptions given in the release notes of the VMS version of the
 PIR database, with the following exceptions:

   1. An identifier list (with identifiers separated by '|') can
    appear after the ';' on the first line of the entry, and there is
    no limitation to the length of that identifier list.
   2. The second line of the entry is treated as a full one-line
    description (so it can contain more than just the
    description and organism name).
   3. The NBRF header lines (which occur after the sequence)
    are assumed to begin at the first line whose second
    character is a ';', and run until the end of the entry. So, the
    sequence lines cannot contain such a line (or the
    sequence will only be partially read).
   4. Every "C;Comment: " line in the header lines is assumed to
    contain a space between the "C;Comment:" and the
    comment text. This space (or whatever character appears
    there) is not considered part of the comment text.

The read operation first looks for a line beginning with '>', which
contains a two-character code and database identifiers for the
sequence. The next line, which should not begin with a '>', contains a
one-line description of the sequence, and the operation attempts to
extract the sequence length from that line. After that, the operation
scans the sequence lines looking for the beginning of the header lines
or the end of the entry. The header lines begin with the first line whose
second character is ';', and they are not required to appear in an entry.
The end of the entry is either the first line which begins with a '>', or the
end of the file.

The getseq operation scans the sequences lines from just after the
description line to either the first occurrence of a '*', the beginning of
the header lines or the end of the entry. All alphabetic characters on
those lines are assumed to be in the sequence. No format for those
lines is assumed.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence.

The getinfo operation first looks at the initial identification line. The
format of that line is ">??;..." where "??" is a two character description
and "..." is a list of identifiers. Six forms of the two character
description are recognized

 o "P1" - Protein complete
 o "F1" - Protein fragment
 o "DL" - linear DNA
 o "DC" - circular DNA
 o "RL" - linear RNA
 o "RC" - circular RNA

and the appropriate alphabet, isfragment and iscircular values are
set.  The list of identifiers are added to mainid, mainacc and
idlist. If no identifier prefix is specified for an identifier (either
by the identifier itself or by the "IdPrefix" information field of the
database's BIOSEQ entry, if a database search is being performed),
then "oth" for Other is used.  The next line in the entry is parsed
according to the one-line description format. Then, if the header
lines were found in the entry during the read operation, they are
scanned, looking for lines beginning with "C;Accession:", "C;Comment:"
and "C;Date:" which give the accession numbers, comments and date,
respectively.

In the NBRF format, the putseq operation outputs a initial identification
line of the appropriate form, containing one of the two character
descriptions above (or "XX" if the alphabet is Unknown) and containing
the list of identifiers in idlist. It then outputs a one-line description
according to the one-line description format. The sequence is output
and terminated with a '*'. Finally, the date, accession numbers and
comments/history are output in lines beginning with "C;Accession:",
"C;Comment:" and "C;Date:".

In the NBRF-old format, the putseq operation only outputs the initial
identification line, the description line and the sequence lines. In
addition, only one identifier is placed on the initial identification line,
and if that identifier was not an accession number, the main accession
number is added to the beginning of the description line.

For the NBRF format, the annotate operation replaces or appends the
"C;Comment: " lines, if they exists. If no comment lines exists, then a
new comment section will be inserted (or rather output between the
existing lines of the entry) as follows. If a "C;Genetics:", C;Complex:",
"C;Function:", "C;Superfamily:", "C;Keywords:" or "F;" line appears in
the entry, the comment is inserted just before the first of those lines.
Otherwise, the comment is inserted at the end of the entry.

There is no annotate operation in the NBRF-old format.

Example NBRF entry:

>DL;gb:A14666
PRLB promoter - Bacteriophage lambda, 281 bp.
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c*
C;Date: 18-AUG-1994
C;Accession: A14666
C;Comment: NCBI gi: 579066
C;Comment:
C;Comment: SEQIO retrieval from GenBank database entry.   23-Mar-1996

Example NBRF-old entry:

>DL;gb:A14666
~A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c*


IG/Stanford, IG-old/Stanford-old File Formats
*********************************************

The read operation first looks for a line beginning with ';'. The
operation then looks for the next line which does not begin with a ';'.
All of the lines beginning with ';' make up the comment lines, and the
first line not beginning with ';' contains the sequence's description. If
the description line has been formatted using the standard one-line
description format (see file "user.doc"), then the sequence length is
extracted from that line. Finally, the operation looks for the entry's end
at either the next line which does begin with a ';' or the end of the file.

The getseq operation scans the sequence lines from just after the
description line until either the end of the entry is reached, or a '1' or a
'2' appears. All alphabetic characters on those lines are assumed to be
in the sequence. No format for those lines is assumed.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence.

The getinfo operation first gets the comment lines at the beginning of
the entry, and then parses the description line according to the
one-line description format. Finally, it looks for a '1' or '2' at the end of
the sequence, and sets iscircular to 0 or 1, respectively.

In the IG/Stanford format, the putseq operation outputs any
comment/history lines (or just the line ";\n" if there are no
comment/history lines, a one-line description, the sequence and finally
either a '1' or '2' depending on the value of iscircular.

In the IG-old/Stanford-old format, the putseq operation outputs the
same text as in the IG/Stanford format except that exactly one
comment/history line is output.

In the IG/Stanford format, the annotate operation either replaces,
appends or inserts the comment lines at the beginning of the entry.
There is no annotate operation in the IG-old/Stanford-old format.

Example IG/Stanford entry:

;NCBI gi: 579066
;
;SEQIO retrieval from GenBank database entry.   07-Feb-1996
gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c1

Example IG-old/Stanford-old entry:

;NCBI gi: 579066
gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c1


ASN.1 Text File Format
**********************

 NOTE: This file format implementation is not nearly complete
 enough to handle all of the variations of ASN.1 text files. I
 concentrated the implementation on handling the "Bioseq"
 sequence records defined as part of the "Bioseq-set" structure,
 i.e., it looks for each "Bioseq-set.seq-set.seq" record in the file,
 where '.' separates the initial keywords for each level of
 sub-record. (See the NCBI toolkit for the definitions of the
 "Bioseq-set" and "Bioseq" syntax, and the values of those initial
 keywords).

 However, it does handle all of the syntactic requirements of the
 ASN.1 text format. It makes no assumptions on the structure of
 the file, handling a completely free-form file (with one exception
 listed below). It does assume that the format consists of a
 hierarchy of records, where a record consists of a text string
 identifier and then a pair of matching braces bounding the
 contents of the record (except for simple records which contain
 only one or more strings and numbers).

The read operation looks for the beginning of each
"Bioseq-set.seq-set.seq" record in the file. The operation assumes
that this record is a "Bioseq" record, and looks for the end of it. Also,
the read operations makes the syntactic requirement that the open
brace beginning the "seq" record is separated from its initial keyword
by exactly one space (i.e., the operation looks for the string "seq {").
After scanning to the end of the "seq" record, the operation looks for
the "seq.inst.length" sub-record. If found, the sequence length is
extracted from that sub-record.

The getseq operation looks for the "seq.inst.seq-data" sub-record in
the entry. If found, the sequence is extracted from that sub-record.
(NOTE: This operation can only handle sequences that have been
encoded in the `iupacna', `iupacaa', `ncbi2na' or `ncbi4na' formats.)

The rawseq operation is the same as the getseq operation, since the
`iupacna', `iupacaa', 'ncbi2na' and 'ncbi4na' formats do not contain
non-alphabetic characters.

The getinfo operation looks for a large number of possible sub-records
for information about the sequence. To find database identifiers, it
looks in the "seq.id" sub-record for the sub-sub-records "pir.name",
"pir.accession", "swissprot.name", "swissprot.accession",
"genbank.name", "genbank.accession", "embl.name",
"embl.accession", "ddbj.name", "ddbj.accession", "prf.name",
"prf.accession", "other.name", "other.accession", "pdb.mol", "gi",
"giim.id", "gibbsq" and "gibbmt". Any identifiers found are added to
the idlist. To find the date information, it looks in the "seq.descr"
sub-record to find the sub-sub-records "create-date",
"update-date", "genbank.date", "genbank.entry-date",
"embl.creation-date", "embl.update-date", "pir.date", "sp.created",
"sp.sequpd", "sp.annotupd" and "pdb.deposition".

Then, the operations searches for the description, organism and
comment information in the "seq.descr" sub-record. For the
description, the operation searches for the sub-sub-records "title",
"pdb.compound" and "name" and picks one of them for the description
("title" if found, else "pdb.compound", else "name"). For the organism,
the sub-sub-records "org.taxname", "org.common", "pir.source" and
"pdb.source" are searched. For the comments, all of the "comment"
sub-sub-records in "seq.descr" are concatenated together to make up
the comment lines.

Finally, the alphabet is picked up from the "seq.descr.mol-type",
"seq.descr.modif.dna", "seq.descr.modif.rna" or "seq.inst.mol"
sub-records, the isfragment field is set to 1 if "seq.descr.modif.partial"
exists, and the iscircular field is set to 1 if data string in
"seq.inst.topology" is "circular".

The putseq operation outputs a "Bioseq" record for the sequence as
part of a "Bioseq-set" structure (i.e., the appropriate strings are output
before the first putseq operation, between the "Bioseq" records and
when the file is closed, so that the file consists of a correctly formatted
"Bioseq-set" record). The form of the file mirrors that of the Bioseq-set
example given in the NCBI toolkit.

(NOTE: Because some text must be output when the file is closed (i.e.,
when seqfclose is called), you MUST call seqfclose when writing an
ASN.1 file. If you don't call seqfclose, the text file will not be complete.)

The annotate operation either replaces, creates or appends the
comment lines in the "seq.descr" sub-record (i.e., the comment lines
are the "seq.descr.comment" records). If no "seq.descr" sub-record
exists, one is created in the most appropriate place in the "seq" record.
If the entry given to the annotate operation is not a Bioseq "seq"
record, an error occurs.

(NOTE: Using the annotate operation by itself will NOT create a valid
ASN.1 text file. You must output the following strings before the first
entry, between entries, and after the last entry (again, assuming the
entries are "Bioseq" records taken from the "Bioseq-set" hierarchy):

   Before the first entry:  "Bioseq-set ::= {\n  seq-set {\n"
          Between entries:  " ,\n"
     After the last entry:  " } }\n"

A Complete ASN.1 Text File:

Bioseq-set ::= {
  seq-set {
    seq {
      id {
        genbank {
          name "A14666" ,
          accession "A14666" } } ,
      descr {
        title "PRLB promoter" ,
        org {
          taxname "Bacteriophage lambda" } ,
        update-date
          str "18-AUG-1994" ,
        comment "NCBI gi: 579066" ,
        comment "SEQIO retrieval from GenBank database entry.  07-Feb-1996" } ,
      inst {
        repr raw ,
        mol dna ,
        length 281 ,
        seq-data
          iupacna "gatcagctgcgacacaactagtttacttactcgcttattaaaccagacccacaatcttt
tacacagatacaatatttttagtggaaacttcttgacatttcggcccatgacctttactctgttataaattactttta
tgggggacgatcacactagcaaaggagttacctaagccccgaatgttcaatgggaagacttccccaatcatgacccac
attacgggaccccaagttgcggagaagaaggcgatgtaaactgtcaaagcaatcacagagatgatc" } } } }


GCG Format
**********

The read operation first looks for a line that ends with the string ".." (or
more precisely, a line whose last non-whitespace characters are "..").
That line should be the GCG information line, and should look
something like the following:

  gb:A02201  Length: 664  June 21, 1996 18:42  Type: N  Check: 9896  ..

although any or all of this information (except the "..") can be missing.
If the line contains the "Length:" keyword, then the read operation will
extract the sequence length. The read operation then reads the rest of
the file, and assumes that those lines contain the sequence.

The getseq operation scans the sequences lines. All alphabetic
characters on those lines are assumed to be in the sequence. No
format for those lines is assumed.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence. During this operation, any period `.' appearing in the
sequence lines is assumed to be a gap character and translated into a
dash `-' (the SEQIO's canonical gap character).

The getinfo operation takes the date and the alphabet from the GCG
information line (if the date and the "Type:" fields are there), sets the
description to the first word of the GCG information line (if it isn't
"Length:"), and then takes all of the lines up to the GCG information
line as the comment.

The putseq operation first outputs any comment lines, outputs a
complete GCG information line (with a valid checksum), and then
outputs the sequence lines in the default format shown below. Any
dash `-' appearing in the output sequence is assumed to be a gap
character and automatically translated into a period `.'.

There currently is no annotate function.


GCG-* Formats
*************

The processing of the GCG-* formats essentially merges the
processing of the GCG format on the sequence lines with the
processing of the GenBank, PIR, EMBL, Swiss-Prot, FASTA,
FASTA-old, NBRF, NBRF-old, IG/Stanford and IG-old formats when
dealing with the header lines of each entry. So, see above for the
details on that processing.

The one exception to this rule is the relationship between the NBRF
and GCG-NBRF formats. Since the NBRF entries contain "header"
information that actually appears at the end of the entry, and the GCG
format requires that the last thing in an entry be the sequence, the
GCG and non-GCG forms of the NBRF entries differ more than the
other formats. In the GCG-NBRF format, the lines before the GCG
information line are assumed to contain the two header lines normally
found in the NBRF entries, immediately followed by the lines normally
appearing at the end of the file (the "C;Comment:", "C;Accession:" and
other lines). After those lines, the GCG information line and sequence
lines should appear, and be the last things in the entry. The fmtseq
program and SEQIO package have been implemented to make this
transformation between the NBRF and GCG-NBRF formats.

An example GCG-Genbank entry:

LOCUS       A14666        281 bp    DNA             PHG       18-AUG-1994
DEFINITION  PRLB promoter.
ACCESSION   A14666
KEYWORDS    .
SOURCE      Bacteriophage lambda.
  ORGANISM  Bacteriophage lambda
            Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
REFERENCE   1  (bases 1 to 281)
  AUTHORS   Michiels,F., Delcour,J., Mahillon,J., Joos,H., Platteeuw,C. and
            Josson,K.
  TITLE     Transformed lactic acid bacteria
  JOURNAL   Patent: EP 0311469-A 10 12-APR-1989;
            PLANT GENETIC SYSTEMS N.V.; UNIVERSITE CATHOLIQUE DE LOUVAIN
COMMENT     NCBI gi: 579066
FEATURES             Location/Qualifiers
     source          1..281
                     /organism="Bacteriophage lambda"
     RBS             158..166
     CDS             180..254
                     /note="PRLB;  NCBI gi: 579067"
                     /codon_start=1
                     /translation="MFNGKTSPIMTHITGPQVAEKKAM"
BASE COUNT       89 a     67 c     52 g     73 t
ORIGIN

  gb:A14666  Length: 281  June 28, 1996 16:23  Type: N  Check: 2754  ..

       1 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc

      51 acaatctttt acacagatac aatattttta gtggaaactt cttgacattt

     101 cggcccatga cctttactct gttataaatt acttttatgg gggacgatca

     151 cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc

     201 ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat

     251 gtaaactgtc aaagcaatca cagagatgat c

An example GCG-NBRF entry:

>DL;gb:A14666
PRLB promoter - Bacteriophage lambda, 281 bp.
C;Date: 18-AUG-1994
C;Accession: A14666
C;Comment: NCBI gi: 579066
C;Comment:
C;Comment: SEQIO retrieval from GenBank database.   28-Jun-1996

  gb:A14666  Length: 281  June 28, 1996 16:22  Type: N  Check: 2754  ..

       1 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc

      51 acaatctttt acacagatac aatattttta gtggaaactt cttgacattt

     101 cggcccatga cctttactct gttataaatt acttttatgg gggacgatca

     151 cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc

     201 ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat

     251 gtaaactgtc aaagcaatca cagagatgat c


MSF Multiple Sequence Format
****************************

The read operation first looks for a GCG information line of the
following form:

 Pileup.Msf  MSF: 729  Type: N  June 21, 1996 15:02  Check: 3171 ..

although any or all of this information can be missing, except the ".."
and the "MSF: %d" section, the second of which the read operation
uses to get the sequence length. After the information line, the read
operation looks for the sequence name lines, which are of the form

 Name: Humhbbbpc        Len:   729  Check: 6463  Weight:  1.00

where the "Name: " field gives the sequence identifier and must appear
on any non-blank line in this section of the MSF file (the other fields
are ignored, and the length is assumed to be the same as the global
length). The sequence name lines section ends when a line beginning
with "//" appears. Any number of blank lines can be interspersed in this
section, but any non-blank line should contain the above format. The
rest of the file is assumed to contain the sequence lines, where each
sequence line begins with the sequence name followed by a space, as
in:

           401                                                450
Humhbbbpc  CAGAAATTTA GTGTTTTCTC AGTCAGTTAA CATTCCTTCA AC........
Humhbbbpd  CAGAAATTTA GTGTTTTCTC AGTCAGTTAA CATTCCTTCA AC........
Humhbbbpe  CAGACAAATG GAACAGAATA GAGAGCCCAG AAATAAGACC ACATG.....
Humhbbbpf  CAGACAAATG GAACAGAATA GAGAGCCCAG AAATAAGACC ACATG.....
Humhbbbpg  AATACAAAAT CAGTAGCATT TCATATATAA A......... ..........
Humhbbbph  AATACAAAAT CAGTAGCATT TCATATATAA A......... ..........
Humhbbbp1  AAGTGATGAA ATTGTGTATT CAATGTAGTC TCAAGAGAAT TGAAAACCAA
Humhbbbpa  AAATAAAAGG ATGGAGGAAG ATCTACCAAG CA........ ..........
Humhbbbpb  AAATAAAAGG ATGGAGGAAT ATCTACCAAG CA........ ..........
Humhbbbp2  AGCT.AAAGG ATTGTAAATG CACTAATCAG CACTCTGTGT CTAGCTCAAG

No format of the sequence lines or presence or absence of the position
number lines (401...450) is assumed, except for the initial sequence
name. The sequence lines run to the end of the file.

The getseq operation finds every sequence line beginning with the
corresponding sequence name (the sequences are ordered by the
order of sequence names in the sequence names section). All
alphabetic characters appearing after the sequence name are taken for
the sequence.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence. During this operation, any period `.' appearing in the
sequence lines is assumed to be a gap character and translated into a
dash `-' (the SEQIO's canonical gap character).

The getinfo operation takes the date and the alphabet from the GCG
information line (if the date and the "Type:" fields are there), sets the
description to the sequence name found in the sequence name section,
and then takes all of the lines up to the GCG information line as the
comment.

The putseq operation outputs an MSF file exactly mimicing the files
output by GCG using "PileUp" in its default mode, except that only the
keyword "PileUp" appears on the first line and no comments are
output. Any dashes `-' found in the sequences are assumed to be gap
characters and are automatically translated into periods `.'. If the
sequences are of different lengths, the putseq operation will pad the
smaller sequences with periods `.'.

(IMPORTANT: The one unusual feature about the putseq operation is
that, unlike all of the other putseq operations except Clustalw and
PHYLIP, the actual output does not occur until `seqfclose' is called to
close the file. Because the MSF format must know the number of entries
before it can begin the output, the sequences cannot be output at each
call to `seqfwrite'. What the putseq operation does, on each call to
`seqfwrite', is make a copy of the sequence and a sequence identifier
(either the main identifier, description or organism name). Then, when
`seqfclose' is called, all of the sequences are output in the correct
format.)

There currently is no annotation function.

An example MSF file:

PileUp


 pir.msf  MSF: 104  Type: P  June 28, 1996 17:04  Check: 3466  ..

 Name: pir:CCCZ         Len:   104  Check: 9501  Weight:  1.00
 Name: pir:CCMQR        Len:   104  Check: 9512  Weight:  1.00
 Name: pir:CCMKP        Len:   104  Check: 9066  Weight:  1.00
 Name: pir:CCRB         Len:   104  Check: 8395  Weight:  1.00
 Name: pir:CCGW         Len:   104  Check: 8496  Weight:  1.00
 Name: pir:CCCM         Len:   104  Check: 8496  Weight:  1.00

//

            1                                                   50
pir:CCCZ    GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
pir:CCMQR   GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
pir:CCMKP   GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE
pir:CCRB    GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
pir:CCGW    GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
pir:CCCM    GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD

            51                                                 100
pir:CCCZ    ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
pir:CCMQR   ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
pir:CCMKP   ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
pir:CCRB    ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK
pir:CCGW    ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
pir:CCCM    ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK

            101
pir:CCCZ    ATNE
pir:CCMQR   ATNE
pir:CCMKP   ATNE
pir:CCRB    ATNE
pir:CCGW    ATNE
pir:CCCM    ATNE


PHYLIP Interleaved and Sequential File
**************************************
Formats
*******

 NOTE: The implementation here is more flexible than other
 implementations, however it is a bit restrictive in its output, in
 that

   1. Both interleaved and sequential formats are supported and
    rigorously distinguished. See below for the details.
   2. An input file in the PHYLIP format can contain one or more
    PHYLIP entries, where each entry must be separated only
    by whitespace. Mixed files (some interleaved entries, some
    sequential entries) are supported.
   3. Any number of blank lines or lines filled only with
    whitespace can be included in the file. Blank lines do not
    disrupt the parsing of the entries.
   4. The output operation does NOT output more than one
    entry per file, because I have yet to completely figure out
    the SEQIO interface issues. (Note that this may change in a
    future version.)
   5. This implementation was done using the documentation
    from Version 3.5c. Whether it works with earlier versions is
    not known.

The read operation first skips whitespace characters and then looks for
the number of sequences and the sequence length (those two numbers
must be the first thing in the entry). On that initial line, it also looks for
the option characters 'A', 'C', 'F', 'M', 'U', 'W'. If any of the options
except 'U' are found, the operation then skips any subsequent lines
that begin with a match to the character strings "ANCESTOR ",
"CATEGORIES", "FACTORS ", "MIXTURE ", or "WEIGHTS ". A line is
considered to match one of the strings if the first 10 characters of the
line contain a prefix of the string padded by spaces. Also, these lines
are skipped only if the corresponding option was given on that first
line.
(NOTE: This may cause some problems on an entry such as this one:

3 6 A
A         ABCDEF
B         BCDEFG
C         CDEFGH

because the second line of the entry is treated as an "ANCESTOR "
line, when in fact it was a sequence line. But, from looking at the
documentation, the PHYLIP programs would die on this entry, too. And
replacing "A " with something like "Alpha " eliminates the problem.)

After skipping those initial lines, the read operation tries to match the
subsequent lines to the interleaved and sequential file formats. The
following criteria are the keys to distinguishing between the two
formats:

 1. The line giving the initial piece of a sequence must be at least 10
   characters long and there must be at least one non-whitespace
   character in those first ten characters. This should be the
   sequence identifier, and its characters are not counted as part of
   the sequence.
 2. In the Interleaved format, all of the sequence substrings in each
   block of the entry must have the same length. A block is a set of
   "number-of-sequences" lines (not counting blank lines) which
   contain a piece of each of the sequences.
 3. The end of each sequence must occur on its own line, without
   any additional non-whitespace text after the sequence
   characters.

If one format but not the other matches, or both formats match and the
input format has been specified as PHYLIP-Int or PHYLIP-seq (instead
of just PHYLIP), then the entry format has been successfully
determined. Otherwise (if neither match or both match), a parse error is
triggered. However, given the above criteria and the fact that the
operation attempts to completely match both formats against the text,
the likelihood that the formats will match the same text is extremely
remote.

Finally, if the 'U' option has been set on the entry's first line, the read
operation skips the user trees listed in the entry, to get to the end of
the entry. The format of the user trees consists of a line giving the
number of trees, followed by any number of lines of text where each
user tree description is ended by a semi-colon (the operation just
counts the semi-colons it sees). The end of the entry is at the end of
the line containing the last semi-colon.

The getseq operation finds the first line of the appropriate sequence in
the entry (i.e., the `seqfseqno' sequence), skips the 10 character
identifier and retrieves the sequence. All alphabetic characters are
considered to be in the sequence.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence.

The getinfo operation takes the 10 character sequence identifier to be
the description of the sequence. No other information is retrieved.

The putseq operation outputs an Interleaved or Sequential entry
exactly as described in the PHYLIP program documentation. If the
sequences output are of different lengths, the putseq operation will pad
the smaller sequences with dashes `-'.

(IMPORTANT: The one unusual feature about the putseq operation is
that, unlike all of the other putseq operations except Clustalw and MSF,
the actual output does not occur until `seqfclose' is called to close the
file. Because the PHYLIP format must know the number of entries
before it can output the first line, the sequences cannot be output at
each call to `seqfwrite'. What the putseq operation does is, on each call
to `seqfwrite', it makes a copy of the sequence and a sequence
identifier (either the mainid, mainacc, description or organism name).
Then, when `seqfclose' is called, all of the sequences are output in the
correct format.)

There is no annotate function.

Example PHYLIP Interleaved entry:

     6    104
pir:CCCZ   GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
pir:CCMQR  GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
pir:CCMKP  GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE
pir:CCRB   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
pir:CCGW   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
pir:CCCM   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD

           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK
           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK

           ATNE
           ATNE
           ATNE
           ATNE
           ATNE
           ATNE

Example PHYLIP Sequential entry:

     6    104
pir:CCCZ   GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
           ATNE
pir:CCMQR  GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
           ATNE
pir:CCMKP  GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE
           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
           ATNE
pir:CCRB   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK
           ATNE
pir:CCGW   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
           ATNE
pir:CCCM   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
           ATNE


Clustalw Format
***************

The read operation first skips the header line of the file, and then skips
any blank lines. The next non-blank line is assumed to begin the first
block. The sequence lines of each block contain first an identifier of 15
characters and then the rest of the line is sequence. Those sequence
lines must begin with a non-whitespace character. After the sequence
lines in each block, there is an additional line to highlight closely
related columns in the alignment, followed by zero or more blank lines.
This additional line and all of the lines occurring between blocks must
either be empty or begin with a whitespace character. There is only one
entry per file, and the whole file is assumed to consist of these
sequence blocks.

The getseq operation finds the first line of the appropriate sequence in
the entry (i.e., the `seqfseqno' sequence), skips the 15 character
identifier and retrieves the sequence. All alphabetic characters are
considered to be in the sequence.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence.

The getinfo operation takes the 15 character sequence identifier to be
the description of the sequence. No other information is retrieved.

The putseq operation outputs a Clustalw entry exactly as the clustalw
program does, except that the version number is replaced with "*.**"
and the package does not look for closely related columns in the
output alignment (it simply outputs a line of whitespace without any '*'
or '.' characters). If the sequences are of different lengths, the putseq
operation will pad the smaller sequences with dashes '-'.

(IMPORTANT: The one unusual feature about the putseq operation is
that, unlike all of the other putseq operations except PHYLIP and MSF,
the actual output does not occur until `seqfclose' is called to close the
file. Because the Clustalw format must know the number of entries
before it can output the first line, the sequences cannot be output at
each call to `seqfwrite'. What the putseq operation does is, on each call
to `seqfwrite', it makes a copy of the sequence and a sequence
identifier (either the mainid, mainacc, description or organism name).
Then, when `seqfclose' is called, all of the sequences are output in the
correct format.)

There is no annotate function.

Example Clustalw file:

CLUSTAL W(*.**) multiple sequence alignment


pir:CCCZ       GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWG
pir:CCMQR      GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGITWG
pir:CCMKP      GDVFKGKRIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQASGFTYTEANKNKGIIWG
pir:CCRB       GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG
pir:CCGW       GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG
pir:CCCM       GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG


pir:CCCZ       EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
pir:CCMQR      EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
pir:CCMKP      EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
pir:CCRB       EDTLMEYLENPKKYIPGTKMIFAGIKKKDERADLIAYLKKATNE
pir:CCGW       EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE
pir:CCCM       EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE


FASTA-output Formats
********************

 NOTE: With one or two exceptions, this implementation can read
 and understand the output from the FASTA, TFASTA, SSEARCH,
 LFASTA, LALIGN and ALIGN programs which were run either in
 interactive or non-interactive mode, and where the output was
 formatted with MARKX option set to any of 0, 1, 2, 3 or 10.

 The exceptions are

   1. The program must have been run in non-interactive mode
    in order for the automatic format determination to work
    correctly. By "non-interactive", I mean that the initial
    header output by the program:

        FASTA searches a protein or DNA sequence data bank
        version 2.0u4 Feb., 1996
       Please cite:
        W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
       .
       .
       .

    must appear in the text given as input.
   2. If the FASTA, TFASTA or SSEARCH is run in interactive
    mode, no information will be known about the query
    sequence (its information is in the initial header, which is
    not included in the file specified to receive the program
    output),
   3. The ALIGN program must be run in non-interactive mode
    in order for the package to correctly parse it (i.e., that
    initial header must occur in the text). For the other
    programs, the package will parse its output correctly, if the
    file format is specified as `FASTA-output'.
   4. The implementation was tested against version 2.0u4. If the
    output was different in previous versions, the
    implementation may not work.

The read operation first scans the text occurring before the first
alignment in the file. This initial text is ignored, except where it gives
information about the sequences being aligned. The initial texts of
some of the output formats contain lines of the following form.

 >GT8.7 transl. of pa875.con, 19 to 675: 217 aa
 >musplfm transl. of musplfm.seq, 2 to 676 : 224 aa

(A) musplfm.aa >musplfm transl. of musplfm.seq, 2 to 676          - 224 aa
(B) lcbo.aa    >LCBO - Prolactin precursor - Bovine               - 229 aa

>musplfm transl. of musplfm.seq, 2 to 676           224 aa vs.
>LCBO - Prolactin precursor - Bovine                229 aa

The text after the '>' is parsed to extract the sequence id (the first word
after the '>'), a sequence description, the sequence length and
alphabet information about the sequence.

Then, the read operation reads the "entries" of the file, where each
entry is considered to be the text describing an alignment between two
sequences. Different programs output different sets of alignments, but
all six of the FASTA programs supported output one or more
two-sequence alignments. Thus, every entry in this format contains
two sequences.

The getseq operation extracts the appropriate sequence from the entry
(the first or second sequence if the `seqfseqno' value is 1 or 2,
respectively). All alphabetic characters are considered part of the
sequence, except that if the output was generated with MARKX=2, then
any periods occurring in the second sequence are replaced with the
corresponding character of the first sequence.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence (with the exception of period substitution mentioned
above).

The getinfo operation extracts a main identifier, a description and an
alphabet for the appropriate sequence, if available. It also constructs a
comment that begins with the following:

From SSEARCH output alignment of:
 >musplfm transl. of musplfm.seq, 2 to 676, 224 aa
 >LCBO - Prolactin precursor - Bovine, 229 aa

This gives the name of the program whose output is being parsed, and
the descriptions of the two sequences from whose alignment came the
current sequence. This text is then followed by any information from the
alignment describing the score of that pairwise alignment. The format of
this text depends on the FASTA program executed and the MARKX
value, as it is just copied from the program output.

There is no putseq or annotate operation.


BLAST-output Formats
********************

 NOTE: With one or two exceptions, this implementation can read
 and understand the output from the BLASTN, BLASTP or BLASTX
 (and maybe even the TBLAST* programs, although that has not
 been tested yet). The exceptions are:

   1. Automatic recognition of the BLAST-output format
    requires that one of the keywords BLASTN, BLASTP or
    BLASTX be the first word in the file (possibly after an
    e-mail header). Many of the BLAST e-mail servers prepend
    a description of their service before the actual BLAST
    output, and so disrupt the recognition by the package. So,
    for output gotten by an e-mail server, the input format
    must be set.
   2. The implementation was tested on output generated by
    versions 1.2 and 1.4.9. If the output is different in version
    1.3 or 2.0, the implementation may not work (although the
    implementation can correctly handle gaps in the
    alignments, so that change from 1.* to 2.0 is handled).

The read operation first scans the text occurring before the first
alignment in the file. This initial text is ignored, except where it gives
information about the sequences being aligned. The initial texts of
some of the output formats contain lines of the following form.

Query=  gb:A02201|acc:A02201 Phage phi-105 DNA for immF plypeptide - Bacteriophage phi-
        (665 letters)

The text after "Query=" and before the line containing the "(... letters)"
is parsed as a oneline description, and the number inside the "(...
letters)" is taken as the length of the query sequence.

Then, the read operation reads the "entries" of the file, where each
entry is considered to be the text describing an alignment between two
sequences. The BLAST alignment format consists of header lines
specifying the sequence that matches the query, following by one or
more pairwise alignments of substrings of the matching sequence and
the query. The read operation first scans the header lines, which are of
the form:

>emb|X02799|NCPHI105 Bacillus subtilis phage phi 105 immunity region with
            repressor gene and ORF >emb|A11144|A11144 phage phi 105 repressor
            (ORF1)-Orf 2 genes and there flanking regions
            Length = 1306

where the "Length =" line ends the list of oneline descriptions of the
sequences that match the query (in the next pairwise alignment(s) ). It
extracts the oneline description and length of the sequence.

The read operation considers an "entry" to consist only of the actual
score reporting text and pairwise alignment text. So, while the header
lines above are scanned for their information, the entry reported by the
package begins at the line containing either "Plus Strand HSPs:",
"Minus Strand HSPs:" or "Score =". And the entry ends just after the
last line of the pairwise alignment text. This is done to make the entry
text reported by the package more uniform. Thus, the following BLAST
output would be reported as two entries, the first beginning at the
"Plus Strand HSPs:" line and running through the first pairwise
alignment, and the second beginning with the "Score = 89..." line. The
header lines will not be reported in any alignment, and will only be
scanned to extract the oneline description and length information.

>emb|Z68118|CER01E6 Caenorhabditis elegans cosmid R01E6
            Length = 40,937

  Plus Strand HSPs:

 Score = 127 (35.1 bits), Expect = 3.2, Sum P(2) = 0.96
 Identities = 39/56 (69%), Positives = 39/56 (69%), Strand = Plus / Plus

Query:    426 ATTTTAATAAATCTGGATTTAAATGTGTTAAAAATGACGGAAATACAAGTAGTTGA 481
              ||||||||||||||    ||||||  | |||||||||  | || |    || || |
Sbjct:  35266 ATTTTAATAAATCTCATCTTAAATTAGATAAAAATGAATGCAAAATTTATATTTTA 35321

 Score = 89 (24.6 bits), Expect = 3.2, Sum P(2) = 0.96
 Identities = 25/34 (73%), Positives = 25/34 (73%), Strand = Plus / Plus

Query:     93 ACAATACTAAAAAAGACGGAAATACAAGTATTTT 126
              ||||||||||||||    | ||   || ||||||
Sbjct:  31613 ACAATACTAAAAAATCTTGTAAACAAAATATTTT 31646

The getseq operation extracts the appropriate sequence from the entry
(the first or second sequence if the `seqfseqno' value is 1 or 2,
respectively). All alphabetic characters are considered part of the
sequence.

The rawseq operation is the same as the getseq operation, except that
all non-whitespace and non-numeric characters are considered part of
the sequence.

The getinfo operation extracts a main identifier, a description and an
alphabet for the appropriate sequence, if available. It also constructs a
comment that begins with the following:

From BLASTN/BLASTP/BLASTX output alignment of:
   >gb:A02201|acc:A02201 Phage phi-105 DNA for immF plypeptide - Bacteriophage phi
and
   >emb|X02799|NCPHI105 Bacillus subtilis phage phi 105 immunity
              region with repressor gene and ORF
   >emb|A11144|A11144 phage phi 105 repressor (ORF1)-Orf 2 genes
              and there flanking regions

This gives the name of the program whose output is being parsed, and
the descriptions of the two sequences from whose alignment came the
current sequence. This text is then followed by any information from the
alignment describing the score of that pairwise alignment.

There is no putseq or annotate operation.


James R. Knight, knight@cs.ucdavis.edu
June 28, 1996