1SEQIO -- A Package for Sequence File I/O
2
3
4FORMAT.DOC - The SEQIO File Formats
5***********************************
6
7
8
9The File Formats
10================
11
12This file describes the specific assumptions the SEQIO package makes
13about the file formats it supports. The basic file formats are (with
14alternative names in parens):
15
16 o Raw
17 o Plain
18 o GenBank (gb)
19 o EMBL
20 o Swiss-Prot (swissprot, sprot)
21 o PIR (CODATA)
22 o NBRF
23 o FASTA (Pearson)
24 o IG/Stanford (IG, Stanford)
25 o ASN.1 (ASN)
26 o GCG
27 o GCG-* (GCG-GenBank, GCG-PIR, GCG-EMBL, ...)
28 o MSF
29 o PHYLIP
30 o PHYLIP-Seq (phylip-s, phylips)
31 o PHYLIP-Int (phylip-i, phylipi)
32 o Clustalw (clustal)
33 o FASTA-output (fasta-out, fastaout, fout)
34 o BLAST-output (blast-out, blastout, bout)
35
36where `FASTA-output' and `BLAST-output' specify the output
37produced by the programs in the FASTA and BLAST packages. The
38`GCG-*' format actually refers to a set of formats which specify the
39GCG forms of the GenBank, EMBL, Swiss-Prot, PIR, NBRF, FASTA and
40IG/Stanford formats. These formats are included to distinguish the GCG
41forms of these formats from the generic GCG format (where the header
42lines of an entry are considered as unstructured comments). Any valid
43name for one of the seven formats, plus their *-old variants given
44below, can replace the `*' in `GCG-*'.
45
46In addition to the basic file formats, there are four file "formats" which
47use faster file reading implementations. They are specifically geared to
48the formats of the GenBank, PIR, EMBL and Swiss-Prot databases, and
49they are included to speedup database searches (they run about 30%
50faster than the basic implementations, but at the cost of less error
51checking and depending that the file format exactly matches the
52database's format):
53
54 o gbfast
55 o pirfast
56 o emblfast
57 o spfast
58
59My advice is that these formats only be used when searching the actual
60databases, and the basic file formats be used the rest of the time. The
61difference in time only becomes significant when reading files in the
62multi-megabyte range.
63
64Finally, there are also format variants which have been added to
65account for FASTA, NBRF and IG/Stanford format limitations commonly
66in use. For FASTA and IG/Stanford, the limitation is that only one
67header line (any line beginning with a '>' or ';') may appear in the entry.
68For NBRF, the limitation is that no lines like "C;Accession:" or
69"C;Comment:" may appear after the sequence. The formats below have
70a different output function which outputs entries in these limited
71formats (at the cost of losing some information about the sequences).
72Thus, the package can output entries that are readable by other
73programs which require the limited format.
74
75 o NBRF-old (NBRFold)
76 o FASTA-old (FASTAold)
77 o Stanford-old (Stanfordold, IG-old, IGold)
78
79These three format variants are included in the `GCG-*' set of formats.
80
81File Format Types
82=================
83
84Each format is considered to be one of the following types, which gives
85a basic description of the capabilities and common uses of the format:
86
87T_SEQONLY
88   The entries of the format contain only a sequence. It does not
89   contain any place to store sequence information or comments.
90   (Plain, Raw)
91T_DATABANK
92   The entries are used mainly to store unadorned sequences (i.e.,
93   not used for sequences containing alignment characters).
94   (GenBank, PIR, EMBL, Swiss-Prot, their GCG-* forms, ASN.1)
95T_GENERAL
96   The entries can contain both unadorned sequences and
97   alignment sequences. In addition, there is a place to store
98   sequence information and comments.
99   (FASTA, NBRF, IG/Stanford, their GCG-* forms, GCG)
100T_LIMITED
101   The entries can contain both unadorned sequences and
102   alignment sequence, but there no place to store extra sequence
103   information and comments.
104   (FASTA-old, NBRF-old, IG-old, their GCG-* forms)
105T_ALIGNMENT
106   The entries are used mainly to store multiple sequence
107   alignments. They are not considered to contain much sequence
108   information and do not have any place to store comments.
109   (PHYLIP, Clustalw, MSF)
110T_OUTPUT
111   The format is the output of an aligment program, and these
112   formats are read-only formats.
113   (FASTA-output, BLAST-output)
114
115These types may be of some use when developing software that wishes
116to perform different operations based on this file type information (the
117"fmtseq" program included in the distribution is one such piece of
118software).
119
120(NOTE: Why is having someplace to store comments so important?
121Well, one of the goals of this package is to try to unify all of the file
122formats and be able to capture and transfer as much information from
123one format to another. The plans are to use these comment sections as
124the place to store any extra information for which there is not explicit
125spot in the entry. And that can't happen if the file format doesn't have a
126comment section. This is also the reason for the FASTA, NBRF and
127IG/Stanford variants mentioned above.)
128
129Automatically Determining the Format Type
130=========================================
131
132The SEQIO package has the ability to automatically determine the
133format of a file, if that file is one of the following formats:
134
135   Plain, GenBank, PIR, EMBL/Swiss-Prot, FASTA, NBRF,
136   IG/Stanford, ASN.1, GCG, GCG-*, MSF, PHYLIP, Clustalw,
137   FASTA-output, BLAST-output
138
139The Raw format and all of the format variations (*-old, *fast) must be
140explicitly specified in order to be used. The package makes the format
141determination in two phases. The first phase looks at the initial
142non-whitespace text of the file. The second phase looks at the text of
143the first entry in the file. Both of these phases occur during the
144opening of the file.
145
146First Phase
147+++++++++++
148
149The first phase operation first skips over an e-mail header at the
150beginning of the file, if the file begins with the string "From ". It then
151looks for the first non-whitespace character of the file and attempts to
152match that non-whitespace text to one of the following keywords
153(where the matching is case-insensitive and the `?' character is a
154wildcard which can match any character in the file):
155
156    GenBank - "LOCUS ", "GB???.SEQ          Genetic Sequence Data Bank"
157       NBRF - ">??;"
158      FASTA - ">"
159       EMBL - "ID   ", "CC ", "XX "
160        PIR - "\\\", "ENTRY", "P R O T E I N  S E Q U E N C E  D A T A B A S E"
161IG/Stanford - ";"
162      ASN.1 - "Bioseq-set ::= {", "Seq-set ::= {"
163  FASTA-out - "FASTA", "TFASTA", "SSEARCH", "LFASTA", "LALIGN", "ALIGN"
164     PHYLIP - "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"
165   Clustalw - "CLUSTAL"
166        MSF - "PileUp"
167  BLAST-out - "BLASTN", "BLASTP", "BLASTX"
168
169The keyword matching occurs in the order specified here, and the first
170matching keyword specifies the file format. So, for NBRF and FASTA
171files, if the first entry's header line has a ';' as the third character after
172the initial '>', the file format is taken to be NBRF. Files without that
173semi-colon are taken to be in FASTA format.
174
175If there's a match, then the file format has been determined. Otherwise,
176the file's format is considered to be `Plain' at this point.
177
178Second Phase
179++++++++++++
180
181The second phase distinguishes more subtle variations of the file
182formats by looking in more detail at the text of the entries. The possible
183changes in the determined format are the following:
184
185 o For EMBL files, the "ID " line of each entry is scanned, and if it
186   contains exactly 2 semi-colons, a period and the string "PRT"
187   occuring before the second semi-colon, the entry is taken to be
188   a Swiss-Prot entry.
189
190   In addition, if the string occurring before the last semi-colon on
191   the "ID " line is "EPD", then the entry identifier is taken to be an
192   EPD database identifier, but the entry itself is still considered to
193   be an EMBL formatted entry.
194
195 o For all of the basic formats of the GCG-* formats, if the entry's
196   sequence lines are in the GCG format, then the entry is
197   considered to be the corresponding GCG-* format (so, a
198   GenBank format becomes a GCG-GenBank format).
199
200 o For PHYLIP files, each entry is checked to see if it is in the
201   Interleaved or Sequential format. This checking is a complete
202   match of the text to the two formats, so the likelihood of an
203   incorrect determination is remote. See below in the description
204   of the PHYLIP format for more details.
205
206 o For Plain files (as determined by phase 1), the entry text is
207   checked to see if a line ending with the string ".." occurs (or,
208   more precisely, a line whose last non-whitespace characters are
209   ".."). If so, the file is considered either a GCG or MSF file. If the
210   line ending with the ".." contains the string "MSF:", then the
211   entry is considered to be an MSF file. If not, the entry is
212   considered to be a GCG file.
213
214
215
216The SEQIO File Format Implementations
217*************************************
218
219The package has six main (internal) operations that encapsulate the
220details of the file formats. Those operations are:
221
222read
223   Read the input file to find the beginning and end of the next
224   entry in the file. Also, find the beginning of the lines containing
225   the sequence and if the entry explicitly specifies a sequence
226   length, get that value.
227getseq
228   Retrieve the sequence, if it exists, from the entry.
229rawseq
230   Retrieve the raw sequence, if it exists, from the entry. The raw
231   sequence typically contains the sequence characters plus any
232   alignment or notational characters.
233getinfo
234   Get one piece or all of the SEQINFO information from the entry.
235putseq
236   Given a sequence and SEQINFO structure, output a correctly
237   formatted entry.
238annotate
239   Output an entry's text, adding new text to its comment section
240   (creating a comment section, if none exists in the entry).
241
242Each of the supported file formats will be described in terms of what
243those six operations do for that format.
244
245General Comments
246================
247
248 o There are no limits on lengths of anything (lines, entries,
249   sequences, etc.), except for memory limitations and when
250   outputting formats whose official descriptions specify a
251   maximum line length (see below in the format descriptions).
252
253 o When outputting formats that do have a maximum line length,
254   long description/organism/comment lines are broken between
255   word boundaries. That maximum line length is maintained unless
256   there is a single word that is longer than the line length. That
257   word is not broken up, but is output on a line that will be longer
258   that the maximum length.
259
260 o Except for gbfast, emblfast, spfast and pirfast, the case of the
261   entry's keywords is irrelevant (they can be in upper or lower
262   case, or any mixture of the two). The "fast" formats require
263   keywords in upper case (as occurs in the databases).
264
265 o When outputting in the Plain, FASTA, NBRF or IG/Stanford
266   formats, the putseq operation looks at the sequence being
267   output, and may add whitespace to the output sequence to make
268   it look prettier. By default, the extra spaces are added when the
269   sequence is DNA, RNA or Protein and when there are no
270   non-alphabetic characters in the sequence (such as alignment
271   characters).
272
273   This prettying operation can be turned off or turned on for all
274   sequences using the function `seqfsetpretty'.
275
276
277
278Raw Format
279**********
280
281In the raw format, all of the characters of the file are the characters of
282the sequence (including spaces, newlines, non-printable characters,
283and so on).
284
285The read operation simply reads the whole file. The getseq and rawseq
286operations return that text. The getinfo operation merely stores the
287filename in the description field. The putseq operation just outputs the
288sequence characters. And there is no annotate operation.
289
290
291
292Plain Format
293************
294
295In the plain format, all of the alphabetic characters of the file are taken
296as the characters of the sequence, while spaces, newlines, position
297numbers and other punctuation characters are ignored.
298
299The read operation reads in the whole file. The getseq operation
300extracts all of the alphabetic characters from the text. The rawseq
301operation extracts all of the non-whitespace and non-numeric
302characters from the text. The getinfo operation stores the filename in
303the description field.
304
305The putseq operation outputs the sequence in one of two formats,
306depending on the sequence's alphabet. If the alphabet is DNA, RNA or
307Protein, or the alphabet is Unknown but does not contain newline
308characters, the sequence is output 60 sequence characters per line,
309with interspersed spaces to improve the look of the output. If the
310alphabet is Unknown and it contains newline characters, then it is
311output as is.
312
313
314
315GenBank Flat-File Format
316************************
317
318The read operation first looks for a "LOCUS" line and extracts the
319sequence length from positions 23-29 of that line (if the text there
320consists of digits). Then, it looks for the entry ending "//" line, along
321with the "ORIGIN" line which specifies where the sequence lines begin.
322The "ORIGIN" line is not required, however if it does not exist, the entry
323is assumed to contain no sequence.
324
325The getseq operation scans the sequence lines, from just after the
326"ORIGIN" line to the "//" line. All alphabetic characters there are
327assumed to be part of the sequence. No assumptions are made about
328the format of these lines.
329
330The rawseq operation is the same as the getseq operation, except that
331all non-whitespace and non-numeric characters are considered part of
332the sequence.
333
334The getinfo operation looks first at the "LOCUS" line. It takes the
335identifier from positions 13-22 (and assumes it's a GenBank id, unless
336marked by an identifier prefix), the alphabet determination from
337positions 37-40, whether it's circular from the existence of the keyword
338"circular" at positions 43-52, and the date from positions 63-73. Then,
339it looks for the "ACCESSION", "NID", "PID", "DEFINITION",
340"COMMENT" and "SOURCE" lines, where `lines' here mean one or
341more text lines corresponding to that part of the entry and where the
342lines can appear in any order. Accession numbers, NID numbers and
343PID numbers are extracted from the "ACCESSION", "NID" and "PID"
344lines, respectively. The description is taken from the "DEFINITION" line.
345Comments are retrieved from the "COMMENT" line. The organism name
346is taken from the "ORGANISM" sub-record of the "SOURCE" line. The
347getinfo operation cannot determine the value of the isfragment field
348(since that is not explicitly given anywhere in the entry).
349
350The putseq operation outputs an entry with the following lines (in
351order): LOCUS, DEFINITION, ACCESSION, NID, SOURCE/ORGANISM,
352COMMENT, BASE COUNT, ORIGIN, sequence lines, //. The form of
353these lines follows that described in the GenBank Release Notes, with
354the following exceptions:
355
356 o Except for the LOCUS line and the ORIGIN-sequence-// lines, no
357   lines are output if the SEQINFO information for that line does not
358   exist.
359 o Only an non-accession identifier 10 characters long or less is
360   output on the LOCUS line. If there are no such identifiers in the
361   idlist, then the keyword "Unknown" is output (or "(below)" to
362   signal that the long identifiers occur in the COMMENT lines).
363 o On the LOCUS line, the "bp" in positions 31-32 may be replaced
364   with "aa" or "ch" if the alphabet is Protein or Unknown. The
365   alphabet string in positions 37-40 could be "PRT" or "UNK" for
366   the same reason. The output classification in positions 53-55 is
367   "UNC" (Unclassified). And finally, the date in positions 63-73 is
368   "01-JAN-0000" if no date is specified in the SEQINFO structure.
369 o The history lines, and any extra references, are output at the end
370   of the COMMENT lines (or a COMMENT line is added which
371   contains those lines). Each of the added lines begins with the
372   keyword "SEQIO".
373
374The annotate operation replaces or appends to the COMMENT line, if it
375exists. If no COMMENT line exists, then a new COMMENT line will be
376inserted (or rather output between the existing lines of the entry) just
377before one of the following lines (whichever comes first in the entry):
378FEATURES, BASE COUNT or ORIGIN. One of those lines must appear
379in the entry.
380
381Example GenBank entry:
382
383LOCUS       A02201        664 bp    DNA             UNC       10-MAR-1993
384DEFINITION  Phage phi-105 DNA for immF plypeptide.
385ACCESSION   A02201
386SOURCE      .
387  ORGANISM  Bacteriophage phi-105
388COMMENT     NCBI gi: 345121
389
390            SEQIO retrieval from GenBank database entry.   07-Feb-1996
391BASE COUNT      237 a    111 c    144 g    172 t
392ORIGIN
393        1 tgatcaccta tctcctttac aacacatagt gcctcactgt gccactgtgt cttgtggcat
394       61 gacacaatta tagtatccga atgtcggaaa tacaatacta aaaaagacgg aaatacaagt
395      121 attttttagt aaattgacgg aaatacaaga taaatactct ctgaatcttt aaaatgcttg
396      181 aatttcgtca aatttcgact tttacaaaat gtcgtgaata ccatacaatt tagacatacc
397      241 ttaacgggag gtgataatca tgctggatgg gaaaaagctt ggggctttaa ttaaggacaa
398      301 aagaaaagaa aagcacttga aacagacaga aatggcgaag gcactgggta tgtccagaac
399      361 ttatctctct gatatcgaaa acggcagata tctgccgagt acaaaaacac tttccagaat
400      421 agcgatttta ataaatctgg atttaaatgt gttaaaaatg acggaaatac aagtagttga
401      481 ggagggtgga tatgatagag ctgccggcac atgtagaaga caggctttat gagattttta
402      541 tgaaactatc agttccaagg ttgcttgaga aagaagccct ggagaaagga gagaagccga
403      601 atgcggaaag aaaaggcgct tgacctcgcg gccttcttcg ctgaatttga acaaatgatg
404      661 atca
405//
406
407
408
409GBFAST variation of GenBank
410***************************
411
412The read operation performs the same steps as the GenBank read,
413however it makes some additional assumptions. First, all keywords
414must appear in uppercase. Second, the sequence length must appear
415in positions 23-29 on the "LOCUS" line. Third, an "ORIGIN" line must
416appear in the entry (as must a sequence). Fourth, all of the lines of
417sequence except the last must be in the format as described in the
418Release Notes, and so must be 75 characters long (9 characters for the
419position number, 60 characters of sequence, 6 spaces), plus the
420newline characters. See the above example.
421
422The getseq operation assumes that the sequence lines are in the format
423described in the previous paragraph, and all of the characters in the
424correct positions in that format are assumed to be characters of the
425sequence. So, if the line format is incorrect, you will get garbage as the
426sequence.
427
428The rawseq operation here is exactly the same as the getseq operation,
429since the GenBank sequences don't contain other characters.
430
431The getinfo, putseq and annotate functions are the same as in the
432GenBank format.
433
434
435
436PIR/CODATA Format
437*****************
438
439The read operation first looks for an "ENTRY" line. It then looks for the
440entry ending "///" line, but during this scan it also looks for the
441"SUMMARY" line and the "SEQUENCE" line. If the "SUMMARY" line is
442found, the sequence length is extracted by scanning for "#length" on
443the line, and then looking for digits after that keyword. The
444"SEQUENCE" line specifies the beginning of the sequence lines
445(starting on the next line), and no sequence is assumed to appear in
446the entry if the "SEQUENCE" line is missing.
447
448The getseq operation scans the sequences lines from just after the
449"SEQUENCE" line to the "///" line ending the entry. All alphabetic
450characters on those lines are assumed to be in the sequence. No
451format for those lines is assumed.
452
453The rawseq operation is the same as the getseq operation, except that
454all non-whitespace and non-numeric characters are considered part of
455the sequence.
456
457The getinfo operation first looks at the "ENTRY" line. The next word
458(i.e., non-whitespace string) after the "ENTRY" keyword is taken for an
459identifier, and then the rest of the line is searched for a "#type" option.
460If the word after "#type" is "fragment", the isfragment field is set to 1.
461Then, the entry is searched for the "ACCESSIONS", "COMMENT",
462"DATE", "ORGANISM" and "TITLE" lines, which can appear in any
463order. The "ACCESSIONS" line holds accession numbers (and the
464search for the "ACCESSIONS" line will also find lines beginning with
465just "ACCESSION", for backward compatibility). The "COMMENT" lines
466hold comments. The "DATE" line holds the date, and the date taken is
467the last given on the line, with the assumption being that the dates on
468the line are specified from oldest to newest (not absolutely accurate,
469but handling dates better is on my TODO list). The "TITLE" line holds
470the description, an optional organism name and possibly one of the
471keywords "(fragment)", "(fragment)" or "(tentative sequence)". The text
472before the string " - " is taken for the description, and the rest of the
473text, except for a trailing keyword, is taken for the organism name. If the
474keywords "(fragment)" or "(fragments)" appear at the end of the string,
475isfragment is set to 1. If "(tentative sequence)" appears, it is considered
476part of the description. The "ORGANISM" line holds an organism name
477which is taken if the "TITLE" line does not specify an organism.
478
479The putseq operation outputs a PIR entry containing the following lines
480(in order): ENTRY, TITLE, ORGANISM, DATE, ACCESSIONS,
481COMMENT, SUMMARY, SEQUENCE, sequence lines, ///. The format of
482those lines follows the PIR Release Notes, with the following
483exceptions:
484
485 o The TITLE, ORGANISM, DATE, ACCESSIONS and COMMENT
486   lines may not appear, if the SEQINFO structure does not contain
487   the appropriate information.
488 o If no idlist is given, the keyword "UNKNWN" is output on the
489   ENTRY line, instead of the sequence identifier.
490 o The SEQIO package attempts to follow the guidelines for the
491   TITLE line (i.e., description " - " organism, and an optional
492   "(fragment)") as best it can. Depending on the text of the
493   description and organism fields, this may or may not turn out
494   well.
495 o The organism name is output in the "#formal_name" field of the
496   ORGANISM line, even though it may not be the formal name of
497   the organism. (Better handling of the organism names is another
498   thing on my TODO list.)
499 o The SUMMARY line only contains the "#length" field on it.
500 o The history lines, and any extra references, are output at the end
501   of the COMMENT lines (or a COMMENT line is added which
502   contains those lines). Each of the added lines begins with the
503   keyword "SEQIO".
504
505The annotate operation replaces or appends to the COMMENT line, if it
506exists. If no COMMENT line exists, then a new COMMENT line will be
507inserted just before one of the following lines (whichever comes first in
508the entry): GENETIC, CLASSIFICATION, KEYWORDS, FEATURE,
509SUMMARY or SEQUENCE. One of those lines must appear in the entry.
510
511Example PIR entry:
512
513ENTRY            CCMST       #type complete
514TITLE            cytochrome c, testis-specific - mouse
515ORGANISM         #formal_name mouse
516DATE             04-Nov-1994
517ACCESSIONS       B28160; A00012
518COMMENT    Mammalian testis contains two forms of cytochrome c, one identical
519           with the form found in somatic tissues and another that is
520           expressed in a stage-specific manner during spermatogenic
521           differentiation.
522
523           SEQIO retrieval from PIR database entry.   07-Feb-1996
524SUMMARY          #length 105
525SEQUENCE
526                5        10        15        20        25        30
527      1 M G D A E A G K K I F V Q K C A Q C H T V E K G G K H K T G
528     31 P N L W G L F G R K T G Q A P G F S Y T D A N K N K G V I W
529     61 S E E T L M E Y L E N P K K Y I P G T K M I F A G I K K K S
530     91 E R E D L I K Y L K Q A T S S
531///
532
533
534
535PIRFAST Variation of PIR
536************************
537
538The read operation performs the same steps as the PIR read, however it
539makes some additional assumptions. First, all keywords must appear in
540uppercase. Second, a "SUMMARY" line must appear in the entry, and
541it must contain a "#length" field (although the field can appear
542anywhere on the line). Third, a "SEQUENCE" line must appear in the
543entry immediately after the "SUMMARY" line (and the entry must
544contain a sequence). Fourth, the format of the sequence lines must be
545as given in the PIR database, and so must be either 67 or 68 characters
546long (7 characters for the position number, 30 characters of sequence,
54730 or 31 spaces or notational characters), plus the newline character.
548See the above example.
549
550The getseq operation assumes that the sequence lines are in the format
551described in the previous paragraph, and all of the characters in the
552correct positions in that format are assumed to be characters of the
553sequence. So, if the line format is incorrect, you will get garbage as the
554sequence.
555
556The rawseq operation here does not use the "fast" implementation, but
557uses the rawseq operation of the basic PIR format.
558
559The getinfo, putseq and annotate functions are the same as in the PIR
560format.
561
562
563
564EMBL/Swiss-Prot File Formats
565****************************
566
567 NOTE: The EMBL and Swiss-Prot file format implementations are
568 essentially the same, differing only in their putseq and annotate
569 operations. So, we'll describe them together.
570
571 NOTE2: The EMBL read, getseq and getinfo implementations have
572 been tested on, and are compatible with, the "EMBL" entries in
573 the EMBL, EPD, aids-db, ENZYME, PROSITE and Swiss-Prot
574 databases. Because of the variations of the entries in these
575 databases, some of the assumptions made in the implementations
576 will differ from the official EMBL or Swiss-Prot file format
577 descriptions.
578
579The read operation first looks for an "ID " line. It then looks for the
580entry ending "//" line, but during this scan it also looks for an "SQ "
581line and a line beginning with two spaces. If the "SQ " line is found and
582the next word after "SQ Sequence" consists of digits, it is taken for the
583sequence length. The first line beginning with two spaces is assumed
584to be the beginning of the sequence lines, and if no such lines appear,
585the entry is assumed to contain no sequence.
586
587The getseq operation scans the sequences lines from the first line
588beginning with two spaces to the "///" line ending the entry. All
589alphabetic characters on those lines are assumed to be in the
590sequence. No format for those lines is assumed.
591
592The rawseq operation is the same as the getseq operation, except that
593all non-whitespace and non-numeric characters are considered part of
594the sequence.
595
596The getinfo operation first looks at the "ID " line. The next word (i.e.,
597non-whitespace string) after the "ID" keyword is taken for an identifier,
598and an attempt is made to determine if it is an EMBL id, an EPD id, a
599Swiss-Prot id, or something else. It does this by counting the number
600of semi-colons on the line and checking whether the line ends with a
601period. If three semi-colons and a period are found, then the string just
602before the third identifier is checked, and the identifier is assumed to
603be an EPD id if that string is "EPD" and is assumed to be an EMBL id
604otherwise. If two semi-colons and a period are found, and the string
605just before the second semi-colon is "PRT", the identifier is assumed
606to be a Swiss-Prot id. Otherwise, the identifier is some other id. After
607figuring out the type of identifier and extracting it from the line, the rest
608of the line is searched for words that specify the alphabet ("DNA",
609"RNA", "PRT", and so on) and whether the sequence is circular
610("circular").
611
612Then the rest of the entry is searched for the "AC ", "NI ", "PI ", "DT ",
613"DE ", "OS ", "CC " and "XX " lines, which can appear in any order. The
614"AC ", "NI " and "PI " lines contain accession, NID and PID numbers.
615The "DT " lines contain dates, of which the date on the last "DT " line is
616taken, under the assumption that the dates are given from oldest
617tonewest. The "DE " lines contain the description, and may end with
618one of the keywords "(fragment)" or "(fragments)", in which
619caseisfragment is set to 1. The "OS " lines specify the organism name.
620The "CC " and "XX " lines specify the comment lines, about which there
621are a couple things to note. First, an "XX " line isdifferent from any line
622beginning with "XX", in that three spacesmust appear after the "XX"
623and non-whitespace text must appear after that, in order for it to be
624considered a comment line. These lines do not occur in the official
625EMBL or Swiss-Prot formats, but do appear in some of the variations.
626Second, more than one comment section can appear in an entry. When
627a "CC " line is reached, the comment section beginning at that line is
628assumed to consist of all "CC " and "XX" lines (note the lack of spaces
629after the "XX") following that line, upto the first line not beginning with
630"CC" or "XX" (and ignoring a trailing "XX" line). When an "XX " line is
631seen, all following "XX " lines are considered part of that comment
632section. The text for these sections are concatenated together to make
633up the comment lines.
634
635For the EMBL format, the putseq operation outputs an EMBL entry
636containing the following lines (in order): ID, AC, NI, DT, DE, OS, CC,
637SQ, sequence lines, //. In the output, XX lines are added between each
638of the lines (except the sequence lines) as specified in the EMBL format.
639The format of the lines follows the EMBL Release Notes, with the
640following exceptions:
641
642 o The AC, NI, DT, DE, OS, and CC lines may not appear if the
643   SEQINFO structure does not contain the appropriate
644   information.
645 o On the ID line, if no idlist is given, the keyword "Unknown" is
646   output instead of an identifier. The keyword "converted" is
647   output instead of "standard" or "preliminary". The keyword
648   "UNC" is output instead of the classification code. The keyword
649   "UNK" might be output for the alphabet, if the alphabet is
650   Unknown. And, the keyword "AA" or "CH" could appear after the
651   sequence length, if the alphabet is Protein or Unknown.
652 o There will be at most one DT line, and it will only contain the
653   specified date.
654 o Instead of outputting "XX" lines to specify a `blank' line in a
655   comment, a line containing "CC " followed immediately by a
656   newline is output (so, in my design of the comment sections, the
657   comments are specified by the "CC " lines).
658 o The history lines, and any extra references, are output at the end
659   of the output comment section. Each of the added lines begins
660   with the keyword "SEQIO".
661
662For the Swiss-Prot format, the putseq operation outputs a Swiss-Prot
663entry containing the following lines (in order): ID, AC, DT, DE, OS, CC,
664SQ, sequence lines, //. The format of the lines follows the Swiss-Prot
665Release Notes, with the following exceptions:
666
667 o The AC, DT, DE, OS, and CC lines may not appear if the
668   SEQINFO structure does not contain the appropriate
669   information.
670 o On the ID line, if no idlist is given, the keyword "Unknown" is
671   output instead of an identifier. The keyword "converted" is
672   output instead of "standard" or "preliminary".
673 o The alphabet keyword could be "RNA", "DNA" or "UNK" if the
674   alphabet is not Protein. And, the keyword "circular" could
675   appear before the alphabet (if iscircular is 1). The keyword "BP"
676   or "CH" could appear after the sequence length, if the alphabet
677   is DNA, RNA or Unknown.
678 o There will be at most one DT line, and it will only contain the
679   specified date.
680 o The history lines, and any extra references, are output at the end
681   of the output comment section. Each of the added lines begins
682   with the keyword "SEQIO".
683
684For the EMBL format, the annotate operation replaces or appends to
685the "CC " or "XX " lines, if one exists. The operation looks for the first
686comment section, and will insert or replace at that point. If no comment
687section exists, then a new comment section using "CC " lines will be
688inserted (or rather output between the existing lines of the entry) as
689follows. If a "DR ", "PR ", "FH " or "FT " line appears in the entry, the
690comment is inserted just before the first of those lines. Otherwise, the
691comment is inserted just before the "SQ ", or " " (i.e., sequence) lines.
692One of these lines must appear in the entry.
693
694For the Swiss-Prot format, the annotate operation replaces or appends
695to the "CC " lines, if they exist. If no comment section exists, then a new
696comment section will be inserted (or rather output between the existing
697lines of the entry) as follows. If a "DR ", "KW " or "FT " line appears in
698the entry, the comment is inserted just before the first of those lines.
699Otherwise, the comment is inserted just before the "SQ " or sequence
700lines. One of these lines must appear in the entry.
701
702Example EMBL entry:
703
704ID   CM23SRIBR  converted; DNA; UNC; 805 BP.
705XX
706AC   X80636;
707XX
708DT   22-MAR-1995
709XX
710DE   C.mucosalis gene for 23S ribosomal RNA (fragment)
711XX
712OS   Campylobacter mucosalis
713XX
714CC   SEQIO retrieval from EMBL-format entry.   07-Feb-1996
715XX
716SQ   Sequence 805 BP; 226 A; 158 C; 224 G; 194 T; 3 other;
717     gattctgcgc ggaaaatata acggggctaa aatgagtacc gaagctttag acttagtttt        60
718     actaagtggt aggagcgttc tattcagcgt tgaaggtgta ccggtaagga gcgctggagc       120
719     ggatagaagt gagcatgcag gcatgagtag cgataattgg ggtgagaatc cccaacgccg       180
720     taarcccaag gtttcctacg cgatgctcgt catcgtaggg ttagccgggt cctaagcaaa       240
721     gtccgaaagg ggtatgcgat ggaaaattgg ttaatattcc aatgccaaca ttattgtgcg       300
722     atggaaggac gcttagagtt aaaggagcca gctgatggaa gtgctggtcg aaaggtgtag       360
723     gttgagttac aggcaaatcc gtaactcttt atccgagacc ccacaggcgt ttgaagttct       420
724     tcggaatgga tgacgaatcc ttgatactgt cgagccaaga aaagtttcta agtttagata       480
725     atgttgcccg taccgtaaac cgacacaggt gggtgggatg agtattctaa ggcgcgtgga       540
726     agaactctct tcaaggaact ctgcaaaata gcaccgtatc ttcggtataa ggtgtgccta       600
727     actttgtgaa ggatttactc cgtaagcatt gaaggttaca acaaagagtc cctcccgact       660
728     gtttaccaaa aacacagcac tctgctaact cgtaagagga tgtatagggt gtgacgcctg       720
729     cccggtgctc gaaggttaat tgatggggty agcagyaatg cgaagctctt gatcgaagcc       780
730     cgagtaaacg gccgccgtaa ctata                                             805
731//
732
733Example Swiss-Prot entry:
734
735ID   104K_THEPA  CONVERTED;      PRT;   924 AA.
736AC   P15711;
737DT   01-AUG-1992
738DE   104 KD MICRONEME-RHOPTRY ANTIGEN.
739OS   THEILERIA PARVA.
740CC   -!- DEVELOPMENTAL STAGE: SPOROZOITE ANTIGEN.
741CC   -!- SUBCELLULAR LOCATION: IN MICRONEME/RHOPTRY COMPLEXES.
742CC
743CC   SEQIO retrieval from Swiss-Prot database entry.   07-Feb-1996
744SQ   SEQUENCE   924 AA;
745     MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL
746     QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG
747     DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN
748     GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK
749     YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI
750     TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT
751     THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS
752     EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT
753     QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS
754     SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR
755     PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD
756     DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK
757     DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR
758     SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL
759     TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP
760     KKPDSAYIPS ILAILVVSLI VGIL
761//
762
763
764
765EMBLFAST/SPFAST Variation of EMBL/Swiss-Prot
766********************************************
767
768The read operation performs the same steps as the EMBL/Swiss-Prot
769read, however it makes some additional assumptions. First, all
770keywords must appear in uppercase, with one exception noted next.
771Second, an "SQ Sequence" line must appear in the entry, although the
772keyword "Sequence" can appear in uppercase, as in "SQ SEQUENCE".
773Third, the sequence length must be the next word after "SQ Sequence".
774Fourth, the format of the sequence lines must occur as in the EMBL or
775Swiss-Prot databases. The EMBL sequence lines are 80 characters
776long (5 spaces, 60 sequence characters with 5 interspersed spaces,
777and 10 characters with a right justified position number), plus the
778newline character. The Swiss-Prot sequence lines are 70 characters
779long (same as EMBL except no position numbers), plus the newline.
780
781The getseq operation assumes that the sequence lines are in the format
782described in the previous paragraph, and all of the characters in the
783correct positions in that format are assumed to be characters of the
784sequence. So, if the line format is incorrect, you will get garbage as the
785sequence.
786
787The rawseq operation here is exactly the same as the getseq operation,
788since the EMBL and Swiss-Prot sequences don't contain other
789characters.
790
791The getinfo, putseq and annotate functions are the same as in the
792EMBL/Swiss-Prot format.
793
794
795
796FASTA/FASTA-old File Formats
797****************************
798
799 NOTE: The implementation of the FASTA format here follows the
800 format described in the FASTA program documentation, with the
801 exception that, at the beginning of the entry, multiple lines
802 beginning with either '>' or ';' can appear. This was done in order
803 to better distinguish the entry's header lines from the sequence
804 lines (where comments beginning with ';' are permitted). This
805 exception only occurs when reading FASTA entries. The FASTA
806 output functions only use ';' for those additional header lines.
807
808The read operation looks for a line beginning with '>'. That line is taken
809as the header/description line for the entry. If that line has been
810formatted using the standard one-line description format (see file "
811user.doc"), then the sequence length is extracted from that line. The
812operation then looks for the next line which does not begin with a '>'
813and which does not begin with a ';'. If such a line occurs before the
814next line with a '>', that line is the first line of the sequence. Finally, the
815operation looks for the entry's end at either the next line which does
816begin with a '>' or the end of the file.
817
818The getseq operation scans the sequences lines (all of the lines not
819beginning with '>'). All alphabetic characters on those lines are
820assumed to be in the sequence, except that when a semi-colon
821appears on a line, the rest of that line is considered a comment and not
822part of the sequence. No format for those lines is assumed.
823
824The rawseq operation is the same as the getseq operation, except that
825all non-whitespace and non-numeric characters are considered part of
826the sequence.
827
828The getinfo operation first looks at the first header line of the entry, and
829parses it according to the one-line description format specified in file "
830user.doc". It then considers any following lines that begin either with a
831'>' or a ';' as comment lines. Any other comments in the entry are
832ignored.
833
834In the FASTA format, the putseq operation outputs a first header line
835according to the one-line description format. The comment/history
836lines and the sequence identifiers are output as additional header lines
837that begin with a ';'. Finally, the sequence is output.
838
839In the FASTA-old format, the putseq operation only outputs the first
840header line and the sequence lines. No comment/history lines are
841output, and the identifiers appear in the header line.
842
843In the FASTA format, the annotate operation either replaces, appends
844or inserts the comment lines just after the first header line. There is no
845annotate operation in the FASTA-old format.
846
847Example FASTA entry:
848
849>gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
850;
851;NCBI gi: 579066
852;
853;SEQIO retrieval from GenBank database entry.   07-Feb-1996
854  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
855  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
856  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
857  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
858  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c
859
860Example FASTA-old entry:
861
862>gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
863  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
864  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
865  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
866  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
867  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c
868
869
870
871NBRF/NBRF-old File Formats
872**************************
873
874 NOTE: The implementation of the NBRF format follows the format
875 descriptions given in the release notes of the VMS version of the
876 PIR database, with the following exceptions:
877
878   1. An identifier list (with identifiers separated by '|') can
879    appear after the ';' on the first line of the entry, and there is
880    no limitation to the length of that identifier list.
881   2. The second line of the entry is treated as a full one-line
882    description (so it can contain more than just the
883    description and organism name).
884   3. The NBRF header lines (which occur after the sequence)
885    are assumed to begin at the first line whose second
886    character is a ';', and run until the end of the entry. So, the
887    sequence lines cannot contain such a line (or the
888    sequence will only be partially read).
889   4. Every "C;Comment: " line in the header lines is assumed to
890    contain a space between the "C;Comment:" and the
891    comment text. This space (or whatever character appears
892    there) is not considered part of the comment text.
893
894The read operation first looks for a line beginning with '>', which
895contains a two-character code and database identifiers for the
896sequence. The next line, which should not begin with a '>', contains a
897one-line description of the sequence, and the operation attempts to
898extract the sequence length from that line. After that, the operation
899scans the sequence lines looking for the beginning of the header lines
900or the end of the entry. The header lines begin with the first line whose
901second character is ';', and they are not required to appear in an entry.
902The end of the entry is either the first line which begins with a '>', or the
903end of the file.
904
905The getseq operation scans the sequences lines from just after the
906description line to either the first occurrence of a '*', the beginning of
907the header lines or the end of the entry. All alphabetic characters on
908those lines are assumed to be in the sequence. No format for those
909lines is assumed.
910
911The rawseq operation is the same as the getseq operation, except that
912all non-whitespace and non-numeric characters are considered part of
913the sequence.
914
915The getinfo operation first looks at the initial identification line. The
916format of that line is ">??;..." where "??" is a two character description
917and "..." is a list of identifiers. Six forms of the two character
918description are recognized
919
920 o "P1" - Protein complete
921 o "F1" - Protein fragment
922 o "DL" - linear DNA
923 o "DC" - circular DNA
924 o "RL" - linear RNA
925 o "RC" - circular RNA
926
927and the appropriate alphabet, isfragment and iscircular values are
928set.  The list of identifiers are added to mainid, mainacc and
929idlist. If no identifier prefix is specified for an identifier (either
930by the identifier itself or by the "IdPrefix" information field of the
931database's BIOSEQ entry, if a database search is being performed),
932then "oth" for Other is used.  The next line in the entry is parsed
933according to the one-line description format. Then, if the header
934lines were found in the entry during the read operation, they are
935scanned, looking for lines beginning with "C;Accession:", "C;Comment:"
936and "C;Date:" which give the accession numbers, comments and date,
937respectively.
938
939In the NBRF format, the putseq operation outputs a initial identification
940line of the appropriate form, containing one of the two character
941descriptions above (or "XX" if the alphabet is Unknown) and containing
942the list of identifiers in idlist. It then outputs a one-line description
943according to the one-line description format. The sequence is output
944and terminated with a '*'. Finally, the date, accession numbers and
945comments/history are output in lines beginning with "C;Accession:",
946"C;Comment:" and "C;Date:".
947
948In the NBRF-old format, the putseq operation only outputs the initial
949identification line, the description line and the sequence lines. In
950addition, only one identifier is placed on the initial identification line,
951and if that identifier was not an accession number, the main accession
952number is added to the beginning of the description line.
953
954For the NBRF format, the annotate operation replaces or appends the
955"C;Comment: " lines, if they exists. If no comment lines exists, then a
956new comment section will be inserted (or rather output between the
957existing lines of the entry) as follows. If a "C;Genetics:", C;Complex:",
958"C;Function:", "C;Superfamily:", "C;Keywords:" or "F;" line appears in
959the entry, the comment is inserted just before the first of those lines.
960Otherwise, the comment is inserted at the end of the entry.
961
962There is no annotate operation in the NBRF-old format.
963
964Example NBRF entry:
965
966>DL;gb:A14666
967PRLB promoter - Bacteriophage lambda, 281 bp.
968  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
969  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
970  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
971  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
972  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c*
973C;Date: 18-AUG-1994
974C;Accession: A14666
975C;Comment: NCBI gi: 579066
976C;Comment:
977C;Comment: SEQIO retrieval from GenBank database entry.   23-Mar-1996
978
979Example NBRF-old entry:
980
981>DL;gb:A14666
982~A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
983  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
984  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
985  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
986  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
987  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c*
988
989
990
991IG/Stanford, IG-old/Stanford-old File Formats
992*********************************************
993
994The read operation first looks for a line beginning with ';'. The
995operation then looks for the next line which does not begin with a ';'.
996All of the lines beginning with ';' make up the comment lines, and the
997first line not beginning with ';' contains the sequence's description. If
998the description line has been formatted using the standard one-line
999description format (see file "user.doc"), then the sequence length is
1000extracted from that line. Finally, the operation looks for the entry's end
1001at either the next line which does begin with a ';' or the end of the file.
1002
1003The getseq operation scans the sequence lines from just after the
1004description line until either the end of the entry is reached, or a '1' or a
1005'2' appears. All alphabetic characters on those lines are assumed to be
1006in the sequence. No format for those lines is assumed.
1007
1008The rawseq operation is the same as the getseq operation, except that
1009all non-whitespace and non-numeric characters are considered part of
1010the sequence.
1011
1012The getinfo operation first gets the comment lines at the beginning of
1013the entry, and then parses the description line according to the
1014one-line description format. Finally, it looks for a '1' or '2' at the end of
1015the sequence, and sets iscircular to 0 or 1, respectively.
1016
1017In the IG/Stanford format, the putseq operation outputs any
1018comment/history lines (or just the line ";\n" if there are no
1019comment/history lines, a one-line description, the sequence and finally
1020either a '1' or '2' depending on the value of iscircular.
1021
1022In the IG-old/Stanford-old format, the putseq operation outputs the
1023same text as in the IG/Stanford format except that exactly one
1024comment/history line is output.
1025
1026In the IG/Stanford format, the annotate operation either replaces,
1027appends or inserts the comment lines at the beginning of the entry.
1028There is no annotate operation in the IG-old/Stanford-old format.
1029
1030Example IG/Stanford entry:
1031
1032;NCBI gi: 579066
1033;
1034;SEQIO retrieval from GenBank database entry.   07-Feb-1996
1035gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
1036  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
1037  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
1038  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
1039  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
1040  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c1
1041
1042Example IG-old/Stanford-old entry:
1043
1044;NCBI gi: 579066
1045gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp.
1046  gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt
1047  acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct
1048  gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa
1049  tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga
1050  agaaggcgat gtaaactgtc aaagcaatca cagagatgat c1
1051
1052
1053
1054ASN.1 Text File Format
1055**********************
1056
1057 NOTE: This file format implementation is not nearly complete
1058 enough to handle all of the variations of ASN.1 text files. I
1059 concentrated the implementation on handling the "Bioseq"
1060 sequence records defined as part of the "Bioseq-set" structure,
1061 i.e., it looks for each "Bioseq-set.seq-set.seq" record in the file,
1062 where '.' separates the initial keywords for each level of
1063 sub-record. (See the NCBI toolkit for the definitions of the
1064 "Bioseq-set" and "Bioseq" syntax, and the values of those initial
1065 keywords).
1066
1067 However, it does handle all of the syntactic requirements of the
1068 ASN.1 text format. It makes no assumptions on the structure of
1069 the file, handling a completely free-form file (with one exception
1070 listed below). It does assume that the format consists of a
1071 hierarchy of records, where a record consists of a text string
1072 identifier and then a pair of matching braces bounding the
1073 contents of the record (except for simple records which contain
1074 only one or more strings and numbers).
1075
1076The read operation looks for the beginning of each
1077"Bioseq-set.seq-set.seq" record in the file. The operation assumes
1078that this record is a "Bioseq" record, and looks for the end of it. Also,
1079the read operations makes the syntactic requirement that the open
1080brace beginning the "seq" record is separated from its initial keyword
1081by exactly one space (i.e., the operation looks for the string "seq {").
1082After scanning to the end of the "seq" record, the operation looks for
1083the "seq.inst.length" sub-record. If found, the sequence length is
1084extracted from that sub-record.
1085
1086The getseq operation looks for the "seq.inst.seq-data" sub-record in
1087the entry. If found, the sequence is extracted from that sub-record.
1088(NOTE: This operation can only handle sequences that have been
1089encoded in the `iupacna', `iupacaa', `ncbi2na' or `ncbi4na' formats.)
1090
1091The rawseq operation is the same as the getseq operation, since the
1092`iupacna', `iupacaa', 'ncbi2na' and 'ncbi4na' formats do not contain
1093non-alphabetic characters.
1094
1095The getinfo operation looks for a large number of possible sub-records
1096for information about the sequence. To find database identifiers, it
1097looks in the "seq.id" sub-record for the sub-sub-records "pir.name",
1098"pir.accession", "swissprot.name", "swissprot.accession",
1099"genbank.name", "genbank.accession", "embl.name",
1100"embl.accession", "ddbj.name", "ddbj.accession", "prf.name",
1101"prf.accession", "other.name", "other.accession", "pdb.mol", "gi",
1102"giim.id", "gibbsq" and "gibbmt". Any identifiers found are added to
1103the idlist. To find the date information, it looks in the "seq.descr"
1104sub-record to find the sub-sub-records "create-date",
1105"update-date", "genbank.date", "genbank.entry-date",
1106"embl.creation-date", "embl.update-date", "pir.date", "sp.created",
1107"sp.sequpd", "sp.annotupd" and "pdb.deposition".
1108
1109Then, the operations searches for the description, organism and
1110comment information in the "seq.descr" sub-record. For the
1111description, the operation searches for the sub-sub-records "title",
1112"pdb.compound" and "name" and picks one of them for the description
1113("title" if found, else "pdb.compound", else "name"). For the organism,
1114the sub-sub-records "org.taxname", "org.common", "pir.source" and
1115"pdb.source" are searched. For the comments, all of the "comment"
1116sub-sub-records in "seq.descr" are concatenated together to make up
1117the comment lines.
1118
1119Finally, the alphabet is picked up from the "seq.descr.mol-type",
1120"seq.descr.modif.dna", "seq.descr.modif.rna" or "seq.inst.mol"
1121sub-records, the isfragment field is set to 1 if "seq.descr.modif.partial"
1122exists, and the iscircular field is set to 1 if data string in
1123"seq.inst.topology" is "circular".
1124
1125The putseq operation outputs a "Bioseq" record for the sequence as
1126part of a "Bioseq-set" structure (i.e., the appropriate strings are output
1127before the first putseq operation, between the "Bioseq" records and
1128when the file is closed, so that the file consists of a correctly formatted
1129"Bioseq-set" record). The form of the file mirrors that of the Bioseq-set
1130example given in the NCBI toolkit.
1131
1132(NOTE: Because some text must be output when the file is closed (i.e.,
1133when seqfclose is called), you MUST call seqfclose when writing an
1134ASN.1 file. If you don't call seqfclose, the text file will not be complete.)
1135
1136The annotate operation either replaces, creates or appends the
1137comment lines in the "seq.descr" sub-record (i.e., the comment lines
1138are the "seq.descr.comment" records). If no "seq.descr" sub-record
1139exists, one is created in the most appropriate place in the "seq" record.
1140If the entry given to the annotate operation is not a Bioseq "seq"
1141record, an error occurs.
1142
1143(NOTE: Using the annotate operation by itself will NOT create a valid
1144ASN.1 text file. You must output the following strings before the first
1145entry, between entries, and after the last entry (again, assuming the
1146entries are "Bioseq" records taken from the "Bioseq-set" hierarchy):
1147
1148   Before the first entry:  "Bioseq-set ::= {\n  seq-set {\n"
1149          Between entries:  " ,\n"
1150     After the last entry:  " } }\n"
1151
1152A Complete ASN.1 Text File:
1153
1154Bioseq-set ::= {
1155  seq-set {
1156    seq {
1157      id {
1158        genbank {
1159          name "A14666" ,
1160          accession "A14666" } } ,
1161      descr {
1162        title "PRLB promoter" ,
1163        org {
1164          taxname "Bacteriophage lambda" } ,
1165        update-date
1166          str "18-AUG-1994" ,
1167        comment "NCBI gi: 579066" ,
1168        comment "SEQIO retrieval from GenBank database entry.  07-Feb-1996" } ,
1169      inst {
1170        repr raw ,
1171        mol dna ,
1172        length 281 ,
1173        seq-data
1174          iupacna "gatcagctgcgacacaactagtttacttactcgcttattaaaccagacccacaatcttt
1175tacacagatacaatatttttagtggaaacttcttgacatttcggcccatgacctttactctgttataaattactttta
1176tgggggacgatcacactagcaaaggagttacctaagccccgaatgttcaatgggaagacttccccaatcatgacccac
1177attacgggaccccaagttgcggagaagaaggcgatgtaaactgtcaaagcaatcacagagatgatc" } } } }
1178
1179
1180
1181GCG Format
1182**********
1183
1184The read operation first looks for a line that ends with the string ".." (or
1185more precisely, a line whose last non-whitespace characters are "..").
1186That line should be the GCG information line, and should look
1187something like the following:
1188
1189  gb:A02201  Length: 664  June 21, 1996 18:42  Type: N  Check: 9896  ..
1190
1191although any or all of this information (except the "..") can be missing.
1192If the line contains the "Length:" keyword, then the read operation will
1193extract the sequence length. The read operation then reads the rest of
1194the file, and assumes that those lines contain the sequence.
1195
1196The getseq operation scans the sequences lines. All alphabetic
1197characters on those lines are assumed to be in the sequence. No
1198format for those lines is assumed.
1199
1200The rawseq operation is the same as the getseq operation, except that
1201all non-whitespace and non-numeric characters are considered part of
1202the sequence. During this operation, any period `.' appearing in the
1203sequence lines is assumed to be a gap character and translated into a
1204dash `-' (the SEQIO's canonical gap character).
1205
1206The getinfo operation takes the date and the alphabet from the GCG
1207information line (if the date and the "Type:" fields are there), sets the
1208description to the first word of the GCG information line (if it isn't
1209"Length:"), and then takes all of the lines up to the GCG information
1210line as the comment.
1211
1212The putseq operation first outputs any comment lines, outputs a
1213complete GCG information line (with a valid checksum), and then
1214outputs the sequence lines in the default format shown below. Any
1215dash `-' appearing in the output sequence is assumed to be a gap
1216character and automatically translated into a period `.'.
1217
1218There currently is no annotate function.
1219
1220
1221
1222GCG-* Formats
1223*************
1224
1225The processing of the GCG-* formats essentially merges the
1226processing of the GCG format on the sequence lines with the
1227processing of the GenBank, PIR, EMBL, Swiss-Prot, FASTA,
1228FASTA-old, NBRF, NBRF-old, IG/Stanford and IG-old formats when
1229dealing with the header lines of each entry. So, see above for the
1230details on that processing.
1231
1232The one exception to this rule is the relationship between the NBRF
1233and GCG-NBRF formats. Since the NBRF entries contain "header"
1234information that actually appears at the end of the entry, and the GCG
1235format requires that the last thing in an entry be the sequence, the
1236GCG and non-GCG forms of the NBRF entries differ more than the
1237other formats. In the GCG-NBRF format, the lines before the GCG
1238information line are assumed to contain the two header lines normally
1239found in the NBRF entries, immediately followed by the lines normally
1240appearing at the end of the file (the "C;Comment:", "C;Accession:" and
1241other lines). After those lines, the GCG information line and sequence
1242lines should appear, and be the last things in the entry. The fmtseq
1243program and SEQIO package have been implemented to make this
1244transformation between the NBRF and GCG-NBRF formats.
1245
1246An example GCG-Genbank entry:
1247
1248LOCUS       A14666        281 bp    DNA             PHG       18-AUG-1994
1249DEFINITION  PRLB promoter.
1250ACCESSION   A14666
1251KEYWORDS    .
1252SOURCE      Bacteriophage lambda.
1253  ORGANISM  Bacteriophage lambda
1254            Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
1255REFERENCE   1  (bases 1 to 281)
1256  AUTHORS   Michiels,F., Delcour,J., Mahillon,J., Joos,H., Platteeuw,C. and
1257            Josson,K.
1258  TITLE     Transformed lactic acid bacteria
1259  JOURNAL   Patent: EP 0311469-A 10 12-APR-1989;
1260            PLANT GENETIC SYSTEMS N.V.; UNIVERSITE CATHOLIQUE DE LOUVAIN
1261COMMENT     NCBI gi: 579066
1262FEATURES             Location/Qualifiers
1263     source          1..281
1264                     /organism="Bacteriophage lambda"
1265     RBS             158..166
1266     CDS             180..254
1267                     /note="PRLB;  NCBI gi: 579067"
1268                     /codon_start=1
1269                     /translation="MFNGKTSPIMTHITGPQVAEKKAM"
1270BASE COUNT       89 a     67 c     52 g     73 t
1271ORIGIN
1272
1273  gb:A14666  Length: 281  June 28, 1996 16:23  Type: N  Check: 2754  ..
1274
1275       1 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc
1276
1277      51 acaatctttt acacagatac aatattttta gtggaaactt cttgacattt
1278
1279     101 cggcccatga cctttactct gttataaatt acttttatgg gggacgatca
1280
1281     151 cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc
1282
1283     201 ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat
1284
1285     251 gtaaactgtc aaagcaatca cagagatgat c
1286
1287An example GCG-NBRF entry:
1288
1289>DL;gb:A14666
1290PRLB promoter - Bacteriophage lambda, 281 bp.
1291C;Date: 18-AUG-1994
1292C;Accession: A14666
1293C;Comment: NCBI gi: 579066
1294C;Comment:
1295C;Comment: SEQIO retrieval from GenBank database.   28-Jun-1996
1296
1297  gb:A14666  Length: 281  June 28, 1996 16:22  Type: N  Check: 2754  ..
1298
1299       1 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc
1300
1301      51 acaatctttt acacagatac aatattttta gtggaaactt cttgacattt
1302
1303     101 cggcccatga cctttactct gttataaatt acttttatgg gggacgatca
1304
1305     151 cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc
1306
1307     201 ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat
1308
1309     251 gtaaactgtc aaagcaatca cagagatgat c
1310
1311
1312
1313
1314MSF Multiple Sequence Format
1315****************************
1316
1317The read operation first looks for a GCG information line of the
1318following form:
1319
1320 Pileup.Msf  MSF: 729  Type: N  June 21, 1996 15:02  Check: 3171 ..
1321
1322although any or all of this information can be missing, except the ".."
1323and the "MSF: %d" section, the second of which the read operation
1324uses to get the sequence length. After the information line, the read
1325operation looks for the sequence name lines, which are of the form
1326
1327 Name: Humhbbbpc        Len:   729  Check: 6463  Weight:  1.00
1328
1329where the "Name: " field gives the sequence identifier and must appear
1330on any non-blank line in this section of the MSF file (the other fields
1331are ignored, and the length is assumed to be the same as the global
1332length). The sequence name lines section ends when a line beginning
1333with "//" appears. Any number of blank lines can be interspersed in this
1334section, but any non-blank line should contain the above format. The
1335rest of the file is assumed to contain the sequence lines, where each
1336sequence line begins with the sequence name followed by a space, as
1337in:
1338
1339           401                                                450
1340Humhbbbpc  CAGAAATTTA GTGTTTTCTC AGTCAGTTAA CATTCCTTCA AC........
1341Humhbbbpd  CAGAAATTTA GTGTTTTCTC AGTCAGTTAA CATTCCTTCA AC........
1342Humhbbbpe  CAGACAAATG GAACAGAATA GAGAGCCCAG AAATAAGACC ACATG.....
1343Humhbbbpf  CAGACAAATG GAACAGAATA GAGAGCCCAG AAATAAGACC ACATG.....
1344Humhbbbpg  AATACAAAAT CAGTAGCATT TCATATATAA A......... ..........
1345Humhbbbph  AATACAAAAT CAGTAGCATT TCATATATAA A......... ..........
1346Humhbbbp1  AAGTGATGAA ATTGTGTATT CAATGTAGTC TCAAGAGAAT TGAAAACCAA
1347Humhbbbpa  AAATAAAAGG ATGGAGGAAG ATCTACCAAG CA........ ..........
1348Humhbbbpb  AAATAAAAGG ATGGAGGAAT ATCTACCAAG CA........ ..........
1349Humhbbbp2  AGCT.AAAGG ATTGTAAATG CACTAATCAG CACTCTGTGT CTAGCTCAAG
1350
1351No format of the sequence lines or presence or absence of the position
1352number lines (401...450) is assumed, except for the initial sequence
1353name. The sequence lines run to the end of the file.
1354
1355The getseq operation finds every sequence line beginning with the
1356corresponding sequence name (the sequences are ordered by the
1357order of sequence names in the sequence names section). All
1358alphabetic characters appearing after the sequence name are taken for
1359the sequence.
1360
1361The rawseq operation is the same as the getseq operation, except that
1362all non-whitespace and non-numeric characters are considered part of
1363the sequence. During this operation, any period `.' appearing in the
1364sequence lines is assumed to be a gap character and translated into a
1365dash `-' (the SEQIO's canonical gap character).
1366
1367The getinfo operation takes the date and the alphabet from the GCG
1368information line (if the date and the "Type:" fields are there), sets the
1369description to the sequence name found in the sequence name section,
1370and then takes all of the lines up to the GCG information line as the
1371comment.
1372
1373The putseq operation outputs an MSF file exactly mimicing the files
1374output by GCG using "PileUp" in its default mode, except that only the
1375keyword "PileUp" appears on the first line and no comments are
1376output. Any dashes `-' found in the sequences are assumed to be gap
1377characters and are automatically translated into periods `.'. If the
1378sequences are of different lengths, the putseq operation will pad the
1379smaller sequences with periods `.'.
1380
1381(IMPORTANT: The one unusual feature about the putseq operation is
1382that, unlike all of the other putseq operations except Clustalw and
1383PHYLIP, the actual output does not occur until `seqfclose' is called to
1384close the file. Because the MSF format must know the number of entries
1385before it can begin the output, the sequences cannot be output at each
1386call to `seqfwrite'. What the putseq operation does, on each call to
1387`seqfwrite', is make a copy of the sequence and a sequence identifier
1388(either the main identifier, description or organism name). Then, when
1389`seqfclose' is called, all of the sequences are output in the correct
1390format.)
1391
1392There currently is no annotation function.
1393
1394An example MSF file:
1395
1396PileUp
1397
1398
1399 pir.msf  MSF: 104  Type: P  June 28, 1996 17:04  Check: 3466  ..
1400
1401 Name: pir:CCCZ         Len:   104  Check: 9501  Weight:  1.00
1402 Name: pir:CCMQR        Len:   104  Check: 9512  Weight:  1.00
1403 Name: pir:CCMKP        Len:   104  Check: 9066  Weight:  1.00
1404 Name: pir:CCRB         Len:   104  Check: 8395  Weight:  1.00
1405 Name: pir:CCGW         Len:   104  Check: 8496  Weight:  1.00
1406 Name: pir:CCCM         Len:   104  Check: 8496  Weight:  1.00
1407
1408//
1409
1410            1                                                   50
1411pir:CCCZ    GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
1412pir:CCMQR   GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
1413pir:CCMKP   GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE
1414pir:CCRB    GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
1415pir:CCGW    GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
1416pir:CCCM    GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
1417
1418            51                                                 100
1419pir:CCCZ    ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
1420pir:CCMQR   ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
1421pir:CCMKP   ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
1422pir:CCRB    ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK
1423pir:CCGW    ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
1424pir:CCCM    ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
1425
1426            101
1427pir:CCCZ    ATNE
1428pir:CCMQR   ATNE
1429pir:CCMKP   ATNE
1430pir:CCRB    ATNE
1431pir:CCGW    ATNE
1432pir:CCCM    ATNE
1433
1434
1435
1436
1437PHYLIP Interleaved and Sequential File
1438**************************************
1439Formats
1440*******
1441
1442 NOTE: The implementation here is more flexible than other
1443 implementations, however it is a bit restrictive in its output, in
1444 that
1445
1446   1. Both interleaved and sequential formats are supported and
1447    rigorously distinguished. See below for the details.
1448   2. An input file in the PHYLIP format can contain one or more
1449    PHYLIP entries, where each entry must be separated only
1450    by whitespace. Mixed files (some interleaved entries, some
1451    sequential entries) are supported.
1452   3. Any number of blank lines or lines filled only with
1453    whitespace can be included in the file. Blank lines do not
1454    disrupt the parsing of the entries.
1455   4. The output operation does NOT output more than one
1456    entry per file, because I have yet to completely figure out
1457    the SEQIO interface issues. (Note that this may change in a
1458    future version.)
1459   5. This implementation was done using the documentation
1460    from Version 3.5c. Whether it works with earlier versions is
1461    not known.
1462
1463The read operation first skips whitespace characters and then looks for
1464the number of sequences and the sequence length (those two numbers
1465must be the first thing in the entry). On that initial line, it also looks for
1466the option characters 'A', 'C', 'F', 'M', 'U', 'W'. If any of the options
1467except 'U' are found, the operation then skips any subsequent lines
1468that begin with a match to the character strings "ANCESTOR ",
1469"CATEGORIES", "FACTORS ", "MIXTURE ", or "WEIGHTS ". A line is
1470considered to match one of the strings if the first 10 characters of the
1471line contain a prefix of the string padded by spaces. Also, these lines
1472are skipped only if the corresponding option was given on that first
1473line.
1474(NOTE: This may cause some problems on an entry such as this one:
1475
14763 6 A
1477A         ABCDEF
1478B         BCDEFG
1479C         CDEFGH
1480
1481because the second line of the entry is treated as an "ANCESTOR "
1482line, when in fact it was a sequence line. But, from looking at the
1483documentation, the PHYLIP programs would die on this entry, too. And
1484replacing "A " with something like "Alpha " eliminates the problem.)
1485
1486After skipping those initial lines, the read operation tries to match the
1487subsequent lines to the interleaved and sequential file formats. The
1488following criteria are the keys to distinguishing between the two
1489formats:
1490
1491 1. The line giving the initial piece of a sequence must be at least 10
1492   characters long and there must be at least one non-whitespace
1493   character in those first ten characters. This should be the
1494   sequence identifier, and its characters are not counted as part of
1495   the sequence.
1496 2. In the Interleaved format, all of the sequence substrings in each
1497   block of the entry must have the same length. A block is a set of
1498   "number-of-sequences" lines (not counting blank lines) which
1499   contain a piece of each of the sequences.
1500 3. The end of each sequence must occur on its own line, without
1501   any additional non-whitespace text after the sequence
1502   characters.
1503
1504If one format but not the other matches, or both formats match and the
1505input format has been specified as PHYLIP-Int or PHYLIP-seq (instead
1506of just PHYLIP), then the entry format has been successfully
1507determined. Otherwise (if neither match or both match), a parse error is
1508triggered. However, given the above criteria and the fact that the
1509operation attempts to completely match both formats against the text,
1510the likelihood that the formats will match the same text is extremely
1511remote.
1512
1513Finally, if the 'U' option has been set on the entry's first line, the read
1514operation skips the user trees listed in the entry, to get to the end of
1515the entry. The format of the user trees consists of a line giving the
1516number of trees, followed by any number of lines of text where each
1517user tree description is ended by a semi-colon (the operation just
1518counts the semi-colons it sees). The end of the entry is at the end of
1519the line containing the last semi-colon.
1520
1521The getseq operation finds the first line of the appropriate sequence in
1522the entry (i.e., the `seqfseqno' sequence), skips the 10 character
1523identifier and retrieves the sequence. All alphabetic characters are
1524considered to be in the sequence.
1525
1526The rawseq operation is the same as the getseq operation, except that
1527all non-whitespace and non-numeric characters are considered part of
1528the sequence.
1529
1530The getinfo operation takes the 10 character sequence identifier to be
1531the description of the sequence. No other information is retrieved.
1532
1533The putseq operation outputs an Interleaved or Sequential entry
1534exactly as described in the PHYLIP program documentation. If the
1535sequences output are of different lengths, the putseq operation will pad
1536the smaller sequences with dashes `-'.
1537
1538(IMPORTANT: The one unusual feature about the putseq operation is
1539that, unlike all of the other putseq operations except Clustalw and MSF,
1540the actual output does not occur until `seqfclose' is called to close the
1541file. Because the PHYLIP format must know the number of entries
1542before it can output the first line, the sequences cannot be output at
1543each call to `seqfwrite'. What the putseq operation does is, on each call
1544to `seqfwrite', it makes a copy of the sequence and a sequence
1545identifier (either the mainid, mainacc, description or organism name).
1546Then, when `seqfclose' is called, all of the sequences are output in the
1547correct format.)
1548
1549There is no annotate function.
1550
1551Example PHYLIP Interleaved entry:
1552
1553     6    104
1554pir:CCCZ   GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
1555pir:CCMQR  GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
1556pir:CCMKP  GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE
1557pir:CCRB   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
1558pir:CCGW   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
1559pir:CCCM   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
1560
1561           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
1562           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
1563           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
1564           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK
1565           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
1566           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
1567
1568           ATNE
1569           ATNE
1570           ATNE
1571           ATNE
1572           ATNE
1573           ATNE
1574
1575Example PHYLIP Sequential entry:
1576
1577     6    104
1578pir:CCCZ   GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
1579           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
1580           ATNE
1581pir:CCMQR  GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA
1582           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
1583           ATNE
1584pir:CCMKP  GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE
1585           ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK
1586           ATNE
1587pir:CCRB   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
1588           ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK
1589           ATNE
1590pir:CCGW   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
1591           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
1592           ATNE
1593pir:CCCM   GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD
1594           ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK
1595           ATNE
1596
1597
1598
1599Clustalw Format
1600***************
1601
1602The read operation first skips the header line of the file, and then skips
1603any blank lines. The next non-blank line is assumed to begin the first
1604block. The sequence lines of each block contain first an identifier of 15
1605characters and then the rest of the line is sequence. Those sequence
1606lines must begin with a non-whitespace character. After the sequence
1607lines in each block, there is an additional line to highlight closely
1608related columns in the alignment, followed by zero or more blank lines.
1609This additional line and all of the lines occurring between blocks must
1610either be empty or begin with a whitespace character. There is only one
1611entry per file, and the whole file is assumed to consist of these
1612sequence blocks.
1613
1614The getseq operation finds the first line of the appropriate sequence in
1615the entry (i.e., the `seqfseqno' sequence), skips the 15 character
1616identifier and retrieves the sequence. All alphabetic characters are
1617considered to be in the sequence.
1618
1619The rawseq operation is the same as the getseq operation, except that
1620all non-whitespace and non-numeric characters are considered part of
1621the sequence.
1622
1623The getinfo operation takes the 15 character sequence identifier to be
1624the description of the sequence. No other information is retrieved.
1625
1626The putseq operation outputs a Clustalw entry exactly as the clustalw
1627program does, except that the version number is replaced with "*.**"
1628and the package does not look for closely related columns in the
1629output alignment (it simply outputs a line of whitespace without any '*'
1630or '.' characters). If the sequences are of different lengths, the putseq
1631operation will pad the smaller sequences with dashes '-'.
1632
1633(IMPORTANT: The one unusual feature about the putseq operation is
1634that, unlike all of the other putseq operations except PHYLIP and MSF,
1635the actual output does not occur until `seqfclose' is called to close the
1636file. Because the Clustalw format must know the number of entries
1637before it can output the first line, the sequences cannot be output at
1638each call to `seqfwrite'. What the putseq operation does is, on each call
1639to `seqfwrite', it makes a copy of the sequence and a sequence
1640identifier (either the mainid, mainacc, description or organism name).
1641Then, when `seqfclose' is called, all of the sequences are output in the
1642correct format.)
1643
1644There is no annotate function.
1645
1646Example Clustalw file:
1647
1648CLUSTAL W(*.**) multiple sequence alignment
1649
1650
1651
1652pir:CCCZ       GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWG
1653pir:CCMQR      GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGITWG
1654pir:CCMKP      GDVFKGKRIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQASGFTYTEANKNKGIIWG
1655pir:CCRB       GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG
1656pir:CCGW       GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG
1657pir:CCCM       GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG
1658
1659
1660pir:CCCZ       EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
1661pir:CCMQR      EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
1662pir:CCMKP      EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
1663pir:CCRB       EDTLMEYLENPKKYIPGTKMIFAGIKKKDERADLIAYLKKATNE
1664pir:CCGW       EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE
1665pir:CCCM       EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE
1666
1667
1668
1669
1670FASTA-output Formats
1671********************
1672
1673 NOTE: With one or two exceptions, this implementation can read
1674 and understand the output from the FASTA, TFASTA, SSEARCH,
1675 LFASTA, LALIGN and ALIGN programs which were run either in
1676 interactive or non-interactive mode, and where the output was
1677 formatted with MARKX option set to any of 0, 1, 2, 3 or 10.
1678
1679 The exceptions are
1680
1681   1. The program must have been run in non-interactive mode
1682    in order for the automatic format determination to work
1683    correctly. By "non-interactive", I mean that the initial
1684    header output by the program:
1685
1686        FASTA searches a protein or DNA sequence data bank
1687        version 2.0u4 Feb., 1996
1688       Please cite:
1689        W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
1690       .
1691       .
1692       .
1693
1694    must appear in the text given as input.
1695   2. If the FASTA, TFASTA or SSEARCH is run in interactive
1696    mode, no information will be known about the query
1697    sequence (its information is in the initial header, which is
1698    not included in the file specified to receive the program
1699    output),
1700   3. The ALIGN program must be run in non-interactive mode
1701    in order for the package to correctly parse it (i.e., that
1702    initial header must occur in the text). For the other
1703    programs, the package will parse its output correctly, if the
1704    file format is specified as `FASTA-output'.
1705   4. The implementation was tested against version 2.0u4. If the
1706    output was different in previous versions, the
1707    implementation may not work.
1708
1709The read operation first scans the text occurring before the first
1710alignment in the file. This initial text is ignored, except where it gives
1711information about the sequences being aligned. The initial texts of
1712some of the output formats contain lines of the following form.
1713
1714 >GT8.7 transl. of pa875.con, 19 to 675: 217 aa
1715 >musplfm transl. of musplfm.seq, 2 to 676 : 224 aa
1716
1717(A) musplfm.aa >musplfm transl. of musplfm.seq, 2 to 676          - 224 aa
1718(B) lcbo.aa    >LCBO - Prolactin precursor - Bovine               - 229 aa
1719
1720>musplfm transl. of musplfm.seq, 2 to 676           224 aa vs.
1721>LCBO - Prolactin precursor - Bovine                229 aa
1722
1723The text after the '>' is parsed to extract the sequence id (the first word
1724after the '>'), a sequence description, the sequence length and
1725alphabet information about the sequence.
1726
1727Then, the read operation reads the "entries" of the file, where each
1728entry is considered to be the text describing an alignment between two
1729sequences. Different programs output different sets of alignments, but
1730all six of the FASTA programs supported output one or more
1731two-sequence alignments. Thus, every entry in this format contains
1732two sequences.
1733
1734The getseq operation extracts the appropriate sequence from the entry
1735(the first or second sequence if the `seqfseqno' value is 1 or 2,
1736respectively). All alphabetic characters are considered part of the
1737sequence, except that if the output was generated with MARKX=2, then
1738any periods occurring in the second sequence are replaced with the
1739corresponding character of the first sequence.
1740
1741The rawseq operation is the same as the getseq operation, except that
1742all non-whitespace and non-numeric characters are considered part of
1743the sequence (with the exception of period substitution mentioned
1744above).
1745
1746The getinfo operation extracts a main identifier, a description and an
1747alphabet for the appropriate sequence, if available. It also constructs a
1748comment that begins with the following:
1749
1750From SSEARCH output alignment of:
1751 >musplfm transl. of musplfm.seq, 2 to 676, 224 aa
1752 >LCBO - Prolactin precursor - Bovine, 229 aa
1753
1754This gives the name of the program whose output is being parsed, and
1755the descriptions of the two sequences from whose alignment came the
1756current sequence. This text is then followed by any information from the
1757alignment describing the score of that pairwise alignment. The format of
1758this text depends on the FASTA program executed and the MARKX
1759value, as it is just copied from the program output.
1760
1761There is no putseq or annotate operation.
1762
1763
1764BLAST-output Formats
1765********************
1766
1767 NOTE: With one or two exceptions, this implementation can read
1768 and understand the output from the BLASTN, BLASTP or BLASTX
1769 (and maybe even the TBLAST* programs, although that has not
1770 been tested yet). The exceptions are:
1771
1772   1. Automatic recognition of the BLAST-output format
1773    requires that one of the keywords BLASTN, BLASTP or
1774    BLASTX be the first word in the file (possibly after an
1775    e-mail header). Many of the BLAST e-mail servers prepend
1776    a description of their service before the actual BLAST
1777    output, and so disrupt the recognition by the package. So,
1778    for output gotten by an e-mail server, the input format
1779    must be set.
1780   2. The implementation was tested on output generated by
1781    versions 1.2 and 1.4.9. If the output is different in version
1782    1.3 or 2.0, the implementation may not work (although the
1783    implementation can correctly handle gaps in the
1784    alignments, so that change from 1.* to 2.0 is handled).
1785
1786The read operation first scans the text occurring before the first
1787alignment in the file. This initial text is ignored, except where it gives
1788information about the sequences being aligned. The initial texts of
1789some of the output formats contain lines of the following form.
1790
1791Query=  gb:A02201|acc:A02201 Phage phi-105 DNA for immF plypeptide - Bacteriophage phi-
1792        (665 letters)
1793
1794The text after "Query=" and before the line containing the "(... letters)"
1795is parsed as a oneline description, and the number inside the "(...
1796letters)" is taken as the length of the query sequence.
1797
1798Then, the read operation reads the "entries" of the file, where each
1799entry is considered to be the text describing an alignment between two
1800sequences. The BLAST alignment format consists of header lines
1801specifying the sequence that matches the query, following by one or
1802more pairwise alignments of substrings of the matching sequence and
1803the query. The read operation first scans the header lines, which are of
1804the form:
1805
1806>emb|X02799|NCPHI105 Bacillus subtilis phage phi 105 immunity region with
1807            repressor gene and ORF >emb|A11144|A11144 phage phi 105 repressor
1808            (ORF1)-Orf 2 genes and there flanking regions
1809            Length = 1306
1810
1811where the "Length =" line ends the list of oneline descriptions of the
1812sequences that match the query (in the next pairwise alignment(s) ). It
1813extracts the oneline description and length of the sequence.
1814
1815The read operation considers an "entry" to consist only of the actual
1816score reporting text and pairwise alignment text. So, while the header
1817lines above are scanned for their information, the entry reported by the
1818package begins at the line containing either "Plus Strand HSPs:",
1819"Minus Strand HSPs:" or "Score =". And the entry ends just after the
1820last line of the pairwise alignment text. This is done to make the entry
1821text reported by the package more uniform. Thus, the following BLAST
1822output would be reported as two entries, the first beginning at the
1823"Plus Strand HSPs:" line and running through the first pairwise
1824alignment, and the second beginning with the "Score = 89..." line. The
1825header lines will not be reported in any alignment, and will only be
1826scanned to extract the oneline description and length information.
1827
1828>emb|Z68118|CER01E6 Caenorhabditis elegans cosmid R01E6
1829            Length = 40,937
1830
1831  Plus Strand HSPs:
1832
1833 Score = 127 (35.1 bits), Expect = 3.2, Sum P(2) = 0.96
1834 Identities = 39/56 (69%), Positives = 39/56 (69%), Strand = Plus / Plus
1835
1836Query:    426 ATTTTAATAAATCTGGATTTAAATGTGTTAAAAATGACGGAAATACAAGTAGTTGA 481
1837              ||||||||||||||    ||||||  | |||||||||  | || |    || || |
1838Sbjct:  35266 ATTTTAATAAATCTCATCTTAAATTAGATAAAAATGAATGCAAAATTTATATTTTA 35321
1839
1840 Score = 89 (24.6 bits), Expect = 3.2, Sum P(2) = 0.96
1841 Identities = 25/34 (73%), Positives = 25/34 (73%), Strand = Plus / Plus
1842
1843Query:     93 ACAATACTAAAAAAGACGGAAATACAAGTATTTT 126
1844              ||||||||||||||    | ||   || ||||||
1845Sbjct:  31613 ACAATACTAAAAAATCTTGTAAACAAAATATTTT 31646
1846
1847The getseq operation extracts the appropriate sequence from the entry
1848(the first or second sequence if the `seqfseqno' value is 1 or 2,
1849respectively). All alphabetic characters are considered part of the
1850sequence.
1851
1852The rawseq operation is the same as the getseq operation, except that
1853all non-whitespace and non-numeric characters are considered part of
1854the sequence.
1855
1856The getinfo operation extracts a main identifier, a description and an
1857alphabet for the appropriate sequence, if available. It also constructs a
1858comment that begins with the following:
1859
1860From BLASTN/BLASTP/BLASTX output alignment of:
1861   >gb:A02201|acc:A02201 Phage phi-105 DNA for immF plypeptide - Bacteriophage phi
1862and
1863   >emb|X02799|NCPHI105 Bacillus subtilis phage phi 105 immunity
1864              region with repressor gene and ORF
1865   >emb|A11144|A11144 phage phi 105 repressor (ORF1)-Orf 2 genes
1866              and there flanking regions
1867
1868This gives the name of the program whose output is being parsed, and
1869the descriptions of the two sequences from whose alignment came the
1870current sequence. This text is then followed by any information from the
1871alignment describing the score of that pairwise alignment.
1872
1873There is no putseq or annotate operation.
1874
1875
1876James R. Knight, knight@cs.ucdavis.edu
1877June 28, 1996
1878