1SEQIO -- A Package for Sequence File I/O 2 3 4FORMAT.DOC - The SEQIO File Formats 5*********************************** 6 7 8 9The File Formats 10================ 11 12This file describes the specific assumptions the SEQIO package makes 13about the file formats it supports. The basic file formats are (with 14alternative names in parens): 15 16 o Raw 17 o Plain 18 o GenBank (gb) 19 o EMBL 20 o Swiss-Prot (swissprot, sprot) 21 o PIR (CODATA) 22 o NBRF 23 o FASTA (Pearson) 24 o IG/Stanford (IG, Stanford) 25 o ASN.1 (ASN) 26 o GCG 27 o GCG-* (GCG-GenBank, GCG-PIR, GCG-EMBL, ...) 28 o MSF 29 o PHYLIP 30 o PHYLIP-Seq (phylip-s, phylips) 31 o PHYLIP-Int (phylip-i, phylipi) 32 o Clustalw (clustal) 33 o FASTA-output (fasta-out, fastaout, fout) 34 o BLAST-output (blast-out, blastout, bout) 35 36where `FASTA-output' and `BLAST-output' specify the output 37produced by the programs in the FASTA and BLAST packages. The 38`GCG-*' format actually refers to a set of formats which specify the 39GCG forms of the GenBank, EMBL, Swiss-Prot, PIR, NBRF, FASTA and 40IG/Stanford formats. These formats are included to distinguish the GCG 41forms of these formats from the generic GCG format (where the header 42lines of an entry are considered as unstructured comments). Any valid 43name for one of the seven formats, plus their *-old variants given 44below, can replace the `*' in `GCG-*'. 45 46In addition to the basic file formats, there are four file "formats" which 47use faster file reading implementations. They are specifically geared to 48the formats of the GenBank, PIR, EMBL and Swiss-Prot databases, and 49they are included to speedup database searches (they run about 30% 50faster than the basic implementations, but at the cost of less error 51checking and depending that the file format exactly matches the 52database's format): 53 54 o gbfast 55 o pirfast 56 o emblfast 57 o spfast 58 59My advice is that these formats only be used when searching the actual 60databases, and the basic file formats be used the rest of the time. The 61difference in time only becomes significant when reading files in the 62multi-megabyte range. 63 64Finally, there are also format variants which have been added to 65account for FASTA, NBRF and IG/Stanford format limitations commonly 66in use. For FASTA and IG/Stanford, the limitation is that only one 67header line (any line beginning with a '>' or ';') may appear in the entry. 68For NBRF, the limitation is that no lines like "C;Accession:" or 69"C;Comment:" may appear after the sequence. The formats below have 70a different output function which outputs entries in these limited 71formats (at the cost of losing some information about the sequences). 72Thus, the package can output entries that are readable by other 73programs which require the limited format. 74 75 o NBRF-old (NBRFold) 76 o FASTA-old (FASTAold) 77 o Stanford-old (Stanfordold, IG-old, IGold) 78 79These three format variants are included in the `GCG-*' set of formats. 80 81File Format Types 82================= 83 84Each format is considered to be one of the following types, which gives 85a basic description of the capabilities and common uses of the format: 86 87T_SEQONLY 88 The entries of the format contain only a sequence. It does not 89 contain any place to store sequence information or comments. 90 (Plain, Raw) 91T_DATABANK 92 The entries are used mainly to store unadorned sequences (i.e., 93 not used for sequences containing alignment characters). 94 (GenBank, PIR, EMBL, Swiss-Prot, their GCG-* forms, ASN.1) 95T_GENERAL 96 The entries can contain both unadorned sequences and 97 alignment sequences. In addition, there is a place to store 98 sequence information and comments. 99 (FASTA, NBRF, IG/Stanford, their GCG-* forms, GCG) 100T_LIMITED 101 The entries can contain both unadorned sequences and 102 alignment sequence, but there no place to store extra sequence 103 information and comments. 104 (FASTA-old, NBRF-old, IG-old, their GCG-* forms) 105T_ALIGNMENT 106 The entries are used mainly to store multiple sequence 107 alignments. They are not considered to contain much sequence 108 information and do not have any place to store comments. 109 (PHYLIP, Clustalw, MSF) 110T_OUTPUT 111 The format is the output of an aligment program, and these 112 formats are read-only formats. 113 (FASTA-output, BLAST-output) 114 115These types may be of some use when developing software that wishes 116to perform different operations based on this file type information (the 117"fmtseq" program included in the distribution is one such piece of 118software). 119 120(NOTE: Why is having someplace to store comments so important? 121Well, one of the goals of this package is to try to unify all of the file 122formats and be able to capture and transfer as much information from 123one format to another. The plans are to use these comment sections as 124the place to store any extra information for which there is not explicit 125spot in the entry. And that can't happen if the file format doesn't have a 126comment section. This is also the reason for the FASTA, NBRF and 127IG/Stanford variants mentioned above.) 128 129Automatically Determining the Format Type 130========================================= 131 132The SEQIO package has the ability to automatically determine the 133format of a file, if that file is one of the following formats: 134 135 Plain, GenBank, PIR, EMBL/Swiss-Prot, FASTA, NBRF, 136 IG/Stanford, ASN.1, GCG, GCG-*, MSF, PHYLIP, Clustalw, 137 FASTA-output, BLAST-output 138 139The Raw format and all of the format variations (*-old, *fast) must be 140explicitly specified in order to be used. The package makes the format 141determination in two phases. The first phase looks at the initial 142non-whitespace text of the file. The second phase looks at the text of 143the first entry in the file. Both of these phases occur during the 144opening of the file. 145 146First Phase 147+++++++++++ 148 149The first phase operation first skips over an e-mail header at the 150beginning of the file, if the file begins with the string "From ". It then 151looks for the first non-whitespace character of the file and attempts to 152match that non-whitespace text to one of the following keywords 153(where the matching is case-insensitive and the `?' character is a 154wildcard which can match any character in the file): 155 156 GenBank - "LOCUS ", "GB???.SEQ Genetic Sequence Data Bank" 157 NBRF - ">??;" 158 FASTA - ">" 159 EMBL - "ID ", "CC ", "XX " 160 PIR - "\\\", "ENTRY", "P R O T E I N S E Q U E N C E D A T A B A S E" 161IG/Stanford - ";" 162 ASN.1 - "Bioseq-set ::= {", "Seq-set ::= {" 163 FASTA-out - "FASTA", "TFASTA", "SSEARCH", "LFASTA", "LALIGN", "ALIGN" 164 PHYLIP - "0", "1", "2", "3", "4", "5", "6", "7", "8", "9" 165 Clustalw - "CLUSTAL" 166 MSF - "PileUp" 167 BLAST-out - "BLASTN", "BLASTP", "BLASTX" 168 169The keyword matching occurs in the order specified here, and the first 170matching keyword specifies the file format. So, for NBRF and FASTA 171files, if the first entry's header line has a ';' as the third character after 172the initial '>', the file format is taken to be NBRF. Files without that 173semi-colon are taken to be in FASTA format. 174 175If there's a match, then the file format has been determined. Otherwise, 176the file's format is considered to be `Plain' at this point. 177 178Second Phase 179++++++++++++ 180 181The second phase distinguishes more subtle variations of the file 182formats by looking in more detail at the text of the entries. The possible 183changes in the determined format are the following: 184 185 o For EMBL files, the "ID " line of each entry is scanned, and if it 186 contains exactly 2 semi-colons, a period and the string "PRT" 187 occuring before the second semi-colon, the entry is taken to be 188 a Swiss-Prot entry. 189 190 In addition, if the string occurring before the last semi-colon on 191 the "ID " line is "EPD", then the entry identifier is taken to be an 192 EPD database identifier, but the entry itself is still considered to 193 be an EMBL formatted entry. 194 195 o For all of the basic formats of the GCG-* formats, if the entry's 196 sequence lines are in the GCG format, then the entry is 197 considered to be the corresponding GCG-* format (so, a 198 GenBank format becomes a GCG-GenBank format). 199 200 o For PHYLIP files, each entry is checked to see if it is in the 201 Interleaved or Sequential format. This checking is a complete 202 match of the text to the two formats, so the likelihood of an 203 incorrect determination is remote. See below in the description 204 of the PHYLIP format for more details. 205 206 o For Plain files (as determined by phase 1), the entry text is 207 checked to see if a line ending with the string ".." occurs (or, 208 more precisely, a line whose last non-whitespace characters are 209 ".."). If so, the file is considered either a GCG or MSF file. If the 210 line ending with the ".." contains the string "MSF:", then the 211 entry is considered to be an MSF file. If not, the entry is 212 considered to be a GCG file. 213 214 215 216The SEQIO File Format Implementations 217************************************* 218 219The package has six main (internal) operations that encapsulate the 220details of the file formats. Those operations are: 221 222read 223 Read the input file to find the beginning and end of the next 224 entry in the file. Also, find the beginning of the lines containing 225 the sequence and if the entry explicitly specifies a sequence 226 length, get that value. 227getseq 228 Retrieve the sequence, if it exists, from the entry. 229rawseq 230 Retrieve the raw sequence, if it exists, from the entry. The raw 231 sequence typically contains the sequence characters plus any 232 alignment or notational characters. 233getinfo 234 Get one piece or all of the SEQINFO information from the entry. 235putseq 236 Given a sequence and SEQINFO structure, output a correctly 237 formatted entry. 238annotate 239 Output an entry's text, adding new text to its comment section 240 (creating a comment section, if none exists in the entry). 241 242Each of the supported file formats will be described in terms of what 243those six operations do for that format. 244 245General Comments 246================ 247 248 o There are no limits on lengths of anything (lines, entries, 249 sequences, etc.), except for memory limitations and when 250 outputting formats whose official descriptions specify a 251 maximum line length (see below in the format descriptions). 252 253 o When outputting formats that do have a maximum line length, 254 long description/organism/comment lines are broken between 255 word boundaries. That maximum line length is maintained unless 256 there is a single word that is longer than the line length. That 257 word is not broken up, but is output on a line that will be longer 258 that the maximum length. 259 260 o Except for gbfast, emblfast, spfast and pirfast, the case of the 261 entry's keywords is irrelevant (they can be in upper or lower 262 case, or any mixture of the two). The "fast" formats require 263 keywords in upper case (as occurs in the databases). 264 265 o When outputting in the Plain, FASTA, NBRF or IG/Stanford 266 formats, the putseq operation looks at the sequence being 267 output, and may add whitespace to the output sequence to make 268 it look prettier. By default, the extra spaces are added when the 269 sequence is DNA, RNA or Protein and when there are no 270 non-alphabetic characters in the sequence (such as alignment 271 characters). 272 273 This prettying operation can be turned off or turned on for all 274 sequences using the function `seqfsetpretty'. 275 276 277 278Raw Format 279********** 280 281In the raw format, all of the characters of the file are the characters of 282the sequence (including spaces, newlines, non-printable characters, 283and so on). 284 285The read operation simply reads the whole file. The getseq and rawseq 286operations return that text. The getinfo operation merely stores the 287filename in the description field. The putseq operation just outputs the 288sequence characters. And there is no annotate operation. 289 290 291 292Plain Format 293************ 294 295In the plain format, all of the alphabetic characters of the file are taken 296as the characters of the sequence, while spaces, newlines, position 297numbers and other punctuation characters are ignored. 298 299The read operation reads in the whole file. The getseq operation 300extracts all of the alphabetic characters from the text. The rawseq 301operation extracts all of the non-whitespace and non-numeric 302characters from the text. The getinfo operation stores the filename in 303the description field. 304 305The putseq operation outputs the sequence in one of two formats, 306depending on the sequence's alphabet. If the alphabet is DNA, RNA or 307Protein, or the alphabet is Unknown but does not contain newline 308characters, the sequence is output 60 sequence characters per line, 309with interspersed spaces to improve the look of the output. If the 310alphabet is Unknown and it contains newline characters, then it is 311output as is. 312 313 314 315GenBank Flat-File Format 316************************ 317 318The read operation first looks for a "LOCUS" line and extracts the 319sequence length from positions 23-29 of that line (if the text there 320consists of digits). Then, it looks for the entry ending "//" line, along 321with the "ORIGIN" line which specifies where the sequence lines begin. 322The "ORIGIN" line is not required, however if it does not exist, the entry 323is assumed to contain no sequence. 324 325The getseq operation scans the sequence lines, from just after the 326"ORIGIN" line to the "//" line. All alphabetic characters there are 327assumed to be part of the sequence. No assumptions are made about 328the format of these lines. 329 330The rawseq operation is the same as the getseq operation, except that 331all non-whitespace and non-numeric characters are considered part of 332the sequence. 333 334The getinfo operation looks first at the "LOCUS" line. It takes the 335identifier from positions 13-22 (and assumes it's a GenBank id, unless 336marked by an identifier prefix), the alphabet determination from 337positions 37-40, whether it's circular from the existence of the keyword 338"circular" at positions 43-52, and the date from positions 63-73. Then, 339it looks for the "ACCESSION", "NID", "PID", "DEFINITION", 340"COMMENT" and "SOURCE" lines, where `lines' here mean one or 341more text lines corresponding to that part of the entry and where the 342lines can appear in any order. Accession numbers, NID numbers and 343PID numbers are extracted from the "ACCESSION", "NID" and "PID" 344lines, respectively. The description is taken from the "DEFINITION" line. 345Comments are retrieved from the "COMMENT" line. The organism name 346is taken from the "ORGANISM" sub-record of the "SOURCE" line. The 347getinfo operation cannot determine the value of the isfragment field 348(since that is not explicitly given anywhere in the entry). 349 350The putseq operation outputs an entry with the following lines (in 351order): LOCUS, DEFINITION, ACCESSION, NID, SOURCE/ORGANISM, 352COMMENT, BASE COUNT, ORIGIN, sequence lines, //. The form of 353these lines follows that described in the GenBank Release Notes, with 354the following exceptions: 355 356 o Except for the LOCUS line and the ORIGIN-sequence-// lines, no 357 lines are output if the SEQINFO information for that line does not 358 exist. 359 o Only an non-accession identifier 10 characters long or less is 360 output on the LOCUS line. If there are no such identifiers in the 361 idlist, then the keyword "Unknown" is output (or "(below)" to 362 signal that the long identifiers occur in the COMMENT lines). 363 o On the LOCUS line, the "bp" in positions 31-32 may be replaced 364 with "aa" or "ch" if the alphabet is Protein or Unknown. The 365 alphabet string in positions 37-40 could be "PRT" or "UNK" for 366 the same reason. The output classification in positions 53-55 is 367 "UNC" (Unclassified). And finally, the date in positions 63-73 is 368 "01-JAN-0000" if no date is specified in the SEQINFO structure. 369 o The history lines, and any extra references, are output at the end 370 of the COMMENT lines (or a COMMENT line is added which 371 contains those lines). Each of the added lines begins with the 372 keyword "SEQIO". 373 374The annotate operation replaces or appends to the COMMENT line, if it 375exists. If no COMMENT line exists, then a new COMMENT line will be 376inserted (or rather output between the existing lines of the entry) just 377before one of the following lines (whichever comes first in the entry): 378FEATURES, BASE COUNT or ORIGIN. One of those lines must appear 379in the entry. 380 381Example GenBank entry: 382 383LOCUS A02201 664 bp DNA UNC 10-MAR-1993 384DEFINITION Phage phi-105 DNA for immF plypeptide. 385ACCESSION A02201 386SOURCE . 387 ORGANISM Bacteriophage phi-105 388COMMENT NCBI gi: 345121 389 390 SEQIO retrieval from GenBank database entry. 07-Feb-1996 391BASE COUNT 237 a 111 c 144 g 172 t 392ORIGIN 393 1 tgatcaccta tctcctttac aacacatagt gcctcactgt gccactgtgt cttgtggcat 394 61 gacacaatta tagtatccga atgtcggaaa tacaatacta aaaaagacgg aaatacaagt 395 121 attttttagt aaattgacgg aaatacaaga taaatactct ctgaatcttt aaaatgcttg 396 181 aatttcgtca aatttcgact tttacaaaat gtcgtgaata ccatacaatt tagacatacc 397 241 ttaacgggag gtgataatca tgctggatgg gaaaaagctt ggggctttaa ttaaggacaa 398 301 aagaaaagaa aagcacttga aacagacaga aatggcgaag gcactgggta tgtccagaac 399 361 ttatctctct gatatcgaaa acggcagata tctgccgagt acaaaaacac tttccagaat 400 421 agcgatttta ataaatctgg atttaaatgt gttaaaaatg acggaaatac aagtagttga 401 481 ggagggtgga tatgatagag ctgccggcac atgtagaaga caggctttat gagattttta 402 541 tgaaactatc agttccaagg ttgcttgaga aagaagccct ggagaaagga gagaagccga 403 601 atgcggaaag aaaaggcgct tgacctcgcg gccttcttcg ctgaatttga acaaatgatg 404 661 atca 405// 406 407 408 409GBFAST variation of GenBank 410*************************** 411 412The read operation performs the same steps as the GenBank read, 413however it makes some additional assumptions. First, all keywords 414must appear in uppercase. Second, the sequence length must appear 415in positions 23-29 on the "LOCUS" line. Third, an "ORIGIN" line must 416appear in the entry (as must a sequence). Fourth, all of the lines of 417sequence except the last must be in the format as described in the 418Release Notes, and so must be 75 characters long (9 characters for the 419position number, 60 characters of sequence, 6 spaces), plus the 420newline characters. See the above example. 421 422The getseq operation assumes that the sequence lines are in the format 423described in the previous paragraph, and all of the characters in the 424correct positions in that format are assumed to be characters of the 425sequence. So, if the line format is incorrect, you will get garbage as the 426sequence. 427 428The rawseq operation here is exactly the same as the getseq operation, 429since the GenBank sequences don't contain other characters. 430 431The getinfo, putseq and annotate functions are the same as in the 432GenBank format. 433 434 435 436PIR/CODATA Format 437***************** 438 439The read operation first looks for an "ENTRY" line. It then looks for the 440entry ending "///" line, but during this scan it also looks for the 441"SUMMARY" line and the "SEQUENCE" line. If the "SUMMARY" line is 442found, the sequence length is extracted by scanning for "#length" on 443the line, and then looking for digits after that keyword. The 444"SEQUENCE" line specifies the beginning of the sequence lines 445(starting on the next line), and no sequence is assumed to appear in 446the entry if the "SEQUENCE" line is missing. 447 448The getseq operation scans the sequences lines from just after the 449"SEQUENCE" line to the "///" line ending the entry. All alphabetic 450characters on those lines are assumed to be in the sequence. No 451format for those lines is assumed. 452 453The rawseq operation is the same as the getseq operation, except that 454all non-whitespace and non-numeric characters are considered part of 455the sequence. 456 457The getinfo operation first looks at the "ENTRY" line. The next word 458(i.e., non-whitespace string) after the "ENTRY" keyword is taken for an 459identifier, and then the rest of the line is searched for a "#type" option. 460If the word after "#type" is "fragment", the isfragment field is set to 1. 461Then, the entry is searched for the "ACCESSIONS", "COMMENT", 462"DATE", "ORGANISM" and "TITLE" lines, which can appear in any 463order. The "ACCESSIONS" line holds accession numbers (and the 464search for the "ACCESSIONS" line will also find lines beginning with 465just "ACCESSION", for backward compatibility). The "COMMENT" lines 466hold comments. The "DATE" line holds the date, and the date taken is 467the last given on the line, with the assumption being that the dates on 468the line are specified from oldest to newest (not absolutely accurate, 469but handling dates better is on my TODO list). The "TITLE" line holds 470the description, an optional organism name and possibly one of the 471keywords "(fragment)", "(fragment)" or "(tentative sequence)". The text 472before the string " - " is taken for the description, and the rest of the 473text, except for a trailing keyword, is taken for the organism name. If the 474keywords "(fragment)" or "(fragments)" appear at the end of the string, 475isfragment is set to 1. If "(tentative sequence)" appears, it is considered 476part of the description. The "ORGANISM" line holds an organism name 477which is taken if the "TITLE" line does not specify an organism. 478 479The putseq operation outputs a PIR entry containing the following lines 480(in order): ENTRY, TITLE, ORGANISM, DATE, ACCESSIONS, 481COMMENT, SUMMARY, SEQUENCE, sequence lines, ///. The format of 482those lines follows the PIR Release Notes, with the following 483exceptions: 484 485 o The TITLE, ORGANISM, DATE, ACCESSIONS and COMMENT 486 lines may not appear, if the SEQINFO structure does not contain 487 the appropriate information. 488 o If no idlist is given, the keyword "UNKNWN" is output on the 489 ENTRY line, instead of the sequence identifier. 490 o The SEQIO package attempts to follow the guidelines for the 491 TITLE line (i.e., description " - " organism, and an optional 492 "(fragment)") as best it can. Depending on the text of the 493 description and organism fields, this may or may not turn out 494 well. 495 o The organism name is output in the "#formal_name" field of the 496 ORGANISM line, even though it may not be the formal name of 497 the organism. (Better handling of the organism names is another 498 thing on my TODO list.) 499 o The SUMMARY line only contains the "#length" field on it. 500 o The history lines, and any extra references, are output at the end 501 of the COMMENT lines (or a COMMENT line is added which 502 contains those lines). Each of the added lines begins with the 503 keyword "SEQIO". 504 505The annotate operation replaces or appends to the COMMENT line, if it 506exists. If no COMMENT line exists, then a new COMMENT line will be 507inserted just before one of the following lines (whichever comes first in 508the entry): GENETIC, CLASSIFICATION, KEYWORDS, FEATURE, 509SUMMARY or SEQUENCE. One of those lines must appear in the entry. 510 511Example PIR entry: 512 513ENTRY CCMST #type complete 514TITLE cytochrome c, testis-specific - mouse 515ORGANISM #formal_name mouse 516DATE 04-Nov-1994 517ACCESSIONS B28160; A00012 518COMMENT Mammalian testis contains two forms of cytochrome c, one identical 519 with the form found in somatic tissues and another that is 520 expressed in a stage-specific manner during spermatogenic 521 differentiation. 522 523 SEQIO retrieval from PIR database entry. 07-Feb-1996 524SUMMARY #length 105 525SEQUENCE 526 5 10 15 20 25 30 527 1 M G D A E A G K K I F V Q K C A Q C H T V E K G G K H K T G 528 31 P N L W G L F G R K T G Q A P G F S Y T D A N K N K G V I W 529 61 S E E T L M E Y L E N P K K Y I P G T K M I F A G I K K K S 530 91 E R E D L I K Y L K Q A T S S 531/// 532 533 534 535PIRFAST Variation of PIR 536************************ 537 538The read operation performs the same steps as the PIR read, however it 539makes some additional assumptions. First, all keywords must appear in 540uppercase. Second, a "SUMMARY" line must appear in the entry, and 541it must contain a "#length" field (although the field can appear 542anywhere on the line). Third, a "SEQUENCE" line must appear in the 543entry immediately after the "SUMMARY" line (and the entry must 544contain a sequence). Fourth, the format of the sequence lines must be 545as given in the PIR database, and so must be either 67 or 68 characters 546long (7 characters for the position number, 30 characters of sequence, 54730 or 31 spaces or notational characters), plus the newline character. 548See the above example. 549 550The getseq operation assumes that the sequence lines are in the format 551described in the previous paragraph, and all of the characters in the 552correct positions in that format are assumed to be characters of the 553sequence. So, if the line format is incorrect, you will get garbage as the 554sequence. 555 556The rawseq operation here does not use the "fast" implementation, but 557uses the rawseq operation of the basic PIR format. 558 559The getinfo, putseq and annotate functions are the same as in the PIR 560format. 561 562 563 564EMBL/Swiss-Prot File Formats 565**************************** 566 567 NOTE: The EMBL and Swiss-Prot file format implementations are 568 essentially the same, differing only in their putseq and annotate 569 operations. So, we'll describe them together. 570 571 NOTE2: The EMBL read, getseq and getinfo implementations have 572 been tested on, and are compatible with, the "EMBL" entries in 573 the EMBL, EPD, aids-db, ENZYME, PROSITE and Swiss-Prot 574 databases. Because of the variations of the entries in these 575 databases, some of the assumptions made in the implementations 576 will differ from the official EMBL or Swiss-Prot file format 577 descriptions. 578 579The read operation first looks for an "ID " line. It then looks for the 580entry ending "//" line, but during this scan it also looks for an "SQ " 581line and a line beginning with two spaces. If the "SQ " line is found and 582the next word after "SQ Sequence" consists of digits, it is taken for the 583sequence length. The first line beginning with two spaces is assumed 584to be the beginning of the sequence lines, and if no such lines appear, 585the entry is assumed to contain no sequence. 586 587The getseq operation scans the sequences lines from the first line 588beginning with two spaces to the "///" line ending the entry. All 589alphabetic characters on those lines are assumed to be in the 590sequence. No format for those lines is assumed. 591 592The rawseq operation is the same as the getseq operation, except that 593all non-whitespace and non-numeric characters are considered part of 594the sequence. 595 596The getinfo operation first looks at the "ID " line. The next word (i.e., 597non-whitespace string) after the "ID" keyword is taken for an identifier, 598and an attempt is made to determine if it is an EMBL id, an EPD id, a 599Swiss-Prot id, or something else. It does this by counting the number 600of semi-colons on the line and checking whether the line ends with a 601period. If three semi-colons and a period are found, then the string just 602before the third identifier is checked, and the identifier is assumed to 603be an EPD id if that string is "EPD" and is assumed to be an EMBL id 604otherwise. If two semi-colons and a period are found, and the string 605just before the second semi-colon is "PRT", the identifier is assumed 606to be a Swiss-Prot id. Otherwise, the identifier is some other id. After 607figuring out the type of identifier and extracting it from the line, the rest 608of the line is searched for words that specify the alphabet ("DNA", 609"RNA", "PRT", and so on) and whether the sequence is circular 610("circular"). 611 612Then the rest of the entry is searched for the "AC ", "NI ", "PI ", "DT ", 613"DE ", "OS ", "CC " and "XX " lines, which can appear in any order. The 614"AC ", "NI " and "PI " lines contain accession, NID and PID numbers. 615The "DT " lines contain dates, of which the date on the last "DT " line is 616taken, under the assumption that the dates are given from oldest 617tonewest. The "DE " lines contain the description, and may end with 618one of the keywords "(fragment)" or "(fragments)", in which 619caseisfragment is set to 1. The "OS " lines specify the organism name. 620The "CC " and "XX " lines specify the comment lines, about which there 621are a couple things to note. First, an "XX " line isdifferent from any line 622beginning with "XX", in that three spacesmust appear after the "XX" 623and non-whitespace text must appear after that, in order for it to be 624considered a comment line. These lines do not occur in the official 625EMBL or Swiss-Prot formats, but do appear in some of the variations. 626Second, more than one comment section can appear in an entry. When 627a "CC " line is reached, the comment section beginning at that line is 628assumed to consist of all "CC " and "XX" lines (note the lack of spaces 629after the "XX") following that line, upto the first line not beginning with 630"CC" or "XX" (and ignoring a trailing "XX" line). When an "XX " line is 631seen, all following "XX " lines are considered part of that comment 632section. The text for these sections are concatenated together to make 633up the comment lines. 634 635For the EMBL format, the putseq operation outputs an EMBL entry 636containing the following lines (in order): ID, AC, NI, DT, DE, OS, CC, 637SQ, sequence lines, //. In the output, XX lines are added between each 638of the lines (except the sequence lines) as specified in the EMBL format. 639The format of the lines follows the EMBL Release Notes, with the 640following exceptions: 641 642 o The AC, NI, DT, DE, OS, and CC lines may not appear if the 643 SEQINFO structure does not contain the appropriate 644 information. 645 o On the ID line, if no idlist is given, the keyword "Unknown" is 646 output instead of an identifier. The keyword "converted" is 647 output instead of "standard" or "preliminary". The keyword 648 "UNC" is output instead of the classification code. The keyword 649 "UNK" might be output for the alphabet, if the alphabet is 650 Unknown. And, the keyword "AA" or "CH" could appear after the 651 sequence length, if the alphabet is Protein or Unknown. 652 o There will be at most one DT line, and it will only contain the 653 specified date. 654 o Instead of outputting "XX" lines to specify a `blank' line in a 655 comment, a line containing "CC " followed immediately by a 656 newline is output (so, in my design of the comment sections, the 657 comments are specified by the "CC " lines). 658 o The history lines, and any extra references, are output at the end 659 of the output comment section. Each of the added lines begins 660 with the keyword "SEQIO". 661 662For the Swiss-Prot format, the putseq operation outputs a Swiss-Prot 663entry containing the following lines (in order): ID, AC, DT, DE, OS, CC, 664SQ, sequence lines, //. The format of the lines follows the Swiss-Prot 665Release Notes, with the following exceptions: 666 667 o The AC, DT, DE, OS, and CC lines may not appear if the 668 SEQINFO structure does not contain the appropriate 669 information. 670 o On the ID line, if no idlist is given, the keyword "Unknown" is 671 output instead of an identifier. The keyword "converted" is 672 output instead of "standard" or "preliminary". 673 o The alphabet keyword could be "RNA", "DNA" or "UNK" if the 674 alphabet is not Protein. And, the keyword "circular" could 675 appear before the alphabet (if iscircular is 1). The keyword "BP" 676 or "CH" could appear after the sequence length, if the alphabet 677 is DNA, RNA or Unknown. 678 o There will be at most one DT line, and it will only contain the 679 specified date. 680 o The history lines, and any extra references, are output at the end 681 of the output comment section. Each of the added lines begins 682 with the keyword "SEQIO". 683 684For the EMBL format, the annotate operation replaces or appends to 685the "CC " or "XX " lines, if one exists. The operation looks for the first 686comment section, and will insert or replace at that point. If no comment 687section exists, then a new comment section using "CC " lines will be 688inserted (or rather output between the existing lines of the entry) as 689follows. If a "DR ", "PR ", "FH " or "FT " line appears in the entry, the 690comment is inserted just before the first of those lines. Otherwise, the 691comment is inserted just before the "SQ ", or " " (i.e., sequence) lines. 692One of these lines must appear in the entry. 693 694For the Swiss-Prot format, the annotate operation replaces or appends 695to the "CC " lines, if they exist. If no comment section exists, then a new 696comment section will be inserted (or rather output between the existing 697lines of the entry) as follows. If a "DR ", "KW " or "FT " line appears in 698the entry, the comment is inserted just before the first of those lines. 699Otherwise, the comment is inserted just before the "SQ " or sequence 700lines. One of these lines must appear in the entry. 701 702Example EMBL entry: 703 704ID CM23SRIBR converted; DNA; UNC; 805 BP. 705XX 706AC X80636; 707XX 708DT 22-MAR-1995 709XX 710DE C.mucosalis gene for 23S ribosomal RNA (fragment) 711XX 712OS Campylobacter mucosalis 713XX 714CC SEQIO retrieval from EMBL-format entry. 07-Feb-1996 715XX 716SQ Sequence 805 BP; 226 A; 158 C; 224 G; 194 T; 3 other; 717 gattctgcgc ggaaaatata acggggctaa aatgagtacc gaagctttag acttagtttt 60 718 actaagtggt aggagcgttc tattcagcgt tgaaggtgta ccggtaagga gcgctggagc 120 719 ggatagaagt gagcatgcag gcatgagtag cgataattgg ggtgagaatc cccaacgccg 180 720 taarcccaag gtttcctacg cgatgctcgt catcgtaggg ttagccgggt cctaagcaaa 240 721 gtccgaaagg ggtatgcgat ggaaaattgg ttaatattcc aatgccaaca ttattgtgcg 300 722 atggaaggac gcttagagtt aaaggagcca gctgatggaa gtgctggtcg aaaggtgtag 360 723 gttgagttac aggcaaatcc gtaactcttt atccgagacc ccacaggcgt ttgaagttct 420 724 tcggaatgga tgacgaatcc ttgatactgt cgagccaaga aaagtttcta agtttagata 480 725 atgttgcccg taccgtaaac cgacacaggt gggtgggatg agtattctaa ggcgcgtgga 540 726 agaactctct tcaaggaact ctgcaaaata gcaccgtatc ttcggtataa ggtgtgccta 600 727 actttgtgaa ggatttactc cgtaagcatt gaaggttaca acaaagagtc cctcccgact 660 728 gtttaccaaa aacacagcac tctgctaact cgtaagagga tgtatagggt gtgacgcctg 720 729 cccggtgctc gaaggttaat tgatggggty agcagyaatg cgaagctctt gatcgaagcc 780 730 cgagtaaacg gccgccgtaa ctata 805 731// 732 733Example Swiss-Prot entry: 734 735ID 104K_THEPA CONVERTED; PRT; 924 AA. 736AC P15711; 737DT 01-AUG-1992 738DE 104 KD MICRONEME-RHOPTRY ANTIGEN. 739OS THEILERIA PARVA. 740CC -!- DEVELOPMENTAL STAGE: SPOROZOITE ANTIGEN. 741CC -!- SUBCELLULAR LOCATION: IN MICRONEME/RHOPTRY COMPLEXES. 742CC 743CC SEQIO retrieval from Swiss-Prot database entry. 07-Feb-1996 744SQ SEQUENCE 924 AA; 745 MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL 746 QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG 747 DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN 748 GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK 749 YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI 750 TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT 751 THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS 752 EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT 753 QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS 754 SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR 755 PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD 756 DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK 757 DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR 758 SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL 759 TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP 760 KKPDSAYIPS ILAILVVSLI VGIL 761// 762 763 764 765EMBLFAST/SPFAST Variation of EMBL/Swiss-Prot 766******************************************** 767 768The read operation performs the same steps as the EMBL/Swiss-Prot 769read, however it makes some additional assumptions. First, all 770keywords must appear in uppercase, with one exception noted next. 771Second, an "SQ Sequence" line must appear in the entry, although the 772keyword "Sequence" can appear in uppercase, as in "SQ SEQUENCE". 773Third, the sequence length must be the next word after "SQ Sequence". 774Fourth, the format of the sequence lines must occur as in the EMBL or 775Swiss-Prot databases. The EMBL sequence lines are 80 characters 776long (5 spaces, 60 sequence characters with 5 interspersed spaces, 777and 10 characters with a right justified position number), plus the 778newline character. The Swiss-Prot sequence lines are 70 characters 779long (same as EMBL except no position numbers), plus the newline. 780 781The getseq operation assumes that the sequence lines are in the format 782described in the previous paragraph, and all of the characters in the 783correct positions in that format are assumed to be characters of the 784sequence. So, if the line format is incorrect, you will get garbage as the 785sequence. 786 787The rawseq operation here is exactly the same as the getseq operation, 788since the EMBL and Swiss-Prot sequences don't contain other 789characters. 790 791The getinfo, putseq and annotate functions are the same as in the 792EMBL/Swiss-Prot format. 793 794 795 796FASTA/FASTA-old File Formats 797**************************** 798 799 NOTE: The implementation of the FASTA format here follows the 800 format described in the FASTA program documentation, with the 801 exception that, at the beginning of the entry, multiple lines 802 beginning with either '>' or ';' can appear. This was done in order 803 to better distinguish the entry's header lines from the sequence 804 lines (where comments beginning with ';' are permitted). This 805 exception only occurs when reading FASTA entries. The FASTA 806 output functions only use ';' for those additional header lines. 807 808The read operation looks for a line beginning with '>'. That line is taken 809as the header/description line for the entry. If that line has been 810formatted using the standard one-line description format (see file " 811user.doc"), then the sequence length is extracted from that line. The 812operation then looks for the next line which does not begin with a '>' 813and which does not begin with a ';'. If such a line occurs before the 814next line with a '>', that line is the first line of the sequence. Finally, the 815operation looks for the entry's end at either the next line which does 816begin with a '>' or the end of the file. 817 818The getseq operation scans the sequences lines (all of the lines not 819beginning with '>'). All alphabetic characters on those lines are 820assumed to be in the sequence, except that when a semi-colon 821appears on a line, the rest of that line is considered a comment and not 822part of the sequence. No format for those lines is assumed. 823 824The rawseq operation is the same as the getseq operation, except that 825all non-whitespace and non-numeric characters are considered part of 826the sequence. 827 828The getinfo operation first looks at the first header line of the entry, and 829parses it according to the one-line description format specified in file " 830user.doc". It then considers any following lines that begin either with a 831'>' or a ';' as comment lines. Any other comments in the entry are 832ignored. 833 834In the FASTA format, the putseq operation outputs a first header line 835according to the one-line description format. The comment/history 836lines and the sequence identifiers are output as additional header lines 837that begin with a ';'. Finally, the sequence is output. 838 839In the FASTA-old format, the putseq operation only outputs the first 840header line and the sequence lines. No comment/history lines are 841output, and the identifiers appear in the header line. 842 843In the FASTA format, the annotate operation either replaces, appends 844or inserts the comment lines just after the first header line. There is no 845annotate operation in the FASTA-old format. 846 847Example FASTA entry: 848 849>gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp. 850; 851;NCBI gi: 579066 852; 853;SEQIO retrieval from GenBank database entry. 07-Feb-1996 854 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt 855 acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct 856 gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa 857 tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga 858 agaaggcgat gtaaactgtc aaagcaatca cagagatgat c 859 860Example FASTA-old entry: 861 862>gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp. 863 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt 864 acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct 865 gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa 866 tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga 867 agaaggcgat gtaaactgtc aaagcaatca cagagatgat c 868 869 870 871NBRF/NBRF-old File Formats 872************************** 873 874 NOTE: The implementation of the NBRF format follows the format 875 descriptions given in the release notes of the VMS version of the 876 PIR database, with the following exceptions: 877 878 1. An identifier list (with identifiers separated by '|') can 879 appear after the ';' on the first line of the entry, and there is 880 no limitation to the length of that identifier list. 881 2. The second line of the entry is treated as a full one-line 882 description (so it can contain more than just the 883 description and organism name). 884 3. The NBRF header lines (which occur after the sequence) 885 are assumed to begin at the first line whose second 886 character is a ';', and run until the end of the entry. So, the 887 sequence lines cannot contain such a line (or the 888 sequence will only be partially read). 889 4. Every "C;Comment: " line in the header lines is assumed to 890 contain a space between the "C;Comment:" and the 891 comment text. This space (or whatever character appears 892 there) is not considered part of the comment text. 893 894The read operation first looks for a line beginning with '>', which 895contains a two-character code and database identifiers for the 896sequence. The next line, which should not begin with a '>', contains a 897one-line description of the sequence, and the operation attempts to 898extract the sequence length from that line. After that, the operation 899scans the sequence lines looking for the beginning of the header lines 900or the end of the entry. The header lines begin with the first line whose 901second character is ';', and they are not required to appear in an entry. 902The end of the entry is either the first line which begins with a '>', or the 903end of the file. 904 905The getseq operation scans the sequences lines from just after the 906description line to either the first occurrence of a '*', the beginning of 907the header lines or the end of the entry. All alphabetic characters on 908those lines are assumed to be in the sequence. No format for those 909lines is assumed. 910 911The rawseq operation is the same as the getseq operation, except that 912all non-whitespace and non-numeric characters are considered part of 913the sequence. 914 915The getinfo operation first looks at the initial identification line. The 916format of that line is ">??;..." where "??" is a two character description 917and "..." is a list of identifiers. Six forms of the two character 918description are recognized 919 920 o "P1" - Protein complete 921 o "F1" - Protein fragment 922 o "DL" - linear DNA 923 o "DC" - circular DNA 924 o "RL" - linear RNA 925 o "RC" - circular RNA 926 927and the appropriate alphabet, isfragment and iscircular values are 928set. The list of identifiers are added to mainid, mainacc and 929idlist. If no identifier prefix is specified for an identifier (either 930by the identifier itself or by the "IdPrefix" information field of the 931database's BIOSEQ entry, if a database search is being performed), 932then "oth" for Other is used. The next line in the entry is parsed 933according to the one-line description format. Then, if the header 934lines were found in the entry during the read operation, they are 935scanned, looking for lines beginning with "C;Accession:", "C;Comment:" 936and "C;Date:" which give the accession numbers, comments and date, 937respectively. 938 939In the NBRF format, the putseq operation outputs a initial identification 940line of the appropriate form, containing one of the two character 941descriptions above (or "XX" if the alphabet is Unknown) and containing 942the list of identifiers in idlist. It then outputs a one-line description 943according to the one-line description format. The sequence is output 944and terminated with a '*'. Finally, the date, accession numbers and 945comments/history are output in lines beginning with "C;Accession:", 946"C;Comment:" and "C;Date:". 947 948In the NBRF-old format, the putseq operation only outputs the initial 949identification line, the description line and the sequence lines. In 950addition, only one identifier is placed on the initial identification line, 951and if that identifier was not an accession number, the main accession 952number is added to the beginning of the description line. 953 954For the NBRF format, the annotate operation replaces or appends the 955"C;Comment: " lines, if they exists. If no comment lines exists, then a 956new comment section will be inserted (or rather output between the 957existing lines of the entry) as follows. If a "C;Genetics:", C;Complex:", 958"C;Function:", "C;Superfamily:", "C;Keywords:" or "F;" line appears in 959the entry, the comment is inserted just before the first of those lines. 960Otherwise, the comment is inserted at the end of the entry. 961 962There is no annotate operation in the NBRF-old format. 963 964Example NBRF entry: 965 966>DL;gb:A14666 967PRLB promoter - Bacteriophage lambda, 281 bp. 968 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt 969 acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct 970 gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa 971 tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga 972 agaaggcgat gtaaactgtc aaagcaatca cagagatgat c* 973C;Date: 18-AUG-1994 974C;Accession: A14666 975C;Comment: NCBI gi: 579066 976C;Comment: 977C;Comment: SEQIO retrieval from GenBank database entry. 23-Mar-1996 978 979Example NBRF-old entry: 980 981>DL;gb:A14666 982~A14666 PRLB promoter - Bacteriophage lambda, 281 bp. 983 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt 984 acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct 985 gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa 986 tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga 987 agaaggcgat gtaaactgtc aaagcaatca cagagatgat c* 988 989 990 991IG/Stanford, IG-old/Stanford-old File Formats 992********************************************* 993 994The read operation first looks for a line beginning with ';'. The 995operation then looks for the next line which does not begin with a ';'. 996All of the lines beginning with ';' make up the comment lines, and the 997first line not beginning with ';' contains the sequence's description. If 998the description line has been formatted using the standard one-line 999description format (see file "user.doc"), then the sequence length is 1000extracted from that line. Finally, the operation looks for the entry's end 1001at either the next line which does begin with a ';' or the end of the file. 1002 1003The getseq operation scans the sequence lines from just after the 1004description line until either the end of the entry is reached, or a '1' or a 1005'2' appears. All alphabetic characters on those lines are assumed to be 1006in the sequence. No format for those lines is assumed. 1007 1008The rawseq operation is the same as the getseq operation, except that 1009all non-whitespace and non-numeric characters are considered part of 1010the sequence. 1011 1012The getinfo operation first gets the comment lines at the beginning of 1013the entry, and then parses the description line according to the 1014one-line description format. Finally, it looks for a '1' or '2' at the end of 1015the sequence, and sets iscircular to 0 or 1, respectively. 1016 1017In the IG/Stanford format, the putseq operation outputs any 1018comment/history lines (or just the line ";\n" if there are no 1019comment/history lines, a one-line description, the sequence and finally 1020either a '1' or '2' depending on the value of iscircular. 1021 1022In the IG-old/Stanford-old format, the putseq operation outputs the 1023same text as in the IG/Stanford format except that exactly one 1024comment/history line is output. 1025 1026In the IG/Stanford format, the annotate operation either replaces, 1027appends or inserts the comment lines at the beginning of the entry. 1028There is no annotate operation in the IG-old/Stanford-old format. 1029 1030Example IG/Stanford entry: 1031 1032;NCBI gi: 579066 1033; 1034;SEQIO retrieval from GenBank database entry. 07-Feb-1996 1035gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp. 1036 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt 1037 acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct 1038 gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa 1039 tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga 1040 agaaggcgat gtaaactgtc aaagcaatca cagagatgat c1 1041 1042Example IG-old/Stanford-old entry: 1043 1044;NCBI gi: 579066 1045gb:A14666|acc:A14666 PRLB promoter - Bacteriophage lambda, 281 bp. 1046 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc acaatctttt 1047 acacagatac aatattttta gtggaaactt cttgacattt cggcccatga cctttactct 1048 gttataaatt acttttatgg gggacgatca cactagcaaa ggagttacct aagccccgaa 1049 tgttcaatgg gaagacttcc ccaatcatga cccacattac gggaccccaa gttgcggaga 1050 agaaggcgat gtaaactgtc aaagcaatca cagagatgat c1 1051 1052 1053 1054ASN.1 Text File Format 1055********************** 1056 1057 NOTE: This file format implementation is not nearly complete 1058 enough to handle all of the variations of ASN.1 text files. I 1059 concentrated the implementation on handling the "Bioseq" 1060 sequence records defined as part of the "Bioseq-set" structure, 1061 i.e., it looks for each "Bioseq-set.seq-set.seq" record in the file, 1062 where '.' separates the initial keywords for each level of 1063 sub-record. (See the NCBI toolkit for the definitions of the 1064 "Bioseq-set" and "Bioseq" syntax, and the values of those initial 1065 keywords). 1066 1067 However, it does handle all of the syntactic requirements of the 1068 ASN.1 text format. It makes no assumptions on the structure of 1069 the file, handling a completely free-form file (with one exception 1070 listed below). It does assume that the format consists of a 1071 hierarchy of records, where a record consists of a text string 1072 identifier and then a pair of matching braces bounding the 1073 contents of the record (except for simple records which contain 1074 only one or more strings and numbers). 1075 1076The read operation looks for the beginning of each 1077"Bioseq-set.seq-set.seq" record in the file. The operation assumes 1078that this record is a "Bioseq" record, and looks for the end of it. Also, 1079the read operations makes the syntactic requirement that the open 1080brace beginning the "seq" record is separated from its initial keyword 1081by exactly one space (i.e., the operation looks for the string "seq {"). 1082After scanning to the end of the "seq" record, the operation looks for 1083the "seq.inst.length" sub-record. If found, the sequence length is 1084extracted from that sub-record. 1085 1086The getseq operation looks for the "seq.inst.seq-data" sub-record in 1087the entry. If found, the sequence is extracted from that sub-record. 1088(NOTE: This operation can only handle sequences that have been 1089encoded in the `iupacna', `iupacaa', `ncbi2na' or `ncbi4na' formats.) 1090 1091The rawseq operation is the same as the getseq operation, since the 1092`iupacna', `iupacaa', 'ncbi2na' and 'ncbi4na' formats do not contain 1093non-alphabetic characters. 1094 1095The getinfo operation looks for a large number of possible sub-records 1096for information about the sequence. To find database identifiers, it 1097looks in the "seq.id" sub-record for the sub-sub-records "pir.name", 1098"pir.accession", "swissprot.name", "swissprot.accession", 1099"genbank.name", "genbank.accession", "embl.name", 1100"embl.accession", "ddbj.name", "ddbj.accession", "prf.name", 1101"prf.accession", "other.name", "other.accession", "pdb.mol", "gi", 1102"giim.id", "gibbsq" and "gibbmt". Any identifiers found are added to 1103the idlist. To find the date information, it looks in the "seq.descr" 1104sub-record to find the sub-sub-records "create-date", 1105"update-date", "genbank.date", "genbank.entry-date", 1106"embl.creation-date", "embl.update-date", "pir.date", "sp.created", 1107"sp.sequpd", "sp.annotupd" and "pdb.deposition". 1108 1109Then, the operations searches for the description, organism and 1110comment information in the "seq.descr" sub-record. For the 1111description, the operation searches for the sub-sub-records "title", 1112"pdb.compound" and "name" and picks one of them for the description 1113("title" if found, else "pdb.compound", else "name"). For the organism, 1114the sub-sub-records "org.taxname", "org.common", "pir.source" and 1115"pdb.source" are searched. For the comments, all of the "comment" 1116sub-sub-records in "seq.descr" are concatenated together to make up 1117the comment lines. 1118 1119Finally, the alphabet is picked up from the "seq.descr.mol-type", 1120"seq.descr.modif.dna", "seq.descr.modif.rna" or "seq.inst.mol" 1121sub-records, the isfragment field is set to 1 if "seq.descr.modif.partial" 1122exists, and the iscircular field is set to 1 if data string in 1123"seq.inst.topology" is "circular". 1124 1125The putseq operation outputs a "Bioseq" record for the sequence as 1126part of a "Bioseq-set" structure (i.e., the appropriate strings are output 1127before the first putseq operation, between the "Bioseq" records and 1128when the file is closed, so that the file consists of a correctly formatted 1129"Bioseq-set" record). The form of the file mirrors that of the Bioseq-set 1130example given in the NCBI toolkit. 1131 1132(NOTE: Because some text must be output when the file is closed (i.e., 1133when seqfclose is called), you MUST call seqfclose when writing an 1134ASN.1 file. If you don't call seqfclose, the text file will not be complete.) 1135 1136The annotate operation either replaces, creates or appends the 1137comment lines in the "seq.descr" sub-record (i.e., the comment lines 1138are the "seq.descr.comment" records). If no "seq.descr" sub-record 1139exists, one is created in the most appropriate place in the "seq" record. 1140If the entry given to the annotate operation is not a Bioseq "seq" 1141record, an error occurs. 1142 1143(NOTE: Using the annotate operation by itself will NOT create a valid 1144ASN.1 text file. You must output the following strings before the first 1145entry, between entries, and after the last entry (again, assuming the 1146entries are "Bioseq" records taken from the "Bioseq-set" hierarchy): 1147 1148 Before the first entry: "Bioseq-set ::= {\n seq-set {\n" 1149 Between entries: " ,\n" 1150 After the last entry: " } }\n" 1151 1152A Complete ASN.1 Text File: 1153 1154Bioseq-set ::= { 1155 seq-set { 1156 seq { 1157 id { 1158 genbank { 1159 name "A14666" , 1160 accession "A14666" } } , 1161 descr { 1162 title "PRLB promoter" , 1163 org { 1164 taxname "Bacteriophage lambda" } , 1165 update-date 1166 str "18-AUG-1994" , 1167 comment "NCBI gi: 579066" , 1168 comment "SEQIO retrieval from GenBank database entry. 07-Feb-1996" } , 1169 inst { 1170 repr raw , 1171 mol dna , 1172 length 281 , 1173 seq-data 1174 iupacna "gatcagctgcgacacaactagtttacttactcgcttattaaaccagacccacaatcttt 1175tacacagatacaatatttttagtggaaacttcttgacatttcggcccatgacctttactctgttataaattactttta 1176tgggggacgatcacactagcaaaggagttacctaagccccgaatgttcaatgggaagacttccccaatcatgacccac 1177attacgggaccccaagttgcggagaagaaggcgatgtaaactgtcaaagcaatcacagagatgatc" } } } } 1178 1179 1180 1181GCG Format 1182********** 1183 1184The read operation first looks for a line that ends with the string ".." (or 1185more precisely, a line whose last non-whitespace characters are ".."). 1186That line should be the GCG information line, and should look 1187something like the following: 1188 1189 gb:A02201 Length: 664 June 21, 1996 18:42 Type: N Check: 9896 .. 1190 1191although any or all of this information (except the "..") can be missing. 1192If the line contains the "Length:" keyword, then the read operation will 1193extract the sequence length. The read operation then reads the rest of 1194the file, and assumes that those lines contain the sequence. 1195 1196The getseq operation scans the sequences lines. All alphabetic 1197characters on those lines are assumed to be in the sequence. No 1198format for those lines is assumed. 1199 1200The rawseq operation is the same as the getseq operation, except that 1201all non-whitespace and non-numeric characters are considered part of 1202the sequence. During this operation, any period `.' appearing in the 1203sequence lines is assumed to be a gap character and translated into a 1204dash `-' (the SEQIO's canonical gap character). 1205 1206The getinfo operation takes the date and the alphabet from the GCG 1207information line (if the date and the "Type:" fields are there), sets the 1208description to the first word of the GCG information line (if it isn't 1209"Length:"), and then takes all of the lines up to the GCG information 1210line as the comment. 1211 1212The putseq operation first outputs any comment lines, outputs a 1213complete GCG information line (with a valid checksum), and then 1214outputs the sequence lines in the default format shown below. Any 1215dash `-' appearing in the output sequence is assumed to be a gap 1216character and automatically translated into a period `.'. 1217 1218There currently is no annotate function. 1219 1220 1221 1222GCG-* Formats 1223************* 1224 1225The processing of the GCG-* formats essentially merges the 1226processing of the GCG format on the sequence lines with the 1227processing of the GenBank, PIR, EMBL, Swiss-Prot, FASTA, 1228FASTA-old, NBRF, NBRF-old, IG/Stanford and IG-old formats when 1229dealing with the header lines of each entry. So, see above for the 1230details on that processing. 1231 1232The one exception to this rule is the relationship between the NBRF 1233and GCG-NBRF formats. Since the NBRF entries contain "header" 1234information that actually appears at the end of the entry, and the GCG 1235format requires that the last thing in an entry be the sequence, the 1236GCG and non-GCG forms of the NBRF entries differ more than the 1237other formats. In the GCG-NBRF format, the lines before the GCG 1238information line are assumed to contain the two header lines normally 1239found in the NBRF entries, immediately followed by the lines normally 1240appearing at the end of the file (the "C;Comment:", "C;Accession:" and 1241other lines). After those lines, the GCG information line and sequence 1242lines should appear, and be the last things in the entry. The fmtseq 1243program and SEQIO package have been implemented to make this 1244transformation between the NBRF and GCG-NBRF formats. 1245 1246An example GCG-Genbank entry: 1247 1248LOCUS A14666 281 bp DNA PHG 18-AUG-1994 1249DEFINITION PRLB promoter. 1250ACCESSION A14666 1251KEYWORDS . 1252SOURCE Bacteriophage lambda. 1253 ORGANISM Bacteriophage lambda 1254 Viridae; ds-DNA nonenveloped viruses; Siphoviridae. 1255REFERENCE 1 (bases 1 to 281) 1256 AUTHORS Michiels,F., Delcour,J., Mahillon,J., Joos,H., Platteeuw,C. and 1257 Josson,K. 1258 TITLE Transformed lactic acid bacteria 1259 JOURNAL Patent: EP 0311469-A 10 12-APR-1989; 1260 PLANT GENETIC SYSTEMS N.V.; UNIVERSITE CATHOLIQUE DE LOUVAIN 1261COMMENT NCBI gi: 579066 1262FEATURES Location/Qualifiers 1263 source 1..281 1264 /organism="Bacteriophage lambda" 1265 RBS 158..166 1266 CDS 180..254 1267 /note="PRLB; NCBI gi: 579067" 1268 /codon_start=1 1269 /translation="MFNGKTSPIMTHITGPQVAEKKAM" 1270BASE COUNT 89 a 67 c 52 g 73 t 1271ORIGIN 1272 1273 gb:A14666 Length: 281 June 28, 1996 16:23 Type: N Check: 2754 .. 1274 1275 1 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc 1276 1277 51 acaatctttt acacagatac aatattttta gtggaaactt cttgacattt 1278 1279 101 cggcccatga cctttactct gttataaatt acttttatgg gggacgatca 1280 1281 151 cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc 1282 1283 201 ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat 1284 1285 251 gtaaactgtc aaagcaatca cagagatgat c 1286 1287An example GCG-NBRF entry: 1288 1289>DL;gb:A14666 1290PRLB promoter - Bacteriophage lambda, 281 bp. 1291C;Date: 18-AUG-1994 1292C;Accession: A14666 1293C;Comment: NCBI gi: 579066 1294C;Comment: 1295C;Comment: SEQIO retrieval from GenBank database. 28-Jun-1996 1296 1297 gb:A14666 Length: 281 June 28, 1996 16:22 Type: N Check: 2754 .. 1298 1299 1 gatcagctgc gacacaacta gtttacttac tcgcttatta aaccagaccc 1300 1301 51 acaatctttt acacagatac aatattttta gtggaaactt cttgacattt 1302 1303 101 cggcccatga cctttactct gttataaatt acttttatgg gggacgatca 1304 1305 151 cactagcaaa ggagttacct aagccccgaa tgttcaatgg gaagacttcc 1306 1307 201 ccaatcatga cccacattac gggaccccaa gttgcggaga agaaggcgat 1308 1309 251 gtaaactgtc aaagcaatca cagagatgat c 1310 1311 1312 1313 1314MSF Multiple Sequence Format 1315**************************** 1316 1317The read operation first looks for a GCG information line of the 1318following form: 1319 1320 Pileup.Msf MSF: 729 Type: N June 21, 1996 15:02 Check: 3171 .. 1321 1322although any or all of this information can be missing, except the ".." 1323and the "MSF: %d" section, the second of which the read operation 1324uses to get the sequence length. After the information line, the read 1325operation looks for the sequence name lines, which are of the form 1326 1327 Name: Humhbbbpc Len: 729 Check: 6463 Weight: 1.00 1328 1329where the "Name: " field gives the sequence identifier and must appear 1330on any non-blank line in this section of the MSF file (the other fields 1331are ignored, and the length is assumed to be the same as the global 1332length). The sequence name lines section ends when a line beginning 1333with "//" appears. Any number of blank lines can be interspersed in this 1334section, but any non-blank line should contain the above format. The 1335rest of the file is assumed to contain the sequence lines, where each 1336sequence line begins with the sequence name followed by a space, as 1337in: 1338 1339 401 450 1340Humhbbbpc CAGAAATTTA GTGTTTTCTC AGTCAGTTAA CATTCCTTCA AC........ 1341Humhbbbpd CAGAAATTTA GTGTTTTCTC AGTCAGTTAA CATTCCTTCA AC........ 1342Humhbbbpe CAGACAAATG GAACAGAATA GAGAGCCCAG AAATAAGACC ACATG..... 1343Humhbbbpf CAGACAAATG GAACAGAATA GAGAGCCCAG AAATAAGACC ACATG..... 1344Humhbbbpg AATACAAAAT CAGTAGCATT TCATATATAA A......... .......... 1345Humhbbbph AATACAAAAT CAGTAGCATT TCATATATAA A......... .......... 1346Humhbbbp1 AAGTGATGAA ATTGTGTATT CAATGTAGTC TCAAGAGAAT TGAAAACCAA 1347Humhbbbpa AAATAAAAGG ATGGAGGAAG ATCTACCAAG CA........ .......... 1348Humhbbbpb AAATAAAAGG ATGGAGGAAT ATCTACCAAG CA........ .......... 1349Humhbbbp2 AGCT.AAAGG ATTGTAAATG CACTAATCAG CACTCTGTGT CTAGCTCAAG 1350 1351No format of the sequence lines or presence or absence of the position 1352number lines (401...450) is assumed, except for the initial sequence 1353name. The sequence lines run to the end of the file. 1354 1355The getseq operation finds every sequence line beginning with the 1356corresponding sequence name (the sequences are ordered by the 1357order of sequence names in the sequence names section). All 1358alphabetic characters appearing after the sequence name are taken for 1359the sequence. 1360 1361The rawseq operation is the same as the getseq operation, except that 1362all non-whitespace and non-numeric characters are considered part of 1363the sequence. During this operation, any period `.' appearing in the 1364sequence lines is assumed to be a gap character and translated into a 1365dash `-' (the SEQIO's canonical gap character). 1366 1367The getinfo operation takes the date and the alphabet from the GCG 1368information line (if the date and the "Type:" fields are there), sets the 1369description to the sequence name found in the sequence name section, 1370and then takes all of the lines up to the GCG information line as the 1371comment. 1372 1373The putseq operation outputs an MSF file exactly mimicing the files 1374output by GCG using "PileUp" in its default mode, except that only the 1375keyword "PileUp" appears on the first line and no comments are 1376output. Any dashes `-' found in the sequences are assumed to be gap 1377characters and are automatically translated into periods `.'. If the 1378sequences are of different lengths, the putseq operation will pad the 1379smaller sequences with periods `.'. 1380 1381(IMPORTANT: The one unusual feature about the putseq operation is 1382that, unlike all of the other putseq operations except Clustalw and 1383PHYLIP, the actual output does not occur until `seqfclose' is called to 1384close the file. Because the MSF format must know the number of entries 1385before it can begin the output, the sequences cannot be output at each 1386call to `seqfwrite'. What the putseq operation does, on each call to 1387`seqfwrite', is make a copy of the sequence and a sequence identifier 1388(either the main identifier, description or organism name). Then, when 1389`seqfclose' is called, all of the sequences are output in the correct 1390format.) 1391 1392There currently is no annotation function. 1393 1394An example MSF file: 1395 1396PileUp 1397 1398 1399 pir.msf MSF: 104 Type: P June 28, 1996 17:04 Check: 3466 .. 1400 1401 Name: pir:CCCZ Len: 104 Check: 9501 Weight: 1.00 1402 Name: pir:CCMQR Len: 104 Check: 9512 Weight: 1.00 1403 Name: pir:CCMKP Len: 104 Check: 9066 Weight: 1.00 1404 Name: pir:CCRB Len: 104 Check: 8395 Weight: 1.00 1405 Name: pir:CCGW Len: 104 Check: 8496 Weight: 1.00 1406 Name: pir:CCCM Len: 104 Check: 8496 Weight: 1.00 1407 1408// 1409 1410 1 50 1411pir:CCCZ GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA 1412pir:CCMQR GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA 1413pir:CCMKP GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE 1414pir:CCRB GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 1415pir:CCGW GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 1416pir:CCCM GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 1417 1418 51 100 1419pir:CCCZ ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 1420pir:CCMQR ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 1421pir:CCMKP ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 1422pir:CCRB ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK 1423pir:CCGW ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK 1424pir:CCCM ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK 1425 1426 101 1427pir:CCCZ ATNE 1428pir:CCMQR ATNE 1429pir:CCMKP ATNE 1430pir:CCRB ATNE 1431pir:CCGW ATNE 1432pir:CCCM ATNE 1433 1434 1435 1436 1437PHYLIP Interleaved and Sequential File 1438************************************** 1439Formats 1440******* 1441 1442 NOTE: The implementation here is more flexible than other 1443 implementations, however it is a bit restrictive in its output, in 1444 that 1445 1446 1. Both interleaved and sequential formats are supported and 1447 rigorously distinguished. See below for the details. 1448 2. An input file in the PHYLIP format can contain one or more 1449 PHYLIP entries, where each entry must be separated only 1450 by whitespace. Mixed files (some interleaved entries, some 1451 sequential entries) are supported. 1452 3. Any number of blank lines or lines filled only with 1453 whitespace can be included in the file. Blank lines do not 1454 disrupt the parsing of the entries. 1455 4. The output operation does NOT output more than one 1456 entry per file, because I have yet to completely figure out 1457 the SEQIO interface issues. (Note that this may change in a 1458 future version.) 1459 5. This implementation was done using the documentation 1460 from Version 3.5c. Whether it works with earlier versions is 1461 not known. 1462 1463The read operation first skips whitespace characters and then looks for 1464the number of sequences and the sequence length (those two numbers 1465must be the first thing in the entry). On that initial line, it also looks for 1466the option characters 'A', 'C', 'F', 'M', 'U', 'W'. If any of the options 1467except 'U' are found, the operation then skips any subsequent lines 1468that begin with a match to the character strings "ANCESTOR ", 1469"CATEGORIES", "FACTORS ", "MIXTURE ", or "WEIGHTS ". A line is 1470considered to match one of the strings if the first 10 characters of the 1471line contain a prefix of the string padded by spaces. Also, these lines 1472are skipped only if the corresponding option was given on that first 1473line. 1474(NOTE: This may cause some problems on an entry such as this one: 1475 14763 6 A 1477A ABCDEF 1478B BCDEFG 1479C CDEFGH 1480 1481because the second line of the entry is treated as an "ANCESTOR " 1482line, when in fact it was a sequence line. But, from looking at the 1483documentation, the PHYLIP programs would die on this entry, too. And 1484replacing "A " with something like "Alpha " eliminates the problem.) 1485 1486After skipping those initial lines, the read operation tries to match the 1487subsequent lines to the interleaved and sequential file formats. The 1488following criteria are the keys to distinguishing between the two 1489formats: 1490 1491 1. The line giving the initial piece of a sequence must be at least 10 1492 characters long and there must be at least one non-whitespace 1493 character in those first ten characters. This should be the 1494 sequence identifier, and its characters are not counted as part of 1495 the sequence. 1496 2. In the Interleaved format, all of the sequence substrings in each 1497 block of the entry must have the same length. A block is a set of 1498 "number-of-sequences" lines (not counting blank lines) which 1499 contain a piece of each of the sequences. 1500 3. The end of each sequence must occur on its own line, without 1501 any additional non-whitespace text after the sequence 1502 characters. 1503 1504If one format but not the other matches, or both formats match and the 1505input format has been specified as PHYLIP-Int or PHYLIP-seq (instead 1506of just PHYLIP), then the entry format has been successfully 1507determined. Otherwise (if neither match or both match), a parse error is 1508triggered. However, given the above criteria and the fact that the 1509operation attempts to completely match both formats against the text, 1510the likelihood that the formats will match the same text is extremely 1511remote. 1512 1513Finally, if the 'U' option has been set on the entry's first line, the read 1514operation skips the user trees listed in the entry, to get to the end of 1515the entry. The format of the user trees consists of a line giving the 1516number of trees, followed by any number of lines of text where each 1517user tree description is ended by a semi-colon (the operation just 1518counts the semi-colons it sees). The end of the entry is at the end of 1519the line containing the last semi-colon. 1520 1521The getseq operation finds the first line of the appropriate sequence in 1522the entry (i.e., the `seqfseqno' sequence), skips the 10 character 1523identifier and retrieves the sequence. All alphabetic characters are 1524considered to be in the sequence. 1525 1526The rawseq operation is the same as the getseq operation, except that 1527all non-whitespace and non-numeric characters are considered part of 1528the sequence. 1529 1530The getinfo operation takes the 10 character sequence identifier to be 1531the description of the sequence. No other information is retrieved. 1532 1533The putseq operation outputs an Interleaved or Sequential entry 1534exactly as described in the PHYLIP program documentation. If the 1535sequences output are of different lengths, the putseq operation will pad 1536the smaller sequences with dashes `-'. 1537 1538(IMPORTANT: The one unusual feature about the putseq operation is 1539that, unlike all of the other putseq operations except Clustalw and MSF, 1540the actual output does not occur until `seqfclose' is called to close the 1541file. Because the PHYLIP format must know the number of entries 1542before it can output the first line, the sequences cannot be output at 1543each call to `seqfwrite'. What the putseq operation does is, on each call 1544to `seqfwrite', it makes a copy of the sequence and a sequence 1545identifier (either the mainid, mainacc, description or organism name). 1546Then, when `seqfclose' is called, all of the sequences are output in the 1547correct format.) 1548 1549There is no annotate function. 1550 1551Example PHYLIP Interleaved entry: 1552 1553 6 104 1554pir:CCCZ GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA 1555pir:CCMQR GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA 1556pir:CCMKP GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE 1557pir:CCRB GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 1558pir:CCGW GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 1559pir:CCCM GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 1560 1561 ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 1562 ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 1563 ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 1564 ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK 1565 ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK 1566 ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK 1567 1568 ATNE 1569 ATNE 1570 ATNE 1571 ATNE 1572 ATNE 1573 ATNE 1574 1575Example PHYLIP Sequential entry: 1576 1577 6 104 1578pir:CCCZ GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA 1579 ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 1580 ATNE 1581pir:CCMQR GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA 1582 ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 1583 ATNE 1584pir:CCMKP GDVFKGKRIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQASGFTYTE 1585 ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK 1586 ATNE 1587pir:CCRB GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 1588 ANKNKGITWG EDTLMEYLEN PKKYIPGTKM IFAGIKKKDE RADLIAYLKK 1589 ATNE 1590pir:CCGW GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 1591 ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK 1592 ATNE 1593pir:CCCM GDVEKGKKIF VQKCAQCHTV EKGGKHKTGP NLHGLFGRKT GQAVGFSYTD 1594 ANKNKGITWG EETLMEYLEN PKKYIPGTKM IFAGIKKKGE RADLIAYLKK 1595 ATNE 1596 1597 1598 1599Clustalw Format 1600*************** 1601 1602The read operation first skips the header line of the file, and then skips 1603any blank lines. The next non-blank line is assumed to begin the first 1604block. The sequence lines of each block contain first an identifier of 15 1605characters and then the rest of the line is sequence. Those sequence 1606lines must begin with a non-whitespace character. After the sequence 1607lines in each block, there is an additional line to highlight closely 1608related columns in the alignment, followed by zero or more blank lines. 1609This additional line and all of the lines occurring between blocks must 1610either be empty or begin with a whitespace character. There is only one 1611entry per file, and the whole file is assumed to consist of these 1612sequence blocks. 1613 1614The getseq operation finds the first line of the appropriate sequence in 1615the entry (i.e., the `seqfseqno' sequence), skips the 15 character 1616identifier and retrieves the sequence. All alphabetic characters are 1617considered to be in the sequence. 1618 1619The rawseq operation is the same as the getseq operation, except that 1620all non-whitespace and non-numeric characters are considered part of 1621the sequence. 1622 1623The getinfo operation takes the 15 character sequence identifier to be 1624the description of the sequence. No other information is retrieved. 1625 1626The putseq operation outputs a Clustalw entry exactly as the clustalw 1627program does, except that the version number is replaced with "*.**" 1628and the package does not look for closely related columns in the 1629output alignment (it simply outputs a line of whitespace without any '*' 1630or '.' characters). If the sequences are of different lengths, the putseq 1631operation will pad the smaller sequences with dashes '-'. 1632 1633(IMPORTANT: The one unusual feature about the putseq operation is 1634that, unlike all of the other putseq operations except PHYLIP and MSF, 1635the actual output does not occur until `seqfclose' is called to close the 1636file. Because the Clustalw format must know the number of entries 1637before it can output the first line, the sequences cannot be output at 1638each call to `seqfwrite'. What the putseq operation does is, on each call 1639to `seqfwrite', it makes a copy of the sequence and a sequence 1640identifier (either the mainid, mainacc, description or organism name). 1641Then, when `seqfclose' is called, all of the sequences are output in the 1642correct format.) 1643 1644There is no annotate function. 1645 1646Example Clustalw file: 1647 1648CLUSTAL W(*.**) multiple sequence alignment 1649 1650 1651 1652pir:CCCZ GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWG 1653pir:CCMQR GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGITWG 1654pir:CCMKP GDVFKGKRIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQASGFTYTEANKNKGIIWG 1655pir:CCRB GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG 1656pir:CCGW GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG 1657pir:CCCM GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWG 1658 1659 1660pir:CCCZ EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE 1661pir:CCMQR EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE 1662pir:CCMKP EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE 1663pir:CCRB EDTLMEYLENPKKYIPGTKMIFAGIKKKDERADLIAYLKKATNE 1664pir:CCGW EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE 1665pir:CCCM EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE 1666 1667 1668 1669 1670FASTA-output Formats 1671******************** 1672 1673 NOTE: With one or two exceptions, this implementation can read 1674 and understand the output from the FASTA, TFASTA, SSEARCH, 1675 LFASTA, LALIGN and ALIGN programs which were run either in 1676 interactive or non-interactive mode, and where the output was 1677 formatted with MARKX option set to any of 0, 1, 2, 3 or 10. 1678 1679 The exceptions are 1680 1681 1. The program must have been run in non-interactive mode 1682 in order for the automatic format determination to work 1683 correctly. By "non-interactive", I mean that the initial 1684 header output by the program: 1685 1686 FASTA searches a protein or DNA sequence data bank 1687 version 2.0u4 Feb., 1996 1688 Please cite: 1689 W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 1690 . 1691 . 1692 . 1693 1694 must appear in the text given as input. 1695 2. If the FASTA, TFASTA or SSEARCH is run in interactive 1696 mode, no information will be known about the query 1697 sequence (its information is in the initial header, which is 1698 not included in the file specified to receive the program 1699 output), 1700 3. The ALIGN program must be run in non-interactive mode 1701 in order for the package to correctly parse it (i.e., that 1702 initial header must occur in the text). For the other 1703 programs, the package will parse its output correctly, if the 1704 file format is specified as `FASTA-output'. 1705 4. The implementation was tested against version 2.0u4. If the 1706 output was different in previous versions, the 1707 implementation may not work. 1708 1709The read operation first scans the text occurring before the first 1710alignment in the file. This initial text is ignored, except where it gives 1711information about the sequences being aligned. The initial texts of 1712some of the output formats contain lines of the following form. 1713 1714 >GT8.7 transl. of pa875.con, 19 to 675: 217 aa 1715 >musplfm transl. of musplfm.seq, 2 to 676 : 224 aa 1716 1717(A) musplfm.aa >musplfm transl. of musplfm.seq, 2 to 676 - 224 aa 1718(B) lcbo.aa >LCBO - Prolactin precursor - Bovine - 229 aa 1719 1720>musplfm transl. of musplfm.seq, 2 to 676 224 aa vs. 1721>LCBO - Prolactin precursor - Bovine 229 aa 1722 1723The text after the '>' is parsed to extract the sequence id (the first word 1724after the '>'), a sequence description, the sequence length and 1725alphabet information about the sequence. 1726 1727Then, the read operation reads the "entries" of the file, where each 1728entry is considered to be the text describing an alignment between two 1729sequences. Different programs output different sets of alignments, but 1730all six of the FASTA programs supported output one or more 1731two-sequence alignments. Thus, every entry in this format contains 1732two sequences. 1733 1734The getseq operation extracts the appropriate sequence from the entry 1735(the first or second sequence if the `seqfseqno' value is 1 or 2, 1736respectively). All alphabetic characters are considered part of the 1737sequence, except that if the output was generated with MARKX=2, then 1738any periods occurring in the second sequence are replaced with the 1739corresponding character of the first sequence. 1740 1741The rawseq operation is the same as the getseq operation, except that 1742all non-whitespace and non-numeric characters are considered part of 1743the sequence (with the exception of period substitution mentioned 1744above). 1745 1746The getinfo operation extracts a main identifier, a description and an 1747alphabet for the appropriate sequence, if available. It also constructs a 1748comment that begins with the following: 1749 1750From SSEARCH output alignment of: 1751 >musplfm transl. of musplfm.seq, 2 to 676, 224 aa 1752 >LCBO - Prolactin precursor - Bovine, 229 aa 1753 1754This gives the name of the program whose output is being parsed, and 1755the descriptions of the two sequences from whose alignment came the 1756current sequence. This text is then followed by any information from the 1757alignment describing the score of that pairwise alignment. The format of 1758this text depends on the FASTA program executed and the MARKX 1759value, as it is just copied from the program output. 1760 1761There is no putseq or annotate operation. 1762 1763 1764BLAST-output Formats 1765******************** 1766 1767 NOTE: With one or two exceptions, this implementation can read 1768 and understand the output from the BLASTN, BLASTP or BLASTX 1769 (and maybe even the TBLAST* programs, although that has not 1770 been tested yet). The exceptions are: 1771 1772 1. Automatic recognition of the BLAST-output format 1773 requires that one of the keywords BLASTN, BLASTP or 1774 BLASTX be the first word in the file (possibly after an 1775 e-mail header). Many of the BLAST e-mail servers prepend 1776 a description of their service before the actual BLAST 1777 output, and so disrupt the recognition by the package. So, 1778 for output gotten by an e-mail server, the input format 1779 must be set. 1780 2. The implementation was tested on output generated by 1781 versions 1.2 and 1.4.9. If the output is different in version 1782 1.3 or 2.0, the implementation may not work (although the 1783 implementation can correctly handle gaps in the 1784 alignments, so that change from 1.* to 2.0 is handled). 1785 1786The read operation first scans the text occurring before the first 1787alignment in the file. This initial text is ignored, except where it gives 1788information about the sequences being aligned. The initial texts of 1789some of the output formats contain lines of the following form. 1790 1791Query= gb:A02201|acc:A02201 Phage phi-105 DNA for immF plypeptide - Bacteriophage phi- 1792 (665 letters) 1793 1794The text after "Query=" and before the line containing the "(... letters)" 1795is parsed as a oneline description, and the number inside the "(... 1796letters)" is taken as the length of the query sequence. 1797 1798Then, the read operation reads the "entries" of the file, where each 1799entry is considered to be the text describing an alignment between two 1800sequences. The BLAST alignment format consists of header lines 1801specifying the sequence that matches the query, following by one or 1802more pairwise alignments of substrings of the matching sequence and 1803the query. The read operation first scans the header lines, which are of 1804the form: 1805 1806>emb|X02799|NCPHI105 Bacillus subtilis phage phi 105 immunity region with 1807 repressor gene and ORF >emb|A11144|A11144 phage phi 105 repressor 1808 (ORF1)-Orf 2 genes and there flanking regions 1809 Length = 1306 1810 1811where the "Length =" line ends the list of oneline descriptions of the 1812sequences that match the query (in the next pairwise alignment(s) ). It 1813extracts the oneline description and length of the sequence. 1814 1815The read operation considers an "entry" to consist only of the actual 1816score reporting text and pairwise alignment text. So, while the header 1817lines above are scanned for their information, the entry reported by the 1818package begins at the line containing either "Plus Strand HSPs:", 1819"Minus Strand HSPs:" or "Score =". And the entry ends just after the 1820last line of the pairwise alignment text. This is done to make the entry 1821text reported by the package more uniform. Thus, the following BLAST 1822output would be reported as two entries, the first beginning at the 1823"Plus Strand HSPs:" line and running through the first pairwise 1824alignment, and the second beginning with the "Score = 89..." line. The 1825header lines will not be reported in any alignment, and will only be 1826scanned to extract the oneline description and length information. 1827 1828>emb|Z68118|CER01E6 Caenorhabditis elegans cosmid R01E6 1829 Length = 40,937 1830 1831 Plus Strand HSPs: 1832 1833 Score = 127 (35.1 bits), Expect = 3.2, Sum P(2) = 0.96 1834 Identities = 39/56 (69%), Positives = 39/56 (69%), Strand = Plus / Plus 1835 1836Query: 426 ATTTTAATAAATCTGGATTTAAATGTGTTAAAAATGACGGAAATACAAGTAGTTGA 481 1837 |||||||||||||| |||||| | ||||||||| | || | || || | 1838Sbjct: 35266 ATTTTAATAAATCTCATCTTAAATTAGATAAAAATGAATGCAAAATTTATATTTTA 35321 1839 1840 Score = 89 (24.6 bits), Expect = 3.2, Sum P(2) = 0.96 1841 Identities = 25/34 (73%), Positives = 25/34 (73%), Strand = Plus / Plus 1842 1843Query: 93 ACAATACTAAAAAAGACGGAAATACAAGTATTTT 126 1844 |||||||||||||| | || || |||||| 1845Sbjct: 31613 ACAATACTAAAAAATCTTGTAAACAAAATATTTT 31646 1846 1847The getseq operation extracts the appropriate sequence from the entry 1848(the first or second sequence if the `seqfseqno' value is 1 or 2, 1849respectively). All alphabetic characters are considered part of the 1850sequence. 1851 1852The rawseq operation is the same as the getseq operation, except that 1853all non-whitespace and non-numeric characters are considered part of 1854the sequence. 1855 1856The getinfo operation extracts a main identifier, a description and an 1857alphabet for the appropriate sequence, if available. It also constructs a 1858comment that begins with the following: 1859 1860From BLASTN/BLASTP/BLASTX output alignment of: 1861 >gb:A02201|acc:A02201 Phage phi-105 DNA for immF plypeptide - Bacteriophage phi 1862and 1863 >emb|X02799|NCPHI105 Bacillus subtilis phage phi 105 immunity 1864 region with repressor gene and ORF 1865 >emb|A11144|A11144 phage phi 105 repressor (ORF1)-Orf 2 genes 1866 and there flanking regions 1867 1868This gives the name of the program whose output is being parsed, and 1869the descriptions of the two sequences from whose alignment came the 1870current sequence. This text is then followed by any information from the 1871alignment describing the score of that pairwise alignment. 1872 1873There is no putseq or annotate operation. 1874 1875 1876James R. Knight, knight@cs.ucdavis.edu 1877June 28, 1996 1878