1""" 2FASTA/QUAL format (:mod:`skbio.io.format.fasta`) 3================================================ 4 5.. currentmodule:: skbio.io.format.fasta 6 7The FASTA file format (``fasta``) stores biological (i.e., nucleotide or 8protein) sequences in a simple plain text format that is both human-readable 9and easy to parse. The file format was first introduced and used in the FASTA 10software package [1]_. Additional descriptions of the file format can be found 11in [2]_ and [3]_. 12 13An example of a FASTA-formatted file containing two DNA sequences:: 14 15 >seq1 db-accession-149855 16 CGATGTCGATCGATCGATCGATCAG 17 >seq2 db-accession-34989 18 CATCGATCGATCGATGCATGCATGCATG 19 20The QUAL file format is an additional format related to FASTA. A FASTA file is 21sometimes accompanied by a QUAL file, particuarly when the FASTA file contains 22sequences generated on a high-throughput sequencing instrument. QUAL files 23store a Phred quality score (nonnegative integer) for each base in a sequence 24stored in FASTA format (see [4]_ for more details). scikit-bio supports reading 25and writing FASTA (and optionally QUAL) file formats. 26 27Format Support 28-------------- 29**Has Sniffer: Yes** 30 31+------+------+---------------------------------------------------------------+ 32|Reader|Writer| Object Class | 33+======+======+===============================================================+ 34|Yes |Yes |generator of :mod:`skbio.sequence.Sequence` objects | 35+------+------+---------------------------------------------------------------+ 36|Yes |Yes |:mod:`skbio.alignment.TabularMSA` | 37+------+------+---------------------------------------------------------------+ 38|Yes |Yes |:mod:`skbio.sequence.Sequence` | 39+------+------+---------------------------------------------------------------+ 40|Yes |Yes |:mod:`skbio.sequence.DNA` | 41+------+------+---------------------------------------------------------------+ 42|Yes |Yes |:mod:`skbio.sequence.RNA` | 43+------+------+---------------------------------------------------------------+ 44|Yes |Yes |:mod:`skbio.sequence.Protein` | 45+------+------+---------------------------------------------------------------+ 46 47.. note:: All readers and writers support an optional QUAL file via the 48 ``qual`` parameter. If one is provided, quality scores will be read/written 49 in addition to FASTA sequence data. 50 51Format Specification 52-------------------- 53The following sections define the FASTA and QUAL file formats in detail. 54 55FASTA Format 56^^^^^^^^^^^^ 57A FASTA file contains one or more biological sequences. The sequences are 58stored sequentially, with a *record* for each sequence (also referred to as a 59*FASTA record*). Each *record* consists of a single-line *header* (sometimes 60referred to as a *defline*, *label*, *description*, or *comment*) followed by 61the sequence data, optionally split over multiple lines. 62 63.. note:: Blank or whitespace-only lines are only allowed at the beginning of 64 the file, between FASTA records, or at the end of the file. A blank or 65 whitespace-only line after the header line, within the sequence (for FASTA 66 files), or within quality scores (for QUAL files) will raise an error. 67 68 scikit-bio will ignore leading and trailing whitespace characters on each 69 line while reading. 70 71.. note:: scikit-bio does not currently support legacy FASTA format (i.e., 72 headers/comments denoted with a semicolon). The format supported by 73 scikit-bio (described below in detail) most closely resembles the 74 description given in NCBI's BLAST documentation [3]_. See [2]_ for more 75 details on legacy FASTA format. If you would like legacy FASTA format 76 support added to scikit-bio, please consider submitting a feature request on 77 the 78 `scikit-bio issue tracker <https://github.com/biocore/scikit-bio/issues>`_ 79 (pull requests are also welcome!). 80 81Sequence Header 82~~~~~~~~~~~~~~~ 83Each sequence header consists of a single line beginning with a greater-than 84(``>``) symbol. Immediately following this is a sequence identifier (ID) and 85description separated by one or more whitespace characters. 86 87.. note:: When reading a FASTA-formatted file, the sequence ID and description 88 are stored in the sequence `metadata` attribute, under the `'id'` and 89 `'description'` keys, repectively. Both are optional. Each will be 90 represented as the empty string (``''``) in `metadata` if it is not present 91 in the header. 92 93 When writing a FASTA-formatted file, sequence `metadata` identified by keys 94 `'id'` and `'description'` will be converted to strings and written as the 95 sequence identifier and description, respectively. Each will be written as 96 the empty string if not present in sequence `metadata`. 97 98A sequence ID consists of a single *word*: all characters after the greater- 99than symbol and before the first whitespace character (if any) are taken as the 100sequence ID. Unique sequence IDs are not strictly enforced by the FASTA format 101itself. A single standardized ID format is similarly not enforced by the FASTA 102format, though it is often common to use a unique library accession number for 103a sequence ID (e.g., NCBI's FASTA defline format [5]_). 104 105If a description is present, it is taken as the remaining characters that 106follow the sequence ID and initial whitespace(s). The description is considered 107additional information about the sequence (e.g., comments about the source of 108the sequence or the molecule that it encodes). 109 110For example, consider the following header:: 111 112 >seq1 db-accession-149855 113 114``seq1`` is the sequence ID and ``db-accession-149855`` is the sequence 115description. 116 117.. note:: scikit-bio's readers will remove all leading and trailing whitespace 118 from the description. If a header line begins with whitespace following the 119 ``>``, the ID is assumed to be missing and the remainder of the line is 120 taken as the description. 121 122Sequence Data 123~~~~~~~~~~~~~ 124Biological sequence data follows the header, and can be split over multiple 125lines. The sequence data (i.e., nucleotides or amino acids) are stored using 126the standard IUPAC lexicon (single-letter codes). 127 128.. note:: scikit-bio supports both upper and lower case characters. 129 This functionality depends on the type of object the data is 130 being read into. For ``Sequence`` objects, sciki-bio doesn't care about the 131 case. Other sequence objects do, but all provide the `lowercase` parameter 132 to control case functionality. Refer to each class's respective constructor 133 documentation for details. 134 135 Both ``-`` and ``.`` are supported as gap characters when reading into 136 ``DNA``, ``RNA``, and ``Protein`` sequence objects. 137 138 Validation is performed when reading into scikit-bio sequence objects that 139 enforce an alphabet (e.g., ``DNA``, ``RNA``, ``Protein``). If any invalid 140 characters are found while reading from the FASTA file, an exception is 141 raised. 142 143QUAL Format 144^^^^^^^^^^^ 145A QUAL file contains quality scores for one or more biological sequences stored 146in a corresponding FASTA file. QUAL format is very similar to FASTA format: it 147stores records sequentially, with each record beginning with a header line 148containing a sequence ID and description. The same rules apply to QUAL headers 149as FASTA headers (see the above sections for details). scikit-bio processes 150FASTA and QUAL headers in exactly the same way. 151 152Instead of storing biological sequence data in each record, a QUAL file stores 153a Phred quality score for each base in the corresponding sequence. Quality 154scores are represented as nonnegative integers separated by whitespace 155(typically a single space or newline), and can span multiple lines. 156 157.. note:: When reading a QUAL-formatted file, quality scores are stored in the 158 sequence's `positional_metadata` attribute under the `'quality'` column. 159 160 When writing a QUAL-formatted file, a sequence's `positional_metadata` 161 `'quality'` column will be written as the quality scores. 162 163.. note:: When reading FASTA and QUAL files, scikit-bio requires records to be 164 in the same order in both files (i.e., each FASTA and QUAL record must have 165 the same ID and description after being parsed). In addition to having the 166 same order, the number of FASTA records must match the number of QUAL 167 records (i.e., missing or additonal records are not allowed). scikit-bio 168 also requires that the number of quality scores match the number of bases in 169 the corresponding sequence. 170 171 When writing FASTA and QUAL files, scikit-bio will maintain the same 172 ordering of records in both files (i.e., using the same ID and description 173 in both records) to support future reading. 174 175Format Parameters 176----------------- 177The following parameters are available to change how FASTA/QUAL files are read 178or written in scikit-bio. 179 180QUAL File Parameter (Readers and Writers) 181^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 182The ``qual`` parameter is available to all FASTA format readers and writers. It 183can be any file-like type supported by scikit-bio's I/O registry (e.g., file 184handle, file path, etc.). If ``qual`` is provided when reading, quality scores 185will be included in each in-memory ``Sequence`` object, in addition 186to sequence data stored in the FASTA file. When writing, quality scores will be 187written in QUAL format in addition to the sequence data being written in FASTA 188format. 189 190Reader-specific Parameters 191^^^^^^^^^^^^^^^^^^^^^^^^^^ 192The available reader parameters differ depending on which reader is used. 193 194Generator and TabularMSA Reader Parameters 195~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 196The ``constructor`` parameter can be used with the ``Sequence`` generator and 197``TabularMSA`` FASTA readers. ``constructor`` specifies the type of in-memory 198sequence object to read each sequence into. For example, if you know that the 199FASTA file you're reading contains protein sequences, you would pass 200``constructor=Protein`` to the reader call. 201 202When reading into a ``Sequence`` generator, ``constructor`` defaults to 203``Sequence`` and must be a subclass of ``Sequence`` if supplied. 204 205When reading into a ``TabularMSA``, ``constructor`` is a required format 206parameter and must be a subclass of ``GrammaredSequence`` (e.g., ``DNA``, 207``RNA``, ``Protein``). 208 209.. note:: The FASTA sniffer will not attempt to guess the ``constructor`` 210 parameter. 211 212Sequence Reader Parameters 213~~~~~~~~~~~~~~~~~~~~~~~~~~ 214The ``seq_num`` parameter can be used with the ``Sequence``, 215``DNA``, ``RNA``, and ``Protein`` FASTA readers. ``seq_num`` specifies which 216sequence to read from the FASTA file (and optional QUAL file), and defaults to 2171 (i.e., such that the first sequence is read). For example, to read the 50th 218sequence from a FASTA file, you would pass ``seq_num=50`` to the reader call. 219 220Writer-specific Parameters 221^^^^^^^^^^^^^^^^^^^^^^^^^^ 222The following parameters are available to all FASTA format writers: 223 224- ``id_whitespace_replacement``: string to replace **each** whitespace 225 character in a sequence ID. This parameter is useful for cases where an 226 in-memory sequence ID contains whitespace, which would result in an on-disk 227 representation that would not be read back into memory as the same ID (since 228 IDs in FASTA format cannot contain whitespace). Defaults to ``_``. If 229 ``None``, no whitespace replacement is performed and IDs are written as they 230 are stored in memory (this has the potential to create an invalid 231 FASTA-formatted file; see note below). This parameter also applies to a QUAL 232 file if one is provided. 233 234- ``description_newline_replacement``: string to replace **each** newline 235 character in a sequence description. Since a FASTA header must be a single 236 line, newlines are not allowed in sequence descriptions and must be replaced 237 in order to write a valid FASTA file. Defaults to a single space. If 238 ``None``, no newline replacement is performed and descriptions are written as 239 they are stored in memory (this has the potential to create an invalid 240 FASTA-formatted file; see note below). This parameter also applies to a QUAL 241 file if one is provided. 242 243- ``max_width``: integer specifying the maximum line width (i.e., number of 244 characters) for sequence data and/or quality scores. If a sequence or its 245 quality scores are longer than ``max_width``, it will be split across 246 multiple lines, each with a maximum width of ``max_width``. Note that there 247 are some caveats when splitting quality scores. A single quality score will 248 *never* be split across multiple lines, otherwise it would become two 249 different quality scores when read again. Thus, splitting only occurs 250 *between* quality scores. This makes it possible to have a single long 251 quality score written on its own line that exceeds ``max_width``. For 252 example, the quality score ``12345`` would not be split across multiple lines 253 even if ``max_width=3``. Thus, a 5-character line would be written. Default 254 behavior is to not split sequence data or quality scores across multiple 255 lines. 256 257- ``lowercase``: String or boolean array. If a string, it is treated as a key 258 into the positional metadata of the object. If a boolean array, it 259 indicates characters to write in lowercase. Characters in the sequence 260 corresponding to `True` values will be written in lowercase. The boolean 261 array must be the same length as the sequence. 262 263.. note:: The FASTA format writers will have noticeably better runtime 264 performance if ``id_whitespace_replacement`` and/or 265 ``description_newline_replacement`` are set to ``None`` so that whitespace 266 replacement is not performed during writing. However, this can potentially 267 create invalid FASTA files, especially if there are newline characters in 268 the IDs or descriptions. For IDs with whitespace, this can also affect how 269 the IDs are read into memory in a subsequent read operation. For example, if 270 an in-memory sequence ID is ``'seq 1'`` and 271 ``id_whitespace_replacement=None``, reading the FASTA file back into memory 272 would result in an ID of ``'seq'``, and ``'1'`` would be part of the 273 sequence description. 274 275Examples 276-------- 277 278Reading and Writing FASTA Files 279^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 280Suppose we have the following FASTA file with five equal-length sequences 281(example modified from [6]_):: 282 283 >seq1 Turkey 284 AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT 285 >seq2 Salmo gair 286 AAGCCTTGGCAGTGCAGGGTGAGCCGTGG 287 CCGGGCACGGTAT 288 >seq3 H. Sapiens 289 ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA 290 >seq4 Chimp 291 AAACCCTTGCCG 292 TTACGCTTAAAC 293 CGAGGCCGGGAC 294 ACTCAT 295 >seq5 Gorilla 296 AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA 297 298.. note:: Original copyright notice for the above example file: 299 300 *(c) Copyright 1986-2008 by The University of Washington. Written by Joseph 301 Felsenstein. Permission is granted to copy this document provided that no 302 fee is charged for it and that this copyright notice is not removed.* 303 304Note that the sequences are not required to be of equal length in order for the 305file to be a valid FASTA file (this depends on the object that you're reading 306the file into). Also note that some of the sequences occur on a single line, 307while others are split across multiple lines. 308 309Let's define this file in-memory as a ``StringIO``, though this could be a real 310file path, file handle, or anything that's supported by scikit-bio's I/O 311registry in practice: 312 313>>> fl = [">seq1 Turkey\\n", 314... "AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT\\n", 315... ">seq2 Salmo gair\\n", 316... "AAGCCTTGGCAGTGCAGGGTGAGCCGTGG\\n", 317... "CCGGGCACGGTAT\\n", 318... ">seq3 H. Sapiens\\n", 319... "ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA\\n", 320... ">seq4 Chimp\\n", 321... "AAACCCTTGCCG\\n", 322... "TTACGCTTAAAC\\n", 323... "CGAGGCCGGGAC\\n", 324... "ACTCAT\\n", 325... ">seq5 Gorilla\\n", 326... "AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA\\n"] 327 328Since these sequences are of equal length (presumably because they've been 329aligned), let's read the FASTA file into a ``TabularMSA`` object: 330 331>>> from skbio import TabularMSA, DNA 332>>> msa = TabularMSA.read(fl, constructor=DNA) 333>>> msa 334TabularMSA[DNA] 335------------------------------------------ 336Stats: 337 sequence count: 5 338 position count: 42 339------------------------------------------ 340AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT 341AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT 342ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA 343AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT 344AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA 345 346Note that we didn't specify a file format in the ``read`` call. The FASTA 347sniffer detected the correct file format for us! 348 349To write the ``TabularMSA`` in FASTA format: 350 351>>> from io import StringIO 352>>> with StringIO() as fh: 353... print(msa.write(fh).getvalue()) 354>seq1 Turkey 355AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT 356>seq2 Salmo gair 357AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT 358>seq3 H. Sapiens 359ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA 360>seq4 Chimp 361AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT 362>seq5 Gorilla 363AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA 364<BLANKLINE> 365 366``TabularMSA`` loads all of the sequences from the FASTA file into memory at 367once. If the FASTA file is large (which is often the case), this may be 368infeasible if you don't have enough memory. To work around this issue, you can 369stream the sequences using scikit-bio's generator-based FASTA reader and 370writer. The generator-based reader yields ``Sequence`` objects (or subclasses 371if ``constructor`` is supplied) one at a time, instead of loading all sequences 372into memory. For example, let's use the generator-based reader to process a 373single sequence at a time in a ``for`` loop: 374 375>>> import skbio.io 376>>> for seq in skbio.io.read(fl, format='fasta'): 377... seq 378... print('') 379Sequence 380------------------------------------------------ 381Metadata: 382 'description': 'Turkey' 383 'id': 'seq1' 384Stats: 385 length: 42 386------------------------------------------------ 3870 AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT 388<BLANKLINE> 389Sequence 390------------------------------------------------ 391Metadata: 392 'description': 'Salmo gair' 393 'id': 'seq2' 394Stats: 395 length: 42 396------------------------------------------------ 3970 AAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT 398<BLANKLINE> 399Sequence 400------------------------------------------------ 401Metadata: 402 'description': 'H. Sapiens' 403 'id': 'seq3' 404Stats: 405 length: 42 406------------------------------------------------ 4070 ACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA 408<BLANKLINE> 409Sequence 410------------------------------------------------ 411Metadata: 412 'description': 'Chimp' 413 'id': 'seq4' 414Stats: 415 length: 42 416------------------------------------------------ 4170 AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT 418<BLANKLINE> 419Sequence 420------------------------------------------------ 421Metadata: 422 'description': 'Gorilla' 423 'id': 'seq5' 424Stats: 425 length: 42 426------------------------------------------------ 4270 AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA 428<BLANKLINE> 429 430A single sequence can also be read into a ``Sequence`` (or subclass): 431 432>>> from skbio import Sequence 433>>> seq = Sequence.read(fl) 434>>> seq 435Sequence 436------------------------------------------------ 437Metadata: 438 'description': 'Turkey' 439 'id': 'seq1' 440Stats: 441 length: 42 442------------------------------------------------ 4430 AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT 444 445By default, the first sequence in the FASTA file is read. This can be 446controlled with ``seq_num``. For example, to read the fifth sequence: 447 448>>> seq = Sequence.read(fl, seq_num=5) 449>>> seq 450Sequence 451------------------------------------------------ 452Metadata: 453 'description': 'Gorilla' 454 'id': 'seq5' 455Stats: 456 length: 42 457------------------------------------------------ 4580 AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA 459 460We can use the same API to read the fifth sequence into a ``DNA`` sequence: 461 462>>> dna_seq = DNA.read(fl, seq_num=5) 463>>> dna_seq 464DNA 465------------------------------------------------ 466Metadata: 467 'description': 'Gorilla' 468 'id': 'seq5' 469Stats: 470 length: 42 471 has gaps: False 472 has degenerates: False 473 has definites: True 474 GC-content: 50.00% 475------------------------------------------------ 4760 AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA 477 478Individual sequence objects can also be written in FASTA format: 479 480>>> with StringIO() as fh: 481... print(dna_seq.write(fh).getvalue()) 482>seq5 Gorilla 483AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA 484<BLANKLINE> 485 486Reading and Writing FASTA/QUAL Files 487^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 488In addition to reading and writing standalone FASTA files, scikit-bio supports 489reading and writing FASTA and QUAL files together. Suppose we have the 490following FASTA file:: 491 492 >seq1 db-accession-149855 493 CGATGTC 494 >seq2 db-accession-34989 495 CATCGTC 496 497Also suppose we have the following QUAL file:: 498 499 >seq1 db-accession-149855 500 40 39 39 4 501 50 1 100 502 >seq2 db-accession-34989 503 3 3 10 42 80 80 79 504 505>>> fasta_fl = [ 506... ">seq1 db-accession-149855\\n", 507... "CGATGTC\\n", 508... ">seq2 db-accession-34989\\n", 509... "CATCGTC\\n"] 510>>> qual_fl = [ 511... ">seq1 db-accession-149855\\n", 512... "40 39 39 4\\n", 513... "50 1 100\\n", 514... ">seq2 db-accession-34989\\n", 515... "3 3 10 42 80 80 79\\n"] 516 517To read in a single ``Sequence`` at a time, we can use the 518generator-based reader as we did above, providing both FASTA and QUAL files: 519 520>>> for seq in skbio.io.read(fasta_fl, qual=qual_fl, format='fasta'): 521... seq 522... print('') 523Sequence 524---------------------------------------- 525Metadata: 526 'description': 'db-accession-149855' 527 'id': 'seq1' 528Positional metadata: 529 'quality': <dtype: uint8> 530Stats: 531 length: 7 532---------------------------------------- 5330 CGATGTC 534<BLANKLINE> 535Sequence 536--------------------------------------- 537Metadata: 538 'description': 'db-accession-34989' 539 'id': 'seq2' 540Positional metadata: 541 'quality': <dtype: uint8> 542Stats: 543 length: 7 544--------------------------------------- 5450 CATCGTC 546<BLANKLINE> 547 548Note that the sequence objects have quality scores stored as positional 549metadata since we provided a QUAL file. The other FASTA readers operate in a 550similar manner. 551 552Now let's load the sequences and their quality scores into a ``TabularMSA``: 553 554>>> msa = TabularMSA.read(fasta_fl, qual=qual_fl, constructor=DNA) 555>>> msa 556TabularMSA[DNA] 557--------------------- 558Stats: 559 sequence count: 2 560 position count: 7 561--------------------- 562CGATGTC 563CATCGTC 564 565To write the sequence data and quality scores in the ``TabularMSA`` to FASTA 566and QUAL files, respectively: 567 568>>> new_fasta_fh = StringIO() 569>>> new_qual_fh = StringIO() 570>>> _ = msa.write(new_fasta_fh, qual=new_qual_fh) 571>>> print(new_fasta_fh.getvalue()) 572>seq1 db-accession-149855 573CGATGTC 574>seq2 db-accession-34989 575CATCGTC 576<BLANKLINE> 577>>> print(new_qual_fh.getvalue()) 578>seq1 db-accession-149855 57940 39 39 4 50 1 100 580>seq2 db-accession-34989 5813 3 10 42 80 80 79 582<BLANKLINE> 583>>> new_fasta_fh.close() 584>>> new_qual_fh.close() 585 586References 587---------- 588.. [1] Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity 589 searches". Science 227 (4693): 1435-41. 590.. [2] http://en.wikipedia.org/wiki/FASTA_format 591.. [3] http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml 592.. [4] https://www.broadinstitute.org/crd/wiki/index.php/Qual 593.. [5] Madden T. The BLAST Sequence Analysis Tool. 2002 Oct 9 594 [Updated 2003 Aug 13]. In: McEntyre J, Ostell J, editors. The NCBI Handbook 595 [Internet]. Bethesda (MD): National Center for Biotechnology Information 596 (US); 2002-. Chapter 16. Available from: 597 http://www.ncbi.nlm.nih.gov/books/NBK21097/ 598.. [6] http://evolution.genetics.washington.edu/phylip/doc/sequence.html 599 600""" 601 602# ---------------------------------------------------------------------------- 603# Copyright (c) 2013--, scikit-bio development team. 604# 605# Distributed under the terms of the Modified BSD License. 606# 607# The full license is in the file COPYING.txt, distributed with this software. 608# ---------------------------------------------------------------------------- 609 610import itertools 611import textwrap 612 613import numpy as np 614 615from skbio.io import create_format, FASTAFormatError, QUALFormatError 616from skbio.io.registry import FileSentinel 617from skbio.io.format._base import (_get_nth_sequence, 618 _parse_fasta_like_header, 619 _format_fasta_like_records, _line_generator, 620 _too_many_blanks) 621from skbio.util._misc import chunk_str 622from skbio.alignment import TabularMSA 623from skbio.sequence import Sequence, DNA, RNA, Protein 624 625 626fasta = create_format('fasta') 627 628 629@fasta.sniffer() 630def _fasta_sniffer(fh): 631 # Strategy: 632 # Ignore up to 5 blank/whitespace-only lines at the beginning of the 633 # file. Read up to 10 records. If at least one record is read (i.e. 634 # the file isn't empty) and no errors are thrown during reading, assume 635 # the file is in FASTA format. If a record appears to be QUAL, do *not* 636 # identify the file as FASTA since we don't want to sniff QUAL files as 637 # FASTA (technically they can be read as FASTA since the sequences may 638 # not be validated but it probably isn't what the user wanted). Also, if 639 # we add QUAL as its own file format in the future, we wouldn't want the 640 # FASTA and QUAL sniffers to both positively identify a QUAL file. 641 if _too_many_blanks(fh, 5): 642 return False, {} 643 644 num_records = 10 645 empty = True 646 try: 647 parser = _parse_fasta_raw(fh, _sniffer_data_parser, FASTAFormatError) 648 for _ in zip(range(num_records), parser): 649 empty = False 650 except FASTAFormatError: 651 return False, {} 652 653 if empty: 654 return False, {} 655 else: 656 return True, {} 657 658 659def _sniffer_data_parser(chunks): 660 data = _parse_sequence_data(chunks) 661 try: 662 _parse_quality_scores(chunks) 663 except QUALFormatError: 664 return data 665 else: 666 # used for flow control within sniffer, user should never see this 667 # message 668 raise FASTAFormatError('Data appear to be quality scores.') 669 670 671@fasta.reader(None) 672def _fasta_to_generator(fh, qual=FileSentinel, constructor=Sequence, **kwargs): 673 if qual is None: 674 for seq, id_, desc in _parse_fasta_raw(fh, _parse_sequence_data, 675 FASTAFormatError): 676 yield constructor(seq, metadata={'id': id_, 'description': desc}, 677 **kwargs) 678 else: 679 fasta_gen = _parse_fasta_raw(fh, _parse_sequence_data, 680 FASTAFormatError) 681 qual_gen = _parse_fasta_raw(qual, _parse_quality_scores, 682 QUALFormatError) 683 684 for fasta_rec, qual_rec in itertools.zip_longest(fasta_gen, qual_gen, 685 fillvalue=None): 686 if fasta_rec is None: 687 raise FASTAFormatError( 688 "QUAL file has more records than FASTA file.") 689 if qual_rec is None: 690 raise FASTAFormatError( 691 "FASTA file has more records than QUAL file.") 692 693 fasta_seq, fasta_id, fasta_desc = fasta_rec 694 qual_scores, qual_id, qual_desc = qual_rec 695 696 if fasta_id != qual_id: 697 raise FASTAFormatError( 698 "IDs do not match between FASTA and QUAL records: %r != %r" 699 % (str(fasta_id), str(qual_id))) 700 if fasta_desc != qual_desc: 701 raise FASTAFormatError( 702 "Descriptions do not match between FASTA and QUAL " 703 "records: %r != %r" % (str(fasta_desc), str(qual_desc))) 704 705 # sequence and quality scores lengths are checked in constructor 706 yield constructor( 707 fasta_seq, 708 metadata={'id': fasta_id, 'description': fasta_desc}, 709 positional_metadata={'quality': qual_scores}, **kwargs) 710 711 712@fasta.reader(Sequence) 713def _fasta_to_sequence(fh, qual=FileSentinel, seq_num=1, **kwargs): 714 return _get_nth_sequence( 715 _fasta_to_generator(fh, qual=qual, constructor=Sequence, **kwargs), 716 seq_num) 717 718 719@fasta.reader(DNA) 720def _fasta_to_dna(fh, qual=FileSentinel, seq_num=1, **kwargs): 721 return _get_nth_sequence( 722 _fasta_to_generator(fh, qual=qual, 723 constructor=DNA, **kwargs), 724 seq_num) 725 726 727@fasta.reader(RNA) 728def _fasta_to_rna(fh, qual=FileSentinel, seq_num=1, **kwargs): 729 return _get_nth_sequence( 730 _fasta_to_generator(fh, qual=qual, 731 constructor=RNA, **kwargs), 732 seq_num) 733 734 735@fasta.reader(Protein) 736def _fasta_to_protein(fh, qual=FileSentinel, seq_num=1, **kwargs): 737 return _get_nth_sequence( 738 _fasta_to_generator(fh, qual=qual, 739 constructor=Protein, **kwargs), 740 seq_num) 741 742 743@fasta.reader(TabularMSA) 744def _fasta_to_tabular_msa(fh, qual=FileSentinel, constructor=None, **kwargs): 745 if constructor is None: 746 raise ValueError("Must provide `constructor`.") 747 748 return TabularMSA( 749 _fasta_to_generator(fh, qual=qual, constructor=constructor, **kwargs)) 750 751 752@fasta.writer(None) 753def _generator_to_fasta(obj, fh, qual=FileSentinel, 754 id_whitespace_replacement='_', 755 description_newline_replacement=' ', max_width=None, 756 lowercase=None): 757 if max_width is not None: 758 if max_width < 1: 759 raise ValueError( 760 "Maximum line width must be greater than zero (max_width=%d)." 761 % max_width) 762 if qual is not None: 763 # define text wrapper for splitting quality scores here for 764 # efficiency. textwrap docs recommend reusing a TextWrapper 765 # instance when it is used many times. configure text wrapper to 766 # never break "words" (i.e., integer quality scores) across lines 767 qual_wrapper = textwrap.TextWrapper( 768 width=max_width, break_long_words=False, 769 break_on_hyphens=False) 770 771 formatted_records = _format_fasta_like_records( 772 obj, id_whitespace_replacement, description_newline_replacement, 773 qual is not None, lowercase) 774 for header, seq_str, qual_scores in formatted_records: 775 if max_width is not None: 776 seq_str = chunk_str(seq_str, max_width, '\n') 777 778 fh.write('>%s\n%s\n' % (header, seq_str)) 779 780 if qual is not None: 781 qual_str = ' '.join(np.asarray(qual_scores, dtype=np.str)) 782 if max_width is not None: 783 qual_str = qual_wrapper.fill(qual_str) 784 qual.write('>%s\n%s\n' % (header, qual_str)) 785 786 787@fasta.writer(Sequence) 788def _sequence_to_fasta(obj, fh, qual=FileSentinel, 789 id_whitespace_replacement='_', 790 description_newline_replacement=' ', max_width=None, 791 lowercase=None): 792 _sequences_to_fasta([obj], fh, qual, id_whitespace_replacement, 793 description_newline_replacement, max_width, lowercase) 794 795 796@fasta.writer(DNA) 797def _dna_to_fasta(obj, fh, qual=FileSentinel, id_whitespace_replacement='_', 798 description_newline_replacement=' ', max_width=None, 799 lowercase=None): 800 _sequences_to_fasta([obj], fh, qual, id_whitespace_replacement, 801 description_newline_replacement, max_width, lowercase) 802 803 804@fasta.writer(RNA) 805def _rna_to_fasta(obj, fh, qual=FileSentinel, id_whitespace_replacement='_', 806 description_newline_replacement=' ', max_width=None, 807 lowercase=None): 808 _sequences_to_fasta([obj], fh, qual, id_whitespace_replacement, 809 description_newline_replacement, max_width, lowercase) 810 811 812@fasta.writer(Protein) 813def _protein_to_fasta(obj, fh, qual=FileSentinel, 814 id_whitespace_replacement='_', 815 description_newline_replacement=' ', max_width=None, 816 lowercase=None): 817 _sequences_to_fasta([obj], fh, qual, id_whitespace_replacement, 818 description_newline_replacement, max_width, lowercase) 819 820 821@fasta.writer(TabularMSA) 822def _tabular_msa_to_fasta(obj, fh, qual=FileSentinel, 823 id_whitespace_replacement='_', 824 description_newline_replacement=' ', max_width=None, 825 lowercase=None): 826 _sequences_to_fasta(obj, fh, qual, id_whitespace_replacement, 827 description_newline_replacement, max_width, lowercase) 828 829 830def _parse_fasta_raw(fh, data_parser, error_type): 831 """Raw parser for FASTA or QUAL files. 832 833 Returns raw values (seq/qual, id, description). It is the responsibility of 834 the caller to construct the correct in-memory object to hold the data. 835 836 """ 837 # Skip any blank or whitespace-only lines at beginning of file 838 try: 839 seq_header = next(_line_generator(fh, skip_blanks=True)) 840 except StopIteration: 841 return 842 843 # header check inlined here and below for performance 844 if seq_header.startswith('>'): 845 id_, desc = _parse_fasta_like_header(seq_header) 846 else: 847 raise error_type( 848 "Found non-header line when attempting to read the 1st record:" 849 "\n%s" % seq_header) 850 851 data_chunks = [] 852 prev = seq_header 853 for line in _line_generator(fh, skip_blanks=False): 854 if line.startswith('>'): 855 # new header, so yield current record and reset state 856 yield data_parser(data_chunks), id_, desc 857 data_chunks = [] 858 id_, desc = _parse_fasta_like_header(line) 859 else: 860 if line: 861 # ensure no blank lines within a single record 862 if not prev: 863 raise error_type( 864 "Found blank or whitespace-only line within record.") 865 data_chunks.append(line) 866 prev = line 867 # yield last record in file 868 yield data_parser(data_chunks), id_, desc 869 870 871def _parse_sequence_data(chunks): 872 if not chunks: 873 raise FASTAFormatError("Found header without sequence data.") 874 return ''.join(chunks) 875 876 877def _parse_quality_scores(chunks): 878 if not chunks: 879 raise QUALFormatError("Found header without quality scores.") 880 881 qual_str = ' '.join(chunks) 882 try: 883 quality = np.asarray(qual_str.split(), dtype=int) 884 except ValueError: 885 raise QUALFormatError( 886 "Could not convert quality scores to integers:\n%s" 887 % str(qual_str)) 888 889 if (quality < 0).any(): 890 raise QUALFormatError( 891 "Encountered negative quality score(s). Quality scores must be " 892 "greater than or equal to zero.") 893 if (quality > 255).any(): 894 raise QUALFormatError( 895 "Encountered quality score(s) greater than 255. scikit-bio only " 896 "supports quality scores in the range 0-255 (inclusive) when " 897 "reading QUAL files.") 898 return quality.astype(np.uint8, casting='unsafe', copy=False) 899 900 901def _sequences_to_fasta(obj, fh, qual, id_whitespace_replacement, 902 description_newline_replacement, max_width, 903 lowercase=None): 904 def seq_gen(): 905 yield from obj 906 907 _generator_to_fasta( 908 seq_gen(), fh, qual=qual, 909 id_whitespace_replacement=id_whitespace_replacement, 910 description_newline_replacement=description_newline_replacement, 911 max_width=max_width, lowercase=lowercase) 912