1"""
2FASTA/QUAL format (:mod:`skbio.io.format.fasta`)
3================================================
4
5.. currentmodule:: skbio.io.format.fasta
6
7The FASTA file format (``fasta``) stores biological (i.e., nucleotide or
8protein) sequences in a simple plain text format that is both human-readable
9and easy to parse. The file format was first introduced and used in the FASTA
10software package [1]_. Additional descriptions of the file format can be found
11in [2]_ and [3]_.
12
13An example of a FASTA-formatted file containing two DNA sequences::
14
15    >seq1 db-accession-149855
16    CGATGTCGATCGATCGATCGATCAG
17    >seq2 db-accession-34989
18    CATCGATCGATCGATGCATGCATGCATG
19
20The QUAL file format is an additional format related to FASTA. A FASTA file is
21sometimes accompanied by a QUAL file, particuarly when the FASTA file contains
22sequences generated on a high-throughput sequencing instrument. QUAL files
23store a Phred quality score (nonnegative integer) for each base in a sequence
24stored in FASTA format (see [4]_ for more details). scikit-bio supports reading
25and writing FASTA (and optionally QUAL) file formats.
26
27Format Support
28--------------
29**Has Sniffer: Yes**
30
31+------+------+---------------------------------------------------------------+
32|Reader|Writer|                          Object Class                         |
33+======+======+===============================================================+
34|Yes   |Yes   |generator of :mod:`skbio.sequence.Sequence` objects            |
35+------+------+---------------------------------------------------------------+
36|Yes   |Yes   |:mod:`skbio.alignment.TabularMSA`                              |
37+------+------+---------------------------------------------------------------+
38|Yes   |Yes   |:mod:`skbio.sequence.Sequence`                                 |
39+------+------+---------------------------------------------------------------+
40|Yes   |Yes   |:mod:`skbio.sequence.DNA`                                      |
41+------+------+---------------------------------------------------------------+
42|Yes   |Yes   |:mod:`skbio.sequence.RNA`                                      |
43+------+------+---------------------------------------------------------------+
44|Yes   |Yes   |:mod:`skbio.sequence.Protein`                                  |
45+------+------+---------------------------------------------------------------+
46
47.. note:: All readers and writers support an optional QUAL file via the
48   ``qual`` parameter. If one is provided, quality scores will be read/written
49   in addition to FASTA sequence data.
50
51Format Specification
52--------------------
53The following sections define the FASTA and QUAL file formats in detail.
54
55FASTA Format
56^^^^^^^^^^^^
57A FASTA file contains one or more biological sequences. The sequences are
58stored sequentially, with a *record* for each sequence (also referred to as a
59*FASTA record*). Each *record* consists of a single-line *header* (sometimes
60referred to as a *defline*, *label*, *description*, or *comment*) followed by
61the sequence data, optionally split over multiple lines.
62
63.. note:: Blank or whitespace-only lines are only allowed at the beginning of
64   the file, between FASTA records, or at the end of the file. A blank or
65   whitespace-only line after the header line, within the sequence (for FASTA
66   files), or within quality scores (for QUAL files) will raise an error.
67
68   scikit-bio will ignore leading and trailing whitespace characters on each
69   line while reading.
70
71.. note:: scikit-bio does not currently support legacy FASTA format (i.e.,
72   headers/comments denoted with a semicolon). The format supported by
73   scikit-bio (described below in detail) most closely resembles the
74   description given in NCBI's BLAST documentation [3]_. See [2]_ for more
75   details on legacy FASTA format. If you would like legacy FASTA format
76   support added to scikit-bio, please consider submitting a feature request on
77   the
78   `scikit-bio issue tracker <https://github.com/biocore/scikit-bio/issues>`_
79   (pull requests are also welcome!).
80
81Sequence Header
82~~~~~~~~~~~~~~~
83Each sequence header consists of a single line beginning with a greater-than
84(``>``) symbol. Immediately following this is a sequence identifier (ID) and
85description separated by one or more whitespace characters.
86
87.. note:: When reading a FASTA-formatted file, the sequence ID and description
88   are stored in the sequence `metadata` attribute, under the `'id'` and
89   `'description'` keys, repectively. Both are optional. Each will be
90   represented as the empty string (``''``) in `metadata` if it is not present
91   in the header.
92
93   When writing a FASTA-formatted file, sequence `metadata` identified by keys
94   `'id'` and `'description'` will be converted to strings and written as the
95   sequence identifier and description, respectively. Each will be written as
96   the empty string if not present in sequence `metadata`.
97
98A sequence ID consists of a single *word*: all characters after the greater-
99than symbol and before the first whitespace character (if any) are taken as the
100sequence ID. Unique sequence IDs are not strictly enforced by the FASTA format
101itself. A single standardized ID format is similarly not enforced by the FASTA
102format, though it is often common to use a unique library accession number for
103a sequence ID (e.g., NCBI's FASTA defline format [5]_).
104
105If a description is present, it is taken as the remaining characters that
106follow the sequence ID and initial whitespace(s). The description is considered
107additional information about the sequence (e.g., comments about the source of
108the sequence or the molecule that it encodes).
109
110For example, consider the following header::
111
112    >seq1 db-accession-149855
113
114``seq1`` is the sequence ID and ``db-accession-149855`` is the sequence
115description.
116
117.. note:: scikit-bio's readers will remove all leading and trailing whitespace
118   from the description. If a header line begins with whitespace following the
119   ``>``, the ID is assumed to be missing and the remainder of the line is
120   taken as the description.
121
122Sequence Data
123~~~~~~~~~~~~~
124Biological sequence data follows the header, and can be split over multiple
125lines. The sequence data (i.e., nucleotides or amino acids) are stored using
126the standard IUPAC lexicon (single-letter codes).
127
128.. note:: scikit-bio supports both upper and lower case characters.
129   This functionality depends on the type of object the data is
130   being read into. For ``Sequence`` objects, sciki-bio doesn't care about the
131   case. Other sequence objects do, but all provide the `lowercase` parameter
132   to control case functionality. Refer to each class's respective constructor
133   documentation for details.
134
135   Both ``-`` and ``.`` are supported as gap characters when reading into
136   ``DNA``, ``RNA``, and ``Protein`` sequence objects.
137
138   Validation is performed when reading into scikit-bio sequence objects that
139   enforce an alphabet (e.g., ``DNA``, ``RNA``, ``Protein``). If any invalid
140   characters are found while reading from the FASTA file, an exception is
141   raised.
142
143QUAL Format
144^^^^^^^^^^^
145A QUAL file contains quality scores for one or more biological sequences stored
146in a corresponding FASTA file. QUAL format is very similar to FASTA format: it
147stores records sequentially, with each record beginning with a header line
148containing a sequence ID and description. The same rules apply to QUAL headers
149as FASTA headers (see the above sections for details). scikit-bio processes
150FASTA and QUAL headers in exactly the same way.
151
152Instead of storing biological sequence data in each record, a QUAL file stores
153a Phred quality score for each base in the corresponding sequence. Quality
154scores are represented as nonnegative integers separated by whitespace
155(typically a single space or newline), and can span multiple lines.
156
157.. note:: When reading a QUAL-formatted file, quality scores are stored in the
158   sequence's `positional_metadata` attribute under the `'quality'` column.
159
160   When writing a QUAL-formatted file, a sequence's `positional_metadata`
161   `'quality'` column will be written as the quality scores.
162
163.. note:: When reading FASTA and QUAL files, scikit-bio requires records to be
164   in the same order in both files (i.e., each FASTA and QUAL record must have
165   the same ID and description after being parsed). In addition to having the
166   same order, the number of FASTA records must match the number of QUAL
167   records (i.e., missing or additonal records are not allowed). scikit-bio
168   also requires that the number of quality scores match the number of bases in
169   the corresponding sequence.
170
171   When writing FASTA and QUAL files, scikit-bio will maintain the same
172   ordering of records in both files (i.e., using the same ID and description
173   in both records) to support future reading.
174
175Format Parameters
176-----------------
177The following parameters are available to change how FASTA/QUAL files are read
178or written in scikit-bio.
179
180QUAL File Parameter (Readers and Writers)
181^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
182The ``qual`` parameter is available to all FASTA format readers and writers. It
183can be any file-like type supported by scikit-bio's I/O registry (e.g., file
184handle, file path, etc.). If ``qual`` is provided when reading, quality scores
185will be included in each in-memory ``Sequence`` object, in addition
186to sequence data stored in the FASTA file. When writing, quality scores will be
187written in QUAL format in addition to the sequence data being written in FASTA
188format.
189
190Reader-specific Parameters
191^^^^^^^^^^^^^^^^^^^^^^^^^^
192The available reader parameters differ depending on which reader is used.
193
194Generator and TabularMSA Reader Parameters
195~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
196The ``constructor`` parameter can be used with the ``Sequence`` generator and
197``TabularMSA`` FASTA readers. ``constructor`` specifies the type of in-memory
198sequence object to read each sequence into. For example, if you know that the
199FASTA file you're reading contains protein sequences, you would pass
200``constructor=Protein`` to the reader call.
201
202When reading into a ``Sequence`` generator, ``constructor`` defaults to
203``Sequence`` and must be a subclass of ``Sequence`` if supplied.
204
205When reading into a ``TabularMSA``, ``constructor`` is a required format
206parameter and must be a subclass of ``GrammaredSequence`` (e.g., ``DNA``,
207``RNA``, ``Protein``).
208
209.. note:: The FASTA sniffer will not attempt to guess the ``constructor``
210   parameter.
211
212Sequence Reader Parameters
213~~~~~~~~~~~~~~~~~~~~~~~~~~
214The ``seq_num`` parameter can be used with the ``Sequence``,
215``DNA``, ``RNA``, and ``Protein`` FASTA readers. ``seq_num`` specifies which
216sequence to read from the FASTA file (and optional QUAL file), and defaults to
2171 (i.e., such that the first sequence is read). For example, to read the 50th
218sequence from a FASTA file, you would pass ``seq_num=50`` to the reader call.
219
220Writer-specific Parameters
221^^^^^^^^^^^^^^^^^^^^^^^^^^
222The following parameters are available to all FASTA format writers:
223
224- ``id_whitespace_replacement``: string to replace **each** whitespace
225  character in a sequence ID. This parameter is useful for cases where an
226  in-memory sequence ID contains whitespace, which would result in an on-disk
227  representation that would not be read back into memory as the same ID (since
228  IDs in FASTA format cannot contain whitespace). Defaults to ``_``. If
229  ``None``, no whitespace replacement is performed and IDs are written as they
230  are stored in memory (this has the potential to create an invalid
231  FASTA-formatted file; see note below). This parameter also applies to a QUAL
232  file if one is provided.
233
234- ``description_newline_replacement``: string to replace **each** newline
235  character in a sequence description. Since a FASTA header must be a single
236  line, newlines are not allowed in sequence descriptions and must be replaced
237  in order to write a valid FASTA file. Defaults to a single space. If
238  ``None``, no newline replacement is performed and descriptions are written as
239  they are stored in memory (this has the potential to create an invalid
240  FASTA-formatted file; see note below). This parameter also applies to a QUAL
241  file if one is provided.
242
243- ``max_width``: integer specifying the maximum line width (i.e., number of
244  characters) for sequence data and/or quality scores. If a sequence or its
245  quality scores are longer than ``max_width``, it will be split across
246  multiple lines, each with a maximum width of ``max_width``. Note that there
247  are some caveats when splitting quality scores. A single quality score will
248  *never* be split across multiple lines, otherwise it would become two
249  different quality scores when read again. Thus, splitting only occurs
250  *between* quality scores. This makes it possible to have a single long
251  quality score written on its own line that exceeds ``max_width``. For
252  example, the quality score ``12345`` would not be split across multiple lines
253  even if ``max_width=3``. Thus, a 5-character line would be written. Default
254  behavior is to not split sequence data or quality scores across multiple
255  lines.
256
257- ``lowercase``: String or boolean array. If a string, it is treated as a key
258  into the positional metadata of the object. If a boolean array, it
259  indicates characters to write in lowercase. Characters in the sequence
260  corresponding to `True` values will be written in lowercase. The boolean
261  array must be the same length as the sequence.
262
263.. note:: The FASTA format writers will have noticeably better runtime
264   performance if ``id_whitespace_replacement`` and/or
265   ``description_newline_replacement`` are set to ``None`` so that whitespace
266   replacement is not performed during writing. However, this can potentially
267   create invalid FASTA files, especially if there are newline characters in
268   the IDs or descriptions. For IDs with whitespace, this can also affect how
269   the IDs are read into memory in a subsequent read operation. For example, if
270   an in-memory sequence ID is ``'seq 1'`` and
271   ``id_whitespace_replacement=None``, reading the FASTA file back into memory
272   would result in an ID of ``'seq'``, and ``'1'`` would be part of the
273   sequence description.
274
275Examples
276--------
277
278Reading and Writing FASTA Files
279^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
280Suppose we have the following FASTA file with five equal-length sequences
281(example modified from [6]_)::
282
283    >seq1 Turkey
284    AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
285    >seq2 Salmo gair
286    AAGCCTTGGCAGTGCAGGGTGAGCCGTGG
287    CCGGGCACGGTAT
288    >seq3 H. Sapiens
289    ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
290    >seq4 Chimp
291    AAACCCTTGCCG
292    TTACGCTTAAAC
293    CGAGGCCGGGAC
294    ACTCAT
295    >seq5 Gorilla
296    AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
297
298.. note:: Original copyright notice for the above example file:
299
300   *(c) Copyright 1986-2008 by The University of Washington. Written by Joseph
301   Felsenstein. Permission is granted to copy this document provided that no
302   fee is charged for it and that this copyright notice is not removed.*
303
304Note that the sequences are not required to be of equal length in order for the
305file to be a valid FASTA file (this depends on the object that you're reading
306the file into). Also note that some of the sequences occur on a single line,
307while others are split across multiple lines.
308
309Let's define this file in-memory as a ``StringIO``, though this could be a real
310file path, file handle, or anything that's supported by scikit-bio's I/O
311registry in practice:
312
313>>> fl = [">seq1 Turkey\\n",
314...       "AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT\\n",
315...       ">seq2 Salmo gair\\n",
316...       "AAGCCTTGGCAGTGCAGGGTGAGCCGTGG\\n",
317...       "CCGGGCACGGTAT\\n",
318...       ">seq3 H. Sapiens\\n",
319...       "ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA\\n",
320...       ">seq4 Chimp\\n",
321...       "AAACCCTTGCCG\\n",
322...       "TTACGCTTAAAC\\n",
323...       "CGAGGCCGGGAC\\n",
324...       "ACTCAT\\n",
325...       ">seq5 Gorilla\\n",
326...       "AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA\\n"]
327
328Since these sequences are of equal length (presumably because they've been
329aligned), let's read the FASTA file into a ``TabularMSA`` object:
330
331>>> from skbio import TabularMSA, DNA
332>>> msa = TabularMSA.read(fl, constructor=DNA)
333>>> msa
334TabularMSA[DNA]
335------------------------------------------
336Stats:
337    sequence count: 5
338    position count: 42
339------------------------------------------
340AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
341AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
342ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
343AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
344AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
345
346Note that we didn't specify a file format in the ``read`` call. The FASTA
347sniffer detected the correct file format for us!
348
349To write the ``TabularMSA`` in FASTA format:
350
351>>> from io import StringIO
352>>> with StringIO() as fh:
353...     print(msa.write(fh).getvalue())
354>seq1 Turkey
355AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
356>seq2 Salmo gair
357AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
358>seq3 H. Sapiens
359ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
360>seq4 Chimp
361AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
362>seq5 Gorilla
363AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
364<BLANKLINE>
365
366``TabularMSA`` loads all of the sequences from the FASTA file into memory at
367once. If the FASTA file is large (which is often the case), this may be
368infeasible if you don't have enough memory. To work around this issue, you can
369stream the sequences using scikit-bio's generator-based FASTA reader and
370writer. The generator-based reader yields ``Sequence`` objects (or subclasses
371if ``constructor`` is supplied) one at a time, instead of loading all sequences
372into memory. For example, let's use the generator-based reader to process a
373single sequence at a time in a ``for`` loop:
374
375>>> import skbio.io
376>>> for seq in skbio.io.read(fl, format='fasta'):
377...     seq
378...     print('')
379Sequence
380------------------------------------------------
381Metadata:
382    'description': 'Turkey'
383    'id': 'seq1'
384Stats:
385    length: 42
386------------------------------------------------
3870 AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT
388<BLANKLINE>
389Sequence
390------------------------------------------------
391Metadata:
392    'description': 'Salmo gair'
393    'id': 'seq2'
394Stats:
395    length: 42
396------------------------------------------------
3970 AAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT
398<BLANKLINE>
399Sequence
400------------------------------------------------
401Metadata:
402    'description': 'H. Sapiens'
403    'id': 'seq3'
404Stats:
405    length: 42
406------------------------------------------------
4070 ACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA
408<BLANKLINE>
409Sequence
410------------------------------------------------
411Metadata:
412    'description': 'Chimp'
413    'id': 'seq4'
414Stats:
415    length: 42
416------------------------------------------------
4170 AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT
418<BLANKLINE>
419Sequence
420------------------------------------------------
421Metadata:
422    'description': 'Gorilla'
423    'id': 'seq5'
424Stats:
425    length: 42
426------------------------------------------------
4270 AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA
428<BLANKLINE>
429
430A single sequence can also be read into a ``Sequence`` (or subclass):
431
432>>> from skbio import Sequence
433>>> seq = Sequence.read(fl)
434>>> seq
435Sequence
436------------------------------------------------
437Metadata:
438    'description': 'Turkey'
439    'id': 'seq1'
440Stats:
441    length: 42
442------------------------------------------------
4430 AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT
444
445By default, the first sequence in the FASTA file is read. This can be
446controlled with ``seq_num``. For example, to read the fifth sequence:
447
448>>> seq = Sequence.read(fl, seq_num=5)
449>>> seq
450Sequence
451------------------------------------------------
452Metadata:
453    'description': 'Gorilla'
454    'id': 'seq5'
455Stats:
456    length: 42
457------------------------------------------------
4580 AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA
459
460We can use the same API to read the fifth sequence into a ``DNA`` sequence:
461
462>>> dna_seq = DNA.read(fl, seq_num=5)
463>>> dna_seq
464DNA
465------------------------------------------------
466Metadata:
467    'description': 'Gorilla'
468    'id': 'seq5'
469Stats:
470    length: 42
471    has gaps: False
472    has degenerates: False
473    has definites: True
474    GC-content: 50.00%
475------------------------------------------------
4760 AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA
477
478Individual sequence objects can also be written in FASTA format:
479
480>>> with StringIO() as fh:
481...     print(dna_seq.write(fh).getvalue())
482>seq5 Gorilla
483AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA
484<BLANKLINE>
485
486Reading and Writing FASTA/QUAL Files
487^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
488In addition to reading and writing standalone FASTA files, scikit-bio supports
489reading and writing FASTA and QUAL files together. Suppose we have the
490following FASTA file::
491
492    >seq1 db-accession-149855
493    CGATGTC
494    >seq2 db-accession-34989
495    CATCGTC
496
497Also suppose we have the following QUAL file::
498
499    >seq1 db-accession-149855
500    40 39 39 4
501    50 1 100
502    >seq2 db-accession-34989
503    3 3 10 42 80 80 79
504
505>>> fasta_fl = [
506...     ">seq1 db-accession-149855\\n",
507...     "CGATGTC\\n",
508...     ">seq2 db-accession-34989\\n",
509...     "CATCGTC\\n"]
510>>> qual_fl = [
511...     ">seq1 db-accession-149855\\n",
512...     "40 39 39 4\\n",
513...     "50 1 100\\n",
514...     ">seq2 db-accession-34989\\n",
515...     "3 3 10 42 80 80 79\\n"]
516
517To read in a single ``Sequence`` at a time, we can use the
518generator-based reader as we did above, providing both FASTA and QUAL files:
519
520>>> for seq in skbio.io.read(fasta_fl, qual=qual_fl, format='fasta'):
521...     seq
522...     print('')
523Sequence
524----------------------------------------
525Metadata:
526    'description': 'db-accession-149855'
527    'id': 'seq1'
528Positional metadata:
529    'quality': <dtype: uint8>
530Stats:
531    length: 7
532----------------------------------------
5330 CGATGTC
534<BLANKLINE>
535Sequence
536---------------------------------------
537Metadata:
538    'description': 'db-accession-34989'
539    'id': 'seq2'
540Positional metadata:
541    'quality': <dtype: uint8>
542Stats:
543    length: 7
544---------------------------------------
5450 CATCGTC
546<BLANKLINE>
547
548Note that the sequence objects have quality scores stored as positional
549metadata since we provided a QUAL file. The other FASTA readers operate in a
550similar manner.
551
552Now let's load the sequences and their quality scores into a ``TabularMSA``:
553
554>>> msa = TabularMSA.read(fasta_fl, qual=qual_fl, constructor=DNA)
555>>> msa
556TabularMSA[DNA]
557---------------------
558Stats:
559    sequence count: 2
560    position count: 7
561---------------------
562CGATGTC
563CATCGTC
564
565To write the sequence data and quality scores in the ``TabularMSA`` to FASTA
566and QUAL files, respectively:
567
568>>> new_fasta_fh = StringIO()
569>>> new_qual_fh = StringIO()
570>>> _ = msa.write(new_fasta_fh, qual=new_qual_fh)
571>>> print(new_fasta_fh.getvalue())
572>seq1 db-accession-149855
573CGATGTC
574>seq2 db-accession-34989
575CATCGTC
576<BLANKLINE>
577>>> print(new_qual_fh.getvalue())
578>seq1 db-accession-149855
57940 39 39 4 50 1 100
580>seq2 db-accession-34989
5813 3 10 42 80 80 79
582<BLANKLINE>
583>>> new_fasta_fh.close()
584>>> new_qual_fh.close()
585
586References
587----------
588.. [1] Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity
589   searches". Science 227 (4693): 1435-41.
590.. [2] http://en.wikipedia.org/wiki/FASTA_format
591.. [3] http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml
592.. [4] https://www.broadinstitute.org/crd/wiki/index.php/Qual
593.. [5] Madden T. The BLAST Sequence Analysis Tool. 2002 Oct 9
594   [Updated 2003 Aug 13]. In: McEntyre J, Ostell J, editors. The NCBI Handbook
595   [Internet]. Bethesda (MD): National Center for Biotechnology Information
596   (US); 2002-. Chapter 16. Available from:
597   http://www.ncbi.nlm.nih.gov/books/NBK21097/
598.. [6] http://evolution.genetics.washington.edu/phylip/doc/sequence.html
599
600"""
601
602# ----------------------------------------------------------------------------
603# Copyright (c) 2013--, scikit-bio development team.
604#
605# Distributed under the terms of the Modified BSD License.
606#
607# The full license is in the file COPYING.txt, distributed with this software.
608# ----------------------------------------------------------------------------
609
610import itertools
611import textwrap
612
613import numpy as np
614
615from skbio.io import create_format, FASTAFormatError, QUALFormatError
616from skbio.io.registry import FileSentinel
617from skbio.io.format._base import (_get_nth_sequence,
618                                   _parse_fasta_like_header,
619                                   _format_fasta_like_records, _line_generator,
620                                   _too_many_blanks)
621from skbio.util._misc import chunk_str
622from skbio.alignment import TabularMSA
623from skbio.sequence import Sequence, DNA, RNA, Protein
624
625
626fasta = create_format('fasta')
627
628
629@fasta.sniffer()
630def _fasta_sniffer(fh):
631    # Strategy:
632    #   Ignore up to 5 blank/whitespace-only lines at the beginning of the
633    #   file. Read up to 10 records. If at least one record is read (i.e.
634    #   the file isn't empty) and no errors are thrown during reading, assume
635    #   the file is in FASTA format. If a record appears to be QUAL, do *not*
636    #   identify the file as FASTA since we don't want to sniff QUAL files as
637    #   FASTA (technically they can be read as FASTA since the sequences may
638    #   not be validated but it probably isn't what the user wanted). Also, if
639    #   we add QUAL as its own file format in the future, we wouldn't want the
640    #   FASTA and QUAL sniffers to both positively identify a QUAL file.
641    if _too_many_blanks(fh, 5):
642        return False, {}
643
644    num_records = 10
645    empty = True
646    try:
647        parser = _parse_fasta_raw(fh, _sniffer_data_parser, FASTAFormatError)
648        for _ in zip(range(num_records), parser):
649            empty = False
650    except FASTAFormatError:
651        return False, {}
652
653    if empty:
654        return False, {}
655    else:
656        return True, {}
657
658
659def _sniffer_data_parser(chunks):
660    data = _parse_sequence_data(chunks)
661    try:
662        _parse_quality_scores(chunks)
663    except QUALFormatError:
664        return data
665    else:
666        # used for flow control within sniffer, user should never see this
667        # message
668        raise FASTAFormatError('Data appear to be quality scores.')
669
670
671@fasta.reader(None)
672def _fasta_to_generator(fh, qual=FileSentinel, constructor=Sequence, **kwargs):
673    if qual is None:
674        for seq, id_, desc in _parse_fasta_raw(fh, _parse_sequence_data,
675                                               FASTAFormatError):
676            yield constructor(seq, metadata={'id': id_, 'description': desc},
677                              **kwargs)
678    else:
679        fasta_gen = _parse_fasta_raw(fh, _parse_sequence_data,
680                                     FASTAFormatError)
681        qual_gen = _parse_fasta_raw(qual, _parse_quality_scores,
682                                    QUALFormatError)
683
684        for fasta_rec, qual_rec in itertools.zip_longest(fasta_gen, qual_gen,
685                                                         fillvalue=None):
686            if fasta_rec is None:
687                raise FASTAFormatError(
688                    "QUAL file has more records than FASTA file.")
689            if qual_rec is None:
690                raise FASTAFormatError(
691                    "FASTA file has more records than QUAL file.")
692
693            fasta_seq, fasta_id, fasta_desc = fasta_rec
694            qual_scores, qual_id, qual_desc = qual_rec
695
696            if fasta_id != qual_id:
697                raise FASTAFormatError(
698                    "IDs do not match between FASTA and QUAL records: %r != %r"
699                    % (str(fasta_id), str(qual_id)))
700            if fasta_desc != qual_desc:
701                raise FASTAFormatError(
702                    "Descriptions do not match between FASTA and QUAL "
703                    "records: %r != %r" % (str(fasta_desc), str(qual_desc)))
704
705            # sequence and quality scores lengths are checked in constructor
706            yield constructor(
707                fasta_seq,
708                metadata={'id': fasta_id, 'description': fasta_desc},
709                positional_metadata={'quality': qual_scores}, **kwargs)
710
711
712@fasta.reader(Sequence)
713def _fasta_to_sequence(fh, qual=FileSentinel, seq_num=1, **kwargs):
714    return _get_nth_sequence(
715        _fasta_to_generator(fh, qual=qual, constructor=Sequence, **kwargs),
716        seq_num)
717
718
719@fasta.reader(DNA)
720def _fasta_to_dna(fh, qual=FileSentinel, seq_num=1, **kwargs):
721    return _get_nth_sequence(
722        _fasta_to_generator(fh, qual=qual,
723                            constructor=DNA, **kwargs),
724        seq_num)
725
726
727@fasta.reader(RNA)
728def _fasta_to_rna(fh, qual=FileSentinel, seq_num=1, **kwargs):
729    return _get_nth_sequence(
730        _fasta_to_generator(fh, qual=qual,
731                            constructor=RNA, **kwargs),
732        seq_num)
733
734
735@fasta.reader(Protein)
736def _fasta_to_protein(fh, qual=FileSentinel, seq_num=1, **kwargs):
737    return _get_nth_sequence(
738        _fasta_to_generator(fh, qual=qual,
739                            constructor=Protein, **kwargs),
740        seq_num)
741
742
743@fasta.reader(TabularMSA)
744def _fasta_to_tabular_msa(fh, qual=FileSentinel, constructor=None, **kwargs):
745    if constructor is None:
746        raise ValueError("Must provide `constructor`.")
747
748    return TabularMSA(
749        _fasta_to_generator(fh, qual=qual, constructor=constructor, **kwargs))
750
751
752@fasta.writer(None)
753def _generator_to_fasta(obj, fh, qual=FileSentinel,
754                        id_whitespace_replacement='_',
755                        description_newline_replacement=' ', max_width=None,
756                        lowercase=None):
757    if max_width is not None:
758        if max_width < 1:
759            raise ValueError(
760                "Maximum line width must be greater than zero (max_width=%d)."
761                % max_width)
762        if qual is not None:
763            # define text wrapper for splitting quality scores here for
764            # efficiency. textwrap docs recommend reusing a TextWrapper
765            # instance when it is used many times. configure text wrapper to
766            # never break "words" (i.e., integer quality scores) across lines
767            qual_wrapper = textwrap.TextWrapper(
768                width=max_width, break_long_words=False,
769                break_on_hyphens=False)
770
771    formatted_records = _format_fasta_like_records(
772        obj, id_whitespace_replacement, description_newline_replacement,
773        qual is not None, lowercase)
774    for header, seq_str, qual_scores in formatted_records:
775        if max_width is not None:
776            seq_str = chunk_str(seq_str, max_width, '\n')
777
778        fh.write('>%s\n%s\n' % (header, seq_str))
779
780        if qual is not None:
781            qual_str = ' '.join(np.asarray(qual_scores, dtype=np.str))
782            if max_width is not None:
783                qual_str = qual_wrapper.fill(qual_str)
784            qual.write('>%s\n%s\n' % (header, qual_str))
785
786
787@fasta.writer(Sequence)
788def _sequence_to_fasta(obj, fh, qual=FileSentinel,
789                       id_whitespace_replacement='_',
790                       description_newline_replacement=' ', max_width=None,
791                       lowercase=None):
792    _sequences_to_fasta([obj], fh, qual, id_whitespace_replacement,
793                        description_newline_replacement, max_width, lowercase)
794
795
796@fasta.writer(DNA)
797def _dna_to_fasta(obj, fh, qual=FileSentinel, id_whitespace_replacement='_',
798                  description_newline_replacement=' ', max_width=None,
799                  lowercase=None):
800    _sequences_to_fasta([obj], fh, qual, id_whitespace_replacement,
801                        description_newline_replacement, max_width, lowercase)
802
803
804@fasta.writer(RNA)
805def _rna_to_fasta(obj, fh, qual=FileSentinel, id_whitespace_replacement='_',
806                  description_newline_replacement=' ', max_width=None,
807                  lowercase=None):
808    _sequences_to_fasta([obj], fh, qual, id_whitespace_replacement,
809                        description_newline_replacement, max_width, lowercase)
810
811
812@fasta.writer(Protein)
813def _protein_to_fasta(obj, fh, qual=FileSentinel,
814                      id_whitespace_replacement='_',
815                      description_newline_replacement=' ', max_width=None,
816                      lowercase=None):
817    _sequences_to_fasta([obj], fh, qual, id_whitespace_replacement,
818                        description_newline_replacement, max_width, lowercase)
819
820
821@fasta.writer(TabularMSA)
822def _tabular_msa_to_fasta(obj, fh, qual=FileSentinel,
823                          id_whitespace_replacement='_',
824                          description_newline_replacement=' ', max_width=None,
825                          lowercase=None):
826    _sequences_to_fasta(obj, fh, qual, id_whitespace_replacement,
827                        description_newline_replacement, max_width, lowercase)
828
829
830def _parse_fasta_raw(fh, data_parser, error_type):
831    """Raw parser for FASTA or QUAL files.
832
833    Returns raw values (seq/qual, id, description). It is the responsibility of
834    the caller to construct the correct in-memory object to hold the data.
835
836    """
837    # Skip any blank or whitespace-only lines at beginning of file
838    try:
839        seq_header = next(_line_generator(fh, skip_blanks=True))
840    except StopIteration:
841        return
842
843    # header check inlined here and below for performance
844    if seq_header.startswith('>'):
845        id_, desc = _parse_fasta_like_header(seq_header)
846    else:
847        raise error_type(
848            "Found non-header line when attempting to read the 1st record:"
849            "\n%s" % seq_header)
850
851    data_chunks = []
852    prev = seq_header
853    for line in _line_generator(fh, skip_blanks=False):
854        if line.startswith('>'):
855            # new header, so yield current record and reset state
856            yield data_parser(data_chunks), id_, desc
857            data_chunks = []
858            id_, desc = _parse_fasta_like_header(line)
859        else:
860            if line:
861                # ensure no blank lines within a single record
862                if not prev:
863                    raise error_type(
864                        "Found blank or whitespace-only line within record.")
865                data_chunks.append(line)
866        prev = line
867    # yield last record in file
868    yield data_parser(data_chunks), id_, desc
869
870
871def _parse_sequence_data(chunks):
872    if not chunks:
873        raise FASTAFormatError("Found header without sequence data.")
874    return ''.join(chunks)
875
876
877def _parse_quality_scores(chunks):
878    if not chunks:
879        raise QUALFormatError("Found header without quality scores.")
880
881    qual_str = ' '.join(chunks)
882    try:
883        quality = np.asarray(qual_str.split(), dtype=int)
884    except ValueError:
885        raise QUALFormatError(
886            "Could not convert quality scores to integers:\n%s"
887            % str(qual_str))
888
889    if (quality < 0).any():
890        raise QUALFormatError(
891            "Encountered negative quality score(s). Quality scores must be "
892            "greater than or equal to zero.")
893    if (quality > 255).any():
894        raise QUALFormatError(
895            "Encountered quality score(s) greater than 255. scikit-bio only "
896            "supports quality scores in the range 0-255 (inclusive) when "
897            "reading QUAL files.")
898    return quality.astype(np.uint8, casting='unsafe', copy=False)
899
900
901def _sequences_to_fasta(obj, fh, qual, id_whitespace_replacement,
902                        description_newline_replacement, max_width,
903                        lowercase=None):
904    def seq_gen():
905        yield from obj
906
907    _generator_to_fasta(
908        seq_gen(), fh, qual=qual,
909        id_whitespace_replacement=id_whitespace_replacement,
910        description_newline_replacement=description_newline_replacement,
911        max_width=max_width, lowercase=lowercase)
912