1# Copyright 2009-2020 by Peter Cock.  All rights reserved.
2#
3# This file is part of the Biopython distribution and governed by your
4# choice of the "Biopython License Agreement" or the "BSD 3-Clause License".
5# Please see the LICENSE file that should have been included as part of this
6# package.
7"""Bio.SeqIO support for the FASTQ and QUAL file formats.
8
9Note that you are expected to use this code via the Bio.SeqIO interface, as
10shown below.
11
12The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute
13to bundle a FASTA sequence and its PHRED quality data (integers between 0 and
1490).  Rather than using a single FASTQ file, often paired FASTA and QUAL files
15are used containing the sequence and the quality information separately.
16
17The PHRED software reads DNA sequencing trace files, calls bases, and
18assigns a non-negative quality value to each called base using a logged
19transformation of the error probability, Q = -10 log10( Pe ), for example::
20
21    Pe = 1.0,         Q =  0
22    Pe = 0.1,         Q = 10
23    Pe = 0.01,        Q = 20
24    ...
25    Pe = 0.00000001,  Q = 80
26    Pe = 0.000000001, Q = 90
27
28In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40.
29In the QUAL format these quality values are held as space separated text in
30a FASTA like file format.  In the FASTQ format, each quality values is encoded
31with a single ASCI character using chr(Q+33), meaning zero maps to the
32character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard
33the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and
34quality are then stored in pairs in a FASTA like format.
35
36Unfortunately there is no official document describing the FASTQ file format,
37and worse, several related but different variants exist. For more details,
38please read this open access publication::
39
40    The Sanger FASTQ file format for sequences with quality scores, and the
41    Solexa/Illumina FASTQ variants.
42    P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby),
43    M.L.Heuer (BioJava) and P.M. Rice (EMBOSS).
44    Nucleic Acids Research 2010 38(6):1767-1771
45    https://doi.org/10.1093/nar/gkp1137
46
47The good news is that Roche 454 sequencers can output files in the QUAL format,
48and sensibly they use PHREP style scores like Sanger.  Converting a pair of
49FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL
50files from a Roche 454 SFF binary file, use the Roche off instrument command
51line tool "sffinfo" with the -q or -qual argument.  You can extract a matching
52FASTA file using the -s or -seq argument instead.
53
54The bad news is that Solexa/Illumina did things differently - they have their
55own scoring system AND their own incompatible versions of the FASTQ format.
56Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can
57be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a
58reasonable mapping can be achieved between them, and they are approximately
59equal for higher quality reads).
60
61Confusingly early Solexa pipelines produced a FASTQ like file but using their
62own score mapping and an ASCII offset of 64. To make things worse, for the
63Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the
64FASTQ file format, this time using PHRED scores (which is more consistent) but
65with an ASCII offset of 64.
66
67i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ
68file format: The original Sanger PHRED standard, and two from Solexa/Illumina.
69
70The good news is that as of CASAVA version 1.8, Illumina sequencers will
71produce FASTQ files using the standard Sanger encoding.
72
73You are expected to use this module via the Bio.SeqIO functions, with the
74following format names:
75
76    - "qual" means simple quality files using PHRED scores (e.g. from Roche 454)
77    - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII
78      offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+).
79      These can potentially hold PHRED scores from 0 to 93.
80    - "fastq-sanger" is an alias for "fastq".
81    - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ
82      files, using Solexa scores with an ASCII offset 64. These can hold Solexa
83      scores from -5 to 62.
84    - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using
85      PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0
86      to 62.
87
88We could potentially add support for "qual-solexa" meaning QUAL files which
89contain Solexa scores, but thus far there isn't any reason to use such files.
90
91For example, consider the following short FASTQ file::
92
93    @EAS54_6_R1_2_1_413_324
94    CCCTTCTTGTCTTCAGCGTTTCTCC
95    +
96    ;;3;;;;;;;;;;;;7;;;;;;;88
97    @EAS54_6_R1_2_1_540_792
98    TTGGCAGGCCAAGGCCGATGGATCA
99    +
100    ;;;;;;;;;;;7;;;;;-;;;3;83
101    @EAS54_6_R1_2_1_443_348
102    GTTGCTTCTGGCGTGGGTGGGGGGG
103    +
104    ;;;;;;;;;;;9;7;;.7;393333
105
106This contains three reads of length 25.  From the read length these were
107probably originally from an early Solexa/Illumina sequencer but this file
108follows the Sanger FASTQ convention (PHRED style qualities with an ASCII
109offet of 33).  This means we can parse this file using Bio.SeqIO using
110"fastq" as the format name:
111
112>>> from Bio import SeqIO
113>>> for record in SeqIO.parse("Quality/example.fastq", "fastq"):
114...     print("%s %s" % (record.id, record.seq))
115EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
116EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
117EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
118
119The qualities are held as a list of integers in each record's annotation:
120
121>>> print(record)
122ID: EAS54_6_R1_2_1_443_348
123Name: EAS54_6_R1_2_1_443_348
124Description: EAS54_6_R1_2_1_443_348
125Number of features: 0
126Per letter annotation for: phred_quality
127Seq('GTTGCTTCTGGCGTGGGTGGGGGGG')
128>>> print(record.letter_annotations["phred_quality"])
129[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
130
131You can use the SeqRecord format method to show this in the QUAL format:
132
133>>> print(record.format("qual"))
134>EAS54_6_R1_2_1_443_348
13526 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
13624 18 18 18 18
137<BLANKLINE>
138
139Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"):
140
141>>> print(record.format("fastq"))
142@EAS54_6_R1_2_1_443_348
143GTTGCTTCTGGCGTGGGTGGGGGGG
144+
145;;;;;;;;;;;9;7;;.7;393333
146<BLANKLINE>
147
148Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset
149of 64):
150
151>>> print(record.format("fastq-illumina"))
152@EAS54_6_R1_2_1_443_348
153GTTGCTTCTGGCGTGGGTGGGGGGG
154+
155ZZZZZZZZZZZXZVZZMVZRXRRRR
156<BLANKLINE>
157
158You can also get Biopython to convert the scores and show a Solexa style
159FASTQ file:
160
161>>> print(record.format("fastq-solexa"))
162@EAS54_6_R1_2_1_443_348
163GTTGCTTCTGGCGTGGGTGGGGGGG
164+
165ZZZZZZZZZZZXZVZZMVZRXRRRR
166<BLANKLINE>
167
168Notice that this is actually the same output as above using "fastq-illumina"
169as the format! The reason for this is all these scores are high enough that
170the PHRED and Solexa scores are almost equal. The differences become apparent
171for poor quality reads. See the functions solexa_quality_from_phred and
172phred_quality_from_solexa for more details.
173
174If you wanted to trim your sequences (perhaps to remove low quality regions,
175or to remove a primer sequence), try slicing the SeqRecord objects.  e.g.
176
177>>> sub_rec = record[5:15]
178>>> print(sub_rec)
179ID: EAS54_6_R1_2_1_443_348
180Name: EAS54_6_R1_2_1_443_348
181Description: EAS54_6_R1_2_1_443_348
182Number of features: 0
183Per letter annotation for: phred_quality
184Seq('TTCTGGCGTG')
185>>> print(sub_rec.letter_annotations["phred_quality"])
186[26, 26, 26, 26, 26, 26, 24, 26, 22, 26]
187>>> print(sub_rec.format("fastq"))
188@EAS54_6_R1_2_1_443_348
189TTCTGGCGTG
190+
191;;;;;;9;7;
192<BLANKLINE>
193
194If you wanted to, you could read in this FASTQ file, and save it as a QUAL file:
195
196>>> from Bio import SeqIO
197>>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
198>>> with open("Quality/temp.qual", "w") as out_handle:
199...     SeqIO.write(record_iterator, out_handle, "qual")
2003
201
202You can of course read in a QUAL file, such as the one we just created:
203
204>>> from Bio import SeqIO
205>>> for record in SeqIO.parse("Quality/temp.qual", "qual"):
206...     print("%s read of length %d" % (record.id, len(record.seq)))
207EAS54_6_R1_2_1_413_324 read of length 25
208EAS54_6_R1_2_1_540_792 read of length 25
209EAS54_6_R1_2_1_443_348 read of length 25
210
211Notice that QUAL files don't have a proper sequence present!  But the quality
212information is there:
213
214>>> print(record)
215ID: EAS54_6_R1_2_1_443_348
216Name: EAS54_6_R1_2_1_443_348
217Description: EAS54_6_R1_2_1_443_348
218Number of features: 0
219Per letter annotation for: phred_quality
220Undefined sequence of length 25
221>>> print(record.letter_annotations["phred_quality"])
222[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
223
224Just to keep things tidy, if you are following this example yourself, you can
225delete this temporary file now:
226
227>>> import os
228>>> os.remove("Quality/temp.qual")
229
230Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL
231files.  Because the Bio.SeqIO system is designed for reading single files, you
232would have to read the two in separately and then combine the data.  However,
233since this is such a common thing to want to do, there is a helper iterator
234defined in this module that does this for you - PairedFastaQualIterator.
235
236Alternatively, if you have enough RAM to hold all the records in memory at once,
237then a simple dictionary approach would work:
238
239>>> from Bio import SeqIO
240>>> reads = SeqIO.to_dict(SeqIO.parse("Quality/example.fasta", "fasta"))
241>>> for rec in SeqIO.parse("Quality/example.qual", "qual"):
242...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"]
243
244You can then access any record by its key, and get both the sequence and the
245quality scores.
246
247>>> print(reads["EAS54_6_R1_2_1_540_792"].format("fastq"))
248@EAS54_6_R1_2_1_540_792
249TTGGCAGGCCAAGGCCGATGGATCA
250+
251;;;;;;;;;;;7;;;;;-;;;3;83
252<BLANKLINE>
253
254It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are
255using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values,
256"fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina"
257for the more recent variant), as this cannot be detected reliably
258automatically.
259
260To illustrate this problem, let's consider an artificial example:
261
262>>> from Bio.Seq import Seq
263>>> from Bio.SeqRecord import SeqRecord
264>>> test = SeqRecord(Seq("NACGTACGTA"), id="Test", description="Made up!")
265>>> print(test.format("fasta"))
266>Test Made up!
267NACGTACGTA
268<BLANKLINE>
269>>> print(test.format("fastq"))
270Traceback (most recent call last):
271 ...
272ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test).
273
274We created a sample SeqRecord, and can show it in FASTA format - but for QUAL
275or FASTQ format we need to provide some quality scores. These are held as a
276list of integers (one for each base) in the letter_annotations dictionary:
277
278>>> test.letter_annotations["phred_quality"] = [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
279>>> print(test.format("qual"))
280>Test Made up!
2810 1 2 3 4 5 10 20 30 40
282<BLANKLINE>
283>>> print(test.format("fastq"))
284@Test Made up!
285NACGTACGTA
286+
287!"#$%&+5?I
288<BLANKLINE>
289
290We can check this FASTQ encoding - the first PHRED quality was zero, and this
291mapped to a exclamation mark, while the final score was 40 and this mapped to
292the letter "I":
293
294>>> ord('!') - 33
2950
296>>> ord('I') - 33
29740
298>>> [ord(letter)-33 for letter in '!"#$%&+5?I']
299[0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
300
301Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED
302scores with an offset of 64:
303
304>>> print(test.format("fastq-illumina"))
305@Test Made up!
306NACGTACGTA
307+
308@ABCDEJT^h
309<BLANKLINE>
310
311And we can check this too - the first PHRED score was zero, and this mapped to
312"@", while the final score was 40 and this mapped to "h":
313
314>>> ord("@") - 64
3150
316>>> ord("h") - 64
31740
318>>> [ord(letter)-64 for letter in "@ABCDEJT^h"]
319[0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
320
321Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style
322FASTQ files look for the same data! Then we have the older Solexa/Illumina
323format to consider which encodes Solexa scores instead of PHRED scores.
324
325First let's see what Biopython says if we convert the PHRED scores into Solexa
326scores (rounding to one decimal place):
327
328>>> for q in [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]:
329...     print("PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q)))
330PHRED 0 maps to Solexa -5.0
331PHRED 1 maps to Solexa -5.0
332PHRED 2 maps to Solexa -2.3
333PHRED 3 maps to Solexa -0.0
334PHRED 4 maps to Solexa 1.8
335PHRED 5 maps to Solexa 3.3
336PHRED 10 maps to Solexa 9.5
337PHRED 20 maps to Solexa 20.0
338PHRED 30 maps to Solexa 30.0
339PHRED 40 maps to Solexa 40.0
340
341Now here is the record using the old Solexa style FASTQ file:
342
343>>> print(test.format("fastq-solexa"))
344@Test Made up!
345NACGTACGTA
346+
347;;>@BCJT^h
348<BLANKLINE>
349
350Again, this is using an ASCII offset of 64, so we can check the Solexa scores:
351
352>>> [ord(letter)-64 for letter in ";;>@BCJT^h"]
353[-5, -5, -2, 0, 2, 3, 10, 20, 30, 40]
354
355This explains why the last few letters of this FASTQ output matched that using
356the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores
357are approximately equal.
358
359"""
360import warnings
361
362from math import log
363
364from Bio import BiopythonParserWarning
365from Bio import BiopythonWarning
366from Bio import StreamModeError
367from Bio.File import as_handle
368from Bio.Seq import Seq
369from Bio.SeqRecord import SeqRecord
370
371from .Interfaces import _clean
372from .Interfaces import _get_seq_string
373from .Interfaces import SequenceIterator
374from .Interfaces import SequenceWriter
375
376
377# define score offsets. See discussion for differences between Sanger and
378# Solexa offsets.
379SANGER_SCORE_OFFSET = 33
380SOLEXA_SCORE_OFFSET = 64
381
382
383def solexa_quality_from_phred(phred_quality):
384    """Covert a PHRED quality (range 0 to about 90) to a Solexa quality.
385
386    PHRED and Solexa quality scores are both log transformations of a
387    probality of error (high score = low probability of error). This function
388    takes a PHRED score, transforms it back to a probability of error, and
389    then re-expresses it as a Solexa score. This assumes the error estimates
390    are equivalent.
391
392    How does this work exactly? Well the PHRED quality is minus ten times the
393    base ten logarithm of the probability of error::
394
395        phred_quality = -10*log(error,10)
396
397    Therefore, turning this round::
398
399        error = 10 ** (- phred_quality / 10)
400
401    Now, Solexa qualities use a different log transformation::
402
403        solexa_quality = -10*log(error/(1-error),10)
404
405    After substitution and a little manipulation we get::
406
407         solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10)
408
409    However, real Solexa files use a minimum quality of -5. This does have a
410    good reason - a random base call would be correct 25% of the time,
411    and thus have a probability of error of 0.75, which gives 1.25 as the PHRED
412    quality, or -4.77 as the Solexa quality. Thus (after rounding), a random
413    nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5.
414
415    Taken literally, this logarithic formula would map a PHRED quality of zero
416    to a Solexa quality of minus infinity. Of course, taken literally, a PHRED
417    score of zero means a probability of error of one (i.e. the base call is
418    definitely wrong), which is worse than random! In practice, a PHRED quality
419    of zero usually means a default value, or perhaps random - and therefore
420    mapping it to the minimum Solexa score of -5 is reasonable.
421
422    In conclusion, we follow EMBOSS, and take this logarithmic formula but also
423    apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED
424    quality of zero to -5.0 as well.
425
426    Note this function will return a floating point number, it is up to you to
427    round this to the nearest integer if appropriate.  e.g.
428
429    >>> print("%0.2f" % round(solexa_quality_from_phred(80), 2))
430    80.00
431    >>> print("%0.2f" % round(solexa_quality_from_phred(50), 2))
432    50.00
433    >>> print("%0.2f" % round(solexa_quality_from_phred(20), 2))
434    19.96
435    >>> print("%0.2f" % round(solexa_quality_from_phred(10), 2))
436    9.54
437    >>> print("%0.2f" % round(solexa_quality_from_phred(5), 2))
438    3.35
439    >>> print("%0.2f" % round(solexa_quality_from_phred(4), 2))
440    1.80
441    >>> print("%0.2f" % round(solexa_quality_from_phred(3), 2))
442    -0.02
443    >>> print("%0.2f" % round(solexa_quality_from_phred(2), 2))
444    -2.33
445    >>> print("%0.2f" % round(solexa_quality_from_phred(1), 2))
446    -5.00
447    >>> print("%0.2f" % round(solexa_quality_from_phred(0), 2))
448    -5.00
449
450    Notice that for high quality reads PHRED and Solexa scores are numerically
451    equal. The differences are important for poor quality reads, where PHRED
452    has a minimum of zero but Solexa scores can be negative.
453
454    Finally, as a special case where None is used for a "missing value", None
455    is returned:
456
457    >>> print(solexa_quality_from_phred(None))
458    None
459    """
460    if phred_quality is None:
461        # Assume None is used as some kind of NULL or NA value; return None
462        # e.g. Bio.SeqIO gives Ace contig gaps a quality of None.
463        return None
464    elif phred_quality > 0:
465        # Solexa uses a minimum value of -5, which after rounding matches a
466        # random nucleotide base call.
467        return max(-5.0, 10 * log(10 ** (phred_quality / 10.0) - 1, 10))
468    elif phred_quality == 0:
469        # Special case, map to -5 as discussed in the docstring
470        return -5.0
471    else:
472        raise ValueError(
473            "PHRED qualities must be positive (or zero), not %r" % phred_quality
474        )
475
476
477def phred_quality_from_solexa(solexa_quality):
478    """Convert a Solexa quality (which can be negative) to a PHRED quality.
479
480    PHRED and Solexa quality scores are both log transformations of a
481    probality of error (high score = low probability of error). This function
482    takes a Solexa score, transforms it back to a probability of error, and
483    then re-expresses it as a PHRED score. This assumes the error estimates
484    are equivalent.
485
486    The underlying formulas are given in the documentation for the sister
487    function solexa_quality_from_phred, in this case the operation is::
488
489        phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10)
490
491    This will return a floating point number, it is up to you to round this to
492    the nearest integer if appropriate.  e.g.
493
494    >>> print("%0.2f" % round(phred_quality_from_solexa(80), 2))
495    80.00
496    >>> print("%0.2f" % round(phred_quality_from_solexa(20), 2))
497    20.04
498    >>> print("%0.2f" % round(phred_quality_from_solexa(10), 2))
499    10.41
500    >>> print("%0.2f" % round(phred_quality_from_solexa(0), 2))
501    3.01
502    >>> print("%0.2f" % round(phred_quality_from_solexa(-5), 2))
503    1.19
504
505    Note that a solexa_quality less then -5 is not expected, will trigger a
506    warning, but will still be converted as per the logarithmic mapping
507    (giving a number between 0 and 1.19 back).
508
509    As a special case where None is used for a "missing value", None is
510    returned:
511
512    >>> print(phred_quality_from_solexa(None))
513    None
514    """
515    if solexa_quality is None:
516        # Assume None is used as some kind of NULL or NA value; return None
517        return None
518    if solexa_quality < -5:
519        warnings.warn(
520            "Solexa quality less than -5 passed, %r" % solexa_quality, BiopythonWarning
521        )
522    return 10 * log(10 ** (solexa_quality / 10.0) + 1, 10)
523
524
525def _get_phred_quality(record):
526    """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE).
527
528    If there are no PHRED qualities, but there are Solexa qualities, those are
529    used instead after conversion.
530    """
531    try:
532        return record.letter_annotations["phred_quality"]
533    except KeyError:
534        pass
535    try:
536        return [
537            phred_quality_from_solexa(q)
538            for q in record.letter_annotations["solexa_quality"]
539        ]
540    except KeyError:
541        raise ValueError(
542            "No suitable quality scores found in "
543            "letter_annotations of SeqRecord (id=%s)." % record.id
544        ) from None
545
546
547# Only map 0 to 93, we need to give a warning on truncating at 93
548_phred_to_sanger_quality_str = {
549    qp: chr(min(126, qp + SANGER_SCORE_OFFSET)) for qp in range(0, 93 + 1)
550}
551# Only map -5 to 93, we need to give a warning on truncating at 93
552_solexa_to_sanger_quality_str = {
553    qs: chr(min(126, int(round(phred_quality_from_solexa(qs)) + SANGER_SCORE_OFFSET)))
554    for qs in range(-5, 93 + 1)
555}
556
557
558def _get_sanger_quality_str(record):
559    """Return a Sanger FASTQ encoded quality string (PRIVATE).
560
561    >>> from Bio.Seq import Seq
562    >>> from Bio.SeqRecord import SeqRecord
563    >>> r = SeqRecord(Seq("ACGTAN"), id="Test",
564    ...               letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0]})
565    >>> _get_sanger_quality_str(r)
566    'SI?5+!'
567
568    If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO),
569    the PHRED qualities are integers, this function is able to use a very fast
570    pre-cached mapping. However, if they are floats which differ slightly, then
571    it has to do the appropriate rounding - which is slower:
572
573    >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2",
574    ...      letter_annotations = {"phred_quality":[50.0, 40.05, 29.99, 20, 9.55, 0.01]})
575    >>> _get_sanger_quality_str(r2)
576    'SI?5+!'
577
578    If your scores include a None value, this raises an exception:
579
580    >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3",
581    ...               letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, None]})
582    >>> _get_sanger_quality_str(r3)
583    Traceback (most recent call last):
584       ...
585    TypeError: A quality value of None was found
586
587    If (strangely) your record has both PHRED and Solexa scores, then the PHRED
588    scores are used in preference:
589
590    >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4",
591    ...               letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0],
592    ...                                     "solexa_quality":[-5, -4, 0, None, 0, 40]})
593    >>> _get_sanger_quality_str(r4)
594    'SI?5+!'
595
596    If there are no PHRED scores, but there are Solexa scores, these are used
597    instead (after the appropriate conversion):
598
599    >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5",
600    ...      letter_annotations = {"solexa_quality":[40, 30, 20, 10, 0, -5]})
601    >>> _get_sanger_quality_str(r5)
602    'I?5+$"'
603
604    Again, integer Solexa scores can be looked up in a pre-cached mapping making
605    this very fast. You can still use approximate floating point scores:
606
607    >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6",
608    ...      letter_annotations = {"solexa_quality":[40.1, 29.7, 20.01, 10, 0.0, -4.9]})
609    >>> _get_sanger_quality_str(r6)
610    'I?5+$"'
611
612    Notice that due to the limited range of printable ASCII characters, a
613    PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ
614    file (using ASCII 126, the tilde). This function will issue a warning
615    in this situation.
616    """
617    # TODO - This functions works and is fast, but it is also ugly
618    # and there is considerable repetition of code for the other
619    # two FASTQ variants.
620    try:
621        # These take priority (in case both Solexa and PHRED scores found)
622        qualities = record.letter_annotations["phred_quality"]
623    except KeyError:
624        # Fall back on solexa scores...
625        pass
626    else:
627        # Try and use the precomputed mapping:
628        try:
629            return "".join(_phred_to_sanger_quality_str[qp] for qp in qualities)
630        except KeyError:
631            # Could be a float, or a None in the list, or a high value.
632            pass
633        if None in qualities:
634            raise TypeError("A quality value of None was found")
635        if max(qualities) >= 93.5:
636            warnings.warn(
637                "Data loss - max PHRED quality 93 in Sanger FASTQ", BiopythonWarning
638            )
639        # This will apply the truncation at 93, giving max ASCII 126
640        return "".join(
641            chr(min(126, int(round(qp)) + SANGER_SCORE_OFFSET)) for qp in qualities
642        )
643    # Fall back on the Solexa scores...
644    try:
645        qualities = record.letter_annotations["solexa_quality"]
646    except KeyError:
647        raise ValueError(
648            "No suitable quality scores found in "
649            "letter_annotations of SeqRecord (id=%s)." % record.id
650        ) from None
651    # Try and use the precomputed mapping:
652    try:
653        return "".join(_solexa_to_sanger_quality_str[qs] for qs in qualities)
654    except KeyError:
655        # Either no PHRED scores, or something odd like a float or None
656        pass
657    if None in qualities:
658        raise TypeError("A quality value of None was found")
659    # Must do this the slow way, first converting the PHRED scores into
660    # Solexa scores:
661    if max(qualities) >= 93.5:
662        warnings.warn(
663            "Data loss - max PHRED quality 93 in Sanger FASTQ", BiopythonWarning
664        )
665    # This will apply the truncation at 93, giving max ASCII 126
666    return "".join(
667        chr(min(126, int(round(phred_quality_from_solexa(qs))) + SANGER_SCORE_OFFSET))
668        for qs in qualities
669    )
670
671
672# Only map 0 to 62, we need to give a warning on truncating at 62
673assert 62 + SOLEXA_SCORE_OFFSET == 126
674_phred_to_illumina_quality_str = {
675    qp: chr(qp + SOLEXA_SCORE_OFFSET) for qp in range(0, 62 + 1)
676}
677# Only map -5 to 62, we need to give a warning on truncating at 62
678_solexa_to_illumina_quality_str = {
679    qs: chr(int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)
680    for qs in range(-5, 62 + 1)
681}
682
683
684def _get_illumina_quality_str(record):
685    """Return an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE).
686
687    Notice that due to the limited range of printable ASCII characters, a
688    PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ
689    file (using ASCII 126, the tilde). This function will issue a warning
690    in this situation.
691    """
692    # TODO - This functions works and is fast, but it is also ugly
693    # and there is considerable repetition of code for the other
694    # two FASTQ variants.
695    try:
696        # These take priority (in case both Solexa and PHRED scores found)
697        qualities = record.letter_annotations["phred_quality"]
698    except KeyError:
699        # Fall back on solexa scores...
700        pass
701    else:
702        # Try and use the precomputed mapping:
703        try:
704            return "".join(_phred_to_illumina_quality_str[qp] for qp in qualities)
705        except KeyError:
706            # Could be a float, or a None in the list, or a high value.
707            pass
708        if None in qualities:
709            raise TypeError("A quality value of None was found")
710        if max(qualities) >= 62.5:
711            warnings.warn(
712                "Data loss - max PHRED quality 62 in Illumina FASTQ", BiopythonWarning
713            )
714        # This will apply the truncation at 62, giving max ASCII 126
715        return "".join(
716            chr(min(126, int(round(qp)) + SOLEXA_SCORE_OFFSET)) for qp in qualities
717        )
718    # Fall back on the Solexa scores...
719    try:
720        qualities = record.letter_annotations["solexa_quality"]
721    except KeyError:
722        raise ValueError(
723            "No suitable quality scores found in "
724            "letter_annotations of SeqRecord (id=%s)." % record.id
725        ) from None
726    # Try and use the precomputed mapping:
727    try:
728        return "".join(_solexa_to_illumina_quality_str[qs] for qs in qualities)
729    except KeyError:
730        # Either no PHRED scores, or something odd like a float or None
731        pass
732    if None in qualities:
733        raise TypeError("A quality value of None was found")
734    # Must do this the slow way, first converting the PHRED scores into
735    # Solexa scores:
736    if max(qualities) >= 62.5:
737        warnings.warn(
738            "Data loss - max PHRED quality 62 in Illumina FASTQ", BiopythonWarning
739        )
740    # This will apply the truncation at 62, giving max ASCII 126
741    return "".join(
742        chr(min(126, int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET))
743        for qs in qualities
744    )
745
746
747# Only map 0 to 62, we need to give a warning on truncating at 62
748assert 62 + SOLEXA_SCORE_OFFSET == 126
749_solexa_to_solexa_quality_str = {
750    qs: chr(min(126, qs + SOLEXA_SCORE_OFFSET)) for qs in range(-5, 62 + 1)
751}
752# Only map -5 to 62, we need to give a warning on truncating at 62
753_phred_to_solexa_quality_str = {
754    qp: chr(min(126, int(round(solexa_quality_from_phred(qp))) + SOLEXA_SCORE_OFFSET))
755    for qp in range(0, 62 + 1)
756}
757
758
759def _get_solexa_quality_str(record):
760    """Return a Solexa FASTQ encoded quality string (PRIVATE).
761
762    Notice that due to the limited range of printable ASCII characters, a
763    Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ
764    file (using ASCII 126, the tilde). This function will issue a warning
765    in this situation.
766    """
767    # TODO - This functions works and is fast, but it is also ugly
768    # and there is considerable repetition of code for the other
769    # two FASTQ variants.
770    try:
771        # These take priority (in case both Solexa and PHRED scores found)
772        qualities = record.letter_annotations["solexa_quality"]
773    except KeyError:
774        # Fall back on PHRED scores...
775        pass
776    else:
777        # Try and use the precomputed mapping:
778        try:
779            return "".join(_solexa_to_solexa_quality_str[qs] for qs in qualities)
780        except KeyError:
781            # Could be a float, or a None in the list, or a high value.
782            pass
783        if None in qualities:
784            raise TypeError("A quality value of None was found")
785        if max(qualities) >= 62.5:
786            warnings.warn(
787                "Data loss - max Solexa quality 62 in Solexa FASTQ", BiopythonWarning
788            )
789        # This will apply the truncation at 62, giving max ASCII 126
790        return "".join(
791            chr(min(126, int(round(qs)) + SOLEXA_SCORE_OFFSET)) for qs in qualities
792        )
793    # Fall back on the PHRED scores...
794    try:
795        qualities = record.letter_annotations["phred_quality"]
796    except KeyError:
797        raise ValueError(
798            "No suitable quality scores found in "
799            "letter_annotations of SeqRecord (id=%s)." % record.id
800        ) from None
801    # Try and use the precomputed mapping:
802    try:
803        return "".join(_phred_to_solexa_quality_str[qp] for qp in qualities)
804    except KeyError:
805        # Either no PHRED scores, or something odd like a float or None
806        # or too big to be in the cache
807        pass
808    if None in qualities:
809        raise TypeError("A quality value of None was found")
810    # Must do this the slow way, first converting the PHRED scores into
811    # Solexa scores:
812    if max(qualities) >= 62.5:
813        warnings.warn(
814            "Data loss - max Solexa quality 62 in Solexa FASTQ", BiopythonWarning
815        )
816    return "".join(
817        chr(min(126, int(round(solexa_quality_from_phred(qp))) + SOLEXA_SCORE_OFFSET))
818        for qp in qualities
819    )
820
821
822# TODO - Default to nucleotide or even DNA?
823def FastqGeneralIterator(source):
824    """Iterate over Fastq records as string tuples (not as SeqRecord objects).
825
826    Arguments:
827     - source - input stream opened in text mode, or a path to a file
828
829    This code does not try to interpret the quality string numerically.  It
830    just returns tuples of the title, sequence and quality as strings.  For
831    the sequence and quality, any whitespace (such as new lines) is removed.
832
833    Our SeqRecord based FASTQ iterators call this function internally, and then
834    turn the strings into a SeqRecord objects, mapping the quality string into
835    a list of numerical scores.  If you want to do a custom quality mapping,
836    then you might consider calling this function directly.
837
838    For parsing FASTQ files, the title string from the "@" line at the start
839    of each record can optionally be omitted on the "+" lines.  If it is
840    repeated, it must be identical.
841
842    The sequence string and the quality string can optionally be split over
843    multiple lines, although several sources discourage this.  In comparison,
844    for the FASTA file format line breaks between 60 and 80 characters are
845    the norm.
846
847    **WARNING** - Because the "@" character can appear in the quality string,
848    this can cause problems as this is also the marker for the start of
849    a new sequence.  In fact, the "+" sign can also appear as well.  Some
850    sources recommended having no line breaks in the  quality to avoid this,
851    but even that is not enough, consider this example::
852
853        @071113_EAS56_0053:1:1:998:236
854        TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
855        +071113_EAS56_0053:1:1:998:236
856        IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
857        @071113_EAS56_0053:1:1:182:712
858        ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
859        +
860        @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
861        @071113_EAS56_0053:1:1:153:10
862        TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
863        +
864        IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
865        @071113_EAS56_0053:1:3:990:501
866        TGGGAGGTTTTATGTGGA
867        AAGCAGCAATGTACAAGA
868        +
869        IIIIIII.IIIIII1@44
870        @-7.%<&+/$/%4(++(%
871
872    This is four PHRED encoded FASTQ entries originally from an NCBI source
873    (given the read length of 36, these are probably Solexa Illumina reads where
874    the quality has been mapped onto the PHRED values).
875
876    This example has been edited to illustrate some of the nasty things allowed
877    in the FASTQ format.  Firstly, on the "+" lines most but not all of the
878    (redundant) identifiers are omitted.  In real files it is likely that all or
879    none of these extra identifiers will be present.
880
881    Secondly, while the first three sequences have been shown without line
882    breaks, the last has been split over multiple lines.  In real files any line
883    breaks are likely to be consistent.
884
885    Thirdly, some of the quality string lines start with an "@" character.  For
886    the second record this is unavoidable.  However for the fourth sequence this
887    only happens because its quality string is split over two lines.  A naive
888    parser could wrongly treat any line starting with an "@" as the beginning of
889    a new sequence!  This code copes with this possible ambiguity by keeping
890    track of the length of the sequence which gives the expected length of the
891    quality string.
892
893    Using this tricky example file as input, this short bit of code demonstrates
894    what this parsing function would return:
895
896    >>> with open("Quality/tricky.fastq") as handle:
897    ...     for (title, sequence, quality) in FastqGeneralIterator(handle):
898    ...         print(title)
899    ...         print("%s %s" % (sequence, quality))
900    ...
901    071113_EAS56_0053:1:1:998:236
902    TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
903    071113_EAS56_0053:1:1:182:712
904    ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
905    071113_EAS56_0053:1:1:153:10
906    TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
907    071113_EAS56_0053:1:3:990:501
908    TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%
909
910    Finally we note that some sources state that the quality string should
911    start with "!" (which using the PHRED mapping means the first letter always
912    has a quality score of zero).  This rather restrictive rule is not widely
913    observed, so is therefore ignored here.  One plus point about this "!" rule
914    is that (provided there are no line breaks in the quality sequence) it
915    would prevent the above problem with the "@" character.
916    """
917    try:
918        handle = open(source)
919    except TypeError:
920        handle = source
921        if handle.read(0) != "":
922            raise StreamModeError("Fastq files must be opened in text mode") from None
923    try:
924        try:
925            line = next(handle)
926        except StopIteration:
927            return  # Premature end of file, or just empty?
928
929        while True:
930            if line[0] != "@":
931                raise ValueError(
932                    "Records in Fastq files should start with '@' character"
933                )
934            title_line = line[1:].rstrip()
935            seq_string = ""
936            # There will now be one or more sequence lines; keep going until we
937            # find the "+" marking the quality line:
938            for line in handle:
939                if line[0] == "+":
940                    break
941                seq_string += line.rstrip()
942            else:
943                if seq_string:
944                    raise ValueError("End of file without quality information.")
945                else:
946                    raise ValueError("Unexpected end of file")
947            # The title here is optional, but if present must match!
948            second_title = line[1:].rstrip()
949            if second_title and second_title != title_line:
950                raise ValueError("Sequence and quality captions differ.")
951            # This is going to slow things down a little, but assuming
952            # this isn't allowed we should try and catch it here:
953            if " " in seq_string or "\t" in seq_string:
954                raise ValueError("Whitespace is not allowed in the sequence.")
955            seq_len = len(seq_string)
956
957            # There will now be at least one line of quality data, followed by
958            # another sequence, or EOF
959            line = None
960            quality_string = ""
961            for line in handle:
962                if line[0] == "@":
963                    # This COULD be the start of a new sequence. However, it MAY just
964                    # be a line of quality data which starts with a "@" character.  We
965                    # should be able to check this by looking at the sequence length
966                    # and the amount of quality data found so far.
967                    if len(quality_string) >= seq_len:
968                        # We expect it to be equal if this is the start of a new record.
969                        # If the quality data is longer, we'll raise an error below.
970                        break
971                    # Continue - its just some (more) quality data.
972                quality_string += line.rstrip()
973            else:
974                if line is None:
975                    raise ValueError("Unexpected end of file")
976                line = None
977
978            if seq_len != len(quality_string):
979                raise ValueError(
980                    "Lengths of sequence and quality values differs for %s (%i and %i)."
981                    % (title_line, seq_len, len(quality_string))
982                )
983
984            # Return the record and then continue...
985            yield (title_line, seq_string, quality_string)
986
987            if line is None:
988                break
989    finally:
990        if handle is not source:
991            handle.close()
992
993
994class FastqPhredIterator(SequenceIterator):
995    """Parser for FASTQ files."""
996
997    def __init__(self, source, alphabet=None, title2ids=None):
998        """Iterate over FASTQ records as SeqRecord objects.
999
1000        Arguments:
1001         - source - input stream opened in text mode, or a path to a file
1002         - alphabet - optional alphabet, no longer used. Leave as None.
1003         - title2ids - A function that, when given the title line from the FASTQ
1004           file (without the beginning >), will return the id, name and
1005           description (in that order) for the record as a tuple of strings.
1006           If this is not given, then the entire title line will be used as
1007           the description, and the first word as the id and name.
1008
1009        Note that use of title2ids matches that of Bio.SeqIO.FastaIO.
1010
1011        For each sequence in a (Sanger style) FASTQ file there is a matching string
1012        encoding the PHRED qualities (integers between 0 and about 90) using ASCII
1013        values with an offset of 33.
1014
1015        For example, consider a file containing three short reads::
1016
1017            @EAS54_6_R1_2_1_413_324
1018            CCCTTCTTGTCTTCAGCGTTTCTCC
1019            +
1020            ;;3;;;;;;;;;;;;7;;;;;;;88
1021            @EAS54_6_R1_2_1_540_792
1022            TTGGCAGGCCAAGGCCGATGGATCA
1023            +
1024            ;;;;;;;;;;;7;;;;;-;;;3;83
1025            @EAS54_6_R1_2_1_443_348
1026            GTTGCTTCTGGCGTGGGTGGGGGGG
1027            +
1028            ;;;;;;;;;;;9;7;;.7;393333
1029
1030        For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
1031        string encoding the PHRED qualities using a ASCII values with an offset of
1032        33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").
1033
1034        Using this module directly you might run:
1035
1036        >>> with open("Quality/example.fastq") as handle:
1037        ...     for record in FastqPhredIterator(handle):
1038        ...         print("%s %s" % (record.id, record.seq))
1039        EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1040        EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1041        EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1042
1043        Typically however, you would call this via Bio.SeqIO instead with "fastq"
1044        (or "fastq-sanger") as the format:
1045
1046        >>> from Bio import SeqIO
1047        >>> with open("Quality/example.fastq") as handle:
1048        ...     for record in SeqIO.parse(handle, "fastq"):
1049        ...         print("%s %s" % (record.id, record.seq))
1050        EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1051        EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1052        EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1053
1054        If you want to look at the qualities, they are record in each record's
1055        per-letter-annotation dictionary as a simple list of integers:
1056
1057        >>> print(record.letter_annotations["phred_quality"])
1058        [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1059
1060        """
1061        if alphabet is not None:
1062            raise ValueError("The alphabet argument is no longer supported")
1063        self.title2ids = title2ids
1064        super().__init__(source, mode="t", fmt="Fastq")
1065
1066    def parse(self, handle):
1067        """Start parsing the file, and return a SeqRecord generator."""
1068        records = self.iterate(handle)
1069        return records
1070
1071    def iterate(self, handle):
1072        """Parse the file and generate SeqRecord objects."""
1073        title2ids = self.title2ids
1074        assert SANGER_SCORE_OFFSET == ord("!")
1075        # Originally, I used a list expression for each record:
1076        #
1077        # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string]
1078        #
1079        # Precomputing is faster, perhaps partly by avoiding the subtractions.
1080        q_mapping = {
1081            chr(letter): letter - SANGER_SCORE_OFFSET
1082            for letter in range(SANGER_SCORE_OFFSET, 94 + SANGER_SCORE_OFFSET)
1083        }
1084
1085        for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1086            if title2ids:
1087                id, name, descr = title2ids(title_line)
1088            else:
1089                descr = title_line
1090                id = descr.split()[0]
1091                name = id
1092            record = SeqRecord(Seq(seq_string), id=id, name=name, description=descr)
1093            try:
1094                qualities = [q_mapping[letter] for letter in quality_string]
1095            except KeyError:
1096                raise ValueError("Invalid character in quality string") from None
1097            # For speed, will now use a dirty trick to speed up assigning the
1098            # qualities. We do this to bypass the length check imposed by the
1099            # per-letter-annotations restricted dict (as this has already been
1100            # checked by FastqGeneralIterator). This is equivalent to:
1101            # record.letter_annotations["phred_quality"] = qualities
1102            dict.__setitem__(record._per_letter_annotations, "phred_quality", qualities)
1103            yield record
1104
1105
1106def FastqSolexaIterator(source, alphabet=None, title2ids=None):
1107    r"""Parse old Solexa/Illumina FASTQ like files (which differ in the quality mapping).
1108
1109    The optional arguments are the same as those for the FastqPhredIterator.
1110
1111    For each sequence in Solexa/Illumina FASTQ files there is a matching string
1112    encoding the Solexa integer qualities using ASCII values with an offset
1113    of 64.  Solexa scores are scaled differently to PHRED scores, and Biopython
1114    will NOT perform any automatic conversion when loading.
1115
1116    NOTE - This file format is used by the OLD versions of the Solexa/Illumina
1117    pipeline. See also the FastqIlluminaIterator function for the NEW version.
1118
1119    For example, consider a file containing these five records::
1120
1121        @SLXA-B3_649_FC8437_R1_1_1_610_79
1122        GATGTGCAATACCTTTGTAGAGGAA
1123        +SLXA-B3_649_FC8437_R1_1_1_610_79
1124        YYYYYYYYYYYYYYYYYYWYWYYSU
1125        @SLXA-B3_649_FC8437_R1_1_1_397_389
1126        GGTTTGAGAAAGAGAAATGAGATAA
1127        +SLXA-B3_649_FC8437_R1_1_1_397_389
1128        YYYYYYYYYWYYYYWWYYYWYWYWW
1129        @SLXA-B3_649_FC8437_R1_1_1_850_123
1130        GAGGGTGTTGATCATGATGATGGCG
1131        +SLXA-B3_649_FC8437_R1_1_1_850_123
1132        YYYYYYYYYYYYYWYYWYYSYYYSY
1133        @SLXA-B3_649_FC8437_R1_1_1_362_549
1134        GGAAACAAAGTTTTTCTCAACATAG
1135        +SLXA-B3_649_FC8437_R1_1_1_362_549
1136        YYYYYYYYYYYYYYYYYYWWWWYWY
1137        @SLXA-B3_649_FC8437_R1_1_1_183_714
1138        GTATTATTTAATGGCATACACTCAA
1139        +SLXA-B3_649_FC8437_R1_1_1_183_714
1140        YYYYYYYYYYWYYYYWYWWUWWWQQ
1141
1142    Using this module directly you might run:
1143
1144    >>> with open("Quality/solexa_example.fastq") as handle:
1145    ...     for record in FastqSolexaIterator(handle):
1146    ...         print("%s %s" % (record.id, record.seq))
1147    SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
1148    SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
1149    SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
1150    SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
1151    SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
1152
1153    Typically however, you would call this via Bio.SeqIO instead with
1154    "fastq-solexa" as the format:
1155
1156    >>> from Bio import SeqIO
1157    >>> with open("Quality/solexa_example.fastq") as handle:
1158    ...     for record in SeqIO.parse(handle, "fastq-solexa"):
1159    ...         print("%s %s" % (record.id, record.seq))
1160    SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
1161    SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
1162    SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
1163    SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
1164    SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
1165
1166    If you want to look at the qualities, they are recorded in each record's
1167    per-letter-annotation dictionary as a simple list of integers:
1168
1169    >>> print(record.letter_annotations["solexa_quality"])
1170    [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]
1171
1172    These scores aren't very good, but they are high enough that they map
1173    almost exactly onto PHRED scores:
1174
1175    >>> print("%0.2f" % phred_quality_from_solexa(25))
1176    25.01
1177
1178    Let's look at faked example read which is even worse, where there are
1179    more noticeable differences between the Solexa and PHRED scores::
1180
1181         @slxa_0001_1_0001_01
1182         ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1183         +slxa_0001_1_0001_01
1184         hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
1185
1186    Again, you would typically use Bio.SeqIO to read this file in (rather than
1187    calling the Bio.SeqIO.QualtityIO module directly).  Most FASTQ files will
1188    contain thousands of reads, so you would normally use Bio.SeqIO.parse()
1189    as shown above.  This example has only as one entry, so instead we can
1190    use the Bio.SeqIO.read() function:
1191
1192    >>> from Bio import SeqIO
1193    >>> with open("Quality/solexa_faked.fastq") as handle:
1194    ...     record = SeqIO.read(handle, "fastq-solexa")
1195    >>> print("%s %s" % (record.id, record.seq))
1196    slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1197    >>> print(record.letter_annotations["solexa_quality"])
1198    [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
1199
1200    These quality scores are so low that when converted from the Solexa scheme
1201    into PHRED scores they look quite different:
1202
1203    >>> print("%0.2f" % phred_quality_from_solexa(-1))
1204    2.54
1205    >>> print("%0.2f" % phred_quality_from_solexa(-5))
1206    1.19
1207
1208    Note you can use the Bio.SeqIO.write() function or the SeqRecord's format
1209    method to output the record(s):
1210
1211    >>> print(record.format("fastq-solexa"))
1212    @slxa_0001_1_0001_01
1213    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1214    +
1215    hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
1216    <BLANKLINE>
1217
1218    Note this output is slightly different from the input file as Biopython
1219    has left out the optional repetition of the sequence identifier on the "+"
1220    line.  If you want the to use PHRED scores, use "fastq" or "qual" as the
1221    output format instead, and Biopython will do the conversion for you:
1222
1223    >>> print(record.format("fastq"))
1224    @slxa_0001_1_0001_01
1225    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1226    +
1227    IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##""
1228    <BLANKLINE>
1229
1230    >>> print(record.format("qual"))
1231    >slxa_0001_1_0001_01
1232    40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
1233    20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
1234    1 1
1235    <BLANKLINE>
1236
1237    As shown above, the poor quality Solexa reads have been mapped to the
1238    equivalent PHRED score (e.g. -5 to 1 as shown earlier).
1239    """
1240    if alphabet is not None:
1241        raise ValueError("The alphabet argument is no longer supported")
1242
1243    q_mapping = {
1244        chr(letter): letter - SOLEXA_SCORE_OFFSET
1245        for letter in range(SOLEXA_SCORE_OFFSET - 5, 63 + SOLEXA_SCORE_OFFSET)
1246    }
1247
1248    for title_line, seq_string, quality_string in FastqGeneralIterator(source):
1249        if title2ids:
1250            id, name, descr = title2ids(title_line)
1251        else:
1252            descr = title_line
1253            id = descr.split()[0]
1254            name = id
1255        record = SeqRecord(Seq(seq_string), id=id, name=name, description=descr)
1256        try:
1257            qualities = [q_mapping[letter] for letter in quality_string]
1258        # DO NOT convert these into PHRED qualities automatically!
1259        except KeyError:
1260            raise ValueError("Invalid character in quality string") from None
1261        # Dirty trick to speed up this line:
1262        # record.letter_annotations["solexa_quality"] = qualities
1263        dict.__setitem__(record._per_letter_annotations, "solexa_quality", qualities)
1264        yield record
1265
1266
1267def FastqIlluminaIterator(source, alphabet=None, title2ids=None):
1268    """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping).
1269
1270    The optional arguments are the same as those for the FastqPhredIterator.
1271
1272    For each sequence in Illumina 1.3+ FASTQ files there is a matching string
1273    encoding PHRED integer qualities using ASCII values with an offset of 64.
1274
1275    >>> from Bio import SeqIO
1276    >>> record = SeqIO.read("Quality/illumina_faked.fastq", "fastq-illumina")
1277    >>> print("%s %s" % (record.id, record.seq))
1278    Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1279    >>> max(record.letter_annotations["phred_quality"])
1280    40
1281    >>> min(record.letter_annotations["phred_quality"])
1282    0
1283
1284    NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores
1285    with an ASCII offset of 64. They are approximately equal but only for high
1286    quality reads. If you have an old Solexa/Illumina file with negative
1287    Solexa scores, and try and read this as an Illumina 1.3+ file it will fail:
1288
1289    >>> record2 = SeqIO.read("Quality/solexa_faked.fastq", "fastq-illumina")
1290    Traceback (most recent call last):
1291       ...
1292    ValueError: Invalid character in quality string
1293
1294    NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33.
1295    """
1296    if alphabet is not None:
1297        raise ValueError("The alphabet argument is no longer supported")
1298
1299    q_mapping = {
1300        chr(letter): letter - SOLEXA_SCORE_OFFSET
1301        for letter in range(SOLEXA_SCORE_OFFSET, 63 + SOLEXA_SCORE_OFFSET)
1302    }
1303
1304    for title_line, seq_string, quality_string in FastqGeneralIterator(source):
1305        if title2ids:
1306            id, name, descr = title2ids(title_line)
1307        else:
1308            descr = title_line
1309            id = descr.split()[0]
1310            name = id
1311        record = SeqRecord(Seq(seq_string), id=id, name=name, description=descr)
1312        try:
1313            qualities = [q_mapping[letter] for letter in quality_string]
1314        except KeyError:
1315            raise ValueError("Invalid character in quality string") from None
1316        # Dirty trick to speed up this line:
1317        # record.letter_annotations["phred_quality"] = qualities
1318        dict.__setitem__(record._per_letter_annotations, "phred_quality", qualities)
1319        yield record
1320
1321
1322class QualPhredIterator(SequenceIterator):
1323    """Parser for QUAL files with PHRED quality scores but no sequence."""
1324
1325    def __init__(self, source, alphabet=None, title2ids=None):
1326        """For QUAL files which include PHRED quality scores, but no sequence.
1327
1328        For example, consider this short QUAL file::
1329
1330            >EAS54_6_R1_2_1_413_324
1331            26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
1332            26 26 26 23 23
1333            >EAS54_6_R1_2_1_540_792
1334            26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1335            26 18 26 23 18
1336            >EAS54_6_R1_2_1_443_348
1337            26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1338            24 18 18 18 18
1339
1340        Using this module directly you might run:
1341
1342        >>> with open("Quality/example.qual") as handle:
1343        ...     for record in QualPhredIterator(handle):
1344        ...         print("%s read of length %d" % (record.id, len(record.seq)))
1345        EAS54_6_R1_2_1_413_324 read of length 25
1346        EAS54_6_R1_2_1_540_792 read of length 25
1347        EAS54_6_R1_2_1_443_348 read of length 25
1348
1349        Typically however, you would call this via Bio.SeqIO instead with "qual"
1350        as the format:
1351
1352        >>> from Bio import SeqIO
1353        >>> with open("Quality/example.qual") as handle:
1354        ...     for record in SeqIO.parse(handle, "qual"):
1355        ...         print("%s read of length %d" % (record.id, len(record.seq)))
1356        EAS54_6_R1_2_1_413_324 read of length 25
1357        EAS54_6_R1_2_1_540_792 read of length 25
1358        EAS54_6_R1_2_1_443_348 read of length 25
1359
1360        Only the sequence length is known, as the QUAL file does not contain
1361        the sequence string itself.
1362
1363        The quality scores themselves are available as a list of integers
1364        in each record's per-letter-annotation:
1365
1366        >>> print(record.letter_annotations["phred_quality"])
1367        [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1368
1369        You can still slice one of these SeqRecord objects:
1370
1371        >>> sub_record = record[5:10]
1372        >>> print("%s %s" % (sub_record.id, sub_record.letter_annotations["phred_quality"]))
1373        EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]
1374
1375        As of Biopython 1.59, this parser will accept files with negatives quality
1376        scores but will replace them with the lowest possible PHRED score of zero.
1377        This will trigger a warning, previously it raised a ValueError exception.
1378        """
1379        if alphabet is not None:
1380            raise ValueError("The alphabet argument is no longer supported")
1381        self.title2ids = title2ids
1382        super().__init__(source, mode="t", fmt="QUAL")
1383
1384    def parse(self, handle):
1385        """Start parsing the file, and return a SeqRecord generator."""
1386        records = self.iterate(handle)
1387        return records
1388
1389    def iterate(self, handle):
1390        """Parse the file and generate SeqRecord objects."""
1391        title2ids = self.title2ids
1392        # Skip any text before the first record (e.g. blank lines, comments)
1393        for line in handle:
1394            if line[0] == ">":
1395                break
1396        else:
1397            return
1398
1399        while True:
1400            if line[0] != ">":
1401                raise ValueError(
1402                    "Records in Fasta files should start with '>' character"
1403                )
1404            if title2ids:
1405                id, name, descr = title2ids(line[1:].rstrip())
1406            else:
1407                descr = line[1:].rstrip()
1408                id = descr.split()[0]
1409                name = id
1410
1411            qualities = []
1412            for line in handle:
1413                if line[0] == ">":
1414                    break
1415                qualities.extend(int(word) for word in line.split())
1416            else:
1417                line = None
1418
1419            if qualities and min(qualities) < 0:
1420                warnings.warn(
1421                    "Negative quality score %i found, substituting PHRED zero instead."
1422                    % min(qualities),
1423                    BiopythonParserWarning,
1424                )
1425                qualities = [max(0, q) for q in qualities]
1426
1427            # Return the record and then continue...
1428            sequence = Seq(None, length=len(qualities))
1429            record = SeqRecord(sequence, id=id, name=name, description=descr)
1430            # Dirty trick to speed up this line:
1431            # record.letter_annotations["phred_quality"] = qualities
1432            dict.__setitem__(record._per_letter_annotations, "phred_quality", qualities)
1433            yield record
1434
1435            if line is None:
1436                return  # StopIteration
1437        raise ValueError("Unrecognised QUAL record format.")
1438
1439
1440class FastqPhredWriter(SequenceWriter):
1441    """Class to write standard FASTQ format files (using PHRED quality scores) (OBSOLETE).
1442
1443    Although you can use this class directly, you are strongly encouraged
1444    to use the ``as_fastq`` function, or top level ``Bio.SeqIO.write()``
1445    function instead via the format name "fastq" or the alias "fastq-sanger".
1446
1447    For example, this code reads in a standard Sanger style FASTQ file
1448    (using PHRED scores) and re-saves it as another Sanger style FASTQ file:
1449
1450    >>> from Bio import SeqIO
1451    >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
1452    >>> with open("Quality/temp.fastq", "w") as out_handle:
1453    ...     SeqIO.write(record_iterator, out_handle, "fastq")
1454    3
1455
1456    You might want to do this if the original file included extra line breaks,
1457    which while valid may not be supported by all tools.  The output file from
1458    Biopython will have each sequence on a single line, and each quality
1459    string on a single line (which is considered desirable for maximum
1460    compatibility).
1461
1462    In this next example, an old style Solexa/Illumina FASTQ file (using Solexa
1463    quality scores) is converted into a standard Sanger style FASTQ file using
1464    PHRED qualities:
1465
1466    >>> from Bio import SeqIO
1467    >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa")
1468    >>> with open("Quality/temp.fastq", "w") as out_handle:
1469    ...     SeqIO.write(record_iterator, out_handle, "fastq")
1470    5
1471
1472    This code is also called if you use the .format("fastq") method of a
1473    SeqRecord, or .format("fastq-sanger") if you prefer that alias.
1474
1475    Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is
1476    encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
1477    warning is issued.
1478
1479    P.S. To avoid cluttering up your working directory, you can delete this
1480    temporary file now:
1481
1482    >>> import os
1483    >>> os.remove("Quality/temp.fastq")
1484    """
1485
1486    assert SANGER_SCORE_OFFSET == ord("!")
1487
1488    def write_record(self, record):
1489        """Write a single FASTQ record to the file."""
1490        assert self._header_written
1491        assert not self._footer_written
1492        self._record_written = True
1493        # TODO - Is an empty sequence allowed in FASTQ format?
1494        seq = record.seq
1495        if seq is None:
1496            raise ValueError("No sequence for record %s" % record.id)
1497        qualities_str = _get_sanger_quality_str(record)
1498        if len(qualities_str) != len(seq):
1499            raise ValueError(
1500                "Record %s has sequence length %i but %i quality scores"
1501                % (record.id, len(seq), len(qualities_str))
1502            )
1503
1504        # FASTQ files can include a description, just like FASTA files
1505        # (at least, this is what the NCBI Short Read Archive does)
1506        id = self.clean(record.id)
1507        description = self.clean(record.description)
1508        if description and description.split(None, 1)[0] == id:
1509            # The description includes the id at the start
1510            title = description
1511        elif description:
1512            title = "%s %s" % (id, description)
1513        else:
1514            title = id
1515
1516        self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qualities_str))
1517
1518
1519def as_fastq(record):
1520    """Turn a SeqRecord into a Sanger FASTQ formatted string.
1521
1522    This is used internally by the SeqRecord's .format("fastq")
1523    method and by the SeqIO.write(..., ..., "fastq") function,
1524    and under the format alias "fastq-sanger" as well.
1525    """
1526    seq_str = _get_seq_string(record)
1527    qualities_str = _get_sanger_quality_str(record)
1528    if len(qualities_str) != len(seq_str):
1529        raise ValueError(
1530            "Record %s has sequence length %i but %i quality scores"
1531            % (record.id, len(seq_str), len(qualities_str))
1532        )
1533    id = _clean(record.id)
1534    description = _clean(record.description)
1535    if description and description.split(None, 1)[0] == id:
1536        title = description
1537    elif description:
1538        title = "%s %s" % (id, description)
1539    else:
1540        title = id
1541    return "@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str)
1542
1543
1544class QualPhredWriter(SequenceWriter):
1545    """Class to write QUAL format files (using PHRED quality scores) (OBSOLETE).
1546
1547    Although you can use this class directly, you are strongly encouraged
1548    to use the ``as_qual`` function, or top level ``Bio.SeqIO.write()``
1549    function instead.
1550
1551    For example, this code reads in a FASTQ file and saves the quality scores
1552    into a QUAL file:
1553
1554    >>> from Bio import SeqIO
1555    >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
1556    >>> with open("Quality/temp.qual", "w") as out_handle:
1557    ...     SeqIO.write(record_iterator, out_handle, "qual")
1558    3
1559
1560    This code is also called if you use the .format("qual") method of a
1561    SeqRecord.
1562
1563    P.S. Don't forget to clean up the temp file if you don't need it anymore:
1564
1565    >>> import os
1566    >>> os.remove("Quality/temp.qual")
1567    """
1568
1569    def __init__(self, handle, wrap=60, record2title=None):
1570        """Create a QUAL writer.
1571
1572        Arguments:
1573         - handle - Handle to an output file, e.g. as returned
1574           by open(filename, "w")
1575         - wrap   - Optional line length used to wrap sequence lines.
1576           Defaults to wrapping the sequence at 60 characters. Use
1577           zero (or None) for no wrapping, giving a single long line
1578           for the sequence.
1579         - record2title - Optional function to return the text to be
1580           used for the title line of each record.  By default a
1581           combination of the record.id and record.description is
1582           used.  If the record.description starts with the record.id,
1583           then just the record.description is used.
1584
1585        The record2title argument is present for consistency with the
1586        Bio.SeqIO.FastaIO writer class.
1587        """
1588        super().__init__(handle)
1589        # self.handle = handle
1590        self.wrap = None
1591        if wrap:
1592            if wrap < 1:
1593                raise ValueError
1594        self.wrap = wrap
1595        self.record2title = record2title
1596
1597    def write_record(self, record):
1598        """Write a single QUAL record to the file."""
1599        assert self._header_written
1600        assert not self._footer_written
1601        self._record_written = True
1602
1603        handle = self.handle
1604        wrap = self.wrap
1605
1606        if self.record2title:
1607            title = self.clean(self.record2title(record))
1608        else:
1609            id = self.clean(record.id)
1610            description = self.clean(record.description)
1611            if description and description.split(None, 1)[0] == id:
1612                # The description includes the id at the start
1613                title = description
1614            elif description:
1615                title = "%s %s" % (id, description)
1616            else:
1617                title = id
1618        handle.write(">%s\n" % title)
1619
1620        qualities = _get_phred_quality(record)
1621        try:
1622            # This rounds to the nearest integer.
1623            # TODO - can we record a float in a qual file?
1624            qualities_strs = [("%i" % round(q, 0)) for q in qualities]
1625        except TypeError:
1626            if None in qualities:
1627                raise TypeError("A quality value of None was found") from None
1628            else:
1629                raise
1630
1631        if wrap > 5:
1632            # Fast wrapping
1633            data = " ".join(qualities_strs)
1634            while True:
1635                if len(data) <= wrap:
1636                    self.handle.write(data + "\n")
1637                    break
1638                else:
1639                    # By construction there must be spaces in the first X chars
1640                    # (unless we have X digit or higher quality scores!)
1641                    i = data.rfind(" ", 0, wrap)
1642                    handle.write(data[:i] + "\n")
1643                    data = data[i + 1 :]
1644        elif wrap:
1645            # Safe wrapping
1646            while qualities_strs:
1647                line = qualities_strs.pop(0)
1648                while qualities_strs and len(line) + 1 + len(qualities_strs[0]) < wrap:
1649                    line += " " + qualities_strs.pop(0)
1650                handle.write(line + "\n")
1651        else:
1652            # No wrapping
1653            data = " ".join(qualities_strs)
1654            handle.write(data + "\n")
1655
1656
1657def as_qual(record):
1658    """Turn a SeqRecord into a QUAL formatted string.
1659
1660    This is used internally by the SeqRecord's .format("qual")
1661    method and by the SeqIO.write(..., ..., "qual") function.
1662    """
1663    id = _clean(record.id)
1664    description = _clean(record.description)
1665    if description and description.split(None, 1)[0] == id:
1666        title = description
1667    elif description:
1668        title = "%s %s" % (id, description)
1669    else:
1670        title = id
1671    lines = [">%s\n" % title]
1672
1673    qualities = _get_phred_quality(record)
1674    try:
1675        # This rounds to the nearest integer.
1676        # TODO - can we record a float in a qual file?
1677        qualities_strs = [("%i" % round(q, 0)) for q in qualities]
1678    except TypeError:
1679        if None in qualities:
1680            raise TypeError("A quality value of None was found") from None
1681        else:
1682            raise
1683
1684    # Safe wrapping
1685    while qualities_strs:
1686        line = qualities_strs.pop(0)
1687        while qualities_strs and len(line) + 1 + len(qualities_strs[0]) < 60:
1688            line += " " + qualities_strs.pop(0)
1689        lines.append(line + "\n")
1690    return "".join(lines)
1691
1692
1693class FastqSolexaWriter(SequenceWriter):
1694    r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities) (OBSOLETE).
1695
1696    This outputs FASTQ files like those from the early Solexa/Illumina
1697    pipeline, using Solexa scores and an ASCII offset of 64. These are
1698    NOT compatible with the standard Sanger style PHRED FASTQ files.
1699
1700    If your records contain a "solexa_quality" entry under letter_annotations,
1701    this is used, otherwise any "phred_quality" entry will be used after
1702    conversion using the solexa_quality_from_phred function. If neither style
1703    of quality scores are present, an exception is raised.
1704
1705    Although you can use this class directly, you are strongly encouraged
1706    to use the ``as_fastq_solexa`` function, or top-level ``Bio.SeqIO.write()``
1707    function instead.  For example, this code reads in a FASTQ file and re-saves
1708    it as another FASTQ file:
1709
1710    >>> from Bio import SeqIO
1711    >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa")
1712    >>> with open("Quality/temp.fastq", "w") as out_handle:
1713    ...     SeqIO.write(record_iterator, out_handle, "fastq-solexa")
1714    5
1715
1716    You might want to do this if the original file included extra line breaks,
1717    which (while valid) may not be supported by all tools.  The output file
1718    from Biopython will have each sequence on a single line, and each quality
1719    string on a single line (which is considered desirable for maximum
1720    compatibility).
1721
1722    This code is also called if you use the .format("fastq-solexa") method of
1723    a SeqRecord. For example,
1724
1725    >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger")
1726    >>> print(record.format("fastq-solexa"))
1727    @Test PHRED qualities from 40 to 0 inclusive
1728    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1729    +
1730    hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;;
1731    <BLANKLINE>
1732
1733    Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is
1734    encoded as ASCII 126, the tilde.  If your quality scores must be truncated to fit,
1735    a warning is issued.
1736
1737    P.S. Don't forget to delete the temp file if you don't need it anymore:
1738
1739    >>> import os
1740    >>> os.remove("Quality/temp.fastq")
1741    """
1742
1743    def write_record(self, record):
1744        """Write a single FASTQ record to the file."""
1745        assert self._header_written
1746        assert not self._footer_written
1747        self._record_written = True
1748
1749        # TODO - Is an empty sequence allowed in FASTQ format?
1750        seq = record.seq
1751        if seq is None:
1752            raise ValueError("No sequence for record %s" % record.id)
1753        qualities_str = _get_solexa_quality_str(record)
1754        if len(qualities_str) != len(seq):
1755            raise ValueError(
1756                "Record %s has sequence length %i but %i quality scores"
1757                % (record.id, len(seq), len(qualities_str))
1758            )
1759
1760        # FASTQ files can include a description, just like FASTA files
1761        # (at least, this is what the NCBI Short Read Archive does)
1762        id = self.clean(record.id)
1763        description = self.clean(record.description)
1764        if description and description.split(None, 1)[0] == id:
1765            # The description includes the id at the start
1766            title = description
1767        elif description:
1768            title = "%s %s" % (id, description)
1769        else:
1770            title = id
1771
1772        self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qualities_str))
1773
1774
1775def as_fastq_solexa(record):
1776    """Turn a SeqRecord into a Solexa FASTQ formatted string.
1777
1778    This is used internally by the SeqRecord's .format("fastq-solexa")
1779    method and by the SeqIO.write(..., ..., "fastq-solexa") function.
1780    """
1781    seq_str = _get_seq_string(record)
1782    qualities_str = _get_solexa_quality_str(record)
1783    if len(qualities_str) != len(seq_str):
1784        raise ValueError(
1785            "Record %s has sequence length %i but %i quality scores"
1786            % (record.id, len(seq_str), len(qualities_str))
1787        )
1788    id = _clean(record.id)
1789    description = _clean(record.description)
1790    if description and description.split(None, 1)[0] == id:
1791        # The description includes the id at the start
1792        title = description
1793    elif description:
1794        title = "%s %s" % (id, description)
1795    else:
1796        title = id
1797    return "@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str)
1798
1799
1800class FastqIlluminaWriter(SequenceWriter):
1801    r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores) (OBSOLETE).
1802
1803    This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline,
1804    using PHRED scores and an ASCII offset of 64. Note these files are NOT
1805    compatible with the standard Sanger style PHRED FASTQ files which use an
1806    ASCII offset of 32.
1807
1808    Although you can use this class directly, you are strongly encouraged to
1809    use the ``as_fastq_illumina`` or top-level ``Bio.SeqIO.write()`` function
1810    with format name "fastq-illumina" instead. This code is also called if you
1811    use the .format("fastq-illumina") method of a SeqRecord. For example,
1812
1813    >>> from Bio import SeqIO
1814    >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger")
1815    >>> print(record.format("fastq-illumina"))
1816    @Test PHRED qualities from 40 to 0 inclusive
1817    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1818    +
1819    hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
1820    <BLANKLINE>
1821
1822    Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is
1823    encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
1824    warning is issued.
1825    """
1826
1827    def write_record(self, record):
1828        """Write a single FASTQ record to the file."""
1829        assert self._header_written
1830        assert not self._footer_written
1831        self._record_written = True
1832
1833        # TODO - Is an empty sequence allowed in FASTQ format?
1834        seq = record.seq
1835        if seq is None:
1836            raise ValueError("No sequence for record %s" % record.id)
1837        qualities_str = _get_illumina_quality_str(record)
1838        if len(qualities_str) != len(seq):
1839            raise ValueError(
1840                "Record %s has sequence length %i but %i quality scores"
1841                % (record.id, len(seq), len(qualities_str))
1842            )
1843
1844        # FASTQ files can include a description, just like FASTA files
1845        # (at least, this is what the NCBI Short Read Archive does)
1846        id = self.clean(record.id)
1847        description = self.clean(record.description)
1848        if description and description.split(None, 1)[0] == id:
1849            # The description includes the id at the start
1850            title = description
1851        elif description:
1852            title = "%s %s" % (id, description)
1853        else:
1854            title = id
1855
1856        self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qualities_str))
1857
1858
1859def as_fastq_illumina(record):
1860    """Turn a SeqRecord into an Illumina FASTQ formatted string.
1861
1862    This is used internally by the SeqRecord's .format("fastq-illumina")
1863    method and by the SeqIO.write(..., ..., "fastq-illumina") function.
1864    """
1865    seq_str = _get_seq_string(record)
1866    qualities_str = _get_illumina_quality_str(record)
1867    if len(qualities_str) != len(seq_str):
1868        raise ValueError(
1869            "Record %s has sequence length %i but %i quality scores"
1870            % (record.id, len(seq_str), len(qualities_str))
1871        )
1872    id = _clean(record.id)
1873    description = _clean(record.description)
1874    if description and description.split(None, 1)[0] == id:
1875        title = description
1876    elif description:
1877        title = "%s %s" % (id, description)
1878    else:
1879        title = id
1880    return "@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str)
1881
1882
1883def PairedFastaQualIterator(fasta_source, qual_source, alphabet=None, title2ids=None):
1884    """Iterate over matched FASTA and QUAL files as SeqRecord objects.
1885
1886    For example, consider this short QUAL file with PHRED quality scores::
1887
1888        >EAS54_6_R1_2_1_413_324
1889        26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
1890        26 26 26 23 23
1891        >EAS54_6_R1_2_1_540_792
1892        26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1893        26 18 26 23 18
1894        >EAS54_6_R1_2_1_443_348
1895        26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1896        24 18 18 18 18
1897
1898    And a matching FASTA file::
1899
1900        >EAS54_6_R1_2_1_413_324
1901        CCCTTCTTGTCTTCAGCGTTTCTCC
1902        >EAS54_6_R1_2_1_540_792
1903        TTGGCAGGCCAAGGCCGATGGATCA
1904        >EAS54_6_R1_2_1_443_348
1905        GTTGCTTCTGGCGTGGGTGGGGGGG
1906
1907    You can parse these separately using Bio.SeqIO with the "qual" and
1908    "fasta" formats, but then you'll get a group of SeqRecord objects with
1909    no sequence, and a matching group with the sequence but not the
1910    qualities.  Because it only deals with one input file handle, Bio.SeqIO
1911    can't be used to read the two files together - but this function can!
1912    For example,
1913
1914    >>> with open("Quality/example.fasta") as f:
1915    ...     with open("Quality/example.qual") as q:
1916    ...         for record in PairedFastaQualIterator(f, q):
1917    ...             print("%s %s" % (record.id, record.seq))
1918    ...
1919    EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1920    EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1921    EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1922
1923    As with the FASTQ or QUAL parsers, if you want to look at the qualities,
1924    they are in each record's per-letter-annotation dictionary as a simple
1925    list of integers:
1926
1927    >>> print(record.letter_annotations["phred_quality"])
1928    [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1929
1930    If you have access to data as a FASTQ format file, using that directly
1931    would be simpler and more straight forward.  Note that you can easily use
1932    this function to convert paired FASTA and QUAL files into FASTQ files:
1933
1934    >>> from Bio import SeqIO
1935    >>> with open("Quality/example.fasta") as f:
1936    ...     with open("Quality/example.qual") as q:
1937    ...         SeqIO.write(PairedFastaQualIterator(f, q), "Quality/temp.fastq", "fastq")
1938    ...
1939    3
1940
1941    And don't forget to clean up the temp file if you don't need it anymore:
1942
1943    >>> import os
1944    >>> os.remove("Quality/temp.fastq")
1945    """
1946    if alphabet is not None:
1947        raise ValueError("The alphabet argument is no longer supported")
1948
1949    from Bio.SeqIO.FastaIO import FastaIterator
1950
1951    fasta_iter = FastaIterator(fasta_source, title2ids=title2ids)
1952    qual_iter = QualPhredIterator(qual_source, title2ids=title2ids)
1953
1954    # Using zip wouldn't load everything into memory, but also would not catch
1955    # any extra records found in only one file.
1956    while True:
1957        try:
1958            f_rec = next(fasta_iter)
1959        except StopIteration:
1960            f_rec = None
1961        try:
1962            q_rec = next(qual_iter)
1963        except StopIteration:
1964            q_rec = None
1965        if f_rec is None and q_rec is None:
1966            # End of both files
1967            break
1968        if f_rec is None:
1969            raise ValueError("FASTA file has more entries than the QUAL file.")
1970        if q_rec is None:
1971            raise ValueError("QUAL file has more entries than the FASTA file.")
1972        if f_rec.id != q_rec.id:
1973            raise ValueError(
1974                "FASTA and QUAL entries do not match (%s vs %s)." % (f_rec.id, q_rec.id)
1975            )
1976        if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]):
1977            raise ValueError(
1978                "Sequence length and number of quality scores disagree for %s"
1979                % f_rec.id
1980            )
1981        # Merge the data....
1982        f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations[
1983            "phred_quality"
1984        ]
1985        yield f_rec
1986    # Done
1987
1988
1989def _fastq_generic(in_file, out_file, mapping):
1990    """FASTQ helper function where can't have data loss by truncation (PRIVATE)."""
1991    # For real speed, don't even make SeqRecord and Seq objects!
1992    count = 0
1993    null = chr(0)
1994    with as_handle(out_file, "w") as out_handle:
1995        for title, seq, old_qual in FastqGeneralIterator(in_file):
1996            count += 1
1997            # map the qual...
1998            qual = old_qual.translate(mapping)
1999            if null in qual:
2000                raise ValueError("Invalid character in quality string")
2001            out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
2002    return count
2003
2004
2005def _fastq_generic2(in_file, out_file, mapping, truncate_char, truncate_msg):
2006    """FASTQ helper function where there could be data loss by truncation (PRIVATE)."""
2007    # For real speed, don't even make SeqRecord and Seq objects!
2008    count = 0
2009    null = chr(0)
2010    with as_handle(out_file, "w") as out_handle:
2011        for title, seq, old_qual in FastqGeneralIterator(in_file):
2012            count += 1
2013            # map the qual...
2014            qual = old_qual.translate(mapping)
2015            if null in qual:
2016                raise ValueError("Invalid character in quality string")
2017            if truncate_char in qual:
2018                qual = qual.replace(truncate_char, chr(126))
2019                warnings.warn(truncate_msg, BiopythonWarning)
2020            out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
2021    return count
2022
2023
2024def _fastq_sanger_convert_fastq_sanger(in_file, out_file):
2025    """Fast Sanger FASTQ to Sanger FASTQ conversion (PRIVATE).
2026
2027    Useful for removing line wrapping and the redundant second identifier
2028    on the plus lines. Will check also check the quality string is valid.
2029
2030    Avoids creating SeqRecord and Seq objects in order to speed up this
2031    conversion.
2032    """
2033    # Map unexpected chars to null
2034    mapping = "".join(
2035        [chr(0) for ascii in range(0, 33)]
2036        + [chr(ascii) for ascii in range(33, 127)]
2037        + [chr(0) for ascii in range(127, 256)]
2038    )
2039    assert len(mapping) == 256
2040    return _fastq_generic(in_file, out_file, mapping)
2041
2042
2043def _fastq_solexa_convert_fastq_solexa(in_file, out_file):
2044    """Fast Solexa FASTQ to Solexa FASTQ conversion (PRIVATE).
2045
2046    Useful for removing line wrapping and the redundant second identifier
2047    on the plus lines. Will check also check the quality string is valid.
2048    Avoids creating SeqRecord and Seq objects in order to speed up this
2049    conversion.
2050    """
2051    # Map unexpected chars to null
2052    mapping = "".join(
2053        [chr(0) for ascii in range(0, 59)]
2054        + [chr(ascii) for ascii in range(59, 127)]
2055        + [chr(0) for ascii in range(127, 256)]
2056    )
2057    assert len(mapping) == 256
2058    return _fastq_generic(in_file, out_file, mapping)
2059
2060
2061def _fastq_illumina_convert_fastq_illumina(in_file, out_file):
2062    """Fast Illumina 1.3+ FASTQ to Illumina 1.3+ FASTQ conversion (PRIVATE).
2063
2064    Useful for removing line wrapping and the redundant second identifier
2065    on the plus lines. Will check also check the quality string is valid.
2066    Avoids creating SeqRecord and Seq objects in order to speed up this
2067    conversion.
2068    """
2069    # Map unexpected chars to null
2070    mapping = "".join(
2071        [chr(0) for ascii in range(0, 64)]
2072        + [chr(ascii) for ascii in range(64, 127)]
2073        + [chr(0) for ascii in range(127, 256)]
2074    )
2075    assert len(mapping) == 256
2076    return _fastq_generic(in_file, out_file, mapping)
2077
2078
2079def _fastq_illumina_convert_fastq_sanger(in_file, out_file):
2080    """Fast Illumina 1.3+ FASTQ to Sanger FASTQ conversion (PRIVATE).
2081
2082    Avoids creating SeqRecord and Seq objects in order to speed up this
2083    conversion.
2084    """
2085    # Map unexpected chars to null
2086    mapping = "".join(
2087        [chr(0) for ascii in range(0, 64)]
2088        + [chr(33 + q) for q in range(0, 62 + 1)]
2089        + [chr(0) for ascii in range(127, 256)]
2090    )
2091    assert len(mapping) == 256
2092    return _fastq_generic(in_file, out_file, mapping)
2093
2094
2095def _fastq_sanger_convert_fastq_illumina(in_file, out_file):
2096    """Fast Sanger FASTQ to Illumina 1.3+ FASTQ conversion (PRIVATE).
2097
2098    Avoids creating SeqRecord and Seq objects in order to speed up this
2099    conversion. Will issue a warning if the scores had to be truncated at 62
2100    (maximum possible in the Illumina 1.3+ FASTQ format)
2101    """
2102    # Map unexpected chars to null
2103    trunc_char = chr(1)
2104    mapping = "".join(
2105        [chr(0) for ascii in range(0, 33)]
2106        + [chr(64 + q) for q in range(0, 62 + 1)]
2107        + [trunc_char for ascii in range(96, 127)]
2108        + [chr(0) for ascii in range(127, 256)]
2109    )
2110    assert len(mapping) == 256
2111    return _fastq_generic2(
2112        in_file,
2113        out_file,
2114        mapping,
2115        trunc_char,
2116        "Data loss - max PHRED quality 62 in Illumina 1.3+ FASTQ",
2117    )
2118
2119
2120def _fastq_solexa_convert_fastq_sanger(in_file, out_file):
2121    """Fast Solexa FASTQ to Sanger FASTQ conversion (PRIVATE).
2122
2123    Avoids creating SeqRecord and Seq objects in order to speed up this
2124    conversion.
2125    """
2126    # Map unexpected chars to null
2127    mapping = "".join(
2128        [chr(0) for ascii in range(0, 59)]
2129        + [
2130            chr(33 + int(round(phred_quality_from_solexa(q))))
2131            for q in range(-5, 62 + 1)
2132        ]
2133        + [chr(0) for ascii in range(127, 256)]
2134    )
2135    assert len(mapping) == 256
2136    return _fastq_generic(in_file, out_file, mapping)
2137
2138
2139def _fastq_sanger_convert_fastq_solexa(in_file, out_file):
2140    """Fast Sanger FASTQ to Solexa FASTQ conversion (PRIVATE).
2141
2142    Avoids creating SeqRecord and Seq objects in order to speed up this
2143    conversion. Will issue a warning if the scores had to be truncated at 62
2144    (maximum possible in the Solexa FASTQ format)
2145    """
2146    # Map unexpected chars to null
2147    trunc_char = chr(1)
2148    mapping = "".join(
2149        [chr(0) for ascii in range(0, 33)]
2150        + [chr(64 + int(round(solexa_quality_from_phred(q)))) for q in range(0, 62 + 1)]
2151        + [trunc_char for ascii in range(96, 127)]
2152        + [chr(0) for ascii in range(127, 256)]
2153    )
2154    assert len(mapping) == 256
2155    return _fastq_generic2(
2156        in_file,
2157        out_file,
2158        mapping,
2159        trunc_char,
2160        "Data loss - max Solexa quality 62 in Solexa FASTQ",
2161    )
2162
2163
2164def _fastq_solexa_convert_fastq_illumina(in_file, out_file):
2165    """Fast Solexa FASTQ to Illumina 1.3+ FASTQ conversion (PRIVATE).
2166
2167    Avoids creating SeqRecord and Seq objects in order to speed up this
2168    conversion.
2169    """
2170    # Map unexpected chars to null
2171    mapping = "".join(
2172        [chr(0) for ascii in range(0, 59)]
2173        + [
2174            chr(64 + int(round(phred_quality_from_solexa(q))))
2175            for q in range(-5, 62 + 1)
2176        ]
2177        + [chr(0) for ascii in range(127, 256)]
2178    )
2179    assert len(mapping) == 256
2180    return _fastq_generic(in_file, out_file, mapping)
2181
2182
2183def _fastq_illumina_convert_fastq_solexa(in_file, out_file):
2184    """Fast Illumina 1.3+ FASTQ to Solexa FASTQ conversion (PRIVATE).
2185
2186    Avoids creating SeqRecord and Seq objects in order to speed up this
2187    conversion.
2188    """
2189    # Map unexpected chars to null
2190    mapping = "".join(
2191        [chr(0) for ascii in range(0, 64)]
2192        + [chr(64 + int(round(solexa_quality_from_phred(q)))) for q in range(0, 62 + 1)]
2193        + [chr(0) for ascii in range(127, 256)]
2194    )
2195    assert len(mapping) == 256
2196    return _fastq_generic(in_file, out_file, mapping)
2197
2198
2199def _fastq_convert_fasta(in_file, out_file):
2200    """Fast FASTQ to FASTA conversion (PRIVATE).
2201
2202    Avoids dealing with the FASTQ quality encoding, and creating SeqRecord and
2203    Seq objects in order to speed up this conversion.
2204
2205    NOTE - This does NOT check the characters used in the FASTQ quality string
2206    are valid!
2207    """
2208    # For real speed, don't even make SeqRecord and Seq objects!
2209    count = 0
2210    with as_handle(out_file, "w") as out_handle:
2211        for title, seq, qual in FastqGeneralIterator(in_file):
2212            count += 1
2213            out_handle.write(">%s\n" % title)
2214            # Do line wrapping
2215            for i in range(0, len(seq), 60):
2216                out_handle.write(seq[i : i + 60] + "\n")
2217    return count
2218
2219
2220def _fastq_convert_tab(in_file, out_file):
2221    """Fast FASTQ to simple tabbed conversion (PRIVATE).
2222
2223    Avoids dealing with the FASTQ quality encoding, and creating SeqRecord and
2224    Seq objects in order to speed up this conversion.
2225
2226    NOTE - This does NOT check the characters used in the FASTQ quality string
2227    are valid!
2228    """
2229    # For real speed, don't even make SeqRecord and Seq objects!
2230    count = 0
2231    with as_handle(out_file, "w") as out_handle:
2232        for title, seq, qual in FastqGeneralIterator(in_file):
2233            count += 1
2234            out_handle.write("%s\t%s\n" % (title.split(None, 1)[0], seq))
2235    return count
2236
2237
2238def _fastq_convert_qual(in_file, out_file, mapping):
2239    """FASTQ helper function for QUAL output (PRIVATE).
2240
2241    Mapping should be a dictionary mapping expected ASCII characters from the
2242    FASTQ quality string to PHRED quality scores (as strings).
2243    """
2244    # For real speed, don't even make SeqRecord and Seq objects!
2245    count = 0
2246    with as_handle(out_file, "w") as out_handle:
2247        for title, seq, qual in FastqGeneralIterator(in_file):
2248            count += 1
2249            out_handle.write(">%s\n" % title)
2250            # map the qual... note even with Sanger encoding max 2 digits
2251            try:
2252                qualities_strs = [mapping[ascii] for ascii in qual]
2253            except KeyError:
2254                raise ValueError("Invalid character in quality string") from None
2255            data = " ".join(qualities_strs)
2256            while len(data) > 60:
2257                # Know quality scores are either 1 or 2 digits, so there
2258                # must be a space in any three consecutive characters.
2259                if data[60] == " ":
2260                    out_handle.write(data[:60] + "\n")
2261                    data = data[61:]
2262                elif data[59] == " ":
2263                    out_handle.write(data[:59] + "\n")
2264                    data = data[60:]
2265                else:
2266                    assert data[58] == " ", "Internal logic failure in wrapping"
2267                    out_handle.write(data[:58] + "\n")
2268                    data = data[59:]
2269            out_handle.write(data + "\n")
2270    return count
2271
2272
2273def _fastq_sanger_convert_qual(in_file, out_file):
2274    """Fast Sanger FASTQ to QUAL conversion (PRIVATE)."""
2275    mapping = {chr(q + 33): str(q) for q in range(0, 93 + 1)}
2276    return _fastq_convert_qual(in_file, out_file, mapping)
2277
2278
2279def _fastq_solexa_convert_qual(in_file, out_file):
2280    """Fast Solexa FASTQ to QUAL conversion (PRIVATE)."""
2281    mapping = {
2282        chr(q + 64): str(int(round(phred_quality_from_solexa(q))))
2283        for q in range(-5, 62 + 1)
2284    }
2285    return _fastq_convert_qual(in_file, out_file, mapping)
2286
2287
2288def _fastq_illumina_convert_qual(in_file, out_file):
2289    """Fast Illumina 1.3+ FASTQ to QUAL conversion (PRIVATE)."""
2290    mapping = {chr(q + 64): str(q) for q in range(0, 62 + 1)}
2291    return _fastq_convert_qual(in_file, out_file, mapping)
2292
2293
2294if __name__ == "__main__":
2295    from Bio._utils import run_doctest
2296
2297    run_doctest(verbose=0)
2298