1# Copyright 2009-2020 by Peter Cock. All rights reserved. 2# 3# This file is part of the Biopython distribution and governed by your 4# choice of the "Biopython License Agreement" or the "BSD 3-Clause License". 5# Please see the LICENSE file that should have been included as part of this 6# package. 7"""Bio.SeqIO support for the FASTQ and QUAL file formats. 8 9Note that you are expected to use this code via the Bio.SeqIO interface, as 10shown below. 11 12The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute 13to bundle a FASTA sequence and its PHRED quality data (integers between 0 and 1490). Rather than using a single FASTQ file, often paired FASTA and QUAL files 15are used containing the sequence and the quality information separately. 16 17The PHRED software reads DNA sequencing trace files, calls bases, and 18assigns a non-negative quality value to each called base using a logged 19transformation of the error probability, Q = -10 log10( Pe ), for example:: 20 21 Pe = 1.0, Q = 0 22 Pe = 0.1, Q = 10 23 Pe = 0.01, Q = 20 24 ... 25 Pe = 0.00000001, Q = 80 26 Pe = 0.000000001, Q = 90 27 28In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40. 29In the QUAL format these quality values are held as space separated text in 30a FASTA like file format. In the FASTQ format, each quality values is encoded 31with a single ASCI character using chr(Q+33), meaning zero maps to the 32character "!" and for example 80 maps to "q". For the Sanger FASTQ standard 33the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and 34quality are then stored in pairs in a FASTA like format. 35 36Unfortunately there is no official document describing the FASTQ file format, 37and worse, several related but different variants exist. For more details, 38please read this open access publication:: 39 40 The Sanger FASTQ file format for sequences with quality scores, and the 41 Solexa/Illumina FASTQ variants. 42 P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby), 43 M.L.Heuer (BioJava) and P.M. Rice (EMBOSS). 44 Nucleic Acids Research 2010 38(6):1767-1771 45 https://doi.org/10.1093/nar/gkp1137 46 47The good news is that Roche 454 sequencers can output files in the QUAL format, 48and sensibly they use PHREP style scores like Sanger. Converting a pair of 49FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL 50files from a Roche 454 SFF binary file, use the Roche off instrument command 51line tool "sffinfo" with the -q or -qual argument. You can extract a matching 52FASTA file using the -s or -seq argument instead. 53 54The bad news is that Solexa/Illumina did things differently - they have their 55own scoring system AND their own incompatible versions of the FASTQ format. 56Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can 57be negative. PHRED scores and Solexa scores are NOT interchangeable (but a 58reasonable mapping can be achieved between them, and they are approximately 59equal for higher quality reads). 60 61Confusingly early Solexa pipelines produced a FASTQ like file but using their 62own score mapping and an ASCII offset of 64. To make things worse, for the 63Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the 64FASTQ file format, this time using PHRED scores (which is more consistent) but 65with an ASCII offset of 64. 66 67i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ 68file format: The original Sanger PHRED standard, and two from Solexa/Illumina. 69 70The good news is that as of CASAVA version 1.8, Illumina sequencers will 71produce FASTQ files using the standard Sanger encoding. 72 73You are expected to use this module via the Bio.SeqIO functions, with the 74following format names: 75 76 - "qual" means simple quality files using PHRED scores (e.g. from Roche 454) 77 - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII 78 offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+). 79 These can potentially hold PHRED scores from 0 to 93. 80 - "fastq-sanger" is an alias for "fastq". 81 - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ 82 files, using Solexa scores with an ASCII offset 64. These can hold Solexa 83 scores from -5 to 62. 84 - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using 85 PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0 86 to 62. 87 88We could potentially add support for "qual-solexa" meaning QUAL files which 89contain Solexa scores, but thus far there isn't any reason to use such files. 90 91For example, consider the following short FASTQ file:: 92 93 @EAS54_6_R1_2_1_413_324 94 CCCTTCTTGTCTTCAGCGTTTCTCC 95 + 96 ;;3;;;;;;;;;;;;7;;;;;;;88 97 @EAS54_6_R1_2_1_540_792 98 TTGGCAGGCCAAGGCCGATGGATCA 99 + 100 ;;;;;;;;;;;7;;;;;-;;;3;83 101 @EAS54_6_R1_2_1_443_348 102 GTTGCTTCTGGCGTGGGTGGGGGGG 103 + 104 ;;;;;;;;;;;9;7;;.7;393333 105 106This contains three reads of length 25. From the read length these were 107probably originally from an early Solexa/Illumina sequencer but this file 108follows the Sanger FASTQ convention (PHRED style qualities with an ASCII 109offet of 33). This means we can parse this file using Bio.SeqIO using 110"fastq" as the format name: 111 112>>> from Bio import SeqIO 113>>> for record in SeqIO.parse("Quality/example.fastq", "fastq"): 114... print("%s %s" % (record.id, record.seq)) 115EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 116EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 117EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 118 119The qualities are held as a list of integers in each record's annotation: 120 121>>> print(record) 122ID: EAS54_6_R1_2_1_443_348 123Name: EAS54_6_R1_2_1_443_348 124Description: EAS54_6_R1_2_1_443_348 125Number of features: 0 126Per letter annotation for: phred_quality 127Seq('GTTGCTTCTGGCGTGGGTGGGGGGG') 128>>> print(record.letter_annotations["phred_quality"]) 129[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 130 131You can use the SeqRecord format method to show this in the QUAL format: 132 133>>> print(record.format("qual")) 134>EAS54_6_R1_2_1_443_348 13526 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 13624 18 18 18 18 137<BLANKLINE> 138 139Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"): 140 141>>> print(record.format("fastq")) 142@EAS54_6_R1_2_1_443_348 143GTTGCTTCTGGCGTGGGTGGGGGGG 144+ 145;;;;;;;;;;;9;7;;.7;393333 146<BLANKLINE> 147 148Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset 149of 64): 150 151>>> print(record.format("fastq-illumina")) 152@EAS54_6_R1_2_1_443_348 153GTTGCTTCTGGCGTGGGTGGGGGGG 154+ 155ZZZZZZZZZZZXZVZZMVZRXRRRR 156<BLANKLINE> 157 158You can also get Biopython to convert the scores and show a Solexa style 159FASTQ file: 160 161>>> print(record.format("fastq-solexa")) 162@EAS54_6_R1_2_1_443_348 163GTTGCTTCTGGCGTGGGTGGGGGGG 164+ 165ZZZZZZZZZZZXZVZZMVZRXRRRR 166<BLANKLINE> 167 168Notice that this is actually the same output as above using "fastq-illumina" 169as the format! The reason for this is all these scores are high enough that 170the PHRED and Solexa scores are almost equal. The differences become apparent 171for poor quality reads. See the functions solexa_quality_from_phred and 172phred_quality_from_solexa for more details. 173 174If you wanted to trim your sequences (perhaps to remove low quality regions, 175or to remove a primer sequence), try slicing the SeqRecord objects. e.g. 176 177>>> sub_rec = record[5:15] 178>>> print(sub_rec) 179ID: EAS54_6_R1_2_1_443_348 180Name: EAS54_6_R1_2_1_443_348 181Description: EAS54_6_R1_2_1_443_348 182Number of features: 0 183Per letter annotation for: phred_quality 184Seq('TTCTGGCGTG') 185>>> print(sub_rec.letter_annotations["phred_quality"]) 186[26, 26, 26, 26, 26, 26, 24, 26, 22, 26] 187>>> print(sub_rec.format("fastq")) 188@EAS54_6_R1_2_1_443_348 189TTCTGGCGTG 190+ 191;;;;;;9;7; 192<BLANKLINE> 193 194If you wanted to, you could read in this FASTQ file, and save it as a QUAL file: 195 196>>> from Bio import SeqIO 197>>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 198>>> with open("Quality/temp.qual", "w") as out_handle: 199... SeqIO.write(record_iterator, out_handle, "qual") 2003 201 202You can of course read in a QUAL file, such as the one we just created: 203 204>>> from Bio import SeqIO 205>>> for record in SeqIO.parse("Quality/temp.qual", "qual"): 206... print("%s read of length %d" % (record.id, len(record.seq))) 207EAS54_6_R1_2_1_413_324 read of length 25 208EAS54_6_R1_2_1_540_792 read of length 25 209EAS54_6_R1_2_1_443_348 read of length 25 210 211Notice that QUAL files don't have a proper sequence present! But the quality 212information is there: 213 214>>> print(record) 215ID: EAS54_6_R1_2_1_443_348 216Name: EAS54_6_R1_2_1_443_348 217Description: EAS54_6_R1_2_1_443_348 218Number of features: 0 219Per letter annotation for: phred_quality 220Undefined sequence of length 25 221>>> print(record.letter_annotations["phred_quality"]) 222[26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 223 224Just to keep things tidy, if you are following this example yourself, you can 225delete this temporary file now: 226 227>>> import os 228>>> os.remove("Quality/temp.qual") 229 230Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL 231files. Because the Bio.SeqIO system is designed for reading single files, you 232would have to read the two in separately and then combine the data. However, 233since this is such a common thing to want to do, there is a helper iterator 234defined in this module that does this for you - PairedFastaQualIterator. 235 236Alternatively, if you have enough RAM to hold all the records in memory at once, 237then a simple dictionary approach would work: 238 239>>> from Bio import SeqIO 240>>> reads = SeqIO.to_dict(SeqIO.parse("Quality/example.fasta", "fasta")) 241>>> for rec in SeqIO.parse("Quality/example.qual", "qual"): 242... reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"] 243 244You can then access any record by its key, and get both the sequence and the 245quality scores. 246 247>>> print(reads["EAS54_6_R1_2_1_540_792"].format("fastq")) 248@EAS54_6_R1_2_1_540_792 249TTGGCAGGCCAAGGCCGATGGATCA 250+ 251;;;;;;;;;;;7;;;;;-;;;3;83 252<BLANKLINE> 253 254It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are 255using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, 256"fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" 257for the more recent variant), as this cannot be detected reliably 258automatically. 259 260To illustrate this problem, let's consider an artificial example: 261 262>>> from Bio.Seq import Seq 263>>> from Bio.SeqRecord import SeqRecord 264>>> test = SeqRecord(Seq("NACGTACGTA"), id="Test", description="Made up!") 265>>> print(test.format("fasta")) 266>Test Made up! 267NACGTACGTA 268<BLANKLINE> 269>>> print(test.format("fastq")) 270Traceback (most recent call last): 271 ... 272ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test). 273 274We created a sample SeqRecord, and can show it in FASTA format - but for QUAL 275or FASTQ format we need to provide some quality scores. These are held as a 276list of integers (one for each base) in the letter_annotations dictionary: 277 278>>> test.letter_annotations["phred_quality"] = [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 279>>> print(test.format("qual")) 280>Test Made up! 2810 1 2 3 4 5 10 20 30 40 282<BLANKLINE> 283>>> print(test.format("fastq")) 284@Test Made up! 285NACGTACGTA 286+ 287!"#$%&+5?I 288<BLANKLINE> 289 290We can check this FASTQ encoding - the first PHRED quality was zero, and this 291mapped to a exclamation mark, while the final score was 40 and this mapped to 292the letter "I": 293 294>>> ord('!') - 33 2950 296>>> ord('I') - 33 29740 298>>> [ord(letter)-33 for letter in '!"#$%&+5?I'] 299[0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 300 301Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED 302scores with an offset of 64: 303 304>>> print(test.format("fastq-illumina")) 305@Test Made up! 306NACGTACGTA 307+ 308@ABCDEJT^h 309<BLANKLINE> 310 311And we can check this too - the first PHRED score was zero, and this mapped to 312"@", while the final score was 40 and this mapped to "h": 313 314>>> ord("@") - 64 3150 316>>> ord("h") - 64 31740 318>>> [ord(letter)-64 for letter in "@ABCDEJT^h"] 319[0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 320 321Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style 322FASTQ files look for the same data! Then we have the older Solexa/Illumina 323format to consider which encodes Solexa scores instead of PHRED scores. 324 325First let's see what Biopython says if we convert the PHRED scores into Solexa 326scores (rounding to one decimal place): 327 328>>> for q in [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]: 329... print("PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))) 330PHRED 0 maps to Solexa -5.0 331PHRED 1 maps to Solexa -5.0 332PHRED 2 maps to Solexa -2.3 333PHRED 3 maps to Solexa -0.0 334PHRED 4 maps to Solexa 1.8 335PHRED 5 maps to Solexa 3.3 336PHRED 10 maps to Solexa 9.5 337PHRED 20 maps to Solexa 20.0 338PHRED 30 maps to Solexa 30.0 339PHRED 40 maps to Solexa 40.0 340 341Now here is the record using the old Solexa style FASTQ file: 342 343>>> print(test.format("fastq-solexa")) 344@Test Made up! 345NACGTACGTA 346+ 347;;>@BCJT^h 348<BLANKLINE> 349 350Again, this is using an ASCII offset of 64, so we can check the Solexa scores: 351 352>>> [ord(letter)-64 for letter in ";;>@BCJT^h"] 353[-5, -5, -2, 0, 2, 3, 10, 20, 30, 40] 354 355This explains why the last few letters of this FASTQ output matched that using 356the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores 357are approximately equal. 358 359""" 360import warnings 361 362from math import log 363 364from Bio import BiopythonParserWarning 365from Bio import BiopythonWarning 366from Bio import StreamModeError 367from Bio.File import as_handle 368from Bio.Seq import Seq 369from Bio.SeqRecord import SeqRecord 370 371from .Interfaces import _clean 372from .Interfaces import _get_seq_string 373from .Interfaces import SequenceIterator 374from .Interfaces import SequenceWriter 375 376 377# define score offsets. See discussion for differences between Sanger and 378# Solexa offsets. 379SANGER_SCORE_OFFSET = 33 380SOLEXA_SCORE_OFFSET = 64 381 382 383def solexa_quality_from_phred(phred_quality): 384 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality. 385 386 PHRED and Solexa quality scores are both log transformations of a 387 probality of error (high score = low probability of error). This function 388 takes a PHRED score, transforms it back to a probability of error, and 389 then re-expresses it as a Solexa score. This assumes the error estimates 390 are equivalent. 391 392 How does this work exactly? Well the PHRED quality is minus ten times the 393 base ten logarithm of the probability of error:: 394 395 phred_quality = -10*log(error,10) 396 397 Therefore, turning this round:: 398 399 error = 10 ** (- phred_quality / 10) 400 401 Now, Solexa qualities use a different log transformation:: 402 403 solexa_quality = -10*log(error/(1-error),10) 404 405 After substitution and a little manipulation we get:: 406 407 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10) 408 409 However, real Solexa files use a minimum quality of -5. This does have a 410 good reason - a random base call would be correct 25% of the time, 411 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED 412 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random 413 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5. 414 415 Taken literally, this logarithic formula would map a PHRED quality of zero 416 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED 417 score of zero means a probability of error of one (i.e. the base call is 418 definitely wrong), which is worse than random! In practice, a PHRED quality 419 of zero usually means a default value, or perhaps random - and therefore 420 mapping it to the minimum Solexa score of -5 is reasonable. 421 422 In conclusion, we follow EMBOSS, and take this logarithmic formula but also 423 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED 424 quality of zero to -5.0 as well. 425 426 Note this function will return a floating point number, it is up to you to 427 round this to the nearest integer if appropriate. e.g. 428 429 >>> print("%0.2f" % round(solexa_quality_from_phred(80), 2)) 430 80.00 431 >>> print("%0.2f" % round(solexa_quality_from_phred(50), 2)) 432 50.00 433 >>> print("%0.2f" % round(solexa_quality_from_phred(20), 2)) 434 19.96 435 >>> print("%0.2f" % round(solexa_quality_from_phred(10), 2)) 436 9.54 437 >>> print("%0.2f" % round(solexa_quality_from_phred(5), 2)) 438 3.35 439 >>> print("%0.2f" % round(solexa_quality_from_phred(4), 2)) 440 1.80 441 >>> print("%0.2f" % round(solexa_quality_from_phred(3), 2)) 442 -0.02 443 >>> print("%0.2f" % round(solexa_quality_from_phred(2), 2)) 444 -2.33 445 >>> print("%0.2f" % round(solexa_quality_from_phred(1), 2)) 446 -5.00 447 >>> print("%0.2f" % round(solexa_quality_from_phred(0), 2)) 448 -5.00 449 450 Notice that for high quality reads PHRED and Solexa scores are numerically 451 equal. The differences are important for poor quality reads, where PHRED 452 has a minimum of zero but Solexa scores can be negative. 453 454 Finally, as a special case where None is used for a "missing value", None 455 is returned: 456 457 >>> print(solexa_quality_from_phred(None)) 458 None 459 """ 460 if phred_quality is None: 461 # Assume None is used as some kind of NULL or NA value; return None 462 # e.g. Bio.SeqIO gives Ace contig gaps a quality of None. 463 return None 464 elif phred_quality > 0: 465 # Solexa uses a minimum value of -5, which after rounding matches a 466 # random nucleotide base call. 467 return max(-5.0, 10 * log(10 ** (phred_quality / 10.0) - 1, 10)) 468 elif phred_quality == 0: 469 # Special case, map to -5 as discussed in the docstring 470 return -5.0 471 else: 472 raise ValueError( 473 "PHRED qualities must be positive (or zero), not %r" % phred_quality 474 ) 475 476 477def phred_quality_from_solexa(solexa_quality): 478 """Convert a Solexa quality (which can be negative) to a PHRED quality. 479 480 PHRED and Solexa quality scores are both log transformations of a 481 probality of error (high score = low probability of error). This function 482 takes a Solexa score, transforms it back to a probability of error, and 483 then re-expresses it as a PHRED score. This assumes the error estimates 484 are equivalent. 485 486 The underlying formulas are given in the documentation for the sister 487 function solexa_quality_from_phred, in this case the operation is:: 488 489 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10) 490 491 This will return a floating point number, it is up to you to round this to 492 the nearest integer if appropriate. e.g. 493 494 >>> print("%0.2f" % round(phred_quality_from_solexa(80), 2)) 495 80.00 496 >>> print("%0.2f" % round(phred_quality_from_solexa(20), 2)) 497 20.04 498 >>> print("%0.2f" % round(phred_quality_from_solexa(10), 2)) 499 10.41 500 >>> print("%0.2f" % round(phred_quality_from_solexa(0), 2)) 501 3.01 502 >>> print("%0.2f" % round(phred_quality_from_solexa(-5), 2)) 503 1.19 504 505 Note that a solexa_quality less then -5 is not expected, will trigger a 506 warning, but will still be converted as per the logarithmic mapping 507 (giving a number between 0 and 1.19 back). 508 509 As a special case where None is used for a "missing value", None is 510 returned: 511 512 >>> print(phred_quality_from_solexa(None)) 513 None 514 """ 515 if solexa_quality is None: 516 # Assume None is used as some kind of NULL or NA value; return None 517 return None 518 if solexa_quality < -5: 519 warnings.warn( 520 "Solexa quality less than -5 passed, %r" % solexa_quality, BiopythonWarning 521 ) 522 return 10 * log(10 ** (solexa_quality / 10.0) + 1, 10) 523 524 525def _get_phred_quality(record): 526 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE). 527 528 If there are no PHRED qualities, but there are Solexa qualities, those are 529 used instead after conversion. 530 """ 531 try: 532 return record.letter_annotations["phred_quality"] 533 except KeyError: 534 pass 535 try: 536 return [ 537 phred_quality_from_solexa(q) 538 for q in record.letter_annotations["solexa_quality"] 539 ] 540 except KeyError: 541 raise ValueError( 542 "No suitable quality scores found in " 543 "letter_annotations of SeqRecord (id=%s)." % record.id 544 ) from None 545 546 547# Only map 0 to 93, we need to give a warning on truncating at 93 548_phred_to_sanger_quality_str = { 549 qp: chr(min(126, qp + SANGER_SCORE_OFFSET)) for qp in range(0, 93 + 1) 550} 551# Only map -5 to 93, we need to give a warning on truncating at 93 552_solexa_to_sanger_quality_str = { 553 qs: chr(min(126, int(round(phred_quality_from_solexa(qs)) + SANGER_SCORE_OFFSET))) 554 for qs in range(-5, 93 + 1) 555} 556 557 558def _get_sanger_quality_str(record): 559 """Return a Sanger FASTQ encoded quality string (PRIVATE). 560 561 >>> from Bio.Seq import Seq 562 >>> from Bio.SeqRecord import SeqRecord 563 >>> r = SeqRecord(Seq("ACGTAN"), id="Test", 564 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0]}) 565 >>> _get_sanger_quality_str(r) 566 'SI?5+!' 567 568 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO), 569 the PHRED qualities are integers, this function is able to use a very fast 570 pre-cached mapping. However, if they are floats which differ slightly, then 571 it has to do the appropriate rounding - which is slower: 572 573 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2", 574 ... letter_annotations = {"phred_quality":[50.0, 40.05, 29.99, 20, 9.55, 0.01]}) 575 >>> _get_sanger_quality_str(r2) 576 'SI?5+!' 577 578 If your scores include a None value, this raises an exception: 579 580 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3", 581 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, None]}) 582 >>> _get_sanger_quality_str(r3) 583 Traceback (most recent call last): 584 ... 585 TypeError: A quality value of None was found 586 587 If (strangely) your record has both PHRED and Solexa scores, then the PHRED 588 scores are used in preference: 589 590 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4", 591 ... letter_annotations = {"phred_quality":[50, 40, 30, 20, 10, 0], 592 ... "solexa_quality":[-5, -4, 0, None, 0, 40]}) 593 >>> _get_sanger_quality_str(r4) 594 'SI?5+!' 595 596 If there are no PHRED scores, but there are Solexa scores, these are used 597 instead (after the appropriate conversion): 598 599 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5", 600 ... letter_annotations = {"solexa_quality":[40, 30, 20, 10, 0, -5]}) 601 >>> _get_sanger_quality_str(r5) 602 'I?5+$"' 603 604 Again, integer Solexa scores can be looked up in a pre-cached mapping making 605 this very fast. You can still use approximate floating point scores: 606 607 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6", 608 ... letter_annotations = {"solexa_quality":[40.1, 29.7, 20.01, 10, 0.0, -4.9]}) 609 >>> _get_sanger_quality_str(r6) 610 'I?5+$"' 611 612 Notice that due to the limited range of printable ASCII characters, a 613 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ 614 file (using ASCII 126, the tilde). This function will issue a warning 615 in this situation. 616 """ 617 # TODO - This functions works and is fast, but it is also ugly 618 # and there is considerable repetition of code for the other 619 # two FASTQ variants. 620 try: 621 # These take priority (in case both Solexa and PHRED scores found) 622 qualities = record.letter_annotations["phred_quality"] 623 except KeyError: 624 # Fall back on solexa scores... 625 pass 626 else: 627 # Try and use the precomputed mapping: 628 try: 629 return "".join(_phred_to_sanger_quality_str[qp] for qp in qualities) 630 except KeyError: 631 # Could be a float, or a None in the list, or a high value. 632 pass 633 if None in qualities: 634 raise TypeError("A quality value of None was found") 635 if max(qualities) >= 93.5: 636 warnings.warn( 637 "Data loss - max PHRED quality 93 in Sanger FASTQ", BiopythonWarning 638 ) 639 # This will apply the truncation at 93, giving max ASCII 126 640 return "".join( 641 chr(min(126, int(round(qp)) + SANGER_SCORE_OFFSET)) for qp in qualities 642 ) 643 # Fall back on the Solexa scores... 644 try: 645 qualities = record.letter_annotations["solexa_quality"] 646 except KeyError: 647 raise ValueError( 648 "No suitable quality scores found in " 649 "letter_annotations of SeqRecord (id=%s)." % record.id 650 ) from None 651 # Try and use the precomputed mapping: 652 try: 653 return "".join(_solexa_to_sanger_quality_str[qs] for qs in qualities) 654 except KeyError: 655 # Either no PHRED scores, or something odd like a float or None 656 pass 657 if None in qualities: 658 raise TypeError("A quality value of None was found") 659 # Must do this the slow way, first converting the PHRED scores into 660 # Solexa scores: 661 if max(qualities) >= 93.5: 662 warnings.warn( 663 "Data loss - max PHRED quality 93 in Sanger FASTQ", BiopythonWarning 664 ) 665 # This will apply the truncation at 93, giving max ASCII 126 666 return "".join( 667 chr(min(126, int(round(phred_quality_from_solexa(qs))) + SANGER_SCORE_OFFSET)) 668 for qs in qualities 669 ) 670 671 672# Only map 0 to 62, we need to give a warning on truncating at 62 673assert 62 + SOLEXA_SCORE_OFFSET == 126 674_phred_to_illumina_quality_str = { 675 qp: chr(qp + SOLEXA_SCORE_OFFSET) for qp in range(0, 62 + 1) 676} 677# Only map -5 to 62, we need to give a warning on truncating at 62 678_solexa_to_illumina_quality_str = { 679 qs: chr(int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET) 680 for qs in range(-5, 62 + 1) 681} 682 683 684def _get_illumina_quality_str(record): 685 """Return an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE). 686 687 Notice that due to the limited range of printable ASCII characters, a 688 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ 689 file (using ASCII 126, the tilde). This function will issue a warning 690 in this situation. 691 """ 692 # TODO - This functions works and is fast, but it is also ugly 693 # and there is considerable repetition of code for the other 694 # two FASTQ variants. 695 try: 696 # These take priority (in case both Solexa and PHRED scores found) 697 qualities = record.letter_annotations["phred_quality"] 698 except KeyError: 699 # Fall back on solexa scores... 700 pass 701 else: 702 # Try and use the precomputed mapping: 703 try: 704 return "".join(_phred_to_illumina_quality_str[qp] for qp in qualities) 705 except KeyError: 706 # Could be a float, or a None in the list, or a high value. 707 pass 708 if None in qualities: 709 raise TypeError("A quality value of None was found") 710 if max(qualities) >= 62.5: 711 warnings.warn( 712 "Data loss - max PHRED quality 62 in Illumina FASTQ", BiopythonWarning 713 ) 714 # This will apply the truncation at 62, giving max ASCII 126 715 return "".join( 716 chr(min(126, int(round(qp)) + SOLEXA_SCORE_OFFSET)) for qp in qualities 717 ) 718 # Fall back on the Solexa scores... 719 try: 720 qualities = record.letter_annotations["solexa_quality"] 721 except KeyError: 722 raise ValueError( 723 "No suitable quality scores found in " 724 "letter_annotations of SeqRecord (id=%s)." % record.id 725 ) from None 726 # Try and use the precomputed mapping: 727 try: 728 return "".join(_solexa_to_illumina_quality_str[qs] for qs in qualities) 729 except KeyError: 730 # Either no PHRED scores, or something odd like a float or None 731 pass 732 if None in qualities: 733 raise TypeError("A quality value of None was found") 734 # Must do this the slow way, first converting the PHRED scores into 735 # Solexa scores: 736 if max(qualities) >= 62.5: 737 warnings.warn( 738 "Data loss - max PHRED quality 62 in Illumina FASTQ", BiopythonWarning 739 ) 740 # This will apply the truncation at 62, giving max ASCII 126 741 return "".join( 742 chr(min(126, int(round(phred_quality_from_solexa(qs))) + SOLEXA_SCORE_OFFSET)) 743 for qs in qualities 744 ) 745 746 747# Only map 0 to 62, we need to give a warning on truncating at 62 748assert 62 + SOLEXA_SCORE_OFFSET == 126 749_solexa_to_solexa_quality_str = { 750 qs: chr(min(126, qs + SOLEXA_SCORE_OFFSET)) for qs in range(-5, 62 + 1) 751} 752# Only map -5 to 62, we need to give a warning on truncating at 62 753_phred_to_solexa_quality_str = { 754 qp: chr(min(126, int(round(solexa_quality_from_phred(qp))) + SOLEXA_SCORE_OFFSET)) 755 for qp in range(0, 62 + 1) 756} 757 758 759def _get_solexa_quality_str(record): 760 """Return a Solexa FASTQ encoded quality string (PRIVATE). 761 762 Notice that due to the limited range of printable ASCII characters, a 763 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ 764 file (using ASCII 126, the tilde). This function will issue a warning 765 in this situation. 766 """ 767 # TODO - This functions works and is fast, but it is also ugly 768 # and there is considerable repetition of code for the other 769 # two FASTQ variants. 770 try: 771 # These take priority (in case both Solexa and PHRED scores found) 772 qualities = record.letter_annotations["solexa_quality"] 773 except KeyError: 774 # Fall back on PHRED scores... 775 pass 776 else: 777 # Try and use the precomputed mapping: 778 try: 779 return "".join(_solexa_to_solexa_quality_str[qs] for qs in qualities) 780 except KeyError: 781 # Could be a float, or a None in the list, or a high value. 782 pass 783 if None in qualities: 784 raise TypeError("A quality value of None was found") 785 if max(qualities) >= 62.5: 786 warnings.warn( 787 "Data loss - max Solexa quality 62 in Solexa FASTQ", BiopythonWarning 788 ) 789 # This will apply the truncation at 62, giving max ASCII 126 790 return "".join( 791 chr(min(126, int(round(qs)) + SOLEXA_SCORE_OFFSET)) for qs in qualities 792 ) 793 # Fall back on the PHRED scores... 794 try: 795 qualities = record.letter_annotations["phred_quality"] 796 except KeyError: 797 raise ValueError( 798 "No suitable quality scores found in " 799 "letter_annotations of SeqRecord (id=%s)." % record.id 800 ) from None 801 # Try and use the precomputed mapping: 802 try: 803 return "".join(_phred_to_solexa_quality_str[qp] for qp in qualities) 804 except KeyError: 805 # Either no PHRED scores, or something odd like a float or None 806 # or too big to be in the cache 807 pass 808 if None in qualities: 809 raise TypeError("A quality value of None was found") 810 # Must do this the slow way, first converting the PHRED scores into 811 # Solexa scores: 812 if max(qualities) >= 62.5: 813 warnings.warn( 814 "Data loss - max Solexa quality 62 in Solexa FASTQ", BiopythonWarning 815 ) 816 return "".join( 817 chr(min(126, int(round(solexa_quality_from_phred(qp))) + SOLEXA_SCORE_OFFSET)) 818 for qp in qualities 819 ) 820 821 822# TODO - Default to nucleotide or even DNA? 823def FastqGeneralIterator(source): 824 """Iterate over Fastq records as string tuples (not as SeqRecord objects). 825 826 Arguments: 827 - source - input stream opened in text mode, or a path to a file 828 829 This code does not try to interpret the quality string numerically. It 830 just returns tuples of the title, sequence and quality as strings. For 831 the sequence and quality, any whitespace (such as new lines) is removed. 832 833 Our SeqRecord based FASTQ iterators call this function internally, and then 834 turn the strings into a SeqRecord objects, mapping the quality string into 835 a list of numerical scores. If you want to do a custom quality mapping, 836 then you might consider calling this function directly. 837 838 For parsing FASTQ files, the title string from the "@" line at the start 839 of each record can optionally be omitted on the "+" lines. If it is 840 repeated, it must be identical. 841 842 The sequence string and the quality string can optionally be split over 843 multiple lines, although several sources discourage this. In comparison, 844 for the FASTA file format line breaks between 60 and 80 characters are 845 the norm. 846 847 **WARNING** - Because the "@" character can appear in the quality string, 848 this can cause problems as this is also the marker for the start of 849 a new sequence. In fact, the "+" sign can also appear as well. Some 850 sources recommended having no line breaks in the quality to avoid this, 851 but even that is not enough, consider this example:: 852 853 @071113_EAS56_0053:1:1:998:236 854 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA 855 +071113_EAS56_0053:1:1:998:236 856 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 857 @071113_EAS56_0053:1:1:182:712 858 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG 859 + 860 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 861 @071113_EAS56_0053:1:1:153:10 862 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT 863 + 864 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 865 @071113_EAS56_0053:1:3:990:501 866 TGGGAGGTTTTATGTGGA 867 AAGCAGCAATGTACAAGA 868 + 869 IIIIIII.IIIIII1@44 870 @-7.%<&+/$/%4(++(% 871 872 This is four PHRED encoded FASTQ entries originally from an NCBI source 873 (given the read length of 36, these are probably Solexa Illumina reads where 874 the quality has been mapped onto the PHRED values). 875 876 This example has been edited to illustrate some of the nasty things allowed 877 in the FASTQ format. Firstly, on the "+" lines most but not all of the 878 (redundant) identifiers are omitted. In real files it is likely that all or 879 none of these extra identifiers will be present. 880 881 Secondly, while the first three sequences have been shown without line 882 breaks, the last has been split over multiple lines. In real files any line 883 breaks are likely to be consistent. 884 885 Thirdly, some of the quality string lines start with an "@" character. For 886 the second record this is unavoidable. However for the fourth sequence this 887 only happens because its quality string is split over two lines. A naive 888 parser could wrongly treat any line starting with an "@" as the beginning of 889 a new sequence! This code copes with this possible ambiguity by keeping 890 track of the length of the sequence which gives the expected length of the 891 quality string. 892 893 Using this tricky example file as input, this short bit of code demonstrates 894 what this parsing function would return: 895 896 >>> with open("Quality/tricky.fastq") as handle: 897 ... for (title, sequence, quality) in FastqGeneralIterator(handle): 898 ... print(title) 899 ... print("%s %s" % (sequence, quality)) 900 ... 901 071113_EAS56_0053:1:1:998:236 902 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 903 071113_EAS56_0053:1:1:182:712 904 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 905 071113_EAS56_0053:1:1:153:10 906 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 907 071113_EAS56_0053:1:3:990:501 908 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(% 909 910 Finally we note that some sources state that the quality string should 911 start with "!" (which using the PHRED mapping means the first letter always 912 has a quality score of zero). This rather restrictive rule is not widely 913 observed, so is therefore ignored here. One plus point about this "!" rule 914 is that (provided there are no line breaks in the quality sequence) it 915 would prevent the above problem with the "@" character. 916 """ 917 try: 918 handle = open(source) 919 except TypeError: 920 handle = source 921 if handle.read(0) != "": 922 raise StreamModeError("Fastq files must be opened in text mode") from None 923 try: 924 try: 925 line = next(handle) 926 except StopIteration: 927 return # Premature end of file, or just empty? 928 929 while True: 930 if line[0] != "@": 931 raise ValueError( 932 "Records in Fastq files should start with '@' character" 933 ) 934 title_line = line[1:].rstrip() 935 seq_string = "" 936 # There will now be one or more sequence lines; keep going until we 937 # find the "+" marking the quality line: 938 for line in handle: 939 if line[0] == "+": 940 break 941 seq_string += line.rstrip() 942 else: 943 if seq_string: 944 raise ValueError("End of file without quality information.") 945 else: 946 raise ValueError("Unexpected end of file") 947 # The title here is optional, but if present must match! 948 second_title = line[1:].rstrip() 949 if second_title and second_title != title_line: 950 raise ValueError("Sequence and quality captions differ.") 951 # This is going to slow things down a little, but assuming 952 # this isn't allowed we should try and catch it here: 953 if " " in seq_string or "\t" in seq_string: 954 raise ValueError("Whitespace is not allowed in the sequence.") 955 seq_len = len(seq_string) 956 957 # There will now be at least one line of quality data, followed by 958 # another sequence, or EOF 959 line = None 960 quality_string = "" 961 for line in handle: 962 if line[0] == "@": 963 # This COULD be the start of a new sequence. However, it MAY just 964 # be a line of quality data which starts with a "@" character. We 965 # should be able to check this by looking at the sequence length 966 # and the amount of quality data found so far. 967 if len(quality_string) >= seq_len: 968 # We expect it to be equal if this is the start of a new record. 969 # If the quality data is longer, we'll raise an error below. 970 break 971 # Continue - its just some (more) quality data. 972 quality_string += line.rstrip() 973 else: 974 if line is None: 975 raise ValueError("Unexpected end of file") 976 line = None 977 978 if seq_len != len(quality_string): 979 raise ValueError( 980 "Lengths of sequence and quality values differs for %s (%i and %i)." 981 % (title_line, seq_len, len(quality_string)) 982 ) 983 984 # Return the record and then continue... 985 yield (title_line, seq_string, quality_string) 986 987 if line is None: 988 break 989 finally: 990 if handle is not source: 991 handle.close() 992 993 994class FastqPhredIterator(SequenceIterator): 995 """Parser for FASTQ files.""" 996 997 def __init__(self, source, alphabet=None, title2ids=None): 998 """Iterate over FASTQ records as SeqRecord objects. 999 1000 Arguments: 1001 - source - input stream opened in text mode, or a path to a file 1002 - alphabet - optional alphabet, no longer used. Leave as None. 1003 - title2ids - A function that, when given the title line from the FASTQ 1004 file (without the beginning >), will return the id, name and 1005 description (in that order) for the record as a tuple of strings. 1006 If this is not given, then the entire title line will be used as 1007 the description, and the first word as the id and name. 1008 1009 Note that use of title2ids matches that of Bio.SeqIO.FastaIO. 1010 1011 For each sequence in a (Sanger style) FASTQ file there is a matching string 1012 encoding the PHRED qualities (integers between 0 and about 90) using ASCII 1013 values with an offset of 33. 1014 1015 For example, consider a file containing three short reads:: 1016 1017 @EAS54_6_R1_2_1_413_324 1018 CCCTTCTTGTCTTCAGCGTTTCTCC 1019 + 1020 ;;3;;;;;;;;;;;;7;;;;;;;88 1021 @EAS54_6_R1_2_1_540_792 1022 TTGGCAGGCCAAGGCCGATGGATCA 1023 + 1024 ;;;;;;;;;;;7;;;;;-;;;3;83 1025 @EAS54_6_R1_2_1_443_348 1026 GTTGCTTCTGGCGTGGGTGGGGGGG 1027 + 1028 ;;;;;;;;;;;9;7;;.7;393333 1029 1030 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching 1031 string encoding the PHRED qualities using a ASCII values with an offset of 1032 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88"). 1033 1034 Using this module directly you might run: 1035 1036 >>> with open("Quality/example.fastq") as handle: 1037 ... for record in FastqPhredIterator(handle): 1038 ... print("%s %s" % (record.id, record.seq)) 1039 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1040 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1041 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1042 1043 Typically however, you would call this via Bio.SeqIO instead with "fastq" 1044 (or "fastq-sanger") as the format: 1045 1046 >>> from Bio import SeqIO 1047 >>> with open("Quality/example.fastq") as handle: 1048 ... for record in SeqIO.parse(handle, "fastq"): 1049 ... print("%s %s" % (record.id, record.seq)) 1050 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1051 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1052 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1053 1054 If you want to look at the qualities, they are record in each record's 1055 per-letter-annotation dictionary as a simple list of integers: 1056 1057 >>> print(record.letter_annotations["phred_quality"]) 1058 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1059 1060 """ 1061 if alphabet is not None: 1062 raise ValueError("The alphabet argument is no longer supported") 1063 self.title2ids = title2ids 1064 super().__init__(source, mode="t", fmt="Fastq") 1065 1066 def parse(self, handle): 1067 """Start parsing the file, and return a SeqRecord generator.""" 1068 records = self.iterate(handle) 1069 return records 1070 1071 def iterate(self, handle): 1072 """Parse the file and generate SeqRecord objects.""" 1073 title2ids = self.title2ids 1074 assert SANGER_SCORE_OFFSET == ord("!") 1075 # Originally, I used a list expression for each record: 1076 # 1077 # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string] 1078 # 1079 # Precomputing is faster, perhaps partly by avoiding the subtractions. 1080 q_mapping = { 1081 chr(letter): letter - SANGER_SCORE_OFFSET 1082 for letter in range(SANGER_SCORE_OFFSET, 94 + SANGER_SCORE_OFFSET) 1083 } 1084 1085 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1086 if title2ids: 1087 id, name, descr = title2ids(title_line) 1088 else: 1089 descr = title_line 1090 id = descr.split()[0] 1091 name = id 1092 record = SeqRecord(Seq(seq_string), id=id, name=name, description=descr) 1093 try: 1094 qualities = [q_mapping[letter] for letter in quality_string] 1095 except KeyError: 1096 raise ValueError("Invalid character in quality string") from None 1097 # For speed, will now use a dirty trick to speed up assigning the 1098 # qualities. We do this to bypass the length check imposed by the 1099 # per-letter-annotations restricted dict (as this has already been 1100 # checked by FastqGeneralIterator). This is equivalent to: 1101 # record.letter_annotations["phred_quality"] = qualities 1102 dict.__setitem__(record._per_letter_annotations, "phred_quality", qualities) 1103 yield record 1104 1105 1106def FastqSolexaIterator(source, alphabet=None, title2ids=None): 1107 r"""Parse old Solexa/Illumina FASTQ like files (which differ in the quality mapping). 1108 1109 The optional arguments are the same as those for the FastqPhredIterator. 1110 1111 For each sequence in Solexa/Illumina FASTQ files there is a matching string 1112 encoding the Solexa integer qualities using ASCII values with an offset 1113 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython 1114 will NOT perform any automatic conversion when loading. 1115 1116 NOTE - This file format is used by the OLD versions of the Solexa/Illumina 1117 pipeline. See also the FastqIlluminaIterator function for the NEW version. 1118 1119 For example, consider a file containing these five records:: 1120 1121 @SLXA-B3_649_FC8437_R1_1_1_610_79 1122 GATGTGCAATACCTTTGTAGAGGAA 1123 +SLXA-B3_649_FC8437_R1_1_1_610_79 1124 YYYYYYYYYYYYYYYYYYWYWYYSU 1125 @SLXA-B3_649_FC8437_R1_1_1_397_389 1126 GGTTTGAGAAAGAGAAATGAGATAA 1127 +SLXA-B3_649_FC8437_R1_1_1_397_389 1128 YYYYYYYYYWYYYYWWYYYWYWYWW 1129 @SLXA-B3_649_FC8437_R1_1_1_850_123 1130 GAGGGTGTTGATCATGATGATGGCG 1131 +SLXA-B3_649_FC8437_R1_1_1_850_123 1132 YYYYYYYYYYYYYWYYWYYSYYYSY 1133 @SLXA-B3_649_FC8437_R1_1_1_362_549 1134 GGAAACAAAGTTTTTCTCAACATAG 1135 +SLXA-B3_649_FC8437_R1_1_1_362_549 1136 YYYYYYYYYYYYYYYYYYWWWWYWY 1137 @SLXA-B3_649_FC8437_R1_1_1_183_714 1138 GTATTATTTAATGGCATACACTCAA 1139 +SLXA-B3_649_FC8437_R1_1_1_183_714 1140 YYYYYYYYYYWYYYYWYWWUWWWQQ 1141 1142 Using this module directly you might run: 1143 1144 >>> with open("Quality/solexa_example.fastq") as handle: 1145 ... for record in FastqSolexaIterator(handle): 1146 ... print("%s %s" % (record.id, record.seq)) 1147 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1148 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1149 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1150 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1151 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1152 1153 Typically however, you would call this via Bio.SeqIO instead with 1154 "fastq-solexa" as the format: 1155 1156 >>> from Bio import SeqIO 1157 >>> with open("Quality/solexa_example.fastq") as handle: 1158 ... for record in SeqIO.parse(handle, "fastq-solexa"): 1159 ... print("%s %s" % (record.id, record.seq)) 1160 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1161 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1162 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1163 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1164 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1165 1166 If you want to look at the qualities, they are recorded in each record's 1167 per-letter-annotation dictionary as a simple list of integers: 1168 1169 >>> print(record.letter_annotations["solexa_quality"]) 1170 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17] 1171 1172 These scores aren't very good, but they are high enough that they map 1173 almost exactly onto PHRED scores: 1174 1175 >>> print("%0.2f" % phred_quality_from_solexa(25)) 1176 25.01 1177 1178 Let's look at faked example read which is even worse, where there are 1179 more noticeable differences between the Solexa and PHRED scores:: 1180 1181 @slxa_0001_1_0001_01 1182 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1183 +slxa_0001_1_0001_01 1184 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1185 1186 Again, you would typically use Bio.SeqIO to read this file in (rather than 1187 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will 1188 contain thousands of reads, so you would normally use Bio.SeqIO.parse() 1189 as shown above. This example has only as one entry, so instead we can 1190 use the Bio.SeqIO.read() function: 1191 1192 >>> from Bio import SeqIO 1193 >>> with open("Quality/solexa_faked.fastq") as handle: 1194 ... record = SeqIO.read(handle, "fastq-solexa") 1195 >>> print("%s %s" % (record.id, record.seq)) 1196 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1197 >>> print(record.letter_annotations["solexa_quality"]) 1198 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1199 1200 These quality scores are so low that when converted from the Solexa scheme 1201 into PHRED scores they look quite different: 1202 1203 >>> print("%0.2f" % phred_quality_from_solexa(-1)) 1204 2.54 1205 >>> print("%0.2f" % phred_quality_from_solexa(-5)) 1206 1.19 1207 1208 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format 1209 method to output the record(s): 1210 1211 >>> print(record.format("fastq-solexa")) 1212 @slxa_0001_1_0001_01 1213 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1214 + 1215 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1216 <BLANKLINE> 1217 1218 Note this output is slightly different from the input file as Biopython 1219 has left out the optional repetition of the sequence identifier on the "+" 1220 line. If you want the to use PHRED scores, use "fastq" or "qual" as the 1221 output format instead, and Biopython will do the conversion for you: 1222 1223 >>> print(record.format("fastq")) 1224 @slxa_0001_1_0001_01 1225 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1226 + 1227 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" 1228 <BLANKLINE> 1229 1230 >>> print(record.format("qual")) 1231 >slxa_0001_1_0001_01 1232 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 1233 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1234 1 1 1235 <BLANKLINE> 1236 1237 As shown above, the poor quality Solexa reads have been mapped to the 1238 equivalent PHRED score (e.g. -5 to 1 as shown earlier). 1239 """ 1240 if alphabet is not None: 1241 raise ValueError("The alphabet argument is no longer supported") 1242 1243 q_mapping = { 1244 chr(letter): letter - SOLEXA_SCORE_OFFSET 1245 for letter in range(SOLEXA_SCORE_OFFSET - 5, 63 + SOLEXA_SCORE_OFFSET) 1246 } 1247 1248 for title_line, seq_string, quality_string in FastqGeneralIterator(source): 1249 if title2ids: 1250 id, name, descr = title2ids(title_line) 1251 else: 1252 descr = title_line 1253 id = descr.split()[0] 1254 name = id 1255 record = SeqRecord(Seq(seq_string), id=id, name=name, description=descr) 1256 try: 1257 qualities = [q_mapping[letter] for letter in quality_string] 1258 # DO NOT convert these into PHRED qualities automatically! 1259 except KeyError: 1260 raise ValueError("Invalid character in quality string") from None 1261 # Dirty trick to speed up this line: 1262 # record.letter_annotations["solexa_quality"] = qualities 1263 dict.__setitem__(record._per_letter_annotations, "solexa_quality", qualities) 1264 yield record 1265 1266 1267def FastqIlluminaIterator(source, alphabet=None, title2ids=None): 1268 """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping). 1269 1270 The optional arguments are the same as those for the FastqPhredIterator. 1271 1272 For each sequence in Illumina 1.3+ FASTQ files there is a matching string 1273 encoding PHRED integer qualities using ASCII values with an offset of 64. 1274 1275 >>> from Bio import SeqIO 1276 >>> record = SeqIO.read("Quality/illumina_faked.fastq", "fastq-illumina") 1277 >>> print("%s %s" % (record.id, record.seq)) 1278 Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1279 >>> max(record.letter_annotations["phred_quality"]) 1280 40 1281 >>> min(record.letter_annotations["phred_quality"]) 1282 0 1283 1284 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores 1285 with an ASCII offset of 64. They are approximately equal but only for high 1286 quality reads. If you have an old Solexa/Illumina file with negative 1287 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail: 1288 1289 >>> record2 = SeqIO.read("Quality/solexa_faked.fastq", "fastq-illumina") 1290 Traceback (most recent call last): 1291 ... 1292 ValueError: Invalid character in quality string 1293 1294 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33. 1295 """ 1296 if alphabet is not None: 1297 raise ValueError("The alphabet argument is no longer supported") 1298 1299 q_mapping = { 1300 chr(letter): letter - SOLEXA_SCORE_OFFSET 1301 for letter in range(SOLEXA_SCORE_OFFSET, 63 + SOLEXA_SCORE_OFFSET) 1302 } 1303 1304 for title_line, seq_string, quality_string in FastqGeneralIterator(source): 1305 if title2ids: 1306 id, name, descr = title2ids(title_line) 1307 else: 1308 descr = title_line 1309 id = descr.split()[0] 1310 name = id 1311 record = SeqRecord(Seq(seq_string), id=id, name=name, description=descr) 1312 try: 1313 qualities = [q_mapping[letter] for letter in quality_string] 1314 except KeyError: 1315 raise ValueError("Invalid character in quality string") from None 1316 # Dirty trick to speed up this line: 1317 # record.letter_annotations["phred_quality"] = qualities 1318 dict.__setitem__(record._per_letter_annotations, "phred_quality", qualities) 1319 yield record 1320 1321 1322class QualPhredIterator(SequenceIterator): 1323 """Parser for QUAL files with PHRED quality scores but no sequence.""" 1324 1325 def __init__(self, source, alphabet=None, title2ids=None): 1326 """For QUAL files which include PHRED quality scores, but no sequence. 1327 1328 For example, consider this short QUAL file:: 1329 1330 >EAS54_6_R1_2_1_413_324 1331 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1332 26 26 26 23 23 1333 >EAS54_6_R1_2_1_540_792 1334 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1335 26 18 26 23 18 1336 >EAS54_6_R1_2_1_443_348 1337 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1338 24 18 18 18 18 1339 1340 Using this module directly you might run: 1341 1342 >>> with open("Quality/example.qual") as handle: 1343 ... for record in QualPhredIterator(handle): 1344 ... print("%s read of length %d" % (record.id, len(record.seq))) 1345 EAS54_6_R1_2_1_413_324 read of length 25 1346 EAS54_6_R1_2_1_540_792 read of length 25 1347 EAS54_6_R1_2_1_443_348 read of length 25 1348 1349 Typically however, you would call this via Bio.SeqIO instead with "qual" 1350 as the format: 1351 1352 >>> from Bio import SeqIO 1353 >>> with open("Quality/example.qual") as handle: 1354 ... for record in SeqIO.parse(handle, "qual"): 1355 ... print("%s read of length %d" % (record.id, len(record.seq))) 1356 EAS54_6_R1_2_1_413_324 read of length 25 1357 EAS54_6_R1_2_1_540_792 read of length 25 1358 EAS54_6_R1_2_1_443_348 read of length 25 1359 1360 Only the sequence length is known, as the QUAL file does not contain 1361 the sequence string itself. 1362 1363 The quality scores themselves are available as a list of integers 1364 in each record's per-letter-annotation: 1365 1366 >>> print(record.letter_annotations["phred_quality"]) 1367 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1368 1369 You can still slice one of these SeqRecord objects: 1370 1371 >>> sub_record = record[5:10] 1372 >>> print("%s %s" % (sub_record.id, sub_record.letter_annotations["phred_quality"])) 1373 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26] 1374 1375 As of Biopython 1.59, this parser will accept files with negatives quality 1376 scores but will replace them with the lowest possible PHRED score of zero. 1377 This will trigger a warning, previously it raised a ValueError exception. 1378 """ 1379 if alphabet is not None: 1380 raise ValueError("The alphabet argument is no longer supported") 1381 self.title2ids = title2ids 1382 super().__init__(source, mode="t", fmt="QUAL") 1383 1384 def parse(self, handle): 1385 """Start parsing the file, and return a SeqRecord generator.""" 1386 records = self.iterate(handle) 1387 return records 1388 1389 def iterate(self, handle): 1390 """Parse the file and generate SeqRecord objects.""" 1391 title2ids = self.title2ids 1392 # Skip any text before the first record (e.g. blank lines, comments) 1393 for line in handle: 1394 if line[0] == ">": 1395 break 1396 else: 1397 return 1398 1399 while True: 1400 if line[0] != ">": 1401 raise ValueError( 1402 "Records in Fasta files should start with '>' character" 1403 ) 1404 if title2ids: 1405 id, name, descr = title2ids(line[1:].rstrip()) 1406 else: 1407 descr = line[1:].rstrip() 1408 id = descr.split()[0] 1409 name = id 1410 1411 qualities = [] 1412 for line in handle: 1413 if line[0] == ">": 1414 break 1415 qualities.extend(int(word) for word in line.split()) 1416 else: 1417 line = None 1418 1419 if qualities and min(qualities) < 0: 1420 warnings.warn( 1421 "Negative quality score %i found, substituting PHRED zero instead." 1422 % min(qualities), 1423 BiopythonParserWarning, 1424 ) 1425 qualities = [max(0, q) for q in qualities] 1426 1427 # Return the record and then continue... 1428 sequence = Seq(None, length=len(qualities)) 1429 record = SeqRecord(sequence, id=id, name=name, description=descr) 1430 # Dirty trick to speed up this line: 1431 # record.letter_annotations["phred_quality"] = qualities 1432 dict.__setitem__(record._per_letter_annotations, "phred_quality", qualities) 1433 yield record 1434 1435 if line is None: 1436 return # StopIteration 1437 raise ValueError("Unrecognised QUAL record format.") 1438 1439 1440class FastqPhredWriter(SequenceWriter): 1441 """Class to write standard FASTQ format files (using PHRED quality scores) (OBSOLETE). 1442 1443 Although you can use this class directly, you are strongly encouraged 1444 to use the ``as_fastq`` function, or top level ``Bio.SeqIO.write()`` 1445 function instead via the format name "fastq" or the alias "fastq-sanger". 1446 1447 For example, this code reads in a standard Sanger style FASTQ file 1448 (using PHRED scores) and re-saves it as another Sanger style FASTQ file: 1449 1450 >>> from Bio import SeqIO 1451 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 1452 >>> with open("Quality/temp.fastq", "w") as out_handle: 1453 ... SeqIO.write(record_iterator, out_handle, "fastq") 1454 3 1455 1456 You might want to do this if the original file included extra line breaks, 1457 which while valid may not be supported by all tools. The output file from 1458 Biopython will have each sequence on a single line, and each quality 1459 string on a single line (which is considered desirable for maximum 1460 compatibility). 1461 1462 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa 1463 quality scores) is converted into a standard Sanger style FASTQ file using 1464 PHRED qualities: 1465 1466 >>> from Bio import SeqIO 1467 >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") 1468 >>> with open("Quality/temp.fastq", "w") as out_handle: 1469 ... SeqIO.write(record_iterator, out_handle, "fastq") 1470 5 1471 1472 This code is also called if you use the .format("fastq") method of a 1473 SeqRecord, or .format("fastq-sanger") if you prefer that alias. 1474 1475 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is 1476 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1477 warning is issued. 1478 1479 P.S. To avoid cluttering up your working directory, you can delete this 1480 temporary file now: 1481 1482 >>> import os 1483 >>> os.remove("Quality/temp.fastq") 1484 """ 1485 1486 assert SANGER_SCORE_OFFSET == ord("!") 1487 1488 def write_record(self, record): 1489 """Write a single FASTQ record to the file.""" 1490 assert self._header_written 1491 assert not self._footer_written 1492 self._record_written = True 1493 # TODO - Is an empty sequence allowed in FASTQ format? 1494 seq = record.seq 1495 if seq is None: 1496 raise ValueError("No sequence for record %s" % record.id) 1497 qualities_str = _get_sanger_quality_str(record) 1498 if len(qualities_str) != len(seq): 1499 raise ValueError( 1500 "Record %s has sequence length %i but %i quality scores" 1501 % (record.id, len(seq), len(qualities_str)) 1502 ) 1503 1504 # FASTQ files can include a description, just like FASTA files 1505 # (at least, this is what the NCBI Short Read Archive does) 1506 id = self.clean(record.id) 1507 description = self.clean(record.description) 1508 if description and description.split(None, 1)[0] == id: 1509 # The description includes the id at the start 1510 title = description 1511 elif description: 1512 title = "%s %s" % (id, description) 1513 else: 1514 title = id 1515 1516 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qualities_str)) 1517 1518 1519def as_fastq(record): 1520 """Turn a SeqRecord into a Sanger FASTQ formatted string. 1521 1522 This is used internally by the SeqRecord's .format("fastq") 1523 method and by the SeqIO.write(..., ..., "fastq") function, 1524 and under the format alias "fastq-sanger" as well. 1525 """ 1526 seq_str = _get_seq_string(record) 1527 qualities_str = _get_sanger_quality_str(record) 1528 if len(qualities_str) != len(seq_str): 1529 raise ValueError( 1530 "Record %s has sequence length %i but %i quality scores" 1531 % (record.id, len(seq_str), len(qualities_str)) 1532 ) 1533 id = _clean(record.id) 1534 description = _clean(record.description) 1535 if description and description.split(None, 1)[0] == id: 1536 title = description 1537 elif description: 1538 title = "%s %s" % (id, description) 1539 else: 1540 title = id 1541 return "@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str) 1542 1543 1544class QualPhredWriter(SequenceWriter): 1545 """Class to write QUAL format files (using PHRED quality scores) (OBSOLETE). 1546 1547 Although you can use this class directly, you are strongly encouraged 1548 to use the ``as_qual`` function, or top level ``Bio.SeqIO.write()`` 1549 function instead. 1550 1551 For example, this code reads in a FASTQ file and saves the quality scores 1552 into a QUAL file: 1553 1554 >>> from Bio import SeqIO 1555 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 1556 >>> with open("Quality/temp.qual", "w") as out_handle: 1557 ... SeqIO.write(record_iterator, out_handle, "qual") 1558 3 1559 1560 This code is also called if you use the .format("qual") method of a 1561 SeqRecord. 1562 1563 P.S. Don't forget to clean up the temp file if you don't need it anymore: 1564 1565 >>> import os 1566 >>> os.remove("Quality/temp.qual") 1567 """ 1568 1569 def __init__(self, handle, wrap=60, record2title=None): 1570 """Create a QUAL writer. 1571 1572 Arguments: 1573 - handle - Handle to an output file, e.g. as returned 1574 by open(filename, "w") 1575 - wrap - Optional line length used to wrap sequence lines. 1576 Defaults to wrapping the sequence at 60 characters. Use 1577 zero (or None) for no wrapping, giving a single long line 1578 for the sequence. 1579 - record2title - Optional function to return the text to be 1580 used for the title line of each record. By default a 1581 combination of the record.id and record.description is 1582 used. If the record.description starts with the record.id, 1583 then just the record.description is used. 1584 1585 The record2title argument is present for consistency with the 1586 Bio.SeqIO.FastaIO writer class. 1587 """ 1588 super().__init__(handle) 1589 # self.handle = handle 1590 self.wrap = None 1591 if wrap: 1592 if wrap < 1: 1593 raise ValueError 1594 self.wrap = wrap 1595 self.record2title = record2title 1596 1597 def write_record(self, record): 1598 """Write a single QUAL record to the file.""" 1599 assert self._header_written 1600 assert not self._footer_written 1601 self._record_written = True 1602 1603 handle = self.handle 1604 wrap = self.wrap 1605 1606 if self.record2title: 1607 title = self.clean(self.record2title(record)) 1608 else: 1609 id = self.clean(record.id) 1610 description = self.clean(record.description) 1611 if description and description.split(None, 1)[0] == id: 1612 # The description includes the id at the start 1613 title = description 1614 elif description: 1615 title = "%s %s" % (id, description) 1616 else: 1617 title = id 1618 handle.write(">%s\n" % title) 1619 1620 qualities = _get_phred_quality(record) 1621 try: 1622 # This rounds to the nearest integer. 1623 # TODO - can we record a float in a qual file? 1624 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1625 except TypeError: 1626 if None in qualities: 1627 raise TypeError("A quality value of None was found") from None 1628 else: 1629 raise 1630 1631 if wrap > 5: 1632 # Fast wrapping 1633 data = " ".join(qualities_strs) 1634 while True: 1635 if len(data) <= wrap: 1636 self.handle.write(data + "\n") 1637 break 1638 else: 1639 # By construction there must be spaces in the first X chars 1640 # (unless we have X digit or higher quality scores!) 1641 i = data.rfind(" ", 0, wrap) 1642 handle.write(data[:i] + "\n") 1643 data = data[i + 1 :] 1644 elif wrap: 1645 # Safe wrapping 1646 while qualities_strs: 1647 line = qualities_strs.pop(0) 1648 while qualities_strs and len(line) + 1 + len(qualities_strs[0]) < wrap: 1649 line += " " + qualities_strs.pop(0) 1650 handle.write(line + "\n") 1651 else: 1652 # No wrapping 1653 data = " ".join(qualities_strs) 1654 handle.write(data + "\n") 1655 1656 1657def as_qual(record): 1658 """Turn a SeqRecord into a QUAL formatted string. 1659 1660 This is used internally by the SeqRecord's .format("qual") 1661 method and by the SeqIO.write(..., ..., "qual") function. 1662 """ 1663 id = _clean(record.id) 1664 description = _clean(record.description) 1665 if description and description.split(None, 1)[0] == id: 1666 title = description 1667 elif description: 1668 title = "%s %s" % (id, description) 1669 else: 1670 title = id 1671 lines = [">%s\n" % title] 1672 1673 qualities = _get_phred_quality(record) 1674 try: 1675 # This rounds to the nearest integer. 1676 # TODO - can we record a float in a qual file? 1677 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1678 except TypeError: 1679 if None in qualities: 1680 raise TypeError("A quality value of None was found") from None 1681 else: 1682 raise 1683 1684 # Safe wrapping 1685 while qualities_strs: 1686 line = qualities_strs.pop(0) 1687 while qualities_strs and len(line) + 1 + len(qualities_strs[0]) < 60: 1688 line += " " + qualities_strs.pop(0) 1689 lines.append(line + "\n") 1690 return "".join(lines) 1691 1692 1693class FastqSolexaWriter(SequenceWriter): 1694 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities) (OBSOLETE). 1695 1696 This outputs FASTQ files like those from the early Solexa/Illumina 1697 pipeline, using Solexa scores and an ASCII offset of 64. These are 1698 NOT compatible with the standard Sanger style PHRED FASTQ files. 1699 1700 If your records contain a "solexa_quality" entry under letter_annotations, 1701 this is used, otherwise any "phred_quality" entry will be used after 1702 conversion using the solexa_quality_from_phred function. If neither style 1703 of quality scores are present, an exception is raised. 1704 1705 Although you can use this class directly, you are strongly encouraged 1706 to use the ``as_fastq_solexa`` function, or top-level ``Bio.SeqIO.write()`` 1707 function instead. For example, this code reads in a FASTQ file and re-saves 1708 it as another FASTQ file: 1709 1710 >>> from Bio import SeqIO 1711 >>> record_iterator = SeqIO.parse("Quality/solexa_example.fastq", "fastq-solexa") 1712 >>> with open("Quality/temp.fastq", "w") as out_handle: 1713 ... SeqIO.write(record_iterator, out_handle, "fastq-solexa") 1714 5 1715 1716 You might want to do this if the original file included extra line breaks, 1717 which (while valid) may not be supported by all tools. The output file 1718 from Biopython will have each sequence on a single line, and each quality 1719 string on a single line (which is considered desirable for maximum 1720 compatibility). 1721 1722 This code is also called if you use the .format("fastq-solexa") method of 1723 a SeqRecord. For example, 1724 1725 >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") 1726 >>> print(record.format("fastq-solexa")) 1727 @Test PHRED qualities from 40 to 0 inclusive 1728 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1729 + 1730 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;; 1731 <BLANKLINE> 1732 1733 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is 1734 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit, 1735 a warning is issued. 1736 1737 P.S. Don't forget to delete the temp file if you don't need it anymore: 1738 1739 >>> import os 1740 >>> os.remove("Quality/temp.fastq") 1741 """ 1742 1743 def write_record(self, record): 1744 """Write a single FASTQ record to the file.""" 1745 assert self._header_written 1746 assert not self._footer_written 1747 self._record_written = True 1748 1749 # TODO - Is an empty sequence allowed in FASTQ format? 1750 seq = record.seq 1751 if seq is None: 1752 raise ValueError("No sequence for record %s" % record.id) 1753 qualities_str = _get_solexa_quality_str(record) 1754 if len(qualities_str) != len(seq): 1755 raise ValueError( 1756 "Record %s has sequence length %i but %i quality scores" 1757 % (record.id, len(seq), len(qualities_str)) 1758 ) 1759 1760 # FASTQ files can include a description, just like FASTA files 1761 # (at least, this is what the NCBI Short Read Archive does) 1762 id = self.clean(record.id) 1763 description = self.clean(record.description) 1764 if description and description.split(None, 1)[0] == id: 1765 # The description includes the id at the start 1766 title = description 1767 elif description: 1768 title = "%s %s" % (id, description) 1769 else: 1770 title = id 1771 1772 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qualities_str)) 1773 1774 1775def as_fastq_solexa(record): 1776 """Turn a SeqRecord into a Solexa FASTQ formatted string. 1777 1778 This is used internally by the SeqRecord's .format("fastq-solexa") 1779 method and by the SeqIO.write(..., ..., "fastq-solexa") function. 1780 """ 1781 seq_str = _get_seq_string(record) 1782 qualities_str = _get_solexa_quality_str(record) 1783 if len(qualities_str) != len(seq_str): 1784 raise ValueError( 1785 "Record %s has sequence length %i but %i quality scores" 1786 % (record.id, len(seq_str), len(qualities_str)) 1787 ) 1788 id = _clean(record.id) 1789 description = _clean(record.description) 1790 if description and description.split(None, 1)[0] == id: 1791 # The description includes the id at the start 1792 title = description 1793 elif description: 1794 title = "%s %s" % (id, description) 1795 else: 1796 title = id 1797 return "@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str) 1798 1799 1800class FastqIlluminaWriter(SequenceWriter): 1801 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores) (OBSOLETE). 1802 1803 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline, 1804 using PHRED scores and an ASCII offset of 64. Note these files are NOT 1805 compatible with the standard Sanger style PHRED FASTQ files which use an 1806 ASCII offset of 32. 1807 1808 Although you can use this class directly, you are strongly encouraged to 1809 use the ``as_fastq_illumina`` or top-level ``Bio.SeqIO.write()`` function 1810 with format name "fastq-illumina" instead. This code is also called if you 1811 use the .format("fastq-illumina") method of a SeqRecord. For example, 1812 1813 >>> from Bio import SeqIO 1814 >>> record = SeqIO.read("Quality/sanger_faked.fastq", "fastq-sanger") 1815 >>> print(record.format("fastq-illumina")) 1816 @Test PHRED qualities from 40 to 0 inclusive 1817 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1818 + 1819 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ 1820 <BLANKLINE> 1821 1822 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is 1823 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1824 warning is issued. 1825 """ 1826 1827 def write_record(self, record): 1828 """Write a single FASTQ record to the file.""" 1829 assert self._header_written 1830 assert not self._footer_written 1831 self._record_written = True 1832 1833 # TODO - Is an empty sequence allowed in FASTQ format? 1834 seq = record.seq 1835 if seq is None: 1836 raise ValueError("No sequence for record %s" % record.id) 1837 qualities_str = _get_illumina_quality_str(record) 1838 if len(qualities_str) != len(seq): 1839 raise ValueError( 1840 "Record %s has sequence length %i but %i quality scores" 1841 % (record.id, len(seq), len(qualities_str)) 1842 ) 1843 1844 # FASTQ files can include a description, just like FASTA files 1845 # (at least, this is what the NCBI Short Read Archive does) 1846 id = self.clean(record.id) 1847 description = self.clean(record.description) 1848 if description and description.split(None, 1)[0] == id: 1849 # The description includes the id at the start 1850 title = description 1851 elif description: 1852 title = "%s %s" % (id, description) 1853 else: 1854 title = id 1855 1856 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qualities_str)) 1857 1858 1859def as_fastq_illumina(record): 1860 """Turn a SeqRecord into an Illumina FASTQ formatted string. 1861 1862 This is used internally by the SeqRecord's .format("fastq-illumina") 1863 method and by the SeqIO.write(..., ..., "fastq-illumina") function. 1864 """ 1865 seq_str = _get_seq_string(record) 1866 qualities_str = _get_illumina_quality_str(record) 1867 if len(qualities_str) != len(seq_str): 1868 raise ValueError( 1869 "Record %s has sequence length %i but %i quality scores" 1870 % (record.id, len(seq_str), len(qualities_str)) 1871 ) 1872 id = _clean(record.id) 1873 description = _clean(record.description) 1874 if description and description.split(None, 1)[0] == id: 1875 title = description 1876 elif description: 1877 title = "%s %s" % (id, description) 1878 else: 1879 title = id 1880 return "@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str) 1881 1882 1883def PairedFastaQualIterator(fasta_source, qual_source, alphabet=None, title2ids=None): 1884 """Iterate over matched FASTA and QUAL files as SeqRecord objects. 1885 1886 For example, consider this short QUAL file with PHRED quality scores:: 1887 1888 >EAS54_6_R1_2_1_413_324 1889 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1890 26 26 26 23 23 1891 >EAS54_6_R1_2_1_540_792 1892 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1893 26 18 26 23 18 1894 >EAS54_6_R1_2_1_443_348 1895 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1896 24 18 18 18 18 1897 1898 And a matching FASTA file:: 1899 1900 >EAS54_6_R1_2_1_413_324 1901 CCCTTCTTGTCTTCAGCGTTTCTCC 1902 >EAS54_6_R1_2_1_540_792 1903 TTGGCAGGCCAAGGCCGATGGATCA 1904 >EAS54_6_R1_2_1_443_348 1905 GTTGCTTCTGGCGTGGGTGGGGGGG 1906 1907 You can parse these separately using Bio.SeqIO with the "qual" and 1908 "fasta" formats, but then you'll get a group of SeqRecord objects with 1909 no sequence, and a matching group with the sequence but not the 1910 qualities. Because it only deals with one input file handle, Bio.SeqIO 1911 can't be used to read the two files together - but this function can! 1912 For example, 1913 1914 >>> with open("Quality/example.fasta") as f: 1915 ... with open("Quality/example.qual") as q: 1916 ... for record in PairedFastaQualIterator(f, q): 1917 ... print("%s %s" % (record.id, record.seq)) 1918 ... 1919 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1920 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1921 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1922 1923 As with the FASTQ or QUAL parsers, if you want to look at the qualities, 1924 they are in each record's per-letter-annotation dictionary as a simple 1925 list of integers: 1926 1927 >>> print(record.letter_annotations["phred_quality"]) 1928 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1929 1930 If you have access to data as a FASTQ format file, using that directly 1931 would be simpler and more straight forward. Note that you can easily use 1932 this function to convert paired FASTA and QUAL files into FASTQ files: 1933 1934 >>> from Bio import SeqIO 1935 >>> with open("Quality/example.fasta") as f: 1936 ... with open("Quality/example.qual") as q: 1937 ... SeqIO.write(PairedFastaQualIterator(f, q), "Quality/temp.fastq", "fastq") 1938 ... 1939 3 1940 1941 And don't forget to clean up the temp file if you don't need it anymore: 1942 1943 >>> import os 1944 >>> os.remove("Quality/temp.fastq") 1945 """ 1946 if alphabet is not None: 1947 raise ValueError("The alphabet argument is no longer supported") 1948 1949 from Bio.SeqIO.FastaIO import FastaIterator 1950 1951 fasta_iter = FastaIterator(fasta_source, title2ids=title2ids) 1952 qual_iter = QualPhredIterator(qual_source, title2ids=title2ids) 1953 1954 # Using zip wouldn't load everything into memory, but also would not catch 1955 # any extra records found in only one file. 1956 while True: 1957 try: 1958 f_rec = next(fasta_iter) 1959 except StopIteration: 1960 f_rec = None 1961 try: 1962 q_rec = next(qual_iter) 1963 except StopIteration: 1964 q_rec = None 1965 if f_rec is None and q_rec is None: 1966 # End of both files 1967 break 1968 if f_rec is None: 1969 raise ValueError("FASTA file has more entries than the QUAL file.") 1970 if q_rec is None: 1971 raise ValueError("QUAL file has more entries than the FASTA file.") 1972 if f_rec.id != q_rec.id: 1973 raise ValueError( 1974 "FASTA and QUAL entries do not match (%s vs %s)." % (f_rec.id, q_rec.id) 1975 ) 1976 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]): 1977 raise ValueError( 1978 "Sequence length and number of quality scores disagree for %s" 1979 % f_rec.id 1980 ) 1981 # Merge the data.... 1982 f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations[ 1983 "phred_quality" 1984 ] 1985 yield f_rec 1986 # Done 1987 1988 1989def _fastq_generic(in_file, out_file, mapping): 1990 """FASTQ helper function where can't have data loss by truncation (PRIVATE).""" 1991 # For real speed, don't even make SeqRecord and Seq objects! 1992 count = 0 1993 null = chr(0) 1994 with as_handle(out_file, "w") as out_handle: 1995 for title, seq, old_qual in FastqGeneralIterator(in_file): 1996 count += 1 1997 # map the qual... 1998 qual = old_qual.translate(mapping) 1999 if null in qual: 2000 raise ValueError("Invalid character in quality string") 2001 out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) 2002 return count 2003 2004 2005def _fastq_generic2(in_file, out_file, mapping, truncate_char, truncate_msg): 2006 """FASTQ helper function where there could be data loss by truncation (PRIVATE).""" 2007 # For real speed, don't even make SeqRecord and Seq objects! 2008 count = 0 2009 null = chr(0) 2010 with as_handle(out_file, "w") as out_handle: 2011 for title, seq, old_qual in FastqGeneralIterator(in_file): 2012 count += 1 2013 # map the qual... 2014 qual = old_qual.translate(mapping) 2015 if null in qual: 2016 raise ValueError("Invalid character in quality string") 2017 if truncate_char in qual: 2018 qual = qual.replace(truncate_char, chr(126)) 2019 warnings.warn(truncate_msg, BiopythonWarning) 2020 out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) 2021 return count 2022 2023 2024def _fastq_sanger_convert_fastq_sanger(in_file, out_file): 2025 """Fast Sanger FASTQ to Sanger FASTQ conversion (PRIVATE). 2026 2027 Useful for removing line wrapping and the redundant second identifier 2028 on the plus lines. Will check also check the quality string is valid. 2029 2030 Avoids creating SeqRecord and Seq objects in order to speed up this 2031 conversion. 2032 """ 2033 # Map unexpected chars to null 2034 mapping = "".join( 2035 [chr(0) for ascii in range(0, 33)] 2036 + [chr(ascii) for ascii in range(33, 127)] 2037 + [chr(0) for ascii in range(127, 256)] 2038 ) 2039 assert len(mapping) == 256 2040 return _fastq_generic(in_file, out_file, mapping) 2041 2042 2043def _fastq_solexa_convert_fastq_solexa(in_file, out_file): 2044 """Fast Solexa FASTQ to Solexa FASTQ conversion (PRIVATE). 2045 2046 Useful for removing line wrapping and the redundant second identifier 2047 on the plus lines. Will check also check the quality string is valid. 2048 Avoids creating SeqRecord and Seq objects in order to speed up this 2049 conversion. 2050 """ 2051 # Map unexpected chars to null 2052 mapping = "".join( 2053 [chr(0) for ascii in range(0, 59)] 2054 + [chr(ascii) for ascii in range(59, 127)] 2055 + [chr(0) for ascii in range(127, 256)] 2056 ) 2057 assert len(mapping) == 256 2058 return _fastq_generic(in_file, out_file, mapping) 2059 2060 2061def _fastq_illumina_convert_fastq_illumina(in_file, out_file): 2062 """Fast Illumina 1.3+ FASTQ to Illumina 1.3+ FASTQ conversion (PRIVATE). 2063 2064 Useful for removing line wrapping and the redundant second identifier 2065 on the plus lines. Will check also check the quality string is valid. 2066 Avoids creating SeqRecord and Seq objects in order to speed up this 2067 conversion. 2068 """ 2069 # Map unexpected chars to null 2070 mapping = "".join( 2071 [chr(0) for ascii in range(0, 64)] 2072 + [chr(ascii) for ascii in range(64, 127)] 2073 + [chr(0) for ascii in range(127, 256)] 2074 ) 2075 assert len(mapping) == 256 2076 return _fastq_generic(in_file, out_file, mapping) 2077 2078 2079def _fastq_illumina_convert_fastq_sanger(in_file, out_file): 2080 """Fast Illumina 1.3+ FASTQ to Sanger FASTQ conversion (PRIVATE). 2081 2082 Avoids creating SeqRecord and Seq objects in order to speed up this 2083 conversion. 2084 """ 2085 # Map unexpected chars to null 2086 mapping = "".join( 2087 [chr(0) for ascii in range(0, 64)] 2088 + [chr(33 + q) for q in range(0, 62 + 1)] 2089 + [chr(0) for ascii in range(127, 256)] 2090 ) 2091 assert len(mapping) == 256 2092 return _fastq_generic(in_file, out_file, mapping) 2093 2094 2095def _fastq_sanger_convert_fastq_illumina(in_file, out_file): 2096 """Fast Sanger FASTQ to Illumina 1.3+ FASTQ conversion (PRIVATE). 2097 2098 Avoids creating SeqRecord and Seq objects in order to speed up this 2099 conversion. Will issue a warning if the scores had to be truncated at 62 2100 (maximum possible in the Illumina 1.3+ FASTQ format) 2101 """ 2102 # Map unexpected chars to null 2103 trunc_char = chr(1) 2104 mapping = "".join( 2105 [chr(0) for ascii in range(0, 33)] 2106 + [chr(64 + q) for q in range(0, 62 + 1)] 2107 + [trunc_char for ascii in range(96, 127)] 2108 + [chr(0) for ascii in range(127, 256)] 2109 ) 2110 assert len(mapping) == 256 2111 return _fastq_generic2( 2112 in_file, 2113 out_file, 2114 mapping, 2115 trunc_char, 2116 "Data loss - max PHRED quality 62 in Illumina 1.3+ FASTQ", 2117 ) 2118 2119 2120def _fastq_solexa_convert_fastq_sanger(in_file, out_file): 2121 """Fast Solexa FASTQ to Sanger FASTQ conversion (PRIVATE). 2122 2123 Avoids creating SeqRecord and Seq objects in order to speed up this 2124 conversion. 2125 """ 2126 # Map unexpected chars to null 2127 mapping = "".join( 2128 [chr(0) for ascii in range(0, 59)] 2129 + [ 2130 chr(33 + int(round(phred_quality_from_solexa(q)))) 2131 for q in range(-5, 62 + 1) 2132 ] 2133 + [chr(0) for ascii in range(127, 256)] 2134 ) 2135 assert len(mapping) == 256 2136 return _fastq_generic(in_file, out_file, mapping) 2137 2138 2139def _fastq_sanger_convert_fastq_solexa(in_file, out_file): 2140 """Fast Sanger FASTQ to Solexa FASTQ conversion (PRIVATE). 2141 2142 Avoids creating SeqRecord and Seq objects in order to speed up this 2143 conversion. Will issue a warning if the scores had to be truncated at 62 2144 (maximum possible in the Solexa FASTQ format) 2145 """ 2146 # Map unexpected chars to null 2147 trunc_char = chr(1) 2148 mapping = "".join( 2149 [chr(0) for ascii in range(0, 33)] 2150 + [chr(64 + int(round(solexa_quality_from_phred(q)))) for q in range(0, 62 + 1)] 2151 + [trunc_char for ascii in range(96, 127)] 2152 + [chr(0) for ascii in range(127, 256)] 2153 ) 2154 assert len(mapping) == 256 2155 return _fastq_generic2( 2156 in_file, 2157 out_file, 2158 mapping, 2159 trunc_char, 2160 "Data loss - max Solexa quality 62 in Solexa FASTQ", 2161 ) 2162 2163 2164def _fastq_solexa_convert_fastq_illumina(in_file, out_file): 2165 """Fast Solexa FASTQ to Illumina 1.3+ FASTQ conversion (PRIVATE). 2166 2167 Avoids creating SeqRecord and Seq objects in order to speed up this 2168 conversion. 2169 """ 2170 # Map unexpected chars to null 2171 mapping = "".join( 2172 [chr(0) for ascii in range(0, 59)] 2173 + [ 2174 chr(64 + int(round(phred_quality_from_solexa(q)))) 2175 for q in range(-5, 62 + 1) 2176 ] 2177 + [chr(0) for ascii in range(127, 256)] 2178 ) 2179 assert len(mapping) == 256 2180 return _fastq_generic(in_file, out_file, mapping) 2181 2182 2183def _fastq_illumina_convert_fastq_solexa(in_file, out_file): 2184 """Fast Illumina 1.3+ FASTQ to Solexa FASTQ conversion (PRIVATE). 2185 2186 Avoids creating SeqRecord and Seq objects in order to speed up this 2187 conversion. 2188 """ 2189 # Map unexpected chars to null 2190 mapping = "".join( 2191 [chr(0) for ascii in range(0, 64)] 2192 + [chr(64 + int(round(solexa_quality_from_phred(q)))) for q in range(0, 62 + 1)] 2193 + [chr(0) for ascii in range(127, 256)] 2194 ) 2195 assert len(mapping) == 256 2196 return _fastq_generic(in_file, out_file, mapping) 2197 2198 2199def _fastq_convert_fasta(in_file, out_file): 2200 """Fast FASTQ to FASTA conversion (PRIVATE). 2201 2202 Avoids dealing with the FASTQ quality encoding, and creating SeqRecord and 2203 Seq objects in order to speed up this conversion. 2204 2205 NOTE - This does NOT check the characters used in the FASTQ quality string 2206 are valid! 2207 """ 2208 # For real speed, don't even make SeqRecord and Seq objects! 2209 count = 0 2210 with as_handle(out_file, "w") as out_handle: 2211 for title, seq, qual in FastqGeneralIterator(in_file): 2212 count += 1 2213 out_handle.write(">%s\n" % title) 2214 # Do line wrapping 2215 for i in range(0, len(seq), 60): 2216 out_handle.write(seq[i : i + 60] + "\n") 2217 return count 2218 2219 2220def _fastq_convert_tab(in_file, out_file): 2221 """Fast FASTQ to simple tabbed conversion (PRIVATE). 2222 2223 Avoids dealing with the FASTQ quality encoding, and creating SeqRecord and 2224 Seq objects in order to speed up this conversion. 2225 2226 NOTE - This does NOT check the characters used in the FASTQ quality string 2227 are valid! 2228 """ 2229 # For real speed, don't even make SeqRecord and Seq objects! 2230 count = 0 2231 with as_handle(out_file, "w") as out_handle: 2232 for title, seq, qual in FastqGeneralIterator(in_file): 2233 count += 1 2234 out_handle.write("%s\t%s\n" % (title.split(None, 1)[0], seq)) 2235 return count 2236 2237 2238def _fastq_convert_qual(in_file, out_file, mapping): 2239 """FASTQ helper function for QUAL output (PRIVATE). 2240 2241 Mapping should be a dictionary mapping expected ASCII characters from the 2242 FASTQ quality string to PHRED quality scores (as strings). 2243 """ 2244 # For real speed, don't even make SeqRecord and Seq objects! 2245 count = 0 2246 with as_handle(out_file, "w") as out_handle: 2247 for title, seq, qual in FastqGeneralIterator(in_file): 2248 count += 1 2249 out_handle.write(">%s\n" % title) 2250 # map the qual... note even with Sanger encoding max 2 digits 2251 try: 2252 qualities_strs = [mapping[ascii] for ascii in qual] 2253 except KeyError: 2254 raise ValueError("Invalid character in quality string") from None 2255 data = " ".join(qualities_strs) 2256 while len(data) > 60: 2257 # Know quality scores are either 1 or 2 digits, so there 2258 # must be a space in any three consecutive characters. 2259 if data[60] == " ": 2260 out_handle.write(data[:60] + "\n") 2261 data = data[61:] 2262 elif data[59] == " ": 2263 out_handle.write(data[:59] + "\n") 2264 data = data[60:] 2265 else: 2266 assert data[58] == " ", "Internal logic failure in wrapping" 2267 out_handle.write(data[:58] + "\n") 2268 data = data[59:] 2269 out_handle.write(data + "\n") 2270 return count 2271 2272 2273def _fastq_sanger_convert_qual(in_file, out_file): 2274 """Fast Sanger FASTQ to QUAL conversion (PRIVATE).""" 2275 mapping = {chr(q + 33): str(q) for q in range(0, 93 + 1)} 2276 return _fastq_convert_qual(in_file, out_file, mapping) 2277 2278 2279def _fastq_solexa_convert_qual(in_file, out_file): 2280 """Fast Solexa FASTQ to QUAL conversion (PRIVATE).""" 2281 mapping = { 2282 chr(q + 64): str(int(round(phred_quality_from_solexa(q)))) 2283 for q in range(-5, 62 + 1) 2284 } 2285 return _fastq_convert_qual(in_file, out_file, mapping) 2286 2287 2288def _fastq_illumina_convert_qual(in_file, out_file): 2289 """Fast Illumina 1.3+ FASTQ to QUAL conversion (PRIVATE).""" 2290 mapping = {chr(q + 64): str(q) for q in range(0, 62 + 1)} 2291 return _fastq_convert_qual(in_file, out_file, mapping) 2292 2293 2294if __name__ == "__main__": 2295 from Bio._utils import run_doctest 2296 2297 run_doctest(verbose=0) 2298