1\chapter{Cookbook -- Cool things to do with it} 2\label{chapter:cookbook} 3 4Biopython now has two collections of ``cookbook'' examples -- this chapter 5(which has been included in this tutorial for many years and has gradually 6grown), and \url{http://biopython.org/wiki/Category:Cookbook} which is a 7user contributed collection on our wiki. 8 9We're trying to encourage Biopython users to contribute their own examples 10to the wiki. In addition to helping the community, one direct benefit of 11sharing an example like this is that you could also get some feedback on 12the code from other Biopython users and developers - which could help you 13improve all your Python code. 14 15In the long term, we may end up moving all of the examples in this chapter 16to the wiki, or elsewhere within the tutorial. 17 18\section{Working with sequence files} 19\label{sec:cookbook-sequences} 20 21This section shows some more examples of sequence input/output, using the 22\verb|Bio.SeqIO| module described in Chapter~\ref{chapter:seqio}. 23 24\subsection{Filtering a sequence file} 25 26Often you'll have a large file with many sequences in it (e.g. FASTA file 27or genes, or a FASTQ or SFF file of reads), a separate shorter list of 28the IDs for a subset of sequences of interest, and want to make a new 29sequence file for this subset. 30 31Let's say the list of IDs is in a simple text file, as the first word on 32each line. This could be a tabular file where the first column is the ID. 33Try something like this: 34 35%not a doctest to avoid temp files being left behind, also no >>> 36%makes it easier to copy and paste the example to a script file. 37\begin{minted}{python} 38from Bio import SeqIO 39 40input_file = "big_file.sff" 41id_file = "short_list.txt" 42output_file = "short_list.sff" 43 44with open(id_file) as id_handle: 45 wanted = set(line.rstrip("\n").split(None, 1)[0] for line in id_handle) 46print("Found %i unique identifiers in %s" % (len(wanted), id_file)) 47 48records = (r for r in SeqIO.parse(input_file, "sff") if r.id in wanted) 49count = SeqIO.write(records, output_file, "sff") 50print("Saved %i records from %s to %s" % (count, input_file, output_file)) 51if count < len(wanted): 52 print("Warning %i IDs not found in %s" % (len(wanted) - count, input_file)) 53\end{minted} 54 55Note that we use a Python \verb|set| rather than a \verb|list|, this makes 56testing membership faster. 57 58As discussed in Section~\ref{sec:low-level-fasta-fastq}, for a large FASTA 59or FASTQ file for speed you would be better off not using the high-level 60\verb|SeqIO| interface, but working directly with strings. This next 61example shows how to do this with FASTQ files -- it is more complicated: 62 63%not a doctest to avoid temp files being left behind, also no >>> 64%makes it easier to copy and paste the example to a script file. 65\begin{minted}{python} 66from Bio.SeqIO.QualityIO import FastqGeneralIterator 67 68input_file = "big_file.fastq" 69id_file = "short_list.txt" 70output_file = "short_list.fastq" 71 72with open(id_file) as id_handle: 73 # Taking first word on each line as an identifer 74 wanted = set(line.rstrip("\n").split(None, 1)[0] for line in id_handle) 75print("Found %i unique identifiers in %s" % (len(wanted), id_file)) 76 77with open(input_file) as in_handle: 78 with open(output_file, "w") as out_handle: 79 for title, seq, qual in FastqGeneralIterator(in_handle): 80 # The ID is the first word in the title line (after the @ sign): 81 if title.split(None, 1)[0] in wanted: 82 # This produces a standard 4-line FASTQ entry: 83 out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) 84 count += 1 85print("Saved %i records from %s to %s" % (count, input_file, output_file)) 86if count < len(wanted): 87 print("Warning %i IDs not found in %s" % (len(wanted) - count, input_file)) 88\end{minted} 89 90\subsection{Producing randomised genomes} 91 92Let's suppose you are looking at genome sequence, hunting for some sequence 93feature -- maybe extreme local GC\% bias, or possible restriction digest sites. 94Once you've got your Python code working on the real genome it may be sensible 95to try running the same search on randomised versions of the same genome for 96statistical analysis (after all, any ``features'' you've found could just be 97there just by chance). 98 99For this discussion, we'll use the GenBank file for the pPCP1 plasmid from 100\textit{Yersinia pestis biovar Microtus}. The file is included with the 101Biopython unit tests under the GenBank folder, or you can get it from our 102website, \href{https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb} 103{\texttt{NC\_005816.gb}}. 104This file contains one and only one record, so we can read it in as a 105\verb|SeqRecord| using the \verb|Bio.SeqIO.read()| function: 106 107%doctest ../Tests/GenBank 108\begin{minted}{pycon} 109>>> from Bio import SeqIO 110>>> original_rec = SeqIO.read("NC_005816.gb", "genbank") 111\end{minted} 112 113So, how can we generate a shuffled versions of the original sequence? I would 114use the built in Python \verb|random| module for this, in particular the function 115\verb|random.shuffle| -- but this works on a Python list. Our sequence is a 116\verb|Seq| object, so in order to shuffle it we need to turn it into a list: 117 118%cont-doctest 119\begin{minted}{pycon} 120>>> import random 121>>> nuc_list = list(original_rec.seq) 122>>> random.shuffle(nuc_list) # acts in situ! 123\end{minted} 124 125Now, in order to use \verb|Bio.SeqIO| to output the shuffled sequence, we need 126to construct a new \verb|SeqRecord| with a new \verb|Seq| object using this 127shuffled list. In order to do this, we need to turn the list of nucleotides 128(single letter strings) into a long string -- the standard Python way to do 129this is with the string object's join method. 130 131%cont-doctest 132\begin{minted}{pycon} 133>>> from Bio.Seq import Seq 134>>> from Bio.SeqRecord import SeqRecord 135>>> shuffled_rec = SeqRecord(Seq("".join(nuc_list)), 136... id="Shuffled", description="Based on %s" % original_rec.id) 137... 138\end{minted} 139 140Let's put all these pieces together to make a complete Python script which 141generates a single FASTA file containing 30 randomly shuffled versions of 142the original sequence. 143 144This first version just uses a big for loop and writes out the records one by one 145(using the \verb|SeqRecord|'s format method described in 146Section~\ref{sec:Bio.SeqIO-and-StringIO}): 147 148\begin{minted}{python} 149import random 150from Bio.Seq import Seq 151from Bio.SeqRecord import SeqRecord 152from Bio import SeqIO 153 154original_rec = SeqIO.read("NC_005816.gb", "genbank") 155 156with open("shuffled.fasta", "w") as output_handle: 157 for i in range(30): 158 nuc_list = list(original_rec.seq) 159 random.shuffle(nuc_list) 160 shuffled_rec = SeqRecord( 161 Seq("".join(nuc_list)), 162 id="Shuffled%i" % (i + 1), 163 description="Based on %s" % original_rec.id, 164 ) 165 output_handle.write(shuffled_rec.format("fasta")) 166\end{minted} 167 168Personally I prefer the following version using a function to shuffle the record 169and a generator expression instead of the for loop: 170 171\begin{minted}{python} 172import random 173from Bio.Seq import Seq 174from Bio.SeqRecord import SeqRecord 175from Bio import SeqIO 176 177 178def make_shuffle_record(record, new_id): 179 nuc_list = list(record.seq) 180 random.shuffle(nuc_list) 181 return SeqRecord( 182 Seq("".join(nuc_list)), id=new_id, description="Based on %s" % original_rec.id, 183 ) 184 185 186original_rec = SeqIO.read("NC_005816.gb", "genbank") 187shuffled_recs = ( 188 make_shuffle_record(original_rec, "Shuffled%i" % (i + 1)) for i in range(30) 189) 190 191SeqIO.write(shuffled_recs, "shuffled.fasta", "fasta") 192\end{minted} 193 194\subsection{Translating a FASTA file of CDS entries} 195\label{sec:SeqIO-translate} 196Suppose you've got an input file of CDS entries for some organism, and you 197want to generate a new FASTA file containing their protein sequences. i.e. 198Take each nucleotide sequence from the original file, and translate it. 199Back in Section~\ref{sec:translation} we saw how to use the \verb|Seq| 200object's \verb|translate method|, and the optional \verb|cds| argument 201which enables correct translation of alternative start codons. 202 203We can combine this with \verb|Bio.SeqIO| as 204shown in the reverse complement example in Section~\ref{sec:SeqIO-reverse-complement}. 205The key point is that for each nucleotide \verb|SeqRecord|, we need to create 206a protein \verb|SeqRecord| - and take care of naming it. 207 208You can write you own function to do this, choosing suitable protein identifiers 209for your sequences, and the appropriate genetic code. In this example we just 210use the default table and add a prefix to the identifier: 211 212\begin{minted}{python} 213from Bio.SeqRecord import SeqRecord 214 215 216def make_protein_record(nuc_record): 217 """Returns a new SeqRecord with the translated sequence (default table).""" 218 return SeqRecord( 219 seq=nuc_record.seq.translate(cds=True), 220 id="trans_" + nuc_record.id, 221 description="translation of CDS, using default table", 222 ) 223\end{minted} 224 225We can then use this function to turn the input nucleotide records into protein 226records ready for output. An elegant way and memory efficient way to do this 227is with a generator expression: 228 229\begin{minted}{python} 230from Bio import SeqIO 231 232proteins = ( 233 make_protein_record(nuc_rec) 234 for nuc_rec in SeqIO.parse("coding_sequences.fasta", "fasta") 235) 236SeqIO.write(proteins, "translations.fasta", "fasta") 237\end{minted} 238 239This should work on any FASTA file of complete coding sequences. 240If you are working on partial coding sequences, you may prefer to use 241\verb|nuc_record.seq.translate(to_stop=True)| in the example above, as 242this wouldn't check for a valid start codon etc. 243 244\subsection{Making the sequences in a FASTA file upper case} 245 246Often you'll get data from collaborators as FASTA files, and sometimes the 247sequences can be in a mixture of upper and lower case. In some cases this is 248deliberate (e.g. lower case for poor quality regions), but usually it is not 249important. You may want to edit the file to make everything consistent (e.g. 250all upper case), and you can do this easily using the \verb|upper()| method 251of the \verb|SeqRecord| object (added in Biopython 1.55): 252 253\begin{minted}{python} 254from Bio import SeqIO 255 256records = (rec.upper() for rec in SeqIO.parse("mixed.fas", "fasta")) 257count = SeqIO.write(records, "upper.fas", "fasta") 258print("Converted %i records to upper case" % count) 259\end{minted} 260 261How does this work? The first line is just importing the \verb|Bio.SeqIO| 262module. The second line is the interesting bit -- this is a Python 263generator expression which gives an upper case version of each record 264parsed from the input file (\texttt{mixed.fas}). In the third line we give 265this generator expression to the \verb|Bio.SeqIO.write()| function and it 266saves the new upper cases records to our output file (\texttt{upper.fas}). 267 268The reason we use a generator expression (rather than a list or list 269comprehension) is this means only one record is kept in memory at a time. 270This can be really important if you are dealing with large files with 271millions of entries. 272 273\subsection{Sorting a sequence file} 274\label{sec:SeqIO-sort} 275 276Suppose you wanted to sort a sequence file by length (e.g. a set of 277contigs from an assembly), and you are working with a file format like 278FASTA or FASTQ which \verb|Bio.SeqIO| can read, write (and index). 279 280If the file is small enough, you can load it all into memory at once 281as a list of \verb|SeqRecord| objects, sort the list, and save it: 282 283\begin{minted}{python} 284from Bio import SeqIO 285 286records = list(SeqIO.parse("ls_orchid.fasta", "fasta")) 287records.sort(key=lambda r: len(r)) 288SeqIO.write(records, "sorted_orchids.fasta", "fasta") 289\end{minted} 290 291The only clever bit is specifying a comparison method for how to 292sort the records (here we sort them by length). If you wanted the 293longest records first, you could flip the comparison or use the 294reverse argument: 295 296\begin{minted}{python} 297from Bio import SeqIO 298 299records = list(SeqIO.parse("ls_orchid.fasta", "fasta")) 300records.sort(key=lambda r: -len(r)) 301SeqIO.write(records, "sorted_orchids.fasta", "fasta") 302\end{minted} 303 304Now that's pretty straight forward - but what happens if you have a 305very large file and you can't load it all into memory like this? 306For example, you might have some next-generation sequencing reads 307to sort by length. This can be solved using the 308\verb|Bio.SeqIO.index()| function. 309 310\begin{minted}{python} 311from Bio import SeqIO 312 313# Get the lengths and ids, and sort on length 314len_and_ids = sorted( 315 (len(rec), rec.id) for rec in SeqIO.parse("ls_orchid.fasta", "fasta") 316) 317ids = reversed([id for (length, id) in len_and_ids]) 318del len_and_ids # free this memory 319record_index = SeqIO.index("ls_orchid.fasta", "fasta") 320records = (record_index[id] for id in ids) 321SeqIO.write(records, "sorted.fasta", "fasta") 322\end{minted} 323 324First we scan through the file once using \verb|Bio.SeqIO.parse()|, 325recording the record identifiers and their lengths in a list of tuples. 326We then sort this list to get them in length order, and discard the lengths. 327Using this sorted list of identifiers \verb|Bio.SeqIO.index()| allows us to 328retrieve the records one by one, and we pass them to \verb|Bio.SeqIO.write()| 329for output. 330 331These examples all use \verb|Bio.SeqIO| to parse the records into 332\verb|SeqRecord| objects which are output using \verb|Bio.SeqIO.write()|. 333What if you want to sort a file format which \verb|Bio.SeqIO.write()| doesn't 334support, like the plain text SwissProt format? Here is an alternative 335solution using the \verb|get_raw()| method added to \verb|Bio.SeqIO.index()| 336in Biopython 1.54 (see Section~\ref{sec:seqio-index-getraw}). 337 338\begin{minted}{python} 339from Bio import SeqIO 340 341# Get the lengths and ids, and sort on length 342len_and_ids = sorted( 343 (len(rec), rec.id) for rec in SeqIO.parse("ls_orchid.fasta", "fasta") 344) 345ids = reversed([id for (length, id) in len_and_ids]) 346del len_and_ids # free this memory 347 348record_index = SeqIO.index("ls_orchid.fasta", "fasta") 349with open("sorted.fasta", "wb") as out_handle: 350 for id in ids: 351 out_handle.write(record_index.get_raw(id)) 352\end{minted} 353 354Note with Python 3 onwards, we have to open the file for writing in 355binary mode because the \verb|get_raw()| method returns \verb|bytes| objects. 356 357As a bonus, because it doesn't parse the data into \verb|SeqRecord| objects 358a second time it should be faster. If you only want to use this with FASTA 359format, we can speed this up one step further by using the low-level FASTA 360parser to get the record identifiers and lengths: 361 362\begin{minted}{python} 363from Bio.SeqIO.FastaIO import SimpleFastaParser 364from Bio import SeqIO 365 366# Get the lengths and ids, and sort on length 367with open("ls_orchid.fasta") as in_handle: 368 len_and_ids = sorted( 369 (len(seq), title.split(None, 1)[0]) 370 for title, seq in SimpleFastaParser(in_handle) 371 ) 372ids = reversed([id for (length, id) in len_and_ids]) 373del len_and_ids # free this memory 374 375record_index = SeqIO.index("ls_orchid.fasta", "fasta") 376with open("sorted.fasta", "wb") as out_handle: 377 for id in ids: 378 out_handle.write(record_index.get_raw(id)) 379\end{minted} 380 381\subsection{Simple quality filtering for FASTQ files} 382\label{sec:FASTQ-filtering-example} 383 384The FASTQ file format was introduced at Sanger and is now widely used for 385holding nucleotide sequencing reads together with their quality scores. 386FASTQ files (and the related QUAL files) are an excellent example of 387per-letter-annotation, because for each nucleotide in the sequence there is 388an associated quality score. Any per-letter-annotation is held in a 389\verb|SeqRecord| in the \verb|letter_annotations| dictionary as a list, 390tuple or string (with the same number of elements as the sequence length). 391 392One common task is taking a large set of sequencing reads and filtering them 393(or cropping them) based on their quality scores. 394The following example is very simplistic, but should illustrate the basics of 395working with quality data in a \verb|SeqRecord| object. All we are going to 396do here is read in a file of FASTQ data, and filter it to pick out only those 397records whose PHRED quality scores are all above some threshold (here 20). 398 399For this example we'll use some real data downloaded from the ENA sequence 400read archive, 401\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz} 402(2MB) which unzips to a 19MB file \texttt{SRR020192.fastq}. This is some 403Roche 454 GS FLX single end data from virus infected California sea lions 404(see \url{https://www.ebi.ac.uk/ena/data/view/SRS004476} for details). 405 406First, let's count the reads: 407 408\begin{minted}{python} 409from Bio import SeqIO 410 411count = 0 412for rec in SeqIO.parse("SRR020192.fastq", "fastq"): 413 count += 1 414print("%i reads" % count) 415\end{minted} 416 417\noindent Now let's do a simple filtering for a minimum PHRED quality of 20: 418 419\begin{minted}{python} 420from Bio import SeqIO 421 422good_reads = ( 423 rec 424 for rec in SeqIO.parse("SRR020192.fastq", "fastq") 425 if min(rec.letter_annotations["phred_quality"]) >= 20 426) 427count = SeqIO.write(good_reads, "good_quality.fastq", "fastq") 428print("Saved %i reads" % count) 429\end{minted} 430 431\noindent This pulled out only $14580$ reads out of the $41892$ present. 432A more sensible thing to do would be to quality trim the reads, but this 433is intended as an example only. 434 435FASTQ files can contain millions of entries, so it is best to avoid loading 436them all into memory at once. This example uses a generator expression, which 437means only one \verb|SeqRecord| is created at a time - avoiding any memory 438limitations. 439 440Note that it would be faster to use the low-level \verb|FastqGeneralIterator| 441parser here (see Section~\ref{sec:low-level-fasta-fastq}), but that does not 442turn the quality string into integer scores. 443 444\subsection{Trimming off primer sequences} 445\label{sec:FASTQ-slicing-off-primer} 446 447For this example we're going to pretend that \texttt{GATGACGGTGT} is a 5' primer 448sequence we want to look for in some FASTQ formatted read data. As in the example 449above, we'll use the \texttt{SRR020192.fastq} file downloaded from the ENA 450(\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz}). 451 452By using the main \verb|Bio.SeqIO| interface, the same approach would work with 453any other supported file format (e.g. FASTA files). However, for large FASTQ 454files it would be faster the low-level \verb|FastqGeneralIterator| parser here 455(see the earlier example, and Section~\ref{sec:low-level-fasta-fastq}). 456 457This code uses \verb|Bio.SeqIO| with a generator expression (to avoid loading 458all the sequences into memory at once), and the \verb|Seq| object's 459\verb|startswith| method to see if the read starts with the primer sequence: 460 461\begin{minted}{python} 462from Bio import SeqIO 463 464primer_reads = ( 465 rec 466 for rec in SeqIO.parse("SRR020192.fastq", "fastq") 467 if rec.seq.startswith("GATGACGGTGT") 468) 469count = SeqIO.write(primer_reads, "with_primer.fastq", "fastq") 470print("Saved %i reads" % count) 471\end{minted} 472 473\noindent That should find $13819$ reads from \texttt{SRR014849.fastq} and save them to 474a new FASTQ file, \texttt{with\_primer.fastq}. 475 476Now suppose that instead you wanted to make a FASTQ file containing these reads 477but with the primer sequence removed? That's just a small change as we can slice the 478\verb|SeqRecord| (see Section~\ref{sec:SeqRecord-slicing}) to remove the first eleven 479letters (the length of our primer): 480 481\begin{minted}{python} 482from Bio import SeqIO 483 484trimmed_primer_reads = ( 485 rec[11:] 486 for rec in SeqIO.parse("SRR020192.fastq", "fastq") 487 if rec.seq.startswith("GATGACGGTGT") 488) 489count = SeqIO.write(trimmed_primer_reads, "with_primer_trimmed.fastq", "fastq") 490print("Saved %i reads" % count) 491\end{minted} 492 493\noindent Again, that should pull out the $13819$ reads from \texttt{SRR020192.fastq}, 494but this time strip off the first ten characters, and save them to another new 495FASTQ file, \texttt{with\_primer\_trimmed.fastq}. 496 497Now, suppose you want to create a new FASTQ file where these reads have 498their primer removed, but all the other reads are kept as they were? 499If we want to still use a generator expression, it is probably clearest to 500define our own trim function: 501 502\begin{minted}{python} 503from Bio import SeqIO 504 505 506def trim_primer(record, primer): 507 if record.seq.startswith(primer): 508 return record[len(primer) :] 509 else: 510 return record 511 512 513trimmed_reads = ( 514 trim_primer(record, "GATGACGGTGT") 515 for record in SeqIO.parse("SRR020192.fastq", "fastq") 516) 517count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") 518print("Saved %i reads" % count) 519\end{minted} 520 521This takes longer, as this time the output file contains all $41892$ reads. 522Again, we're used a generator expression to avoid any memory problems. 523You could alternatively use a generator function rather than a generator 524expression. 525 526\begin{minted}{python} 527from Bio import SeqIO 528 529 530def trim_primers(records, primer): 531 """Removes perfect primer sequences at start of reads. 532 533 This is a generator function, the records argument should 534 be a list or iterator returning SeqRecord objects. 535 """ 536 len_primer = len(primer) # cache this for later 537 for record in records: 538 if record.seq.startswith(primer): 539 yield record[len_primer:] 540 else: 541 yield record 542 543 544original_reads = SeqIO.parse("SRR020192.fastq", "fastq") 545trimmed_reads = trim_primers(original_reads, "GATGACGGTGT") 546count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") 547print("Saved %i reads" % count) 548\end{minted} 549 550This form is more flexible if you want to do something more complicated 551where only some of the records are retained -- as shown in the next example. 552 553\subsection{Trimming off adaptor sequences} 554\label{sec:FASTQ-slicing-off-adaptor} 555 556This is essentially a simple extension to the previous example. We are going 557to going to pretend \texttt{GATGACGGTGT} is an adaptor sequence in some FASTQ 558formatted read data, again the \texttt{SRR020192.fastq} file from the NCBI 559(\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz}). 560 561This time however, we will look for the sequence \emph{anywhere} in the reads, 562not just at the very beginning: 563 564\begin{minted}{python} 565from Bio import SeqIO 566 567 568def trim_adaptors(records, adaptor): 569 """Trims perfect adaptor sequences. 570 571 This is a generator function, the records argument should 572 be a list or iterator returning SeqRecord objects. 573 """ 574 len_adaptor = len(adaptor) # cache this for later 575 for record in records: 576 index = record.seq.find(adaptor) 577 if index == -1: 578 # adaptor not found, so won't trim 579 yield record 580 else: 581 # trim off the adaptor 582 yield record[index + len_adaptor :] 583 584 585original_reads = SeqIO.parse("SRR020192.fastq", "fastq") 586trimmed_reads = trim_adaptors(original_reads, "GATGACGGTGT") 587count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") 588print("Saved %i reads" % count) 589\end{minted} 590 591Because we are using a FASTQ input file in this example, the \verb|SeqRecord| 592objects have per-letter-annotation for the quality scores. By slicing the 593\verb|SeqRecord| object the appropriate scores are used on the trimmed 594records, so we can output them as a FASTQ file too. 595 596Compared to the output of the previous example where we only looked for 597a primer/adaptor at the start of each read, you may find some of the 598trimmed reads are quite short after trimming (e.g. if the adaptor was 599found in the middle rather than near the start). So, let's add a minimum 600length requirement as well: 601 602\begin{minted}{python} 603from Bio import SeqIO 604 605 606def trim_adaptors(records, adaptor, min_len): 607 """Trims perfect adaptor sequences, checks read length. 608 609 This is a generator function, the records argument should 610 be a list or iterator returning SeqRecord objects. 611 """ 612 len_adaptor = len(adaptor) # cache this for later 613 for record in records: 614 len_record = len(record) # cache this for later 615 if len(record) < min_len: 616 # Too short to keep 617 continue 618 index = record.seq.find(adaptor) 619 if index == -1: 620 # adaptor not found, so won't trim 621 yield record 622 elif len_record - index - len_adaptor >= min_len: 623 # after trimming this will still be long enough 624 yield record[index + len_adaptor :] 625 626 627original_reads = SeqIO.parse("SRR020192.fastq", "fastq") 628trimmed_reads = trim_adaptors(original_reads, "GATGACGGTGT", 100) 629count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq") 630print("Saved %i reads" % count) 631\end{minted} 632 633By changing the format names, you could apply this to FASTA files instead. 634This code also could be extended to do a fuzzy match instead of an exact 635match (maybe using a pairwise alignment, or taking into account the read 636quality scores), but that will be much slower. 637 638\subsection{Converting FASTQ files} 639\label{sec:SeqIO-fastq-conversion} 640 641Back in Section~\ref{sec:SeqIO-conversion} we showed how to use 642\verb|Bio.SeqIO| to convert between two file formats. Here we'll go into a 643little more detail regarding FASTQ files which are used in second generation 644DNA sequencing. Please refer to Cock \textit{et al.} (2009) \cite{cock2010} 645for a longer description. FASTQ files store both the DNA sequence (as a string) 646and the associated read qualities. 647 648PHRED scores (used in most FASTQ files, and also in QUAL files, ACE files 649and SFF files) have become a \textit{de facto} standard for representing 650the probability of a sequencing error (here denoted by $P_e$) at a given 651base using a simple base ten log transformation: 652 653\begin{equation} 654Q_{\textrm{PHRED}} = - 10 \times \textrm{log}_{10} ( P_e ) 655\end{equation} 656 657This means a wrong read ($P_e = 1$) gets a PHRED quality of $0$, while a very 658good read like $P_e = 0.00001$ gets a PHRED quality of $50$. While for raw 659sequencing data qualities higher than this are rare, with post processing 660such as read mapping or assembly, qualities of up to about $90$ are possible 661(indeed, the MAQ tool allows for PHRED scores in the range 0 to 93 inclusive). 662 663The FASTQ format has the potential to become a \textit{de facto} standard for 664storing the letters and quality scores for a sequencing read in a single plain 665text file. The only fly in the ointment is that there are at least three 666versions of the FASTQ format which are incompatible and difficult to 667distinguish... 668 669\begin{enumerate} 670\item The original Sanger FASTQ format uses PHRED qualities encoded with an 671ASCII offset of 33. The NCBI are using this format in their Short Read 672Archive. We call this the \texttt{fastq} (or \texttt{fastq-sanger}) format 673in \verb|Bio.SeqIO|. 674\item Solexa (later bought by Illumina) introduced their own version using 675Solexa qualities encoded with an ASCII offset of 64. We call this the 676\texttt{fastq-solexa} format. 677\item Illumina pipeline 1.3 onwards produces FASTQ files with PHRED qualities 678(which is more consistent), but encoded with an ASCII offset of 64. We call 679this the \texttt{fastq-illumina} format. 680\end{enumerate} 681 682The Solexa quality scores are defined using a different log transformation: 683 684\begin{equation} 685Q_{\textrm{Solexa}} = - 10 \times \textrm{log}_{10} \left( \frac{P_e}{1-P_e} \right) 686\end{equation} 687 688Given Solexa/Illumina have now moved to using PHRED scores in version 1.3 of 689their pipeline, the Solexa quality scores will gradually fall out of use. 690If you equate the error estimates ($P_e$) these two equations allow conversion 691between the two scoring systems - and Biopython includes functions to do this 692in the \verb|Bio.SeqIO.QualityIO| module, which are called if you use 693\verb|Bio.SeqIO| to convert an old Solexa/Illumina file into a standard Sanger 694FASTQ file: 695 696\begin{minted}{python} 697from Bio import SeqIO 698 699SeqIO.convert("solexa.fastq", "fastq-solexa", "standard.fastq", "fastq") 700\end{minted} 701 702If you want to convert a new Illumina 1.3+ FASTQ file, all that gets changed 703is the ASCII offset because although encoded differently the scores are all 704PHRED qualities: 705 706\begin{minted}{python} 707from Bio import SeqIO 708 709SeqIO.convert("illumina.fastq", "fastq-illumina", "standard.fastq", "fastq") 710\end{minted} 711 712Note that using \verb|Bio.SeqIO.convert()| like this is \emph{much} faster 713than combining \verb|Bio.SeqIO.parse()| and \verb|Bio.SeqIO.write()| 714because optimised code is used for converting between FASTQ variants 715(and also for FASTQ to FASTA conversion). 716 717For good quality reads, PHRED and Solexa scores are approximately equal, 718which means since both the \texttt{fasta-solexa} and \texttt{fastq-illumina} 719formats use an ASCII offset of 64 the files are almost the same. This was a 720deliberate design choice by Illumina, meaning applications expecting the old 721\texttt{fasta-solexa} style files will probably be OK using the newer 722\texttt{fastq-illumina} files (on good data). Of course, both variants are 723very different from the original FASTQ standard as used by Sanger, 724the NCBI, and elsewhere (format name \texttt{fastq} or \texttt{fastq-sanger}). 725 726For more details, see the built in help (also \href{http://www.biopython.org/docs/\bpversion/api/Bio.SeqIO.QualityIO.html}{online}): 727 728\begin{minted}{pycon} 729>>> from Bio.SeqIO import QualityIO 730>>> help(QualityIO) 731... 732\end{minted} 733 734\subsection{Converting FASTA and QUAL files into FASTQ files} 735\label{sec:SeqIO-fasta-qual-conversion} 736 737FASTQ files hold \emph{both} sequences and their quality strings. 738FASTA files hold \emph{just} sequences, while QUAL files hold \emph{just} 739the qualities. Therefore a single FASTQ file can be converted to or from 740\emph{paired} FASTA and QUAL files. 741 742Going from FASTQ to FASTA is easy: 743 744\begin{minted}{python} 745from Bio import SeqIO 746 747SeqIO.convert("example.fastq", "fastq", "example.fasta", "fasta") 748\end{minted} 749 750Going from FASTQ to QUAL is also easy: 751 752\begin{minted}{python} 753from Bio import SeqIO 754 755SeqIO.convert("example.fastq", "fastq", "example.qual", "qual") 756\end{minted} 757 758However, the reverse is a little more tricky. You can use \verb|Bio.SeqIO.parse()| 759to iterate over the records in a \emph{single} file, but in this case we have 760two input files. There are several strategies possible, but assuming that the 761two files are really paired the most memory efficient way is to loop over both 762together. The code is a little fiddly, so we provide a function called 763\verb|PairedFastaQualIterator| in the \verb|Bio.SeqIO.QualityIO| module to do 764this. This takes two handles (the FASTA file and the QUAL file) and returns 765a \verb|SeqRecord| iterator: 766 767\begin{minted}{python} 768from Bio.SeqIO.QualityIO import PairedFastaQualIterator 769 770for record in PairedFastaQualIterator(open("example.fasta"), open("example.qual")): 771 print(record) 772\end{minted} 773 774This function will check that the FASTA and QUAL files are consistent (e.g. 775the records are in the same order, and have the same sequence length). 776You can combine this with the \verb|Bio.SeqIO.write()| function to convert a 777pair of FASTA and QUAL files into a single FASTQ files: 778 779\begin{minted}{python} 780from Bio import SeqIO 781from Bio.SeqIO.QualityIO import PairedFastaQualIterator 782 783with open("example.fasta") as f_handle, open("example.qual") as q_handle: 784 records = PairedFastaQualIterator(f_handle, q_handle) 785 count = SeqIO.write(records, "temp.fastq", "fastq") 786print("Converted %i records" % count) 787\end{minted} 788 789\subsection{Indexing a FASTQ file} 790\label{sec:fastq-indexing} 791 792FASTQ files are usually very large, with millions of reads in them. Due to the 793sheer amount of data, you can't load all the records into memory at once. 794This is why the examples above (filtering and trimming) iterate over the file 795looking at just one \verb|SeqRecord| at a time. 796 797However, sometimes you can't use a big loop or an iterator - you may need 798random access to the reads. Here the \verb|Bio.SeqIO.index()| function 799may prove very helpful, as it allows you to access any read in the FASTQ file 800by its name (see Section~\ref{sec:SeqIO-index}). 801 802Again we'll use the \texttt{SRR020192.fastq} file from the ENA 803(\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz}), 804although this is actually quite a small FASTQ file with less than $50,000$ reads: 805 806\begin{minted}{pycon} 807>>> from Bio import SeqIO 808>>> fq_dict = SeqIO.index("SRR020192.fastq", "fastq") 809>>> len(fq_dict) 81041892 811>>> list(fq_dict.keys())[:4] 812['SRR020192.38240', 'SRR020192.23181', 'SRR020192.40568', 'SRR020192.23186'] 813>>> fq_dict["SRR020192.23186"].seq 814Seq('GTCCCAGTATTCGGATTTGTCTGCCAAAACAATGAAATTGACACAGTTTACAAC...CCG') 815\end{minted} 816 817When testing this on a FASTQ file with seven million reads, 818indexing took about a minute, but record access was almost instant. 819 820The sister function \verb|Bio.SeqIO.index_db()| lets you save the index 821to an SQLite3 database file for near instantaneous reuse - see 822Section~\ref{sec:SeqIO-index} for more details. 823 824The example in Section~\ref{sec:SeqIO-sort} show how you can use the 825\verb|Bio.SeqIO.index()| function to sort a large FASTA file -- this 826could also be used on FASTQ files. 827 828\subsection{Converting SFF files} 829\label{sec:SeqIO-sff-conversion} 830 831If you work with 454 (Roche) sequence data, you will probably have access 832to the raw data as a Standard Flowgram Format (SFF) file. This contains 833the sequence reads (called bases) with quality scores and the original 834flow information. 835 836A common task is to convert from SFF to a pair of FASTA and QUAL files, 837or to a single FASTQ file. These operations are trivial using the 838\verb|Bio.SeqIO.convert()| function (see Section~\ref{sec:SeqIO-conversion}): 839 840\begin{minted}{pycon} 841>>> from Bio import SeqIO 842>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff", "reads.fasta", "fasta") 84310 844>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff", "reads.qual", "qual") 84510 846>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff", "reads.fastq", "fastq") 84710 848\end{minted} 849 850\noindent Remember the convert function returns the number of records, in 851this example just ten. This will give you the \emph{untrimmed} reads, where 852the leading and trailing poor quality sequence or adaptor will be in lower 853case. If you want the \emph{trimmed} reads (using the clipping information 854recorded within the SFF file) use this: 855 856\begin{minted}{pycon} 857>>> from Bio import SeqIO 858>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fasta", "fasta") 85910 860>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.qual", "qual") 86110 862>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fastq", "fastq") 86310 864\end{minted} 865 866If you run Linux, you could ask Roche for a copy of their ``off instrument'' 867tools (often referred to as the Newbler tools). This offers an alternative way to 868do SFF to FASTA or QUAL conversion at the command line (but currently FASTQ output 869is not supported), e.g. 870 871\begin{minted}{console} 872$ sffinfo -seq -notrim E3MFGYR02_random_10_reads.sff > reads.fasta 873$ sffinfo -qual -notrim E3MFGYR02_random_10_reads.sff > reads.qual 874$ sffinfo -seq -trim E3MFGYR02_random_10_reads.sff > trimmed.fasta 875$ sffinfo -qual -trim E3MFGYR02_random_10_reads.sff > trimmed.qual 876\end{minted} 877 878\noindent The way Biopython uses mixed case sequence strings to represent 879the trimming points deliberately mimics what the Roche tools do. 880 881For more information on the Biopython SFF support, consult the built in help: 882 883\begin{minted}{pycon} 884>>> from Bio.SeqIO import SffIO 885>>> help(SffIO) 886... 887\end{minted} 888 889\subsection{Identifying open reading frames} 890 891A very simplistic first step at identifying possible genes is to look for 892open reading frames (ORFs). By this we mean look in all six frames for long 893regions without stop codons -- an ORF is just a region of nucleotides with 894no in frame stop codons. 895 896Of course, to find a gene you would also need to worry about locating a start 897codon, possible promoters -- and in Eukaryotes there are introns to worry about 898too. However, this approach is still useful in viruses and Prokaryotes. 899 900To show how you might approach this with Biopython, we'll need a sequence to 901search, and as an example we'll again use the bacterial plasmid -- although 902this time we'll start with a plain FASTA file with no pre-marked genes: 903\href{https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna} 904{\texttt{NC\_005816.fna}}. This is a bacterial sequence, so we'll want to use 905NCBI codon table 11 (see Section~\ref{sec:translation} about translation). 906 907%doctest ../Tests/GenBank 908\begin{minted}{pycon} 909>>> from Bio import SeqIO 910>>> record = SeqIO.read("NC_005816.fna", "fasta") 911>>> table = 11 912>>> min_pro_len = 100 913\end{minted} 914 915Here is a neat trick using the \verb|Seq| object's \verb|split| method to 916get a list of all the possible ORF translations in the six reading frames: 917 918%cont-doctest 919\begin{minted}{pycon} 920>>> for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]: 921... for frame in range(3): 922... length = 3 * ((len(record)-frame) // 3) #Multiple of three 923... for pro in nuc[frame:frame+length].translate(table).split("*"): 924... if len(pro) >= min_pro_len: 925... print("%s...%s - length %i, strand %i, frame %i" \ 926... % (pro[:30], pro[-3:], len(pro), strand, frame)) 927GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, frame 0 928KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, frame 1 929GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, frame 1 930VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, frame 1 931NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, frame 2 932RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, frame 2 933TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, frame 2 934QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, frame 0 935IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, frame 0 936WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, frame 1 937RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, frame 1 938WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, frame 1 939LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, frame 2 940RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, frame 2 941\end{minted} 942 943Note that here we are counting the frames from the 5' end (start) of 944\emph{each} strand. It is sometimes easier to always count from the 5' end 945(start) of the \emph{forward} strand. 946 947You could easily edit the above loop based code to build up a list of the 948candidate proteins, or convert this to a list comprehension. Now, one thing 949this code doesn't do is keep track of where the proteins are. 950 951You could tackle this in several ways. For example, the following code tracks 952the locations in terms of the protein counting, and converts back to the 953parent sequence by multiplying by three, then adjusting for the frame and 954strand: 955 956\begin{minted}{python} 957from Bio import SeqIO 958 959record = SeqIO.read("NC_005816.gb", "genbank") 960table = 11 961min_pro_len = 100 962 963 964def find_orfs_with_trans(seq, trans_table, min_protein_length): 965 answer = [] 966 seq_len = len(seq) 967 for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]: 968 for frame in range(3): 969 trans = nuc[frame:].translate(trans_table) 970 trans_len = len(trans) 971 aa_start = 0 972 aa_end = 0 973 while aa_start < trans_len: 974 aa_end = trans.find("*", aa_start) 975 if aa_end == -1: 976 aa_end = trans_len 977 if aa_end - aa_start >= min_protein_length: 978 if strand == 1: 979 start = frame + aa_start * 3 980 end = min(seq_len, frame + aa_end * 3 + 3) 981 else: 982 start = seq_len - frame - aa_end * 3 - 3 983 end = seq_len - frame - aa_start * 3 984 answer.append((start, end, strand, trans[aa_start:aa_end])) 985 aa_start = aa_end + 1 986 answer.sort() 987 return answer 988 989 990orf_list = find_orfs_with_trans(record.seq, table, min_pro_len) 991for start, end, strand, pro in orf_list: 992 print( 993 "%s...%s - length %i, strand %i, %i:%i" 994 % (pro[:30], pro[-3:], len(pro), strand, start, end) 995 ) 996\end{minted} 997 998\noindent And the output: 999 1000\begin{minted}{text} 1001NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, 41:1109 1002WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, 491:827 1003KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, 1030:1888 1004RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, 2830:3190 1005RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, 3470:3857 1006GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, 4249:4780 1007RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, 4814:5900 1008VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, 5923:6421 1009LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, 5974:6298 1010GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, 6654:7602 1011IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, 7788:8124 1012WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, 8087:8465 1013TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, 8741:9044 1014QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, 9264:9609 1015\end{minted} 1016 1017If you comment out the sort statement, then the protein sequences will be 1018shown in the same order as before, so you can check this is doing the same 1019thing. Here we have sorted them by location to make it easier to compare 1020to the actual annotation in the GenBank file (as visualised in 1021Section~\ref{sec:gd_nice_example}). 1022 1023If however all you want to find are the locations of the open reading frames, 1024then it is a waste of time to translate every possible codon, including doing 1025the reverse complement to search the reverse strand too. All you need to do 1026is search for the possible stop codons (and their reverse complements). Using 1027regular expressions is an obvious approach here (see the Python module 1028\verb|re|). These are an extremely powerful (but rather complex) way of 1029describing search strings, which are supported in lots of programming 1030languages and also command line tools like \texttt{grep} as well). You can 1031find whole books about this topic! 1032 1033 1034\section{Sequence parsing plus simple plots} 1035\label{sec:sequence-parsing-plus-pylab} 1036 1037This section shows some more examples of sequence parsing, using the 1038\verb|Bio.SeqIO| module described in Chapter~\ref{chapter:seqio}, 1039plus the Python library matplotlib's \verb|pylab| plotting interface 1040(see \href{https://matplotlib.org}{the matplotlib website 1041for a tutorial}). Note that to follow these examples you will need 1042matplotlib installed - but without it you can still try the data 1043parsing bits. 1044 1045\subsection{Histogram of sequence lengths} 1046 1047There are lots of times when you might want to visualise the distribution of sequence 1048lengths in a dataset -- for example the range of contig sizes in a genome assembly 1049project. In this example we'll reuse our orchid FASTA file 1050\href{https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta}{ls\_orchid.fasta} 1051which has only 94 sequences. 1052 1053First of all, we will use \verb|Bio.SeqIO| to parse the FASTA file and compile a list 1054of all the sequence lengths. You could do this with a for loop, but I find a list 1055comprehension more pleasing: 1056 1057\begin{minted}{pycon} 1058>>> from Bio import SeqIO 1059>>> sizes = [len(rec) for rec in SeqIO.parse("ls_orchid.fasta", "fasta")] 1060>>> len(sizes), min(sizes), max(sizes) 1061(94, 572, 789) 1062>>> sizes 1063[740, 753, 748, 744, 733, 718, 730, 704, 740, 709, 700, 726, ..., 592] 1064\end{minted} 1065 1066Now that we have the lengths of all the genes (as a list of integers), we can use the 1067matplotlib histogram function to display it. 1068 1069\begin{minted}{python} 1070from Bio import SeqIO 1071 1072sizes = [len(rec) for rec in SeqIO.parse("ls_orchid.fasta", "fasta")] 1073 1074import pylab 1075 1076pylab.hist(sizes, bins=20) 1077pylab.title( 1078 "%i orchid sequences\nLengths %i to %i" % (len(sizes), min(sizes), max(sizes)) 1079) 1080pylab.xlabel("Sequence length (bp)") 1081pylab.ylabel("Count") 1082pylab.show() 1083\end{minted} 1084 1085% 1086% Have a HTML version and a PDF version to display nicely... 1087% 1088\begin{htmlonly} 1089\noindent That should pop up a new window containing the following graph: 1090 1091\imgsrc[width=600, height=450]{images/hist_plot.png} 1092 1093\end{htmlonly} 1094% 1095% Now the PDF equivalent where we cannot always expect the figure 1096% to be positioned right next to the text, so will use a reference. 1097% 1098\begin{latexonly} 1099\begin{figure}[htbp] 1100\centering 1101\includegraphics[width=0.8\textwidth]{images/hist_plot.png} 1102\caption{Histogram of orchid sequence lengths.} 1103\label{fig:seq-len-hist} 1104\end{figure} 1105\noindent That should pop up a new window containing the graph 1106shown in Figure~\ref{fig:seq-len-hist}. 1107\end{latexonly} 1108% 1109% The text now continues... 1110% 1111Notice that most of these orchid sequences are about $740$ bp long, and there could be 1112two distinct classes of sequence here with a subset of shorter sequences. 1113 1114\emph{Tip:} Rather than using \verb|pylab.show()| to show the plot in a window, 1115you can also use \verb|pylab.savefig(...)| to save the figure to a file 1116(e.g. as a PNG or PDF). 1117 1118\subsection{Plot of sequence GC\%} 1119 1120Another easily calculated quantity of a nucleotide sequence is the GC\%. You might 1121want to look at the GC\% of all the genes in a bacterial genome for example, and 1122investigate any outliers which could have been recently acquired by horizontal gene 1123transfer. Again, for this example we'll reuse our orchid FASTA file 1124\href{https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta}{ls\_orchid.fasta}. 1125 1126First of all, we will use \verb|Bio.SeqIO| to parse the FASTA file and compile a list 1127of all the GC percentages. Again, you could do this with a for loop, but I prefer this: 1128 1129\begin{minted}{python} 1130from Bio import SeqIO 1131from Bio.SeqUtils import GC 1132 1133gc_values = sorted(GC(rec.seq) for rec in SeqIO.parse("ls_orchid.fasta", "fasta")) 1134\end{minted} 1135 1136Having read in each sequence and calculated the GC\%, we then sorted them into ascending 1137order. Now we'll take this list of floating point values and plot them with matplotlib: 1138 1139\begin{minted}{python} 1140import pylab 1141 1142pylab.plot(gc_values) 1143pylab.title( 1144 "%i orchid sequences\nGC%% %0.1f to %0.1f" 1145 % (len(gc_values), min(gc_values), max(gc_values)) 1146) 1147pylab.xlabel("Genes") 1148pylab.ylabel("GC%") 1149pylab.show() 1150\end{minted} 1151 1152% 1153% Have an HTML version and a PDF version to display nicely... 1154% 1155\begin{htmlonly} 1156\noindent As in the previous example, that should pop up a new window containing a graph: 1157 1158\imgsrc[width=600, height=450]{images/gc_plot.png} 1159 1160\end{htmlonly} 1161% 1162% Now the PDF equivalent where we cannot always expect the figure 1163% to be positioned right next to the text, so we'll use a reference. 1164% 1165\begin{latexonly} 1166\begin{figure}[htbp] 1167\centering 1168\includegraphics[width=0.8\textwidth]{images/gc_plot.png} 1169\caption{Histogram of orchid sequence lengths.} 1170\label{fig:seq-gc-plot} 1171\end{figure} 1172\noindent As in the previous example, that should pop up a new window with the graph shown in Figure~\ref{fig:seq-gc-plot}. 1173\end{latexonly} 1174% 1175% The text now continues... 1176% 1177If you tried this on the full set of genes from one organism, you'd probably get a much 1178smoother plot than this. 1179 1180\subsection{Nucleotide dot plots} 1181A dot plot is a way of visually comparing two nucleotide sequences for similarity to 1182each other. A sliding window is used to compare short sub-sequences to each other, 1183often with a mis-match threshold. Here for simplicity we'll only look for perfect 1184matches (shown in black 1185\begin{latexonly} 1186in Figure~\ref{fig:nuc-dot-plot}). 1187\end{latexonly} 1188\begin{htmlonly} 1189in the plot below). 1190\end{htmlonly} 1191 1192% 1193% Now the PDF equivalent where we cannot always expect the figure 1194% to be positioned right next to the text, so we'll use a reference. 1195% 1196\begin{latexonly} 1197\begin{figure}[htbp] 1198\centering 1199\includegraphics[width=0.8\textwidth]{images/dot_plot.png} 1200\caption{Nucleotide dot plot of two orchid sequence lengths (using pylab's imshow function).} 1201\label{fig:nuc-dot-plot} 1202\end{figure} 1203\end{latexonly} 1204 1205To start off, we'll need two sequences. For the sake of argument, we'll just take 1206the first two from our orchid FASTA file \href{https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta}{ls\_orchid.fasta}: 1207 1208\begin{minted}{python} 1209from Bio import SeqIO 1210 1211with open("ls_orchid.fasta") as in_handle: 1212 record_iterator = SeqIO.parse(in_handle, "fasta") 1213 rec_one = next(record_iterator) 1214 rec_two = next(record_iterator) 1215\end{minted} 1216 1217We're going to show two approaches. Firstly, a simple naive implementation 1218which compares all the window sized sub-sequences to each other to compiles a 1219similarity matrix. You could construct a matrix or array object, but here we 1220just use a list of lists of booleans created with a nested list 1221comprehension: 1222 1223\begin{minted}{python} 1224window = 7 1225seq_one = rec_one.seq.upper() 1226seq_two = rec_two.seq.upper() 1227data = [ 1228 [ 1229 (seq_one[i : i + window] != seq_two[j : j + window]) 1230 for j in range(len(seq_one) - window) 1231 ] 1232 for i in range(len(seq_two) - window) 1233] 1234\end{minted} 1235 1236Note that we have \emph{not} checked for reverse complement matches here. 1237Now we'll use the matplotlib's \verb|pylab.imshow()| function to display this 1238data, first requesting the gray color scheme so this is done in black and 1239white: 1240 1241\begin{minted}{python} 1242import pylab 1243 1244pylab.gray() 1245pylab.imshow(data) 1246pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one))) 1247pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two))) 1248pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window) 1249pylab.show() 1250\end{minted} 1251%pylab.savefig("dot_plot.png", dpi=75) 1252%pylab.savefig("dot_plot.pdf") 1253 1254% 1255% Have a HTML version and a PDF version to display nicely... 1256% 1257\begin{htmlonly} 1258\noindent That should pop up a new window containing a graph like this: 1259 1260\imgsrc[width=600, height=450]{images/dot_plot.png} 1261 1262\end{htmlonly} 1263\begin{latexonly} 1264\noindent That should pop up a new window showing the graph in Figure~\ref{fig:nuc-dot-plot}. 1265\end{latexonly} 1266% 1267% The text now continues... 1268% 1269As you might have expected, these two sequences are very similar with a 1270partial line of window sized matches along the diagonal. There are no off 1271diagonal matches which would be indicative of inversions or other interesting 1272events. 1273 1274The above code works fine on small examples, but there are two problems 1275applying this to larger sequences, which we will address below. 1276First off all, this brute force approach to the all against all comparisons 1277is very slow. Instead, we'll compile dictionaries mapping the window sized 1278sub-sequences to their locations, and then take the set intersection to find 1279those sub-sequences found in both sequences. This uses more memory, but is 1280\emph{much} faster. Secondly, the \verb|pylab.imshow()| function is limited 1281in the size of matrix it can display. As an alternative, we'll use the 1282\verb|pylab.scatter()| function. 1283 1284We start by creating dictionaries mapping the window-sized sub-sequences to locations: 1285\begin{minted}{python} 1286window = 7 1287dict_one = {} 1288dict_two = {} 1289for (seq, section_dict) in [ 1290 (rec_one.seq.upper(), dict_one), 1291 (rec_two.seq.upper(), dict_two), 1292]: 1293 for i in range(len(seq) - window): 1294 section = seq[i : i + window] 1295 try: 1296 section_dict[section].append(i) 1297 except KeyError: 1298 section_dict[section] = [i] 1299# Now find any sub-sequences found in both sequences 1300matches = set(dict_one).intersection(dict_two) 1301print("%i unique matches" % len(matches)) 1302\end{minted} 1303\noindent In order to use the \verb|pylab.scatter()| we need separate lists for the $x$ and $y$ co-ordinates: 1304\begin{minted}{python} 1305# Create lists of x and y co-ordinates for scatter plot 1306x = [] 1307y = [] 1308for section in matches: 1309 for i in dict_one[section]: 1310 for j in dict_two[section]: 1311 x.append(i) 1312 y.append(j) 1313\end{minted} 1314\noindent We are now ready to draw the revised dot plot as a scatter plot: 1315\begin{minted}{python} 1316import pylab 1317 1318pylab.cla() # clear any prior graph 1319pylab.gray() 1320pylab.scatter(x, y) 1321pylab.xlim(0, len(rec_one) - window) 1322pylab.ylim(0, len(rec_two) - window) 1323pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one))) 1324pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two))) 1325pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window) 1326pylab.show() 1327\end{minted} 1328%pylab.savefig("dot_plot_scatter.png", dpi=75) 1329%pylab.savefig("dot_plot_scatter.pdf") 1330% 1331% Have a HTML version and a PDF version to display nicely... 1332% 1333\begin{htmlonly} 1334\noindent That should pop up a new window containing a graph like this: 1335 1336\imgsrc[width=600, height=450]{images/dot_plot_scatter.png} 1337 1338\end{htmlonly} 1339\begin{latexonly} 1340\noindent That should pop up a new window showing the graph in Figure~\ref{fig:nuc-dot-plot-scatter}. 1341\begin{figure}[htbp] 1342\centering 1343\includegraphics[width=0.8\textwidth]{images/dot_plot_scatter.png} 1344\caption{Nucleotide dot plot of two orchid sequence lengths (using pylab's scatter function).} 1345\label{fig:nuc-dot-plot-scatter} 1346\end{figure}\end{latexonly} 1347Personally I find this second plot much easier to read! 1348Again note that we have \emph{not} checked for reverse complement matches here 1349-- you could extend this example to do this, and perhaps plot the forward 1350matches in one color and the reverse matches in another. 1351 1352\subsection{Plotting the quality scores of sequencing read data} 1353 1354If you are working with second generation sequencing data, you may want to try plotting 1355the quality data. Here is an example using two FASTQ files containing paired end reads, 1356\texttt{SRR001666\_1.fastq} for the forward reads, and \texttt{SRR001666\_2.fastq} for 1357the reverse reads. These were downloaded from the ENA sequence read archive FTP site 1358(\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_1.fastq.gz} and 1359\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_2.fastq.gz}), and 1360are from \textit{E. coli} -- see \url{https://www.ebi.ac.uk/ena/data/view/SRR001666} 1361for details. 1362%Originally from ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/ 1363 1364In the following code the \verb|pylab.subplot(...)| function is used in order to show 1365the forward and reverse qualities on two subplots, side by side. There is also a little 1366bit of code to only plot the first fifty reads. 1367 1368\begin{minted}{python} 1369import pylab 1370from Bio import SeqIO 1371 1372for subfigure in [1, 2]: 1373 filename = "SRR001666_%i.fastq" % subfigure 1374 pylab.subplot(1, 2, subfigure) 1375 for i, record in enumerate(SeqIO.parse(filename, "fastq")): 1376 if i >= 50: 1377 break # trick! 1378 pylab.plot(record.letter_annotations["phred_quality"]) 1379 pylab.ylim(0, 45) 1380 pylab.ylabel("PHRED quality score") 1381 pylab.xlabel("Position") 1382pylab.savefig("SRR001666.png") 1383print("Done") 1384\end{minted} 1385 1386You should note that we are using the \verb|Bio.SeqIO| format name \texttt{fastq} 1387here because the NCBI has saved these reads using the standard Sanger FASTQ format 1388with PHRED scores. However, as you might guess from the read lengths, this data was 1389from an Illumina Genome Analyzer and was probably originally in one of the two 1390Solexa/Illumina FASTQ variant file formats instead. 1391 1392This example uses the \verb|pylab.savefig(...)| function instead of 1393\verb|pylab.show(...)|, but as mentioned before both are useful. 1394\begin{latexonly} 1395\begin{figure}[htbp] 1396\centering 1397\includegraphics[width=0.8\textwidth]{images/SRR001666.png} 1398\caption{Quality plot for some paired end reads.} 1399\label{fig:paired-end-qual-plot} 1400\end{figure} 1401The result is shown in Figure~\ref{fig:paired-end-qual-plot}. 1402\end{latexonly} 1403\begin{htmlonly} 1404Here is the result: 1405 1406%Blank lines here are important! 1407\imgsrc[width=600, height=600]{images/SRR001666.png} 1408 1409\end{htmlonly} 1410 1411\section{Dealing with alignments} 1412 1413This section can been seen as a follow on to Chapter~\ref{chapter:align}. 1414 1415\subsection{Calculating summary information} 1416\label{sec:summary_info} 1417 1418Once you have an alignment, you are very likely going to want to find out information about it. Instead of trying to have all of the functions that can generate information about an alignment in the alignment object itself, we've tried to separate out the functionality into separate classes, which act on the alignment. 1419 1420Getting ready to calculate summary information about an object is quick to do. Let's say we've got an alignment object called \verb|alignment|, for example read in using \verb|Bio.AlignIO.read(...)| as described in Chapter~\ref{chapter:align}. All we need to do to get an object that will calculate summary information is: 1421 1422\begin{minted}{python} 1423from Bio.Align import AlignInfo 1424 1425summary_align = AlignInfo.SummaryInfo(alignment) 1426\end{minted} 1427 1428The \verb|summary_align| object is very useful, and will do the following neat things for you: 1429 1430\begin{enumerate} 1431 \item Calculate a quick consensus sequence -- see section~\ref{sec:consensus} 1432 \item Get a position specific score matrix for the alignment -- see section~\ref{sec:pssm} 1433 \item Calculate the information content for the alignment -- see section~\ref{sec:getting_info_content} 1434 \item Generate information on substitutions in the alignment -- section~\ref{sec:sub_matrix} details using this to generate a substitution matrix. 1435\end{enumerate} 1436 1437\subsection{Calculating a quick consensus sequence} 1438\label{sec:consensus} 1439 1440The \verb|SummaryInfo| object, described in section~\ref{sec:summary_info}, provides functionality to calculate a quick consensus of an alignment. Assuming we've got a \verb|SummaryInfo| object called \verb|summary_align| we can calculate a consensus by doing: 1441 1442\begin{minted}{python} 1443consensus = summary_align.dumb_consensus() 1444\end{minted} 1445 1446As the name suggests, this is a really simple consensus calculator, and will just add up all of the residues at each point in the consensus, and if the most common value is higher than some threshold value will add the common residue to the consensus. If it doesn't reach the threshold, it adds an ambiguity character to the consensus. The returned consensus object is a \verb|Seq| object. 1447 1448You can adjust how \verb|dumb_consensus| works by passing optional parameters: 1449 1450\begin{description} 1451\item[the threshold] This is the threshold specifying how common a particular residue has to be at a position before it is added. The default is $0.7$ (meaning $70\%$). 1452 1453\item[the ambiguous character] This is the ambiguity character to use. The default is 'N'. 1454 1455\end{description} 1456 1457\subsection{Position Specific Score Matrices} 1458\label{sec:pssm} 1459 1460Position specific score matrices (PSSMs) summarize the alignment information in a different way than a consensus, and may be useful for different tasks. Basically, a PSSM is a count matrix. For each column in the alignment, the number of each alphabet letters is counted and totaled. The totals are displayed relative to some representative sequence along the left axis. This sequence may be the consesus sequence, but can also be any sequence in the alignment. For instance for the alignment, 1461 1462\begin{minted}{text} 1463GTATC 1464AT--C 1465CTGTC 1466\end{minted} 1467 1468\noindent the PSSM is: 1469 1470\begin{minted}{text} 1471 G A T C 1472 G 1 1 0 1 1473 T 0 0 3 0 1474 A 1 1 0 0 1475 T 0 0 2 0 1476 C 0 0 0 3 1477\end{minted} 1478 1479Let's assume we've got an alignment object called \verb|c_align|. To get a PSSM with the consensus sequence along the side we first get a summary object and calculate the consensus sequence: 1480 1481\begin{minted}{python} 1482summary_align = AlignInfo.SummaryInfo(c_align) 1483consensus = summary_align.dumb_consensus() 1484\end{minted} 1485 1486Now, we want to make the PSSM, but ignore any \verb|N| ambiguity residues when calculating this: 1487 1488\begin{minted}{python} 1489my_pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore=["N"]) 1490\end{minted} 1491 1492Two notes should be made about this: 1493 1494\begin{enumerate} 1495 \item To maintain strictness with the alphabets, you can only include characters along the top of the PSSM that are in the alphabet of the alignment object. Gaps are not included along the top axis of the PSSM. 1496 1497 \item The sequence passed to be displayed along the left side of the axis does not need to be the consensus. For instance, if you wanted to display the second sequence in the alignment along this axis, you would need to do: 1498 1499\begin{minted}{python} 1500second_seq = alignment.get_seq_by_num(1) 1501my_pssm = summary_align.pos_specific_score_matrix(second_seq, chars_to_ignore=["N"]) 1502\end{minted} 1503 1504\end{enumerate} 1505 1506The command above returns a \verb|PSSM| object. 1507To print out the PSSM as shown above, 1508we simply need to do a \verb|print(my_pssm)|, which gives: 1509 1510\begin{minted}{text} 1511 A C G T 1512T 0.0 0.0 0.0 7.0 1513A 7.0 0.0 0.0 0.0 1514T 0.0 0.0 0.0 7.0 1515A 7.0 0.0 0.0 0.0 1516C 0.0 7.0 0.0 0.0 1517A 7.0 0.0 0.0 0.0 1518T 0.0 0.0 0.0 7.0 1519T 1.0 0.0 0.0 6.0 1520... 1521\end{minted} 1522 1523You can access any element of the PSSM by subscripting like \verb|your_pssm[sequence_number][residue_count_name]|. For instance, to get the counts for the 'A' residue in the second element of the above PSSM you would do: 1524 1525\begin{minted}{pycon} 1526>>> print(my_pssm[1]["A"]) 15277.0 1528\end{minted} 1529 1530The structure of the PSSM class hopefully makes it easy both to access elements and to pretty print the matrix. 1531 1532\subsection{Information Content} 1533\label{sec:getting_info_content} 1534 1535A potentially useful measure of evolutionary conservation is the information content of a sequence. 1536 1537A useful introduction to information theory targeted towards molecular biologists can be found at \url{http://www.lecb.ncifcrf.gov/~toms/paper/primer/}. For our purposes, we will be looking at the information content of a consesus sequence, or a portion of a consensus sequence. We calculate information content at a particular column in a multiple sequence alignment using the following formula: 1538 1539\begin{displaymath} 1540IC_{j} = \sum_{i=1}^{N_{a}} P_{ij} \mathrm{log}\left(\frac{P_{ij}}{Q_{i}}\right) 1541\end{displaymath} 1542 1543\noindent where: 1544 1545\begin{itemize} 1546 \item $IC_{j}$ -- The information content for the $j$-th column in an alignment. 1547 \item $N_{a}$ -- The number of letters in the alphabet. 1548 \item $P_{ij}$ -- The frequency of a particular letter $i$ in the $j$-th column (i.~e.~if G occurred 3 out of 6 times in an aligment column, this would be 0.5) 1549 \item $Q_{i}$ -- The expected frequency of a letter $i$. This is an 1550 optional argument, usage of which is left at the user's 1551 discretion. By default, it is automatically assigned to $0.05 = 1/20$ for a 1552 protein alphabet, and $0.25 = 1/4$ for a nucleic acid alphabet. This is for 1553 geting the information content without any assumption of prior 1554 distributions. When assuming priors, or when using a non-standard 1555 alphabet, you should supply the values for $Q_{i}$. 1556\end{itemize} 1557 1558Well, now that we have an idea what information content is being calculated in Biopython, let's look at how to get it for a particular region of the alignment. 1559 1560First, we need to use our alignment to get an alignment summary object, which we'll assume is called \verb|summary_align| (see section~\ref{sec:summary_info}) for instructions on how to get this. Once we've got this object, calculating the information content for a region is as easy as: 1561 1562\begin{minted}{python} 1563info_content = summary_align.information_content(5, 30, chars_to_ignore=["N"]) 1564\end{minted} 1565 1566Wow, that was much easier then the formula above made it look! The variable \verb|info_content| now contains a float value specifying the information content over the specified region (from 5 to 30 of the alignment). We specifically ignore the ambiguity residue 'N' when calculating the information content, since this value is not included in our alphabet (so we shouldn't be interested in looking at it!). 1567 1568As mentioned above, we can also calculate relative information content by supplying a dictionary with the expected frequencies: 1569 1570\begin{minted}{python} 1571expect_freq = {"A": 0.3, "G": 0.2, "T": 0.3, "C": 0.2} 1572info_content = summary_align.information_content( 1573 5, 30, e_freq_table=e_freq_table, chars_to_ignore=["N"] 1574) 1575\end{minted} 1576 1577Now, \verb|info_content| will contain the relative information content over the region in relation to the expected frequencies. 1578 1579The value return is calculated using base 2 as the logarithm base in the formula above. You can modify this by passing the parameter \verb|log_base| as the base you want: 1580 1581\begin{minted}{python} 1582info_content = summary_align.information_content( 1583 5, 30, log_base=10, chars_to_ignore=["N"] 1584) 1585\end{minted} 1586 1587By default nucleotide or amino acid residues with a frequency of 0 in a column are not take into account when the relative information column for that column is computed. If this is not the desired result, you can use \verb|pseudo_count| instead. 1588 1589\begin{minted}{python} 1590info_content = summary_align.information_content( 1591 5, 30, chars_to_ignore=["N"], pseudo_count=1 1592) 1593\end{minted} 1594 1595In this case, the observed frequency $P_{ij}$ of a particular letter $i$ in the $j$-th column is computed as follow : 1596 1597\begin{displaymath} 1598P_{ij} = \frac{n_{ij} + k\times Q_{i}}{N_{j} + k} 1599\end{displaymath} 1600 1601\noindent where: 1602 1603\begin{itemize} 1604 \item $k$ -- the pseudo count you pass as argument. 1605 \item $k$ -- the pseudo count you pass as argument. 1606 \item $Q_{i}$ -- The expected frequency of the letter $i$ as described above. 1607\end{itemize} 1608 1609Well, now you are ready to calculate information content. If you want to try applying this to some real life problems, it would probably be best to dig into the literature on information content to get an idea of how it is used. Hopefully your digging won't reveal any mistakes made in coding this function! 1610 1611\section{Substitution Matrices} 1612\label{sec:sub_matrix} 1613 1614Substitution matrices are an extremely important part of everyday bioinformatics work. They provide the scoring terms for classifying how likely two different residues are to substitute for each other. This is essential in doing sequence comparisons. The book ``Biological Sequence Analysis'' by Durbin et al. provides a really nice introduction to Substitution Matrices and their uses. Some famous substitution matrices are the PAM and BLOSUM series of matrices. 1615 1616Biopython provides a ton of common substitution matrices, and also provides functionality for creating your own substitution matrices. 1617 1618\subsection{Using common substitution matrices} 1619 1620\subsection{Calculating a substitution matrix from a multiple sequence alignment} 1621\label{sec:subs_mat_ex} 1622 1623You can create your own substitution matrix from an alignment. In this 1624example, we'll first read a protein sequence alignment from the Clustalw file 1625\href{examples/protein.aln}{protein.aln} (also available online 1626\href{https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/protein.aln}{here}) 1627 1628%doctest ../Tests/Clustalw 1629\begin{minted}{pycon} 1630>>> from Bio import AlignIO 1631>>> filename = "protein.aln" 1632>>> alignment = AlignIO.read(filename, "clustal") 1633\end{minted} 1634 1635Section~\ref{sec:align_clustal} contains more information on doing this. 1636 1637The \verb+substitutions+ property of the alignment stores the number of times 1638different residues substitute for each other: 1639%cont-doctest 1640\begin{minted}{pycon} 1641>>> observed_frequencies = alignment.substitutions 1642\end{minted} 1643 1644To make the example more readable, we'll select only amino acids with polar charged side chains: 1645 1646%cont-doctest 1647\begin{minted}{pycon} 1648>>> observed_frequencies = observed_frequencies.select("DEHKR") 1649>>> print(observed_frequencies) 1650 D E H K R 1651D 2360.0 255.5 7.5 0.5 25.0 1652E 255.5 3305.0 16.5 27.0 2.0 1653H 7.5 16.5 1235.0 16.0 8.5 1654K 0.5 27.0 16.0 3218.0 116.5 1655R 25.0 2.0 8.5 116.5 2079.0 1656<BLANKLINE> 1657\end{minted} 1658Rows and columns for other amino acids were removed from the matrix. 1659 1660Next, we normalize the matrix: 1661%cont-doctest 1662\begin{minted}{pycon} 1663>>> import numpy 1664>>> observed_frequencies /= numpy.sum(observed_frequencies) 1665\end{minted} 1666 1667Summing over rows or columns gives the relative frequency of occurrence of 1668each residue: 1669%cont-doctest 1670\begin{minted}{pycon} 1671>>> residue_frequencies = numpy.sum(observed_frequencies, 0) 1672>>> print(residue_frequencies.format("%.4f")) 1673D 0.2015 1674E 0.2743 1675H 0.0976 1676K 0.2569 1677R 0.1697 1678<BLANKLINE> 1679>>> numpy.sum(residue_frequencies) 16801.0 1681\end{minted} 1682 1683The expected frequency of residue pairs is then 1684%cont-doctest 1685\begin{minted}{pycon} 1686>>> expected_frequencies = numpy.dot(residue_frequencies[:, None], residue_frequencies[None, :]) 1687>>> print(expected_frequencies.format("%.4f")) 1688 D E H K R 1689D 0.0406 0.0553 0.0197 0.0518 0.0342 1690E 0.0553 0.0752 0.0268 0.0705 0.0465 1691H 0.0197 0.0268 0.0095 0.0251 0.0166 1692K 0.0518 0.0705 0.0251 0.0660 0.0436 1693R 0.0342 0.0465 0.0166 0.0436 0.0288 1694<BLANKLINE> 1695\end{minted} 1696Here, \verb+residue_frequencies[:, None]+ creates a 2D array consisting of a single column with the values of \verb+residue_frequencies+, and \verb+residue_frequencies[None, :]+ a 2D array with these values as a single row. Taking their dot product (inner product) creates a matrix of expected frequencies where each entry consists of two \verb+residue_frequencies+ values multiplied with each other. For example, \verb+expected_frequencies['D', 'E']+ is equal to \verb+residue_frequencies['D'] * residue_frequencies['E']+. 1697 1698We can now calculate the log-odds matrix by dividing the observed frequencies by the expected freqencies and taking the logarithm: 1699%cont-doctest 1700\begin{minted}{pycon} 1701>>> m = numpy.log2(observed_frequencies/expected_frequencies) 1702>>> print(m) 1703 D E H K R 1704D 2.1 -1.5 -5.1 -10.4 -4.2 1705E -1.5 1.7 -4.4 -5.1 -8.3 1706H -5.1 -4.4 3.3 -4.4 -4.7 1707K -10.4 -5.1 -4.4 1.9 -2.3 1708R -4.2 -8.3 -4.7 -2.3 2.5 1709<BLANKLINE> 1710\end{minted} 1711 1712This matrix can be used as the substitution matrix when performing alignments. For example, 1713%cont-doctest 1714\begin{minted}{pycon} 1715>>> from Bio.Align import PairwiseAligner 1716>>> aligner = PairwiseAligner() 1717>>> aligner.substitution_matrix = m 1718>>> aligner.gap_score = -3.0 1719>>> alignments = aligner.align("DEHEK", "DHHKK") 1720>>> print(alignments[0]) 1721DEHEK 1722|.|.| 1723DHHKK 1724<BLANKLINE> 1725>>> print("%.2f" % alignments.score) 1726-2.18 1727>>> score = m['D','D'] + m['E','H'] + m['H','H'] + m['E','K'] + m['K','K'] 1728>>> print("%.2f" % score) 1729-2.18 1730\end{minted} 1731 1732 1733\section{BioSQL -- storing sequences in a relational database} 1734\label{sec:BioSQL} 1735\href{https://www.biosql.org/}{BioSQL} is a joint effort between the 1736\href{https://www.open-bio.org/wiki/Main_Page}{OBF} projects (BioPerl, BioJava etc) to support a 1737shared database schema for storing sequence data. In theory, you could load a 1738GenBank file into the database with BioPerl, then using Biopython extract this 1739from the database as a record object with features - and get more or less the same 1740thing as if you had loaded the GenBank file directly as a SeqRecord using 1741\verb|Bio.SeqIO| (Chapter~\ref{chapter:seqio}). 1742 1743Biopython's BioSQL module is currently documented at 1744\url{http://biopython.org/wiki/BioSQL} which is part of our wiki pages. 1745