1\chapter{Cookbook -- Cool things to do with it}
2\label{chapter:cookbook}
3
4Biopython now has two collections of ``cookbook'' examples -- this chapter
5(which has been included in this tutorial for many years and has gradually
6grown), and \url{http://biopython.org/wiki/Category:Cookbook} which is a
7user contributed collection on our wiki.
8
9We're trying to encourage Biopython users to contribute their own examples
10to the wiki. In addition to helping the community, one direct benefit of
11sharing an example like this is that you could also get some feedback on
12the code from other Biopython users and developers - which could help you
13improve all your Python code.
14
15In the long term, we may end up moving all of the examples in this chapter
16to the wiki, or elsewhere within the tutorial.
17
18\section{Working with sequence files}
19\label{sec:cookbook-sequences}
20
21This section shows some more examples of sequence input/output, using the
22\verb|Bio.SeqIO| module described in Chapter~\ref{chapter:seqio}.
23
24\subsection{Filtering a sequence file}
25
26Often you'll have a large file with many sequences in it (e.g. FASTA file
27or genes, or a FASTQ or SFF file of reads), a separate shorter list of
28the IDs for a subset of sequences of interest, and want to make a new
29sequence file for this subset.
30
31Let's say the list of IDs is in a simple text file, as the first word on
32each line. This could be a tabular file where the first column is the ID.
33Try something like this:
34
35%not a doctest to avoid temp files being left behind, also no >>>
36%makes it easier to copy and paste the example to a script file.
37\begin{minted}{python}
38from Bio import SeqIO
39
40input_file = "big_file.sff"
41id_file = "short_list.txt"
42output_file = "short_list.sff"
43
44with open(id_file) as id_handle:
45    wanted = set(line.rstrip("\n").split(None, 1)[0] for line in id_handle)
46print("Found %i unique identifiers in %s" % (len(wanted), id_file))
47
48records = (r for r in SeqIO.parse(input_file, "sff") if r.id in wanted)
49count = SeqIO.write(records, output_file, "sff")
50print("Saved %i records from %s to %s" % (count, input_file, output_file))
51if count < len(wanted):
52    print("Warning %i IDs not found in %s" % (len(wanted) - count, input_file))
53\end{minted}
54
55Note that we use a Python \verb|set| rather than a \verb|list|, this makes
56testing membership faster.
57
58As discussed in Section~\ref{sec:low-level-fasta-fastq}, for a large FASTA
59or FASTQ file for speed you would be better off not using the high-level
60\verb|SeqIO| interface, but working directly with strings. This next
61example shows how to do this with FASTQ files -- it is more complicated:
62
63%not a doctest to avoid temp files being left behind, also no >>>
64%makes it easier to copy and paste the example to a script file.
65\begin{minted}{python}
66from Bio.SeqIO.QualityIO import FastqGeneralIterator
67
68input_file = "big_file.fastq"
69id_file = "short_list.txt"
70output_file = "short_list.fastq"
71
72with open(id_file) as id_handle:
73    # Taking first word on each line as an identifer
74    wanted = set(line.rstrip("\n").split(None, 1)[0] for line in id_handle)
75print("Found %i unique identifiers in %s" % (len(wanted), id_file))
76
77with open(input_file) as in_handle:
78    with open(output_file, "w") as out_handle:
79        for title, seq, qual in FastqGeneralIterator(in_handle):
80            # The ID is the first word in the title line (after the @ sign):
81            if title.split(None, 1)[0] in wanted:
82                # This produces a standard 4-line FASTQ entry:
83                out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
84                count += 1
85print("Saved %i records from %s to %s" % (count, input_file, output_file))
86if count < len(wanted):
87    print("Warning %i IDs not found in %s" % (len(wanted) - count, input_file))
88\end{minted}
89
90\subsection{Producing randomised genomes}
91
92Let's suppose you are looking at genome sequence, hunting for some sequence
93feature -- maybe extreme local GC\% bias, or possible restriction digest sites.
94Once you've got your Python code working on the real genome it may be sensible
95to try running the same search on randomised versions of the same genome for
96statistical analysis (after all, any ``features'' you've found could just be
97there just by chance).
98
99For this discussion, we'll use the GenBank file for the pPCP1 plasmid from
100\textit{Yersinia pestis biovar Microtus}.  The file is included with the
101Biopython unit tests under the GenBank folder, or you can get it from our
102website, \href{https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb}
103{\texttt{NC\_005816.gb}}.
104This file contains one and only one record, so we can read it in as a
105\verb|SeqRecord| using the \verb|Bio.SeqIO.read()| function:
106
107%doctest ../Tests/GenBank
108\begin{minted}{pycon}
109>>> from Bio import SeqIO
110>>> original_rec = SeqIO.read("NC_005816.gb", "genbank")
111\end{minted}
112
113So, how can we generate a shuffled versions of the original sequence?  I would
114use the built in Python \verb|random| module for this, in particular the function
115\verb|random.shuffle| -- but this works on a Python list.  Our sequence is a
116\verb|Seq| object, so in order to shuffle it we need to turn it into a list:
117
118%cont-doctest
119\begin{minted}{pycon}
120>>> import random
121>>> nuc_list = list(original_rec.seq)
122>>> random.shuffle(nuc_list)  # acts in situ!
123\end{minted}
124
125Now, in order to use \verb|Bio.SeqIO| to output the shuffled sequence, we need
126to construct a new \verb|SeqRecord| with a new \verb|Seq| object using this
127shuffled list.  In order to do this, we need to turn the list of nucleotides
128(single letter strings) into a long string -- the standard Python way to do
129this is with the string object's join method.
130
131%cont-doctest
132\begin{minted}{pycon}
133>>> from Bio.Seq import Seq
134>>> from Bio.SeqRecord import SeqRecord
135>>> shuffled_rec = SeqRecord(Seq("".join(nuc_list)),
136...                          id="Shuffled", description="Based on %s" % original_rec.id)
137...
138\end{minted}
139
140Let's put all these pieces together to make a complete Python script which
141generates a single FASTA file containing 30 randomly shuffled versions of
142the original sequence.
143
144This first version just uses a big for loop and writes out the records one by one
145(using the \verb|SeqRecord|'s format method described in
146Section~\ref{sec:Bio.SeqIO-and-StringIO}):
147
148\begin{minted}{python}
149import random
150from Bio.Seq import Seq
151from Bio.SeqRecord import SeqRecord
152from Bio import SeqIO
153
154original_rec = SeqIO.read("NC_005816.gb", "genbank")
155
156with open("shuffled.fasta", "w") as output_handle:
157    for i in range(30):
158        nuc_list = list(original_rec.seq)
159        random.shuffle(nuc_list)
160        shuffled_rec = SeqRecord(
161            Seq("".join(nuc_list)),
162            id="Shuffled%i" % (i + 1),
163            description="Based on %s" % original_rec.id,
164        )
165        output_handle.write(shuffled_rec.format("fasta"))
166\end{minted}
167
168Personally I prefer the following version using a function to shuffle the record
169and a generator expression instead of the for loop:
170
171\begin{minted}{python}
172import random
173from Bio.Seq import Seq
174from Bio.SeqRecord import SeqRecord
175from Bio import SeqIO
176
177
178def make_shuffle_record(record, new_id):
179    nuc_list = list(record.seq)
180    random.shuffle(nuc_list)
181    return SeqRecord(
182        Seq("".join(nuc_list)), id=new_id, description="Based on %s" % original_rec.id,
183    )
184
185
186original_rec = SeqIO.read("NC_005816.gb", "genbank")
187shuffled_recs = (
188    make_shuffle_record(original_rec, "Shuffled%i" % (i + 1)) for i in range(30)
189)
190
191SeqIO.write(shuffled_recs, "shuffled.fasta", "fasta")
192\end{minted}
193
194\subsection{Translating a FASTA file of CDS entries}
195\label{sec:SeqIO-translate}
196Suppose you've got an input file of CDS entries for some organism, and you
197want to generate a new FASTA file containing their protein sequences.  i.e.
198Take each nucleotide sequence from the original file, and translate it.
199Back in Section~\ref{sec:translation} we saw how to use the \verb|Seq|
200object's \verb|translate method|, and the optional \verb|cds| argument
201which enables correct translation of alternative start codons.
202
203We can combine this with \verb|Bio.SeqIO| as
204shown in the reverse complement example in Section~\ref{sec:SeqIO-reverse-complement}.
205The key point is that for each nucleotide \verb|SeqRecord|, we need to create
206a protein \verb|SeqRecord| - and take care of naming it.
207
208You can write you own function to do this, choosing suitable protein identifiers
209for your sequences, and the appropriate genetic code.  In this example we just
210use the default table and add a prefix to the identifier:
211
212\begin{minted}{python}
213from Bio.SeqRecord import SeqRecord
214
215
216def make_protein_record(nuc_record):
217    """Returns a new SeqRecord with the translated sequence (default table)."""
218    return SeqRecord(
219        seq=nuc_record.seq.translate(cds=True),
220        id="trans_" + nuc_record.id,
221        description="translation of CDS, using default table",
222    )
223\end{minted}
224
225We can then use this function to turn the input nucleotide records into protein
226records ready for output.  An elegant way and memory efficient way to do this
227is with a generator expression:
228
229\begin{minted}{python}
230from Bio import SeqIO
231
232proteins = (
233    make_protein_record(nuc_rec)
234    for nuc_rec in SeqIO.parse("coding_sequences.fasta", "fasta")
235)
236SeqIO.write(proteins, "translations.fasta", "fasta")
237\end{minted}
238
239This should work on any FASTA file of complete coding sequences.
240If you are working on partial coding sequences, you may prefer to use
241\verb|nuc_record.seq.translate(to_stop=True)| in the example above, as
242this wouldn't check for a valid start codon etc.
243
244\subsection{Making the sequences in a FASTA file upper case}
245
246Often you'll get data from collaborators as FASTA files, and sometimes the
247sequences can be in a mixture of upper and lower case. In some cases this is
248deliberate (e.g. lower case for poor quality regions), but usually it is not
249important. You may want to edit the file to make everything consistent (e.g.
250all upper case), and you can do this easily using the \verb|upper()| method
251of the \verb|SeqRecord| object (added in Biopython 1.55):
252
253\begin{minted}{python}
254from Bio import SeqIO
255
256records = (rec.upper() for rec in SeqIO.parse("mixed.fas", "fasta"))
257count = SeqIO.write(records, "upper.fas", "fasta")
258print("Converted %i records to upper case" % count)
259\end{minted}
260
261How does this work? The first line is just importing the \verb|Bio.SeqIO|
262module. The second line is the interesting bit -- this is a Python
263generator expression which gives an upper case version of each record
264parsed from the input file (\texttt{mixed.fas}). In the third line we give
265this generator expression to the \verb|Bio.SeqIO.write()| function and it
266saves the new upper cases records to our output file (\texttt{upper.fas}).
267
268The reason we use a generator expression (rather than a list or list
269comprehension) is this means only one record is kept in memory at a time.
270This can be really important if you are dealing with large files with
271millions of entries.
272
273\subsection{Sorting a sequence file}
274\label{sec:SeqIO-sort}
275
276Suppose you wanted to sort a sequence file by length (e.g. a set of
277contigs from an assembly), and you are working with a file format like
278FASTA or FASTQ which \verb|Bio.SeqIO| can read, write (and index).
279
280If the file is small enough, you can load it all into memory at once
281as a list of \verb|SeqRecord| objects, sort the list, and save it:
282
283\begin{minted}{python}
284from Bio import SeqIO
285
286records = list(SeqIO.parse("ls_orchid.fasta", "fasta"))
287records.sort(key=lambda r: len(r))
288SeqIO.write(records, "sorted_orchids.fasta", "fasta")
289\end{minted}
290
291The only clever bit is specifying a comparison method for how to
292sort the records (here we sort them by length). If you wanted the
293longest records first, you could flip the comparison or use the
294reverse argument:
295
296\begin{minted}{python}
297from Bio import SeqIO
298
299records = list(SeqIO.parse("ls_orchid.fasta", "fasta"))
300records.sort(key=lambda r: -len(r))
301SeqIO.write(records, "sorted_orchids.fasta", "fasta")
302\end{minted}
303
304Now that's pretty straight forward - but what happens if you have a
305very large file and you can't load it all into memory like this?
306For example, you might have some next-generation sequencing reads
307to sort by length. This can be solved using the
308\verb|Bio.SeqIO.index()| function.
309
310\begin{minted}{python}
311from Bio import SeqIO
312
313# Get the lengths and ids, and sort on length
314len_and_ids = sorted(
315    (len(rec), rec.id) for rec in SeqIO.parse("ls_orchid.fasta", "fasta")
316)
317ids = reversed([id for (length, id) in len_and_ids])
318del len_and_ids  # free this memory
319record_index = SeqIO.index("ls_orchid.fasta", "fasta")
320records = (record_index[id] for id in ids)
321SeqIO.write(records, "sorted.fasta", "fasta")
322\end{minted}
323
324First we scan through the file once using \verb|Bio.SeqIO.parse()|,
325recording the record identifiers and their lengths in a list of tuples.
326We then sort this list to get them in length order, and discard the lengths.
327Using this sorted list of identifiers \verb|Bio.SeqIO.index()| allows us to
328retrieve the records one by one, and we pass them to \verb|Bio.SeqIO.write()|
329for output.
330
331These examples all use \verb|Bio.SeqIO| to parse the records into
332\verb|SeqRecord| objects which are output using \verb|Bio.SeqIO.write()|.
333What if you want to sort a file format which \verb|Bio.SeqIO.write()| doesn't
334support, like the plain text SwissProt format? Here is an alternative
335solution using the \verb|get_raw()| method added to \verb|Bio.SeqIO.index()|
336in Biopython 1.54 (see Section~\ref{sec:seqio-index-getraw}).
337
338\begin{minted}{python}
339from Bio import SeqIO
340
341# Get the lengths and ids, and sort on length
342len_and_ids = sorted(
343    (len(rec), rec.id) for rec in SeqIO.parse("ls_orchid.fasta", "fasta")
344)
345ids = reversed([id for (length, id) in len_and_ids])
346del len_and_ids  # free this memory
347
348record_index = SeqIO.index("ls_orchid.fasta", "fasta")
349with open("sorted.fasta", "wb") as out_handle:
350    for id in ids:
351        out_handle.write(record_index.get_raw(id))
352\end{minted}
353
354Note with Python 3 onwards, we have to open the file for writing in
355binary mode because the \verb|get_raw()| method returns \verb|bytes| objects.
356
357As a bonus, because it doesn't parse the data into \verb|SeqRecord| objects
358a second time it should be faster. If you only want to use this with FASTA
359format, we can speed this up one step further by using the low-level FASTA
360parser to get the record identifiers and lengths:
361
362\begin{minted}{python}
363from Bio.SeqIO.FastaIO import SimpleFastaParser
364from Bio import SeqIO
365
366# Get the lengths and ids, and sort on length
367with open("ls_orchid.fasta") as in_handle:
368    len_and_ids = sorted(
369        (len(seq), title.split(None, 1)[0])
370        for title, seq in SimpleFastaParser(in_handle)
371    )
372ids = reversed([id for (length, id) in len_and_ids])
373del len_and_ids  # free this memory
374
375record_index = SeqIO.index("ls_orchid.fasta", "fasta")
376with open("sorted.fasta", "wb") as out_handle:
377    for id in ids:
378        out_handle.write(record_index.get_raw(id))
379\end{minted}
380
381\subsection{Simple quality filtering for FASTQ files}
382\label{sec:FASTQ-filtering-example}
383
384The FASTQ file format was introduced at Sanger and is now widely used for
385holding nucleotide sequencing reads together with their quality scores.
386FASTQ files (and the related QUAL files) are an excellent example of
387per-letter-annotation, because for each nucleotide in the sequence there is
388an associated quality score.  Any per-letter-annotation is held in a
389\verb|SeqRecord| in the \verb|letter_annotations| dictionary as a list,
390tuple or string (with the same number of elements as the sequence length).
391
392One common task is taking a large set of sequencing reads and filtering them
393(or cropping them) based on their quality scores.
394The following example is very simplistic, but should illustrate the basics of
395working with quality data in a \verb|SeqRecord| object.  All we are going to
396do here is read in a file of FASTQ data, and filter it to pick out only those
397records whose PHRED quality scores are all above some threshold (here 20).
398
399For this example we'll use some real data downloaded from the ENA sequence
400read archive,
401\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz}
402(2MB) which unzips to a 19MB file \texttt{SRR020192.fastq}. This is some
403Roche 454 GS FLX single end data from virus infected California sea lions
404(see \url{https://www.ebi.ac.uk/ena/data/view/SRS004476} for details).
405
406First, let's count the reads:
407
408\begin{minted}{python}
409from Bio import SeqIO
410
411count = 0
412for rec in SeqIO.parse("SRR020192.fastq", "fastq"):
413    count += 1
414print("%i reads" % count)
415\end{minted}
416
417\noindent Now let's do a simple filtering for a minimum PHRED quality of 20:
418
419\begin{minted}{python}
420from Bio import SeqIO
421
422good_reads = (
423    rec
424    for rec in SeqIO.parse("SRR020192.fastq", "fastq")
425    if min(rec.letter_annotations["phred_quality"]) >= 20
426)
427count = SeqIO.write(good_reads, "good_quality.fastq", "fastq")
428print("Saved %i reads" % count)
429\end{minted}
430
431\noindent This pulled out only $14580$ reads out of the $41892$ present.
432A more sensible thing to do would be to quality trim the reads, but this
433is intended as an example only.
434
435FASTQ files can contain millions of entries, so it is best to avoid loading
436them all into memory at once.  This example uses a generator expression, which
437means only one \verb|SeqRecord| is created at a time - avoiding any memory
438limitations.
439
440Note that it would be faster to use the low-level \verb|FastqGeneralIterator|
441parser here (see Section~\ref{sec:low-level-fasta-fastq}), but that does not
442turn the quality string into integer scores.
443
444\subsection{Trimming off primer sequences}
445\label{sec:FASTQ-slicing-off-primer}
446
447For this example we're going to pretend that \texttt{GATGACGGTGT} is a 5' primer
448sequence we want to look for in some FASTQ formatted read data.  As in the example
449above, we'll use the \texttt{SRR020192.fastq} file downloaded from the ENA
450(\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz}).
451
452By using the main \verb|Bio.SeqIO| interface, the same approach would work with
453any other supported file format (e.g. FASTA files). However, for large FASTQ
454files it would be faster the low-level \verb|FastqGeneralIterator| parser here
455(see the earlier example, and Section~\ref{sec:low-level-fasta-fastq}).
456
457This code uses \verb|Bio.SeqIO| with a generator expression (to avoid loading
458all the sequences into memory at once), and the \verb|Seq| object's
459\verb|startswith| method to see if the read starts with the primer sequence:
460
461\begin{minted}{python}
462from Bio import SeqIO
463
464primer_reads = (
465    rec
466    for rec in SeqIO.parse("SRR020192.fastq", "fastq")
467    if rec.seq.startswith("GATGACGGTGT")
468)
469count = SeqIO.write(primer_reads, "with_primer.fastq", "fastq")
470print("Saved %i reads" % count)
471\end{minted}
472
473\noindent That should find $13819$ reads from \texttt{SRR014849.fastq} and save them to
474a new FASTQ file, \texttt{with\_primer.fastq}.
475
476Now suppose that instead you wanted to make a FASTQ file containing these reads
477but with the primer sequence removed?  That's just a small change as we can slice the
478\verb|SeqRecord| (see Section~\ref{sec:SeqRecord-slicing}) to remove the first eleven
479letters (the length of our primer):
480
481\begin{minted}{python}
482from Bio import SeqIO
483
484trimmed_primer_reads = (
485    rec[11:]
486    for rec in SeqIO.parse("SRR020192.fastq", "fastq")
487    if rec.seq.startswith("GATGACGGTGT")
488)
489count = SeqIO.write(trimmed_primer_reads, "with_primer_trimmed.fastq", "fastq")
490print("Saved %i reads" % count)
491\end{minted}
492
493\noindent Again, that should pull out the $13819$ reads from \texttt{SRR020192.fastq},
494but this time strip off the first ten characters, and save them to another new
495FASTQ file, \texttt{with\_primer\_trimmed.fastq}.
496
497Now, suppose you want to create a new FASTQ file where these reads have
498their primer removed, but all the other reads are kept as they were?
499If we want to still use a generator expression, it is probably clearest to
500define our own trim function:
501
502\begin{minted}{python}
503from Bio import SeqIO
504
505
506def trim_primer(record, primer):
507    if record.seq.startswith(primer):
508        return record[len(primer) :]
509    else:
510        return record
511
512
513trimmed_reads = (
514    trim_primer(record, "GATGACGGTGT")
515    for record in SeqIO.parse("SRR020192.fastq", "fastq")
516)
517count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
518print("Saved %i reads" % count)
519\end{minted}
520
521This takes longer, as this time the output file contains all $41892$ reads.
522Again, we're used a generator expression to avoid any memory problems.
523You could alternatively use a generator function rather than a generator
524expression.
525
526\begin{minted}{python}
527from Bio import SeqIO
528
529
530def trim_primers(records, primer):
531    """Removes perfect primer sequences at start of reads.
532
533    This is a generator function, the records argument should
534    be a list or iterator returning SeqRecord objects.
535    """
536    len_primer = len(primer)  # cache this for later
537    for record in records:
538        if record.seq.startswith(primer):
539            yield record[len_primer:]
540        else:
541            yield record
542
543
544original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
545trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
546count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
547print("Saved %i reads" % count)
548\end{minted}
549
550This form is more flexible if you want to do something more complicated
551where only some of the records are retained -- as shown in the next example.
552
553\subsection{Trimming off adaptor sequences}
554\label{sec:FASTQ-slicing-off-adaptor}
555
556This is essentially a simple extension to the previous example. We are going
557to going to pretend \texttt{GATGACGGTGT} is an adaptor sequence in some FASTQ
558formatted read data, again the \texttt{SRR020192.fastq} file from the NCBI
559(\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz}).
560
561This time however, we will look for the sequence \emph{anywhere} in the reads,
562not just at the very beginning:
563
564\begin{minted}{python}
565from Bio import SeqIO
566
567
568def trim_adaptors(records, adaptor):
569    """Trims perfect adaptor sequences.
570
571    This is a generator function, the records argument should
572    be a list or iterator returning SeqRecord objects.
573    """
574    len_adaptor = len(adaptor)  # cache this for later
575    for record in records:
576        index = record.seq.find(adaptor)
577        if index == -1:
578            # adaptor not found, so won't trim
579            yield record
580        else:
581            # trim off the adaptor
582            yield record[index + len_adaptor :]
583
584
585original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
586trimmed_reads = trim_adaptors(original_reads, "GATGACGGTGT")
587count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
588print("Saved %i reads" % count)
589\end{minted}
590
591Because we are using a FASTQ input file in this example, the \verb|SeqRecord|
592objects have per-letter-annotation for the quality scores. By slicing the
593\verb|SeqRecord| object the appropriate scores are used on the trimmed
594records, so we can output them as a FASTQ file too.
595
596Compared to the output of the previous example where we only looked for
597a primer/adaptor at the start of each read, you may find some of the
598trimmed reads are quite short after trimming (e.g. if the adaptor was
599found in the middle rather than near the start). So, let's add a minimum
600length requirement as well:
601
602\begin{minted}{python}
603from Bio import SeqIO
604
605
606def trim_adaptors(records, adaptor, min_len):
607    """Trims perfect adaptor sequences, checks read length.
608
609    This is a generator function, the records argument should
610    be a list or iterator returning SeqRecord objects.
611    """
612    len_adaptor = len(adaptor)  # cache this for later
613    for record in records:
614        len_record = len(record)  # cache this for later
615        if len(record) < min_len:
616            # Too short to keep
617            continue
618        index = record.seq.find(adaptor)
619        if index == -1:
620            # adaptor not found, so won't trim
621            yield record
622        elif len_record - index - len_adaptor >= min_len:
623            # after trimming this will still be long enough
624            yield record[index + len_adaptor :]
625
626
627original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
628trimmed_reads = trim_adaptors(original_reads, "GATGACGGTGT", 100)
629count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
630print("Saved %i reads" % count)
631\end{minted}
632
633By changing the format names, you could apply this to FASTA files instead.
634This code also could be extended to do a fuzzy match instead of an exact
635match (maybe using a pairwise alignment, or taking into account the read
636quality scores), but that will be much slower.
637
638\subsection{Converting FASTQ files}
639\label{sec:SeqIO-fastq-conversion}
640
641Back in Section~\ref{sec:SeqIO-conversion} we showed how to use
642\verb|Bio.SeqIO| to convert between two file formats. Here we'll go into a
643little more detail regarding FASTQ files which are used in second generation
644DNA sequencing. Please refer to Cock \textit{et al.} (2009) \cite{cock2010}
645for a longer description. FASTQ files store both the DNA sequence (as a string)
646and the associated read qualities.
647
648PHRED scores (used in most FASTQ files, and also in QUAL files, ACE files
649and SFF files) have become a \textit{de facto} standard for representing
650the probability of a sequencing error (here denoted by $P_e$) at a given
651base using a simple base ten log transformation:
652
653\begin{equation}
654Q_{\textrm{PHRED}} = - 10 \times \textrm{log}_{10} ( P_e )
655\end{equation}
656
657This means a wrong read ($P_e = 1$) gets a PHRED quality of $0$, while a very
658good read like $P_e = 0.00001$ gets a PHRED quality of $50$. While for raw
659sequencing data qualities higher than this are rare, with post processing
660such as read mapping or assembly, qualities of up to about $90$ are possible
661(indeed, the MAQ tool allows for PHRED scores in the range 0 to 93 inclusive).
662
663The FASTQ format has the potential to become a \textit{de facto} standard for
664storing the letters and quality scores for a sequencing read in a single plain
665text file. The only fly in the ointment is that there are at least three
666versions of the FASTQ format which are incompatible and difficult to
667distinguish...
668
669\begin{enumerate}
670\item The original Sanger FASTQ format uses PHRED qualities encoded with an
671ASCII offset of 33. The NCBI are using this format in their Short Read
672Archive. We call this the \texttt{fastq} (or \texttt{fastq-sanger}) format
673in \verb|Bio.SeqIO|.
674\item Solexa (later bought by Illumina) introduced their own version using
675Solexa qualities encoded with an ASCII offset of 64. We call this the
676\texttt{fastq-solexa} format.
677\item Illumina pipeline 1.3 onwards produces FASTQ files with PHRED qualities
678(which is more consistent), but encoded with an ASCII offset of 64. We call
679this the \texttt{fastq-illumina} format.
680\end{enumerate}
681
682The Solexa quality scores are defined using a different log transformation:
683
684\begin{equation}
685Q_{\textrm{Solexa}} = - 10 \times \textrm{log}_{10} \left( \frac{P_e}{1-P_e} \right)
686\end{equation}
687
688Given Solexa/Illumina have now moved to using PHRED scores in version 1.3 of
689their pipeline, the Solexa quality scores will gradually fall out of use.
690If you equate the error estimates ($P_e$) these two equations allow conversion
691between the two scoring systems - and Biopython includes functions to do this
692in the \verb|Bio.SeqIO.QualityIO| module, which are called if you use
693\verb|Bio.SeqIO| to convert an old Solexa/Illumina file into a standard Sanger
694FASTQ file:
695
696\begin{minted}{python}
697from Bio import SeqIO
698
699SeqIO.convert("solexa.fastq", "fastq-solexa", "standard.fastq", "fastq")
700\end{minted}
701
702If you want to convert a new Illumina 1.3+ FASTQ file, all that gets changed
703is the ASCII offset because although encoded differently the scores are all
704PHRED qualities:
705
706\begin{minted}{python}
707from Bio import SeqIO
708
709SeqIO.convert("illumina.fastq", "fastq-illumina", "standard.fastq", "fastq")
710\end{minted}
711
712Note that using \verb|Bio.SeqIO.convert()| like this is \emph{much} faster
713than combining \verb|Bio.SeqIO.parse()| and \verb|Bio.SeqIO.write()|
714because optimised code is used for converting between FASTQ variants
715(and also for FASTQ to FASTA conversion).
716
717For good quality reads, PHRED and Solexa scores are approximately equal,
718which means since both the \texttt{fasta-solexa} and \texttt{fastq-illumina}
719formats use an ASCII offset of 64 the files are almost the same. This was a
720deliberate design choice by Illumina, meaning applications expecting the old
721\texttt{fasta-solexa} style files will probably be OK using the newer
722\texttt{fastq-illumina} files (on good data). Of course, both variants are
723very different from the original FASTQ standard as used by Sanger,
724the NCBI, and elsewhere (format name \texttt{fastq} or \texttt{fastq-sanger}).
725
726For more details, see the built in help (also \href{http://www.biopython.org/docs/\bpversion/api/Bio.SeqIO.QualityIO.html}{online}):
727
728\begin{minted}{pycon}
729>>> from Bio.SeqIO import QualityIO
730>>> help(QualityIO)
731...
732\end{minted}
733
734\subsection{Converting FASTA and QUAL files into FASTQ files}
735\label{sec:SeqIO-fasta-qual-conversion}
736
737FASTQ files hold \emph{both} sequences and their quality strings.
738FASTA files hold \emph{just} sequences, while QUAL files hold \emph{just}
739the qualities. Therefore a single FASTQ file can be converted to or from
740\emph{paired} FASTA and QUAL files.
741
742Going from FASTQ to FASTA is easy:
743
744\begin{minted}{python}
745from Bio import SeqIO
746
747SeqIO.convert("example.fastq", "fastq", "example.fasta", "fasta")
748\end{minted}
749
750Going from FASTQ to QUAL is also easy:
751
752\begin{minted}{python}
753from Bio import SeqIO
754
755SeqIO.convert("example.fastq", "fastq", "example.qual", "qual")
756\end{minted}
757
758However, the reverse is a little more tricky. You can use \verb|Bio.SeqIO.parse()|
759to iterate over the records in a \emph{single} file, but in this case we have
760two input files. There are several strategies possible, but assuming that the
761two files are really paired the most memory efficient way is to loop over both
762together. The code is a little fiddly, so we provide a function called
763\verb|PairedFastaQualIterator| in the \verb|Bio.SeqIO.QualityIO| module to do
764this. This takes two handles (the FASTA file and the QUAL file) and returns
765a \verb|SeqRecord| iterator:
766
767\begin{minted}{python}
768from Bio.SeqIO.QualityIO import PairedFastaQualIterator
769
770for record in PairedFastaQualIterator(open("example.fasta"), open("example.qual")):
771    print(record)
772\end{minted}
773
774This function will check that the FASTA and QUAL files are consistent (e.g.
775the records are in the same order, and have the same sequence length).
776You can combine this with the \verb|Bio.SeqIO.write()| function to convert a
777pair of FASTA and QUAL files into a single FASTQ files:
778
779\begin{minted}{python}
780from Bio import SeqIO
781from Bio.SeqIO.QualityIO import PairedFastaQualIterator
782
783with open("example.fasta") as f_handle, open("example.qual") as q_handle:
784    records = PairedFastaQualIterator(f_handle, q_handle)
785    count = SeqIO.write(records, "temp.fastq", "fastq")
786print("Converted %i records" % count)
787\end{minted}
788
789\subsection{Indexing a FASTQ file}
790\label{sec:fastq-indexing}
791
792FASTQ files are usually very large, with millions of reads in them. Due to the
793sheer amount of data, you can't load all the records into memory at once.
794This is why the examples above (filtering and trimming) iterate over the file
795looking at just one \verb|SeqRecord| at a time.
796
797However, sometimes you can't use a big loop or an iterator - you may need
798random access to the reads. Here the \verb|Bio.SeqIO.index()| function
799may prove very helpful, as it allows you to access any read in the FASTQ file
800by its name (see Section~\ref{sec:SeqIO-index}).
801
802Again we'll use the \texttt{SRR020192.fastq} file from the ENA
803(\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz}),
804although this is actually quite a small FASTQ file with less than $50,000$ reads:
805
806\begin{minted}{pycon}
807>>> from Bio import SeqIO
808>>> fq_dict = SeqIO.index("SRR020192.fastq", "fastq")
809>>> len(fq_dict)
81041892
811>>> list(fq_dict.keys())[:4]
812['SRR020192.38240', 'SRR020192.23181', 'SRR020192.40568', 'SRR020192.23186']
813>>> fq_dict["SRR020192.23186"].seq
814Seq('GTCCCAGTATTCGGATTTGTCTGCCAAAACAATGAAATTGACACAGTTTACAAC...CCG')
815\end{minted}
816
817When testing this on a FASTQ file with seven million reads,
818indexing took about a minute, but record access was almost instant.
819
820The sister function \verb|Bio.SeqIO.index_db()| lets you save the index
821to an SQLite3 database file for near instantaneous reuse - see
822Section~\ref{sec:SeqIO-index} for more details.
823
824The example in Section~\ref{sec:SeqIO-sort} show how you can use the
825\verb|Bio.SeqIO.index()| function to sort a large FASTA file -- this
826could also be used on FASTQ files.
827
828\subsection{Converting SFF files}
829\label{sec:SeqIO-sff-conversion}
830
831If you work with 454 (Roche) sequence data, you will probably have access
832to the raw data as a Standard Flowgram Format (SFF) file. This contains
833the sequence reads (called bases) with quality scores and the original
834flow information.
835
836A common task is to convert from SFF to a pair of FASTA and QUAL files,
837or to a single FASTQ file. These operations are trivial using the
838\verb|Bio.SeqIO.convert()| function (see Section~\ref{sec:SeqIO-conversion}):
839
840\begin{minted}{pycon}
841>>> from Bio import SeqIO
842>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff", "reads.fasta", "fasta")
84310
844>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff", "reads.qual", "qual")
84510
846>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff", "reads.fastq", "fastq")
84710
848\end{minted}
849
850\noindent Remember the convert function returns the number of records, in
851this example just ten. This will give you the \emph{untrimmed} reads, where
852the leading and trailing poor quality sequence or adaptor will be in lower
853case. If you want the \emph{trimmed} reads (using the clipping information
854recorded within the SFF file) use this:
855
856\begin{minted}{pycon}
857>>> from Bio import SeqIO
858>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fasta", "fasta")
85910
860>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.qual", "qual")
86110
862>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fastq", "fastq")
86310
864\end{minted}
865
866If you run Linux, you could ask Roche for a copy of their ``off instrument''
867tools (often referred to as the Newbler tools). This offers an alternative way to
868do SFF to FASTA or QUAL conversion at the command line (but currently FASTQ output
869is not supported), e.g.
870
871\begin{minted}{console}
872$ sffinfo -seq -notrim E3MFGYR02_random_10_reads.sff > reads.fasta
873$ sffinfo -qual -notrim E3MFGYR02_random_10_reads.sff > reads.qual
874$ sffinfo -seq -trim E3MFGYR02_random_10_reads.sff > trimmed.fasta
875$ sffinfo -qual -trim E3MFGYR02_random_10_reads.sff > trimmed.qual
876\end{minted}
877
878\noindent The way Biopython uses mixed case sequence strings to represent
879the trimming points deliberately mimics what the Roche tools do.
880
881For more information on the Biopython SFF support, consult the built in help:
882
883\begin{minted}{pycon}
884>>> from Bio.SeqIO import SffIO
885>>> help(SffIO)
886...
887\end{minted}
888
889\subsection{Identifying open reading frames}
890
891A very simplistic first step at identifying possible genes is to look for
892open reading frames (ORFs).  By this we mean look in all six frames for long
893regions without stop codons -- an ORF is just a region of nucleotides with
894no in frame stop codons.
895
896Of course, to find a gene you would also need to worry about locating a start
897codon, possible promoters -- and in Eukaryotes there are introns to worry about
898too.  However, this approach is still useful in viruses and Prokaryotes.
899
900To show how you might approach this with Biopython, we'll need a sequence to
901search, and as an example we'll again use the bacterial plasmid -- although
902this time we'll start with a plain FASTA file with no pre-marked genes:
903\href{https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna}
904{\texttt{NC\_005816.fna}}. This is a bacterial sequence, so we'll want to use
905NCBI codon table 11 (see Section~\ref{sec:translation} about translation).
906
907%doctest ../Tests/GenBank
908\begin{minted}{pycon}
909>>> from Bio import SeqIO
910>>> record = SeqIO.read("NC_005816.fna", "fasta")
911>>> table = 11
912>>> min_pro_len = 100
913\end{minted}
914
915Here is a neat trick using the \verb|Seq| object's \verb|split| method to
916get a list of all the possible ORF translations in the six reading frames:
917
918%cont-doctest
919\begin{minted}{pycon}
920>>> for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
921...     for frame in range(3):
922...         length = 3 * ((len(record)-frame) // 3) #Multiple of three
923...         for pro in nuc[frame:frame+length].translate(table).split("*"):
924...             if len(pro) >= min_pro_len:
925...                 print("%s...%s - length %i, strand %i, frame %i" \
926...                       % (pro[:30], pro[-3:], len(pro), strand, frame))
927GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, frame 0
928KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, frame 1
929GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, frame 1
930VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, frame 1
931NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, frame 2
932RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, frame 2
933TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, frame 2
934QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, frame 0
935IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, frame 0
936WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, frame 1
937RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, frame 1
938WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, frame 1
939LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, frame 2
940RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, frame 2
941\end{minted}
942
943Note that here we are counting the frames from the 5' end (start) of
944\emph{each} strand. It is sometimes easier to always count from the 5' end
945(start) of the \emph{forward} strand.
946
947You could easily edit the above loop based code to build up a list of the
948candidate proteins, or convert this to a list comprehension.  Now, one thing
949this code doesn't do is keep track of where the proteins are.
950
951You could tackle this in several ways.  For example, the following code tracks
952the locations in terms of the protein counting, and converts back to the
953parent sequence by multiplying by three, then adjusting for the frame and
954strand:
955
956\begin{minted}{python}
957from Bio import SeqIO
958
959record = SeqIO.read("NC_005816.gb", "genbank")
960table = 11
961min_pro_len = 100
962
963
964def find_orfs_with_trans(seq, trans_table, min_protein_length):
965    answer = []
966    seq_len = len(seq)
967    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
968        for frame in range(3):
969            trans = nuc[frame:].translate(trans_table)
970            trans_len = len(trans)
971            aa_start = 0
972            aa_end = 0
973            while aa_start < trans_len:
974                aa_end = trans.find("*", aa_start)
975                if aa_end == -1:
976                    aa_end = trans_len
977                if aa_end - aa_start >= min_protein_length:
978                    if strand == 1:
979                        start = frame + aa_start * 3
980                        end = min(seq_len, frame + aa_end * 3 + 3)
981                    else:
982                        start = seq_len - frame - aa_end * 3 - 3
983                        end = seq_len - frame - aa_start * 3
984                    answer.append((start, end, strand, trans[aa_start:aa_end]))
985                aa_start = aa_end + 1
986    answer.sort()
987    return answer
988
989
990orf_list = find_orfs_with_trans(record.seq, table, min_pro_len)
991for start, end, strand, pro in orf_list:
992    print(
993        "%s...%s - length %i, strand %i, %i:%i"
994        % (pro[:30], pro[-3:], len(pro), strand, start, end)
995    )
996\end{minted}
997
998\noindent And the output:
999
1000\begin{minted}{text}
1001NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, 41:1109
1002WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, 491:827
1003KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, 1030:1888
1004RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, 2830:3190
1005RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, 3470:3857
1006GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, 4249:4780
1007RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, 4814:5900
1008VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, 5923:6421
1009LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, 5974:6298
1010GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, 6654:7602
1011IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, 7788:8124
1012WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, 8087:8465
1013TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, 8741:9044
1014QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, 9264:9609
1015\end{minted}
1016
1017If you comment out the sort statement, then the protein sequences will be
1018shown in the same order as before, so you can check this is doing the same
1019thing.  Here we have sorted them by location to make it easier to compare
1020to the actual annotation in the GenBank file (as visualised in
1021Section~\ref{sec:gd_nice_example}).
1022
1023If however all you want to find are the locations of the open reading frames,
1024then it is a waste of time to translate every possible codon, including doing
1025the reverse complement to search the reverse strand too.  All you need to do
1026is search for the possible stop codons (and their reverse complements).  Using
1027regular expressions is an obvious approach here (see the Python module
1028\verb|re|). These are an extremely powerful (but rather complex) way of
1029describing search strings, which are supported in lots of programming
1030languages and also command line tools like \texttt{grep} as well). You can
1031find whole books about this topic!
1032
1033
1034\section{Sequence parsing plus simple plots}
1035\label{sec:sequence-parsing-plus-pylab}
1036
1037This section shows some more examples of sequence parsing, using the
1038\verb|Bio.SeqIO| module described in Chapter~\ref{chapter:seqio},
1039plus the Python library matplotlib's \verb|pylab| plotting interface
1040(see \href{https://matplotlib.org}{the matplotlib website
1041for a tutorial}).  Note that to follow these examples you will need
1042matplotlib installed - but without it you can still try the data
1043parsing bits.
1044
1045\subsection{Histogram of sequence lengths}
1046
1047There are lots of times when you might want to visualise the distribution of sequence
1048lengths in a dataset -- for example the range of contig sizes in a genome assembly
1049project.  In this example we'll reuse our orchid FASTA file
1050\href{https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta}{ls\_orchid.fasta}
1051which has only 94 sequences.
1052
1053First of all, we will use \verb|Bio.SeqIO| to parse the FASTA file and compile a list
1054of all the sequence lengths.  You could do this with a for loop, but I find a list
1055comprehension more pleasing:
1056
1057\begin{minted}{pycon}
1058>>> from Bio import SeqIO
1059>>> sizes = [len(rec) for rec in SeqIO.parse("ls_orchid.fasta", "fasta")]
1060>>> len(sizes), min(sizes), max(sizes)
1061(94, 572, 789)
1062>>> sizes
1063[740, 753, 748, 744, 733, 718, 730, 704, 740, 709, 700, 726, ..., 592]
1064\end{minted}
1065
1066Now that we have the lengths of all the genes (as a list of integers), we can use the
1067matplotlib histogram function to display it.
1068
1069\begin{minted}{python}
1070from Bio import SeqIO
1071
1072sizes = [len(rec) for rec in SeqIO.parse("ls_orchid.fasta", "fasta")]
1073
1074import pylab
1075
1076pylab.hist(sizes, bins=20)
1077pylab.title(
1078    "%i orchid sequences\nLengths %i to %i" % (len(sizes), min(sizes), max(sizes))
1079)
1080pylab.xlabel("Sequence length (bp)")
1081pylab.ylabel("Count")
1082pylab.show()
1083\end{minted}
1084
1085%
1086% Have a HTML version and a PDF version to display nicely...
1087%
1088\begin{htmlonly}
1089\noindent That should pop up a new window containing the following graph:
1090
1091\imgsrc[width=600, height=450]{images/hist_plot.png}
1092
1093\end{htmlonly}
1094%
1095% Now the PDF equivalent where we cannot always expect the figure
1096% to be positioned right next to the text, so will use a reference.
1097%
1098\begin{latexonly}
1099\begin{figure}[htbp]
1100\centering
1101\includegraphics[width=0.8\textwidth]{images/hist_plot.png}
1102\caption{Histogram of orchid sequence lengths.}
1103\label{fig:seq-len-hist}
1104\end{figure}
1105\noindent That should pop up a new window containing the graph
1106shown in Figure~\ref{fig:seq-len-hist}.
1107\end{latexonly}
1108%
1109% The text now continues...
1110%
1111Notice that most of these orchid sequences are about $740$ bp long, and there could be
1112two distinct classes of sequence here with a subset of shorter sequences.
1113
1114\emph{Tip:} Rather than using \verb|pylab.show()| to show the plot in a window,
1115you can also use \verb|pylab.savefig(...)| to save the figure to a file
1116(e.g. as a PNG or PDF).
1117
1118\subsection{Plot of sequence GC\%}
1119
1120Another easily calculated quantity of a nucleotide sequence is the GC\%.  You might
1121want to look at the GC\% of all the genes in a bacterial genome for example, and
1122investigate any outliers which could have been recently acquired by horizontal gene
1123transfer.  Again, for this example we'll reuse our orchid FASTA file
1124\href{https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta}{ls\_orchid.fasta}.
1125
1126First of all, we will use \verb|Bio.SeqIO| to parse the FASTA file and compile a list
1127of all the GC percentages.  Again, you could do this with a for loop, but I prefer this:
1128
1129\begin{minted}{python}
1130from Bio import SeqIO
1131from Bio.SeqUtils import GC
1132
1133gc_values = sorted(GC(rec.seq) for rec in SeqIO.parse("ls_orchid.fasta", "fasta"))
1134\end{minted}
1135
1136Having read in each sequence and calculated the GC\%, we then sorted them into ascending
1137order. Now we'll take this list of floating point values and plot them with matplotlib:
1138
1139\begin{minted}{python}
1140import pylab
1141
1142pylab.plot(gc_values)
1143pylab.title(
1144    "%i orchid sequences\nGC%% %0.1f to %0.1f"
1145    % (len(gc_values), min(gc_values), max(gc_values))
1146)
1147pylab.xlabel("Genes")
1148pylab.ylabel("GC%")
1149pylab.show()
1150\end{minted}
1151
1152%
1153% Have an HTML version and a PDF version to display nicely...
1154%
1155\begin{htmlonly}
1156\noindent As in the previous example, that should pop up a new window containing a graph:
1157
1158\imgsrc[width=600, height=450]{images/gc_plot.png}
1159
1160\end{htmlonly}
1161%
1162% Now the PDF equivalent where we cannot always expect the figure
1163% to be positioned right next to the text, so we'll use a reference.
1164%
1165\begin{latexonly}
1166\begin{figure}[htbp]
1167\centering
1168\includegraphics[width=0.8\textwidth]{images/gc_plot.png}
1169\caption{Histogram of orchid sequence lengths.}
1170\label{fig:seq-gc-plot}
1171\end{figure}
1172\noindent As in the previous example, that should pop up a new window with the graph shown in Figure~\ref{fig:seq-gc-plot}.
1173\end{latexonly}
1174%
1175% The text now continues...
1176%
1177If you tried this on the full set of genes from one organism, you'd probably get a much
1178smoother plot than this.
1179
1180\subsection{Nucleotide dot plots}
1181A dot plot is a way of visually comparing two nucleotide sequences for similarity to
1182each other.  A sliding window is used to compare short sub-sequences to each other,
1183often with a mis-match threshold.  Here for simplicity we'll only look for perfect
1184matches (shown in black
1185\begin{latexonly}
1186in Figure~\ref{fig:nuc-dot-plot}).
1187\end{latexonly}
1188\begin{htmlonly}
1189in the plot below).
1190\end{htmlonly}
1191
1192%
1193% Now the PDF equivalent where we cannot always expect the figure
1194% to be positioned right next to the text, so we'll use a reference.
1195%
1196\begin{latexonly}
1197\begin{figure}[htbp]
1198\centering
1199\includegraphics[width=0.8\textwidth]{images/dot_plot.png}
1200\caption{Nucleotide dot plot of two orchid sequence lengths (using pylab's imshow function).}
1201\label{fig:nuc-dot-plot}
1202\end{figure}
1203\end{latexonly}
1204
1205To start off, we'll need two sequences.  For the sake of argument, we'll just take
1206the first two from our orchid FASTA file \href{https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta}{ls\_orchid.fasta}:
1207
1208\begin{minted}{python}
1209from Bio import SeqIO
1210
1211with open("ls_orchid.fasta") as in_handle:
1212    record_iterator = SeqIO.parse(in_handle, "fasta")
1213    rec_one = next(record_iterator)
1214    rec_two = next(record_iterator)
1215\end{minted}
1216
1217We're going to show two approaches.  Firstly, a simple naive implementation
1218which compares all the window sized sub-sequences to each other to compiles a
1219similarity matrix.  You could construct a matrix or array object, but here we
1220just use a list of lists of booleans created with a nested list
1221comprehension:
1222
1223\begin{minted}{python}
1224window = 7
1225seq_one = rec_one.seq.upper()
1226seq_two = rec_two.seq.upper()
1227data = [
1228    [
1229        (seq_one[i : i + window] != seq_two[j : j + window])
1230        for j in range(len(seq_one) - window)
1231    ]
1232    for i in range(len(seq_two) - window)
1233]
1234\end{minted}
1235
1236Note that we have \emph{not} checked for reverse complement matches here.
1237Now we'll use the matplotlib's \verb|pylab.imshow()| function to display this
1238data, first requesting the gray color scheme so this is done in black and
1239white:
1240
1241\begin{minted}{python}
1242import pylab
1243
1244pylab.gray()
1245pylab.imshow(data)
1246pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
1247pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
1248pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window)
1249pylab.show()
1250\end{minted}
1251%pylab.savefig("dot_plot.png", dpi=75)
1252%pylab.savefig("dot_plot.pdf")
1253
1254%
1255% Have a HTML version and a PDF version to display nicely...
1256%
1257\begin{htmlonly}
1258\noindent That should pop up a new window containing a graph like this:
1259
1260\imgsrc[width=600, height=450]{images/dot_plot.png}
1261
1262\end{htmlonly}
1263\begin{latexonly}
1264\noindent That should pop up a new window showing the graph in Figure~\ref{fig:nuc-dot-plot}.
1265\end{latexonly}
1266%
1267% The text now continues...
1268%
1269As you might have expected, these two sequences are very similar with a
1270partial line of window sized matches along the diagonal.  There are no off
1271diagonal matches which would be indicative of inversions or other interesting
1272events.
1273
1274The above code works fine on small examples, but there are two problems
1275applying this to larger sequences, which we will address below.
1276First off all, this brute force approach to the all against all comparisons
1277is very slow.  Instead, we'll compile dictionaries mapping the window sized
1278sub-sequences to their locations, and then take the set intersection to find
1279those sub-sequences found in both sequences. This uses more memory, but is
1280\emph{much} faster.  Secondly, the \verb|pylab.imshow()| function is limited
1281in the size of matrix it can display.  As an alternative, we'll use the
1282\verb|pylab.scatter()| function.
1283
1284We start by creating dictionaries mapping the window-sized sub-sequences to locations:
1285\begin{minted}{python}
1286window = 7
1287dict_one = {}
1288dict_two = {}
1289for (seq, section_dict) in [
1290    (rec_one.seq.upper(), dict_one),
1291    (rec_two.seq.upper(), dict_two),
1292]:
1293    for i in range(len(seq) - window):
1294        section = seq[i : i + window]
1295        try:
1296            section_dict[section].append(i)
1297        except KeyError:
1298            section_dict[section] = [i]
1299# Now find any sub-sequences found in both sequences
1300matches = set(dict_one).intersection(dict_two)
1301print("%i unique matches" % len(matches))
1302\end{minted}
1303\noindent In order to use the \verb|pylab.scatter()| we need separate lists for the $x$ and $y$ co-ordinates:
1304\begin{minted}{python}
1305# Create lists of x and y co-ordinates for scatter plot
1306x = []
1307y = []
1308for section in matches:
1309    for i in dict_one[section]:
1310        for j in dict_two[section]:
1311            x.append(i)
1312            y.append(j)
1313\end{minted}
1314\noindent We are now ready to draw the revised dot plot as a scatter plot:
1315\begin{minted}{python}
1316import pylab
1317
1318pylab.cla()  # clear any prior graph
1319pylab.gray()
1320pylab.scatter(x, y)
1321pylab.xlim(0, len(rec_one) - window)
1322pylab.ylim(0, len(rec_two) - window)
1323pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
1324pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
1325pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window)
1326pylab.show()
1327\end{minted}
1328%pylab.savefig("dot_plot_scatter.png", dpi=75)
1329%pylab.savefig("dot_plot_scatter.pdf")
1330%
1331% Have a HTML version and a PDF version to display nicely...
1332%
1333\begin{htmlonly}
1334\noindent That should pop up a new window containing a graph like this:
1335
1336\imgsrc[width=600, height=450]{images/dot_plot_scatter.png}
1337
1338\end{htmlonly}
1339\begin{latexonly}
1340\noindent That should pop up a new window showing the graph in Figure~\ref{fig:nuc-dot-plot-scatter}.
1341\begin{figure}[htbp]
1342\centering
1343\includegraphics[width=0.8\textwidth]{images/dot_plot_scatter.png}
1344\caption{Nucleotide dot plot of two orchid sequence lengths (using pylab's scatter function).}
1345\label{fig:nuc-dot-plot-scatter}
1346\end{figure}\end{latexonly}
1347Personally I find this second plot much easier to read!
1348Again note that we have \emph{not} checked for reverse complement matches here
1349-- you could extend this example to do this, and perhaps plot the forward
1350matches in one color and the reverse matches in another.
1351
1352\subsection{Plotting the quality scores of sequencing read data}
1353
1354If you are working with second generation sequencing data, you may want to try plotting
1355the quality data. Here is an example using two FASTQ files containing paired end reads,
1356\texttt{SRR001666\_1.fastq} for the forward reads, and  \texttt{SRR001666\_2.fastq} for
1357the reverse reads. These were downloaded from the ENA sequence read archive FTP site
1358(\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_1.fastq.gz} and
1359\url{ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_2.fastq.gz}), and
1360are from \textit{E. coli} -- see \url{https://www.ebi.ac.uk/ena/data/view/SRR001666}
1361for details.
1362%Originally from ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/
1363
1364In the following code the \verb|pylab.subplot(...)| function is used in order to show
1365the forward and reverse qualities on two subplots, side by side. There is also a little
1366bit of code to only plot the first fifty reads.
1367
1368\begin{minted}{python}
1369import pylab
1370from Bio import SeqIO
1371
1372for subfigure in [1, 2]:
1373    filename = "SRR001666_%i.fastq" % subfigure
1374    pylab.subplot(1, 2, subfigure)
1375    for i, record in enumerate(SeqIO.parse(filename, "fastq")):
1376        if i >= 50:
1377            break  # trick!
1378        pylab.plot(record.letter_annotations["phred_quality"])
1379    pylab.ylim(0, 45)
1380    pylab.ylabel("PHRED quality score")
1381    pylab.xlabel("Position")
1382pylab.savefig("SRR001666.png")
1383print("Done")
1384\end{minted}
1385
1386You should note that we are using the \verb|Bio.SeqIO| format name \texttt{fastq}
1387here because the NCBI has saved these reads using the standard Sanger FASTQ format
1388with PHRED scores. However, as you might guess from the read lengths, this data was
1389from an Illumina Genome Analyzer and was probably originally in one of the two
1390Solexa/Illumina FASTQ variant file formats instead.
1391
1392This example uses the \verb|pylab.savefig(...)| function instead of
1393\verb|pylab.show(...)|, but as mentioned before both are useful.
1394\begin{latexonly}
1395\begin{figure}[htbp]
1396\centering
1397\includegraphics[width=0.8\textwidth]{images/SRR001666.png}
1398\caption{Quality plot for some paired end reads.}
1399\label{fig:paired-end-qual-plot}
1400\end{figure}
1401The result is shown in Figure~\ref{fig:paired-end-qual-plot}.
1402\end{latexonly}
1403\begin{htmlonly}
1404Here is the result:
1405
1406%Blank lines here are important!
1407\imgsrc[width=600, height=600]{images/SRR001666.png}
1408
1409\end{htmlonly}
1410
1411\section{Dealing with alignments}
1412
1413This section can been seen as a follow on to Chapter~\ref{chapter:align}.
1414
1415\subsection{Calculating summary information}
1416\label{sec:summary_info}
1417
1418Once you have an alignment, you are very likely going to want to find out information about it. Instead of trying to have all of the functions that can generate information about an alignment in the alignment object itself, we've tried to separate out the functionality into separate classes, which act on the alignment.
1419
1420Getting ready to calculate summary information about an object is quick to do. Let's say we've got an alignment object called \verb|alignment|, for example read in using \verb|Bio.AlignIO.read(...)| as described in Chapter~\ref{chapter:align}. All we need to do to get an object that will calculate summary information is:
1421
1422\begin{minted}{python}
1423from Bio.Align import AlignInfo
1424
1425summary_align = AlignInfo.SummaryInfo(alignment)
1426\end{minted}
1427
1428The \verb|summary_align| object is very useful, and will do the following neat things for you:
1429
1430\begin{enumerate}
1431  \item Calculate a quick consensus sequence -- see section~\ref{sec:consensus}
1432  \item Get a position specific score matrix for the alignment -- see section~\ref{sec:pssm}
1433  \item Calculate the information content for the alignment -- see section~\ref{sec:getting_info_content}
1434  \item Generate information on substitutions in the alignment -- section~\ref{sec:sub_matrix} details using this to generate a substitution matrix.
1435\end{enumerate}
1436
1437\subsection{Calculating a quick consensus sequence}
1438\label{sec:consensus}
1439
1440The \verb|SummaryInfo| object, described in section~\ref{sec:summary_info}, provides functionality to calculate a quick consensus of an alignment. Assuming we've got a \verb|SummaryInfo| object called \verb|summary_align| we can calculate a consensus by doing:
1441
1442\begin{minted}{python}
1443consensus = summary_align.dumb_consensus()
1444\end{minted}
1445
1446As the name suggests, this is a really simple consensus calculator, and will just add up all of the residues at each point in the consensus, and if the most common value is higher than some threshold value will add the common residue to the consensus. If it doesn't reach the threshold, it adds an ambiguity character to the consensus. The returned consensus object is a \verb|Seq| object.
1447
1448You can adjust how \verb|dumb_consensus| works by passing optional parameters:
1449
1450\begin{description}
1451\item[the threshold] This is the threshold specifying how common a particular residue has to be at a position before it is added. The default is $0.7$ (meaning $70\%$).
1452
1453\item[the ambiguous character] This is the ambiguity character to use. The default is 'N'.
1454
1455\end{description}
1456
1457\subsection{Position Specific Score Matrices}
1458\label{sec:pssm}
1459
1460Position specific score matrices (PSSMs) summarize the alignment information in a different way than a consensus, and may be useful for different tasks. Basically, a PSSM is a count matrix. For each column in the alignment, the number of each alphabet letters is counted and totaled. The totals are displayed relative to some representative sequence along the left axis. This sequence may be the consesus sequence, but can also be any sequence in the alignment. For instance for the alignment,
1461
1462\begin{minted}{text}
1463GTATC
1464AT--C
1465CTGTC
1466\end{minted}
1467
1468\noindent the PSSM is:
1469
1470\begin{minted}{text}
1471      G A T C
1472    G 1 1 0 1
1473    T 0 0 3 0
1474    A 1 1 0 0
1475    T 0 0 2 0
1476    C 0 0 0 3
1477\end{minted}
1478
1479Let's assume we've got an alignment object called \verb|c_align|. To get a PSSM with the consensus sequence along the side we first get a summary object and calculate the consensus sequence:
1480
1481\begin{minted}{python}
1482summary_align = AlignInfo.SummaryInfo(c_align)
1483consensus = summary_align.dumb_consensus()
1484\end{minted}
1485
1486Now, we want to make the PSSM, but ignore any \verb|N| ambiguity residues when calculating this:
1487
1488\begin{minted}{python}
1489my_pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore=["N"])
1490\end{minted}
1491
1492Two notes should be made about this:
1493
1494\begin{enumerate}
1495  \item To maintain strictness with the alphabets, you can only include characters along the top of the PSSM that are in the alphabet of the alignment object. Gaps are not included along the top axis of the PSSM.
1496
1497  \item The sequence passed to be displayed along the left side of the axis does not need to be the consensus. For instance, if you wanted to display the second sequence in  the alignment along this axis, you would need to do:
1498
1499\begin{minted}{python}
1500second_seq = alignment.get_seq_by_num(1)
1501my_pssm = summary_align.pos_specific_score_matrix(second_seq, chars_to_ignore=["N"])
1502\end{minted}
1503
1504\end{enumerate}
1505
1506The command above returns a \verb|PSSM| object.
1507To print out the PSSM as shown above,
1508we simply need to do a \verb|print(my_pssm)|, which gives:
1509
1510\begin{minted}{text}
1511    A   C   G   T
1512T  0.0 0.0 0.0 7.0
1513A  7.0 0.0 0.0 0.0
1514T  0.0 0.0 0.0 7.0
1515A  7.0 0.0 0.0 0.0
1516C  0.0 7.0 0.0 0.0
1517A  7.0 0.0 0.0 0.0
1518T  0.0 0.0 0.0 7.0
1519T  1.0 0.0 0.0 6.0
1520...
1521\end{minted}
1522
1523You can access any element of the PSSM by subscripting like \verb|your_pssm[sequence_number][residue_count_name]|. For instance, to get the counts for the 'A' residue in the second element of the above PSSM you would do:
1524
1525\begin{minted}{pycon}
1526>>> print(my_pssm[1]["A"])
15277.0
1528\end{minted}
1529
1530The structure of the PSSM class hopefully makes it easy both to access elements and to pretty print the matrix.
1531
1532\subsection{Information Content}
1533\label{sec:getting_info_content}
1534
1535A potentially useful measure of evolutionary conservation is the information content of a sequence.
1536
1537A useful introduction to information theory targeted towards molecular biologists can be found at \url{http://www.lecb.ncifcrf.gov/~toms/paper/primer/}. For our purposes, we will be looking at the information content of a consesus sequence, or a portion of a consensus sequence. We calculate information content at a particular column in a multiple sequence alignment using the following formula:
1538
1539\begin{displaymath}
1540IC_{j} = \sum_{i=1}^{N_{a}} P_{ij} \mathrm{log}\left(\frac{P_{ij}}{Q_{i}}\right)
1541\end{displaymath}
1542
1543\noindent where:
1544
1545\begin{itemize}
1546  \item $IC_{j}$ -- The information content for the $j$-th column in an alignment.
1547  \item $N_{a}$ -- The number of letters in the alphabet.
1548  \item $P_{ij}$ -- The frequency of a particular letter $i$ in the $j$-th column (i.~e.~if G occurred 3 out of 6 times in an aligment column, this would be 0.5)
1549  \item $Q_{i}$ --  The expected frequency of a letter $i$. This is an
1550  optional argument, usage of which is left at the user's
1551  discretion. By default, it is automatically assigned to $0.05 = 1/20$ for a
1552  protein alphabet, and $0.25 = 1/4$ for a nucleic acid alphabet. This is for
1553  geting the information content without any assumption of prior
1554  distributions. When assuming priors, or when using a non-standard
1555  alphabet, you should supply the values for $Q_{i}$.
1556\end{itemize}
1557
1558Well, now that we have an idea what information content is being calculated in Biopython, let's look at how to get it for a particular region of the alignment.
1559
1560First, we need to use our alignment to get an alignment summary object, which we'll assume is called \verb|summary_align| (see section~\ref{sec:summary_info}) for instructions on how to get this. Once we've got this object, calculating the information content for a region is as easy as:
1561
1562\begin{minted}{python}
1563info_content = summary_align.information_content(5, 30, chars_to_ignore=["N"])
1564\end{minted}
1565
1566Wow, that was much easier then the formula above made it look! The variable \verb|info_content| now contains a float value specifying the information content over the specified region (from 5 to 30 of the alignment). We specifically ignore the ambiguity residue 'N' when calculating the information content, since this value is not included in our alphabet (so we shouldn't be interested in looking at it!).
1567
1568As mentioned above, we can also calculate relative information content by supplying a dictionary with the expected frequencies:
1569
1570\begin{minted}{python}
1571expect_freq = {"A": 0.3, "G": 0.2, "T": 0.3, "C": 0.2}
1572info_content = summary_align.information_content(
1573    5, 30, e_freq_table=e_freq_table, chars_to_ignore=["N"]
1574)
1575\end{minted}
1576
1577Now, \verb|info_content| will contain the relative information content over the region in relation to the expected frequencies.
1578
1579The value return is calculated using base 2 as the logarithm base in the formula above. You can modify this by passing the parameter \verb|log_base| as the base you want:
1580
1581\begin{minted}{python}
1582info_content = summary_align.information_content(
1583    5, 30, log_base=10, chars_to_ignore=["N"]
1584)
1585\end{minted}
1586
1587By default nucleotide or amino acid residues with a frequency of 0 in a column are not take into account when the relative information column for that column is computed. If this is not the desired result, you can use \verb|pseudo_count| instead.
1588
1589\begin{minted}{python}
1590info_content = summary_align.information_content(
1591    5, 30, chars_to_ignore=["N"], pseudo_count=1
1592)
1593\end{minted}
1594
1595In this case, the observed frequency $P_{ij}$ of a particular letter $i$ in the $j$-th column is computed as follow :
1596
1597\begin{displaymath}
1598P_{ij} = \frac{n_{ij} + k\times Q_{i}}{N_{j} + k}
1599\end{displaymath}
1600
1601\noindent where:
1602
1603\begin{itemize}
1604  \item $k$ -- the pseudo count you pass as argument.
1605  \item $k$ -- the pseudo count you pass as argument.
1606  \item $Q_{i}$ --  The expected frequency of the letter $i$ as described above.
1607\end{itemize}
1608
1609Well, now you are ready to calculate information content. If you want to try applying this to some real life problems, it would probably be best to dig into the literature on information content to get an idea of how it is used. Hopefully your digging won't reveal any mistakes made in coding this function!
1610
1611\section{Substitution Matrices}
1612\label{sec:sub_matrix}
1613
1614Substitution matrices are an extremely important part of everyday bioinformatics work. They provide the scoring terms for classifying how likely two different residues are to substitute for each other. This is essential in doing sequence comparisons. The book ``Biological Sequence Analysis'' by Durbin et al. provides a really nice introduction to Substitution Matrices and their uses. Some famous substitution matrices are the PAM and BLOSUM series of matrices.
1615
1616Biopython provides a ton of common substitution matrices, and also provides functionality for creating your own substitution matrices.
1617
1618\subsection{Using common substitution matrices}
1619
1620\subsection{Calculating a substitution matrix from a multiple sequence alignment}
1621\label{sec:subs_mat_ex}
1622
1623You can create your own substitution matrix from an alignment.  In this
1624example, we'll first read a protein sequence alignment from the Clustalw file
1625\href{examples/protein.aln}{protein.aln} (also available online
1626\href{https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/protein.aln}{here})
1627
1628%doctest ../Tests/Clustalw
1629\begin{minted}{pycon}
1630>>> from Bio import AlignIO
1631>>> filename = "protein.aln"
1632>>> alignment = AlignIO.read(filename, "clustal")
1633\end{minted}
1634
1635Section~\ref{sec:align_clustal} contains more information on doing this.
1636
1637The \verb+substitutions+ property of the alignment stores the number of times
1638different residues substitute for each other:
1639%cont-doctest
1640\begin{minted}{pycon}
1641>>> observed_frequencies = alignment.substitutions
1642\end{minted}
1643
1644To make the example more readable, we'll select only amino acids with polar charged side chains:
1645
1646%cont-doctest
1647\begin{minted}{pycon}
1648>>> observed_frequencies = observed_frequencies.select("DEHKR")
1649>>> print(observed_frequencies)
1650       D      E      H      K      R
1651D 2360.0  255.5    7.5    0.5   25.0
1652E  255.5 3305.0   16.5   27.0    2.0
1653H    7.5   16.5 1235.0   16.0    8.5
1654K    0.5   27.0   16.0 3218.0  116.5
1655R   25.0    2.0    8.5  116.5 2079.0
1656<BLANKLINE>
1657\end{minted}
1658Rows and columns for other amino acids were removed from the matrix.
1659
1660Next, we normalize the matrix:
1661%cont-doctest
1662\begin{minted}{pycon}
1663>>> import numpy
1664>>> observed_frequencies /= numpy.sum(observed_frequencies)
1665\end{minted}
1666
1667Summing over rows or columns gives the relative frequency of occurrence of
1668each residue:
1669%cont-doctest
1670\begin{minted}{pycon}
1671>>> residue_frequencies = numpy.sum(observed_frequencies, 0)
1672>>> print(residue_frequencies.format("%.4f"))
1673D 0.2015
1674E 0.2743
1675H 0.0976
1676K 0.2569
1677R 0.1697
1678<BLANKLINE>
1679>>> numpy.sum(residue_frequencies)
16801.0
1681\end{minted}
1682
1683The expected frequency of residue pairs is then
1684%cont-doctest
1685\begin{minted}{pycon}
1686>>> expected_frequencies = numpy.dot(residue_frequencies[:, None], residue_frequencies[None, :])
1687>>> print(expected_frequencies.format("%.4f"))
1688       D      E      H      K      R
1689D 0.0406 0.0553 0.0197 0.0518 0.0342
1690E 0.0553 0.0752 0.0268 0.0705 0.0465
1691H 0.0197 0.0268 0.0095 0.0251 0.0166
1692K 0.0518 0.0705 0.0251 0.0660 0.0436
1693R 0.0342 0.0465 0.0166 0.0436 0.0288
1694<BLANKLINE>
1695\end{minted}
1696Here, \verb+residue_frequencies[:, None]+ creates a 2D array consisting of a single column with the values of \verb+residue_frequencies+, and \verb+residue_frequencies[None, :]+ a 2D array with these values as a single row. Taking their dot product (inner product) creates a matrix of expected frequencies where each entry consists of two \verb+residue_frequencies+ values multiplied with each other. For example, \verb+expected_frequencies['D', 'E']+ is equal to \verb+residue_frequencies['D'] * residue_frequencies['E']+.
1697
1698We can now calculate the log-odds matrix by dividing the observed frequencies by the expected freqencies and taking the logarithm:
1699%cont-doctest
1700\begin{minted}{pycon}
1701>>> m = numpy.log2(observed_frequencies/expected_frequencies)
1702>>> print(m)
1703      D    E    H     K    R
1704D   2.1 -1.5 -5.1 -10.4 -4.2
1705E  -1.5  1.7 -4.4  -5.1 -8.3
1706H  -5.1 -4.4  3.3  -4.4 -4.7
1707K -10.4 -5.1 -4.4   1.9 -2.3
1708R  -4.2 -8.3 -4.7  -2.3  2.5
1709<BLANKLINE>
1710\end{minted}
1711
1712This matrix can be used as the substitution matrix when performing alignments. For example,
1713%cont-doctest
1714\begin{minted}{pycon}
1715>>> from Bio.Align import PairwiseAligner
1716>>> aligner = PairwiseAligner()
1717>>> aligner.substitution_matrix = m
1718>>> aligner.gap_score = -3.0
1719>>> alignments = aligner.align("DEHEK", "DHHKK")
1720>>> print(alignments[0])
1721DEHEK
1722|.|.|
1723DHHKK
1724<BLANKLINE>
1725>>> print("%.2f" % alignments.score)
1726-2.18
1727>>> score = m['D','D'] + m['E','H'] + m['H','H'] + m['E','K'] + m['K','K']
1728>>> print("%.2f" % score)
1729-2.18
1730\end{minted}
1731
1732
1733\section{BioSQL -- storing sequences in a relational database}
1734\label{sec:BioSQL}
1735\href{https://www.biosql.org/}{BioSQL} is a joint effort between the
1736\href{https://www.open-bio.org/wiki/Main_Page}{OBF} projects (BioPerl, BioJava etc) to support a
1737shared database schema for storing sequence data. In theory, you could load a
1738GenBank file into the database with BioPerl, then using Biopython extract this
1739from the database as a record object with features - and get more or less the same
1740thing as if you had loaded the GenBank file directly as a SeqRecord using
1741\verb|Bio.SeqIO| (Chapter~\ref{chapter:seqio}).
1742
1743Biopython's BioSQL module is currently documented at
1744\url{http://biopython.org/wiki/BioSQL} which is part of our wiki pages.
1745