1\chapter{Advanced}
2\label{chapter:advanced}
3
4\section{Parser Design}
5
6Many of the older Biopython parsers were built around an event-oriented
7design that includes Scanner and Consumer objects.
8
9Scanners take input from a data source and analyze it line by line,
10sending off an event whenever it recognizes some information in the
11data.  For example, if the data includes information about an organism
12name, the scanner may generate an \verb|organism_name| event whenever it
13encounters a line containing the name.
14
15Consumers are objects that receive the events generated by Scanners.
16Following the previous example, the consumer receives the
17\verb|organism_name| event, and the processes it in whatever manner
18necessary in the current application.
19
20This is a very flexible framework, which is advantageous if you want to
21be able to parse a file format into more than one representation.  For
22example, the \verb|Bio.GenBank| module uses this to construct either
23\verb|SeqRecord| objects or file-format-specific record objects.
24
25More recently, many of the parsers added for \verb|Bio.SeqIO| and
26\verb|Bio.AlignIO| take a much simpler approach, but only generate a
27single object representation (\verb|SeqRecord| and
28\verb|MultipleSeqAlignment| objects respectively). In some cases the
29\verb|Bio.SeqIO| parsers actually wrap
30another Biopython parser - for example, the \verb|Bio.SwissProt| parser
31produces SwissProt format specific record objects, which get converted
32into \verb|SeqRecord| objects.
33
34\section{Substitution Matrices}
35
36\textbf{Please note that Bio.SubsMat was deprecated in Release 1.78.} As an alternative, please consider using \verb|Bio.Align.substitution_matrices| (described in section~\ref{sec:substitution_matrices}).
37
38\subsection{SubsMat}
39
40This module provides a class and a few routines for generating substitution matrices, similar to BLOSUM or PAM matrices, but based on user-provided data. Additionally, you may select a matrix from MatrixInfo.py, a collection of established substitution matrices.
41
42The \verb+SeqMat+ class derives from a dictionary.
43The dictionary is of the form \verb|{(i1,j1):n1, (i1,j2):n2,...,(ik,jk):nk}| where i, j are alphabet letters, and n is a value.
44
45\begin{enumerate}
46  \item Attributes
47  \begin{enumerate}
48    \item \verb|self.alphabet|: a string consisting of the alphabet letters.
49
50    \item \verb|self.ab_list|: a list of the alphabet's letters, sorted. Needed mainly for internal purposes
51  \end{enumerate}
52
53  \item Methods
54
55  \begin{enumerate}
56
57    \item
58\begin{minted}{python}
59__init__(self, data=None, alphabet=None, mat_name="", build_later=0)
60\end{minted}
61
62    \begin{enumerate}
63
64      \item \verb|data|: can be either a dictionary, or another SeqMat instance.
65      \item \verb|alphabet|: an iterable (e.g., a string) over the alphabet letters.
66
67      \item \verb|mat_name|: matrix name, such as "BLOSUM62" or "PAM250"
68
69      \item \verb|build_later|: default false. If true, user may supply only alphabet and empty dictionary, if intending to build the matrix later. This skips the sanity check of alphabet size vs. matrix size.
70
71    \end{enumerate}
72
73    \item
74\begin{minted}{python}
75entropy(self, obs_freq_mat)
76\end{minted}
77
78    \begin{enumerate}
79      \item \verb|obs_freq_mat|: an observed frequency matrix. Returns the matrix's entropy, based on the frequency in  \verb|obs_freq_mat|. The matrix instance should be LO or SUBS.
80    \end{enumerate}
81
82    \item
83\begin{minted}{python}
84sum(self)
85\end{minted}
86    Calculates the sum of values for each letter in the matrix's alphabet, and returns it as a dictionary of the form \verb|{i1: s1, i2: s2,...,in:sn}|, where:
87    \begin{itemize}
88      \item i: an alphabet letter;
89      \item s: sum of all values in a half-matrix for that letter;
90      \item n: number of letters in alphabet.
91    \end{itemize}
92
93    \item
94\begin{minted}{python}
95format(self, fmt="%4d", topfmt="%4s", alphabet=None, full=False)
96\end{minted}
97
98    Creates a string representation of the matrix. \verb|fmt| is the format field for the matrix values; \verb|letterfmt| is the format field for the bottom row (in case of a half matrix) or the top row (in case of a full matrix), containing matrix letters. Example output for a 3-letter alphabet matrix:
99
100\begin{minted}{text}
101A 23
102B 12 34
103C 7  22  27
104  A   B   C
105\end{minted}
106
107    The \verb|alphabet| optional argument is an iterable (e.g. a string) over all letters in the alphabet. If supplied, the order of letters along the axes is taken from the string, rather than by alphabetical order.
108
109  \end{enumerate}
110
111\item Usage
112
113   The following section is laid out in the order by which most people wish to generate a log-odds matrix. Of course, interim matrices can be generated and
114   investigated. Most people just want a log-odds matrix, that's all.
115
116   \begin{enumerate}
117
118   \item Generating an Accepted Replacement Matrix
119
120   Initially, you should generate an accepted replacement matrix (ARM) from your data. The values in ARM are the counted number of replacements according to your data. The data could be a set of pairs or multiple alignments. So for instance if Alanine was replaced by Cysteine 10 times, and Cysteine by Alanine 12 times, the corresponding ARM entries would be:
121
122\begin{minted}{text}
123('A','C'): 10, ('C','A'): 12
124\end{minted}
125
126as order doesn't matter, user can already provide only one entry:
127
128\begin{minted}{text}
129('A','C'): 22
130\end{minted}
131
132 A SeqMat instance may be initialized with either a full (first method of counting: 10, 12) or half (the latter method, 22) matrices. A full protein alphabet matrix would be of the size 20x20 = 400. A half matrix of that alphabet would be 20x20/2 + 20/2 = 210. That is because same-letter entries don't change. (The matrix diagonal). Given an alphabet size of N:
133
134   \begin{enumerate}
135     \item Full matrix size: N*N
136
137     \item Half matrix size: N(N+1)/2
138   \end{enumerate}
139
140The SeqMat constructor automatically generates a half-matrix, if a full matrix is passed.
141
142At this point, if all you wish to do is generate a log-odds matrix, please go to the section titled Example of Use. The following text describes the nitty-gritty of internal functions, to be used by people who wish to investigate their nucleotide/amino-acid frequency data more thoroughly.
143
144\item Generating the observed frequency matrix (OFM)
145
146Use:
147\begin{minted}{python}
148OFM = SubsMat._build_obs_freq_mat(ARM)
149\end{minted}
150
151  The OFM is generated from the ARM, only instead of replacement counts, it contains replacement frequencies.
152
153\item Generating an expected frequency matrix (EFM)
154
155Use:
156
157\begin{minted}{python}
158EFM = SubsMat._build_exp_freq_mat(OFM, exp_freq_table)
159\end{minted}
160
161  \begin{enumerate}
162    \item \verb|exp_freq_table|: should be a FreqTable instance. See section~\ref{sec:freq_table} for detailed information on FreqTable. Briefly, the expected frequency table has the frequencies of appearance for each letter in the alphabet. It is implemented as a dictionary with the alphabet letters as keys, and each letter's frequency as a value. Values sum to 1.
163  \end{enumerate}
164
165The expected frequency table can (and generally should) be generated from the observed frequency matrix. So in most cases you will generate \verb|exp_freq_table| using:
166
167\begin{minted}{pycon}
168>>> exp_freq_table = SubsMat._exp_freq_table_from_obs_freq(OFM)
169>>> EFM = SubsMat._build_exp_freq_mat(OFM, exp_freq_table)
170\end{minted}
171
172But you can supply your own \verb|exp_freq_table|, if you wish
173
174\item Generating a substitution frequency matrix (SFM)
175
176Use:
177
178\begin{minted}{python}
179SFM = SubsMat._build_subs_mat(OFM, EFM)
180\end{minted}
181
182  Accepts an OFM, EFM. Provides the division product of the corresponding values.
183
184\item Generating a log-odds matrix (LOM)
185
186   Use:
187\begin{minted}{python}
188LOM = SubsMat._build_log_odds_mat(SFM, logbase=10, factor=10.0, round_digit=1)
189\end{minted}
190
191   \begin{enumerate}
192     \item Accepts an SFM.
193
194     \item \verb|logbase|: base of the logarithm used to generate the log-odds values.
195
196     \item \verb|factor|: factor used to multiply the log-odds values.  Each entry is generated by log(LOM[key])*factor And rounded to the \verb|round_digit| place after the decimal point, if required.
197
198\end{enumerate}
199
200\end{enumerate}
201
202\item Example of use
203
204As most people would want to generate a log-odds matrix, with minimum hassle, SubsMat provides one function which does it all:
205
206\begin{minted}{python}
207make_log_odds_matrix(
208    acc_rep_mat, exp_freq_table=None, logbase=10, factor=10.0, round_digit=0
209)
210\end{minted}
211
212\begin{enumerate}
213  \item \verb|acc_rep_mat|: user provided accepted replacements matrix
214  \item \verb|exp_freq_table|: expected frequencies table. Used if provided, if not, generated from the \verb|acc_rep_mat|.
215  \item \verb|logbase|: base of logarithm for the log-odds matrix. Default base 10.
216  \item \verb|round_digit|: number after decimal digit to which result should be rounded. Default zero.
217\end{enumerate}
218
219\end{enumerate}
220
221\subsection{FreqTable}
222\label{sec:freq_table}
223
224\begin{minted}{python}
225FreqTable.FreqTable(dict)
226\end{minted}
227
228\begin{enumerate}
229
230  \item Attributes:
231
232
233  \begin{enumerate}
234    \item \verb|alphabet|: A string containing the letters in the alphabet.
235    \item \verb|data|: frequency dictionary
236    \item \verb|count|: count dictionary (in case counts are provided).
237  \end{enumerate}
238
239  \item Functions:
240  \begin{enumerate}
241    \item \verb|read_count(f)|: read a count file from stream f. Then convert to frequencies.
242    \item \verb|read_freq(f)|: read a frequency data file from stream f. Of course, we then don't have the counts, but it is usually the letter frequencies which are interesting.
243  \end{enumerate}
244
245  \item Example of use:
246  The expected count of the residues in the database is sitting in a file, whitespace delimited, in the following format (example given for an alphabet consisting of three letters):
247
248\begin{minted}{text}
249A   35
250B   65
251C   100
252\end{minted}
253
254And will be read using the \verb|FreqTable.read_count(file_handle)| function.
255
256An equivalent frequency file:
257
258\begin{minted}{text}
259A  0.175
260B  0.325
261C  0.5
262\end{minted}
263
264Conversely, the residue frequencies or counts can be passed as a dictionary.
265Example of a count dictionary (same alphabet of three letters):
266
267\begin{minted}{python}
268{"A": 35, "B": 65, "C": 100}
269\end{minted}
270
271Which means that an expected data count would give a 0.5 frequency
272for 'C', a 0.325 probability of 'B' and a 0.175 probability of 'A'
273out of 200 total, sum of A, B and C)
274
275 A frequency dictionary for the same data would be:
276
277\begin{minted}{python}
278{"A": 0.175, "B": 0.325, "C": 0.5}
279\end{minted}
280
281Summing up to 1.
282
283When passing a dictionary as an argument, you should indicate whether it is a count or a frequency dictionary. Therefore the FreqTable class constructor requires two arguments: the dictionary itself, and FreqTable.COUNT or FreqTable.FREQ indicating counts or frequencies, respectively.
284
285Read expected counts. readCount will already generate the frequencies
286Any one of the following may be done to geerate the frequency table (ftab):
287
288\begin{minted}{pycon}
289>>> from SubsMat import *
290>>> ftab = FreqTable.FreqTable(my_frequency_dictionary, FreqTable.FREQ)
291>>> ftab = FreqTable.FreqTable(my_count_dictionary, FreqTable.COUNT)
292>>> ftab = FreqTable.read_count(open("myCountFile"))
293>>> ftab = FreqTable.read_frequency(open("myFrequencyFile"))
294\end{minted}
295
296\end{enumerate}
297