1\chapter{Bio.phenotype: analyse phenotypic data}
2\label{chapter:phenotype}
3
4This chapter gives an overview of the functionalities of the
5\verb|Bio.phenotype| package included in Biopython. The scope of this
6package is the analysis of phenotypic data, which means parsing and
7analysing growth measurements of cell cultures.
8In its current state the package is focused on the analysis of
9high-throughput phenotypic experiments produced by the
10\href{https://en.wikipedia.org/wiki/Phenotype_microarray}{Phenotype Microarray technology},
11but future developments may include other platforms and formats.
12
13\section{Phenotype Microarrays}
14\label{sec:phenotypemicroarrays}
15
16The \href{https://en.wikipedia.org/wiki/Phenotype_microarray}{Phenotype Microarray}
17is a technology that measures the metabolism of bacterial
18and eukaryotic cells on roughly 2000 chemicals, divided in twenty 96-well
19plates.
20The technology measures the reduction of a tetrazolium dye by
21NADH, whose production by the cell is used as a proxy for cell metabolism;
22color development due to the reduction of this dye is typically measured
23once every 15 minutes.
24When cells are grown in a media that sustains cell metabolism, the
25recorded phenotypic data resembles a sigmoid growth curve, from which a
26series of growth parameters can be retrieved.
27
28\subsection{Parsing Phenotype Microarray data}
29
30The \verb|Bio.phenotype| package can parse two different formats of
31Phenotype Microarray data: the
32\href{https://en.wikipedia.org/wiki/Comma-separated_values}{CSV}
33(comma separated values) files produced by the machine's proprietary
34software and \href{https://en.wikipedia.org/wiki/JSON}{JSON}
35files produced by analysis software, like
36\href{https://www.dsmz.de/research/microorganisms/projects/analysis-of-omnilog-phenotype-microarray-data.html}{opm}
37or \href{https://combogenomics.github.io/DuctApe/}{DuctApe}.
38The parser will return one or a generator of PlateRecord objects, depending
39on whether the read or parse method is being used.
40You can test the parse function by using the \href{https://github.com/biopython/biopython/blob/master/Doc/examples/Plates.csv}{\texttt{Plates.csv}} file provided with the Biopython source code.
41
42%doctest examples lib:numpy
43\begin{minted}{pycon}
44>>> from Bio import phenotype
45>>> for record in phenotype.parse("Plates.csv", "pm-csv"):
46...     print("%s %i" % (record.id, len(record)))
47...
48PM01 96
49PM01 96
50PM09 96
51PM09 96
52\end{minted}
53
54The parser returns a series of PlateRecord objects, each one containing a series of WellRecord objects
55(holding each well's experimental data) arranged in 8 rows and 12 columns; each row is indicated by
56a uppercase character from A to H, while columns are indicated by a two digit number, from 01 to 12.
57There are several ways to access WellRecord objects from a PlateRecord objects:
58
59\begin{description}
60  \item[Well identifier]
61    If you know the well identifier (row + column identifiers) you can access the desired well directly.
62\begin{minted}{pycon}
63    >>> record["A02"]
64    \end{minted}
65
66  \item[Well plate coordinates]
67    The same well can be retrieved by using the row and columns numbers (0-based index).
68
69%doctest examples lib:numpy
70\begin{minted}{pycon}
71>>> from Bio import phenotype
72>>> record = list(phenotype.parse("Plates.csv", "pm-csv"))[-1]
73>>> print(record[0, 1].id)
74A02
75\end{minted}
76
77  \item[Row or column coordinates]
78    A series of WellRecord objects contiguous to each other in the plate can be retrieved in bulk by
79    using the python list slicing syntax on PlateRecord objects; rows and columns are numbered with
80    a 0-based index.
81
82%cont-doctest
83\begin{minted}{pycon}
84>>> print(record[0])
85Plate ID: PM09
86Well: 12
87Rows: 1
88Columns: 12
89PlateRecord('WellRecord['A01'], WellRecord['A02'], WellRecord['A03'], ..., WellRecord['A12']')
90>>> print(record[:, 0])
91Plate ID: PM09
92Well: 8
93Rows: 8
94Columns: 1
95PlateRecord('WellRecord['A01'], WellRecord['B01'], WellRecord['C01'], ..., WellRecord['H01']')
96>>> print(record[:3, :3])
97Plate ID: PM09
98Well: 9
99Rows: 3
100Columns: 3
101PlateRecord('WellRecord['A01'], WellRecord['A02'], WellRecord['A03'], ..., WellRecord['C03']')
102\end{minted}
103
104\end{description}
105
106\subsection{Manipulating Phenotype Microarray data}
107
108\subsubsection{Accessing raw data}
109The raw data extracted from the PM files is comprised of a series of tuples for each well,
110containing the time (in hours) and the colorimetric measure (in arbitrary units).
111Usually the instrument collects data every fifteen minutes, but that can vary between
112experiments. The raw data can be accessed by iterating on a WellRecord object;
113in the example below only the first ten time points are shown.
114
115%doctest examples lib:numpy
116\begin{minted}{pycon}
117>>> from Bio import phenotype
118>>> record = list(phenotype.parse("Plates.csv", "pm-csv"))[-1]
119>>> well = record["A02"]
120\end{minted}
121%rest of code snippet output is truncated
122\begin{minted}{pycon}
123>>> for time, signal in well:
124...    print(time, signal)
125...
126(0.0, 12.0)
127(0.25, 18.0)
128(0.5, 27.0)
129(0.75, 35.0)
130(1.0, 37.0)
131(1.25, 41.0)
132(1.5, 44.0)
133(1.75, 44.0)
134(2.0, 44.0)
135(2.25, 44.0)
136[...]
137\end{minted}
138
139This method, while providing a way to access the raw data, doesn't allow a direct
140comparison between different WellRecord objects, which may have measurements at
141different time points.
142
143\subsubsection{Accessing interpolated data}
144To make it easier to compare different experiments and in general to allow a more intuitive handling
145of the phenotypic data, the module allows to define a custom slicing of the time points that are present
146in the WellRecord object. Colorimetric data for time points that have not been directly measured are
147derived through a linear interpolation of the available data, otherwise a NaN is returned.
148This method only works in the time interval where actual data is available.
149Time intervals can be defined with the same syntax as list
150indexing; the default time interval is therefore one hour.
151
152%cont-doctest
153\begin{minted}{pycon}
154>>> well[:10]
155[12.0, 37.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0]
156\end{minted}
157
158Different time intervals can be used, for instance five minutes:
159%Rounding makes this tricky as a doctest...
160\begin{minted}{pycon}
161>>> well[63:64:0.083]
162[12.0, 37.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0]
163>>> well[9.55]
16444.0
165>>> well[63.33:73.33]
166[113.31999999999999,
167 117.0,
168 120.31999999999999,
169 128.0,
170 129.63999999999999,
171 132.95999999999998,
172 136.95999999999998,
173 140.0,
174 142.0,
175 nan]
176\end{minted}
177
178\subsubsection{Control well subtraction}
179Many Phenotype Microarray plates contain a control well (usually A01), that is a well where the media shouldn't support
180any growth; the low signal produced by this well can be subtracted from the other wells.
181The PlateRecord objects have a dedicated function for that, which returns another PlateRecord object
182with the corrected data.
183
184%cont-doctest
185\begin{minted}{pycon}
186>>> corrected = record.subtract_control(control="A01")
187>>> record["A01"][63]
188336.0
189>>> corrected["A01"][63]
1900.0
191\end{minted}
192
193\subsubsection{Parameters extraction}
194Those wells where metabolic activity is observed show a sigmoid behavior for the colorimetric data.
195To allow an easier way to compare different experiments a sigmoid curve can be fitted onto the data,
196so that a series of summary parameters can be extracted and used for comparisons.
197The parameters that can be extracted from the curve are:
198
199\begin{itemize}
200  \item Minimum (\textbf{min}) and maximum (\textbf{max}) signal;
201
202  \item Average height (\textbf{average\_height});
203
204  \item Area under the curve (\textbf{area});
205
206  \item Curve plateau point (\textbf{plateau});
207
208  \item Curve slope during exponential metabolic activity (\textbf{slope});
209
210  \item Curve lag time (\textbf{lag}).
211\end{itemize}
212
213All the parameters (except \textbf{min}, \textbf{max} and \textbf{average\_height}) require the
214\href{https://www.scipy.org/}{scipy library} to be installed.
215
216The fit function uses three sigmoid functions:
217
218\begin{description}
219  \item[Gompertz] $Ae^{-e^{(\frac{\mu_{m}e}{A}(\lambda - t) + 1)}} + y0$
220
221  \item[Logistic] $\frac{A}{1+e^{(\frac{4\mu_{m}}{A}(\lambda - t) + 2)}} + y_{0}$
222
223  \item[Richards] $A(1 + ve^{1 + v} + e^{\frac{\mu_{m}}{A}(1 + v)(1 + \frac{1}{v})(\lambda - t)})^{-\frac{1}{v}} + y0$
224
225\end{description}
226
227Where:
228\begin{itemize}
229  \item[\textbf{A}] corresponds to the \textbf{plateau}
230
231  \item[\textbf{$\mu_{m}$}] corresponds to the \textbf{slope}
232
233  \item[\textbf{$\lambda$}] corresponds to the \textbf{lag}
234
235\end{itemize}
236
237These functions have been derived from \href{https://www.ncbi.nlm.nih.gov/pubmed/16348228}{this publication}.
238The fit method by default tries first to fit the gompertz function: if it fails it will then try to fit
239the logistic and then the richards function. The user can also specify one of the three functions to be applied.
240
241%TODO: enable with doctest examples
242\begin{minted}{pycon}
243>>> from Bio import phenotype
244>>> record = list(phenotype.parse("Plates.csv", "pm-csv"))[-1]
245>>> well = record["A02"]
246>>> well.fit()
247>>> print("Function fitted: %s" % well.model)
248Function fitted: gompertz
249>>> for param in ["area", "average_height", "lag", "max", "min",
250...               "plateau", "slope"]:
251...     print("%s\t%.2f" % (param, getattr(well, param)))
252...
253area    4414.38
254average_height  61.58
255lag     48.60
256max     143.00
257min     12.00
258plateau 120.02
259slope   4.99
260\end{minted}
261
262\subsection{Writing Phenotype Microarray data}
263PlateRecord objects can be written to file in the form of
264\href{https://en.wikipedia.org/wiki/JSON}{JSON}
265files, a format compatible with other software packages such as
266\href{https://www.dsmz.de/research/microorganisms/projects/analysis-of-omnilog-phenotype-microarray-data.html}{opm}
267or \href{https://combogenomics.github.io/DuctApe/}{DuctApe}.
268\begin{minted}{pycon}
269>>> phenotype.write(record, "out.json", "pm-json")
2701
271\end{minted}
272