1\chapter{Bio.phenotype: analyse phenotypic data} 2\label{chapter:phenotype} 3 4This chapter gives an overview of the functionalities of the 5\verb|Bio.phenotype| package included in Biopython. The scope of this 6package is the analysis of phenotypic data, which means parsing and 7analysing growth measurements of cell cultures. 8In its current state the package is focused on the analysis of 9high-throughput phenotypic experiments produced by the 10\href{https://en.wikipedia.org/wiki/Phenotype_microarray}{Phenotype Microarray technology}, 11but future developments may include other platforms and formats. 12 13\section{Phenotype Microarrays} 14\label{sec:phenotypemicroarrays} 15 16The \href{https://en.wikipedia.org/wiki/Phenotype_microarray}{Phenotype Microarray} 17is a technology that measures the metabolism of bacterial 18and eukaryotic cells on roughly 2000 chemicals, divided in twenty 96-well 19plates. 20The technology measures the reduction of a tetrazolium dye by 21NADH, whose production by the cell is used as a proxy for cell metabolism; 22color development due to the reduction of this dye is typically measured 23once every 15 minutes. 24When cells are grown in a media that sustains cell metabolism, the 25recorded phenotypic data resembles a sigmoid growth curve, from which a 26series of growth parameters can be retrieved. 27 28\subsection{Parsing Phenotype Microarray data} 29 30The \verb|Bio.phenotype| package can parse two different formats of 31Phenotype Microarray data: the 32\href{https://en.wikipedia.org/wiki/Comma-separated_values}{CSV} 33(comma separated values) files produced by the machine's proprietary 34software and \href{https://en.wikipedia.org/wiki/JSON}{JSON} 35files produced by analysis software, like 36\href{https://www.dsmz.de/research/microorganisms/projects/analysis-of-omnilog-phenotype-microarray-data.html}{opm} 37or \href{https://combogenomics.github.io/DuctApe/}{DuctApe}. 38The parser will return one or a generator of PlateRecord objects, depending 39on whether the read or parse method is being used. 40You can test the parse function by using the \href{https://github.com/biopython/biopython/blob/master/Doc/examples/Plates.csv}{\texttt{Plates.csv}} file provided with the Biopython source code. 41 42%doctest examples lib:numpy 43\begin{minted}{pycon} 44>>> from Bio import phenotype 45>>> for record in phenotype.parse("Plates.csv", "pm-csv"): 46... print("%s %i" % (record.id, len(record))) 47... 48PM01 96 49PM01 96 50PM09 96 51PM09 96 52\end{minted} 53 54The parser returns a series of PlateRecord objects, each one containing a series of WellRecord objects 55(holding each well's experimental data) arranged in 8 rows and 12 columns; each row is indicated by 56a uppercase character from A to H, while columns are indicated by a two digit number, from 01 to 12. 57There are several ways to access WellRecord objects from a PlateRecord objects: 58 59\begin{description} 60 \item[Well identifier] 61 If you know the well identifier (row + column identifiers) you can access the desired well directly. 62\begin{minted}{pycon} 63 >>> record["A02"] 64 \end{minted} 65 66 \item[Well plate coordinates] 67 The same well can be retrieved by using the row and columns numbers (0-based index). 68 69%doctest examples lib:numpy 70\begin{minted}{pycon} 71>>> from Bio import phenotype 72>>> record = list(phenotype.parse("Plates.csv", "pm-csv"))[-1] 73>>> print(record[0, 1].id) 74A02 75\end{minted} 76 77 \item[Row or column coordinates] 78 A series of WellRecord objects contiguous to each other in the plate can be retrieved in bulk by 79 using the python list slicing syntax on PlateRecord objects; rows and columns are numbered with 80 a 0-based index. 81 82%cont-doctest 83\begin{minted}{pycon} 84>>> print(record[0]) 85Plate ID: PM09 86Well: 12 87Rows: 1 88Columns: 12 89PlateRecord('WellRecord['A01'], WellRecord['A02'], WellRecord['A03'], ..., WellRecord['A12']') 90>>> print(record[:, 0]) 91Plate ID: PM09 92Well: 8 93Rows: 8 94Columns: 1 95PlateRecord('WellRecord['A01'], WellRecord['B01'], WellRecord['C01'], ..., WellRecord['H01']') 96>>> print(record[:3, :3]) 97Plate ID: PM09 98Well: 9 99Rows: 3 100Columns: 3 101PlateRecord('WellRecord['A01'], WellRecord['A02'], WellRecord['A03'], ..., WellRecord['C03']') 102\end{minted} 103 104\end{description} 105 106\subsection{Manipulating Phenotype Microarray data} 107 108\subsubsection{Accessing raw data} 109The raw data extracted from the PM files is comprised of a series of tuples for each well, 110containing the time (in hours) and the colorimetric measure (in arbitrary units). 111Usually the instrument collects data every fifteen minutes, but that can vary between 112experiments. The raw data can be accessed by iterating on a WellRecord object; 113in the example below only the first ten time points are shown. 114 115%doctest examples lib:numpy 116\begin{minted}{pycon} 117>>> from Bio import phenotype 118>>> record = list(phenotype.parse("Plates.csv", "pm-csv"))[-1] 119>>> well = record["A02"] 120\end{minted} 121%rest of code snippet output is truncated 122\begin{minted}{pycon} 123>>> for time, signal in well: 124... print(time, signal) 125... 126(0.0, 12.0) 127(0.25, 18.0) 128(0.5, 27.0) 129(0.75, 35.0) 130(1.0, 37.0) 131(1.25, 41.0) 132(1.5, 44.0) 133(1.75, 44.0) 134(2.0, 44.0) 135(2.25, 44.0) 136[...] 137\end{minted} 138 139This method, while providing a way to access the raw data, doesn't allow a direct 140comparison between different WellRecord objects, which may have measurements at 141different time points. 142 143\subsubsection{Accessing interpolated data} 144To make it easier to compare different experiments and in general to allow a more intuitive handling 145of the phenotypic data, the module allows to define a custom slicing of the time points that are present 146in the WellRecord object. Colorimetric data for time points that have not been directly measured are 147derived through a linear interpolation of the available data, otherwise a NaN is returned. 148This method only works in the time interval where actual data is available. 149Time intervals can be defined with the same syntax as list 150indexing; the default time interval is therefore one hour. 151 152%cont-doctest 153\begin{minted}{pycon} 154>>> well[:10] 155[12.0, 37.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0] 156\end{minted} 157 158Different time intervals can be used, for instance five minutes: 159%Rounding makes this tricky as a doctest... 160\begin{minted}{pycon} 161>>> well[63:64:0.083] 162[12.0, 37.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0, 44.0] 163>>> well[9.55] 16444.0 165>>> well[63.33:73.33] 166[113.31999999999999, 167 117.0, 168 120.31999999999999, 169 128.0, 170 129.63999999999999, 171 132.95999999999998, 172 136.95999999999998, 173 140.0, 174 142.0, 175 nan] 176\end{minted} 177 178\subsubsection{Control well subtraction} 179Many Phenotype Microarray plates contain a control well (usually A01), that is a well where the media shouldn't support 180any growth; the low signal produced by this well can be subtracted from the other wells. 181The PlateRecord objects have a dedicated function for that, which returns another PlateRecord object 182with the corrected data. 183 184%cont-doctest 185\begin{minted}{pycon} 186>>> corrected = record.subtract_control(control="A01") 187>>> record["A01"][63] 188336.0 189>>> corrected["A01"][63] 1900.0 191\end{minted} 192 193\subsubsection{Parameters extraction} 194Those wells where metabolic activity is observed show a sigmoid behavior for the colorimetric data. 195To allow an easier way to compare different experiments a sigmoid curve can be fitted onto the data, 196so that a series of summary parameters can be extracted and used for comparisons. 197The parameters that can be extracted from the curve are: 198 199\begin{itemize} 200 \item Minimum (\textbf{min}) and maximum (\textbf{max}) signal; 201 202 \item Average height (\textbf{average\_height}); 203 204 \item Area under the curve (\textbf{area}); 205 206 \item Curve plateau point (\textbf{plateau}); 207 208 \item Curve slope during exponential metabolic activity (\textbf{slope}); 209 210 \item Curve lag time (\textbf{lag}). 211\end{itemize} 212 213All the parameters (except \textbf{min}, \textbf{max} and \textbf{average\_height}) require the 214\href{https://www.scipy.org/}{scipy library} to be installed. 215 216The fit function uses three sigmoid functions: 217 218\begin{description} 219 \item[Gompertz] $Ae^{-e^{(\frac{\mu_{m}e}{A}(\lambda - t) + 1)}} + y0$ 220 221 \item[Logistic] $\frac{A}{1+e^{(\frac{4\mu_{m}}{A}(\lambda - t) + 2)}} + y_{0}$ 222 223 \item[Richards] $A(1 + ve^{1 + v} + e^{\frac{\mu_{m}}{A}(1 + v)(1 + \frac{1}{v})(\lambda - t)})^{-\frac{1}{v}} + y0$ 224 225\end{description} 226 227Where: 228\begin{itemize} 229 \item[\textbf{A}] corresponds to the \textbf{plateau} 230 231 \item[\textbf{$\mu_{m}$}] corresponds to the \textbf{slope} 232 233 \item[\textbf{$\lambda$}] corresponds to the \textbf{lag} 234 235\end{itemize} 236 237These functions have been derived from \href{https://www.ncbi.nlm.nih.gov/pubmed/16348228}{this publication}. 238The fit method by default tries first to fit the gompertz function: if it fails it will then try to fit 239the logistic and then the richards function. The user can also specify one of the three functions to be applied. 240 241%TODO: enable with doctest examples 242\begin{minted}{pycon} 243>>> from Bio import phenotype 244>>> record = list(phenotype.parse("Plates.csv", "pm-csv"))[-1] 245>>> well = record["A02"] 246>>> well.fit() 247>>> print("Function fitted: %s" % well.model) 248Function fitted: gompertz 249>>> for param in ["area", "average_height", "lag", "max", "min", 250... "plateau", "slope"]: 251... print("%s\t%.2f" % (param, getattr(well, param))) 252... 253area 4414.38 254average_height 61.58 255lag 48.60 256max 143.00 257min 12.00 258plateau 120.02 259slope 4.99 260\end{minted} 261 262\subsection{Writing Phenotype Microarray data} 263PlateRecord objects can be written to file in the form of 264\href{https://en.wikipedia.org/wiki/JSON}{JSON} 265files, a format compatible with other software packages such as 266\href{https://www.dsmz.de/research/microorganisms/projects/analysis-of-omnilog-phenotype-microarray-data.html}{opm} 267or \href{https://combogenomics.github.io/DuctApe/}{DuctApe}. 268\begin{minted}{pycon} 269>>> phenotype.write(record, "out.json", "pm-json") 2701 271\end{minted} 272