1\documentclass[12pt]{amsart} 2 3\usepackage{graphicx} 4\usepackage{amsmath} 5\usepackage[pdftex]{hyperref} 6 7\setlength{\textwidth}{6.2in} 8\setlength{\textheight}{8.4in} 9\setlength{\topmargin}{0.2in} 10\setlength{\oddsidemargin}{0in} 11\setlength{\evensidemargin}{0in} 12\setlength{\headsep}{.3in} 13\footskip 0.3in 14\newtheorem{thm}{Theorem} 15\newtheorem{lemma}[thm]{Lemma} 16\newtheorem{prop}[thm]{Proposition} 17\newtheorem{cor}[thm]{Corollary} 18\theoremstyle{definition} 19\newtheorem{defn}[thm]{Definition} 20\newtheorem{example}[thm]{Example} 21\newtheorem{remark}[thm]{Remark} 22\newtheorem{conj}[thm]{Conjecture} 23 24\newcommand{\cherry}[1]{\ensuremath{\langle #1 \rangle}} 25\newcommand{\ignore}[1]{} 26\newcommand{\half}{\frac{1}{2}} 27\newcommand{\eopf}{\framebox[6.5pt]{} \vspace{0.2in}} 28 29%------------------document begins----------------------- 30 31\begin{document} 32 33\title{Methods of Cufflinks} 34\author{Cole Trapnell \and Lior Pachter} 35%\address{} 36%\email{\{whoweare\}@math.berkeley.edu} 37%\dedicatory{\today} 38 39%\thanks{Supported in part by ....} 40%\subjclass{Primary 55M20; Secondary 47H10} 41 42\maketitle 43\markboth{C Trapnell and L Pachter}{Methods of Cufflinks} 44 45This document describes the methods used in the Cufflinks RNA-Seq analysis program. 46%\pagestyle{plain} 47\pagenumbering{roman} \tableofcontents \listoffigures \listoftables \newpage \pagenumbering{arabic} 48 49\section{Version 0.8.3} 50 51\subsection{Transcript abundance estimation \label{abundances}} 52\label{sec:abundances} 53 54\subsubsection{Definitions} 55 56A {\em transcript} is an RNA molecule that has been transcribed from DNA. A 57{\em primary transcript} is an RNA molecule that has yet to undergo 58modification. The {\em genomic location} of a primary transcript consists of a 59pair of coordinates in the genome representing the $5'$ transcription start 60site and the $3'$ polyadenylation cleavage site. We denote the set of all 61transcripts in a transcriptome by $T$. We partition transcripts into {\em 62transcription loci} (for simplicity we refer to these as loci) so that every 63locus contains a set of transcripts all of whose genomic locations do not 64overlap the genomic location of any transcript in any other locus. Formally, 65we consider a maximal partition of transcripts into loci, a partition denoted 66by $G$, where the genomic location of a transcript $t \in g \in G$ does not 67overlap the genomic location of any transcript $u $ where $u \in h \in G$ and 68$h \neq g$. We emphasize that the definition of a transcription locus is not 69biological; transcripts in the same locus may be regulated via different 70promoters, and may differ completely in sequence (for example if one 71transcript is in the intron of another) or have different functions. The 72reason for defining loci is that we will see that they are computationally 73convenient. 74 75 76We assume that at the time of an experiment, a transcriptome consists of an 77ensemble of transcripts $T$ where the proportion of transcript $t \in T$ is 78$\rho_t$, so that $\sum_{t \in T} \rho_t = 1$ and $0 \leq \rho_t \leq 1$ for 79all $t \in T$. Formally, a {\em transcriptome} is a set of transcripts $T$ 80together with the abundances $\rho=\{ \rho_t\}_{t \in T}$. For convenience we also introduce 81notation for the proportion of transcripts in each locus. We let $\sigma_g = 82\sum_{t \in g} \rho_t$. Similarly, within a locus $g$, we denote the 83proportion of each transcript $t \in g$ by $\tau_t = \frac{\rho_t}{\sigma_g}$. 84We refer to $\rho,\sigma$ and $\tau$ as {\em transcript abundances}. 85 86 87Transcripts have lengths, which we denote by $l(t)$. For a collection of transcripts $S \subset T$ in a transcriptome, we define the length of $S$ using the weighted mean: 88\begin{equation} 89\label{eq:effective_length} 90l(S) =\frac{\sum_{t \in S} \rho_tl(t)}{\sum_{t \in S}\rho_t}. 91\end{equation} 92It is important to note that the length of a set of transcripts depends on their relative abundances; the reason for this will be clear later. 93 94One grouping of transcripts that we will focus on is the set of transcripts 95within a locus that share the same transcription start site (TSS). Unlike the 96concept of a locus, grouping by TSS has a biological basis. Transcripts 97within such a group are by definition alternatively spliced, and if they have 98different expression levels, this is most likely due to the spliceosome and 99not due to differences in transcriptional regulation. 100 101\subsubsection{A statistical model for RNA-Seq} 102 103In order to analyze expression levels of transcripts with RNA-Seq data, it is 104necessary to have a model for the (stochastic) process of sequencing. A {\em 105sequencing experiment} consists of selecting a total of $M$ fragments of 106transcripts uniformly at random from the transcriptome. Each fragment is 107identified by sequencing from its ends, resulting in two reads called {\em 108mate pairs}. The length of a fragment is a random variable, with a 109distribution we will denote by $F$. That is, the probability that a fragment 110has length $i$ is $F(i)$ and $\sum_{i=1}^{\infty} F(i) = 1$. In this paper we 111assume that $F$ is normal, however in principle $F$ can be estimated using 112data from the experiment (e.g. spike-in sequences). We decided to use 113the normal approximation to $F$ (allowing for user specified 114parameters of the normal distribution) in order to simplify the 115requirements for running {\tt Cufflinks} at this time. 116 117The assumption of random 118fragment selection is known to oversimplify the complexities of a sequencing 119experiment, however without rigorous ways to normalize we decided to work with 120the uniform at random assumption. It is easy to adapt the model to include 121more complex models that address sequencing bias as RNA-Seq experiments mature 122and the technologies are better understood. 123 124The transcript abundance estimation problem in paired-end RNA-Seq is to 125estimate $\rho$ given a set of transcripts $T$ and a set of reads sequenced 126from the ends of fragments. In {\tt Cufflinks}, the transcripts $T$ can be 127specified by the user, or alternatively $T$ can be estimated directly from the 128reads. The latter problem is the transcript assembly problem which we discuss 129in Section \ref{sec:assembly}. We ran {\tt Cufflinks} in the latter 130``discovery'' mode where we assembled the transcripts without using the 131reference annotation. 132 133The fact that fragments have different lengths has bearing on the calculation 134of the probability of selecting a fragment from a transcript. Consider a 135transcript $t$ with length $l(t)$. The probability of selecting a fragment of 136length $k$ from $t$ at one of the positions in $t$ assuming that it is 137selected uniformly at random, is $\frac{1}{l(t)-k}$. For this reason, we will 138define an adjusted length for transcripts as 139 140\begin{equation} 141\tilde{l}(t) = \sum_{i=1}^{l(t)} F(i)(l(t)-i+1). 142\end{equation} 143We also revisit the definition of length for a group of transcripts, and define 144\begin{equation} 145\tilde{l}(S) =\frac{\sum_{t \in S} \rho_t\tilde{l}(t)}{\sum_{t \in S}\rho_t}. 146\end{equation} 147 148It is important to note that given a read it may not be obvious from which 149transcript the fragment it was sequenced from originated. The consistency of 150fragments with transcripts is important and we define the {\em 151fragment-transcript matrix} $A_{R,T}$ to be the $M \times |T|$ matrix with 152$A(r,t)=1$ if the fragment alignment $r$ is completely contained in the 153genomic interval spanned by $t$, and all the implied introns in $r$ match 154introns in $t$ (in order), and with $A(r,t)=0$ otherwise. Note that the reads 155in Figure 1c in the main text are colored according to the matrix $A_{R,T}$, 156with each column of the matrix corresponding to one of the three colors 157(yellow, blue, red) and reads colored according to the mixture of colors 158corresponding to the transcripts their fragments are contained in. 159 160Even given the read 161alignment to a reference genome, it may not be obvious what the length of the 162fragment was. Formally, in the case that $A_{R,T}(r,t)=1$ we denote by $I_t(r)$ the fragment length from within 163a transcript $t$ implied by the (presumably unique) sequences corresponding to 164the mate pairs of a fragment $r$. If $A_{R,T}(r,t)=0$ then $I_t(r)$ is set to be 165infinite and $F(I_t(r)) = 0$. 166 167Given a set of reads, we assume that we can identify for each of them the set 168of transcripts with which the fragments the reads belonged to are consistent. The 169rationale for this assumption is the following: we map the reads to a 170reference genome, and we assume that the read lengths are sufficiently long so 171that every mate-pair can be uniquely mapped to the genome. We refer to this 172mapping as the {\em fragment alignment}. We also assume that we know all the 173possible transcripts and their alignments to the genome. Therefore, we can 174identify for each read the possible transcripts from which the fragment it 175belonged to originated. 176 177\begin{figure}[!ht] 178 \includegraphics[scale=0.8]{pdfs/implied_length} 179 \caption[Implied length of a fragment alignment]{Alignments of reads to the genome (rectangles) may be consistent with multiple transcripts (in this case both $t_1$ and $t_2$). The transcripts $t_1$ and $t_2$ differ by an internal exon; introns are indicated by long dashed lines. If we denote the fragment alignment by $r$, this means that $A_{R,T}(r,t_1)=1$ and $A_{R,T}(r,t_2)=1$. It is apparent that the implied length $I_{t_1}(r) > I_{t_2}(r)$ due to the presence of the extra internal exon in $t_1$. } 180\end{figure} 181 182We are now ready to write down the likelihood equation for the model. We will 183write $L(\rho|R)$ for the likelihood of a set of fragment alignments $R$ 184constructed from $M$ reads. The notation $Pr(trans. = t)$ means ``the 185probability that a fragment selected at random originates from transcript $t$''. 186 187\begin{eqnarray} 188& & L(\rho|R) = \prod_{r \in R} Pr(rd.\ aln. =r)\\ 189& = & \prod_{r \in R} \sum_{t \in T} Pr(rd.\ aln. =r|trans. =t)Pr(trans. =t)\\ 190& = & \prod_{r \in R} \sum_{t \in T} \frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)} Pr(rd.\ aln. = r|trans. =t)\\ 191& = & \prod_{r \in R} \sum_{t \in T}\frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)} \left(\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \\ \label{eq:likerho} 192& = & \prod_{r \in R} \sum_{t \in T} \alpha_t\left(\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right), 193\end{eqnarray} 194where 195\begin{equation} 196\alpha_t = \frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)}. 197\end{equation} 198 199Observe that $\alpha_t$ is exactly the probability that a fragment selected at 200random comes from transcript $t$, and we have that $\sum_{t \in T}\alpha_t = 2011$. In light of the probabilistic meaning of the $\alpha=\{\alpha_t\}_{t \in T}$, we refer to them as 202{\em fragment abundances}. 203 204It is evident that the likelihood function is that of a linear model 205and that the likelihood function is 206concave (Proposition \ref{prop:linearmodel}) so a numerical method can be used 207to find the $\alpha$. It is then possible, in principle, to recover the $\rho$ 208using Lemma \ref{lemma:readstoprobs}. However the number of parameters is in the tens of 209thousands, and in practice this form of the likelihood function is unwieldy. 210Instead, we re-write the likelihood utilizing the fact that transcripts in 211distinct loci do not overlap in genomic location. 212 213We first calculate the probability that a fragment originates from a transcript within a given locus $g$: 214\begin{eqnarray} 215\beta_g & := & \sum_{t \in g} \alpha_t\\ 216& = & \frac{\sum_{t \in g} \rho_t \tilde{l}(t)}{\sum_{u \in T} \rho_u \tilde{l}(u)}\\ 217& = & \frac{\sum_{t \in g} \sigma_g\tau_t \tilde{l}(t)}{\sum_{h \in G} \sum_{u \in h} \sigma_h\tau_u \tilde{l}(u)}\\ 218& = & \frac{\sigma_g\sum_{t \in g} \tau_t \tilde{l}(t)}{\sum_{h \in G} \sigma_h \sum_{u \in h} \tau_u \tilde{l}(u)}\\ 219 & = & \frac{\sigma_g \tilde{l}(g)}{\sum_{h \in G} \sigma_h \tilde{l}(h)}. 220\end{eqnarray} 221Recall that $\sigma_g = \sum_{t \in g} \rho_t$ and that $\tau_t = \frac{\rho_t}{\sigma_g}$ for a locus $g$. 222 223Similarly, the probability of selecting a fragment from a single transcript $t$ conditioned on selecting a transcript from the locus $g$ in which $t$ is contained is 224\begin{equation} 225\label{eq:gammatau} 226\gamma_t = \frac{\tau_t \tilde{l}(t)}{\sum_{u \in g} \tau_u \tilde{l}(u)}. 227\end{equation} 228The parameters $\gamma=\{\gamma_t\}_{t \in g}$ are conditional fragment abundances, and they are the parameters we estimate from the data in the next Section. Note that for a transcript $t \in g$, $\alpha_t = \beta_g \cdot \gamma_t$ and it is easy to convert between fragment abundances and transcript abundances using Lemma \ref{lemma:readstoprobs}. 229 230We denote the fragment counts by $X$; specifically, we denote the number of 231alignments in locus $g$ by $X_g$. Note that $\sum_{g \in G} X_g = M$. We also 232use the notation $g_r$ to denote the (unique) locus from which a read 233alignment $r$ can be obtained. 234 235The likelihood function is given by 236\begin{eqnarray} 237& & L(\rho|R) = \prod_{r \in R} Pr(rd.\ aln. =r)\\ 238& = & \prod_{r \in R} \sum_{g \in G} Pr(rd.\ aln. =r|locus=g)Pr(locus=g)\\ 239& = & \prod_{r \in R} \frac{\sigma_{g_{r}}\tilde{l}(g_{r})}{\sum_{g \in G} \sigma_g\tilde{l}(g)} Pr(rd.\ aln. =r|locus=g_r)\\ 240& = & \prod_{r \in R} \beta_{g_r} 241\sum_{t \in g_r}Pr(rd.\ aln. =r|locus=g_{r},trans. = t)Pr(trans. =t|locus=g_r)\\ 242& = & \prod_{r \in R} \beta_{g_r} 243\sum_{t \in g_r} \frac{\tau_t\tilde{l}(t)}{\sum_{u \in g_r}\tau_u \tilde{l}(u)} Pr(rd.\ aln. =r|locus=g_{r},trans. = t)\\ 244& = & \left( \prod_{r \in R} \beta_{g_r} \right) \left( \prod_{r \in R} \sum_{t \in g} \gamma_t \cdot 245Pr(rd.\ aln. =r|locus=g_r,trans. =t) \right)\\ 246& = & \left( \prod_{r \in R} \beta_{g_r} \right) \left( \prod_{r \in R} \sum_{t \in g} \gamma_t\cdot \frac{F(I_t(r))}{l(t)-I_t(r)+1}\right)\\ 247& = & \left( \prod_{g \in G} \beta_g^{X_{g}} \right) \left( \prod_{g \in G} \left( \prod_{r \in R:r \in g} \sum_{t \in g} \gamma_t \cdot 248\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \right). \label{eq:likebest} 249\end{eqnarray} 250 251Explicitly, in terms of the parameters $\rho$, Equation (\ref{eq:likebest}) 252simplifies to Equation (\ref{eq:likerho}) but we will see in the next section 253how the maximum likelihood estimates $\hat{\rho}$ are most conveniently 254obtained by first finding $\hat{\beta}$ and $\hat{\gamma}$ using Equation 255(\ref{eq:likebest}). 256 257We note that it is biologically meaningful to include prior distributions on 258$\sigma$ and $\tau$ that reflect the inherent stochasticity and resulting 259variability of transcription in a cell. This will be an interesting direction 260for further research as more RNA-Seq data (with replicates) becomes available 261allowing for the determination of biologically meaningful priors. In 262particular, it seems plausible that specific isoform abundances may vary 263considerably and randomly within cells from a single tissue and that this may 264be important in studying differential splicing. We mention to this to clarify 265that in this paper, the confidence intervals we report represent the 266variability in the maximum likelihood estimates $\hat{\sigma}_j$ and 267$\hat{\tau}^k_j$, and are not the variances of prior distributions. 268 269 270\subsubsection{Estimation of parameters} 271 272We begin with a discussion of identifiability of our 273model. Identifiability refers to the injectivity of the model, i.e., 274\begin{equation} 275\mbox{if } Pr_{\rho_1}(r) = Pr_{\rho_2}(r) \ \forall r \in R, \ \mbox{ then } \rho_1 = \rho_2. 276\end{equation} 277 278 279The identifiability of RNA-Seq models was discussed in 280\cite{Hiller2009}, where a standard analysis for linear models is 281applied to RNA-Seq (for another related biological example, see \cite{Pe'er2004} which discusses 282identifiability of haplotypes in mixed populations from genotype data). 283The results in these papers apply to our model. For completeness we review the conditions for 284identifiability. Recall that $A_{R,T}$ is the fragment-transcript matrix that specifies which transcripts each fragment is compatible with. The following theorem provides a simple characterization of identifiability: 285 286\begin{thm} 287The RNA-Seq model is identifiable iff $A_{R,T}$ is full rank. 288\end{thm} 289 290Therefore, for a given set of transcripts and a read set $R$, we can 291test whether the model is identifiable using elementary linear 292algebra. For the results in this paper, when estimating expression 293with given annotations, when the model was not identifiable we picked {\em a} maximum likelihood solution, 294although in principle it is possible to bound the total expression of 295the locus and/or report identifiability problems to the user. 296 297Returning to the likelihood function 298\begin{equation} 299\left( \prod_{g \in G} \beta_g^{X_{g}} \right) \left( \prod_{g \in G} \left( \prod_{r \in R:r \in g} \sum_{t \in g} \gamma_t \cdot 300\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \right), 301\end{equation} 302we note that both the $\beta$ and $\gamma$ parameters depend on the $\rho$ parameters. However, we will see that if we maximize the $\beta$ separately from the $\gamma$, and also each of the sets $\{\gamma_t:t \in g\}$ separately, then it is always possible to find $\rho$ that match both the maximal $\beta$ and $\gamma$. In other words, 303the problem of finding $\hat{\rho}$ is equivalent 304to finding $\hat{\beta}$ that maximizes 305$ \prod_{g \in G} \beta_g^{X_g}$ 306and separately, for each locus $g$, the $\hat{\gamma}_t$ that maximize 307\begin{equation} 308\prod_{r \in R:r \in g} \sum_{t \in g} \gamma_t 309\frac{F(I_t(r))}{l(t)-I_t(r)+1}. 310\end{equation} 311 312We begin by solving for the $\hat{\beta}$ and $\hat{\gamma}$ and the variances 313of the maximum likelihood estimates, and then explain how these are used to 314report expression levels. 315 316We can solve for the $\hat{\gamma}$ using the fact that the model is linear. 317That is, the probability of each individual read is linear in the read 318abundances $\gamma_t$. It is a standard result in statistics (see, e.g., 319Proposition 1.4 in \cite{ASCB2005}) that the log likelihood function of a 320linear model is concave. Thus, a hill climbing method can be used to find the 321$\hat{\gamma}$. We used the EM algorithm for this purpose. 322 323Rather than using the direct ML estimates, we obtained a regularized 324estimate by importance sampling from the posterior distribution with a 325proposal distribution we explain below. The samples were also used to estimate 326variances for our estimates. 327 328It follows from standard MLE asymptotic 329theory that the $\hat{\gamma}$ are asymptotically multivariate normal with 330variance-covariance matrix given by the inverse of the observed Fisher 331information matrix. This matrix is defined as follows: 332 333\begin{defn}[Observed Fisher information matrix] 334The observed Fisher information matrix is the negative of the Hessian of the log likelihood function evaluated at the maximum likelihood estimate. That is, for parameters $\Theta=(\theta_1,\ldots,\theta_n)$, the $n \times n$ matrix is 335\begin{eqnarray} 336\mathcal{F}_{k,l}(\hat{\Theta}) & = & - \frac{\partial^2 log(\mathcal{L}(\Theta|R))}{\partial \theta_k \theta_l} |_{\theta=\hat{\theta}}. 337\end{eqnarray} 338\end{defn} 339In our case, considering a single locus $g$, the parameters are $\Theta = (\gamma_{t_1},\ldots,\gamma_{t_{|g|}})$, and as expected from Proposition \ref{prop:linearmodel}: 340\begin{eqnarray} 341\label{eq:Fisher} 342\mathcal{F}_{t_k,t_l}(\hat{\Theta}) & = & \sum_{r \in R:r \in g} 343\left[ \frac{1}{\left( \sum_{h \in g} \hat{\gamma}_h \frac{F(I_h(r))}{l(h) - I_h(r)+1} \right)^2} \frac{F(I_{t_k}(r)) F(I_{t_l}(r)) }{(l(t_k)-I_{t_k}+1)(l(t_l)-I_{t_l}+1)} \right]. 344\end{eqnarray} 345 346 347Because some of the transcript abundances may be close to zero, we 348adopted the 349Bayesian approach of \cite{Jiang2009} and instead sampled from the 350joint posterior distribution of $\Theta$ using the proposal 351distribution consisting of the multivariate normal with mean 352given by the MLE, and variance-covariance matrix given by the inverse of 353(\ref{eq:Fisher}). If the Observed Fisher Information Matrix is 354singular then the user is warned and the confidence intervals of all 355transcripts are set to $[0,1]$ (meaning that there is no information 356about relative abundances). 357 358The method used for sampling was importance sampling. The samples were used to obtain a maximum-a-posterior estimate for 359$\hat{\gamma}_t$ for each $t$ and for the variance-covariance matrix which we 360denote by $\Psi^g$ (where $g \in G$ denotes the locus). Note that $\Psi^g$ is 361a $|g| \times |g|$ matrix. The covariance between $\hat{\gamma}_{t_k}$ and 362$\hat{\gamma}_{t_l}$ for $t_k,t_l \in g$ is given by $\Psi^g_{t_k,t_l}$. 363 364Turning to the maximum likelihood estimates $\hat{\beta}$, we use the fact that the model is the log-linear. Therefore, 365\begin{equation} 366\label{eq:sigmahat} 367\hat{\beta_g} = \frac{X_{g}}{M}. 368\end{equation} 369 370Viewed as a random variable, the counts $X_{g}$ are approximately Poisson and therefore the variance of the MLE $\hat{\beta}_g$ is approximately $X_{g}$. We note that for the tests in this paper we directly used the total counts $M$ and the proportional counts $X_g$, however it is easy to incorporate recent suggestions for total count normalization, such as \cite{Bullard2010} into {\tt Cufflinks}. 371 372The favored units for reporting expression in RNA-Seq studies to date is not 373using the transcript abundances directly, but rather using a measure 374abbreviated as FPKM, which means ``expected number of fragments per kilobase 375of transcript sequence per millions base pairs sequenced''. These units are 376equivalent to measuring transcript abundances (multiplied by a scalar). The 377computational advantage of FPKM, is that the normalization constants 378conveniently simplify some of the formulas for the variances of transcript 379abundance estimates. 380 381For example, the abundance of a transcript $t \in g$ in FPKM units is 382\begin{equation} 383\label{eq:FPKM1} 384 \frac{10^6 \cdot 10^3 \cdot \alpha_t}{\tilde{l}(t)} = \frac{10^6 \cdot 10^3 \cdot \beta_g \cdot \gamma_t}{\tilde{l}(t)}. 385\end{equation} 386 387Equation (\ref{eq:FPKM1}) makes it clear that although the abundance of each 388transcript $t \in g$ in FPKM units is proportional to the transcript abundance 389$\rho_t$ it is given in terms of the read abundances $\beta_g$ and $\gamma_t$ 390which are the parameters estimated from the likelihood function. 391 392The maximum likelihood estimates of $\beta_g$ and $\gamma_t$ are random 393variables, and we denote their scaled product (in FPKM units) by $A_t$. That 394is $Pr(A_t = a)$ is the probability that for a random set of fragment alignments 395from a sequencing experiment, the maximum likelihood estimate of the 396transcript abundance for $t$ in FPKM units is $a$. 397 398Using the fact that the expectation of a product of independent random 399variables is the product of the expectations, for a transcript $t \in g$ we 400have 401\begin{equation} 402E[A_t] = \frac{10^9X_{g}\hat{\gamma}_t}{\tilde{l}(t)M}. 403\end{equation} 404 405Given the variance estimates for the $\hat{\gamma}_t$ we turn to the problem of estimating $Var[A_t]$ for a transcript $t \in g$. We use Lemma \ref{lemma:varproduct} to obtain 406\begin{eqnarray} 407Var[A_t]& = & \left(\frac{10^9}{\tilde{l}(t)M}\right)^2 \left( \Psi^g_{t,t}X_{g} + \Psi^g_{t,t}X^2_{g} + (\hat{\gamma}_t)^2 X_{g} \right)\\ 408& = & X_{g}\left(\frac{10^9}{\tilde{l}(t)M)}\right)^2 \left( \Psi^g_{t,t}(1+X_{g}) + (\hat{\gamma}_t)^2\right). 409\end{eqnarray} 410 411This variance calculation can be used to estimate a confidence interval by 412utilizing the fact \cite{Aroian1978} that when the expectation divided by the 413standard deviation of at least one of two random variables is large, their 414product is approximately normal. 415 416Next we turn to the problem of estimating expression levels (and variances of 417these estimates) for groups of transcripts. Let $S \subset T$ be a group of 418transcripts located in a single locus $g$, e.g. a collection of transcripts 419sharing a common TSS. 420 421The analogy of Equation (\ref{eq:FPKM1}) for the FPKM of the group is 422\begin{eqnarray} 423\label{eq:grouphard} 424& & \frac{10^6 \cdot 10^3 \cdot \beta_g \cdot \left( \sum_{t \in S} \gamma_t\right)}{\tilde{l}(S)}\\ 425& = & 10^6 \cdot 10^3 \cdot \beta_g \cdot \sum_{t \in S} \frac{\gamma_t}{\tilde{l}(t)}. \label{eq:groupeasy} 426\end{eqnarray} 427 428As before, we denote by $B_S$ the random variables for which $Pr(B_S = b)$ is 429the probability that for a random set of fragment alignments from a sequencing 430experiment, the maximum likelihood estimate of the transcript abundance for 431all the transcripts in $S$ in FPKM units is $b$. We note that the $B_S$ are 432products and sums of random variables (Equation (\ref{eq:groupeasy})). This 433makes Equation (\ref{eq:groupeasy}) more useful than the 434equivalent unsimplified Equation (\ref{eq:grouphard}), especially because 435$\tilde{l}(S)$ is, in general, a ratio of two random variables. 436 437We again use the fact that the expectation of independent random variables is 438the product of the expectation, in addition to the fact that expectation is a 439linear operator to conclude that for a group of transcripts $S$, 440 441\begin{equation} 442\label{eq:expectTSS} 443E[B_S] = \frac{10^9 \cdot X_g \cdot \sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)}}{M}. 444\end{equation} 445In order to compute the variance of $B_S$, we first note that 446\begin{equation} 447Var\left[\sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)}\right] = \sum_{t \in S}\frac{1}{\tilde{l}(t)^2}\Psi^g_{t,t} + \sum_{t,u \in S} \frac{1}{\tilde{l}(t)\tilde{l}(u)} \Psi^g_{t,u}. 448\end{equation} 449Therefore, 450\begin{eqnarray} 451& & \quad Var[B_S] \quad = \quad \nonumber \\ 452& & X_g\left(\frac{10^9}{M}\right)^2\left( \left(1+X_g\right) \left(\sum_{t \in S}\frac{1}{\tilde{l}(t)^2}\Psi^g_{t,t} + \sum_{t,u \in S} \frac{1}{\tilde{l}(t)\tilde{l}(u)} \Psi^g_{t,u}\right) + \left( \sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)} \right)^2 \right). \label{eq:varTSS} 453\end{eqnarray} 454 455We can again estimate a confidence interval by utilizing the fact that $B_S$ 456is approximately normal \cite{Aroian1978}. 457 458\subsection{Transcript assembly} 459\label{sec:assembly} 460\subsubsection{Overview} 461 462{\tt Cufflinks} takes as input alignments of RNA-Seq fragments to a reference 463genome and, in the absence of an (optional) user provided annotation, initially assembles transcripts from the alignments. Transcripts in 464each of the loci are assembled independently. The assembly algorithm 465is designed to aim for the following: 466\begin{enumerate} 467\item Every fragment is consistent with at least one assembled 468 transcript. 469\item Every transcript is tiled by reads. 470\item The number of transcripts is the smallest required to satisfy requirement (1). 471\item The resulting RNA-Seq models (in the sense of Section \ref{sec:abundances}) are identifiable. 472\end{enumerate} 473In other words, we seek an assembly that parsimoniously “explains” the fragments from the RNA-Seq experiment; every 474fragment in the experiment (except those filtered out during a preliminary 475error-control step) should have come from a {\tt Cufflinks} transcript, and 476{\tt Cufflinks} should produce as few transcripts as possible with that property. 477Thus, {\tt Cufflinks} seeks to optimize the criterion suggested in \cite{Xing2004}, 478however, unlike the method in that paper, {\tt Cufflinks} leverages Dilworth's 479Theorem \cite{Dilworth1950} to solve the problem by reducing it 480to a matching problem via the equivalence of Dilworth's and K\"{o}nig's theorems (Theorem \ref{thm:dilko} in Appendix A). Our approach to isoform reconstruction is inspired by a similar approach used for haplotype reconstruction from HIV quasi-species \cite{Eriksson2008}. 481 482\subsection{A partial order on fragment alignments} 483 484The {\tt Cufflinks} program loads a set of alignments in SAM format sorted by reference 485position and assembles non-overlapping sets of alignments independently. After 486filtering out any erroneous spliced alignments or reads from incompletely 487spliced RNAs, {\tt Cufflinks} constructs a partial order (Definition \ref{def:po}), or equivalently a directed acyclic graph (DAG), with 488one node for each fragment that in turn consists of an aligned pair of mated reads. First, we note that fragment alignments are of two types: those where reads align in their entirety to the genome, and reads which have a split alignment (due to an implied intron). 489 490In the case of single reads, the partial order can be simply 491constructed by checking the reads for {\em compatibility}. Two reads 492are {\em compatible} if their overlap contains the exact same implied 493introns (or none). If two reads are not compatible they are {\em 494 incompatible}. The reads can be partially ordered by defining, for 495two reads $x,y$, that $x\leq y$ if the starting coordinate of $x$ is 496at or before the starting coordinate of $y$, and if they are 497compatible. 498 499In the case of paired-end RNA-Seq the situation is more complicated 500because the unknown sequence between mate pairs. To understand this, 501we first note that pairs of fragments can still be determined to be 502incompatible if they cannot have originated from the same 503transcript. As with single reads, this happens when there is 504disagreement on implied introns in the overlap. However compatibility 505is more subtle. We would like to define a pair of fragments $x,y$ to 506be compatible if they do not overlap, or if every implied intron in one fragment overlaps an identical implied intron in the other fragment. 507 508However it is important to note that it may be impossible to determine 509the compatibility (as defined above) or incompatibility of a pair of fragments. For example, an unknown region internal to a fragment may overlap two different introns (that are incompatible with each other). The fragment may be compatible with one of the introns (and the fragment from which it originates) in which case it is incompatible with the other. Since the opposite situation is also feasible, compatibility (or incompatibility) cannot be assigned. Fragments for which the compatibility/incompatibility cannot be determined with respect to every other fragment are called {\em uncertain}. Finally, two fragments are called {\em nested} if one is contained within the other. 510 511 512\begin{figure}[h] 513 \includegraphics[scale=0.5]{pdfs/compatibility.pdf} 514 \caption[Compatibility and incompatibility of 515 fragments]{Compatibility and incompatibility of 516 fragments. End-reads are solid lines, unknown sequences within 517 fragments are shown by dotted lines and implied introns are 518 dashed lines. The reads in (a) are compatible, whereas the 519 fragments in (b) are incompatible. The fragments in (c) are 520 nested. Fragment $x_4$ in (d) is uncertain, because $y_4$ and 521 $y_5$ are incompatible with each other.} 522\end{figure} 523 524Before constructing a partial order, fragments are extended to include 525their nested fragments and uncertain fragments are discarded. These discarded fragments 526are used in the abundance estimation. In theory, this may result in 527suboptimal (i.e. non-minimal assemblies) but we determined empirically 528that after assembly uncertain fragments are almost always consistent 529with one of the transcripts. When they are not, there was no 530completely tiled transcript that contained them. Thus, we employ a 531heuristic that substantially speeds up the program, and that works 532in practice. 533 534A partial order $P$ is then constructed from the remaining fragments 535by declaring that $x \leq y$ whenever the fragment corresponding to 536$x$ begins at, or before, the location of the fragment corresponding 537to $y$ and $x$ and $y$ are compatible. In what follows we identify $P$ 538with its Hasse diagram (or covering relation), equivalently a directed 539acyclic graph (DAG) that is the transitive reduction. 540 541\begin{prop} 542$P$ is a partial order. 543\end{prop} 544{\bf Proof}: The fragments can be totally ordered according to the locations where they begin. It therefore suffices to check that if $x,y,z$ are fragments with $x$ compatible with $y$ and $y$ compatible with $z$ then $x$ is compatible with $z$. Since $x$ is not uncertain, it must be either compatible or incompatible with $z$. The latter case can occur only if $x$ and/or $z$ contain implied introns that overlap and are not identical. Since $y$ is not nested within $z$ and $x$ is not nested within $y$, it must be that $y$ contains an implied intron that is not identical with an implied intron in either $x$ or $z$. Therefore $y$ cannot be compatible with both $x$ and $z$. \qed 545 546\subsection{Assembling a parsimonious set of transcripts} 547 548In order to assemble a set of transcripts, {\tt Cufflinks} finds a 549(minimum) partition of $P$ into chains (see Definition 550\ref{def:po}). A partition of $P$ into chains yields an assembly 551because every chain is a totally ordered set of compatible fragments 552$x_1,\ldots,x_l$ and therefore there is a set of overlapping fragments 553that connects them. By Dilworth's theorem (Theorem \ref{thm:Dilworth}), the problem of finding a minimum partition $P$ into chains is equivalent to finding a maximum antichain in $P$ (an antichain is a set of mutually incompatible fragments). Subsequently, by Theorem \ref{thm:dilko}, the problem of finding a maximum antichain in $P$ can be reduced to finding a maximum matching in a certain bipartite graph that emerges naturally in deducing Dilworth's theorem from K\"{o}nig's theorem \ref{thm:konig}. 554We call the key bipartite graph the ``reachability'' graph. It is the transitive closure of the DAG, i.e. it is the graph 555where each fragment $x$ has nodes $L_x$ and $R_x$ in the left and 556right partitions of the reachability graph respectively, and where there is an edge between $L_x$ and $R_y$ when $x \leq y$ in $P$. The maximum matching problem is a classic problem that admits a polynomial time algorithm. The Hopcroft-Karp algorithm \cite{Hopcroft1973} has a run time of $O(\sqrt{V}E)$ where in our case $V$ is the number of fragments and $E$ depends on the extent of overlap, but is bounded by a constant times the coverage depth. We note that our parsimony approach to assembly therefore has a better complexity than the $O(V^3)$ PASA algorithm \cite{Haas2003}. 557 558The minimum cardinality chain decomposition computed using the approach above may not be unique. 559For example, a locus may contain two putative distinct initial exons (defined by overlapping incompatible fragments), and one 560of two distinct terminal and a constitutive exon in between that is longer than any 561read or insert in the RNA-Seq experiment. In such a case, the parsimonious assembly will consist of two transcripts, but there are four possible solutions that are all minimal. In order to ``phase'' distant exons, we leverage the fact that abundance inhomogeneities can link distant exons via their coverage. We therefore weight the edges of the bipartite reachability graph based on the percent-spliced-in metric introduced by 562Wang \emph{et al.} in \cite{Wang2008}. In our setting, the percent-spliced-in 563$\psi_x$ for an alignment $x$ is computed by counting the alignments 564overlapping $x$ in the genome that are compatible with $x$ and dividing by the 565total number of alignments that overlap $x$, and normalizing for the length of 566the $x$. The cost $C(y,z)$ assigned to an edge between alignments $y$ and $z$ reflects 567the belief that they originate from different transcripts: 568 569 570\begin{equation} 571 C(y,z) = -\log(1 - |\psi_y - \psi_z|). 572 \label{eq:match_cost} 573\end{equation} 574 575Rather than using the Hopcroft-Karp algorithm, a modified version of the {\tt 576LEMON} \cite{LEMON} and {\tt Boost} \cite{Boost} graph libraries are used to 577compute a {\em min-cost} maximum cardinality matching on the bipartite compatibility 578graph. Even with the presence of weighted edges, our algorithm is 579very fast. The best known algorithm for weighted matching is $O(V^2logV+VE)$. 580 581Because we isolated total RNA, we 582expected that a small fraction of our reads would come from the intronic 583regions of incompletely processed primary transcripts. Moreover, transcribed 584repetitive elements and low-complexity sequence result in ``shadow'' 585transfrags that we wished to discard as artifacts. Thus, {\tt Cufflinks} 586heuristically identifies artifact transfrags and suppresses them in its 587output. We also filter extremely low-abundance minor isoforms of alternatively 588spliced genes, using the model described in Section \ref {abundances} as a 589means of reducing the variance of estimates for more abundant transcripts. A 590transcript $x$ meeting any of the following criteria is suppressed: 591 592\begin{enumerate} 593 \item $x$ aligns to the genome entirely within an intronic region of the alignment for a transcript $y$, and the abundance of $x$ is less than 15\% of $y$'s abundance. 594 \item $x$ is supported by only a single fragment alignment to the genome. 595 \item More than 75\% of the fragment alignments supporting $x$, are mappable to multiple genomic loci. 596 \item $x$ is an isoform of an alternatively spliced gene, and has an estimated abundance less than 5\% of the major isoform of the gene. 597\end{enumerate} 598 599Prior to transcript assembly, {\tt Cufflinks} also filters out some of the alignments 600for fragments that are likely to originate from incompletely spliced nuclear 601RNA, as these can reduce the accuracy abundance estimates for fully spliced 602mRNAs. These filters and the output filters above are detailed in the source 603file \verb!filters.cpp! of the source code for {\tt Cufflinks}. 604 605In the overview of this Section, we mentioned that our assembly algorithm has the property that the resulting models are identifiable. This is a convenient property that emerges naturally from the parsimony criterion for a ``minimal explanation'' of the fragment alignments. Formally, it is a corollary of Dilworth's theorem: 606 607\begin{prop} 608The assembly produced by the {\tt Cufflinks} algorithm always results in an 609identifiable RNA-Seq model. 610\end{prop} 611{\bf Proof}: By Dilworth's theorem, the minimum chain decomposition (assembly) 612we obtain has the same size as the maximum antichain in the partially 613ordered set we 614construct from the reads. An antichain consists of reads that are 615pairwise incompatible, and therefore those reads must form a 616permutation sub-matrix in the fragment-transcript matrix $A_{R,T}$ with columns corresponding to the transcripts in a locus, and with rows corresponding to the fragments in the antichain. The matrix $A_{R,T}$ therefore contains permutation sub-matrices that together span all the columns, and the matrix is full-rank. 617 618\subsection{Assessment of assembly quality} 619 620To compare {\tt Cufflinks} transfrags against annotated transcriptomes, and 621also to find transfrags common to multiple assemblies, we developed a tool 622called {\tt Cuffcompare} that builds structural equivalence classes of 623transcripts. We ran {\tt Cuffcompare} on each the assembly from each time point against the 624combined annotated transcriptomes of the {UCSC known genes}, {\tt Ensembl}, 625and {\tt Vega}. Because of the stochastic nature of sequencing, \emph{ab 626initio} assembly of the same transcript in two different samples may result in 627transfrags of slightly different lengths. A {\tt Cufflinks} transfrag was 628considered a complete match when there was a transcript with an identical 629chain of introns in the combined annotation. 630 631When no complete match is found between a {\tt Cufflinks} transfrag and the 632transcripts in the combined annotation, {\tt Cuffcompare} determines and reports if 633another substantial relationship exists with any of the annotation 634transcripts that can be found in or around the same genomic locus. For 635example, when all the introns of a transfrag match perfectly a part of the 636intron chain (sub-chain) of an annotation transcript, a ``containment'' 637relationship is reported. For single-exon transfrags, containment is also 638reported when the exon appears fully overlapped by any of the exons of an 639annotation transcript. If there is no perfect match for the intron chain of a 640transfrag but only some exons overlap and there is at least one intron-exon 641junction match, {\tt Cuffcompare} classifies the transfrag as a putative ``novel'' 642isoform of an annotated gene. When a transfrag is unspliced (single-exon) and 643it overlaps the intronic genomic space of a reference annotation transcript, 644the transfrag is classified as potential pre-mRNA fragment. Finally, when no 645other relationship is found between a {\tt Cufflinks} transfrag and an 646annotation transcript, {\tt Cuffcompare} can check the repeat content of the 647transfrag's genomic region (assuming the soft-masked genomic sequence was also 648provided) and it would classify the transfrag as ``repeat'' if most of its 649bases are found to be repeat-masked. 650 651When provided multiple time point assemblies, {\tt Cuffcompare} matches transcripts 652 653\subsubsection{Testing for changes in absolute expression} 654 655Between any two consecutive time points, we tested whether a transcript was 656significantly (after FDR control \cite{Benjamini1995}) up or down regulated with respect to the 657null hypothesis of no change, with variability in expression due solely to the 658uncertainties resulting from our abundance estimation procedure. This was done 659using the following testing procedure for absolute differential expression: 660 661We employed the standard method used in microarray-based expression analysis 662and proposed for RNA-Seq in \cite{Bullard2010}, which is to compute the 663logarithm of the ratio of intensities (in our case FPKM), and to then use the 664delta method to estimate the variance of the log odds. We describe this for 665testing differential expression of individual transcripts and also groups of 666transcripts (e.g. grouped by TSS). 667 668We recall that the MLE FPKM for a transcript $t$ in a locus $g$ is given by 669\begin{equation} 670\frac{10^9X_g\hat{\gamma}_t}{\tilde{l}(t)M}. 671\end{equation} 672Given two different experiments resulting in $X^a_g,M^a$ and $X^b_g,M^b$ respectively, as well as $\hat{\gamma}^a_t$ and $\hat{\gamma}^b_t$, we would like to test the significance of departures from unity of the ratio of MLE FPKMS, i.e. 673\begin{eqnarray} 674& & \left(\frac{10^9X^a_g\hat{\gamma}^a_t}{\tilde{l}(t)M^a}\right) / \left(\frac{10^9X^b_g\hat{\gamma}^b_t}{\tilde{l}(t)M^b}\right)\\ 675& = & \frac{X_g^a\hat{\gamma}^a_tM^b}{X_g^b \hat{\gamma}^b_tM^a}. 676\end{eqnarray} 677 678This can be turned into a test statistic that is approximately normal by 679taking the logarithm, and normalizing by the variance. We recall that using 680the delta method, if $X$ is a random variable then $Var[log(X)] \approx 681\frac{Var[X]}{E[X]^2}$. 682 683Therefore, our test statistic is 684\begin{equation} 685\frac{log(X^a_g)+log(\hat{\gamma}^a_t)+log(M^b)-log(X^b_g)-log(\hat{\gamma}^b_t)-log(M^a)} 686{ \sqrt{\frac{\left( \Psi^{g,a}_{t,t}(1+X^a_{g}) + (\hat{\gamma}^a_t)^2\right)}{X^a_g \left(\hat{\gamma}^a_t \right)^2}+ 687\frac{\left( \Psi^{g,b}_{t,t}(1+X^b_{g}) + (\hat{\gamma}^b_t)^2\right)}{X^b_{g}\left( \hat{\gamma}^b_t\right)^2} } }. 688\end{equation} 689 690In order to test for differential expression of a group of transcripts, we 691replace the numerator and denominator above by those from Equations 692(\ref{eq:expectTSS}) and (\ref{eq:varTSS}). 693 694It is has been noted that the power of differential expression tests in RNA-Seq depend on the length of the transcripts being tested, because longer transcripts accumulate more reads \cite{Oshlack2009}. This means that the results we report are biased towards discovering longer differentially expressed transcripts and genes. 695 696\subsection{Quantifying transcriptional and post-transcriptional overloading} 697 698In order to infer the extent of differential promoter usage, we 699decided to measure changes in relative abundances of primary 700transcripts of single genes. Similarly, we investigated changes in 701relative abundances of transcripts grouped by TSS in order to infer 702differential splicing. These inferences required two ingredients: 703\begin{enumerate} 704\item A metric on probability distributions (derived from relative 705 abundances). 706\item A test statistic for assessing significant changes in 707 differential promoter usage and splicing as measured using the 708 metric referred to above. 709\end{enumerate} 710 711In order to address the first requirement, namely a metric on 712probability distributions, we turned to an entropy-based metric. This 713was motivated by the methods in \cite{Ritchie2008} where tests for 714differences in relative isoform abundances were performed to 715distinguish cancer cells from normal cells. We extend 716this approach to be able to test for relative isoform abundance changes among multiple 717experiments in RNA-Seq. 718 719\begin{defn}[Entropy] 720The entropy of a discrete probability distribution 721$p=(p_1,\ldots,p_n)$ ($0 \leq p_i \leq 1$ and $\sum_{i=1}^n p_i = 1$) is 722\begin{equation} 723H(p) = -\sum_{i=1}^n p_i log p_i. 724\end{equation} 725If $p_i=0$ for some $i$ the value of $p_i log p_i$ is taken to be $0$. 726\end{defn} 727\begin{defn}[The Jensen-Shannon divergence] 728The Jensen-Shannon divergence of $m$ discrete probability distributions $p^1,\ldots,p^m$ is defined to be: 729\begin{equation} 730JS(p^1,\ldots,p^m) = H\left(\frac{p^1 + \cdots + p^m}{m}\right) - \frac{\sum_{j=1}^m H(p^j)}{m}. 731\end{equation} 732 733In other words, the Jensen-Shannon divergence of a set of probability 734distributions is the entropy of their average minus the average of their 735entropies. \end{defn} 736 737In the case where $m=2$, we remark that the Jensen-Shannon divergence can also 738be described in terms of the Kullback-Leibler divergence of two discrete 739probability distributions. If we denote Kullback-Leibler divergence by 740 741\begin{equation} 742D(p^1\|p^2) = \sum_i p^1_i log \frac{p^1_i}{p^2_i}, 743\end{equation} 744then 745\begin{equation} 746JS(p^1,p^2) = \frac{1}{2}D(p^1\|m)+\frac{1}{2}D(p^2\|m) 747\end{equation} 748where $m=\frac{1}{2}(p^1+p^2)$. In other words the Jensen-Shannon 749divergence is a symmetrized variant of the Kullback-Leibler divergence. 750 751The Jensen-Shannon divergence has a number of useful properties: for 752example it is symmetric and non-negative. However it is {\em not} a 753metric. The following theorem shows how to construct a metric 754from the Jensen-Shannon divergence: 755 756\begin{thm}[Fuglede and Tops{\o}e 2004 \cite{Fuglede2004}] 757The square root of the Jensen-Shannon divergence is a metric. 758\end{thm} 759 760The proof of this result is based on a harmonic analysis argument that is the basis for 761the remark in the main paper that ``transcript abundances move in time along a 762logarithmic spiral in Hilbert space''. We therefore call the square root of the Jensen-Shannon divergence the {\em Jensen-Shannon metric}. We employed this metric in 763order to quantify relative changes in expression in (groups of) transcripts. 764 765In order to test for significance, we introduce a bit of notation. 766Suppose that $S$ is a collection of transcripts (for example, they may 767 share a common TSS). We define 768\begin{equation} 769\kappa_t = \frac{ \frac{\gamma_t}{\tilde{l}(t)}}{\sum_{u \in S} \frac{\gamma_u}{\tilde{l}(u)}} 770\end{equation} 771to be the proportion of transcript $t$ among all the transcripts in a 772group $S$. We let 773$Z=\sum_{u \in S} \hat{\gamma}_u / \tilde{l}(u)$ so that $\hat{\kappa}_t = \frac{\gamma_t}{\tilde{l}(t)Z}$. We therefore have that 774\begin{eqnarray} 775\label{eq:variancekappa1} 776Var[\hat{\kappa}_t] & = &\frac{Var[\hat{\gamma}_t]}{\tilde{l}(t)^2Z^2}, \\ 777Cov[\hat{\kappa}_t,\hat{\kappa}_u] & = & \frac{Cov[\hat{\gamma}_t,\hat{\gamma}_u]}{\tilde{l}(t)\tilde{l}(u)Z^2}.\label{eq:variancekappa2} 778\end{eqnarray} 779 780Our test statistic for divergent relative expression was the 781Jensen-Shannon metric. The test could be applied to multiple time 782points simultaneously, but we focused on pairwise tests (involving 783consecutive time points). Under the null hypothesis of no change in 784relative expression, the Jensen-Shannon metric should be 785zero. We tested for this using a one-sided $t$-test, based on an 786asymptotic derivation of the distribution of the Jensen-Shannon metric 787under the null hypothesis. This asymptotic distribution is normal by applying the delta method approximation, 788which involves computing the linear component of the Taylor expansion of the 789variance of $\sqrt{JS}$. 790 791In order to simplify notation, we let $f(p^1,\ldots,p^m)$ be the 792Jensen-Shannon metric for $m$ probability distributions $p^1,\ldots,p^m$. 793 794\begin{lemma} 795The partial derivatives of the Jensen-Shannon metric are give by 796\begin{equation} 797\frac{\partial f}{\partial p^k_l} = \frac{1}{2m\sqrt{f(p^1,\ldots,p^m)}} log \left( \frac{p^k_l}{\frac{1}{m} \sum_{j=1}^m p_l^j} \right). 798\end{equation} 799\end{lemma} 800 801Let $\hat{\kappa}^1,\ldots,\hat{\kappa}^m$ denote $m$ probability 802distributions on the set of transcripts $S$, for example the MLE for the transcript abundances in a time course. Then from the delta method we 803have that $\sqrt{JS(\hat{\kappa}^1,\ldots,\hat{\kappa}^m)}$ is approximately normally 804distributed with variance given by 805 806\begin{equation} 807Var[\sqrt{JS(\hat{\kappa}^1,\ldots,\hat{\kappa}^m)}] \approx (\bigtriangledown f)^T \Sigma (\bigtriangledown f), 808\end{equation} 809 810where $\Sigma$ is the variance-covariance matrix for the 811$\kappa^1,\ldots,\kappa^m$, i.e., it is a block diagonal matrix where the 812$i$th block is the variance-covariance matrix for the $\kappa^i_t$ given by 813Equations (\ref{eq:variancekappa1},\ref{eq:variancekappa2}). 814 815 816There are two biologically meaningful groupings of transcripts whose 817relative abundances are interesting to track in a 818time course. Transcripts that share a TSS are likely to be regulated by 819the same promoter, and therefore tracking the change in relative 820abundances of groups of transcripts sharing a TSS may reveal how 821transcriptional regulation is affecting expression over 822time. Similarly, transcripts that share a TSS and exhibit changes in 823expression relative to each other are likely to be affected by 824splicing or other post-transcriptional regulation. We therefore 825grouped transcripts by TSS and compared relative abundance changes 826within and between groups. 827 828We define ``overloading'' to be a significant change in relative 829abundances for a set of transcripts (as determined by the Jensen-Shannon metric, see below). The term is intended to generalize the simple notion of 830``isoform switching'' that is well-defined in the case of two 831transcripts, to multiple transcripts. It is complementary to absolute 832differential changes in expression: the overall expression of a gene 833may remain constant while individual transcripts change drastically in 834relative abundances resulting in overloading. The term is borrowed from 835computer science, where in some statically-typed programming 836languages, a function may be used in multiple, specialized instances 837via ``method overloading''. 838 839We tested for overloaded genes by performing a one-sided $t$-test based on the asymptotics of the Jensen-Shannon metric under the null hypothesis of no change in relative abundnaces of isoforms (either grouped by shared TSS for for post-transcriptional overloading, or by comparison of groups of isoforms with shared TSS for transcriptional overloading). Type I errors were controlled with the Benjamini-Hochberg \cite{Benjamini1995} correction for multiple testing. 840 A selection of overloaded genes are displayed in Supplemental Figs. \ref{splice_overloaded} and \ref{promoter_overloaded}. 841 842\newpage 843 844\begin{figure}[!ht] 845 \includegraphics{pdfs/splice_overloaded} 846 \caption[Selected genes with post-transcriptional 847 overloading]{Selected genes with post-transcriptional 848 overloading. Trajectories indicate the expression of individual 849 isoforms in FPKM ($y$ axis) over time in hours ($x$ 850 axis). Dashed isoforms have not been previously 851 annotated. Isoform trajectories are colored by TSS, so isoforms 852 with the same color presumably share a common promoter and are 853 processed from the same primary transcript. It is evident that 854 total gene expression may remain constant during isoforms 855 switching (Eya3) while in other cases changes in relative 856 abundance are accompanied by changes in absolute expression. 857The Jensen-Shannon metric generalizes the notion of ``isoform 858switching'' and is useful in cases with multiple isoforms 859(e.g. Ddx17). 860 \label{splice_overloaded}} 861\end{figure} 862 863\begin{figure}[!ht] 864 \includegraphics{pdfs/promoter_overloaded} 865 \caption[Selected genes with transcriptional overloading]{Selected 866 genes with transcriptional overloading. Trajectories indicate 867 the expression of individual isoforms in FPKM (y axis) over time 868 in hours (x axis). Dashed isoforms have not been previously 869 annotated. Isoform trajectories are colored by TSS, so that 870 isoforms with different colors presumably vary in their promoter 871 and are processed from different primary transcripts. \label{promoter_overloaded}} 872\end{figure} 873 874We can visualize overloading and expression dynamics with a plot that 875superimposes transcriptional and post-transcriptional overloading and 876gene-level expression over the time course. We refer to these as 877``Minard plots'', after Charles Joseph Minard's famous depiction 878of the advance and retreat of 879Napoleon's armies in the campaign against Russia in 1812 880\cite{Tufte2001}. Minard made use of multiple visual cues to display 881numerous varying quantities in one diagram. An example of a Minard 882plot for the gene 883Myc is shown in Figure 3c, and others are given in Appendix B. The dotted line indicates gene-level FPKM, with measured 884FPKM indicated by black circles. Grey circles indicate the arithmetic mean of 885gene-level FPKM between consecutive measured time points, interpolating FPKM 886at intermediate time points. The total gene expression overloading is 887visualized as a swatch centered around the interpolated expression curve. 888The width of the swatch encodes the amount of expression overloading between 889successive time points. The color of the swatch indicates the relative 890contributions of transcriptional and post-transcriptional expression 891overloading. 892 893Some genes, such as tropomyosin I and II, feature a single primary 894transcript, and so all overloading is by definition post-transcriptional. 895Others, like Fhl3, have two primary transcripts, but only a single isoform 896arises from each, so all overloading is transcriptional. Genes with multiple 897primary transcripts, one or more of which are alternatively spliced, such as 898Myc or RTN4, display both forms. 899 900 901\subsection{The {\tt Cufflinks} software} 902 903The transcript assembly and abundance estimation algorithms are implemented in 904freely available open source software called {\tt Cufflinks} that is available 905from \newline {\tt http://cufflinks.cbcb.umd.edu/} \newline Furthermore 906methods for comparing annotations across time points, and for performing the 907differential expression, promoter usage and splicing tests are implemented in 908the companion programs {\tt Cuffdiff} and {\tt Cuffcompare}. Instructions on 909how to install and run the software are provided on the website. 910 911The input to {\tt Cufflinks} consists of fragment alignments in the SAM format 912\cite{Li2009a}. These may consist of either single fragment alignments, or alignments of mate-pairs (paired-end reads produce better assemblies and more accurate abundance estimates than single reads). {\tt Cufflinks} will assemble the transcripts using the 913algorithm in Section \ref{sec:assembly}, and transcript abundances will be 914estimated using the model in Section \ref{sec:abundances}. Transcript 915coordinates and abundances are reported in the Gene Transfer Format (GTF). 916User supplied annotations may be provided to {\tt Cufflinks} (optional input) in 917which case they form the basis for the transcript abundance estimation. 918 919Some of the algorithms here rely on sufficient depth of sequencing in order to 920produce reliable output. {\tt Cufflinks} determines that depth is sufficient where 921possible to check that required assumptions hold. For example, in loci where one 922or more isoforms have extremely low relative expression, the observed Fisher 923Information Matrix may not be positive definite after rounding errors. In this 924case, it is not possible to produce a reliable variance-covariance matrix for 925isoform fragment abundances. {\tt Cufflinks} will report a numerical 926exception in this and similar cases. When an exception is reported, the 927confidence intervals for the isoforms' abundances will be set from 0 928FPKM to the FPKM for the whole gene. If such an exception is generated during 929a {\tt Cuffdiff} run, no differential analysis involving the problematic 930sample will be performed on that locus. 931\newpage 932\section{Appendix A: Lemmas and Theorems} 933 934The following elementary/classical results are required for our methods and we include them so that the supplement is self-contained. 935 936\begin{lemma} 937\label{lemma:varsum} 938Let $X_1,\ldots,X_n$ be random variables and $a_1,\ldots,a_n$ real numbers with $Y=\sum_{i=1}^na_iX_i$. Then 939\begin{equation} 940Var[Y] = \sum_{i=1}^n a_i^2Var[X_i]+2\sum_{i<j} a_ia_j Cov[X_i,X_j]. 941\end{equation} 942\end{lemma} 943\begin{lemma}[Taylor Series] 944If $X$ and $Y$ are random variables then 945\begin{eqnarray} 946 Var[f(X,Y)] & \approx & \left(\frac{\partial f}{\partial X}(E[X],E[Y])\right)^2Var[X] \nonumber \\ & &+2\frac{\partial f}{\partial X}(E[X],E[Y])\frac{\partial f}{\partial Y}(E[X],E[Y])Cov[X,Y] \nonumber \\ & & +\left( \frac{\partial f}{\partial Y}(E[X],E[Y])\right)^2 Var[Y]. 947\end{eqnarray} 948\end{lemma} 949\begin{cor} 950If $X$ and $Y$ are independent then 951\begin{eqnarray} 952 Var\left[log\left(\frac{X}{Y}\right)\right] & \approx & \frac{V[X]}{E[X]^2} + \frac{V[Y]}{E[Y]^2}. 953\end{eqnarray} 954\end{cor} 955\begin{cor} 956\label{lemma:varproduct} 957If $X$ and $Y$ are independent random variables then 958\begin{equation} 959Var[XY] = Var[X]Var[Y]+E[X]^2Var[Y]+E[Y]^2Var[X]. 960\end{equation} 961\end{cor} 962 963The above result is exact using the 2nd order Taylor expansion (higher 964derivatives vanish). 965 966\begin{lemma}[\cite{Li2009b}] 967\label{lemma:readstoprobs} 968Let $a_1,\ldots,a_n,w_1,\ldots,w_n$ be real numbers satisfying: $w_i \neq 0$ and $0 \leq a_i \leq 1$ for all $i$, $\sum_{i=1}^n a_i = 1$ and $\sum_{i=1}^na_iw_i \neq 0$. Let 969$ b_j = \frac{a_jw_j}{\sum_{i=1}^n a_iw_i}$. Then $a_j = \frac{b_j\frac{1}{w_j}}{\sum_{i=1}^n b_i\frac{1}{w_i}}$. 970\end{lemma} 971{\bf Proof}: \begin{eqnarray} 972b_j & = & \frac{a_jw_j}{\sum_{i=1}^n a_iw_i} \\ 973\Rightarrow \, \sum_{k=1}^n \frac{b_k}{w_k} & = & \sum_{k=1}^n \frac{a_k}{\sum_{i=1}^na_iw_i}\\ 974& = & \frac{1}{\sum_{i=1}^n a_iw_i}\\ 975& = & \frac{b_j}{a_jw_j}\\ 976 \Rightarrow \, a_j & = & \frac{b_j\frac{1}{w_j}}{\sum_{i=1}^n b_i \frac{1}{w_i}}. 977\end{eqnarray} \qed 978\begin{prop}[\cite{ASCB2005}] 979\label{prop:linearmodel} 980Let $f_i(\theta) = \sum_{j=1}^d a_{ij}\theta_j + b_i$ ($1 \leq i \leq 981m$) describe a linear statistical model with $a_{ij} \geq$ for all 982$i,j$. That is, $\sum_{i=1}^m f_i(\theta) = 1$. If $u_i \geq 0$ for 983all $i$ then the log likelihood function 984\begin{equation} 985l(\theta) = \sum_{i=1}^m u_i log(f_i(\theta)) 986\end{equation} 987is concave. 988\end{prop} 989{\bf Proof}: It is easy to see that 990\begin{equation} 991\left( \frac{\partial^2 l}{\partial \theta_j \partial \theta_k} \right) = -A^Tdiag\left( \frac{u_1}{f_1(\theta)^2},\ldots,\frac{u_m}{f_m(\theta)^2}\right) A, 992\end{equation} 993where $A$ is the $m \times d$ matrix whose entry in row $i$ and column 994$j$ equals $a_{ij}$. Therefore the Hessian is a symmetric matrix with 995non-positive eigenvalues, and is therefore negative semi-definite. 996\qed 997 998\begin{defn} 999\label{def:po} 1000A partially ordered set is a set $S$ with a binary relation $\leq$ 1001satisfying: 1002\begin{enumerate} 1003\item $x \leq x$ for all $x \in S$, 1004\item If $x \leq y$ and $y \leq z$ then $x \leq z$, 1005\item If $x \leq y$ and $y \leq x$ then $x=y$. 1006\end{enumerate} 1007A {\em chain} is a set of elements in $C \subseteq S$ such that for every 1008$x,y \in C$ either $x \leq y$ or $y \leq x$. An {\em antichain} is a set of 1009elements that are pairwise incompatible. 1010\end{defn} 1011Partially ordered sets are equivalent to directed acyclic graphs (DAGs). The following min-max theorems relate chain partitions to antichains and are special cases of linear-programming duality. More details and complete proofs can be found in \cite{Lovasz2009}. 1012 1013\begin{thm}[Dilworth's theorem] 1014\label{thm:Dilworth} 1015Let $P$ be a finite partially ordered set. The maximum number of elements in any antichain of $P$ equals the minimum number of chains in any partition of $P$ into chains. 1016\end{thm} 1017\begin{thm}[K\"{o}nig's theorem] 1018\label{thm:konig} 1019In a bipartite graph, the number of edges in a maximum matching equals 1020the number of vertices in a minimum vertex cover. 1021\end{thm} 1022\begin{thm} 1023\label{thm:dilko} 1024Dilworth's theorem is equivalent to K\"{o}nig's theorem. 1025\end{thm} 1026{\bf Proof}: We first show that Dilworth's theorem follows from 1027K\"{o}nig's theorem. Let $P$ be a partially ordered set with $n$ 1028elements. We define a bipartite graph $G=(U,V,E)$ where 1029$U=V=P$, i.e. each partition in the bipartite graph is equally to 1030$P$. Two nodes $u,v$ form an edge $(u,v) \in E$ in the graph $G$ iff 1031$u<v$ in $P$. By K\"{o}nig's theorem there exist both a matching $M$ and a 1032a vertex cover $C$ in $G$ of the same cardinality. Let $T \subset 1033S$ be the set of elements not contained in $C$. Note that $T$ is an 1034antichain in $P$. We now form a partition $W$ of $P$ into chains by declaring $u$ and 1035$v$ to be in the same chain whenever there is an edge $(u,v) \in 1036M$. Since $C$ and $M$ have the same size, it follows that $T$ and $W$ 1037have the same size. 1038 1039To deduce K\"{o}nig's theorem from Dilworth's theorem, we begin with a 1040bipartite graph $G=(U,V,E)$ and form a partial order $P$ on the 1041vertices of $G$ by defining $u<v$ when $u \in U, v \in V$ and $(u,v) 1042\in E$. By Dilworth's theorem, there exists an antichain of $P$ and a 1043partition into chains of the same size. The non-trivial chains in $P$ 1044form a matching in the graph. Similarly, the complement of the 1045vertices corresponding to the anti-chain in $P$ is a vertex cover of 1046$G$ with the same cardinality as the matching. \qed 1047 1048\begin{figure}[!ht] 1049 \includegraphics[scale=0.8]{pdfs/Dilworth_Konig} 1050 1051%\caption[Equivalence of Dilworth's and K\"{o}nig's theorems]{ 1052\end{figure} 1053The equivalence of Dilworth's and K\"{o}nig's theorems is depicted above. The 1054 partially ordered set with 8 elements on the left is partitioned into 3 1055 chains. This is the size of a minimum partition into chains, and 1056 is equal to the maximum size of an antichain (Dilworth's 1057 theorem). The antichain is shown with double circles. On the 1058 right, the reachability graph constructed from the partially 1059 ordered set on the left is shown. The maximum matching 1060 corresponding to the chain partition consists of 5 edges and is equal in size to the 1061 number of vertices in a minimum vertex cover (K\"{o}nig's 1062 theorem). The vertex cover is shown with double circles. Note that 1063 8=3+5. 1064 1065 1066\newpage 1067\bibliographystyle{amsplain} 1068\bibliography{cufflinks} 1069\end{document} 1070 1071% long read length is required for uniquely determining genomic origin of fragments, but its fragment length that is critical for assignment of fragments to transcripts. 1072% The length of a set of transcripts depends on their abundances 1073% single read sequencing prevents the correct normalization for fragment sizes in the length. 1074