1\documentclass[12pt]{amsart}
2
3\usepackage{graphicx}
4\usepackage{amsmath}
5\usepackage[pdftex]{hyperref}
6
7\setlength{\textwidth}{6.2in}
8\setlength{\textheight}{8.4in}
9\setlength{\topmargin}{0.2in}
10\setlength{\oddsidemargin}{0in}
11\setlength{\evensidemargin}{0in}
12\setlength{\headsep}{.3in}
13\footskip 0.3in
14\newtheorem{thm}{Theorem}
15\newtheorem{lemma}[thm]{Lemma}
16\newtheorem{prop}[thm]{Proposition}
17\newtheorem{cor}[thm]{Corollary}
18\theoremstyle{definition}
19\newtheorem{defn}[thm]{Definition}
20\newtheorem{example}[thm]{Example}
21\newtheorem{remark}[thm]{Remark}
22\newtheorem{conj}[thm]{Conjecture}
23
24\newcommand{\cherry}[1]{\ensuremath{\langle #1 \rangle}}
25\newcommand{\ignore}[1]{}
26\newcommand{\half}{\frac{1}{2}}
27\newcommand{\eopf}{\framebox[6.5pt]{} \vspace{0.2in}}
28
29%------------------document begins-----------------------
30
31\begin{document}
32
33\title{Methods of Cufflinks}
34\author{Cole Trapnell \and Lior Pachter}
35%\address{}
36%\email{\{whoweare\}@math.berkeley.edu}
37%\dedicatory{\today}
38
39%\thanks{Supported in part by ....}
40%\subjclass{Primary 55M20; Secondary 47H10}
41
42\maketitle
43\markboth{C Trapnell and L Pachter}{Methods of Cufflinks}
44
45This document describes the methods used in the Cufflinks RNA-Seq analysis program.
46%\pagestyle{plain}
47\pagenumbering{roman} \tableofcontents \listoffigures \listoftables \newpage \pagenumbering{arabic}
48
49\section{Version 0.8.3}
50
51\subsection{Transcript abundance estimation \label{abundances}}
52\label{sec:abundances}
53
54\subsubsection{Definitions}
55
56A {\em transcript} is an RNA molecule that has been transcribed from DNA. A
57{\em primary transcript} is an RNA molecule that has yet to undergo
58modification. The {\em genomic location} of a primary transcript consists of a
59pair of coordinates in the genome representing the $5'$ transcription start
60site and the $3'$ polyadenylation cleavage site. We denote the set of all
61transcripts in a transcriptome by $T$. We partition transcripts into {\em
62transcription loci} (for simplicity we refer to these as loci) so that every
63locus contains a set of transcripts all of whose genomic locations do not
64overlap the genomic location of any transcript in any other locus. Formally,
65we consider a maximal partition of transcripts into loci, a partition denoted
66by $G$, where the genomic location of a transcript $t \in g \in G$ does not
67overlap the genomic location of any transcript $u $ where $u \in h \in G$ and
68$h \neq g$. We emphasize that the definition of a transcription locus is not
69biological; transcripts in the same locus may be regulated via different
70promoters, and may differ completely in sequence (for example if one
71transcript is in the intron of another) or have different functions. The
72reason for defining loci is that we will see that they are computationally
73convenient.
74
75
76We assume that at the time of an experiment, a transcriptome consists of an
77ensemble of transcripts $T$ where the proportion of transcript $t \in T$ is
78$\rho_t$, so that $\sum_{t \in T} \rho_t = 1$ and $0 \leq \rho_t \leq 1$ for
79all $t \in T$. Formally, a {\em transcriptome} is a set of transcripts $T$
80together with the abundances $\rho=\{ \rho_t\}_{t \in T}$. For convenience we also introduce
81notation for the proportion of transcripts in each locus. We let $\sigma_g =
82\sum_{t \in g} \rho_t$. Similarly, within a locus $g$, we denote the
83proportion of each transcript $t \in g$ by $\tau_t = \frac{\rho_t}{\sigma_g}$.
84We refer to $\rho,\sigma$ and $\tau$ as {\em transcript abundances}.
85
86
87Transcripts have lengths, which we denote by $l(t)$. For a collection of transcripts $S \subset T$ in a transcriptome, we define the length of $S$ using the weighted mean:
88\begin{equation}
89\label{eq:effective_length}
90l(S) =\frac{\sum_{t \in S} \rho_tl(t)}{\sum_{t \in S}\rho_t}.
91\end{equation}
92It is important to note that the length of a set of transcripts depends on their relative abundances; the reason for this will be clear later.
93
94One grouping of transcripts that we will focus on is the set of transcripts
95within a locus that share the same transcription start site (TSS). Unlike the
96concept of a locus, grouping by TSS has a biological basis. Transcripts
97within such a group are by definition alternatively spliced, and if they have
98different expression levels, this is most likely due to the spliceosome and
99not due to differences in transcriptional regulation.
100
101\subsubsection{A statistical model for RNA-Seq}
102
103In order to analyze expression levels of transcripts with RNA-Seq data, it is
104necessary to have a model for the (stochastic) process of sequencing. A {\em
105sequencing experiment} consists of selecting a total of $M$ fragments of
106transcripts uniformly at random from the transcriptome. Each fragment is
107identified by sequencing from its ends, resulting in two reads called {\em
108mate pairs}. The length of a fragment is a random variable, with a
109distribution we will denote by $F$. That is, the probability that a fragment
110has length $i$ is $F(i)$ and $\sum_{i=1}^{\infty} F(i) = 1$. In this paper we
111assume that $F$ is normal, however in principle $F$ can be estimated using
112data from the experiment (e.g. spike-in sequences). We decided to use
113the normal approximation to $F$ (allowing for user specified
114parameters of the normal distribution) in order to simplify the
115requirements for running {\tt Cufflinks} at this time.
116
117The assumption of random
118fragment selection is known to oversimplify the complexities of a sequencing
119experiment, however without rigorous ways to normalize we decided to work with
120the uniform at random assumption. It is easy to adapt the model to include
121more complex models that address sequencing bias as RNA-Seq experiments mature
122and the technologies are better understood.
123
124The transcript abundance estimation problem in paired-end RNA-Seq is to
125estimate $\rho$ given a set of transcripts $T$ and a set of reads sequenced
126from the ends of fragments. In {\tt Cufflinks}, the transcripts $T$ can be
127specified by the user, or alternatively $T$ can be estimated directly from the
128reads. The latter problem is the transcript assembly problem which we discuss
129in Section \ref{sec:assembly}. We ran {\tt Cufflinks} in the latter
130``discovery'' mode where we assembled the transcripts without using the
131reference annotation.
132
133The fact that fragments have different lengths has bearing on the calculation
134of the probability of selecting a fragment from a transcript. Consider a
135transcript $t$ with length $l(t)$. The probability of selecting a fragment of
136length $k$ from $t$ at one of the positions in $t$ assuming that it is
137selected uniformly at random, is $\frac{1}{l(t)-k}$. For this reason, we will
138define an adjusted length for transcripts as
139
140\begin{equation}
141\tilde{l}(t) = \sum_{i=1}^{l(t)} F(i)(l(t)-i+1).
142\end{equation}
143We also revisit the definition of length for a group of transcripts, and define
144\begin{equation}
145\tilde{l}(S) =\frac{\sum_{t \in S} \rho_t\tilde{l}(t)}{\sum_{t \in S}\rho_t}.
146\end{equation}
147
148It is important to note that given a read it may not be obvious from which
149transcript the fragment it was sequenced from originated. The consistency of
150fragments with transcripts is important and we define the {\em
151fragment-transcript matrix} $A_{R,T}$ to be the $M \times |T|$ matrix with
152$A(r,t)=1$ if the fragment alignment $r$ is completely contained in the
153genomic interval spanned by $t$, and all the implied introns in $r$ match
154introns in $t$ (in order), and with $A(r,t)=0$ otherwise. Note that the reads
155in Figure 1c in the main text are colored according to the matrix $A_{R,T}$,
156with each column of the matrix corresponding to one of the three colors
157(yellow, blue, red) and reads colored according to the mixture of colors
158corresponding to the transcripts their fragments are contained in.
159
160Even given the read
161alignment to a reference genome, it may not be obvious what the length of the
162fragment was. Formally, in the case that $A_{R,T}(r,t)=1$ we denote by $I_t(r)$ the fragment length from within
163a transcript $t$ implied by the (presumably unique) sequences corresponding to
164the mate pairs of a fragment $r$. If $A_{R,T}(r,t)=0$ then $I_t(r)$ is set to be
165infinite and $F(I_t(r)) = 0$.
166
167Given a set of reads, we assume that we can identify for each of them the set
168of transcripts with which the fragments the reads belonged to are consistent. The
169rationale for this assumption is the following: we map the reads to a
170reference genome, and we assume that the read lengths are sufficiently long so
171that every mate-pair can be uniquely mapped to the genome. We refer to this
172mapping as the {\em fragment alignment}. We also assume that we know all the
173possible transcripts and their alignments to the genome. Therefore, we can
174identify for each read the possible transcripts from which the fragment it
175belonged to originated.
176
177\begin{figure}[!ht]
178    \includegraphics[scale=0.8]{pdfs/implied_length}
179    \caption[Implied length of a fragment alignment]{Alignments of reads to the genome (rectangles) may be consistent with multiple transcripts (in this case both $t_1$ and $t_2$). The transcripts $t_1$ and $t_2$ differ by an internal exon; introns are indicated by long dashed lines. If we denote the fragment alignment by $r$, this means that $A_{R,T}(r,t_1)=1$ and $A_{R,T}(r,t_2)=1$. It is apparent that the implied length $I_{t_1}(r) > I_{t_2}(r)$ due to the presence of the extra internal exon in $t_1$. }
180\end{figure}
181
182We are now ready to write down the likelihood equation for the model. We will
183write $L(\rho|R)$ for the likelihood of a set of fragment alignments $R$
184constructed from $M$ reads. The notation $Pr(trans. = t)$ means ``the
185probability that a fragment selected at random originates from transcript $t$''.
186
187\begin{eqnarray}
188& & L(\rho|R) = \prod_{r \in R} Pr(rd.\ aln. =r)\\
189&  = & \prod_{r \in R} \sum_{t \in T} Pr(rd.\ aln. =r|trans. =t)Pr(trans. =t)\\
190& = & \prod_{r \in R} \sum_{t \in T} \frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)} Pr(rd.\ aln. = r|trans. =t)\\
191& = & \prod_{r \in R} \sum_{t \in T}\frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)} \left(\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \\ \label{eq:likerho}
192& = &  \prod_{r \in R} \sum_{t \in T} \alpha_t\left(\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right),
193\end{eqnarray}
194where
195\begin{equation}
196\alpha_t = \frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)}.
197\end{equation}
198
199Observe that $\alpha_t$ is exactly the probability that a fragment selected at
200random comes from transcript $t$, and we have that $\sum_{t \in T}\alpha_t =
2011$. In light of the probabilistic meaning of the $\alpha=\{\alpha_t\}_{t \in T}$, we refer to them as
202{\em fragment abundances}.
203
204It is evident that the likelihood function is that of a linear model
205and  that the likelihood function is
206concave (Proposition \ref{prop:linearmodel}) so a numerical method can be used
207to find the $\alpha$. It is then possible, in principle, to recover the $\rho$
208using Lemma \ref{lemma:readstoprobs}. However the number of parameters is in the tens of
209thousands, and in practice this form of the likelihood function is unwieldy.
210Instead, we re-write the likelihood utilizing the fact that transcripts in
211distinct loci do not overlap in genomic location.
212
213We first calculate the  probability that a fragment originates from a transcript within a given locus $g$:
214\begin{eqnarray}
215\beta_g & := &  \sum_{t \in g} \alpha_t\\
216& = & \frac{\sum_{t \in g} \rho_t \tilde{l}(t)}{\sum_{u \in T} \rho_u \tilde{l}(u)}\\
217& = & \frac{\sum_{t \in g} \sigma_g\tau_t \tilde{l}(t)}{\sum_{h \in G} \sum_{u \in h} \sigma_h\tau_u \tilde{l}(u)}\\
218& = & \frac{\sigma_g\sum_{t \in g} \tau_t \tilde{l}(t)}{\sum_{h \in G} \sigma_h \sum_{u \in h} \tau_u \tilde{l}(u)}\\
219 & = &  \frac{\sigma_g \tilde{l}(g)}{\sum_{h \in G} \sigma_h \tilde{l}(h)}.
220\end{eqnarray}
221Recall that $\sigma_g = \sum_{t \in g} \rho_t$ and that $\tau_t = \frac{\rho_t}{\sigma_g}$ for a locus $g$.
222
223Similarly, the probability of selecting a fragment from a single transcript $t$ conditioned on selecting a transcript from the locus $g$ in which $t$ is contained is
224\begin{equation}
225\label{eq:gammatau}
226\gamma_t = \frac{\tau_t \tilde{l}(t)}{\sum_{u \in g} \tau_u \tilde{l}(u)}.
227\end{equation}
228The parameters $\gamma=\{\gamma_t\}_{t \in g}$ are conditional fragment abundances, and they are the parameters we estimate from the data in the next Section. Note that for a transcript $t \in g$, $\alpha_t = \beta_g \cdot \gamma_t$ and it is easy to convert between fragment abundances and transcript abundances using Lemma \ref{lemma:readstoprobs}.
229
230We denote the fragment counts by $X$; specifically, we denote the number of
231alignments in locus $g$ by $X_g$. Note that $\sum_{g \in G} X_g = M$. We also
232use the notation $g_r$ to denote the (unique) locus from which a read
233alignment $r$ can be obtained.
234
235The likelihood function is given by
236\begin{eqnarray}
237& & L(\rho|R) = \prod_{r \in R} Pr(rd.\ aln. =r)\\
238&  = & \prod_{r \in R} \sum_{g \in G} Pr(rd.\ aln. =r|locus=g)Pr(locus=g)\\
239& = & \prod_{r \in R} \frac{\sigma_{g_{r}}\tilde{l}(g_{r})}{\sum_{g \in G} \sigma_g\tilde{l}(g)} Pr(rd.\ aln. =r|locus=g_r)\\
240& = & \prod_{r \in R}  \beta_{g_r}
241\sum_{t \in g_r}Pr(rd.\ aln. =r|locus=g_{r},trans. = t)Pr(trans. =t|locus=g_r)\\
242& = & \prod_{r \in R} \beta_{g_r}
243\sum_{t \in g_r} \frac{\tau_t\tilde{l}(t)}{\sum_{u \in g_r}\tau_u \tilde{l}(u)} Pr(rd.\ aln. =r|locus=g_{r},trans. = t)\\
244& = & \left( \prod_{r \in R} \beta_{g_r} \right) \left( \prod_{r \in R} \sum_{t \in g} \gamma_t \cdot
245Pr(rd.\ aln. =r|locus=g_r,trans. =t) \right)\\
246& = & \left( \prod_{r \in R}  \beta_{g_r} \right) \left( \prod_{r \in R} \sum_{t \in g}  \gamma_t\cdot  \frac{F(I_t(r))}{l(t)-I_t(r)+1}\right)\\
247& = & \left( \prod_{g \in G}  \beta_g^{X_{g}} \right) \left( \prod_{g \in G} \left( \prod_{r \in R:r \in g} \sum_{t \in g}  \gamma_t \cdot
248\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \right). \label{eq:likebest}
249\end{eqnarray}
250
251Explicitly, in terms of the parameters $\rho$, Equation (\ref{eq:likebest})
252simplifies to Equation (\ref{eq:likerho}) but we will see in the next section
253how the maximum likelihood estimates $\hat{\rho}$ are most conveniently
254obtained by first finding $\hat{\beta}$ and $\hat{\gamma}$ using Equation
255(\ref{eq:likebest}).
256
257We note that it is biologically meaningful to include prior distributions on
258$\sigma$ and $\tau$ that reflect the inherent stochasticity and resulting
259variability of transcription in a cell. This will be an interesting direction
260for further research as more RNA-Seq data (with replicates) becomes available
261allowing for the determination of biologically meaningful priors. In
262particular, it seems plausible that specific isoform abundances may vary
263considerably and randomly within cells from a single tissue and that this may
264be important in studying differential splicing. We mention to this to clarify
265that in this paper, the confidence intervals we report represent the
266variability in the maximum likelihood estimates $\hat{\sigma}_j$ and
267$\hat{\tau}^k_j$, and are not the variances of prior distributions.
268
269
270\subsubsection{Estimation of parameters}
271
272We begin with a discussion of identifiability of our
273model. Identifiability refers to the injectivity of the model, i.e.,
274\begin{equation}
275\mbox{if } Pr_{\rho_1}(r) = Pr_{\rho_2}(r) \  \forall r \in R, \ \mbox{ then } \rho_1 = \rho_2.
276\end{equation}
277
278
279The identifiability of RNA-Seq models was discussed in
280\cite{Hiller2009}, where a standard analysis for linear models is
281applied to RNA-Seq (for another related biological example, see \cite{Pe'er2004} which discusses
282identifiability of haplotypes in mixed populations from genotype data).
283The results in these papers apply to our model. For completeness we review the conditions for
284identifiability. Recall that $A_{R,T}$ is the fragment-transcript matrix that specifies which transcripts each fragment is compatible with. The following theorem provides a simple characterization of identifiability:
285
286\begin{thm}
287The RNA-Seq model is identifiable iff $A_{R,T}$ is full rank.
288\end{thm}
289
290Therefore, for a given set of transcripts and a read set $R$, we can
291test whether the model is identifiable using elementary linear
292algebra. For the results in this paper, when estimating expression
293with given annotations, when the model was not identifiable we picked {\em a} maximum likelihood solution,
294although in principle it is possible to bound the total expression of
295the locus and/or report identifiability problems to the user.
296
297Returning to the likelihood function
298\begin{equation}
299\left( \prod_{g \in G}  \beta_g^{X_{g}} \right) \left( \prod_{g \in G} \left( \prod_{r \in R:r \in g} \sum_{t \in g}  \gamma_t \cdot
300\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \right),
301\end{equation}
302we note that both the $\beta$ and $\gamma$ parameters depend on the $\rho$ parameters. However, we will see that if we maximize the $\beta$ separately from the $\gamma$, and also each of the sets $\{\gamma_t:t \in g\}$ separately, then it is always possible to find $\rho$ that match both the maximal $\beta$ and $\gamma$. In other words,
303the problem of finding $\hat{\rho}$ is equivalent
304to finding $\hat{\beta}$ that maximizes
305$ \prod_{g \in G} \beta_g^{X_g}$
306and separately, for each locus $g$, the $\hat{\gamma}_t$ that maximize
307\begin{equation}
308\prod_{r \in R:r \in g} \sum_{t \in g}  \gamma_t
309\frac{F(I_t(r))}{l(t)-I_t(r)+1}.
310\end{equation}
311
312We begin by solving for the $\hat{\beta}$ and $\hat{\gamma}$ and the variances
313of the maximum likelihood estimates, and then explain how these are used to
314report expression levels.
315
316We can solve for the $\hat{\gamma}$ using the fact that the model is linear.
317That is, the probability of each individual read is linear in the read
318abundances $\gamma_t$. It is a standard result in statistics (see, e.g.,
319Proposition 1.4 in \cite{ASCB2005}) that the log likelihood function of a
320linear model is concave. Thus, a hill climbing method can be used to find the
321$\hat{\gamma}$. We used the EM algorithm for this purpose.
322
323Rather than using the direct ML estimates, we obtained a regularized
324estimate by importance sampling from the posterior distribution with a
325proposal distribution we explain below. The samples were also used to estimate
326variances for our estimates.
327
328It follows from standard MLE asymptotic
329theory that the $\hat{\gamma}$ are asymptotically multivariate normal with
330variance-covariance matrix given by the inverse of the observed Fisher
331information matrix. This matrix is defined as follows:
332
333\begin{defn}[Observed Fisher information matrix]
334The observed Fisher information matrix is the negative of the Hessian of the log likelihood function evaluated at the maximum likelihood estimate. That is, for parameters $\Theta=(\theta_1,\ldots,\theta_n)$, the $n \times n$ matrix is
335\begin{eqnarray}
336\mathcal{F}_{k,l}(\hat{\Theta}) & = & - \frac{\partial^2 log(\mathcal{L}(\Theta|R))}{\partial \theta_k \theta_l} |_{\theta=\hat{\theta}}.
337\end{eqnarray}
338\end{defn}
339In our case, considering a single locus $g$, the parameters are $\Theta = (\gamma_{t_1},\ldots,\gamma_{t_{|g|}})$, and as expected from Proposition \ref{prop:linearmodel}:
340\begin{eqnarray}
341\label{eq:Fisher}
342\mathcal{F}_{t_k,t_l}(\hat{\Theta}) & = & \sum_{r \in R:r \in g}
343\left[ \frac{1}{\left( \sum_{h \in g}  \hat{\gamma}_h \frac{F(I_h(r))}{l(h) - I_h(r)+1} \right)^2} \frac{F(I_{t_k}(r)) F(I_{t_l}(r)) }{(l(t_k)-I_{t_k}+1)(l(t_l)-I_{t_l}+1)} \right].
344\end{eqnarray}
345
346
347Because some of the transcript abundances may be close to zero, we
348adopted the
349Bayesian approach of \cite{Jiang2009} and instead sampled from the
350joint posterior distribution of $\Theta$ using the proposal
351distribution consisting of the multivariate normal with mean
352given by the MLE, and variance-covariance matrix given by the inverse of
353(\ref{eq:Fisher}). If the Observed Fisher Information Matrix is
354singular then the user is warned and the confidence intervals of all
355transcripts are set to $[0,1]$ (meaning that there is no information
356about relative abundances).
357
358The method used for sampling was importance sampling. The samples were used to obtain a maximum-a-posterior estimate for
359$\hat{\gamma}_t$ for each $t$ and for the variance-covariance matrix which we
360denote by $\Psi^g$ (where $g \in G$ denotes the locus). Note that $\Psi^g$ is
361a $|g| \times |g|$ matrix. The covariance between $\hat{\gamma}_{t_k}$ and
362$\hat{\gamma}_{t_l}$ for $t_k,t_l \in g$ is given by $\Psi^g_{t_k,t_l}$.
363
364Turning to the maximum likelihood estimates $\hat{\beta}$, we use the fact that the model is the log-linear. Therefore,
365\begin{equation}
366\label{eq:sigmahat}
367\hat{\beta_g} = \frac{X_{g}}{M}.
368\end{equation}
369
370Viewed as a random variable, the counts $X_{g}$ are approximately Poisson and therefore the variance of the MLE $\hat{\beta}_g$ is approximately $X_{g}$. We note that for the tests in this paper we directly used the total counts $M$ and the proportional counts $X_g$, however it is easy to incorporate recent suggestions for total count normalization, such as \cite{Bullard2010} into {\tt Cufflinks}.
371
372The favored units for reporting expression in RNA-Seq studies to date is not
373using the transcript abundances directly, but rather using a measure
374abbreviated as FPKM, which means ``expected number of fragments per kilobase
375of transcript sequence per millions base pairs sequenced''. These units are
376equivalent to measuring transcript abundances (multiplied by a scalar). The
377computational advantage of FPKM, is that the normalization constants
378conveniently simplify some of the formulas for the variances of transcript
379abundance estimates.
380
381For example, the abundance of a transcript $t \in g$ in FPKM units is
382\begin{equation}
383\label{eq:FPKM1}
384 \frac{10^6 \cdot 10^3 \cdot \alpha_t}{\tilde{l}(t)} =  \frac{10^6 \cdot 10^3 \cdot \beta_g \cdot \gamma_t}{\tilde{l}(t)}.
385\end{equation}
386
387Equation (\ref{eq:FPKM1}) makes it clear that although the abundance of each
388transcript $t \in g$ in FPKM units is proportional to the transcript abundance
389$\rho_t$ it is given in terms of the read abundances $\beta_g$ and $\gamma_t$
390which are the parameters estimated from the likelihood function.
391
392The maximum likelihood estimates of $\beta_g$ and $\gamma_t$ are random
393variables, and we denote their scaled product (in FPKM units) by $A_t$. That
394is $Pr(A_t = a)$ is the probability that for a random set of fragment alignments
395from a sequencing experiment, the maximum likelihood estimate of the
396transcript abundance for $t$ in FPKM units is $a$.
397
398Using the fact that the expectation of a product of independent random
399variables is the product of the expectations, for a transcript $t \in g$ we
400have
401\begin{equation}
402E[A_t] = \frac{10^9X_{g}\hat{\gamma}_t}{\tilde{l}(t)M}.
403\end{equation}
404
405Given the variance estimates for the $\hat{\gamma}_t$ we turn to the problem of estimating $Var[A_t]$ for a transcript $t \in g$. We use Lemma \ref{lemma:varproduct} to obtain
406\begin{eqnarray}
407Var[A_t]& =  & \left(\frac{10^9}{\tilde{l}(t)M}\right)^2 \left( \Psi^g_{t,t}X_{g} + \Psi^g_{t,t}X^2_{g} + (\hat{\gamma}_t)^2 X_{g} \right)\\
408& = & X_{g}\left(\frac{10^9}{\tilde{l}(t)M)}\right)^2  \left( \Psi^g_{t,t}(1+X_{g}) + (\hat{\gamma}_t)^2\right).
409\end{eqnarray}
410
411This variance calculation can be used to estimate a confidence interval by
412utilizing the fact \cite{Aroian1978} that when the expectation divided by the
413standard deviation of at least one of two random variables is large, their
414product is approximately normal.
415
416Next we turn to the problem of estimating expression levels (and variances of
417these estimates) for groups of transcripts. Let $S \subset T$ be a group of
418transcripts located in a single locus $g$, e.g. a collection of transcripts
419sharing a common TSS.
420
421The analogy of Equation (\ref{eq:FPKM1}) for the FPKM of the group is
422\begin{eqnarray}
423\label{eq:grouphard}
424& & \frac{10^6 \cdot 10^3 \cdot \beta_g \cdot \left( \sum_{t \in S} \gamma_t\right)}{\tilde{l}(S)}\\
425& = & 10^6 \cdot 10^3 \cdot \beta_g \cdot \sum_{t \in S} \frac{\gamma_t}{\tilde{l}(t)}. \label{eq:groupeasy}
426\end{eqnarray}
427
428As before, we denote by $B_S$ the random variables for which $Pr(B_S = b)$ is
429the probability that for a random set of fragment alignments from a sequencing
430experiment, the maximum likelihood estimate of the transcript abundance for
431all the transcripts in $S$ in FPKM units is $b$. We note that the $B_S$ are
432products and sums of random variables (Equation (\ref{eq:groupeasy})). This
433makes Equation (\ref{eq:groupeasy})  more useful than the
434equivalent unsimplified Equation (\ref{eq:grouphard}), especially because
435$\tilde{l}(S)$ is, in general, a ratio of two random variables.
436
437We again use the fact that the expectation of independent random variables is
438the product of the expectation, in addition to the fact that expectation is a
439linear operator to conclude that for a group of transcripts $S$,
440
441\begin{equation}
442\label{eq:expectTSS}
443E[B_S] = \frac{10^9 \cdot X_g \cdot \sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)}}{M}.
444\end{equation}
445In order to compute the variance of $B_S$, we first note that
446\begin{equation}
447Var\left[\sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)}\right] = \sum_{t \in S}\frac{1}{\tilde{l}(t)^2}\Psi^g_{t,t} + \sum_{t,u \in S} \frac{1}{\tilde{l}(t)\tilde{l}(u)} \Psi^g_{t,u}.
448\end{equation}
449Therefore,
450\begin{eqnarray}
451& & \quad Var[B_S] \quad = \quad \nonumber \\
452& &  X_g\left(\frac{10^9}{M}\right)^2\left( \left(1+X_g\right) \left(\sum_{t \in S}\frac{1}{\tilde{l}(t)^2}\Psi^g_{t,t} + \sum_{t,u \in S} \frac{1}{\tilde{l}(t)\tilde{l}(u)} \Psi^g_{t,u}\right) + \left( \sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)} \right)^2 \right). \label{eq:varTSS}
453\end{eqnarray}
454
455We can again estimate a confidence interval by utilizing the fact that $B_S$
456is approximately normal \cite{Aroian1978}.
457
458\subsection{Transcript assembly}
459\label{sec:assembly}
460\subsubsection{Overview}
461
462{\tt Cufflinks} takes as input alignments of RNA-Seq fragments to a reference
463genome and, in the absence of an (optional) user provided annotation, initially assembles transcripts from the alignments. Transcripts in
464each of the loci are assembled independently. The assembly algorithm
465is designed to aim for the following:
466\begin{enumerate}
467\item Every fragment is consistent with at least one assembled
468  transcript.
469\item Every transcript is tiled by reads.
470\item The number of transcripts is the smallest required to satisfy requirement (1).
471\item The resulting RNA-Seq models (in the sense of Section \ref{sec:abundances}) are identifiable.
472\end{enumerate}
473In other words, we seek an assembly that parsimoniously “explains” the fragments from the RNA-Seq experiment; every
474fragment in the experiment (except those filtered out during a preliminary
475error-control step) should have come from a {\tt Cufflinks} transcript, and
476{\tt Cufflinks} should produce as few transcripts as possible with that property.
477Thus, {\tt Cufflinks} seeks to optimize the criterion suggested in \cite{Xing2004},
478however, unlike the method in that paper, {\tt Cufflinks} leverages Dilworth's
479Theorem \cite{Dilworth1950} to solve the problem by reducing it
480to a matching problem via the equivalence of Dilworth's and K\"{o}nig's theorems (Theorem \ref{thm:dilko} in Appendix A). Our approach to isoform reconstruction is inspired by a similar approach used for haplotype reconstruction from HIV quasi-species \cite{Eriksson2008}.
481
482\subsection{A partial order on fragment alignments}
483
484The {\tt Cufflinks} program loads a set of alignments in SAM format sorted by reference
485position and assembles non-overlapping sets of alignments independently. After
486filtering out any erroneous spliced alignments or reads from incompletely
487spliced RNAs, {\tt Cufflinks} constructs a partial order (Definition \ref{def:po}), or equivalently a directed acyclic graph (DAG), with
488one node for each fragment that in turn consists of an aligned pair of mated reads. First, we note that fragment alignments are of two types: those where reads align in their entirety to the genome, and reads which have a split alignment (due to an implied intron).
489
490In the case of single reads, the partial order can be simply
491constructed by checking the reads for {\em compatibility}. Two reads
492are {\em compatible} if their overlap contains the exact same implied
493introns (or none). If two reads are not compatible they are {\em
494  incompatible}. The reads can be partially ordered by defining, for
495two reads $x,y$, that $x\leq y$ if the starting coordinate of $x$ is
496at or before the starting coordinate of $y$, and if they are
497compatible.
498
499In the case of paired-end RNA-Seq the situation is more complicated
500because the unknown sequence between mate pairs. To understand this,
501we first note that pairs of fragments can still be determined to be
502incompatible if they cannot have originated from the same
503transcript. As with single reads, this happens when there is
504disagreement on implied introns in the overlap. However compatibility
505is more subtle. We would like to define a pair of fragments $x,y$ to
506be compatible if they do not overlap, or if every implied intron in one fragment overlaps an identical implied intron in the other fragment.
507
508However it is important to note that it may be impossible to determine
509the compatibility (as defined above) or incompatibility of a pair of fragments. For example, an unknown region internal to a fragment may overlap two different introns (that are incompatible with each other). The fragment may be compatible with one of the introns (and the fragment from which it originates) in which case it is incompatible with the other. Since the opposite situation is also feasible, compatibility (or incompatibility) cannot be assigned. Fragments for which the compatibility/incompatibility cannot be determined with respect to every other fragment are called {\em uncertain}. Finally, two fragments are called {\em nested} if one is contained within the other.
510
511
512\begin{figure}[h]
513    \includegraphics[scale=0.5]{pdfs/compatibility.pdf}
514    \caption[Compatibility and incompatibility of
515    fragments]{Compatibility and incompatibility of
516      fragments. End-reads are solid lines, unknown sequences within
517      fragments are shown by dotted lines and implied introns are
518      dashed lines. The reads in (a) are compatible, whereas the
519      fragments in (b) are incompatible. The fragments in (c) are
520      nested. Fragment $x_4$ in (d) is uncertain, because $y_4$ and
521      $y_5$ are incompatible with each other.}
522\end{figure}
523
524Before constructing a partial order, fragments are extended to include
525their nested fragments and uncertain fragments are discarded. These discarded fragments
526are used in the abundance estimation. In theory, this may result in
527suboptimal (i.e. non-minimal assemblies) but we determined empirically
528that after assembly uncertain fragments are almost always consistent
529with one of the transcripts. When they are not, there was no
530completely tiled transcript that contained them. Thus, we employ a
531heuristic that substantially speeds up the program, and that works
532in practice.
533
534A partial order $P$ is then constructed from the remaining fragments
535by declaring that $x \leq y$ whenever the fragment corresponding to
536$x$ begins at, or before, the location of the fragment corresponding
537to $y$ and $x$ and $y$ are compatible. In what follows we identify $P$
538with its Hasse diagram (or covering relation), equivalently a directed
539acyclic graph (DAG) that is the transitive reduction.
540
541\begin{prop}
542$P$ is a partial order.
543\end{prop}
544{\bf Proof}: The fragments can be totally ordered according to the locations where they begin. It therefore suffices to check that if $x,y,z$ are fragments with $x$ compatible with $y$ and $y$ compatible with $z$ then $x$ is compatible with $z$. Since $x$ is not uncertain, it must be either compatible or incompatible with $z$. The latter case can occur only if $x$ and/or $z$ contain implied introns that overlap and are not identical. Since $y$ is not nested within $z$ and $x$ is not nested within $y$, it must be that $y$ contains an implied intron that is not identical with an implied intron in either $x$ or $z$. Therefore $y$ cannot be compatible with both $x$ and $z$. \qed
545
546\subsection{Assembling a parsimonious set of transcripts}
547
548In order to assemble a set of transcripts, {\tt Cufflinks} finds a
549(minimum) partition of $P$ into chains (see Definition
550\ref{def:po}). A partition of $P$ into chains yields an assembly
551because every chain is a totally ordered set of compatible fragments
552$x_1,\ldots,x_l$ and therefore there is a set of overlapping fragments
553that connects them. By Dilworth's theorem (Theorem \ref{thm:Dilworth}), the problem of finding a minimum partition $P$ into chains is equivalent to finding a maximum antichain in $P$ (an antichain is a set of mutually incompatible fragments). Subsequently, by Theorem \ref{thm:dilko}, the problem of finding a maximum antichain in $P$ can be reduced to finding a maximum matching in a certain bipartite graph that emerges naturally in deducing Dilworth's theorem from K\"{o}nig's theorem \ref{thm:konig}.
554We call the key bipartite graph the ``reachability'' graph. It is the transitive closure of the DAG, i.e. it is the graph
555where each fragment $x$ has nodes $L_x$ and $R_x$ in the left and
556right partitions of the reachability graph respectively, and where there is an edge between $L_x$ and $R_y$ when $x \leq y$ in $P$. The maximum matching problem is a classic problem that admits a polynomial time algorithm. The Hopcroft-Karp algorithm \cite{Hopcroft1973} has a run time of $O(\sqrt{V}E)$ where in our case $V$ is the number of fragments and $E$ depends on the extent of overlap, but is bounded by a constant times the coverage depth. We note that our parsimony approach to assembly therefore has a better complexity than the $O(V^3)$ PASA algorithm \cite{Haas2003}.
557
558The minimum cardinality chain decomposition computed using the approach above may not be unique.
559For example, a locus may contain two putative distinct initial exons (defined by overlapping incompatible fragments), and one
560of two distinct terminal and a constitutive exon in between that is longer than any
561read or insert in the RNA-Seq experiment. In such a case, the parsimonious assembly will consist of two transcripts, but there are four possible solutions that are all minimal. In order to ``phase'' distant exons, we leverage the fact that abundance inhomogeneities can link distant exons via their coverage. We therefore weight the edges of the bipartite reachability graph based on the percent-spliced-in metric introduced by
562Wang \emph{et al.} in \cite{Wang2008}. In our setting, the percent-spliced-in
563$\psi_x$ for an alignment $x$ is computed by counting the alignments
564overlapping $x$ in the genome that are compatible with $x$ and dividing by the
565total number of alignments that overlap $x$, and normalizing for the length of
566the $x$. The cost $C(y,z)$ assigned to an edge between alignments $y$ and $z$ reflects
567the belief that they originate from different transcripts:
568
569
570\begin{equation}
571    C(y,z) = -\log(1 - |\psi_y - \psi_z|).
572    \label{eq:match_cost}
573\end{equation}
574
575Rather than using the Hopcroft-Karp algorithm,  a modified version of the {\tt
576LEMON} \cite{LEMON} and {\tt Boost} \cite{Boost} graph libraries are used to
577compute a {\em min-cost} maximum cardinality matching on the bipartite compatibility
578graph. Even with the presence of weighted edges, our algorithm is
579very fast. The best known algorithm for weighted matching is $O(V^2logV+VE)$.
580
581Because we isolated total RNA, we
582expected that a small fraction of our reads would come from the intronic
583regions of incompletely processed primary transcripts. Moreover, transcribed
584repetitive elements and low-complexity sequence result in ``shadow''
585transfrags that we wished to discard as artifacts. Thus, {\tt Cufflinks}
586heuristically identifies artifact transfrags and suppresses them in its
587output. We also filter extremely low-abundance minor isoforms of alternatively
588spliced genes, using the model described in Section \ref {abundances} as a
589means of reducing the variance of estimates for more abundant transcripts. A
590transcript $x$ meeting any of the following criteria is suppressed:
591
592\begin{enumerate}
593    \item $x$ aligns to the genome entirely within an intronic region of the alignment for a transcript $y$, and the abundance of $x$ is less than 15\% of $y$'s abundance.
594    \item $x$ is supported by only a single fragment alignment to the genome.
595    \item More than 75\% of the fragment alignments supporting $x$, are mappable to multiple genomic loci.
596    \item $x$ is an isoform of an alternatively spliced gene, and has an estimated abundance less than 5\% of the major isoform of the gene.
597\end{enumerate}
598
599Prior to transcript assembly, {\tt Cufflinks} also filters out some of the alignments
600for fragments that are likely to originate from incompletely spliced nuclear
601RNA, as these can reduce the accuracy abundance estimates for fully spliced
602mRNAs. These filters and the output filters above are detailed in the source
603file \verb!filters.cpp! of the source code for {\tt Cufflinks}.
604
605In the overview of this Section, we mentioned that our assembly algorithm has the property that the resulting models are identifiable. This is a convenient property that emerges naturally from the parsimony criterion for a ``minimal explanation'' of the fragment alignments. Formally, it is a corollary of Dilworth's theorem:
606
607\begin{prop}
608The assembly produced by the {\tt Cufflinks} algorithm always results in an
609identifiable RNA-Seq model.
610\end{prop}
611{\bf Proof}:  By Dilworth's theorem, the minimum chain decomposition (assembly)
612we obtain has the same size as the maximum antichain in the partially
613ordered set  we
614construct from the reads. An antichain consists of reads that are
615pairwise incompatible, and therefore those reads must form a
616permutation sub-matrix in the fragment-transcript matrix $A_{R,T}$ with columns corresponding to the transcripts in a locus, and with rows corresponding to the fragments in the antichain. The matrix $A_{R,T}$ therefore contains permutation sub-matrices that together span all the columns, and the matrix is full-rank.
617
618\subsection{Assessment of assembly quality}
619
620To compare {\tt Cufflinks} transfrags against annotated transcriptomes, and
621also to find transfrags common to multiple assemblies, we developed a tool
622called {\tt Cuffcompare} that builds structural equivalence classes of
623transcripts. We ran {\tt Cuffcompare} on each the assembly from each time point against the
624combined annotated transcriptomes of the {UCSC known genes}, {\tt Ensembl},
625and {\tt Vega}. Because of the stochastic nature of sequencing, \emph{ab
626initio} assembly of the same transcript in two different samples may result in
627transfrags of slightly different lengths. A {\tt Cufflinks} transfrag was
628considered a complete match when there was a transcript with an identical
629chain of introns in the combined annotation.
630
631When no complete match is found between a {\tt Cufflinks} transfrag and the
632transcripts in the combined annotation, {\tt Cuffcompare} determines and reports if
633another substantial relationship exists with any of the annotation
634transcripts that can be found in or around the same genomic locus. For
635example, when all the introns of a transfrag match perfectly a part of the
636intron chain (sub-chain) of an annotation transcript, a ``containment''
637relationship is reported. For single-exon transfrags, containment is also
638reported when the exon appears fully overlapped by any of the exons of an
639annotation transcript. If there is no perfect match for the intron chain of a
640transfrag but only some exons overlap and there is at least one intron-exon
641junction match, {\tt Cuffcompare} classifies the transfrag as a putative ``novel''
642isoform of an annotated gene. When a transfrag is unspliced (single-exon) and
643it overlaps the intronic genomic space of a reference annotation transcript,
644the transfrag is classified as potential pre-mRNA fragment. Finally, when no
645other relationship is found between a {\tt Cufflinks} transfrag and an
646annotation transcript, {\tt Cuffcompare} can check the repeat content of the
647transfrag's genomic region (assuming the soft-masked genomic sequence was also
648provided) and it would classify the transfrag as ``repeat'' if most of its
649bases are found to be repeat-masked.
650
651When provided multiple time point assemblies, {\tt Cuffcompare} matches transcripts
652
653\subsubsection{Testing for changes in absolute expression}
654
655Between any two consecutive time points, we tested whether a transcript was
656significantly (after FDR control \cite{Benjamini1995}) up or down regulated with respect to the
657null hypothesis of no change, with variability in expression due solely to the
658uncertainties resulting from our abundance estimation procedure. This was done
659using the following testing procedure for absolute differential expression:
660
661We employed the standard method used in microarray-based expression analysis
662and proposed for RNA-Seq in \cite{Bullard2010}, which is to compute the
663logarithm of the ratio of intensities (in our case FPKM), and to then use the
664delta method to estimate the variance of the log odds. We describe this for
665testing differential expression of individual transcripts and also groups of
666transcripts (e.g. grouped by TSS).
667
668We recall that the MLE FPKM for a transcript $t$ in a locus $g$ is given by
669\begin{equation}
670\frac{10^9X_g\hat{\gamma}_t}{\tilde{l}(t)M}.
671\end{equation}
672Given two different experiments resulting in $X^a_g,M^a$ and $X^b_g,M^b$ respectively, as well as $\hat{\gamma}^a_t$ and $\hat{\gamma}^b_t$, we would like to test the significance of departures from unity of the ratio of MLE FPKMS, i.e.
673\begin{eqnarray}
674& & \left(\frac{10^9X^a_g\hat{\gamma}^a_t}{\tilde{l}(t)M^a}\right) / \left(\frac{10^9X^b_g\hat{\gamma}^b_t}{\tilde{l}(t)M^b}\right)\\
675& = & \frac{X_g^a\hat{\gamma}^a_tM^b}{X_g^b \hat{\gamma}^b_tM^a}.
676\end{eqnarray}
677
678This can be turned into a test statistic that is approximately normal by
679taking the logarithm, and normalizing by the variance. We recall that using
680the delta method, if $X$ is a random variable then $Var[log(X)] \approx
681\frac{Var[X]}{E[X]^2}$.
682
683Therefore, our test statistic is
684\begin{equation}
685\frac{log(X^a_g)+log(\hat{\gamma}^a_t)+log(M^b)-log(X^b_g)-log(\hat{\gamma}^b_t)-log(M^a)}
686{ \sqrt{\frac{\left( \Psi^{g,a}_{t,t}(1+X^a_{g}) + (\hat{\gamma}^a_t)^2\right)}{X^a_g \left(\hat{\gamma}^a_t \right)^2}+
687\frac{\left( \Psi^{g,b}_{t,t}(1+X^b_{g}) + (\hat{\gamma}^b_t)^2\right)}{X^b_{g}\left(  \hat{\gamma}^b_t\right)^2}    }  }.
688\end{equation}
689
690In order to test for differential expression of a group of transcripts, we
691replace the numerator and denominator above by those from Equations
692(\ref{eq:expectTSS}) and (\ref{eq:varTSS}).
693
694It is has been noted that the power of differential expression tests in RNA-Seq depend on the length of the transcripts being tested, because longer transcripts accumulate more reads \cite{Oshlack2009}. This means that the results we report are biased towards discovering longer differentially expressed transcripts and genes.
695
696\subsection{Quantifying transcriptional and post-transcriptional overloading}
697
698In order to infer the extent of differential promoter usage, we
699decided to measure changes in relative abundances of primary
700transcripts of single genes. Similarly, we investigated changes in
701relative abundances of transcripts grouped by TSS in order to infer
702differential splicing.  These inferences required two ingredients:
703\begin{enumerate}
704\item A metric on probability distributions (derived from relative
705  abundances).
706\item A test statistic for assessing significant changes in
707  differential promoter usage and splicing as measured using the
708  metric referred to above.
709\end{enumerate}
710
711In order to address the first requirement, namely a metric on
712probability distributions, we turned to an entropy-based metric. This
713was motivated by the methods in \cite{Ritchie2008} where tests for
714differences in relative isoform abundances were performed to
715distinguish cancer cells from normal cells. We extend
716this approach to be able to test for relative isoform abundance changes among multiple
717experiments in RNA-Seq.
718
719\begin{defn}[Entropy]
720The entropy of a discrete probability distribution
721$p=(p_1,\ldots,p_n)$ ($0 \leq p_i \leq 1$ and $\sum_{i=1}^n p_i = 1$) is
722\begin{equation}
723H(p) = -\sum_{i=1}^n p_i log p_i.
724\end{equation}
725If $p_i=0$ for some $i$ the value of $p_i log p_i$ is taken to be $0$.
726\end{defn}
727\begin{defn}[The Jensen-Shannon divergence]
728The Jensen-Shannon divergence of $m$ discrete probability distributions $p^1,\ldots,p^m$ is defined to be:
729\begin{equation}
730JS(p^1,\ldots,p^m) = H\left(\frac{p^1 + \cdots + p^m}{m}\right) - \frac{\sum_{j=1}^m H(p^j)}{m}.
731\end{equation}
732
733In other words, the Jensen-Shannon divergence of a set of probability
734distributions is the entropy of their average minus the average of their
735entropies. \end{defn}
736
737In the case where $m=2$, we remark that the Jensen-Shannon divergence can also
738be described in terms of the Kullback-Leibler divergence of two discrete
739probability distributions. If we denote Kullback-Leibler divergence by
740
741\begin{equation}
742D(p^1\|p^2) = \sum_i p^1_i log \frac{p^1_i}{p^2_i},
743\end{equation}
744then
745\begin{equation}
746JS(p^1,p^2) = \frac{1}{2}D(p^1\|m)+\frac{1}{2}D(p^2\|m)
747\end{equation}
748where $m=\frac{1}{2}(p^1+p^2)$. In other words the Jensen-Shannon
749divergence is a symmetrized variant of the Kullback-Leibler divergence.
750
751The Jensen-Shannon divergence has a number of useful properties: for
752example it is symmetric and non-negative. However it is {\em not} a
753metric. The following theorem shows how to construct a metric
754from the Jensen-Shannon divergence:
755
756\begin{thm}[Fuglede and Tops{\o}e 2004 \cite{Fuglede2004}]
757The square root of the Jensen-Shannon divergence is a metric.
758\end{thm}
759
760The proof of this result is based on a harmonic analysis argument that is the basis for
761the remark in the main paper that ``transcript abundances move in time along a
762logarithmic spiral in Hilbert space''.  We therefore call the square root of the Jensen-Shannon divergence the {\em Jensen-Shannon metric}. We employed this metric in
763order to quantify relative changes in expression in (groups of) transcripts.
764
765In order to test for significance, we introduce a bit of notation.
766Suppose that $S$ is a collection of transcripts (for example, they may
767 share a common TSS). We define
768\begin{equation}
769\kappa_t = \frac{ \frac{\gamma_t}{\tilde{l}(t)}}{\sum_{u \in S} \frac{\gamma_u}{\tilde{l}(u)}}
770\end{equation}
771to be the proportion of transcript $t$ among all the transcripts in a
772group $S$. We let
773$Z=\sum_{u \in S} \hat{\gamma}_u / \tilde{l}(u)$ so that $\hat{\kappa}_t = \frac{\gamma_t}{\tilde{l}(t)Z}$. We therefore have that
774\begin{eqnarray}
775\label{eq:variancekappa1}
776Var[\hat{\kappa}_t] & =  &\frac{Var[\hat{\gamma}_t]}{\tilde{l}(t)^2Z^2}, \\
777Cov[\hat{\kappa}_t,\hat{\kappa}_u] & =  & \frac{Cov[\hat{\gamma}_t,\hat{\gamma}_u]}{\tilde{l}(t)\tilde{l}(u)Z^2}.\label{eq:variancekappa2}
778\end{eqnarray}
779
780Our test statistic for divergent relative expression was the
781Jensen-Shannon metric. The test could be applied to multiple time
782points simultaneously, but we focused on pairwise tests (involving
783consecutive time points). Under the null hypothesis of no change in
784relative expression, the Jensen-Shannon metric should be
785zero. We tested for this using a one-sided $t$-test, based on an
786asymptotic derivation of the distribution of the Jensen-Shannon metric
787under the null hypothesis. This asymptotic distribution is normal by applying the delta method approximation,
788which involves computing the linear component of the Taylor expansion of the
789variance of $\sqrt{JS}$.
790
791In order to simplify notation, we let $f(p^1,\ldots,p^m)$ be the
792Jensen-Shannon metric for $m$ probability distributions $p^1,\ldots,p^m$.
793
794\begin{lemma}
795The partial derivatives of the Jensen-Shannon metric are give by
796\begin{equation}
797\frac{\partial f}{\partial p^k_l} = \frac{1}{2m\sqrt{f(p^1,\ldots,p^m)}} log \left( \frac{p^k_l}{\frac{1}{m} \sum_{j=1}^m p_l^j} \right).
798\end{equation}
799\end{lemma}
800
801Let $\hat{\kappa}^1,\ldots,\hat{\kappa}^m$ denote $m$ probability
802distributions on the set of transcripts $S$, for example the MLE for the transcript abundances in a time course. Then from the delta method we
803have that $\sqrt{JS(\hat{\kappa}^1,\ldots,\hat{\kappa}^m)}$ is approximately normally
804distributed with variance given by
805
806\begin{equation}
807Var[\sqrt{JS(\hat{\kappa}^1,\ldots,\hat{\kappa}^m)}] \approx (\bigtriangledown f)^T \Sigma (\bigtriangledown f),
808\end{equation}
809
810where $\Sigma$ is the variance-covariance matrix for the
811$\kappa^1,\ldots,\kappa^m$, i.e., it is a block diagonal matrix where the
812$i$th block is the variance-covariance matrix for the $\kappa^i_t$ given by
813Equations (\ref{eq:variancekappa1},\ref{eq:variancekappa2}).
814
815
816There are two biologically meaningful groupings of transcripts whose
817relative abundances are interesting to track in a
818time course. Transcripts that share a TSS are likely to be regulated by
819the same promoter, and therefore tracking the change in relative
820abundances of groups of transcripts sharing a TSS may reveal how
821transcriptional regulation is affecting expression over
822time. Similarly, transcripts that share a TSS and exhibit changes in
823expression relative to each other are likely to be affected by
824splicing or other post-transcriptional regulation. We therefore
825grouped transcripts by TSS and compared relative abundance changes
826within and between groups.
827
828We define ``overloading'' to be a significant change in relative
829abundances for a set of transcripts (as determined by the Jensen-Shannon metric, see below). The term is intended to generalize the simple notion of
830``isoform switching'' that is well-defined in the case of two
831transcripts, to multiple transcripts. It is complementary to absolute
832differential changes in expression: the overall expression of a gene
833may remain constant while individual transcripts change drastically in
834relative abundances resulting in overloading. The term is borrowed from
835computer science, where in some statically-typed programming
836languages, a function may be used in multiple, specialized instances
837via ``method overloading''.
838
839We tested for overloaded genes by performing a one-sided $t$-test based on the asymptotics of the Jensen-Shannon metric under the null hypothesis of no change in relative abundnaces of isoforms (either grouped by shared TSS for for post-transcriptional overloading, or by comparison of groups of isoforms with shared TSS for transcriptional overloading). Type I errors were controlled with the Benjamini-Hochberg \cite{Benjamini1995} correction for multiple testing.
840 A selection of overloaded genes are displayed in Supplemental Figs. \ref{splice_overloaded} and \ref{promoter_overloaded}.
841
842\newpage
843
844\begin{figure}[!ht]
845    \includegraphics{pdfs/splice_overloaded}
846    \caption[Selected genes with post-transcriptional
847    overloading]{Selected genes with post-transcriptional
848      overloading. Trajectories indicate the expression of individual
849      isoforms in FPKM ($y$ axis) over time in hours ($x$
850      axis). Dashed isoforms have not been previously
851      annotated. Isoform trajectories are colored by TSS, so isoforms
852      with the same color presumably share a common promoter and are
853      processed from the same primary transcript. It is evident that
854      total gene expression may remain constant during isoforms
855      switching (Eya3) while in other cases changes in relative
856      abundance are accompanied by changes in absolute expression.
857The Jensen-Shannon metric generalizes the notion of ``isoform
858switching'' and is useful in cases with multiple isoforms
859(e.g. Ddx17).
860  \label{splice_overloaded}}
861\end{figure}
862
863\begin{figure}[!ht]
864    \includegraphics{pdfs/promoter_overloaded}
865    \caption[Selected genes with transcriptional overloading]{Selected
866      genes with transcriptional overloading. Trajectories indicate
867      the expression of individual isoforms in FPKM (y axis) over time
868      in hours (x axis). Dashed isoforms have not been previously
869      annotated. Isoform trajectories are colored by TSS, so that
870      isoforms with different colors presumably vary in their promoter
871      and are processed from different primary transcripts.  \label{promoter_overloaded}}
872\end{figure}
873
874We can visualize overloading and expression dynamics with a plot that
875superimposes transcriptional and post-transcriptional overloading and
876gene-level expression over the time course. We refer to these as
877``Minard plots'', after Charles Joseph Minard's famous depiction
878of the advance and retreat of
879Napoleon's armies in the campaign against Russia in 1812
880\cite{Tufte2001}. Minard made use of multiple visual cues to display
881numerous varying quantities in one diagram. An example of a Minard
882plot for the gene
883Myc is shown in Figure 3c, and others are given in Appendix B. The dotted line indicates gene-level FPKM, with measured
884FPKM indicated by black circles. Grey circles indicate the arithmetic mean of
885gene-level FPKM between consecutive measured time points, interpolating FPKM
886at intermediate time points. The total gene expression overloading is
887visualized as a swatch centered around the interpolated expression curve.
888The width of the swatch encodes the amount of expression overloading between
889successive time points. The color of the swatch indicates the relative
890contributions of transcriptional and post-transcriptional expression
891overloading.
892
893Some genes, such as tropomyosin I and II, feature a single primary
894transcript, and so all overloading is by definition post-transcriptional.
895Others, like Fhl3, have two primary transcripts, but only a single isoform
896arises from each, so all overloading is transcriptional. Genes with multiple
897primary transcripts, one or more of which are alternatively spliced, such as
898Myc or RTN4, display both forms.
899
900
901\subsection{The {\tt Cufflinks} software}
902
903The transcript assembly and abundance estimation algorithms are implemented in
904freely available open source software called {\tt Cufflinks} that is available
905from \newline {\tt http://cufflinks.cbcb.umd.edu/} \newline Furthermore
906methods for comparing annotations across time points, and for performing the
907differential expression, promoter usage and splicing tests are implemented in
908the companion programs {\tt Cuffdiff} and {\tt Cuffcompare}. Instructions on
909how to install and run the software are provided on the website.
910
911The input to {\tt Cufflinks} consists of fragment alignments in the SAM format
912\cite{Li2009a}. These may consist of either single fragment alignments, or alignments of mate-pairs (paired-end reads produce better assemblies and more accurate abundance estimates than single reads). {\tt Cufflinks} will assemble the transcripts using the
913algorithm in Section \ref{sec:assembly}, and transcript abundances will be
914estimated using the model in Section \ref{sec:abundances}. Transcript
915coordinates and abundances are reported in the Gene Transfer Format (GTF).
916User supplied annotations may be provided to {\tt Cufflinks} (optional input) in
917which case they form the basis for the transcript abundance estimation.
918
919Some of the algorithms here rely on sufficient depth of sequencing in order to
920produce reliable output. {\tt Cufflinks} determines that depth is sufficient where
921possible to check that required assumptions hold. For example, in loci where one
922or more isoforms have extremely low relative expression, the observed Fisher
923Information Matrix may not be positive definite after rounding errors. In this
924case, it is not possible to produce a reliable variance-covariance matrix for
925isoform fragment abundances. {\tt Cufflinks} will report a numerical
926exception in this and similar cases. When an exception is reported, the
927confidence intervals for the isoforms' abundances will be set from 0
928FPKM to the FPKM for the whole gene. If such an exception is generated during
929a {\tt Cuffdiff} run, no differential analysis involving the problematic
930sample will be performed on that locus.
931\newpage
932\section{Appendix A: Lemmas and Theorems}
933
934The following elementary/classical results are required for our methods and we include them so that the supplement is self-contained.
935
936\begin{lemma}
937\label{lemma:varsum}
938Let $X_1,\ldots,X_n$ be random variables and $a_1,\ldots,a_n$ real numbers with $Y=\sum_{i=1}^na_iX_i$. Then
939\begin{equation}
940Var[Y] = \sum_{i=1}^n a_i^2Var[X_i]+2\sum_{i<j} a_ia_j Cov[X_i,X_j].
941\end{equation}
942\end{lemma}
943\begin{lemma}[Taylor Series]
944If $X$ and $Y$ are random variables then
945\begin{eqnarray}
946  Var[f(X,Y)] & \approx & \left(\frac{\partial f}{\partial X}(E[X],E[Y])\right)^2Var[X] \nonumber \\ & &+2\frac{\partial f}{\partial X}(E[X],E[Y])\frac{\partial f}{\partial Y}(E[X],E[Y])Cov[X,Y] \nonumber \\ & & +\left( \frac{\partial f}{\partial Y}(E[X],E[Y])\right)^2 Var[Y].
947\end{eqnarray}
948\end{lemma}
949\begin{cor}
950If $X$ and $Y$ are independent then
951\begin{eqnarray}
952  Var\left[log\left(\frac{X}{Y}\right)\right] & \approx & \frac{V[X]}{E[X]^2} + \frac{V[Y]}{E[Y]^2}.
953\end{eqnarray}
954\end{cor}
955\begin{cor}
956\label{lemma:varproduct}
957If $X$ and $Y$ are independent random variables then
958\begin{equation}
959Var[XY] = Var[X]Var[Y]+E[X]^2Var[Y]+E[Y]^2Var[X].
960\end{equation}
961\end{cor}
962
963The above result is exact using the 2nd order Taylor expansion (higher
964derivatives vanish).
965
966\begin{lemma}[\cite{Li2009b}]
967\label{lemma:readstoprobs}
968Let $a_1,\ldots,a_n,w_1,\ldots,w_n$ be real numbers satisfying: $w_i \neq 0$ and $0 \leq a_i \leq 1$ for all $i$, $\sum_{i=1}^n a_i = 1$ and $\sum_{i=1}^na_iw_i \neq 0$.  Let
969$ b_j = \frac{a_jw_j}{\sum_{i=1}^n a_iw_i}$. Then $a_j = \frac{b_j\frac{1}{w_j}}{\sum_{i=1}^n b_i\frac{1}{w_i}}$.
970\end{lemma}
971{\bf Proof}: \begin{eqnarray}
972b_j & =  & \frac{a_jw_j}{\sum_{i=1}^n a_iw_i} \\
973\Rightarrow \,  \sum_{k=1}^n \frac{b_k}{w_k} & = &   \sum_{k=1}^n \frac{a_k}{\sum_{i=1}^na_iw_i}\\
974&  = &  \frac{1}{\sum_{i=1}^n a_iw_i}\\
975& = & \frac{b_j}{a_jw_j}\\
976 \Rightarrow \,  a_j  & =  & \frac{b_j\frac{1}{w_j}}{\sum_{i=1}^n b_i \frac{1}{w_i}}.
977\end{eqnarray} \qed
978\begin{prop}[\cite{ASCB2005}]
979\label{prop:linearmodel}
980Let $f_i(\theta) = \sum_{j=1}^d a_{ij}\theta_j + b_i$ ($1 \leq i \leq
981m$) describe a linear statistical model with $a_{ij} \geq$ for all
982$i,j$.  That is, $\sum_{i=1}^m f_i(\theta) = 1$. If $u_i \geq 0$ for
983all $i$ then the log likelihood function
984\begin{equation}
985l(\theta) = \sum_{i=1}^m u_i log(f_i(\theta))
986\end{equation}
987is concave.
988\end{prop}
989{\bf Proof}: It is easy to see that
990\begin{equation}
991\left( \frac{\partial^2 l}{\partial \theta_j \partial \theta_k}  \right)  = -A^Tdiag\left( \frac{u_1}{f_1(\theta)^2},\ldots,\frac{u_m}{f_m(\theta)^2}\right) A,
992\end{equation}
993where $A$ is the $m \times d$ matrix whose entry in row $i$ and column
994$j$ equals $a_{ij}$. Therefore the Hessian is a symmetric matrix with
995non-positive eigenvalues, and is therefore negative semi-definite.
996\qed
997
998\begin{defn}
999\label{def:po}
1000A partially ordered set is a set $S$ with a binary relation $\leq$
1001satisfying:
1002\begin{enumerate}
1003\item $x \leq x$ for all $x \in S$,
1004\item If $x \leq y$ and $y \leq z$ then $x \leq z$,
1005\item If $x \leq y$ and $y \leq x$ then $x=y$.
1006\end{enumerate}
1007A {\em chain} is a set of elements in $C \subseteq S$ such that for every
1008$x,y \in C$ either $x \leq y$ or $y \leq x$. An {\em antichain} is a set of
1009elements that are pairwise incompatible.
1010\end{defn}
1011Partially ordered sets are equivalent to directed acyclic graphs (DAGs). The following min-max theorems relate chain partitions to antichains and are special cases of linear-programming duality. More details and complete proofs can be found in \cite{Lovasz2009}.
1012
1013\begin{thm}[Dilworth's theorem]
1014\label{thm:Dilworth}
1015Let $P$ be a finite partially ordered set. The maximum number of elements in any antichain of $P$ equals the minimum number of chains in any partition of $P$ into chains.
1016\end{thm}
1017\begin{thm}[K\"{o}nig's theorem]
1018\label{thm:konig}
1019In a bipartite graph, the number of edges in a maximum matching equals
1020the number of vertices in a minimum vertex cover.
1021\end{thm}
1022\begin{thm}
1023\label{thm:dilko}
1024Dilworth's theorem is equivalent to K\"{o}nig's theorem.
1025\end{thm}
1026{\bf Proof}: We first show that Dilworth's theorem follows from
1027K\"{o}nig's theorem. Let $P$ be a partially ordered set with $n$
1028elements. We define a bipartite graph $G=(U,V,E)$ where
1029$U=V=P$, i.e. each partition in the bipartite graph is equally to
1030$P$. Two nodes $u,v$ form an edge $(u,v) \in E$ in the graph $G$ iff
1031$u<v$ in $P$. By K\"{o}nig's theorem there exist both a matching $M$ and a
1032a vertex cover $C$ in $G$ of the same cardinality. Let $T \subset
1033S$ be the set of elements not contained in $C$. Note that $T$ is an
1034antichain in $P$. We now form a partition $W$ of $P$ into chains by declaring $u$ and
1035$v$ to be in the same chain whenever there is an edge $(u,v) \in
1036M$. Since $C$ and $M$ have the same size, it follows that $T$ and $W$
1037have the same size.
1038
1039To deduce K\"{o}nig's theorem from Dilworth's theorem, we begin with a
1040bipartite graph $G=(U,V,E)$ and form a partial order $P$ on the
1041vertices of $G$ by defining $u<v$ when $u \in U, v \in V$ and $(u,v)
1042\in E$. By Dilworth's theorem, there exists an antichain of $P$ and a
1043partition into chains of the same size. The non-trivial chains in $P$
1044form a matching in the graph. Similarly, the complement of the
1045vertices corresponding to the anti-chain in $P$ is a vertex cover of
1046$G$ with the same cardinality as the matching.  \qed
1047
1048\begin{figure}[!ht]
1049    \includegraphics[scale=0.8]{pdfs/Dilworth_Konig}
1050
1051%\caption[Equivalence of Dilworth's and K\"{o}nig's theorems]{
1052\end{figure}
1053The equivalence of Dilworth's and K\"{o}nig's theorems is depicted above. The
1054      partially ordered set with 8 elements on the left is partitioned into 3
1055      chains. This is the size of a minimum partition into chains, and
1056    is equal to the maximum size of an antichain (Dilworth's
1057    theorem). The antichain is shown with double circles. On the
1058    right, the reachability graph constructed from the partially
1059    ordered set on the left is shown. The maximum matching
1060    corresponding to the chain partition consists of 5 edges and is equal in size to the
1061    number of vertices in a minimum vertex cover (K\"{o}nig's
1062    theorem). The vertex cover is shown with double circles. Note that
1063    8=3+5.
1064
1065
1066\newpage
1067\bibliographystyle{amsplain}
1068\bibliography{cufflinks}
1069\end{document}
1070
1071% long read length is required for uniquely determining genomic origin of fragments, but its fragment length that is critical for assignment of fragments to transcripts.
1072% The length of a set of transcripts depends on their abundances
1073% single read sequencing prevents the correct normalization for fragment sizes in the length.
1074