1\documentclass[12pt]{amsart}
2
3\usepackage{graphicx}
4\usepackage{amsmath}
5\usepackage[pdftex]{hyperref}
6
7\setlength{\textwidth}{6.2in}
8\setlength{\textheight}{8.4in}
9\setlength{\topmargin}{0.2in}
10\setlength{\oddsidemargin}{0in}
11\setlength{\evensidemargin}{0in}
12\setlength{\headsep}{.3in}
13\footskip 0.3in
14\newtheorem{thm}{Theorem}
15\newtheorem{lemma}[thm]{Lemma}
16\newtheorem{prop}[thm]{Proposition}
17\newtheorem{cor}[thm]{Corollary}
18\theoremstyle{definition}
19\newtheorem{defn}[thm]{Definition}
20\newtheorem{example}[thm]{Example}
21\newtheorem{remark}[thm]{Remark}
22\newtheorem{conj}[thm]{Conjecture}
23
24\newcommand{\cherry}[1]{\ensuremath{\langle #1 \rangle}}
25\newcommand{\ignore}[1]{}
26\newcommand{\half}{\frac{1}{2}}
27\newcommand{\eopf}{\framebox[6.5pt]{} \vspace{0.2in}}
28
29%------------------document begins-----------------------
30
31\begin{document}
32
33\title{Supplementary methods for the paper\\
34Transcript assembly and quantification by RNA-Seq
35reveals unannotated transcripts and isoform switching
36during cell differentiation}
37\author{Cole Trapnell \and Brian A Williams \and Geo Pertea \and Ali Mortazavi \and Gordon Kwan \and Marijke J van Baren \and Steven L Salzberg
38\and Barbara J Wold \and Lior Pachter}
39%\address{}
40%\email{\{whoweare\}@math.berkeley.edu}
41%\dedicatory{\today}
42
43%\thanks{Supported in part by ....}
44%\subjclass{Primary 55M20; Secondary 47H10}
45
46\maketitle
47\markboth{C Trapnell et al.}{Transcript assembly and abundance estimation from RNA-Seq}
48
49This document is a companion to the paper ``Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms''. It describes the experiment
50performed, the details of the {\tt Cufflinks} assembler and abundance
51estimation methods, and provides
52supplementary tables, figures and supporting evidence for claims in
53the paper.
54%\pagestyle{plain}
55\pagenumbering{roman} \tableofcontents \listoffigures \listoftables \newpage \pagenumbering{arabic}
56
57\section{Sequencing experiment}
58
59The data analyzed in the paper consisted of 430,467,018 paired 75bp reads
60sequenced from the transcriptome of mouse skeletal muscle C2C12 cells induced
61to undergo myogenic differentiation. Total RNA was extracted from these cells,
62and subsequently mRNA was isolated at four different time points (-24 hours,
6360 hours, 120 hours, 168 hours). cDNA was prepared following a similar
64procedure to the one described in \cite{Mortazavi2008}. Fragmentation of the
65mRNA followed by size selection resulted in fragment lengths
66approximately 200nt long for
67all of the time-points. The distribution of fragment lengths is shown in
68Supplementary Figure \ref{fragment_lengths} (in Section \ref{sec:abundances}
69this distribution of fragment lengths is referred to as $F$). These estimates
70are based on alignments of the spiked-in sequences using {\tt Bowtie} 0.12
71\cite{Langmead2009}.
72
73\begin{figure}[h]
74    \includegraphics{pdfs/library_frags.pdf}
75    \caption[Fragment length distributions of C2C12 time-course libraries]{Fragment length distributions of C2C12 time-course libraries. \label{fragment_lengths}}
76\end{figure}
77
78\section{Mapping fragments to the genome}
79
80In principle, an algorithm that infers individual transcript abundances by
81measuring the fraction of fragments originating from each of a set of known
82transcripts would begin by computing alignments between fragments and the set
83of known transcripts that may be contained in the sample. However, because the
84transcriptome for mouse is incompletely annotated, such an analysis requires
85mapping of fragments to the genome as a proxy for mapping directly to
86transcripts. This means that alignments of short sequencing reads must be
87allowed to span exon-exon splice junctions in genomic coordinate space. We
88previously developed a program called {\tt TopHat} to map RNA-Seq reads to the
89genome. {\tt TopHat} does not require a reference transcriptome and can
90therefore be used to discover novel splice junctions. \cite{Trapnell2009}
91
92Fragments were mapped to build 37.1 of the mouse genome with {\tt TopHat} version
931.0.13. We extended our previous algorithms described in
94\cite{Trapnell2009} to exploit the longer
95paired reads used in the study. The original {\tt TopHat} program used a
96seed-and-extend alignment strategy to find spliced alignments of unpaired
97RNA-Seq experiments. However, due to computational limitations, our original
98method reported only alignments across GT-AG introns shorter than 20Kb by
99default. This strategy also could not align reads that spanned multiple splice
100junctions. However, as sequencing technology has improved and longer (paired end) reads have
101become available, we have
102modified the software to employ new strategies to align reads across splice
103junctions. {\tt TopHat} version 1.0.7 and later splits a read 75bp or longer in
104three or more segments of approximately equal size (25bp), and maps them
105independently. Reads with segments that can be mapped to the genome only
106non-contiguously are marked as possible intron-spanning reads. These
107``contiguously unmappable'' reads are used to build a set of possible introns
108in the transcriptome.
109
110\subsection{Discovering splice junctions} Suppose $S$ is a read of length
111$l$ that crosses a splice junction. {\tt TopHat} splits $S$ into $n=\lfloor
112l/k \rfloor$ segments where $k=25$bp ($k$ is a parameter that can be
113adjusted by the user in {\tt TopHat}). At most
114one of these segments must cross the splice junction, and junctions
115can be discovered if they lie in any of the segments. {\tt TopHat} maps the
116segments $s_1,...,s_n$ with {\tt Bowtie} to the genome, and checks for
117internal segments $s_2,...,s_{n-1}$ that do not map anywhere to the genome, as
118well as for pairs of successive segments $s_i,s_{i+1}$ that both align to the
119genome, but not adjacently. When a segment $s_i$ fails to align because it
120crosses a splice junction, but $s_{i-1}$ and $s_{i+1}$ are aligned (say at
121starting at positions $x$ and $y$, respectively), {\tt TopHat} looks for the
122donor and acceptor sites for the junction near $x$ and $y$. Assuming
123(without loss of generality)  that the
124transcript is on Crick strand of the genome the
125donor must fall within $k$ bases upstream of position $x+k$, and the acceptor
126must be within $k$ bases downstream of $y$, a total of $k$ possible exon-exon
127splice junctions. Similarly, when successive segments $s_i$ and $s_{i+1}$
128align to the genome non-adjacently at positions $x$ and $y$, the junction
129spanned by the read must be from positions $x+k$ to $y$ in the genome.
130
131{\tt TopHat} accumulates an index of potential splice junctions by examining
132segment mapping for all contiguously unmappable reads. For each junction the
133program then concatenates $k$bp upstream of the donor to $k$bp downstream of
134the acceptor to form a synthetic spliced sequence around the junction. The
135segments of the contiguously unmappable reads are then aligned against these
136synthetic sequences with {\tt Bowtie}. The resulting contiguous and spliced
137segment alignments for these reads are merged to form complete alignments to
138the genome, each spanning one or more splice junctions.
139
140\subsection{Resolving multiple alignments for fragments}
141
142The alignments for both reads from a mate pair are examined together to
143produce a set of alignments for the corresponding library fragment as a whole,
144reported in SAM format \cite{Li2009a}. These fragment alignments are
145ranked according to the procedure described below, and only highest ranking alignments are reported. The ranks are
146designed to incorporate very loose assumptions on intron and gene length,
147namely that introns longer than 20kb are rare. Let $x$ and $y$ be fragment
148alignments. Then $x < y$ if {\em any} of the following (applied in order) are
149true:
150
151\begin {enumerate}
152    \item $x$ is a singleton, and $y$ has both ends mapped,
153    \item $x$ crosses more splice junctions than $y$,
154    \item the reads from $x$ map significantly farther apart in the genome than expected according to the library's fragment length distribution ($\geq$ 3 s.d.), and the reads from $y$ do not,
155    \item the reads from $x$ are significantly closer together than expected according to the library's fragment length distribution, and the reads from $y$ are not,
156    \item The reads from $x$ map more than 100bp farther apart than the reads from $y$,
157    \item $x$ and $y$ both span an intron, and $x$ spans a longer one,
158    \item $x$ has more mismatches than $y$ to the genome.
159\end {enumerate}
160
161Fragments that have multiple equally good alignments according to the above
162rules are all reported. If there are $n$ alignments for a fragment,
163each is assigned a probability
164of only $1/n$ of being correct. The SAM format encodes this probability in the
165mapping quality field, which is later used by {\tt Cufflinks} to reduce the
166contribution of multiply mapping fragments (to $1/n$ of a uniquely
167mappable read) in FPKM calculations (FPKM is a measurement of
168expression, and is formally defined in Section
169\ref{sec:abundances}). The recent work of Li et al. \cite{Li2009b}
170addresses the problem of probabilistically assigning multi-reads, and
171it should be possible to incorporate the ideas of that paper into
172future versions of {\tt Tophat} and {\tt Cufflinks}.
173
174\begin{table}[h]{
175    \Small
176    \label{table:reads}
177    \begin{tabular}{l|c|c|c|c|c|c}
178        \hline
179        \textbf{Sample} & \textbf{Sequenced} & \textbf{Aligned} & \textbf{Singleton} & \textbf{Spliced} & \textbf{Multi-mapping} & \textbf{Total} \\
180        & \textbf{fragments} & \textbf{fragments} & \textbf{fragments} & \textbf{fragments} & \textbf{fragments} & \textbf{alignments} \\
181        \hline
182
183        -24 hours & 42,184,539 & 35,852,366 & 11,031,886 & 8,824,825 & 1,768,041 & 41,663,170 \\
184        60 hours & 70,192,031 & 57,071,494 & 18,104,211 & 15,778,114 & 	2,265,378 & 64,637,511 \\
185        120 hours & 41,069,106 & 27,914,989 & 14,431,734 & 7,711,026 & 1,881,772 & 33,929,133 \\
186        168 hours & 61,787,833 & 50,705,080 & 20,396,250 & 14,585,287 & 	2,458,292 &	58,797,912 \\
187        Total & 215,233,509 & 171,543,929 & 63,964,081 & 46,899,252 & 8,373,483 & 199,027,726 \\
188
189        \hline
190
191    \end{tabular}
192    }
193\vskip 0.2in
194\caption[Number of fragments sequenced, aligned and mapped with {\tt TopHat}]{Number of fragments sequenced, aligned and mapped with {\tt
195    TopHat}. Singleton fragments are fragments for which only one end
196  could be mapped. Spliced fragments include at least one end that
197  maps across a junction. The numbers in the total alignment column
198  may not be the sum of the entries in each row, because some fragments
199  fall into multiple classes.}
200\end{table}
201
202\newpage
203\section{Transcript abundance estimation \label{abundances}}
204\label{sec:abundances}
205
206\subsection{Definitions}
207
208A {\em transcript} is an RNA molecule that has been transcribed from DNA. A
209{\em primary transcript} is an RNA molecule that has yet to undergo
210modification. The {\em genomic location} of a primary transcript consists of a
211pair of coordinates in the genome representing the $5'$ transcription start
212site and the $3'$ polyadenylation cleavage site. We denote the set of all
213transcripts in a transcriptome by $T$. We partition transcripts into {\em
214transcription loci} (for simplicity we refer to these as loci) so that every
215locus contains a set of transcripts all of whose genomic locations do not
216overlap the genomic location of any transcript in any other locus. Formally,
217we consider a maximal partition of transcripts into loci, a partition denoted
218by $G$, where the genomic location of a transcript $t \in g \in G$ does not
219overlap the genomic location of any transcript $u $ where $u \in h \in G$ and
220$h \neq g$. We emphasize that the definition of a transcription locus is not
221biological; transcripts in the same locus may be regulated via different
222promoters, and may differ completely in sequence (for example if one
223transcript is in the intron of another) or have different functions. The
224reason for defining loci is that we will see that they are computationally
225convenient.
226
227
228We assume that at the time of an experiment, a transcriptome consists of an
229ensemble of transcripts $T$ where the proportion of transcript $t \in T$ is
230$\rho_t$, so that $\sum_{t \in T} \rho_t = 1$ and $0 \leq \rho_t \leq 1$ for
231all $t \in T$. Formally, a {\em transcriptome} is a set of transcripts $T$
232together with the abundances $\rho=\{ \rho_t\}_{t \in T}$. For convenience we also introduce
233notation for the proportion of transcripts in each locus. We let $\sigma_g =
234\sum_{t \in g} \rho_t$. Similarly, within a locus $g$, we denote the
235proportion of each transcript $t \in g$ by $\tau_t = \frac{\rho_t}{\sigma_g}$.
236We refer to $\rho,\sigma$ and $\tau$ as {\em transcript abundances}.
237
238
239Transcripts have lengths, which we denote by $l(t)$. For a collection of transcripts $S \subset T$ in a transcriptome, we define the length of $S$ using the weighted mean:
240\begin{equation}
241\label{eq:effective_length}
242l(S) =\frac{\sum_{t \in S} \rho_tl(t)}{\sum_{t \in S}\rho_t}.
243\end{equation}
244It is important to note that the length of a set of transcripts depends on their relative abundances; the reason for this will be clear later.
245
246One grouping of transcripts that we will focus on is the set of transcripts
247within a locus that share the same transcription start site (TSS). Unlike the
248concept of a locus, grouping by TSS has a biological basis. Transcripts
249within such a group are by definition alternatively spliced, and if they have
250different expression levels, this is most likely due to the spliceosome and
251not due to differences in transcriptional regulation.
252
253\subsection{A statistical model for RNA-Seq}
254
255In order to analyze expression levels of transcripts with RNA-Seq data, it is
256necessary to have a model for the (stochastic) process of sequencing. A {\em
257sequencing experiment} consists of selecting a total of $M$ fragments of
258transcripts uniformly at random from the transcriptome. Each fragment is
259identified by sequencing from its ends, resulting in two reads called {\em
260mate pairs}. The length of a fragment is a random variable, with a
261distribution we will denote by $F$. That is, the probability that a fragment
262has length $i$ is $F(i)$ and $\sum_{i=1}^{\infty} F(i) = 1$. In this paper we
263assume that $F$ is normal, however in principle $F$ can be estimated using
264data from the experiment (e.g. spike-in sequences). We decided to use
265the normal approximation to $F$ (allowing for user specified
266parameters of the normal distribution) in order to simplify the
267requirements for running {\tt Cufflinks} at this time.
268
269The assumption of random
270fragment selection is known to oversimplify the complexities of a sequencing
271experiment, however without rigorous ways to normalize we decided to work with
272the uniform at random assumption. It is easy to adapt the model to include
273more complex models that address sequencing bias as RNA-Seq experiments mature
274and the technologies are better understood.
275
276The transcript abundance estimation problem in paired-end RNA-Seq is to
277estimate $\rho$ given a set of transcripts $T$ and a set of reads sequenced
278from the ends of fragments. In {\tt Cufflinks}, the transcripts $T$ can be
279specified by the user, or alternatively $T$ can be estimated directly from the
280reads. The latter problem is the transcript assembly problem which we discuss
281in Section \ref{sec:assembly}. We ran {\tt Cufflinks} in the latter
282``discovery'' mode where we assembled the transcripts without using the
283reference annotation.
284
285The fact that fragments have different lengths has bearing on the calculation
286of the probability of selecting a fragment from a transcript. Consider a
287transcript $t$ with length $l(t)$. The probability of selecting a fragment of
288length $k$ from $t$ at one of the positions in $t$ assuming that it is
289selected uniformly at random, is $\frac{1}{l(t)-k}$. For this reason, we will
290define an adjusted length for transcripts as
291
292\begin{equation}
293\tilde{l}(t) = \sum_{i=1}^{l(t)} F(i)(l(t)-i+1).
294\end{equation}
295We also revisit the definition of length for a group of transcripts, and define
296\begin{equation}
297\tilde{l}(S) =\frac{\sum_{t \in S} \rho_t\tilde{l}(t)}{\sum_{t \in S}\rho_t}.
298\end{equation}
299
300It is important to note that given a read it may not be obvious from which
301transcript the fragment it was sequenced from originated. The consistency of
302fragments with transcripts is important and we define the {\em
303fragment-transcript matrix} $A_{R,T}$ to be the $M \times |T|$ matrix with
304$A(r,t)=1$ if the fragment alignment $r$ is completely contained in the
305genomic interval spanned by $t$, and all the implied introns in $r$ match
306introns in $t$ (in order), and with $A(r,t)=0$ otherwise. Note that the reads
307in Figure 1c in the main text are colored according to the matrix $A_{R,T}$,
308with each column of the matrix corresponding to one of the three colors
309(yellow, blue, red) and reads colored according to the mixture of colors
310corresponding to the transcripts their fragments are contained in.
311
312Even given the read
313alignment to a reference genome, it may not be obvious what the length of the
314fragment was. Formally, in the case that $A_{R,T}(r,t)=1$ we denote by $I_t(r)$ the fragment length from within
315a transcript $t$ implied by the (presumably unique) sequences corresponding to
316the mate pairs of a fragment $r$. If $A_{R,T}(r,t)=0$ then $I_t(r)$ is set to be
317infinite and $F(I_t(r)) = 0$.
318
319Given a set of reads, we assume that we can identify for each of them the set
320of transcripts with which the fragments the reads belonged to are consistent. The
321rationale for this assumption is the following: we map the reads to a
322reference genome, and we assume that the read lengths are sufficiently long so
323that every mate-pair can be uniquely mapped to the genome. We refer to this
324mapping as the {\em fragment alignment}. We also assume that we know all the
325possible transcripts and their alignments to the genome. Therefore, we can
326identify for each read the possible transcripts from which the fragment it
327belonged to originated.
328
329\begin{figure}[!ht]
330    \includegraphics[scale=0.8]{pdfs/implied_length}
331    \caption[Implied length of a fragment alignment]{Alignments of reads to the genome (rectangles) may be consistent with multiple transcripts (in this case both $t_1$ and $t_2$). The transcripts $t_1$ and $t_2$ differ by an internal exon; introns are indicated by long dashed lines. If we denote the fragment alignment by $r$, this means that $A_{R,T}(r,t_1)=1$ and $A_{R,T}(r,t_2)=1$. It is apparent that the implied length $I_{t_1}(r) > I_{t_2}(r)$ due to the presence of the extra internal exon in $t_1$. }
332\end{figure}
333
334We are now ready to write down the likelihood equation for the model. We will
335write $L(\rho|R)$ for the likelihood of a set of fragment alignments $R$
336constructed from $M$ reads. The notation $Pr(trans. = t)$ means ``the
337probability that a fragment selected at random originates from transcript $t$''.
338
339\begin{eqnarray}
340& & L(\rho|R) = \prod_{r \in R} Pr(rd.\ aln. =r)\\
341&  = & \prod_{r \in R} \sum_{t \in T} Pr(rd.\ aln. =r|trans. =t)Pr(trans. =t)\\
342& = & \prod_{r \in R} \sum_{t \in T} \frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)} Pr(rd.\ aln. = r|trans. =t)\\
343& = & \prod_{r \in R} \sum_{t \in T}\frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)} \left(\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \\ \label{eq:likerho}
344& = &  \prod_{r \in R} \sum_{t \in T} \alpha_t\left(\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right),
345\end{eqnarray}
346where
347\begin{equation}
348\alpha_t = \frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)}.
349\end{equation}
350
351Observe that $\alpha_t$ is exactly the probability that a fragment selected at
352random comes from transcript $t$, and we have that $\sum_{t \in T}\alpha_t =
3531$. In light of the probabilistic meaning of the $\alpha=\{\alpha_t\}_{t \in T}$, we refer to them as
354{\em fragment abundances}.
355
356It is evident that the likelihood function is that of a linear model
357and  that the likelihood function is
358concave (Proposition \ref{prop:linearmodel}) so a numerical method can be used
359to find the $\alpha$. It is then possible, in principle, to recover the $\rho$
360using Lemma \ref{lemma:readstoprobs}. However the number of parameters is in the tens of
361thousands, and in practice this form of the likelihood function is unwieldy.
362Instead, we re-write the likelihood utilizing the fact that transcripts in
363distinct loci do not overlap in genomic location.
364
365We first calculate the  probability that a fragment originates from a transcript within a given locus $g$:
366\begin{eqnarray}
367\beta_g & := &  \sum_{t \in g} \alpha_t\\
368& = & \frac{\sum_{t \in g} \rho_t \tilde{l}(t)}{\sum_{u \in T} \rho_u \tilde{l}(u)}\\
369& = & \frac{\sum_{t \in g} \sigma_g\tau_t \tilde{l}(t)}{\sum_{h \in G} \sum_{u \in h} \sigma_h\tau_u \tilde{l}(u)}\\
370& = & \frac{\sigma_g\sum_{t \in g} \tau_t \tilde{l}(t)}{\sum_{h \in G} \sigma_h \sum_{u \in h} \tau_u \tilde{l}(u)}\\
371 & = &  \frac{\sigma_g \tilde{l}(g)}{\sum_{h \in G} \sigma_h \tilde{l}(h)}.
372\end{eqnarray}
373Recall that $\sigma_g = \sum_{t \in g} \rho_t$ and that $\tau_t = \frac{\rho_t}{\sigma_g}$ for a locus $g$.
374
375Similarly, the probability of selecting a fragment from a single transcript $t$ conditioned on selecting a transcript from the locus $g$ in which $t$ is contained is
376\begin{equation}
377\label{eq:gammatau}
378\gamma_t = \frac{\tau_t \tilde{l}(t)}{\sum_{u \in g} \tau_u \tilde{l}(u)}.
379\end{equation}
380The parameters $\gamma=\{\gamma_t\}_{t \in g}$ are conditional fragment abundances, and they are the parameters we estimate from the data in the next Section. Note that for a transcript $t \in g$, $\alpha_t = \beta_g \cdot \gamma_t$ and it is easy to convert between fragment abundances and transcript abundances using Lemma \ref{lemma:readstoprobs}.
381
382We denote the fragment counts by $X$; specifically, we denote the number of
383alignments in locus $g$ by $X_g$. Note that $\sum_{g \in G} X_g = M$. We also
384use the notation $g_r$ to denote the (unique) locus from which a read
385alignment $r$ can be obtained.
386
387The likelihood function is given by
388\begin{eqnarray}
389& & L(\rho|R) = \prod_{r \in R} Pr(rd.\ aln. =r)\\
390&  = & \prod_{r \in R} \sum_{g \in G} Pr(rd.\ aln. =r|locus=g)Pr(locus=g)\\
391& = & \prod_{r \in R} \frac{\sigma_{g_{r}}\tilde{l}(g_{r})}{\sum_{g \in G} \sigma_g\tilde{l}(g)} Pr(rd.\ aln. =r|locus=g_r)\\
392& = & \prod_{r \in R}  \beta_{g_r}
393\sum_{t \in g_r}Pr(rd.\ aln. =r|locus=g_{r},trans. = t)Pr(trans. =t|locus=g_r)\\
394& = & \prod_{r \in R} \beta_{g_r}
395\sum_{t \in g_r} \frac{\tau_t\tilde{l}(t)}{\sum_{u \in g_r}\tau_u \tilde{l}(u)} Pr(rd.\ aln. =r|locus=g_{r},trans. = t)\\
396& = & \left( \prod_{r \in R} \beta_{g_r} \right) \left( \prod_{r \in R} \sum_{t \in g} \gamma_t \cdot
397Pr(rd.\ aln. =r|locus=g_r,trans. =t) \right)\\
398& = & \left( \prod_{r \in R}  \beta_{g_r} \right) \left( \prod_{r \in R} \sum_{t \in g}  \gamma_t\cdot  \frac{F(I_t(r))}{l(t)-I_t(r)+1}\right)\\
399& = & \left( \prod_{g \in G}  \beta_g^{X_{g}} \right) \left( \prod_{g \in G} \left( \prod_{r \in R:r \in g} \sum_{t \in g}  \gamma_t \cdot
400\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \right). \label{eq:likebest}
401\end{eqnarray}
402
403Explicitly, in terms of the parameters $\rho$, Equation (\ref{eq:likebest})
404simplifies to Equation (\ref{eq:likerho}) but we will see in the next section
405how the maximum likelihood estimates $\hat{\rho}$ are most conveniently
406obtained by first finding $\hat{\beta}$ and $\hat{\gamma}$ using Equation
407(\ref{eq:likebest}).
408
409We note that it is biologically meaningful to include prior distributions on
410$\sigma$ and $\tau$ that reflect the inherent stochasticity and resulting
411variability of transcription in a cell. This will be an interesting direction
412for further research as more RNA-Seq data (with replicates) becomes available
413allowing for the determination of biologically meaningful priors. In
414particular, it seems plausible that specific isoform abundances may vary
415considerably and randomly within cells from a single tissue and that this may
416be important in studying differential splicing. We mention to this to clarify
417that in this paper, the confidence intervals we report represent the
418variability in the maximum likelihood estimates $\hat{\sigma}_j$ and
419$\hat{\tau}^k_j$, and are not the variances of prior distributions.
420
421
422\subsection{Estimation of parameters}
423
424We begin with a discussion of identifiability of our
425model. Identifiability refers to the injectivity of the model, i.e.,
426\begin{equation}
427\mbox{if } Pr_{\rho_1}(r) = Pr_{\rho_2}(r) \  \forall r \in R, \ \mbox{ then } \rho_1 = \rho_2.
428\end{equation}
429
430
431The identifiability of RNA-Seq models was discussed in
432\cite{Hiller2009}, where a standard analysis for linear models is
433applied to RNA-Seq (for another related biological example, see \cite{Pe'er2004} which discusses
434identifiability of haplotypes in mixed populations from genotype data).
435The results in these papers apply to our model. For completeness we review the conditions for
436identifiability. Recall that $A_{R,T}$ is the fragment-transcript matrix that specifies which transcripts each fragment is compatible with. The following theorem provides a simple characterization of identifiability:
437
438\begin{thm}
439The RNA-Seq model is identifiable iff $A_{R,T}$ is full rank.
440\end{thm}
441
442Therefore, for a given set of transcripts and a read set $R$, we can
443test whether the model is identifiable using elementary linear
444algebra. For the results in this paper, when estimating expression
445with given annotations, when the model was not identifiable we picked {\em a} maximum likelihood solution,
446although in principle it is possible to bound the total expression of
447the locus and/or report identifiability problems to the user.
448
449Returning to the likelihood function
450\begin{equation}
451\left( \prod_{g \in G}  \beta_g^{X_{g}} \right) \left( \prod_{g \in G} \left( \prod_{r \in R:r \in g} \sum_{t \in g}  \gamma_t \cdot
452\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \right),
453\end{equation}
454we note that both the $\beta$ and $\gamma$ parameters depend on the $\rho$ parameters. However, we will see that if we maximize the $\beta$ separately from the $\gamma$, and also each of the sets $\{\gamma_t:t \in g\}$ separately, then it is always possible to find $\rho$ that match both the maximal $\beta$ and $\gamma$. In other words,
455the problem of finding $\hat{\rho}$ is equivalent
456to finding $\hat{\beta}$ that maximizes
457$ \prod_{g \in G} \beta_g^{X_g}$
458and separately, for each locus $g$, the $\hat{\gamma}_t$ that maximize
459\begin{equation}
460\prod_{r \in R:r \in g} \sum_{t \in g}  \gamma_t
461\frac{F(I_t(r))}{l(t)-I_t(r)+1}.
462\end{equation}
463
464We begin by solving for the $\hat{\beta}$ and $\hat{\gamma}$ and the variances
465of the maximum likelihood estimates, and then explain how these are used to
466report expression levels.
467
468We can solve for the $\hat{\gamma}$ using the fact that the model is linear.
469That is, the probability of each individual read is linear in the read
470abundances $\gamma_t$. It is a standard result in statistics (see, e.g.,
471Proposition 1.4 in \cite{ASCB2005}) that the log likelihood function of a
472linear model is concave. Thus, a hill climbing method can be used to find the
473$\hat{\gamma}$. We used the EM algorithm for this purpose.
474
475Rather than using the direct ML estimates, we obtained a regularized
476estimate by importance sampling from the posterior distribution with a
477proposal distribution we explain below. The samples were also used to estimate
478variances for our estimates.
479
480It follows from standard MLE asymptotic
481theory that the $\hat{\gamma}$ are asymptotically multivariate normal with
482variance-covariance matrix given by the inverse of the observed Fisher
483information matrix. This matrix is defined as follows:
484
485\begin{defn}[Observed Fisher information matrix]
486The observed Fisher information matrix is the negative of the Hessian of the log likelihood function evaluated at the maximum likelihood estimate. That is, for parameters $\Theta=(\theta_1,\ldots,\theta_n)$, the $n \times n$ matrix is
487\begin{eqnarray}
488\mathcal{F}_{k,l}(\hat{\Theta}) & = & - \frac{\partial^2 log(\mathcal{L}(\Theta|R))}{\partial \theta_k \theta_l} |_{\theta=\hat{\theta}}.
489\end{eqnarray}
490\end{defn}
491In our case, considering a single locus $g$, the parameters are $\Theta = (\gamma_{t_1},\ldots,\gamma_{t_{|g|}})$, and as expected from Proposition \ref{prop:linearmodel}:
492\begin{eqnarray}
493\label{eq:Fisher}
494\mathcal{F}_{t_k,t_l}(\hat{\Theta}) & = & \sum_{r \in R:r \in g}
495\left[ \frac{1}{\left( \sum_{h \in g}  \hat{\gamma}_h \frac{F(I_h(r))}{l(h) - I_h(r)+1} \right)^2} \frac{F(I_{t_k}(r)) F(I_{t_l}(r)) }{(l(t_k)-I_{t_k}+1)(l(t_l)-I_{t_l}+1)} \right].
496\end{eqnarray}
497
498
499Because some of the transcript abundances may be close to zero, we
500adopted the
501Bayesian approach of \cite{Jiang2009} and instead sampled from the
502joint posterior distribution of $\Theta$ using the proposal
503distribution consisting of the multivariate normal with mean
504given by the MLE, and variance-covariance matrix given by the inverse of
505(\ref{eq:Fisher}). If the Observed Fisher Information Matrix is
506singular then the user is warned and the confidence intervals of all
507transcripts are set to $[0,1]$ (meaning that there is no information
508about relative abundances).
509
510The method used for sampling was importance sampling. The samples were used to obtain a maximum-a-posterior estimate for
511$\hat{\gamma}_t$ for each $t$ and for the variance-covariance matrix which we
512denote by $\Psi^g$ (where $g \in G$ denotes the locus). Note that $\Psi^g$ is
513a $|g| \times |g|$ matrix. The covariance between $\hat{\gamma}_{t_k}$ and
514$\hat{\gamma}_{t_l}$ for $t_k,t_l \in g$ is given by $\Psi^g_{t_k,t_l}$.
515
516Turning to the maximum likelihood estimates $\hat{\beta}$, we use the fact that the model is the log-linear. Therefore,
517\begin{equation}
518\label{eq:sigmahat}
519\hat{\beta_g} = \frac{X_{g}}{M}.
520\end{equation}
521
522Viewed as a random variable, the counts $X_{g}$ are approximately Poisson and therefore the variance of the MLE $\hat{\beta}_g$ is approximately $X_{g}$. We note that for the tests in this paper we directly used the total counts $M$ and the proportional counts $X_g$, however it is easy to incorporate recent suggestions for total count normalization, such as \cite{Bullard2010} into {\tt Cufflinks}.
523
524The favored units for reporting expression in RNA-Seq studies to date is not
525using the transcript abundances directly, but rather using a measure
526abbreviated as FPKM, which means ``expected number of fragments per kilobase
527of transcript sequence per millions base pairs sequenced''. These units are
528equivalent to measuring transcript abundances (multiplied by a scalar). The
529computational advantage of FPKM, is that the normalization constants
530conveniently simplify some of the formulas for the variances of transcript
531abundance estimates.
532
533For example, the abundance of a transcript $t \in g$ in FPKM units is
534\begin{equation}
535\label{eq:FPKM1}
536 \frac{10^6 \cdot 10^3 \cdot \alpha_t}{\tilde{l}(t)} =  \frac{10^6 \cdot 10^3 \cdot \beta_g \cdot \gamma_t}{\tilde{l}(t)}.
537\end{equation}
538
539Equation (\ref{eq:FPKM1}) makes it clear that although the abundance of each
540transcript $t \in g$ in FPKM units is proportional to the transcript abundance
541$\rho_t$ it is given in terms of the read abundances $\beta_g$ and $\gamma_t$
542which are the parameters estimated from the likelihood function.
543
544The maximum likelihood estimates of $\beta_g$ and $\gamma_t$ are random
545variables, and we denote their scaled product (in FPKM units) by $A_t$. That
546is $Pr(A_t = a)$ is the probability that for a random set of fragment alignments
547from a sequencing experiment, the maximum likelihood estimate of the
548transcript abundance for $t$ in FPKM units is $a$.
549
550Using the fact that the expectation of a product of independent random
551variables is the product of the expectations, for a transcript $t \in g$ we
552have
553\begin{equation}
554E[A_t] = \frac{10^9X_{g}\hat{\gamma}_t}{\tilde{l}(t)M}.
555\end{equation}
556
557Given the variance estimates for the $\hat{\gamma}_t$ we turn to the problem of estimating $Var[A_t]$ for a transcript $t \in g$. We use Lemma \ref{lemma:varproduct} to obtain
558\begin{eqnarray}
559Var[A_t]& =  & \left(\frac{10^9}{\tilde{l}(t)M}\right)^2 \left( \Psi^g_{t,t}X_{g} + \Psi^g_{t,t}X^2_{g} + (\hat{\gamma}_t)^2 X_{g} \right)\\
560& = & X_{g}\left(\frac{10^9}{\tilde{l}(t)M)}\right)^2  \left( \Psi^g_{t,t}(1+X_{g}) + (\hat{\gamma}_t)^2\right).
561\end{eqnarray}
562
563This variance calculation can be used to estimate a confidence interval by
564utilizing the fact \cite{Aroian1978} that when the expectation divided by the
565standard deviation of at least one of two random variables is large, their
566product is approximately normal.
567
568Next we turn to the problem of estimating expression levels (and variances of
569these estimates) for groups of transcripts. Let $S \subset T$ be a group of
570transcripts located in a single locus $g$, e.g. a collection of transcripts
571sharing a common TSS.
572
573The analogy of Equation (\ref{eq:FPKM1}) for the FPKM of the group is
574\begin{eqnarray}
575\label{eq:grouphard}
576& & \frac{10^6 \cdot 10^3 \cdot \beta_g \cdot \left( \sum_{t \in S} \gamma_t\right)}{\tilde{l}(S)}\\
577& = & 10^6 \cdot 10^3 \cdot \beta_g \cdot \sum_{t \in S} \frac{\gamma_t}{\tilde{l}(t)}. \label{eq:groupeasy}
578\end{eqnarray}
579
580As before, we denote by $B_S$ the random variables for which $Pr(B_S = b)$ is
581the probability that for a random set of fragment alignments from a sequencing
582experiment, the maximum likelihood estimate of the transcript abundance for
583all the transcripts in $S$ in FPKM units is $b$. We note that the $B_S$ are
584products and sums of random variables (Equation (\ref{eq:groupeasy})). This
585makes Equation (\ref{eq:groupeasy})  more useful than the
586equivalent unsimplified Equation (\ref{eq:grouphard}), especially because
587$\tilde{l}(S)$ is, in general, a ratio of two random variables.
588
589We again use the fact that the expectation of independent random variables is
590the product of the expectation, in addition to the fact that expectation is a
591linear operator to conclude that for a group of transcripts $S$,
592
593\begin{equation}
594\label{eq:expectTSS}
595E[B_S] = \frac{10^9 \cdot X_g \cdot \sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)}}{M}.
596\end{equation}
597In order to compute the variance of $B_S$, we first note that
598\begin{equation}
599Var\left[\sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)}\right] = \sum_{t \in S}\frac{1}{\tilde{l}(t)^2}\Psi^g_{t,t} + \sum_{t,u \in S} \frac{1}{\tilde{l}(t)\tilde{l}(u)} \Psi^g_{t,u}.
600\end{equation}
601Therefore,
602\begin{eqnarray}
603& & \quad Var[B_S] \quad = \quad \nonumber \\
604& &  X_g\left(\frac{10^9}{M}\right)^2\left( \left(1+X_g\right) \left(\sum_{t \in S}\frac{1}{\tilde{l}(t)^2}\Psi^g_{t,t} + \sum_{t,u \in S} \frac{1}{\tilde{l}(t)\tilde{l}(u)} \Psi^g_{t,u}\right) + \left( \sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)} \right)^2 \right). \label{eq:varTSS}
605\end{eqnarray}
606
607We can again estimate a confidence interval by utilizing the fact that $B_S$
608is approximately normal \cite{Aroian1978}.
609
610\subsection{Assessment of abundance estimation}
611
612We evaluated the accuracy of {\tt Cufflinks}' transcript abundance estimates by first comparing the estimated FPKM values for the spiked-in sequences in each sample against their intended concentrations.  Spike FPKMs were highly correlated across a 5-log dynamic range in all four samples (Supplementary Figure \ref{spikes}).  However, because sequenced spike fragments were unambiguously mappable, we performed additional simulation to measure the accuracy of the software in alternatively spliced loci.
613
614\begin{figure}[h]
615    \includegraphics{pdfs/spikes.pdf}
616    \caption[{\tt Cufflinks}' abundance estimates of spiked-in sequences]{{\tt Cufflinks}' abundance estimates of spiked-in sequences.\label{spikes}}
617\end{figure}
618
619\begin{figure}[h]
620    \includegraphics[scale=0.5]{pdfs/sim_small.pdf}
621    \caption[Accuracy of {\tt Cufflinks} abundance estimates]{\emph{In silico} assessment of the accuracy of {\tt Cufflinks}’  abundance estimation when provided with a perfect assembly (a) and after {\it de novo} comparative assembly (b). Red points indicate in silico transcripts that were only partially recovered, where black points were fully reconstructed by {\tt Cufflinks}. Simulated reads were aligned with {\tt TopHat} and the alignments were provided to {\tt Cufflinks} along with the structures of the transcripts in the simulated sample. \label{sim}}
622\end{figure}
623
624To assess the accuracy of {\tt Cufflinks}' estimates,
625we simulated an RNA-Seq experiment using the FluxSimulator, a freely available
626software package that models whole-transcriptome sequencing experiments with
627the Illumina Genome Analyzer. \cite{Sammeth} The software works by first
628randomly assigning expression values to the transcripts provided by the user,
629constructing an amplified, size-selected library, and sequencing
630it. Mouse UCSC transcripts were supplied to the software, along with build
63137.1 of the genome. FluxSimulator then randomly assigned expression ranks to
63218,935 transcripts, with the expression value $y$ computed from the rank $x$
633according to the formula
634
635\begin{equation}
636    y =\left( \frac{x}{5.0 \times 10^7}\right)^{-0.6}e^{-\left(\frac{x}{9.5\times{10^3}}\right) - \left(\frac{x}{9.5\times{10^3}}\right)^2}.
637\end{equation}
638
639From these relative expression levels, the software constructed an \emph{in
640silico} RNA sample, with each transcript assigned a number of molecules
641according to its abundances. The software modeled the polyadenylation of each
642transcript by adding a poly-A tail (of mean length 125nt) after the terminal
643exon. FluxSimulator then simulated reverse transcription of \emph{in silico}
644mRNAs by random hexamer priming, followed by size selection of RT products to
645between 175 and 225 nt. The resulting ``library'' of 6,601,805 cDNA fragments
646was then sampled uniformly at random for simulated sequencing, where the
647initial and terminal 75bp of each selected fragment were reported as reads.
648FluxSimulator does not allow precise control over the number of reads
649generated (Michael Sammeth, personal communication), but nevertheless
650generated 13,203,516 75nt paired-end RNA-Seq reads. These reads included
651sequencing errors; FluxSimulator includes a position-specific sequencing
652error model.
653
654Fragments were mapped with {\tt TopHat} to the mouse genome using identical
655parameters to those used to map the C2C12 reads, mapping a total of 6,176,961
656(93\% of the library). These alignments were supplied along with the exact set
657of expressed transcripts to {\tt Cufflinks}, to measure {\tt Cufflinks}' abundance
658estimation accuracy when working with a ``perfect'' assembly (Supplementary
659Figure \ref{sim}). Estimated FPKM was very close to true \emph{in silico} FPKM
660across a dynamic range of expression of nearly six orders of magnitude
661($R^2=0.95$).
662
663\begin{figure}[!ht]
664    \includegraphics[scale=0.30]{pdfs/vs_no_discovery_small.pdf}
665    \caption[Improved abundance accuracy with novel
666    transcripts]{Excluding novel C2C12 transcripts from abundance
667      estimation results in inaccurate estimates for known transcripts. \label{vs_no_discovery}}
668\end{figure}
669
670Estimation of transcript abundances by assigning fragments to them may be inaccurate if
671one is working with an incomplete set of transcripts for a particular sample.
672To evaluate the impact of missing transcripts, we removed the newly discovered
673transcripts from our high-confidence set and re-estimated the abundances of
674known transcripts, and then compared them to those obtained when working with
675the complete high-confidence set. While estimates of known transcripts were
676overall similar or identical when working with both sets, reflecting
677single-isoform or fully annotated genes, isoforms of some alternatively
678spliced genes differed greatly. (Supplementary Figure \ref{vs_no_discovery})
679
680As a final note, we point out that a na\"{i}ve, yet popular, current
681approach to expression estimation is to sum the reads mapping to a
682gene (where the sum is taken across all exons appearing in all
683possible isoforms), and then to normalize the count by either the
684total number of exonic bases, or by the average length of the
685transcripts. We call the former method the ``projective normalization''
686method, and the latter the ``average length'' method.
687
688\begin{prop}
689The totally projective normalization method is correct only for single isoform
690genes. If a gene has two or more isoforms the expression is underestimated.
691\end{prop}
692{\bf Proof}: The effective length of the gene is overestimated, hence
693the expression level is underestimated. To see this, first note that
694the length of some transcript in a gene is less than the total number
695of exonic bases among all transcripts. Then,  if $a_1,\ldots,a_n$ are
696real numbers all greater than zero and $b_1,\ldots,b_n$ are not all equal, we have
697\begin{equation}
698\frac{\sum_{i=1}^na_ib_i}{\sum_{i=1}^n a_i} < max_{i}(b_i),
699\end{equation}
700so that the effective length in equation (\ref{eq:effective_length}) is always less than the total
701number of exonic bases among all transcripts.
702
703Stated differently, the projective normalization method has the problem that
704it produces numbers that are not proportional to the $\rho$, so that
705it is not additive. The average length method is flawed for the same reason. The
706transcript abundances are not taken into account in computing the
707effective lengths. In some cases the method might produce the correct
708answer (for the wrong reasons), but it is bound to be incorrect on most examples,
709especially in genes with transcripts of variable lengths and
710non-uniform abundances.
711
712\section{Transcript assembly}
713\label{sec:assembly}
714\subsection{Overview}
715
716{\tt Cufflinks} takes as input alignments of RNA-Seq fragments to a reference
717genome and, in the absence of an (optional) user provided annotation, initially assembles transcripts from the alignments. Transcripts in
718each of the loci are assembled independently. The assembly algorithm
719is designed to aim for the following:
720\begin{enumerate}
721\item Every fragment is consistent with at least one assembled
722  transcript.
723\item Every transcript is tiled by reads.
724\item The number of transcripts is the smallest required to satisfy requirement (1).
725\item The resulting RNA-Seq models (in the sense of Section \ref{sec:abundances}) are identifiable.
726\end{enumerate}
727In other words, we seek an assembly that parsimoniously “explains” the fragments from the RNA-Seq experiment; every
728fragment in the experiment (except those filtered out during a preliminary
729error-control step) should have come from a {\tt Cufflinks} transcript, and
730{\tt Cufflinks} should produce as few transcripts as possible with that property.
731Thus, {\tt Cufflinks} seeks to optimize the criterion suggested in \cite{Xing2004},
732however, unlike the method in that paper, {\tt Cufflinks} leverages Dilworth's
733Theorem \cite{Dilworth1950} to solve the problem by reducing it
734to a matching problem via the equivalence of Dilworth's and K\"{o}nig's theorems (Theorem \ref{thm:dilko} in Appendix A). Our approach to isoform reconstruction is inspired by a similar approach used for haplotype reconstruction from HIV quasi-species \cite{Eriksson2008}.
735
736\subsection{A partial order on fragment alignments}
737
738The {\tt Cufflinks} program loads a set of alignments in SAM format sorted by reference
739position and assembles non-overlapping sets of alignments independently. After
740filtering out any erroneous spliced alignments or reads from incompletely
741spliced RNAs, {\tt Cufflinks} constructs a partial order (Definition \ref{def:po}), or equivalently a directed acyclic graph (DAG), with
742one node for each fragment that in turn consists of an aligned pair of mated reads. First, we note that fragment alignments are of two types: those where reads align in their entirety to the genome, and reads which have a split alignment (due to an implied intron).
743
744In the case of single reads, the partial order can be simply
745constructed by checking the reads for {\em compatibility}. Two reads
746are {\em compatible} if their overlap contains the exact same implied
747introns (or none). If two reads are not compatible they are {\em
748  incompatible}. The reads can be partially ordered by defining, for
749two reads $x,y$, that $x\leq y$ if the starting coordinate of $x$ is
750at or before the starting coordinate of $y$, and if they are
751compatible.
752
753In the case of paired-end RNA-Seq the situation is more complicated
754because the unknown sequence between mate pairs. To understand this,
755we first note that pairs of fragments can still be determined to be
756incompatible if they cannot have originated from the same
757transcript. As with single reads, this happens when there is
758disagreement on implied introns in the overlap. However compatibility
759is more subtle. We would like to define a pair of fragments $x,y$ to
760be compatible if they do not overlap, or if every implied intron in one fragment overlaps an identical implied intron in the other fragment.
761
762However it is important to note that it may be impossible to determine
763the compatibility (as defined above) or incompatibility of a pair of fragments. For example, an unknown region internal to a fragment may overlap two different introns (that are incompatible with each other). The fragment may be compatible with one of the introns (and the fragment from which it originates) in which case it is incompatible with the other. Since the opposite situation is also feasible, compatibility (or incompatibility) cannot be assigned. Fragments for which the compatibility/incompatibility cannot be determined with respect to every other fragment are called {\em uncertain}. Finally, two fragments are called {\em nested} if one is contained within the other.
764
765
766\begin{figure}[h]
767    \includegraphics[scale=0.5]{pdfs/compatibility.pdf}
768    \caption[Compatibility and incompatibility of
769    fragments]{Compatibility and incompatibility of
770      fragments. End-reads are solid lines, unknown sequences within
771      fragments are shown by dotted lines and implied introns are
772      dashed lines. The reads in (a) are compatible, whereas the
773      fragments in (b) are incompatible. The fragments in (c) are
774      nested. Fragment $x_4$ in (d) is uncertain, because $y_4$ and
775      $y_5$ are incompatible with each other.}
776\end{figure}
777
778Before constructing a partial order, fragments are extended to include
779their nested fragments and uncertain fragments are discarded. These discarded fragments
780are used in the abundance estimation. In theory, this may result in
781suboptimal (i.e. non-minimal assemblies) but we determined empirically
782that after assembly uncertain fragments are almost always consistent
783with one of the transcripts. When they are not, there was no
784completely tiled transcript that contained them. Thus, we employ a
785heuristic that substantially speeds up the program, and that works
786in practice.
787
788A partial order $P$ is then constructed from the remaining fragments
789by declaring that $x \leq y$ whenever the fragment corresponding to
790$x$ begins at, or before, the location of the fragment corresponding
791to $y$ and $x$ and $y$ are compatible. In what follows we identify $P$
792with its Hasse diagram (or covering relation), equivalently a directed
793acyclic graph (DAG) that is the transitive reduction.
794
795\begin{prop}
796$P$ is a partial order.
797\end{prop}
798{\bf Proof}: The fragments can be totally ordered according to the locations where they begin. It therefore suffices to check that if $x,y,z$ are fragments with $x$ compatible with $y$ and $y$ compatible with $z$ then $x$ is compatible with $z$. Since $x$ is not uncertain, it must be either compatible or incompatible with $z$. The latter case can occur only if $x$ and/or $z$ contain implied introns that overlap and are not identical. Since $y$ is not nested within $z$ and $x$ is not nested within $y$, it must be that $y$ contains an implied intron that is not identical with an implied intron in either $x$ or $z$. Therefore $y$ cannot be compatible with both $x$ and $z$. \qed
799
800\subsection{Assembling a parsimonious set of transcripts}
801
802In order to assemble a set of transcripts, {\tt Cufflinks} finds a
803(minimum) partition of $P$ into chains (see Definition
804\ref{def:po}). A partition of $P$ into chains yields an assembly
805because every chain is a totally ordered set of compatible fragments
806$x_1,\ldots,x_l$ and therefore there is a set of overlapping fragments
807that connects them. By Dilworth's theorem (Theorem \ref{thm:Dilworth}), the problem of finding a minimum partition $P$ into chains is equivalent to finding a maximum antichain in $P$ (an antichain is a set of mutually incompatible fragments). Subsequently, by Theorem \ref{thm:dilko}, the problem of finding a maximum antichain in $P$ can be reduced to finding a maximum matching in a certain bipartite graph that emerges naturally in deducing Dilworth's theorem from K\"{o}nig's theorem \ref{thm:konig}.
808We call the key bipartite graph the ``reachability'' graph. It is the transitive closure of the DAG, i.e. it is the graph
809where each fragment $x$ has nodes $L_x$ and $R_x$ in the left and
810right partitions of the reachability graph respectively, and where there is an edge between $L_x$ and $R_y$ when $x \leq y$ in $P$. The maximum matching problem is a classic problem that admits a polynomial time algorithm. The Hopcroft-Karp algorithm \cite{Hopcroft1973} has a run time of $O(\sqrt{V}E)$ where in our case $V$ is the number of fragments and $E$ depends on the extent of overlap, but is bounded by a constant times the coverage depth. We note that our parsimony approach to assembly therefore has a better complexity than the $O(V^3)$ PASA algorithm \cite{Haas2003}.
811
812The minimum cardinality chain decomposition computed using the approach above may not be unique.
813For example, a locus may contain two putative distinct initial exons (defined by overlapping incompatible fragments), and one
814of two distinct terminal and a constitutive exon in between that is longer than any
815read or insert in the RNA-Seq experiment. In such a case, the parsimonious assembly will consist of two transcripts, but there are four possible solutions that are all minimal. In order to ``phase'' distant exons, we leverage the fact that abundance inhomogeneities can link distant exons via their coverage. We therefore weight the edges of the bipartite reachability graph based on the percent-spliced-in metric introduced by
816Wang \emph{et al.} in \cite{Wang2008}. In our setting, the percent-spliced-in
817$\psi_x$ for an alignment $x$ is computed by counting the alignments
818overlapping $x$ in the genome that are compatible with $x$ and dividing by the
819total number of alignments that overlap $x$, and normalizing for the length of
820the $x$. The cost $C(y,z)$ assigned to an edge between alignments $y$ and $z$ reflects
821the belief that they originate from different transcripts:
822
823
824\begin{equation}
825    C(y,z) = -\log(1 - |\psi_y - \psi_z|).
826    \label{eq:match_cost}
827\end{equation}
828
829Rather than using the Hopcroft-Karp algorithm,  a modified version of the {\tt
830LEMON} \cite{LEMON} and {\tt Boost} \cite{Boost} graph libraries are used to
831compute a {\em min-cost} maximum cardinality matching on the bipartite compatibility
832graph. Even with the presence of weighted edges, our algorithm is
833very fast. The best known algorithm for weighted matching is $O(V^2logV+VE)$.
834
835Because we isolated total RNA, we
836expected that a small fraction of our reads would come from the intronic
837regions of incompletely processed primary transcripts. Moreover, transcribed
838repetitive elements and low-complexity sequence result in ``shadow''
839transfrags that we wished to discard as artifacts. Thus, {\tt Cufflinks}
840heuristically identifies artifact transfrags and suppresses them in its
841output. We also filter extremely low-abundance minor isoforms of alternatively
842spliced genes, using the model described in Section \ref {abundances} as a
843means of reducing the variance of estimates for more abundant transcripts. A
844transcript $x$ meeting any of the following criteria is suppressed:
845
846\begin{enumerate}
847    \item $x$ aligns to the genome entirely within an intronic region of the alignment for a transcript $y$, and the abundance of $x$ is less than 15\% of $y$'s abundance.
848    \item $x$ is supported by only a single fragment alignment to the genome.
849    \item More than 75\% of the fragment alignments supporting $x$, are mappable to multiple genomic loci.
850    \item $x$ is an isoform of an alternatively spliced gene, and has an estimated abundance less than 5\% of the major isoform of the gene.
851\end{enumerate}
852
853Prior to transcript assembly, {\tt Cufflinks} also filters out some of the alignments
854for fragments that are likely to originate from incompletely spliced nuclear
855RNA, as these can reduce the accuracy abundance estimates for fully spliced
856mRNAs. These filters and the output filters above are detailed in the source
857file \verb!filters.cpp! of the source code for {\tt Cufflinks}.
858
859In the overview of this Section, we mentioned that our assembly algorithm has the property that the resulting models are identifiable. This is a convenient property that emerges naturally from the parsimony criterion for a ``minimal explanation'' of the fragment alignments. Formally, it is a corollary of Dilworth's theorem:
860
861\begin{prop}
862The assembly produced by the {\tt Cufflinks} algorithm always results in an
863identifiable RNA-Seq model.
864\end{prop}
865{\bf Proof}:  By Dilworth's theorem, the minimum chain decomposition (assembly)
866we obtain has the same size as the maximum antichain in the partially
867ordered set  we
868construct from the reads. An antichain consists of reads that are
869pairwise incompatible, and therefore those reads must form a
870permutation sub-matrix in the fragment-transcript matrix $A_{R,T}$ with columns corresponding to the transcripts in a locus, and with rows corresponding to the fragments in the antichain. The matrix $A_{R,T}$ therefore contains permutation sub-matrices that together span all the columns, and the matrix is full-rank.
871
872\subsection{Assessment of assembly quality}
873
874To compare {\tt Cufflinks} transfrags against annotated transcriptomes, and
875also to find transfrags common to multiple assemblies, we developed a tool
876called {\tt Cuffcompare} that builds structural equivalence classes of
877transcripts. We ran {\tt Cuffcompare} on each the assembly from each time point against the
878combined annotated transcriptomes of the {UCSC known genes}, {\tt Ensembl},
879and {\tt Vega}. Because of the stochastic nature of sequencing, \emph{ab
880initio} assembly of the same transcript in two different samples may result in
881transfrags of slightly different lengths. A {\tt Cufflinks} transfrag was
882considered a complete match when there was a transcript with an identical
883chain of introns in the combined annotation.
884
885When no complete match is found between a {\tt Cufflinks} transfrag and the
886transcripts in the combined annotation, {\tt Cuffcompare} determines and reports if
887another substantial relationship exists with any of the annotation
888transcripts that can be found in or around the same genomic locus. For
889example, when all the introns of a transfrag match perfectly a part of the
890intron chain (sub-chain) of an annotation transcript, a ``containment''
891relationship is reported. For single-exon transfrags, containment is also
892reported when the exon appears fully overlapped by any of the exons of an
893annotation transcript. If there is no perfect match for the intron chain of a
894transfrag but only some exons overlap and there is at least one intron-exon
895junction match, {\tt Cuffcompare} classifies the transfrag as a putative ``novel''
896isoform of an annotated gene. When a transfrag is unspliced (single-exon) and
897it overlaps the intronic genomic space of a reference annotation transcript,
898the transfrag is classified as potential pre-mRNA fragment. Finally, when no
899other relationship is found between a {\tt Cufflinks} transfrag and an
900annotation transcript, {\tt Cuffcompare} can check the repeat content of the
901transfrag's genomic region (assuming the soft-masked genomic sequence was also
902provided) and it would classify the transfrag as ``repeat'' if most of its
903bases are found to be repeat-masked.
904
905When provided multiple time point assemblies, {\tt Cuffcompare} matches transcripts
906between samples that have an identical intron structure, placing all mutually
907matching transcripts in the same equivalence class. The program reports a
908non-redundant set of transcript structures, choosing the longest transcript
909from each equivalence class as the representative transcript. {\tt Cuffcompare} also
910reports the relationships found between each equivalence class (transcripts
911that have a complete match across time points) and reference transcripts from
912the combined annotation set, where applicable.
913
914\begin{figure}[h]
915    \includegraphics{pdfs/categories.pdf}
916    \caption[Categorization of {\tt Cufflinks} transcripts by estimated depth of read coverage]{Categorization of {\tt Cufflinks} transcripts by estimated depth of read coverage. \label{categories}}
917\end{figure}
918
919Table \ref{category_table} includes the classifications of the transfrags
920reported by {\tt Cufflinks} after assembling the C2C12 reads. While only
92113.5\% of assembled transfrags represent known transcripts, {\tt Cufflinks}
922assigns more than 76\% of reads to these, reflecting the fact that moderate
923and highly-abundant transfrags generate most of the library fragments in the
924experiment. Less abundant transcripts receive less complete sequencing
925coverage, resulting in numerous transfrags that partially but compatibly match
926known transcripts. Supplementary Figure \ref{categories} shows the categories
927of {\tt Cufflinks} transfrags as estimated depth of sequencing coverage
928increasing.
929
930\begin{table}[h]{
931    \Small
932    \begin{tabular}{l|c|c|c}
933        \hline
934        \textbf{Category} & \textbf{Transfrags} & \textbf{\% of total transfrags} & \textbf{Assembled reads} (\%) \\
935        \hline
936        Match to known isoform & 39,857 & 13.5 &	76.7 \\
937        Novel isoform of known gene & 18,565 & 6.3 & 11.3 \\
938        Contained in known isoform & 71,029 & 24.1 & 4.6 \\
939        Repeat & 41,906 & 14.2 & 0.6 \\
940        Intronic & 32,658 & 11.1 & 0.6 \\
941        Polymerase run-on & 18,522 & 6.3 & 0.5 \\
942        Intergenic & 48,604 & 16.5 & 1.2 \\
943        Other artifacts & 22,483 & 7.7 & 4.5 \\
944        \hline
945        Total transfrags & 293,624 & 100.0 & 100.0 \\
946        \hline
947
948    \end{tabular}
949    }
950\caption[Types of assembled transfrags]{Classification of all transfrags produced at any time point with respect to annotated gene models and masked repeats in the mouse genome.  Transfrags that are present in multiple time point assemblies are multiply counted to preserve the relative distribution of transfrags among the categories across the full experiment. \label{category_table}}
951\end{table}
952
953We selected the {\tt Cufflinks} transfrags that did not have a complete match
954or ``containment'' relationship with a known annotation transcript, but were
955classified by {\tt Cuffcompare} as putative ``novel isoforms'' of known genes. We
956explored the sequence similarity between these transfrags and two sets of mRNA
957sequences: one set representing the mouse transcriptome and consisting of all
958mouse ESTs in dbEST plus all reviewed or validated RefSeq mouse mRNAs, and the
959other consisting of all reviewed or validated RefSeq mRNAs from other
960mammalian species.
961
962We used megablast to map all mouse ESTs onto this set of {\tt Cufflinks}
963transfrags, only keeping EST alignments where at least 80\% of the EST length
964was aligned with at least 95\% identity. We calculated transfrag coverage by
965tiling overlapping EST mappings on each transfrag and counted only those
966transfrags that are covered by ESTs for at least 80\% of the transfrag length
967without any coverage gaps, and with coverage discontinuities only allowed at
968no more than 10\% distance from either end. For the mouse mRNAs alignments we
969also used megablast with the same basic coverage cutoffs (minimum 80\% covered
970with no more than 10\% unaligned on either side of the overlap) but applied to
971each pairwise alignment independently (i.e. as opposed to EST alignments, no
972coverage tiling was considered for mRNA alignments). For alignments with the
973non-mouse mRNAs we used discontiguous megablast with a dual (combined)
974discontiguous word template (option -N 2), with the same coverage assessment
975protocol as in the case of mouse mRNA alignments but with the percent identity
976cutoff lowered to 80\%.
977
978\begin{figure}[h]
979    \includegraphics{pdfs/FHL3_RT-PCR.pdf}
980    \caption[New isoform of Fhl3]{New and known isoforms of Fhl3 recovered by {\tt Cufflinks} at each time point (a) were confirmed by form-specific RT-PCR (b). \label{FHL3_RT-PCR}}
981\end{figure}
982
983\begin{figure}[h]
984    \includegraphics{pdfs/Additional_RT-PCR.pdf}
985    \caption[RT-PCR validation of selected genes]{RT-PCR of selected genes.  For Schip1, {\tt Cufflinks} assembled a known and a novel isoform (with a new TSS), both of which are detected by RT-PCR.  Prkar1a is annotated with two alternate first exons and start sites in {\tt UCSC known genes}, both of which were detected. {\tt Cufflinks} assembles the known isoform of the splicing factor Sfpq, along with a novel variant that contains most of RIKEN clone. Tpm1, a gene known to have muscle- and non-muscle-specific isoforms displays previously observed alternative first and last exons.   \label{RT-PCR}}
986\end{figure}
987
988\begin{table}[h]{
989    \Small
990    \begin{tabular}{r|l|c|c}
991
992        \hline
993        \textbf{Primer name}	& \textbf{sequence}	& \textbf{product length} & \textbf{endpoint gel score} \\
994        \hline
995
996        \textbf{FHL3 Ex1Ex3}	& & & \\
997        Left &	CTCGCCGCTGCTCTCTCG & 221 & +++ \\
998        Right &	GTGTTGTCATAGCACGGAACG & &	\\
999        \textbf{FHL3 Ex2Ex3}	 & & & \\
1000        Left &	AGGAAGGGCTCACAAGTGG	& 407 & +++ \\
1001        Right &	ATAGCACGGAACGCAGTAGG & & \\
1002        \textbf{Sfpq Ex9Ex10} & & & \\
1003        Left & GTGGTGGCATAGGTTATGAAGC & 936 & +++ \\
1004        Right & CCATTTTCAAAAGCTTTCAAGG & & \\
1005        \textbf{Sfpq Ex9Ex11} & & & \\
1006        Left & GTGGTGGCATAGGTTATGAAGC & 172 & +++ \\
1007        Right & CTCAAGTAAATAAGACTCCAAAATCAGC & & \\
1008        \textbf{Prkar1aEx1Ex3} & & & \\
1009        Left & ACAGCAGGGATCTCCTTGTCC & 418 & +++ \\
1010        Right & CCTCTCAAAGTATTCCCGAAGG & & \\
1011        \textbf{Prkar1aEx2Ex3} & & & \\
1012        Left & GCTATCGCAGAGTGGTAGTGAGG & 279 & +++ \\
1013        Right & CCTCTCAAAGTATTCCCGAAGG & & \\
1014        \textbf{Schip1Ex1Ex3} & & & \\
1015        Left & GGCTATGAGGGTGAAAAGTGC & 1050	& +++ \\
1016        Right & GTATAGATTCCTGGGCCATCG & & \\
1017        \textbf{Schip1Ex2Ex3} & & & \\
1018        Left & CAGCATGAGTGGTAACCAAGG & 269 & +++ \\
1019        Right & GTATAGATTCCTGGGCCATCG & & \\
1020        \textbf{Tpm1Ex1Ex3} & & & \\
1021        Left & TGAACAAAAGACCCCAGAGG & 565 & +++ \\
1022        Right & CTGAAGTACAAGGCCATCAGC & & \\
1023        \textbf{Tpm1Ex2Ex3} & & & \\
1024        Left & AGTTTTATTGAGCGTTGAGACG &	318 & +++ \\
1025        Right & CTGAAGTACAAGGCCATCAGC & & \\
1026        \hline
1027
1028    \end{tabular}
1029    }
1030\caption[Form-specific RT-PCR primers for selected genes]{Form-specific RT-PCR primers for selected genes, designed with Primer3 \cite{Rozen2000}. \label{RT-PCR_primers}}
1031\end{table}
1032
1033
1034To assess the dependence of assembly quality on the depth of sequencing, we
1035mapped and assembled subsets of our reads at the 60 hour time point. We
1036partitioned the three Illumina lanes' worth of data (a total of ~140 million
1037reads) into 64 subsets. We then processed a single subset with {\tt TopHat}
1038and {\tt Cufflinks}, as above, and compared the resulting transfrags to the
1039output of {\tt Cufflinks} on all three lanes using {\tt Cuffcompare}. We
1040repeated the mapping and assembly with two subsets, four subsets, eight, and
1041so on. Figure 4 in the main text shows the fraction of reference transcripts
1042captured by {\tt Cufflinks} using all three lanes that are still captured when
1043less data is available. For transcripts with low abundance ($<$15
1044FPKM), increased sequencing yields more full-length transcripts. However, for
1045even moderately abundant transcripts ($\geq$15 FPKM), 75\% or more of
1046the transcripts are recovered with only ~40 million reads, or a lane's worth
1047of Illumina GA II sequencing.
1048
1049\section{Analysis of gene expression dynamics}
1050
1051Expression dynamics of genes are composed of absolute changes in
1052overall transcript abundances, as well as relative changes in
1053transcript abundances over time. Moreover, select groups of
1054transcripts, for example transcripts grouped by TSS, may exhibit
1055specific dynamics due to the underlying biological mechanisms that
1056drive expression.
1057
1058In this section we describe statistical tests we developed in the
1059multiple hypothesis testing framework for examining absolute and
1060relative changes in arbitrary groups of transcripts.
1061
1062\subsection{Selection of high-confidence transcripts for expression tracking}
1063
1064We first restricted our analysis of expression dynamics over the time-course
1065to a set of transcripts we believed were fully sequenced and correctly
1066assembled, and we focused only on known and reliable novel isoforms of
1067annotated genes. This set consisted of transcripts that either were present in
1068the {\tt UCSC genome browser}, {\tt Ensembl}, or {\tt Vega} annotated
1069transcriptomes, or were found in multiple C2C12 time point assemblies. We
1070ignored transfrags classified as intronic pre-mRNA or polymerase run-on, as
1071well as intergenic repeats to focus on coding genes and long non-coding RNAs.
1072This high-confidence set contained a total of 17,416 transcripts, 13,692 of
1073which were in {\tt UCSC known genes}, {\tt Ensembl} or {\tt VEGA} annotation
1074and 3,724 of which are novel. Running {\tt Cufflinks}' abundance estimation
1075algorithm on this high-confidence set of transcripts at each time point
1076allowed us to scan for differentially expressed transcripts, differentially
1077spliced pre-mRNAs, and genes with shifts in promoter preference.
1078
1079\subsection{Testing for changes in absolute expression}
1080
1081Between any two consecutive time points, we tested whether a transcript was
1082significantly (after FDR control \cite{Benjamini1995}) up or down regulated with respect to the
1083null hypothesis of no change, with variability in expression due solely to the
1084uncertainties resulting from our abundance estimation procedure. This was done
1085using the following testing procedure for absolute differential expression:
1086
1087We employed the standard method used in microarray-based expression analysis
1088and proposed for RNA-Seq in \cite{Bullard2010}, which is to compute the
1089logarithm of the ratio of intensities (in our case FPKM), and to then use the
1090delta method to estimate the variance of the log odds. We describe this for
1091testing differential expression of individual transcripts and also groups of
1092transcripts (e.g. grouped by TSS).
1093
1094We recall that the MLE FPKM for a transcript $t$ in a locus $g$ is given by
1095\begin{equation}
1096\frac{10^9X_g\hat{\gamma}_t}{\tilde{l}(t)M}.
1097\end{equation}
1098Given two different experiments resulting in $X^a_g,M^a$ and $X^b_g,M^b$ respectively, as well as $\hat{\gamma}^a_t$ and $\hat{\gamma}^b_t$, we would like to test the significance of departures from unity of the ratio of MLE FPKMS, i.e.
1099\begin{eqnarray}
1100& & \left(\frac{10^9X^a_g\hat{\gamma}^a_t}{\tilde{l}(t)M^a}\right) / \left(\frac{10^9X^b_g\hat{\gamma}^b_t}{\tilde{l}(t)M^b}\right)\\
1101& = & \frac{X_g^a\hat{\gamma}^a_tM^b}{X_g^b \hat{\gamma}^b_tM^a}.
1102\end{eqnarray}
1103
1104This can be turned into a test statistic that is approximately normal by
1105taking the logarithm, and normalizing by the variance. We recall that using
1106the delta method, if $X$ is a random variable then $Var[log(X)] \approx
1107\frac{Var[X]}{E[X]^2}$.
1108
1109Therefore, our test statistic is
1110\begin{equation}
1111\frac{log(X^a_g)+log(\hat{\gamma}^a_t)+log(M^b)-log(X^b_g)-log(\hat{\gamma}^b_t)-log(M^a)}
1112{ \sqrt{\frac{\left( \Psi^{g,a}_{t,t}(1+X^a_{g}) + (\hat{\gamma}^a_t)^2\right)}{X^a_g \left(\hat{\gamma}^a_t \right)^2}+
1113\frac{\left( \Psi^{g,b}_{t,t}(1+X^b_{g}) + (\hat{\gamma}^b_t)^2\right)}{X^b_{g}\left(  \hat{\gamma}^b_t\right)^2}    }  }.
1114\end{equation}
1115
1116In order to test for differential expression of a group of transcripts, we
1117replace the numerator and denominator above by those from Equations
1118(\ref{eq:expectTSS}) and (\ref{eq:varTSS}).
1119
1120It is has been noted that the power of differential expression tests in RNA-Seq depend on the length of the transcripts being tested, because longer transcripts accumulate more reads \cite{Oshlack2009}. This means that the results we report are biased towards discovering longer differentially expressed transcripts and genes.
1121
1122\subsection{Quantifying transcriptional and post-transcriptional overloading}
1123
1124In order to infer the extent of differential promoter usage, we
1125decided to measure changes in relative abundances of primary
1126transcripts of single genes. Similarly, we investigated changes in
1127relative abundances of transcripts grouped by TSS in order to infer
1128differential splicing.  These inferences required two ingredients:
1129\begin{enumerate}
1130\item A metric on probability distributions (derived from relative
1131  abundances).
1132\item A test statistic for assessing significant changes in
1133  differential promoter usage and splicing as measured using the
1134  metric referred to above.
1135\end{enumerate}
1136
1137In order to address the first requirement, namely a metric on
1138probability distributions, we turned to an entropy-based metric. This
1139was motivated by the methods in \cite{Ritchie2008} where tests for
1140differences in relative isoform abundances were performed to
1141distinguish cancer cells from normal cells. We extend
1142this approach to be able to test for relative isoform abundance changes among multiple
1143experiments in RNA-Seq.
1144
1145\begin{defn}[Entropy]
1146The entropy of a discrete probability distribution
1147$p=(p_1,\ldots,p_n)$ ($0 \leq p_i \leq 1$ and $\sum_{i=1}^n p_i = 1$) is
1148\begin{equation}
1149H(p) = -\sum_{i=1}^n p_i log p_i.
1150\end{equation}
1151If $p_i=0$ for some $i$ the value of $p_i log p_i$ is taken to be $0$.
1152\end{defn}
1153\begin{defn}[The Jensen-Shannon divergence]
1154The Jensen-Shannon divergence of $m$ discrete probability distributions $p^1,\ldots,p^m$ is defined to be:
1155\begin{equation}
1156JS(p^1,\ldots,p^m) = H\left(\frac{p^1 + \cdots + p^m}{m}\right) - \frac{\sum_{j=1}^m H(p^j)}{m}.
1157\end{equation}
1158
1159In other words, the Jensen-Shannon divergence of a set of probability
1160distributions is the entropy of their average minus the average of their
1161entropies. \end{defn}
1162
1163In the case where $m=2$, we remark that the Jensen-Shannon divergence can also
1164be described in terms of the Kullback-Leibler divergence of two discrete
1165probability distributions. If we denote Kullback-Leibler divergence by
1166
1167\begin{equation}
1168D(p^1\|p^2) = \sum_i p^1_i log \frac{p^1_i}{p^2_i},
1169\end{equation}
1170then
1171\begin{equation}
1172JS(p^1,p^2) = \frac{1}{2}D(p^1\|m)+\frac{1}{2}D(p^2\|m)
1173\end{equation}
1174where $m=\frac{1}{2}(p^1+p^2)$. In other words the Jensen-Shannon
1175divergence is a symmetrized variant of the Kullback-Leibler divergence.
1176
1177The Jensen-Shannon divergence has a number of useful properties: for
1178example it is symmetric and non-negative. However it is {\em not} a
1179metric. The following theorem shows how to construct a metric
1180from the Jensen-Shannon divergence:
1181
1182\begin{thm}[Fuglede and Tops{\o}e 2004 \cite{Fuglede2004}]
1183The square root of the Jensen-Shannon divergence is a metric.
1184\end{thm}
1185
1186The proof of this result is based on a harmonic analysis argument that is the basis for
1187the remark in the main paper that ``transcript abundances move in time along a
1188logarithmic spiral in Hilbert space''.  We therefore call the square root of the Jensen-Shannon divergence the {\em Jensen-Shannon metric}. We employed this metric in
1189order to quantify relative changes in expression in (groups of) transcripts.
1190
1191In order to test for significance, we introduce a bit of notation.
1192Suppose that $S$ is a collection of transcripts (for example, they may
1193 share a common TSS). We define
1194\begin{equation}
1195\kappa_t = \frac{ \frac{\gamma_t}{\tilde{l}(t)}}{\sum_{u \in S} \frac{\gamma_u}{\tilde{l}(u)}}
1196\end{equation}
1197to be the proportion of transcript $t$ among all the transcripts in a
1198group $S$. We let
1199$Z=\sum_{u \in S} \hat{\gamma}_u / \tilde{l}(u)$ so that $\hat{\kappa}_t = \frac{\gamma_t}{\tilde{l}(t)Z}$. We therefore have that
1200\begin{eqnarray}
1201\label{eq:variancekappa1}
1202Var[\hat{\kappa}_t] & =  &\frac{Var[\hat{\gamma}_t]}{\tilde{l}(t)^2Z^2}, \\
1203Cov[\hat{\kappa}_t,\hat{\kappa}_u] & =  & \frac{Cov[\hat{\gamma}_t,\hat{\gamma}_u]}{\tilde{l}(t)\tilde{l}(u)Z^2}.\label{eq:variancekappa2}
1204\end{eqnarray}
1205
1206Our test statistic for divergent relative expression was the
1207Jensen-Shannon metric. The test could be applied to multiple time
1208points simultaneously, but we focused on pairwise tests (involving
1209consecutive time points). Under the null hypothesis of no change in
1210relative expression, the Jensen-Shannon metric should be
1211zero. We tested for this using a one-sided $t$-test, based on an
1212asymptotic derivation of the distribution of the Jensen-Shannon metric
1213under the null hypothesis. This asymptotic distribution is normal by applying the delta method approximation,
1214which involves computing the linear component of the Taylor expansion of the
1215variance of $\sqrt{JS}$.
1216
1217In order to simplify notation, we let $f(p^1,\ldots,p^m)$ be the
1218Jensen-Shannon metric for $m$ probability distributions $p^1,\ldots,p^m$.
1219
1220\begin{lemma}
1221The partial derivatives of the Jensen-Shannon metric are give by
1222\begin{equation}
1223\frac{\partial f}{\partial p^k_l} = \frac{1}{2m\sqrt{f(p^1,\ldots,p^m)}} log \left( \frac{p^k_l}{\frac{1}{m} \sum_{j=1}^m p_l^j} \right).
1224\end{equation}
1225\end{lemma}
1226
1227Let $\hat{\kappa}^1,\ldots,\hat{\kappa}^m$ denote $m$ probability
1228distributions on the set of transcripts $S$, for example the MLE for the transcript abundances in a time course. Then from the delta method we
1229have that $\sqrt{JS(\hat{\kappa}^1,\ldots,\hat{\kappa}^m)}$ is approximately normally
1230distributed with variance given by
1231
1232\begin{equation}
1233Var[\sqrt{JS(\hat{\kappa}^1,\ldots,\hat{\kappa}^m)}] \approx (\bigtriangledown f)^T \Sigma (\bigtriangledown f),
1234\end{equation}
1235
1236where $\Sigma$ is the variance-covariance matrix for the
1237$\kappa^1,\ldots,\kappa^m$, i.e., it is a block diagonal matrix where the
1238$i$th block is the variance-covariance matrix for the $\kappa^i_t$ given by
1239Equations (\ref{eq:variancekappa1},\ref{eq:variancekappa2}).
1240
1241
1242There are two biologically meaningful groupings of transcripts whose
1243relative abundances are interesting to track in a
1244time course. Transcripts that share a TSS are likely to be regulated by
1245the same promoter, and therefore tracking the change in relative
1246abundances of groups of transcripts sharing a TSS may reveal how
1247transcriptional regulation is affecting expression over
1248time. Similarly, transcripts that share a TSS and exhibit changes in
1249expression relative to each other are likely to be affected by
1250splicing or other post-transcriptional regulation. We therefore
1251grouped transcripts by TSS and compared relative abundance changes
1252within and between groups.
1253
1254We define ``overloading'' to be a significant change in relative
1255abundances for a set of transcripts (as determined by the Jensen-Shannon metric, see below). The term is intended to generalize the simple notion of
1256``isoform switching'' that is well-defined in the case of two
1257transcripts, to multiple transcripts. It is complementary to absolute
1258differential changes in expression: the overall expression of a gene
1259may remain constant while individual transcripts change drastically in
1260relative abundances resulting in overloading. The term is borrowed from
1261computer science, where in some statically-typed programming
1262languages, a function may be used in multiple, specialized instances
1263via ``method overloading''.
1264
1265We tested for overloaded genes by performing a one-sided $t$-test based on the asymptotics of the Jensen-Shannon metric under the null hypothesis of no change in relative abundnaces of isoforms (either grouped by shared TSS for for post-transcriptional overloading, or by comparison of groups of isoforms with shared TSS for transcriptional overloading). Type I errors were controlled with the Benjamini-Hochberg \cite{Benjamini1995} correction for multiple testing.
1266 A selection of overloaded genes are displayed in Supplemental Figs. \ref{splice_overloaded} and \ref{promoter_overloaded}.
1267
1268\newpage
1269
1270\begin{figure}[!ht]
1271    \includegraphics{pdfs/splice_overloaded}
1272    \caption[Selected genes with post-transcriptional
1273    overloading]{Selected genes with post-transcriptional
1274      overloading. Trajectories indicate the expression of individual
1275      isoforms in FPKM ($y$ axis) over time in hours ($x$
1276      axis). Dashed isoforms have not been previously
1277      annotated. Isoform trajectories are colored by TSS, so isoforms
1278      with the same color presumably share a common promoter and are
1279      processed from the same primary transcript. It is evident that
1280      total gene expression may remain constant during isoforms
1281      switching (Eya3) while in other cases changes in relative
1282      abundance are accompanied by changes in absolute expression.
1283The Jensen-Shannon metric generalizes the notion of ``isoform
1284switching'' and is useful in cases with multiple isoforms
1285(e.g. Ddx17).
1286  \label{splice_overloaded}}
1287\end{figure}
1288
1289\begin{figure}[!ht]
1290    \includegraphics{pdfs/promoter_overloaded}
1291    \caption[Selected genes with transcriptional overloading]{Selected
1292      genes with transcriptional overloading. Trajectories indicate
1293      the expression of individual isoforms in FPKM (y axis) over time
1294      in hours (x axis). Dashed isoforms have not been previously
1295      annotated. Isoform trajectories are colored by TSS, so that
1296      isoforms with different colors presumably vary in their promoter
1297      and are processed from different primary transcripts.  \label{promoter_overloaded}}
1298\end{figure}
1299
1300We can visualize overloading and expression dynamics with a plot that
1301superimposes transcriptional and post-transcriptional overloading and
1302gene-level expression over the time course. We refer to these as
1303``Minard plots'', after Charles Joseph Minard's famous depiction
1304of the advance and retreat of
1305Napoleon's armies in the campaign against Russia in 1812
1306\cite{Tufte2001}. Minard made use of multiple visual cues to display
1307numerous varying quantities in one diagram. An example of a Minard
1308plot for the gene
1309Myc is shown in Figure 3c, and others are given in Appendix B. The dotted line indicates gene-level FPKM, with measured
1310FPKM indicated by black circles. Grey circles indicate the arithmetic mean of
1311gene-level FPKM between consecutive measured time points, interpolating FPKM
1312at intermediate time points. The total gene expression overloading is
1313visualized as a swatch centered around the interpolated expression curve.
1314The width of the swatch encodes the amount of expression overloading between
1315successive time points. The color of the swatch indicates the relative
1316contributions of transcriptional and post-transcriptional expression
1317overloading.
1318
1319Some genes, such as tropomyosin I and II, feature a single primary
1320transcript, and so all overloading is by definition post-transcriptional.
1321Others, like Fhl3, have two primary transcripts, but only a single isoform
1322arises from each, so all overloading is transcriptional. Genes with multiple
1323primary transcripts, one or more of which are alternatively spliced, such as
1324Myc or RTN4, display both forms.
1325
1326
1327\section{The {\tt Cufflinks} software}
1328
1329The transcript assembly and abundance estimation algorithms are implemented in
1330freely available open source software called {\tt Cufflinks} that is available
1331from \newline {\tt http://cufflinks.cbcb.umd.edu/} \newline Furthermore
1332methods for comparing annotations across time points, and for performing the
1333differential expression, promoter usage and splicing tests are implemented in
1334the companion programs {\tt Cuffdiff} and {\tt Cuffcompare}. Instructions on
1335how to install and run the software are provided on the website.
1336
1337The input to {\tt Cufflinks} consists of fragment alignments in the SAM format
1338\cite{Li2009a}. These may consist of either single fragment alignments, or alignments of mate-pairs (paired-end reads produce better assemblies and more accurate abundance estimates than single reads). {\tt Cufflinks} will assemble the transcripts using the
1339algorithm in Section \ref{sec:assembly}, and transcript abundances will be
1340estimated using the model in Section \ref{sec:abundances}. Transcript
1341coordinates and abundances are reported in the Gene Transfer Format (GTF).
1342User supplied annotations may be provided to {\tt Cufflinks} (optional input) in
1343which case they form the basis for the transcript abundance estimation.
1344
1345Some of the algorithms here rely on sufficient depth of sequencing in order to
1346produce reliable output. {\tt Cufflinks} determines that depth is sufficient where
1347possible to check that required assumptions hold. For example, in loci where one
1348or more isoforms have extremely low relative expression, the observed Fisher
1349Information Matrix may not be positive definite after rounding errors. In this
1350case, it is not possible to produce a reliable variance-covariance matrix for
1351isoform fragment abundances. {\tt Cufflinks} will report a numerical
1352exception in this and similar cases. When an exception is reported, the
1353confidence intervals for the isoforms' abundances will be set from 0
1354FPKM to the FPKM for the whole gene. If such an exception is generated during
1355a {\tt Cuffdiff} run, no differential analysis involving the problematic
1356sample will be performed on that locus.
1357\newpage
1358\section{Appendix A: Lemmas and Theorems}
1359
1360The following elementary/classical results are required for our methods and we include them so that the supplement is self-contained.
1361
1362\begin{lemma}
1363\label{lemma:varsum}
1364Let $X_1,\ldots,X_n$ be random variables and $a_1,\ldots,a_n$ real numbers with $Y=\sum_{i=1}^na_iX_i$. Then
1365\begin{equation}
1366Var[Y] = \sum_{i=1}^n a_i^2Var[X_i]+2\sum_{i<j} a_ia_j Cov[X_i,X_j].
1367\end{equation}
1368\end{lemma}
1369\begin{lemma}[Taylor Series]
1370If $X$ and $Y$ are random variables then
1371\begin{eqnarray}
1372  Var[f(X,Y)] & \approx & \left(\frac{\partial f}{\partial X}(E[X],E[Y])\right)^2Var[X] \nonumber \\ & &+2\frac{\partial f}{\partial X}(E[X],E[Y])\frac{\partial f}{\partial Y}(E[X],E[Y])Cov[X,Y] \nonumber \\ & & +\left( \frac{\partial f}{\partial Y}(E[X],E[Y])\right)^2 Var[Y].
1373\end{eqnarray}
1374\end{lemma}
1375\begin{cor}
1376If $X$ and $Y$ are independent then
1377\begin{eqnarray}
1378  Var\left[log\left(\frac{X}{Y}\right)\right] & \approx & \frac{V[X]}{E[X]^2} + \frac{V[Y]}{E[Y]^2}.
1379\end{eqnarray}
1380\end{cor}
1381\begin{cor}
1382\label{lemma:varproduct}
1383If $X$ and $Y$ are independent random variables then
1384\begin{equation}
1385Var[XY] = Var[X]Var[Y]+E[X]^2Var[Y]+E[Y]^2Var[X].
1386\end{equation}
1387\end{cor}
1388
1389The above result is exact using the 2nd order Taylor expansion (higher
1390derivatives vanish).
1391
1392\begin{lemma}[\cite{Li2009b}]
1393\label{lemma:readstoprobs}
1394Let $a_1,\ldots,a_n,w_1,\ldots,w_n$ be real numbers satisfying: $w_i \neq 0$ and $0 \leq a_i \leq 1$ for all $i$, $\sum_{i=1}^n a_i = 1$ and $\sum_{i=1}^na_iw_i \neq 0$.  Let
1395$ b_j = \frac{a_jw_j}{\sum_{i=1}^n a_iw_i}$. Then $a_j = \frac{b_j\frac{1}{w_j}}{\sum_{i=1}^n b_i\frac{1}{w_i}}$.
1396\end{lemma}
1397{\bf Proof}: \begin{eqnarray}
1398b_j & =  & \frac{a_jw_j}{\sum_{i=1}^n a_iw_i} \\
1399\Rightarrow \,  \sum_{k=1}^n \frac{b_k}{w_k} & = &   \sum_{k=1}^n \frac{a_k}{\sum_{i=1}^na_iw_i}\\
1400&  = &  \frac{1}{\sum_{i=1}^n a_iw_i}\\
1401& = & \frac{b_j}{a_jw_j}\\
1402 \Rightarrow \,  a_j  & =  & \frac{b_j\frac{1}{w_j}}{\sum_{i=1}^n b_i \frac{1}{w_i}}.
1403\end{eqnarray} \qed
1404\begin{prop}[\cite{ASCB2005}]
1405\label{prop:linearmodel}
1406Let $f_i(\theta) = \sum_{j=1}^d a_{ij}\theta_j + b_i$ ($1 \leq i \leq
1407m$) describe a linear statistical model with $a_{ij} \geq$ for all
1408$i,j$.  That is, $\sum_{i=1}^m f_i(\theta) = 1$. If $u_i \geq 0$ for
1409all $i$ then the log likelihood function
1410\begin{equation}
1411l(\theta) = \sum_{i=1}^m u_i log(f_i(\theta))
1412\end{equation}
1413is concave.
1414\end{prop}
1415{\bf Proof}: It is easy to see that
1416\begin{equation}
1417\left( \frac{\partial^2 l}{\partial \theta_j \partial \theta_k}  \right)  = -A^Tdiag\left( \frac{u_1}{f_1(\theta)^2},\ldots,\frac{u_m}{f_m(\theta)^2}\right) A,
1418\end{equation}
1419where $A$ is the $m \times d$ matrix whose entry in row $i$ and column
1420$j$ equals $a_{ij}$. Therefore the Hessian is a symmetric matrix with
1421non-positive eigenvalues, and is therefore negative semi-definite.
1422\qed
1423
1424\begin{defn}
1425\label{def:po}
1426A partially ordered set is a set $S$ with a binary relation $\leq$
1427satisfying:
1428\begin{enumerate}
1429\item $x \leq x$ for all $x \in S$,
1430\item If $x \leq y$ and $y \leq z$ then $x \leq z$,
1431\item If $x \leq y$ and $y \leq x$ then $x=y$.
1432\end{enumerate}
1433A {\em chain} is a set of elements in $C \subseteq S$ such that for every
1434$x,y \in C$ either $x \leq y$ or $y \leq x$. An {\em antichain} is a set of
1435elements that are pairwise incompatible.
1436\end{defn}
1437Partially ordered sets are equivalent to directed acyclic graphs (DAGs). The following min-max theorems relate chain partitions to antichains and are special cases of linear-programming duality. More details and complete proofs can be found in \cite{Lovasz2009}.
1438
1439\begin{thm}[Dilworth's theorem]
1440\label{thm:Dilworth}
1441Let $P$ be a finite partially ordered set. The maximum number of elements in any antichain of $P$ equals the minimum number of chains in any partition of $P$ into chains.
1442\end{thm}
1443\begin{thm}[K\"{o}nig's theorem]
1444\label{thm:konig}
1445In a bipartite graph, the number of edges in a maximum matching equals
1446the number of vertices in a minimum vertex cover.
1447\end{thm}
1448\begin{thm}
1449\label{thm:dilko}
1450Dilworth's theorem is equivalent to K\"{o}nig's theorem.
1451\end{thm}
1452{\bf Proof}: We first show that Dilworth's theorem follows from
1453K\"{o}nig's theorem. Let $P$ be a partially ordered set with $n$
1454elements. We define a bipartite graph $G=(U,V,E)$ where
1455$U=V=P$, i.e. each partition in the bipartite graph is equally to
1456$P$. Two nodes $u,v$ form an edge $(u,v) \in E$ in the graph $G$ iff
1457$u<v$ in $P$. By K\"{o}nig's theorem there exist both a matching $M$ and a
1458a vertex cover $C$ in $G$ of the same cardinality. Let $T \subset
1459S$ be the set of elements not contained in $C$. Note that $T$ is an
1460antichain in $P$. We now form a partition $W$ of $P$ into chains by declaring $u$ and
1461$v$ to be in the same chain whenever there is an edge $(u,v) \in
1462M$. Since $C$ and $M$ have the same size, it follows that $T$ and $W$
1463have the same size.
1464
1465To deduce K\"{o}nig's theorem from Dilworth's theorem, we begin with a
1466bipartite graph $G=(U,V,E)$ and form a partial order $P$ on the
1467vertices of $G$ by defining $u<v$ when $u \in U, v \in V$ and $(u,v)
1468\in E$. By Dilworth's theorem, there exists an antichain of $P$ and a
1469partition into chains of the same size. The non-trivial chains in $P$
1470form a matching in the graph. Similarly, the complement of the
1471vertices corresponding to the anti-chain in $P$ is a vertex cover of
1472$G$ with the same cardinality as the matching.  \qed
1473
1474\begin{figure}[!ht]
1475    \includegraphics[scale=0.8]{pdfs/Dilworth_Konig}
1476
1477%\caption[Equivalence of Dilworth's and K\"{o}nig's theorems]{
1478\end{figure}
1479The equivalence of Dilworth's and K\"{o}nig's theorems is depicted above. The
1480      partially ordered set with 8 elements on the left is partitioned into 3
1481      chains. This is the size of a minimum partition into chains, and
1482    is equal to the maximum size of an antichain (Dilworth's
1483    theorem). The antichain is shown with double circles. On the
1484    right, the reachability graph constructed from the partially
1485    ordered set on the left is shown. The maximum matching
1486    corresponding to the chain partition consists of 5 edges and is equal in size to the
1487    number of vertices in a minimum vertex cover (K\"{o}nig's
1488    theorem). The vertex cover is shown with double circles. Note that
1489    8=3+5.
1490
1491
1492\newpage
1493
1494\section{Appendix B: selected Minard plots}
1495\begin{figure}[!ht]
1496    \includegraphics{pdfs/Tpm1_Tpm2.pdf}
1497\end{figure}
1498%\newpage
1499\begin{figure}[!ht]
1500    \includegraphics{pdfs/Fhl3_RTN4.pdf}
1501\end{figure}
1502\newpage
1503\begin{figure}[!ht]
1504    \includegraphics{pdfs/Myf6_Myf5.pdf}
1505\end{figure}
1506%\newpage
1507\begin{figure}[!ht]
1508    \includegraphics{pdfs/MyoD_Myog.pdf}
1509\end{figure}
1510\newpage
1511\begin{figure}[!ht]
1512    \includegraphics{pdfs/Actn.pdf}
1513\end{figure}
1514%\newpage
1515\begin{figure}[!ht]
1516    \includegraphics{pdfs/Ddx.pdf}
1517\end{figure}
1518\newpage
1519\begin{figure}[!ht]
1520    \includegraphics{pdfs/Myl1.pdf}
1521\end{figure}
1522%\newpage
1523\begin{figure}[!ht]
1524    \includegraphics{pdfs/Mef2.pdf}
1525\end{figure}
1526
1527\newpage
1528\bibliographystyle{amsplain}
1529\bibliography{cufflinks}
1530\end{document}
1531
1532% long read length is required for uniquely determining genomic origin of fragments, but its fragment length that is critical for assignment of fragments to transcripts.
1533% The length of a set of transcripts depends on their abundances
1534% single read sequencing prevents the correct normalization for fragment sizes in the length.
1535