1\documentclass[12pt]{amsart} 2 3\usepackage{graphicx} 4\usepackage{amsmath} 5\usepackage[pdftex]{hyperref} 6 7\setlength{\textwidth}{6.2in} 8\setlength{\textheight}{8.4in} 9\setlength{\topmargin}{0.2in} 10\setlength{\oddsidemargin}{0in} 11\setlength{\evensidemargin}{0in} 12\setlength{\headsep}{.3in} 13\footskip 0.3in 14\newtheorem{thm}{Theorem} 15\newtheorem{lemma}[thm]{Lemma} 16\newtheorem{prop}[thm]{Proposition} 17\newtheorem{cor}[thm]{Corollary} 18\theoremstyle{definition} 19\newtheorem{defn}[thm]{Definition} 20\newtheorem{example}[thm]{Example} 21\newtheorem{remark}[thm]{Remark} 22\newtheorem{conj}[thm]{Conjecture} 23 24\newcommand{\cherry}[1]{\ensuremath{\langle #1 \rangle}} 25\newcommand{\ignore}[1]{} 26\newcommand{\half}{\frac{1}{2}} 27\newcommand{\eopf}{\framebox[6.5pt]{} \vspace{0.2in}} 28 29%------------------document begins----------------------- 30 31\begin{document} 32 33\title{Supplementary methods for the paper\\ 34Transcript assembly and quantification by RNA-Seq 35reveals unannotated transcripts and isoform switching 36during cell differentiation} 37\author{Cole Trapnell \and Brian A Williams \and Geo Pertea \and Ali Mortazavi \and Gordon Kwan \and Marijke J van Baren \and Steven L Salzberg 38\and Barbara J Wold \and Lior Pachter} 39%\address{} 40%\email{\{whoweare\}@math.berkeley.edu} 41%\dedicatory{\today} 42 43%\thanks{Supported in part by ....} 44%\subjclass{Primary 55M20; Secondary 47H10} 45 46\maketitle 47\markboth{C Trapnell et al.}{Transcript assembly and abundance estimation from RNA-Seq} 48 49This document is a companion to the paper ``Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms''. It describes the experiment 50performed, the details of the {\tt Cufflinks} assembler and abundance 51estimation methods, and provides 52supplementary tables, figures and supporting evidence for claims in 53the paper. 54%\pagestyle{plain} 55\pagenumbering{roman} \tableofcontents \listoffigures \listoftables \newpage \pagenumbering{arabic} 56 57\section{Sequencing experiment} 58 59The data analyzed in the paper consisted of 430,467,018 paired 75bp reads 60sequenced from the transcriptome of mouse skeletal muscle C2C12 cells induced 61to undergo myogenic differentiation. Total RNA was extracted from these cells, 62and subsequently mRNA was isolated at four different time points (-24 hours, 6360 hours, 120 hours, 168 hours). cDNA was prepared following a similar 64procedure to the one described in \cite{Mortazavi2008}. Fragmentation of the 65mRNA followed by size selection resulted in fragment lengths 66approximately 200nt long for 67all of the time-points. The distribution of fragment lengths is shown in 68Supplementary Figure \ref{fragment_lengths} (in Section \ref{sec:abundances} 69this distribution of fragment lengths is referred to as $F$). These estimates 70are based on alignments of the spiked-in sequences using {\tt Bowtie} 0.12 71\cite{Langmead2009}. 72 73\begin{figure}[h] 74 \includegraphics{pdfs/library_frags.pdf} 75 \caption[Fragment length distributions of C2C12 time-course libraries]{Fragment length distributions of C2C12 time-course libraries. \label{fragment_lengths}} 76\end{figure} 77 78\section{Mapping fragments to the genome} 79 80In principle, an algorithm that infers individual transcript abundances by 81measuring the fraction of fragments originating from each of a set of known 82transcripts would begin by computing alignments between fragments and the set 83of known transcripts that may be contained in the sample. However, because the 84transcriptome for mouse is incompletely annotated, such an analysis requires 85mapping of fragments to the genome as a proxy for mapping directly to 86transcripts. This means that alignments of short sequencing reads must be 87allowed to span exon-exon splice junctions in genomic coordinate space. We 88previously developed a program called {\tt TopHat} to map RNA-Seq reads to the 89genome. {\tt TopHat} does not require a reference transcriptome and can 90therefore be used to discover novel splice junctions. \cite{Trapnell2009} 91 92Fragments were mapped to build 37.1 of the mouse genome with {\tt TopHat} version 931.0.13. We extended our previous algorithms described in 94\cite{Trapnell2009} to exploit the longer 95paired reads used in the study. The original {\tt TopHat} program used a 96seed-and-extend alignment strategy to find spliced alignments of unpaired 97RNA-Seq experiments. However, due to computational limitations, our original 98method reported only alignments across GT-AG introns shorter than 20Kb by 99default. This strategy also could not align reads that spanned multiple splice 100junctions. However, as sequencing technology has improved and longer (paired end) reads have 101become available, we have 102modified the software to employ new strategies to align reads across splice 103junctions. {\tt TopHat} version 1.0.7 and later splits a read 75bp or longer in 104three or more segments of approximately equal size (25bp), and maps them 105independently. Reads with segments that can be mapped to the genome only 106non-contiguously are marked as possible intron-spanning reads. These 107``contiguously unmappable'' reads are used to build a set of possible introns 108in the transcriptome. 109 110\subsection{Discovering splice junctions} Suppose $S$ is a read of length 111$l$ that crosses a splice junction. {\tt TopHat} splits $S$ into $n=\lfloor 112l/k \rfloor$ segments where $k=25$bp ($k$ is a parameter that can be 113adjusted by the user in {\tt TopHat}). At most 114one of these segments must cross the splice junction, and junctions 115can be discovered if they lie in any of the segments. {\tt TopHat} maps the 116segments $s_1,...,s_n$ with {\tt Bowtie} to the genome, and checks for 117internal segments $s_2,...,s_{n-1}$ that do not map anywhere to the genome, as 118well as for pairs of successive segments $s_i,s_{i+1}$ that both align to the 119genome, but not adjacently. When a segment $s_i$ fails to align because it 120crosses a splice junction, but $s_{i-1}$ and $s_{i+1}$ are aligned (say at 121starting at positions $x$ and $y$, respectively), {\tt TopHat} looks for the 122donor and acceptor sites for the junction near $x$ and $y$. Assuming 123(without loss of generality) that the 124transcript is on Crick strand of the genome the 125donor must fall within $k$ bases upstream of position $x+k$, and the acceptor 126must be within $k$ bases downstream of $y$, a total of $k$ possible exon-exon 127splice junctions. Similarly, when successive segments $s_i$ and $s_{i+1}$ 128align to the genome non-adjacently at positions $x$ and $y$, the junction 129spanned by the read must be from positions $x+k$ to $y$ in the genome. 130 131{\tt TopHat} accumulates an index of potential splice junctions by examining 132segment mapping for all contiguously unmappable reads. For each junction the 133program then concatenates $k$bp upstream of the donor to $k$bp downstream of 134the acceptor to form a synthetic spliced sequence around the junction. The 135segments of the contiguously unmappable reads are then aligned against these 136synthetic sequences with {\tt Bowtie}. The resulting contiguous and spliced 137segment alignments for these reads are merged to form complete alignments to 138the genome, each spanning one or more splice junctions. 139 140\subsection{Resolving multiple alignments for fragments} 141 142The alignments for both reads from a mate pair are examined together to 143produce a set of alignments for the corresponding library fragment as a whole, 144reported in SAM format \cite{Li2009a}. These fragment alignments are 145ranked according to the procedure described below, and only highest ranking alignments are reported. The ranks are 146designed to incorporate very loose assumptions on intron and gene length, 147namely that introns longer than 20kb are rare. Let $x$ and $y$ be fragment 148alignments. Then $x < y$ if {\em any} of the following (applied in order) are 149true: 150 151\begin {enumerate} 152 \item $x$ is a singleton, and $y$ has both ends mapped, 153 \item $x$ crosses more splice junctions than $y$, 154 \item the reads from $x$ map significantly farther apart in the genome than expected according to the library's fragment length distribution ($\geq$ 3 s.d.), and the reads from $y$ do not, 155 \item the reads from $x$ are significantly closer together than expected according to the library's fragment length distribution, and the reads from $y$ are not, 156 \item The reads from $x$ map more than 100bp farther apart than the reads from $y$, 157 \item $x$ and $y$ both span an intron, and $x$ spans a longer one, 158 \item $x$ has more mismatches than $y$ to the genome. 159\end {enumerate} 160 161Fragments that have multiple equally good alignments according to the above 162rules are all reported. If there are $n$ alignments for a fragment, 163each is assigned a probability 164of only $1/n$ of being correct. The SAM format encodes this probability in the 165mapping quality field, which is later used by {\tt Cufflinks} to reduce the 166contribution of multiply mapping fragments (to $1/n$ of a uniquely 167mappable read) in FPKM calculations (FPKM is a measurement of 168expression, and is formally defined in Section 169\ref{sec:abundances}). The recent work of Li et al. \cite{Li2009b} 170addresses the problem of probabilistically assigning multi-reads, and 171it should be possible to incorporate the ideas of that paper into 172future versions of {\tt Tophat} and {\tt Cufflinks}. 173 174\begin{table}[h]{ 175 \Small 176 \label{table:reads} 177 \begin{tabular}{l|c|c|c|c|c|c} 178 \hline 179 \textbf{Sample} & \textbf{Sequenced} & \textbf{Aligned} & \textbf{Singleton} & \textbf{Spliced} & \textbf{Multi-mapping} & \textbf{Total} \\ 180 & \textbf{fragments} & \textbf{fragments} & \textbf{fragments} & \textbf{fragments} & \textbf{fragments} & \textbf{alignments} \\ 181 \hline 182 183 -24 hours & 42,184,539 & 35,852,366 & 11,031,886 & 8,824,825 & 1,768,041 & 41,663,170 \\ 184 60 hours & 70,192,031 & 57,071,494 & 18,104,211 & 15,778,114 & 2,265,378 & 64,637,511 \\ 185 120 hours & 41,069,106 & 27,914,989 & 14,431,734 & 7,711,026 & 1,881,772 & 33,929,133 \\ 186 168 hours & 61,787,833 & 50,705,080 & 20,396,250 & 14,585,287 & 2,458,292 & 58,797,912 \\ 187 Total & 215,233,509 & 171,543,929 & 63,964,081 & 46,899,252 & 8,373,483 & 199,027,726 \\ 188 189 \hline 190 191 \end{tabular} 192 } 193\vskip 0.2in 194\caption[Number of fragments sequenced, aligned and mapped with {\tt TopHat}]{Number of fragments sequenced, aligned and mapped with {\tt 195 TopHat}. Singleton fragments are fragments for which only one end 196 could be mapped. Spliced fragments include at least one end that 197 maps across a junction. The numbers in the total alignment column 198 may not be the sum of the entries in each row, because some fragments 199 fall into multiple classes.} 200\end{table} 201 202\newpage 203\section{Transcript abundance estimation \label{abundances}} 204\label{sec:abundances} 205 206\subsection{Definitions} 207 208A {\em transcript} is an RNA molecule that has been transcribed from DNA. A 209{\em primary transcript} is an RNA molecule that has yet to undergo 210modification. The {\em genomic location} of a primary transcript consists of a 211pair of coordinates in the genome representing the $5'$ transcription start 212site and the $3'$ polyadenylation cleavage site. We denote the set of all 213transcripts in a transcriptome by $T$. We partition transcripts into {\em 214transcription loci} (for simplicity we refer to these as loci) so that every 215locus contains a set of transcripts all of whose genomic locations do not 216overlap the genomic location of any transcript in any other locus. Formally, 217we consider a maximal partition of transcripts into loci, a partition denoted 218by $G$, where the genomic location of a transcript $t \in g \in G$ does not 219overlap the genomic location of any transcript $u $ where $u \in h \in G$ and 220$h \neq g$. We emphasize that the definition of a transcription locus is not 221biological; transcripts in the same locus may be regulated via different 222promoters, and may differ completely in sequence (for example if one 223transcript is in the intron of another) or have different functions. The 224reason for defining loci is that we will see that they are computationally 225convenient. 226 227 228We assume that at the time of an experiment, a transcriptome consists of an 229ensemble of transcripts $T$ where the proportion of transcript $t \in T$ is 230$\rho_t$, so that $\sum_{t \in T} \rho_t = 1$ and $0 \leq \rho_t \leq 1$ for 231all $t \in T$. Formally, a {\em transcriptome} is a set of transcripts $T$ 232together with the abundances $\rho=\{ \rho_t\}_{t \in T}$. For convenience we also introduce 233notation for the proportion of transcripts in each locus. We let $\sigma_g = 234\sum_{t \in g} \rho_t$. Similarly, within a locus $g$, we denote the 235proportion of each transcript $t \in g$ by $\tau_t = \frac{\rho_t}{\sigma_g}$. 236We refer to $\rho,\sigma$ and $\tau$ as {\em transcript abundances}. 237 238 239Transcripts have lengths, which we denote by $l(t)$. For a collection of transcripts $S \subset T$ in a transcriptome, we define the length of $S$ using the weighted mean: 240\begin{equation} 241\label{eq:effective_length} 242l(S) =\frac{\sum_{t \in S} \rho_tl(t)}{\sum_{t \in S}\rho_t}. 243\end{equation} 244It is important to note that the length of a set of transcripts depends on their relative abundances; the reason for this will be clear later. 245 246One grouping of transcripts that we will focus on is the set of transcripts 247within a locus that share the same transcription start site (TSS). Unlike the 248concept of a locus, grouping by TSS has a biological basis. Transcripts 249within such a group are by definition alternatively spliced, and if they have 250different expression levels, this is most likely due to the spliceosome and 251not due to differences in transcriptional regulation. 252 253\subsection{A statistical model for RNA-Seq} 254 255In order to analyze expression levels of transcripts with RNA-Seq data, it is 256necessary to have a model for the (stochastic) process of sequencing. A {\em 257sequencing experiment} consists of selecting a total of $M$ fragments of 258transcripts uniformly at random from the transcriptome. Each fragment is 259identified by sequencing from its ends, resulting in two reads called {\em 260mate pairs}. The length of a fragment is a random variable, with a 261distribution we will denote by $F$. That is, the probability that a fragment 262has length $i$ is $F(i)$ and $\sum_{i=1}^{\infty} F(i) = 1$. In this paper we 263assume that $F$ is normal, however in principle $F$ can be estimated using 264data from the experiment (e.g. spike-in sequences). We decided to use 265the normal approximation to $F$ (allowing for user specified 266parameters of the normal distribution) in order to simplify the 267requirements for running {\tt Cufflinks} at this time. 268 269The assumption of random 270fragment selection is known to oversimplify the complexities of a sequencing 271experiment, however without rigorous ways to normalize we decided to work with 272the uniform at random assumption. It is easy to adapt the model to include 273more complex models that address sequencing bias as RNA-Seq experiments mature 274and the technologies are better understood. 275 276The transcript abundance estimation problem in paired-end RNA-Seq is to 277estimate $\rho$ given a set of transcripts $T$ and a set of reads sequenced 278from the ends of fragments. In {\tt Cufflinks}, the transcripts $T$ can be 279specified by the user, or alternatively $T$ can be estimated directly from the 280reads. The latter problem is the transcript assembly problem which we discuss 281in Section \ref{sec:assembly}. We ran {\tt Cufflinks} in the latter 282``discovery'' mode where we assembled the transcripts without using the 283reference annotation. 284 285The fact that fragments have different lengths has bearing on the calculation 286of the probability of selecting a fragment from a transcript. Consider a 287transcript $t$ with length $l(t)$. The probability of selecting a fragment of 288length $k$ from $t$ at one of the positions in $t$ assuming that it is 289selected uniformly at random, is $\frac{1}{l(t)-k}$. For this reason, we will 290define an adjusted length for transcripts as 291 292\begin{equation} 293\tilde{l}(t) = \sum_{i=1}^{l(t)} F(i)(l(t)-i+1). 294\end{equation} 295We also revisit the definition of length for a group of transcripts, and define 296\begin{equation} 297\tilde{l}(S) =\frac{\sum_{t \in S} \rho_t\tilde{l}(t)}{\sum_{t \in S}\rho_t}. 298\end{equation} 299 300It is important to note that given a read it may not be obvious from which 301transcript the fragment it was sequenced from originated. The consistency of 302fragments with transcripts is important and we define the {\em 303fragment-transcript matrix} $A_{R,T}$ to be the $M \times |T|$ matrix with 304$A(r,t)=1$ if the fragment alignment $r$ is completely contained in the 305genomic interval spanned by $t$, and all the implied introns in $r$ match 306introns in $t$ (in order), and with $A(r,t)=0$ otherwise. Note that the reads 307in Figure 1c in the main text are colored according to the matrix $A_{R,T}$, 308with each column of the matrix corresponding to one of the three colors 309(yellow, blue, red) and reads colored according to the mixture of colors 310corresponding to the transcripts their fragments are contained in. 311 312Even given the read 313alignment to a reference genome, it may not be obvious what the length of the 314fragment was. Formally, in the case that $A_{R,T}(r,t)=1$ we denote by $I_t(r)$ the fragment length from within 315a transcript $t$ implied by the (presumably unique) sequences corresponding to 316the mate pairs of a fragment $r$. If $A_{R,T}(r,t)=0$ then $I_t(r)$ is set to be 317infinite and $F(I_t(r)) = 0$. 318 319Given a set of reads, we assume that we can identify for each of them the set 320of transcripts with which the fragments the reads belonged to are consistent. The 321rationale for this assumption is the following: we map the reads to a 322reference genome, and we assume that the read lengths are sufficiently long so 323that every mate-pair can be uniquely mapped to the genome. We refer to this 324mapping as the {\em fragment alignment}. We also assume that we know all the 325possible transcripts and their alignments to the genome. Therefore, we can 326identify for each read the possible transcripts from which the fragment it 327belonged to originated. 328 329\begin{figure}[!ht] 330 \includegraphics[scale=0.8]{pdfs/implied_length} 331 \caption[Implied length of a fragment alignment]{Alignments of reads to the genome (rectangles) may be consistent with multiple transcripts (in this case both $t_1$ and $t_2$). The transcripts $t_1$ and $t_2$ differ by an internal exon; introns are indicated by long dashed lines. If we denote the fragment alignment by $r$, this means that $A_{R,T}(r,t_1)=1$ and $A_{R,T}(r,t_2)=1$. It is apparent that the implied length $I_{t_1}(r) > I_{t_2}(r)$ due to the presence of the extra internal exon in $t_1$. } 332\end{figure} 333 334We are now ready to write down the likelihood equation for the model. We will 335write $L(\rho|R)$ for the likelihood of a set of fragment alignments $R$ 336constructed from $M$ reads. The notation $Pr(trans. = t)$ means ``the 337probability that a fragment selected at random originates from transcript $t$''. 338 339\begin{eqnarray} 340& & L(\rho|R) = \prod_{r \in R} Pr(rd.\ aln. =r)\\ 341& = & \prod_{r \in R} \sum_{t \in T} Pr(rd.\ aln. =r|trans. =t)Pr(trans. =t)\\ 342& = & \prod_{r \in R} \sum_{t \in T} \frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)} Pr(rd.\ aln. = r|trans. =t)\\ 343& = & \prod_{r \in R} \sum_{t \in T}\frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)} \left(\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \\ \label{eq:likerho} 344& = & \prod_{r \in R} \sum_{t \in T} \alpha_t\left(\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right), 345\end{eqnarray} 346where 347\begin{equation} 348\alpha_t = \frac{\rho_t\tilde{l}(t)}{\sum_{u \in T}\rho_u\tilde{l}(u)}. 349\end{equation} 350 351Observe that $\alpha_t$ is exactly the probability that a fragment selected at 352random comes from transcript $t$, and we have that $\sum_{t \in T}\alpha_t = 3531$. In light of the probabilistic meaning of the $\alpha=\{\alpha_t\}_{t \in T}$, we refer to them as 354{\em fragment abundances}. 355 356It is evident that the likelihood function is that of a linear model 357and that the likelihood function is 358concave (Proposition \ref{prop:linearmodel}) so a numerical method can be used 359to find the $\alpha$. It is then possible, in principle, to recover the $\rho$ 360using Lemma \ref{lemma:readstoprobs}. However the number of parameters is in the tens of 361thousands, and in practice this form of the likelihood function is unwieldy. 362Instead, we re-write the likelihood utilizing the fact that transcripts in 363distinct loci do not overlap in genomic location. 364 365We first calculate the probability that a fragment originates from a transcript within a given locus $g$: 366\begin{eqnarray} 367\beta_g & := & \sum_{t \in g} \alpha_t\\ 368& = & \frac{\sum_{t \in g} \rho_t \tilde{l}(t)}{\sum_{u \in T} \rho_u \tilde{l}(u)}\\ 369& = & \frac{\sum_{t \in g} \sigma_g\tau_t \tilde{l}(t)}{\sum_{h \in G} \sum_{u \in h} \sigma_h\tau_u \tilde{l}(u)}\\ 370& = & \frac{\sigma_g\sum_{t \in g} \tau_t \tilde{l}(t)}{\sum_{h \in G} \sigma_h \sum_{u \in h} \tau_u \tilde{l}(u)}\\ 371 & = & \frac{\sigma_g \tilde{l}(g)}{\sum_{h \in G} \sigma_h \tilde{l}(h)}. 372\end{eqnarray} 373Recall that $\sigma_g = \sum_{t \in g} \rho_t$ and that $\tau_t = \frac{\rho_t}{\sigma_g}$ for a locus $g$. 374 375Similarly, the probability of selecting a fragment from a single transcript $t$ conditioned on selecting a transcript from the locus $g$ in which $t$ is contained is 376\begin{equation} 377\label{eq:gammatau} 378\gamma_t = \frac{\tau_t \tilde{l}(t)}{\sum_{u \in g} \tau_u \tilde{l}(u)}. 379\end{equation} 380The parameters $\gamma=\{\gamma_t\}_{t \in g}$ are conditional fragment abundances, and they are the parameters we estimate from the data in the next Section. Note that for a transcript $t \in g$, $\alpha_t = \beta_g \cdot \gamma_t$ and it is easy to convert between fragment abundances and transcript abundances using Lemma \ref{lemma:readstoprobs}. 381 382We denote the fragment counts by $X$; specifically, we denote the number of 383alignments in locus $g$ by $X_g$. Note that $\sum_{g \in G} X_g = M$. We also 384use the notation $g_r$ to denote the (unique) locus from which a read 385alignment $r$ can be obtained. 386 387The likelihood function is given by 388\begin{eqnarray} 389& & L(\rho|R) = \prod_{r \in R} Pr(rd.\ aln. =r)\\ 390& = & \prod_{r \in R} \sum_{g \in G} Pr(rd.\ aln. =r|locus=g)Pr(locus=g)\\ 391& = & \prod_{r \in R} \frac{\sigma_{g_{r}}\tilde{l}(g_{r})}{\sum_{g \in G} \sigma_g\tilde{l}(g)} Pr(rd.\ aln. =r|locus=g_r)\\ 392& = & \prod_{r \in R} \beta_{g_r} 393\sum_{t \in g_r}Pr(rd.\ aln. =r|locus=g_{r},trans. = t)Pr(trans. =t|locus=g_r)\\ 394& = & \prod_{r \in R} \beta_{g_r} 395\sum_{t \in g_r} \frac{\tau_t\tilde{l}(t)}{\sum_{u \in g_r}\tau_u \tilde{l}(u)} Pr(rd.\ aln. =r|locus=g_{r},trans. = t)\\ 396& = & \left( \prod_{r \in R} \beta_{g_r} \right) \left( \prod_{r \in R} \sum_{t \in g} \gamma_t \cdot 397Pr(rd.\ aln. =r|locus=g_r,trans. =t) \right)\\ 398& = & \left( \prod_{r \in R} \beta_{g_r} \right) \left( \prod_{r \in R} \sum_{t \in g} \gamma_t\cdot \frac{F(I_t(r))}{l(t)-I_t(r)+1}\right)\\ 399& = & \left( \prod_{g \in G} \beta_g^{X_{g}} \right) \left( \prod_{g \in G} \left( \prod_{r \in R:r \in g} \sum_{t \in g} \gamma_t \cdot 400\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \right). \label{eq:likebest} 401\end{eqnarray} 402 403Explicitly, in terms of the parameters $\rho$, Equation (\ref{eq:likebest}) 404simplifies to Equation (\ref{eq:likerho}) but we will see in the next section 405how the maximum likelihood estimates $\hat{\rho}$ are most conveniently 406obtained by first finding $\hat{\beta}$ and $\hat{\gamma}$ using Equation 407(\ref{eq:likebest}). 408 409We note that it is biologically meaningful to include prior distributions on 410$\sigma$ and $\tau$ that reflect the inherent stochasticity and resulting 411variability of transcription in a cell. This will be an interesting direction 412for further research as more RNA-Seq data (with replicates) becomes available 413allowing for the determination of biologically meaningful priors. In 414particular, it seems plausible that specific isoform abundances may vary 415considerably and randomly within cells from a single tissue and that this may 416be important in studying differential splicing. We mention to this to clarify 417that in this paper, the confidence intervals we report represent the 418variability in the maximum likelihood estimates $\hat{\sigma}_j$ and 419$\hat{\tau}^k_j$, and are not the variances of prior distributions. 420 421 422\subsection{Estimation of parameters} 423 424We begin with a discussion of identifiability of our 425model. Identifiability refers to the injectivity of the model, i.e., 426\begin{equation} 427\mbox{if } Pr_{\rho_1}(r) = Pr_{\rho_2}(r) \ \forall r \in R, \ \mbox{ then } \rho_1 = \rho_2. 428\end{equation} 429 430 431The identifiability of RNA-Seq models was discussed in 432\cite{Hiller2009}, where a standard analysis for linear models is 433applied to RNA-Seq (for another related biological example, see \cite{Pe'er2004} which discusses 434identifiability of haplotypes in mixed populations from genotype data). 435The results in these papers apply to our model. For completeness we review the conditions for 436identifiability. Recall that $A_{R,T}$ is the fragment-transcript matrix that specifies which transcripts each fragment is compatible with. The following theorem provides a simple characterization of identifiability: 437 438\begin{thm} 439The RNA-Seq model is identifiable iff $A_{R,T}$ is full rank. 440\end{thm} 441 442Therefore, for a given set of transcripts and a read set $R$, we can 443test whether the model is identifiable using elementary linear 444algebra. For the results in this paper, when estimating expression 445with given annotations, when the model was not identifiable we picked {\em a} maximum likelihood solution, 446although in principle it is possible to bound the total expression of 447the locus and/or report identifiability problems to the user. 448 449Returning to the likelihood function 450\begin{equation} 451\left( \prod_{g \in G} \beta_g^{X_{g}} \right) \left( \prod_{g \in G} \left( \prod_{r \in R:r \in g} \sum_{t \in g} \gamma_t \cdot 452\frac{F(I_t(r))}{l(t)-I_t(r)+1}\right) \right), 453\end{equation} 454we note that both the $\beta$ and $\gamma$ parameters depend on the $\rho$ parameters. However, we will see that if we maximize the $\beta$ separately from the $\gamma$, and also each of the sets $\{\gamma_t:t \in g\}$ separately, then it is always possible to find $\rho$ that match both the maximal $\beta$ and $\gamma$. In other words, 455the problem of finding $\hat{\rho}$ is equivalent 456to finding $\hat{\beta}$ that maximizes 457$ \prod_{g \in G} \beta_g^{X_g}$ 458and separately, for each locus $g$, the $\hat{\gamma}_t$ that maximize 459\begin{equation} 460\prod_{r \in R:r \in g} \sum_{t \in g} \gamma_t 461\frac{F(I_t(r))}{l(t)-I_t(r)+1}. 462\end{equation} 463 464We begin by solving for the $\hat{\beta}$ and $\hat{\gamma}$ and the variances 465of the maximum likelihood estimates, and then explain how these are used to 466report expression levels. 467 468We can solve for the $\hat{\gamma}$ using the fact that the model is linear. 469That is, the probability of each individual read is linear in the read 470abundances $\gamma_t$. It is a standard result in statistics (see, e.g., 471Proposition 1.4 in \cite{ASCB2005}) that the log likelihood function of a 472linear model is concave. Thus, a hill climbing method can be used to find the 473$\hat{\gamma}$. We used the EM algorithm for this purpose. 474 475Rather than using the direct ML estimates, we obtained a regularized 476estimate by importance sampling from the posterior distribution with a 477proposal distribution we explain below. The samples were also used to estimate 478variances for our estimates. 479 480It follows from standard MLE asymptotic 481theory that the $\hat{\gamma}$ are asymptotically multivariate normal with 482variance-covariance matrix given by the inverse of the observed Fisher 483information matrix. This matrix is defined as follows: 484 485\begin{defn}[Observed Fisher information matrix] 486The observed Fisher information matrix is the negative of the Hessian of the log likelihood function evaluated at the maximum likelihood estimate. That is, for parameters $\Theta=(\theta_1,\ldots,\theta_n)$, the $n \times n$ matrix is 487\begin{eqnarray} 488\mathcal{F}_{k,l}(\hat{\Theta}) & = & - \frac{\partial^2 log(\mathcal{L}(\Theta|R))}{\partial \theta_k \theta_l} |_{\theta=\hat{\theta}}. 489\end{eqnarray} 490\end{defn} 491In our case, considering a single locus $g$, the parameters are $\Theta = (\gamma_{t_1},\ldots,\gamma_{t_{|g|}})$, and as expected from Proposition \ref{prop:linearmodel}: 492\begin{eqnarray} 493\label{eq:Fisher} 494\mathcal{F}_{t_k,t_l}(\hat{\Theta}) & = & \sum_{r \in R:r \in g} 495\left[ \frac{1}{\left( \sum_{h \in g} \hat{\gamma}_h \frac{F(I_h(r))}{l(h) - I_h(r)+1} \right)^2} \frac{F(I_{t_k}(r)) F(I_{t_l}(r)) }{(l(t_k)-I_{t_k}+1)(l(t_l)-I_{t_l}+1)} \right]. 496\end{eqnarray} 497 498 499Because some of the transcript abundances may be close to zero, we 500adopted the 501Bayesian approach of \cite{Jiang2009} and instead sampled from the 502joint posterior distribution of $\Theta$ using the proposal 503distribution consisting of the multivariate normal with mean 504given by the MLE, and variance-covariance matrix given by the inverse of 505(\ref{eq:Fisher}). If the Observed Fisher Information Matrix is 506singular then the user is warned and the confidence intervals of all 507transcripts are set to $[0,1]$ (meaning that there is no information 508about relative abundances). 509 510The method used for sampling was importance sampling. The samples were used to obtain a maximum-a-posterior estimate for 511$\hat{\gamma}_t$ for each $t$ and for the variance-covariance matrix which we 512denote by $\Psi^g$ (where $g \in G$ denotes the locus). Note that $\Psi^g$ is 513a $|g| \times |g|$ matrix. The covariance between $\hat{\gamma}_{t_k}$ and 514$\hat{\gamma}_{t_l}$ for $t_k,t_l \in g$ is given by $\Psi^g_{t_k,t_l}$. 515 516Turning to the maximum likelihood estimates $\hat{\beta}$, we use the fact that the model is the log-linear. Therefore, 517\begin{equation} 518\label{eq:sigmahat} 519\hat{\beta_g} = \frac{X_{g}}{M}. 520\end{equation} 521 522Viewed as a random variable, the counts $X_{g}$ are approximately Poisson and therefore the variance of the MLE $\hat{\beta}_g$ is approximately $X_{g}$. We note that for the tests in this paper we directly used the total counts $M$ and the proportional counts $X_g$, however it is easy to incorporate recent suggestions for total count normalization, such as \cite{Bullard2010} into {\tt Cufflinks}. 523 524The favored units for reporting expression in RNA-Seq studies to date is not 525using the transcript abundances directly, but rather using a measure 526abbreviated as FPKM, which means ``expected number of fragments per kilobase 527of transcript sequence per millions base pairs sequenced''. These units are 528equivalent to measuring transcript abundances (multiplied by a scalar). The 529computational advantage of FPKM, is that the normalization constants 530conveniently simplify some of the formulas for the variances of transcript 531abundance estimates. 532 533For example, the abundance of a transcript $t \in g$ in FPKM units is 534\begin{equation} 535\label{eq:FPKM1} 536 \frac{10^6 \cdot 10^3 \cdot \alpha_t}{\tilde{l}(t)} = \frac{10^6 \cdot 10^3 \cdot \beta_g \cdot \gamma_t}{\tilde{l}(t)}. 537\end{equation} 538 539Equation (\ref{eq:FPKM1}) makes it clear that although the abundance of each 540transcript $t \in g$ in FPKM units is proportional to the transcript abundance 541$\rho_t$ it is given in terms of the read abundances $\beta_g$ and $\gamma_t$ 542which are the parameters estimated from the likelihood function. 543 544The maximum likelihood estimates of $\beta_g$ and $\gamma_t$ are random 545variables, and we denote their scaled product (in FPKM units) by $A_t$. That 546is $Pr(A_t = a)$ is the probability that for a random set of fragment alignments 547from a sequencing experiment, the maximum likelihood estimate of the 548transcript abundance for $t$ in FPKM units is $a$. 549 550Using the fact that the expectation of a product of independent random 551variables is the product of the expectations, for a transcript $t \in g$ we 552have 553\begin{equation} 554E[A_t] = \frac{10^9X_{g}\hat{\gamma}_t}{\tilde{l}(t)M}. 555\end{equation} 556 557Given the variance estimates for the $\hat{\gamma}_t$ we turn to the problem of estimating $Var[A_t]$ for a transcript $t \in g$. We use Lemma \ref{lemma:varproduct} to obtain 558\begin{eqnarray} 559Var[A_t]& = & \left(\frac{10^9}{\tilde{l}(t)M}\right)^2 \left( \Psi^g_{t,t}X_{g} + \Psi^g_{t,t}X^2_{g} + (\hat{\gamma}_t)^2 X_{g} \right)\\ 560& = & X_{g}\left(\frac{10^9}{\tilde{l}(t)M)}\right)^2 \left( \Psi^g_{t,t}(1+X_{g}) + (\hat{\gamma}_t)^2\right). 561\end{eqnarray} 562 563This variance calculation can be used to estimate a confidence interval by 564utilizing the fact \cite{Aroian1978} that when the expectation divided by the 565standard deviation of at least one of two random variables is large, their 566product is approximately normal. 567 568Next we turn to the problem of estimating expression levels (and variances of 569these estimates) for groups of transcripts. Let $S \subset T$ be a group of 570transcripts located in a single locus $g$, e.g. a collection of transcripts 571sharing a common TSS. 572 573The analogy of Equation (\ref{eq:FPKM1}) for the FPKM of the group is 574\begin{eqnarray} 575\label{eq:grouphard} 576& & \frac{10^6 \cdot 10^3 \cdot \beta_g \cdot \left( \sum_{t \in S} \gamma_t\right)}{\tilde{l}(S)}\\ 577& = & 10^6 \cdot 10^3 \cdot \beta_g \cdot \sum_{t \in S} \frac{\gamma_t}{\tilde{l}(t)}. \label{eq:groupeasy} 578\end{eqnarray} 579 580As before, we denote by $B_S$ the random variables for which $Pr(B_S = b)$ is 581the probability that for a random set of fragment alignments from a sequencing 582experiment, the maximum likelihood estimate of the transcript abundance for 583all the transcripts in $S$ in FPKM units is $b$. We note that the $B_S$ are 584products and sums of random variables (Equation (\ref{eq:groupeasy})). This 585makes Equation (\ref{eq:groupeasy}) more useful than the 586equivalent unsimplified Equation (\ref{eq:grouphard}), especially because 587$\tilde{l}(S)$ is, in general, a ratio of two random variables. 588 589We again use the fact that the expectation of independent random variables is 590the product of the expectation, in addition to the fact that expectation is a 591linear operator to conclude that for a group of transcripts $S$, 592 593\begin{equation} 594\label{eq:expectTSS} 595E[B_S] = \frac{10^9 \cdot X_g \cdot \sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)}}{M}. 596\end{equation} 597In order to compute the variance of $B_S$, we first note that 598\begin{equation} 599Var\left[\sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)}\right] = \sum_{t \in S}\frac{1}{\tilde{l}(t)^2}\Psi^g_{t,t} + \sum_{t,u \in S} \frac{1}{\tilde{l}(t)\tilde{l}(u)} \Psi^g_{t,u}. 600\end{equation} 601Therefore, 602\begin{eqnarray} 603& & \quad Var[B_S] \quad = \quad \nonumber \\ 604& & X_g\left(\frac{10^9}{M}\right)^2\left( \left(1+X_g\right) \left(\sum_{t \in S}\frac{1}{\tilde{l}(t)^2}\Psi^g_{t,t} + \sum_{t,u \in S} \frac{1}{\tilde{l}(t)\tilde{l}(u)} \Psi^g_{t,u}\right) + \left( \sum_{t \in S} \frac{\hat{\gamma}_t}{\tilde{l}(t)} \right)^2 \right). \label{eq:varTSS} 605\end{eqnarray} 606 607We can again estimate a confidence interval by utilizing the fact that $B_S$ 608is approximately normal \cite{Aroian1978}. 609 610\subsection{Assessment of abundance estimation} 611 612We evaluated the accuracy of {\tt Cufflinks}' transcript abundance estimates by first comparing the estimated FPKM values for the spiked-in sequences in each sample against their intended concentrations. Spike FPKMs were highly correlated across a 5-log dynamic range in all four samples (Supplementary Figure \ref{spikes}). However, because sequenced spike fragments were unambiguously mappable, we performed additional simulation to measure the accuracy of the software in alternatively spliced loci. 613 614\begin{figure}[h] 615 \includegraphics{pdfs/spikes.pdf} 616 \caption[{\tt Cufflinks}' abundance estimates of spiked-in sequences]{{\tt Cufflinks}' abundance estimates of spiked-in sequences.\label{spikes}} 617\end{figure} 618 619\begin{figure}[h] 620 \includegraphics[scale=0.5]{pdfs/sim_small.pdf} 621 \caption[Accuracy of {\tt Cufflinks} abundance estimates]{\emph{In silico} assessment of the accuracy of {\tt Cufflinks}’ abundance estimation when provided with a perfect assembly (a) and after {\it de novo} comparative assembly (b). Red points indicate in silico transcripts that were only partially recovered, where black points were fully reconstructed by {\tt Cufflinks}. Simulated reads were aligned with {\tt TopHat} and the alignments were provided to {\tt Cufflinks} along with the structures of the transcripts in the simulated sample. \label{sim}} 622\end{figure} 623 624To assess the accuracy of {\tt Cufflinks}' estimates, 625we simulated an RNA-Seq experiment using the FluxSimulator, a freely available 626software package that models whole-transcriptome sequencing experiments with 627the Illumina Genome Analyzer. \cite{Sammeth} The software works by first 628randomly assigning expression values to the transcripts provided by the user, 629constructing an amplified, size-selected library, and sequencing 630it. Mouse UCSC transcripts were supplied to the software, along with build 63137.1 of the genome. FluxSimulator then randomly assigned expression ranks to 63218,935 transcripts, with the expression value $y$ computed from the rank $x$ 633according to the formula 634 635\begin{equation} 636 y =\left( \frac{x}{5.0 \times 10^7}\right)^{-0.6}e^{-\left(\frac{x}{9.5\times{10^3}}\right) - \left(\frac{x}{9.5\times{10^3}}\right)^2}. 637\end{equation} 638 639From these relative expression levels, the software constructed an \emph{in 640silico} RNA sample, with each transcript assigned a number of molecules 641according to its abundances. The software modeled the polyadenylation of each 642transcript by adding a poly-A tail (of mean length 125nt) after the terminal 643exon. FluxSimulator then simulated reverse transcription of \emph{in silico} 644mRNAs by random hexamer priming, followed by size selection of RT products to 645between 175 and 225 nt. The resulting ``library'' of 6,601,805 cDNA fragments 646was then sampled uniformly at random for simulated sequencing, where the 647initial and terminal 75bp of each selected fragment were reported as reads. 648FluxSimulator does not allow precise control over the number of reads 649generated (Michael Sammeth, personal communication), but nevertheless 650generated 13,203,516 75nt paired-end RNA-Seq reads. These reads included 651sequencing errors; FluxSimulator includes a position-specific sequencing 652error model. 653 654Fragments were mapped with {\tt TopHat} to the mouse genome using identical 655parameters to those used to map the C2C12 reads, mapping a total of 6,176,961 656(93\% of the library). These alignments were supplied along with the exact set 657of expressed transcripts to {\tt Cufflinks}, to measure {\tt Cufflinks}' abundance 658estimation accuracy when working with a ``perfect'' assembly (Supplementary 659Figure \ref{sim}). Estimated FPKM was very close to true \emph{in silico} FPKM 660across a dynamic range of expression of nearly six orders of magnitude 661($R^2=0.95$). 662 663\begin{figure}[!ht] 664 \includegraphics[scale=0.30]{pdfs/vs_no_discovery_small.pdf} 665 \caption[Improved abundance accuracy with novel 666 transcripts]{Excluding novel C2C12 transcripts from abundance 667 estimation results in inaccurate estimates for known transcripts. \label{vs_no_discovery}} 668\end{figure} 669 670Estimation of transcript abundances by assigning fragments to them may be inaccurate if 671one is working with an incomplete set of transcripts for a particular sample. 672To evaluate the impact of missing transcripts, we removed the newly discovered 673transcripts from our high-confidence set and re-estimated the abundances of 674known transcripts, and then compared them to those obtained when working with 675the complete high-confidence set. While estimates of known transcripts were 676overall similar or identical when working with both sets, reflecting 677single-isoform or fully annotated genes, isoforms of some alternatively 678spliced genes differed greatly. (Supplementary Figure \ref{vs_no_discovery}) 679 680As a final note, we point out that a na\"{i}ve, yet popular, current 681approach to expression estimation is to sum the reads mapping to a 682gene (where the sum is taken across all exons appearing in all 683possible isoforms), and then to normalize the count by either the 684total number of exonic bases, or by the average length of the 685transcripts. We call the former method the ``projective normalization'' 686method, and the latter the ``average length'' method. 687 688\begin{prop} 689The totally projective normalization method is correct only for single isoform 690genes. If a gene has two or more isoforms the expression is underestimated. 691\end{prop} 692{\bf Proof}: The effective length of the gene is overestimated, hence 693the expression level is underestimated. To see this, first note that 694the length of some transcript in a gene is less than the total number 695of exonic bases among all transcripts. Then, if $a_1,\ldots,a_n$ are 696real numbers all greater than zero and $b_1,\ldots,b_n$ are not all equal, we have 697\begin{equation} 698\frac{\sum_{i=1}^na_ib_i}{\sum_{i=1}^n a_i} < max_{i}(b_i), 699\end{equation} 700so that the effective length in equation (\ref{eq:effective_length}) is always less than the total 701number of exonic bases among all transcripts. 702 703Stated differently, the projective normalization method has the problem that 704it produces numbers that are not proportional to the $\rho$, so that 705it is not additive. The average length method is flawed for the same reason. The 706transcript abundances are not taken into account in computing the 707effective lengths. In some cases the method might produce the correct 708answer (for the wrong reasons), but it is bound to be incorrect on most examples, 709especially in genes with transcripts of variable lengths and 710non-uniform abundances. 711 712\section{Transcript assembly} 713\label{sec:assembly} 714\subsection{Overview} 715 716{\tt Cufflinks} takes as input alignments of RNA-Seq fragments to a reference 717genome and, in the absence of an (optional) user provided annotation, initially assembles transcripts from the alignments. Transcripts in 718each of the loci are assembled independently. The assembly algorithm 719is designed to aim for the following: 720\begin{enumerate} 721\item Every fragment is consistent with at least one assembled 722 transcript. 723\item Every transcript is tiled by reads. 724\item The number of transcripts is the smallest required to satisfy requirement (1). 725\item The resulting RNA-Seq models (in the sense of Section \ref{sec:abundances}) are identifiable. 726\end{enumerate} 727In other words, we seek an assembly that parsimoniously “explains” the fragments from the RNA-Seq experiment; every 728fragment in the experiment (except those filtered out during a preliminary 729error-control step) should have come from a {\tt Cufflinks} transcript, and 730{\tt Cufflinks} should produce as few transcripts as possible with that property. 731Thus, {\tt Cufflinks} seeks to optimize the criterion suggested in \cite{Xing2004}, 732however, unlike the method in that paper, {\tt Cufflinks} leverages Dilworth's 733Theorem \cite{Dilworth1950} to solve the problem by reducing it 734to a matching problem via the equivalence of Dilworth's and K\"{o}nig's theorems (Theorem \ref{thm:dilko} in Appendix A). Our approach to isoform reconstruction is inspired by a similar approach used for haplotype reconstruction from HIV quasi-species \cite{Eriksson2008}. 735 736\subsection{A partial order on fragment alignments} 737 738The {\tt Cufflinks} program loads a set of alignments in SAM format sorted by reference 739position and assembles non-overlapping sets of alignments independently. After 740filtering out any erroneous spliced alignments or reads from incompletely 741spliced RNAs, {\tt Cufflinks} constructs a partial order (Definition \ref{def:po}), or equivalently a directed acyclic graph (DAG), with 742one node for each fragment that in turn consists of an aligned pair of mated reads. First, we note that fragment alignments are of two types: those where reads align in their entirety to the genome, and reads which have a split alignment (due to an implied intron). 743 744In the case of single reads, the partial order can be simply 745constructed by checking the reads for {\em compatibility}. Two reads 746are {\em compatible} if their overlap contains the exact same implied 747introns (or none). If two reads are not compatible they are {\em 748 incompatible}. The reads can be partially ordered by defining, for 749two reads $x,y$, that $x\leq y$ if the starting coordinate of $x$ is 750at or before the starting coordinate of $y$, and if they are 751compatible. 752 753In the case of paired-end RNA-Seq the situation is more complicated 754because the unknown sequence between mate pairs. To understand this, 755we first note that pairs of fragments can still be determined to be 756incompatible if they cannot have originated from the same 757transcript. As with single reads, this happens when there is 758disagreement on implied introns in the overlap. However compatibility 759is more subtle. We would like to define a pair of fragments $x,y$ to 760be compatible if they do not overlap, or if every implied intron in one fragment overlaps an identical implied intron in the other fragment. 761 762However it is important to note that it may be impossible to determine 763the compatibility (as defined above) or incompatibility of a pair of fragments. For example, an unknown region internal to a fragment may overlap two different introns (that are incompatible with each other). The fragment may be compatible with one of the introns (and the fragment from which it originates) in which case it is incompatible with the other. Since the opposite situation is also feasible, compatibility (or incompatibility) cannot be assigned. Fragments for which the compatibility/incompatibility cannot be determined with respect to every other fragment are called {\em uncertain}. Finally, two fragments are called {\em nested} if one is contained within the other. 764 765 766\begin{figure}[h] 767 \includegraphics[scale=0.5]{pdfs/compatibility.pdf} 768 \caption[Compatibility and incompatibility of 769 fragments]{Compatibility and incompatibility of 770 fragments. End-reads are solid lines, unknown sequences within 771 fragments are shown by dotted lines and implied introns are 772 dashed lines. The reads in (a) are compatible, whereas the 773 fragments in (b) are incompatible. The fragments in (c) are 774 nested. Fragment $x_4$ in (d) is uncertain, because $y_4$ and 775 $y_5$ are incompatible with each other.} 776\end{figure} 777 778Before constructing a partial order, fragments are extended to include 779their nested fragments and uncertain fragments are discarded. These discarded fragments 780are used in the abundance estimation. In theory, this may result in 781suboptimal (i.e. non-minimal assemblies) but we determined empirically 782that after assembly uncertain fragments are almost always consistent 783with one of the transcripts. When they are not, there was no 784completely tiled transcript that contained them. Thus, we employ a 785heuristic that substantially speeds up the program, and that works 786in practice. 787 788A partial order $P$ is then constructed from the remaining fragments 789by declaring that $x \leq y$ whenever the fragment corresponding to 790$x$ begins at, or before, the location of the fragment corresponding 791to $y$ and $x$ and $y$ are compatible. In what follows we identify $P$ 792with its Hasse diagram (or covering relation), equivalently a directed 793acyclic graph (DAG) that is the transitive reduction. 794 795\begin{prop} 796$P$ is a partial order. 797\end{prop} 798{\bf Proof}: The fragments can be totally ordered according to the locations where they begin. It therefore suffices to check that if $x,y,z$ are fragments with $x$ compatible with $y$ and $y$ compatible with $z$ then $x$ is compatible with $z$. Since $x$ is not uncertain, it must be either compatible or incompatible with $z$. The latter case can occur only if $x$ and/or $z$ contain implied introns that overlap and are not identical. Since $y$ is not nested within $z$ and $x$ is not nested within $y$, it must be that $y$ contains an implied intron that is not identical with an implied intron in either $x$ or $z$. Therefore $y$ cannot be compatible with both $x$ and $z$. \qed 799 800\subsection{Assembling a parsimonious set of transcripts} 801 802In order to assemble a set of transcripts, {\tt Cufflinks} finds a 803(minimum) partition of $P$ into chains (see Definition 804\ref{def:po}). A partition of $P$ into chains yields an assembly 805because every chain is a totally ordered set of compatible fragments 806$x_1,\ldots,x_l$ and therefore there is a set of overlapping fragments 807that connects them. By Dilworth's theorem (Theorem \ref{thm:Dilworth}), the problem of finding a minimum partition $P$ into chains is equivalent to finding a maximum antichain in $P$ (an antichain is a set of mutually incompatible fragments). Subsequently, by Theorem \ref{thm:dilko}, the problem of finding a maximum antichain in $P$ can be reduced to finding a maximum matching in a certain bipartite graph that emerges naturally in deducing Dilworth's theorem from K\"{o}nig's theorem \ref{thm:konig}. 808We call the key bipartite graph the ``reachability'' graph. It is the transitive closure of the DAG, i.e. it is the graph 809where each fragment $x$ has nodes $L_x$ and $R_x$ in the left and 810right partitions of the reachability graph respectively, and where there is an edge between $L_x$ and $R_y$ when $x \leq y$ in $P$. The maximum matching problem is a classic problem that admits a polynomial time algorithm. The Hopcroft-Karp algorithm \cite{Hopcroft1973} has a run time of $O(\sqrt{V}E)$ where in our case $V$ is the number of fragments and $E$ depends on the extent of overlap, but is bounded by a constant times the coverage depth. We note that our parsimony approach to assembly therefore has a better complexity than the $O(V^3)$ PASA algorithm \cite{Haas2003}. 811 812The minimum cardinality chain decomposition computed using the approach above may not be unique. 813For example, a locus may contain two putative distinct initial exons (defined by overlapping incompatible fragments), and one 814of two distinct terminal and a constitutive exon in between that is longer than any 815read or insert in the RNA-Seq experiment. In such a case, the parsimonious assembly will consist of two transcripts, but there are four possible solutions that are all minimal. In order to ``phase'' distant exons, we leverage the fact that abundance inhomogeneities can link distant exons via their coverage. We therefore weight the edges of the bipartite reachability graph based on the percent-spliced-in metric introduced by 816Wang \emph{et al.} in \cite{Wang2008}. In our setting, the percent-spliced-in 817$\psi_x$ for an alignment $x$ is computed by counting the alignments 818overlapping $x$ in the genome that are compatible with $x$ and dividing by the 819total number of alignments that overlap $x$, and normalizing for the length of 820the $x$. The cost $C(y,z)$ assigned to an edge between alignments $y$ and $z$ reflects 821the belief that they originate from different transcripts: 822 823 824\begin{equation} 825 C(y,z) = -\log(1 - |\psi_y - \psi_z|). 826 \label{eq:match_cost} 827\end{equation} 828 829Rather than using the Hopcroft-Karp algorithm, a modified version of the {\tt 830LEMON} \cite{LEMON} and {\tt Boost} \cite{Boost} graph libraries are used to 831compute a {\em min-cost} maximum cardinality matching on the bipartite compatibility 832graph. Even with the presence of weighted edges, our algorithm is 833very fast. The best known algorithm for weighted matching is $O(V^2logV+VE)$. 834 835Because we isolated total RNA, we 836expected that a small fraction of our reads would come from the intronic 837regions of incompletely processed primary transcripts. Moreover, transcribed 838repetitive elements and low-complexity sequence result in ``shadow'' 839transfrags that we wished to discard as artifacts. Thus, {\tt Cufflinks} 840heuristically identifies artifact transfrags and suppresses them in its 841output. We also filter extremely low-abundance minor isoforms of alternatively 842spliced genes, using the model described in Section \ref {abundances} as a 843means of reducing the variance of estimates for more abundant transcripts. A 844transcript $x$ meeting any of the following criteria is suppressed: 845 846\begin{enumerate} 847 \item $x$ aligns to the genome entirely within an intronic region of the alignment for a transcript $y$, and the abundance of $x$ is less than 15\% of $y$'s abundance. 848 \item $x$ is supported by only a single fragment alignment to the genome. 849 \item More than 75\% of the fragment alignments supporting $x$, are mappable to multiple genomic loci. 850 \item $x$ is an isoform of an alternatively spliced gene, and has an estimated abundance less than 5\% of the major isoform of the gene. 851\end{enumerate} 852 853Prior to transcript assembly, {\tt Cufflinks} also filters out some of the alignments 854for fragments that are likely to originate from incompletely spliced nuclear 855RNA, as these can reduce the accuracy abundance estimates for fully spliced 856mRNAs. These filters and the output filters above are detailed in the source 857file \verb!filters.cpp! of the source code for {\tt Cufflinks}. 858 859In the overview of this Section, we mentioned that our assembly algorithm has the property that the resulting models are identifiable. This is a convenient property that emerges naturally from the parsimony criterion for a ``minimal explanation'' of the fragment alignments. Formally, it is a corollary of Dilworth's theorem: 860 861\begin{prop} 862The assembly produced by the {\tt Cufflinks} algorithm always results in an 863identifiable RNA-Seq model. 864\end{prop} 865{\bf Proof}: By Dilworth's theorem, the minimum chain decomposition (assembly) 866we obtain has the same size as the maximum antichain in the partially 867ordered set we 868construct from the reads. An antichain consists of reads that are 869pairwise incompatible, and therefore those reads must form a 870permutation sub-matrix in the fragment-transcript matrix $A_{R,T}$ with columns corresponding to the transcripts in a locus, and with rows corresponding to the fragments in the antichain. The matrix $A_{R,T}$ therefore contains permutation sub-matrices that together span all the columns, and the matrix is full-rank. 871 872\subsection{Assessment of assembly quality} 873 874To compare {\tt Cufflinks} transfrags against annotated transcriptomes, and 875also to find transfrags common to multiple assemblies, we developed a tool 876called {\tt Cuffcompare} that builds structural equivalence classes of 877transcripts. We ran {\tt Cuffcompare} on each the assembly from each time point against the 878combined annotated transcriptomes of the {UCSC known genes}, {\tt Ensembl}, 879and {\tt Vega}. Because of the stochastic nature of sequencing, \emph{ab 880initio} assembly of the same transcript in two different samples may result in 881transfrags of slightly different lengths. A {\tt Cufflinks} transfrag was 882considered a complete match when there was a transcript with an identical 883chain of introns in the combined annotation. 884 885When no complete match is found between a {\tt Cufflinks} transfrag and the 886transcripts in the combined annotation, {\tt Cuffcompare} determines and reports if 887another substantial relationship exists with any of the annotation 888transcripts that can be found in or around the same genomic locus. For 889example, when all the introns of a transfrag match perfectly a part of the 890intron chain (sub-chain) of an annotation transcript, a ``containment'' 891relationship is reported. For single-exon transfrags, containment is also 892reported when the exon appears fully overlapped by any of the exons of an 893annotation transcript. If there is no perfect match for the intron chain of a 894transfrag but only some exons overlap and there is at least one intron-exon 895junction match, {\tt Cuffcompare} classifies the transfrag as a putative ``novel'' 896isoform of an annotated gene. When a transfrag is unspliced (single-exon) and 897it overlaps the intronic genomic space of a reference annotation transcript, 898the transfrag is classified as potential pre-mRNA fragment. Finally, when no 899other relationship is found between a {\tt Cufflinks} transfrag and an 900annotation transcript, {\tt Cuffcompare} can check the repeat content of the 901transfrag's genomic region (assuming the soft-masked genomic sequence was also 902provided) and it would classify the transfrag as ``repeat'' if most of its 903bases are found to be repeat-masked. 904 905When provided multiple time point assemblies, {\tt Cuffcompare} matches transcripts 906between samples that have an identical intron structure, placing all mutually 907matching transcripts in the same equivalence class. The program reports a 908non-redundant set of transcript structures, choosing the longest transcript 909from each equivalence class as the representative transcript. {\tt Cuffcompare} also 910reports the relationships found between each equivalence class (transcripts 911that have a complete match across time points) and reference transcripts from 912the combined annotation set, where applicable. 913 914\begin{figure}[h] 915 \includegraphics{pdfs/categories.pdf} 916 \caption[Categorization of {\tt Cufflinks} transcripts by estimated depth of read coverage]{Categorization of {\tt Cufflinks} transcripts by estimated depth of read coverage. \label{categories}} 917\end{figure} 918 919Table \ref{category_table} includes the classifications of the transfrags 920reported by {\tt Cufflinks} after assembling the C2C12 reads. While only 92113.5\% of assembled transfrags represent known transcripts, {\tt Cufflinks} 922assigns more than 76\% of reads to these, reflecting the fact that moderate 923and highly-abundant transfrags generate most of the library fragments in the 924experiment. Less abundant transcripts receive less complete sequencing 925coverage, resulting in numerous transfrags that partially but compatibly match 926known transcripts. Supplementary Figure \ref{categories} shows the categories 927of {\tt Cufflinks} transfrags as estimated depth of sequencing coverage 928increasing. 929 930\begin{table}[h]{ 931 \Small 932 \begin{tabular}{l|c|c|c} 933 \hline 934 \textbf{Category} & \textbf{Transfrags} & \textbf{\% of total transfrags} & \textbf{Assembled reads} (\%) \\ 935 \hline 936 Match to known isoform & 39,857 & 13.5 & 76.7 \\ 937 Novel isoform of known gene & 18,565 & 6.3 & 11.3 \\ 938 Contained in known isoform & 71,029 & 24.1 & 4.6 \\ 939 Repeat & 41,906 & 14.2 & 0.6 \\ 940 Intronic & 32,658 & 11.1 & 0.6 \\ 941 Polymerase run-on & 18,522 & 6.3 & 0.5 \\ 942 Intergenic & 48,604 & 16.5 & 1.2 \\ 943 Other artifacts & 22,483 & 7.7 & 4.5 \\ 944 \hline 945 Total transfrags & 293,624 & 100.0 & 100.0 \\ 946 \hline 947 948 \end{tabular} 949 } 950\caption[Types of assembled transfrags]{Classification of all transfrags produced at any time point with respect to annotated gene models and masked repeats in the mouse genome. Transfrags that are present in multiple time point assemblies are multiply counted to preserve the relative distribution of transfrags among the categories across the full experiment. \label{category_table}} 951\end{table} 952 953We selected the {\tt Cufflinks} transfrags that did not have a complete match 954or ``containment'' relationship with a known annotation transcript, but were 955classified by {\tt Cuffcompare} as putative ``novel isoforms'' of known genes. We 956explored the sequence similarity between these transfrags and two sets of mRNA 957sequences: one set representing the mouse transcriptome and consisting of all 958mouse ESTs in dbEST plus all reviewed or validated RefSeq mouse mRNAs, and the 959other consisting of all reviewed or validated RefSeq mRNAs from other 960mammalian species. 961 962We used megablast to map all mouse ESTs onto this set of {\tt Cufflinks} 963transfrags, only keeping EST alignments where at least 80\% of the EST length 964was aligned with at least 95\% identity. We calculated transfrag coverage by 965tiling overlapping EST mappings on each transfrag and counted only those 966transfrags that are covered by ESTs for at least 80\% of the transfrag length 967without any coverage gaps, and with coverage discontinuities only allowed at 968no more than 10\% distance from either end. For the mouse mRNAs alignments we 969also used megablast with the same basic coverage cutoffs (minimum 80\% covered 970with no more than 10\% unaligned on either side of the overlap) but applied to 971each pairwise alignment independently (i.e. as opposed to EST alignments, no 972coverage tiling was considered for mRNA alignments). For alignments with the 973non-mouse mRNAs we used discontiguous megablast with a dual (combined) 974discontiguous word template (option -N 2), with the same coverage assessment 975protocol as in the case of mouse mRNA alignments but with the percent identity 976cutoff lowered to 80\%. 977 978\begin{figure}[h] 979 \includegraphics{pdfs/FHL3_RT-PCR.pdf} 980 \caption[New isoform of Fhl3]{New and known isoforms of Fhl3 recovered by {\tt Cufflinks} at each time point (a) were confirmed by form-specific RT-PCR (b). \label{FHL3_RT-PCR}} 981\end{figure} 982 983\begin{figure}[h] 984 \includegraphics{pdfs/Additional_RT-PCR.pdf} 985 \caption[RT-PCR validation of selected genes]{RT-PCR of selected genes. For Schip1, {\tt Cufflinks} assembled a known and a novel isoform (with a new TSS), both of which are detected by RT-PCR. Prkar1a is annotated with two alternate first exons and start sites in {\tt UCSC known genes}, both of which were detected. {\tt Cufflinks} assembles the known isoform of the splicing factor Sfpq, along with a novel variant that contains most of RIKEN clone. Tpm1, a gene known to have muscle- and non-muscle-specific isoforms displays previously observed alternative first and last exons. \label{RT-PCR}} 986\end{figure} 987 988\begin{table}[h]{ 989 \Small 990 \begin{tabular}{r|l|c|c} 991 992 \hline 993 \textbf{Primer name} & \textbf{sequence} & \textbf{product length} & \textbf{endpoint gel score} \\ 994 \hline 995 996 \textbf{FHL3 Ex1Ex3} & & & \\ 997 Left & CTCGCCGCTGCTCTCTCG & 221 & +++ \\ 998 Right & GTGTTGTCATAGCACGGAACG & & \\ 999 \textbf{FHL3 Ex2Ex3} & & & \\ 1000 Left & AGGAAGGGCTCACAAGTGG & 407 & +++ \\ 1001 Right & ATAGCACGGAACGCAGTAGG & & \\ 1002 \textbf{Sfpq Ex9Ex10} & & & \\ 1003 Left & GTGGTGGCATAGGTTATGAAGC & 936 & +++ \\ 1004 Right & CCATTTTCAAAAGCTTTCAAGG & & \\ 1005 \textbf{Sfpq Ex9Ex11} & & & \\ 1006 Left & GTGGTGGCATAGGTTATGAAGC & 172 & +++ \\ 1007 Right & CTCAAGTAAATAAGACTCCAAAATCAGC & & \\ 1008 \textbf{Prkar1aEx1Ex3} & & & \\ 1009 Left & ACAGCAGGGATCTCCTTGTCC & 418 & +++ \\ 1010 Right & CCTCTCAAAGTATTCCCGAAGG & & \\ 1011 \textbf{Prkar1aEx2Ex3} & & & \\ 1012 Left & GCTATCGCAGAGTGGTAGTGAGG & 279 & +++ \\ 1013 Right & CCTCTCAAAGTATTCCCGAAGG & & \\ 1014 \textbf{Schip1Ex1Ex3} & & & \\ 1015 Left & GGCTATGAGGGTGAAAAGTGC & 1050 & +++ \\ 1016 Right & GTATAGATTCCTGGGCCATCG & & \\ 1017 \textbf{Schip1Ex2Ex3} & & & \\ 1018 Left & CAGCATGAGTGGTAACCAAGG & 269 & +++ \\ 1019 Right & GTATAGATTCCTGGGCCATCG & & \\ 1020 \textbf{Tpm1Ex1Ex3} & & & \\ 1021 Left & TGAACAAAAGACCCCAGAGG & 565 & +++ \\ 1022 Right & CTGAAGTACAAGGCCATCAGC & & \\ 1023 \textbf{Tpm1Ex2Ex3} & & & \\ 1024 Left & AGTTTTATTGAGCGTTGAGACG & 318 & +++ \\ 1025 Right & CTGAAGTACAAGGCCATCAGC & & \\ 1026 \hline 1027 1028 \end{tabular} 1029 } 1030\caption[Form-specific RT-PCR primers for selected genes]{Form-specific RT-PCR primers for selected genes, designed with Primer3 \cite{Rozen2000}. \label{RT-PCR_primers}} 1031\end{table} 1032 1033 1034To assess the dependence of assembly quality on the depth of sequencing, we 1035mapped and assembled subsets of our reads at the 60 hour time point. We 1036partitioned the three Illumina lanes' worth of data (a total of ~140 million 1037reads) into 64 subsets. We then processed a single subset with {\tt TopHat} 1038and {\tt Cufflinks}, as above, and compared the resulting transfrags to the 1039output of {\tt Cufflinks} on all three lanes using {\tt Cuffcompare}. We 1040repeated the mapping and assembly with two subsets, four subsets, eight, and 1041so on. Figure 4 in the main text shows the fraction of reference transcripts 1042captured by {\tt Cufflinks} using all three lanes that are still captured when 1043less data is available. For transcripts with low abundance ($<$15 1044FPKM), increased sequencing yields more full-length transcripts. However, for 1045even moderately abundant transcripts ($\geq$15 FPKM), 75\% or more of 1046the transcripts are recovered with only ~40 million reads, or a lane's worth 1047of Illumina GA II sequencing. 1048 1049\section{Analysis of gene expression dynamics} 1050 1051Expression dynamics of genes are composed of absolute changes in 1052overall transcript abundances, as well as relative changes in 1053transcript abundances over time. Moreover, select groups of 1054transcripts, for example transcripts grouped by TSS, may exhibit 1055specific dynamics due to the underlying biological mechanisms that 1056drive expression. 1057 1058In this section we describe statistical tests we developed in the 1059multiple hypothesis testing framework for examining absolute and 1060relative changes in arbitrary groups of transcripts. 1061 1062\subsection{Selection of high-confidence transcripts for expression tracking} 1063 1064We first restricted our analysis of expression dynamics over the time-course 1065to a set of transcripts we believed were fully sequenced and correctly 1066assembled, and we focused only on known and reliable novel isoforms of 1067annotated genes. This set consisted of transcripts that either were present in 1068the {\tt UCSC genome browser}, {\tt Ensembl}, or {\tt Vega} annotated 1069transcriptomes, or were found in multiple C2C12 time point assemblies. We 1070ignored transfrags classified as intronic pre-mRNA or polymerase run-on, as 1071well as intergenic repeats to focus on coding genes and long non-coding RNAs. 1072This high-confidence set contained a total of 17,416 transcripts, 13,692 of 1073which were in {\tt UCSC known genes}, {\tt Ensembl} or {\tt VEGA} annotation 1074and 3,724 of which are novel. Running {\tt Cufflinks}' abundance estimation 1075algorithm on this high-confidence set of transcripts at each time point 1076allowed us to scan for differentially expressed transcripts, differentially 1077spliced pre-mRNAs, and genes with shifts in promoter preference. 1078 1079\subsection{Testing for changes in absolute expression} 1080 1081Between any two consecutive time points, we tested whether a transcript was 1082significantly (after FDR control \cite{Benjamini1995}) up or down regulated with respect to the 1083null hypothesis of no change, with variability in expression due solely to the 1084uncertainties resulting from our abundance estimation procedure. This was done 1085using the following testing procedure for absolute differential expression: 1086 1087We employed the standard method used in microarray-based expression analysis 1088and proposed for RNA-Seq in \cite{Bullard2010}, which is to compute the 1089logarithm of the ratio of intensities (in our case FPKM), and to then use the 1090delta method to estimate the variance of the log odds. We describe this for 1091testing differential expression of individual transcripts and also groups of 1092transcripts (e.g. grouped by TSS). 1093 1094We recall that the MLE FPKM for a transcript $t$ in a locus $g$ is given by 1095\begin{equation} 1096\frac{10^9X_g\hat{\gamma}_t}{\tilde{l}(t)M}. 1097\end{equation} 1098Given two different experiments resulting in $X^a_g,M^a$ and $X^b_g,M^b$ respectively, as well as $\hat{\gamma}^a_t$ and $\hat{\gamma}^b_t$, we would like to test the significance of departures from unity of the ratio of MLE FPKMS, i.e. 1099\begin{eqnarray} 1100& & \left(\frac{10^9X^a_g\hat{\gamma}^a_t}{\tilde{l}(t)M^a}\right) / \left(\frac{10^9X^b_g\hat{\gamma}^b_t}{\tilde{l}(t)M^b}\right)\\ 1101& = & \frac{X_g^a\hat{\gamma}^a_tM^b}{X_g^b \hat{\gamma}^b_tM^a}. 1102\end{eqnarray} 1103 1104This can be turned into a test statistic that is approximately normal by 1105taking the logarithm, and normalizing by the variance. We recall that using 1106the delta method, if $X$ is a random variable then $Var[log(X)] \approx 1107\frac{Var[X]}{E[X]^2}$. 1108 1109Therefore, our test statistic is 1110\begin{equation} 1111\frac{log(X^a_g)+log(\hat{\gamma}^a_t)+log(M^b)-log(X^b_g)-log(\hat{\gamma}^b_t)-log(M^a)} 1112{ \sqrt{\frac{\left( \Psi^{g,a}_{t,t}(1+X^a_{g}) + (\hat{\gamma}^a_t)^2\right)}{X^a_g \left(\hat{\gamma}^a_t \right)^2}+ 1113\frac{\left( \Psi^{g,b}_{t,t}(1+X^b_{g}) + (\hat{\gamma}^b_t)^2\right)}{X^b_{g}\left( \hat{\gamma}^b_t\right)^2} } }. 1114\end{equation} 1115 1116In order to test for differential expression of a group of transcripts, we 1117replace the numerator and denominator above by those from Equations 1118(\ref{eq:expectTSS}) and (\ref{eq:varTSS}). 1119 1120It is has been noted that the power of differential expression tests in RNA-Seq depend on the length of the transcripts being tested, because longer transcripts accumulate more reads \cite{Oshlack2009}. This means that the results we report are biased towards discovering longer differentially expressed transcripts and genes. 1121 1122\subsection{Quantifying transcriptional and post-transcriptional overloading} 1123 1124In order to infer the extent of differential promoter usage, we 1125decided to measure changes in relative abundances of primary 1126transcripts of single genes. Similarly, we investigated changes in 1127relative abundances of transcripts grouped by TSS in order to infer 1128differential splicing. These inferences required two ingredients: 1129\begin{enumerate} 1130\item A metric on probability distributions (derived from relative 1131 abundances). 1132\item A test statistic for assessing significant changes in 1133 differential promoter usage and splicing as measured using the 1134 metric referred to above. 1135\end{enumerate} 1136 1137In order to address the first requirement, namely a metric on 1138probability distributions, we turned to an entropy-based metric. This 1139was motivated by the methods in \cite{Ritchie2008} where tests for 1140differences in relative isoform abundances were performed to 1141distinguish cancer cells from normal cells. We extend 1142this approach to be able to test for relative isoform abundance changes among multiple 1143experiments in RNA-Seq. 1144 1145\begin{defn}[Entropy] 1146The entropy of a discrete probability distribution 1147$p=(p_1,\ldots,p_n)$ ($0 \leq p_i \leq 1$ and $\sum_{i=1}^n p_i = 1$) is 1148\begin{equation} 1149H(p) = -\sum_{i=1}^n p_i log p_i. 1150\end{equation} 1151If $p_i=0$ for some $i$ the value of $p_i log p_i$ is taken to be $0$. 1152\end{defn} 1153\begin{defn}[The Jensen-Shannon divergence] 1154The Jensen-Shannon divergence of $m$ discrete probability distributions $p^1,\ldots,p^m$ is defined to be: 1155\begin{equation} 1156JS(p^1,\ldots,p^m) = H\left(\frac{p^1 + \cdots + p^m}{m}\right) - \frac{\sum_{j=1}^m H(p^j)}{m}. 1157\end{equation} 1158 1159In other words, the Jensen-Shannon divergence of a set of probability 1160distributions is the entropy of their average minus the average of their 1161entropies. \end{defn} 1162 1163In the case where $m=2$, we remark that the Jensen-Shannon divergence can also 1164be described in terms of the Kullback-Leibler divergence of two discrete 1165probability distributions. If we denote Kullback-Leibler divergence by 1166 1167\begin{equation} 1168D(p^1\|p^2) = \sum_i p^1_i log \frac{p^1_i}{p^2_i}, 1169\end{equation} 1170then 1171\begin{equation} 1172JS(p^1,p^2) = \frac{1}{2}D(p^1\|m)+\frac{1}{2}D(p^2\|m) 1173\end{equation} 1174where $m=\frac{1}{2}(p^1+p^2)$. In other words the Jensen-Shannon 1175divergence is a symmetrized variant of the Kullback-Leibler divergence. 1176 1177The Jensen-Shannon divergence has a number of useful properties: for 1178example it is symmetric and non-negative. However it is {\em not} a 1179metric. The following theorem shows how to construct a metric 1180from the Jensen-Shannon divergence: 1181 1182\begin{thm}[Fuglede and Tops{\o}e 2004 \cite{Fuglede2004}] 1183The square root of the Jensen-Shannon divergence is a metric. 1184\end{thm} 1185 1186The proof of this result is based on a harmonic analysis argument that is the basis for 1187the remark in the main paper that ``transcript abundances move in time along a 1188logarithmic spiral in Hilbert space''. We therefore call the square root of the Jensen-Shannon divergence the {\em Jensen-Shannon metric}. We employed this metric in 1189order to quantify relative changes in expression in (groups of) transcripts. 1190 1191In order to test for significance, we introduce a bit of notation. 1192Suppose that $S$ is a collection of transcripts (for example, they may 1193 share a common TSS). We define 1194\begin{equation} 1195\kappa_t = \frac{ \frac{\gamma_t}{\tilde{l}(t)}}{\sum_{u \in S} \frac{\gamma_u}{\tilde{l}(u)}} 1196\end{equation} 1197to be the proportion of transcript $t$ among all the transcripts in a 1198group $S$. We let 1199$Z=\sum_{u \in S} \hat{\gamma}_u / \tilde{l}(u)$ so that $\hat{\kappa}_t = \frac{\gamma_t}{\tilde{l}(t)Z}$. We therefore have that 1200\begin{eqnarray} 1201\label{eq:variancekappa1} 1202Var[\hat{\kappa}_t] & = &\frac{Var[\hat{\gamma}_t]}{\tilde{l}(t)^2Z^2}, \\ 1203Cov[\hat{\kappa}_t,\hat{\kappa}_u] & = & \frac{Cov[\hat{\gamma}_t,\hat{\gamma}_u]}{\tilde{l}(t)\tilde{l}(u)Z^2}.\label{eq:variancekappa2} 1204\end{eqnarray} 1205 1206Our test statistic for divergent relative expression was the 1207Jensen-Shannon metric. The test could be applied to multiple time 1208points simultaneously, but we focused on pairwise tests (involving 1209consecutive time points). Under the null hypothesis of no change in 1210relative expression, the Jensen-Shannon metric should be 1211zero. We tested for this using a one-sided $t$-test, based on an 1212asymptotic derivation of the distribution of the Jensen-Shannon metric 1213under the null hypothesis. This asymptotic distribution is normal by applying the delta method approximation, 1214which involves computing the linear component of the Taylor expansion of the 1215variance of $\sqrt{JS}$. 1216 1217In order to simplify notation, we let $f(p^1,\ldots,p^m)$ be the 1218Jensen-Shannon metric for $m$ probability distributions $p^1,\ldots,p^m$. 1219 1220\begin{lemma} 1221The partial derivatives of the Jensen-Shannon metric are give by 1222\begin{equation} 1223\frac{\partial f}{\partial p^k_l} = \frac{1}{2m\sqrt{f(p^1,\ldots,p^m)}} log \left( \frac{p^k_l}{\frac{1}{m} \sum_{j=1}^m p_l^j} \right). 1224\end{equation} 1225\end{lemma} 1226 1227Let $\hat{\kappa}^1,\ldots,\hat{\kappa}^m$ denote $m$ probability 1228distributions on the set of transcripts $S$, for example the MLE for the transcript abundances in a time course. Then from the delta method we 1229have that $\sqrt{JS(\hat{\kappa}^1,\ldots,\hat{\kappa}^m)}$ is approximately normally 1230distributed with variance given by 1231 1232\begin{equation} 1233Var[\sqrt{JS(\hat{\kappa}^1,\ldots,\hat{\kappa}^m)}] \approx (\bigtriangledown f)^T \Sigma (\bigtriangledown f), 1234\end{equation} 1235 1236where $\Sigma$ is the variance-covariance matrix for the 1237$\kappa^1,\ldots,\kappa^m$, i.e., it is a block diagonal matrix where the 1238$i$th block is the variance-covariance matrix for the $\kappa^i_t$ given by 1239Equations (\ref{eq:variancekappa1},\ref{eq:variancekappa2}). 1240 1241 1242There are two biologically meaningful groupings of transcripts whose 1243relative abundances are interesting to track in a 1244time course. Transcripts that share a TSS are likely to be regulated by 1245the same promoter, and therefore tracking the change in relative 1246abundances of groups of transcripts sharing a TSS may reveal how 1247transcriptional regulation is affecting expression over 1248time. Similarly, transcripts that share a TSS and exhibit changes in 1249expression relative to each other are likely to be affected by 1250splicing or other post-transcriptional regulation. We therefore 1251grouped transcripts by TSS and compared relative abundance changes 1252within and between groups. 1253 1254We define ``overloading'' to be a significant change in relative 1255abundances for a set of transcripts (as determined by the Jensen-Shannon metric, see below). The term is intended to generalize the simple notion of 1256``isoform switching'' that is well-defined in the case of two 1257transcripts, to multiple transcripts. It is complementary to absolute 1258differential changes in expression: the overall expression of a gene 1259may remain constant while individual transcripts change drastically in 1260relative abundances resulting in overloading. The term is borrowed from 1261computer science, where in some statically-typed programming 1262languages, a function may be used in multiple, specialized instances 1263via ``method overloading''. 1264 1265We tested for overloaded genes by performing a one-sided $t$-test based on the asymptotics of the Jensen-Shannon metric under the null hypothesis of no change in relative abundnaces of isoforms (either grouped by shared TSS for for post-transcriptional overloading, or by comparison of groups of isoforms with shared TSS for transcriptional overloading). Type I errors were controlled with the Benjamini-Hochberg \cite{Benjamini1995} correction for multiple testing. 1266 A selection of overloaded genes are displayed in Supplemental Figs. \ref{splice_overloaded} and \ref{promoter_overloaded}. 1267 1268\newpage 1269 1270\begin{figure}[!ht] 1271 \includegraphics{pdfs/splice_overloaded} 1272 \caption[Selected genes with post-transcriptional 1273 overloading]{Selected genes with post-transcriptional 1274 overloading. Trajectories indicate the expression of individual 1275 isoforms in FPKM ($y$ axis) over time in hours ($x$ 1276 axis). Dashed isoforms have not been previously 1277 annotated. Isoform trajectories are colored by TSS, so isoforms 1278 with the same color presumably share a common promoter and are 1279 processed from the same primary transcript. It is evident that 1280 total gene expression may remain constant during isoforms 1281 switching (Eya3) while in other cases changes in relative 1282 abundance are accompanied by changes in absolute expression. 1283The Jensen-Shannon metric generalizes the notion of ``isoform 1284switching'' and is useful in cases with multiple isoforms 1285(e.g. Ddx17). 1286 \label{splice_overloaded}} 1287\end{figure} 1288 1289\begin{figure}[!ht] 1290 \includegraphics{pdfs/promoter_overloaded} 1291 \caption[Selected genes with transcriptional overloading]{Selected 1292 genes with transcriptional overloading. Trajectories indicate 1293 the expression of individual isoforms in FPKM (y axis) over time 1294 in hours (x axis). Dashed isoforms have not been previously 1295 annotated. Isoform trajectories are colored by TSS, so that 1296 isoforms with different colors presumably vary in their promoter 1297 and are processed from different primary transcripts. \label{promoter_overloaded}} 1298\end{figure} 1299 1300We can visualize overloading and expression dynamics with a plot that 1301superimposes transcriptional and post-transcriptional overloading and 1302gene-level expression over the time course. We refer to these as 1303``Minard plots'', after Charles Joseph Minard's famous depiction 1304of the advance and retreat of 1305Napoleon's armies in the campaign against Russia in 1812 1306\cite{Tufte2001}. Minard made use of multiple visual cues to display 1307numerous varying quantities in one diagram. An example of a Minard 1308plot for the gene 1309Myc is shown in Figure 3c, and others are given in Appendix B. The dotted line indicates gene-level FPKM, with measured 1310FPKM indicated by black circles. Grey circles indicate the arithmetic mean of 1311gene-level FPKM between consecutive measured time points, interpolating FPKM 1312at intermediate time points. The total gene expression overloading is 1313visualized as a swatch centered around the interpolated expression curve. 1314The width of the swatch encodes the amount of expression overloading between 1315successive time points. The color of the swatch indicates the relative 1316contributions of transcriptional and post-transcriptional expression 1317overloading. 1318 1319Some genes, such as tropomyosin I and II, feature a single primary 1320transcript, and so all overloading is by definition post-transcriptional. 1321Others, like Fhl3, have two primary transcripts, but only a single isoform 1322arises from each, so all overloading is transcriptional. Genes with multiple 1323primary transcripts, one or more of which are alternatively spliced, such as 1324Myc or RTN4, display both forms. 1325 1326 1327\section{The {\tt Cufflinks} software} 1328 1329The transcript assembly and abundance estimation algorithms are implemented in 1330freely available open source software called {\tt Cufflinks} that is available 1331from \newline {\tt http://cufflinks.cbcb.umd.edu/} \newline Furthermore 1332methods for comparing annotations across time points, and for performing the 1333differential expression, promoter usage and splicing tests are implemented in 1334the companion programs {\tt Cuffdiff} and {\tt Cuffcompare}. Instructions on 1335how to install and run the software are provided on the website. 1336 1337The input to {\tt Cufflinks} consists of fragment alignments in the SAM format 1338\cite{Li2009a}. These may consist of either single fragment alignments, or alignments of mate-pairs (paired-end reads produce better assemblies and more accurate abundance estimates than single reads). {\tt Cufflinks} will assemble the transcripts using the 1339algorithm in Section \ref{sec:assembly}, and transcript abundances will be 1340estimated using the model in Section \ref{sec:abundances}. Transcript 1341coordinates and abundances are reported in the Gene Transfer Format (GTF). 1342User supplied annotations may be provided to {\tt Cufflinks} (optional input) in 1343which case they form the basis for the transcript abundance estimation. 1344 1345Some of the algorithms here rely on sufficient depth of sequencing in order to 1346produce reliable output. {\tt Cufflinks} determines that depth is sufficient where 1347possible to check that required assumptions hold. For example, in loci where one 1348or more isoforms have extremely low relative expression, the observed Fisher 1349Information Matrix may not be positive definite after rounding errors. In this 1350case, it is not possible to produce a reliable variance-covariance matrix for 1351isoform fragment abundances. {\tt Cufflinks} will report a numerical 1352exception in this and similar cases. When an exception is reported, the 1353confidence intervals for the isoforms' abundances will be set from 0 1354FPKM to the FPKM for the whole gene. If such an exception is generated during 1355a {\tt Cuffdiff} run, no differential analysis involving the problematic 1356sample will be performed on that locus. 1357\newpage 1358\section{Appendix A: Lemmas and Theorems} 1359 1360The following elementary/classical results are required for our methods and we include them so that the supplement is self-contained. 1361 1362\begin{lemma} 1363\label{lemma:varsum} 1364Let $X_1,\ldots,X_n$ be random variables and $a_1,\ldots,a_n$ real numbers with $Y=\sum_{i=1}^na_iX_i$. Then 1365\begin{equation} 1366Var[Y] = \sum_{i=1}^n a_i^2Var[X_i]+2\sum_{i<j} a_ia_j Cov[X_i,X_j]. 1367\end{equation} 1368\end{lemma} 1369\begin{lemma}[Taylor Series] 1370If $X$ and $Y$ are random variables then 1371\begin{eqnarray} 1372 Var[f(X,Y)] & \approx & \left(\frac{\partial f}{\partial X}(E[X],E[Y])\right)^2Var[X] \nonumber \\ & &+2\frac{\partial f}{\partial X}(E[X],E[Y])\frac{\partial f}{\partial Y}(E[X],E[Y])Cov[X,Y] \nonumber \\ & & +\left( \frac{\partial f}{\partial Y}(E[X],E[Y])\right)^2 Var[Y]. 1373\end{eqnarray} 1374\end{lemma} 1375\begin{cor} 1376If $X$ and $Y$ are independent then 1377\begin{eqnarray} 1378 Var\left[log\left(\frac{X}{Y}\right)\right] & \approx & \frac{V[X]}{E[X]^2} + \frac{V[Y]}{E[Y]^2}. 1379\end{eqnarray} 1380\end{cor} 1381\begin{cor} 1382\label{lemma:varproduct} 1383If $X$ and $Y$ are independent random variables then 1384\begin{equation} 1385Var[XY] = Var[X]Var[Y]+E[X]^2Var[Y]+E[Y]^2Var[X]. 1386\end{equation} 1387\end{cor} 1388 1389The above result is exact using the 2nd order Taylor expansion (higher 1390derivatives vanish). 1391 1392\begin{lemma}[\cite{Li2009b}] 1393\label{lemma:readstoprobs} 1394Let $a_1,\ldots,a_n,w_1,\ldots,w_n$ be real numbers satisfying: $w_i \neq 0$ and $0 \leq a_i \leq 1$ for all $i$, $\sum_{i=1}^n a_i = 1$ and $\sum_{i=1}^na_iw_i \neq 0$. Let 1395$ b_j = \frac{a_jw_j}{\sum_{i=1}^n a_iw_i}$. Then $a_j = \frac{b_j\frac{1}{w_j}}{\sum_{i=1}^n b_i\frac{1}{w_i}}$. 1396\end{lemma} 1397{\bf Proof}: \begin{eqnarray} 1398b_j & = & \frac{a_jw_j}{\sum_{i=1}^n a_iw_i} \\ 1399\Rightarrow \, \sum_{k=1}^n \frac{b_k}{w_k} & = & \sum_{k=1}^n \frac{a_k}{\sum_{i=1}^na_iw_i}\\ 1400& = & \frac{1}{\sum_{i=1}^n a_iw_i}\\ 1401& = & \frac{b_j}{a_jw_j}\\ 1402 \Rightarrow \, a_j & = & \frac{b_j\frac{1}{w_j}}{\sum_{i=1}^n b_i \frac{1}{w_i}}. 1403\end{eqnarray} \qed 1404\begin{prop}[\cite{ASCB2005}] 1405\label{prop:linearmodel} 1406Let $f_i(\theta) = \sum_{j=1}^d a_{ij}\theta_j + b_i$ ($1 \leq i \leq 1407m$) describe a linear statistical model with $a_{ij} \geq$ for all 1408$i,j$. That is, $\sum_{i=1}^m f_i(\theta) = 1$. If $u_i \geq 0$ for 1409all $i$ then the log likelihood function 1410\begin{equation} 1411l(\theta) = \sum_{i=1}^m u_i log(f_i(\theta)) 1412\end{equation} 1413is concave. 1414\end{prop} 1415{\bf Proof}: It is easy to see that 1416\begin{equation} 1417\left( \frac{\partial^2 l}{\partial \theta_j \partial \theta_k} \right) = -A^Tdiag\left( \frac{u_1}{f_1(\theta)^2},\ldots,\frac{u_m}{f_m(\theta)^2}\right) A, 1418\end{equation} 1419where $A$ is the $m \times d$ matrix whose entry in row $i$ and column 1420$j$ equals $a_{ij}$. Therefore the Hessian is a symmetric matrix with 1421non-positive eigenvalues, and is therefore negative semi-definite. 1422\qed 1423 1424\begin{defn} 1425\label{def:po} 1426A partially ordered set is a set $S$ with a binary relation $\leq$ 1427satisfying: 1428\begin{enumerate} 1429\item $x \leq x$ for all $x \in S$, 1430\item If $x \leq y$ and $y \leq z$ then $x \leq z$, 1431\item If $x \leq y$ and $y \leq x$ then $x=y$. 1432\end{enumerate} 1433A {\em chain} is a set of elements in $C \subseteq S$ such that for every 1434$x,y \in C$ either $x \leq y$ or $y \leq x$. An {\em antichain} is a set of 1435elements that are pairwise incompatible. 1436\end{defn} 1437Partially ordered sets are equivalent to directed acyclic graphs (DAGs). The following min-max theorems relate chain partitions to antichains and are special cases of linear-programming duality. More details and complete proofs can be found in \cite{Lovasz2009}. 1438 1439\begin{thm}[Dilworth's theorem] 1440\label{thm:Dilworth} 1441Let $P$ be a finite partially ordered set. The maximum number of elements in any antichain of $P$ equals the minimum number of chains in any partition of $P$ into chains. 1442\end{thm} 1443\begin{thm}[K\"{o}nig's theorem] 1444\label{thm:konig} 1445In a bipartite graph, the number of edges in a maximum matching equals 1446the number of vertices in a minimum vertex cover. 1447\end{thm} 1448\begin{thm} 1449\label{thm:dilko} 1450Dilworth's theorem is equivalent to K\"{o}nig's theorem. 1451\end{thm} 1452{\bf Proof}: We first show that Dilworth's theorem follows from 1453K\"{o}nig's theorem. Let $P$ be a partially ordered set with $n$ 1454elements. We define a bipartite graph $G=(U,V,E)$ where 1455$U=V=P$, i.e. each partition in the bipartite graph is equally to 1456$P$. Two nodes $u,v$ form an edge $(u,v) \in E$ in the graph $G$ iff 1457$u<v$ in $P$. By K\"{o}nig's theorem there exist both a matching $M$ and a 1458a vertex cover $C$ in $G$ of the same cardinality. Let $T \subset 1459S$ be the set of elements not contained in $C$. Note that $T$ is an 1460antichain in $P$. We now form a partition $W$ of $P$ into chains by declaring $u$ and 1461$v$ to be in the same chain whenever there is an edge $(u,v) \in 1462M$. Since $C$ and $M$ have the same size, it follows that $T$ and $W$ 1463have the same size. 1464 1465To deduce K\"{o}nig's theorem from Dilworth's theorem, we begin with a 1466bipartite graph $G=(U,V,E)$ and form a partial order $P$ on the 1467vertices of $G$ by defining $u<v$ when $u \in U, v \in V$ and $(u,v) 1468\in E$. By Dilworth's theorem, there exists an antichain of $P$ and a 1469partition into chains of the same size. The non-trivial chains in $P$ 1470form a matching in the graph. Similarly, the complement of the 1471vertices corresponding to the anti-chain in $P$ is a vertex cover of 1472$G$ with the same cardinality as the matching. \qed 1473 1474\begin{figure}[!ht] 1475 \includegraphics[scale=0.8]{pdfs/Dilworth_Konig} 1476 1477%\caption[Equivalence of Dilworth's and K\"{o}nig's theorems]{ 1478\end{figure} 1479The equivalence of Dilworth's and K\"{o}nig's theorems is depicted above. The 1480 partially ordered set with 8 elements on the left is partitioned into 3 1481 chains. This is the size of a minimum partition into chains, and 1482 is equal to the maximum size of an antichain (Dilworth's 1483 theorem). The antichain is shown with double circles. On the 1484 right, the reachability graph constructed from the partially 1485 ordered set on the left is shown. The maximum matching 1486 corresponding to the chain partition consists of 5 edges and is equal in size to the 1487 number of vertices in a minimum vertex cover (K\"{o}nig's 1488 theorem). The vertex cover is shown with double circles. Note that 1489 8=3+5. 1490 1491 1492\newpage 1493 1494\section{Appendix B: selected Minard plots} 1495\begin{figure}[!ht] 1496 \includegraphics{pdfs/Tpm1_Tpm2.pdf} 1497\end{figure} 1498%\newpage 1499\begin{figure}[!ht] 1500 \includegraphics{pdfs/Fhl3_RTN4.pdf} 1501\end{figure} 1502\newpage 1503\begin{figure}[!ht] 1504 \includegraphics{pdfs/Myf6_Myf5.pdf} 1505\end{figure} 1506%\newpage 1507\begin{figure}[!ht] 1508 \includegraphics{pdfs/MyoD_Myog.pdf} 1509\end{figure} 1510\newpage 1511\begin{figure}[!ht] 1512 \includegraphics{pdfs/Actn.pdf} 1513\end{figure} 1514%\newpage 1515\begin{figure}[!ht] 1516 \includegraphics{pdfs/Ddx.pdf} 1517\end{figure} 1518\newpage 1519\begin{figure}[!ht] 1520 \includegraphics{pdfs/Myl1.pdf} 1521\end{figure} 1522%\newpage 1523\begin{figure}[!ht] 1524 \includegraphics{pdfs/Mef2.pdf} 1525\end{figure} 1526 1527\newpage 1528\bibliographystyle{amsplain} 1529\bibliography{cufflinks} 1530\end{document} 1531 1532% long read length is required for uniquely determining genomic origin of fragments, but its fragment length that is critical for assignment of fragments to transcripts. 1533% The length of a set of transcripts depends on their abundances 1534% single read sequencing prevents the correct normalization for fragment sizes in the length. 1535