1\documentclass[12pt]{report}
2\usepackage{graphicx,fancyvrb,url,comment,longtable,color,ccaption,hyperref}
3\usepackage{amsmath}
4
5\textwidth=6.6in
6\textheight=8.7in
7\oddsidemargin=-0.1in
8\evensidemargin=-0.1in
9\headheight=0in
10\headsep=0in
11\topmargin=-0.1in
12
13\DefineVerbatimEnvironment{Rcode}{Verbatim}{fontsize=\footnotesize}
14\newcommand{\code}[1]{{\small\texttt{#1}}}
15\newcommand{\Subread}{\textsf{Subread}}
16\newcommand{\Subjunc}{\textsf{Subjunc}}
17\newcommand{\Sublong}{\textsf{Sublong}}
18\newcommand{\Rsubread}{\textsf{Rsubread}}
19\newcommand{\ExactSNP}{\textsf{ExactSNP}}
20\newcommand{\limma}{\textsf{limma}}
21\newcommand{\edgeR}{\textsf{edgeR}}
22\newcommand{\DGEList}{\textsf{DGEList}}
23\newcommand{\voom}{\textsf{voom}}
24\newcommand{\featureCounts}{\textsf{featureCounts}}
25\newcommand{\repair}{\textsf{repair}}
26\newcommand{\R}{\textsf{R}}
27\newcommand{\C}{\textsf{C}}
28\newcommand{\Rpackage}[1]{\textsf{#1}}
29\excludecomment{hide}
30
31\makeindex
32\begin{document}
33
34\begin{titlepage}
35
36\begin{center}
37{\Huge\bf Rsubread/Subread Users Guide}\\
38\vspace{1 cm}
39{\centering\large Subread v2.0.2/Rsubread v2.4.3\\}
40\vspace{1 cm}
41\centering 29 March 2021\\
42\vspace{5 cm}
43\Large Wei Shi and Yang Liao\\
44\vspace{1 cm}
45\small
46{\large Olivia Newton-John Cancer Research Institute\\
47Melbourne, Australia\\}
48\vspace{7 cm}
49\centering Copyright \small{\copyright}  2011 - 2021\\
50\end{center}
51
52\end{titlepage}
53
54\tableofcontents
55
56\chapter{Introduction}
57
58The Subread/Rsubread packages comprise a suite of high-performance software programs for processing next-generation sequencing data.
59Included in these packages are \code{Subread} aligner, \code{Subjunc} aligner, \code{Sublong} long-read aligner, \code{Subindel} long indel detection program, \code{featureCounts} read quantification program, \code{exactSNP} SNP calling program and other utility programs.
60This document provides a detailed description to the programs included in the packages.
61
62\code{Subread} and \code{Subjunc} aligners adopt a mapping paradigm called ``seed-and-vote'' \cite{liao}.
63This is an elegantly simple multi-seed strategy for mapping reads to a reference genome.
64This strategy chooses the mapped genomic location for the read directly from the seeds.
65It uses a relatively large number of short seeds (called subreads) extracted from each read and allows all the seeds to vote on the optimal location.
66When the read length is $<$160 bp, overlapping subreads are used.
67More conventional alignment algorithms are then used to fill in detailed mismatch and indel information between the subreads that make up the winning voting block.
68The strategy is fast because the overall genomic location has already been chosen before the detailed alignment is done.
69It is sensitive because no individual subread is required to map exactly, nor are individual subreads constrained to map close by other subreads.
70It is accurate because the final location must be supported by several different subreads. The strategy extends easily to find exon junctions, by locating reads that contain sets of subreads mapping to different exons of the same gene.
71It scales up efficiently for longer reads.
72
73\code{Subread} is a general-purpose read aligner.
74It can be used to align reads generated from both genomic DNA sequencing and RNA sequencing technologies.
75It has been successfully used in a number of high-profile studies \cite{TangNC2013,ManNI2013,SpangenbergSCR2013,tang,ezh2}.
76\code{Subjunc} is specifically designed to detect exon-exon junctions and to perform full alignments for RNA-seq reads.
77Note that \code{Subread} performs local alignments for RNA-seq reads, whereas \code{Subjunc} performs global alignments for RNA-seq reads.
78\code{Subread} and \code{Subjunc} comprise a read re-alignment step in which reads are re-aligned using genomic variation data and junction data collected from the initial mapping.
79
80The \code{Subindel} program carries out local read assembly to discover long insertions and deletions.
81Read mapping should be performed before running this program.
82
83The {\featureCounts} program is designed to assign mapped reads or fragments (paired-end data) to genomic features such as genes, exons and promoters.
84It is a light-weight read counting program suitable for count both gDNA-seq and RNA-seq reads for genomic features\cite{fcounts}.
85The \textsf{Subread-featureCounts-limma/voom} pipeline has been found to be one of the best-performing pipelines for the analyses of RNA-seq data by the SEquencing Quality Control (SEQC) study, the third stage of the well-known MicroArray Quality Control (MAQC) project \cite{seqc}.
86
87Also included in this software suite is a very efficient SNP caller -- {\ExactSNP}.
88{\ExactSNP} measures local background noise for each candidate SNP and then uses that information to accurately call SNPs.
89
90These software programs support a variety of sequencing platforms.
91They are released in two packages -- SourceForge \emph{Subread} package and Bioconductor \emph{Rsubread} package\cite{Rsubread}.
92
93\chapter{Preliminaries}
94
95\section{Citation}
96
97If you use {\Rsubread}, you can cite:
98
99\begin{quote}
100Liao Y, Smyth GK and Shi W (2019).
101The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads.
102\emph{Nucleic Acids Research}, 47(8):e47.
103\\
104{\color{blue}{\url{http://www.ncbi.nlm.nih.gov/pubmed/30783653}} }
105\end{quote}
106
107If you use \featureCounts, you can cite:
108
109\begin{quote}
110Liao Y, Smyth GK and Shi W (2014).
111featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.
112\emph{Bioinformatics}, 30(7):923-30.
113\\
114{\color{blue}{\url{http://www.ncbi.nlm.nih.gov/pubmed/24227677}}}
115\end{quote}
116
117If you use {\Subread} or {\Subjunc} aligners, you can cite:
118
119\begin{quote}
120Liao Y, Smyth GK and Shi W (2013).
121The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.
122\emph{Nucleic Acids Research}, 41(10):e108.
123\\
124{\color{blue}{\url{http://www.ncbi.nlm.nih.gov/pubmed/23558742}} }
125\end{quote}
126
127\newpage
128
129\section{Download and installation}
130
131\subsection{Install Bioconductor {\Rsubread} package}
132
133{\R} software needs to be installed on my computer before you can install this package.
134Launch {\R} and issue the following command to install {\Rsubread}:
135
136\begin{Rcode}
137if (!requireNamespace("BiocManager", quietly = TRUE))
138    install.packages("BiocManager")
139BiocManager::install("Rsubread")
140\end{Rcode}
141
142Alternatively you may download it from {\Rsubread} web page {\color{blue}{\url{http://bioconductor.org/packages/release/bioc/html/Rsubread.html}}} and install it manually.
143
144\subsection{Install SourceForge {\Subread} package}
145
146\subsubsection{Install from a binary distribution}
147
148This is the easiest way to install the SourceForge {\Subread} package.
149Binary distributions are available for Linux, Macintosh and Windows operating systems and they can be downloaded from {\color{blue}{\url{http://subread.sourceforge.net}}}.
150The Linux binary distribution can be run on multiple Linux variants including Debian, Ubuntu, Fedora and Cent OS.
151
152To install {\Subread} package on FreeBSD or Solaris, you will have to install from source.
153
154\subsubsection{Install from source on a Unix or Macintosh computer}
155
156Download {\Subread} source package to your working directory from SourceForge \\
157{\color{blue}{\url{http://subread.sourceforge.net}}}, and type the following command to uncompress it:\\
158
159\code{tar zxvf subread-1.x.x.tar.gz}\\
160
161Enter \code{src} directory of the package and issue the following command to install it on a Linux operating system: \\
162
163\code{make -f Makefile.Linux}\\
164
165To install it on a Mac OS X operating system, issue the following command:\\
166
167\code{make -f Makefile.MacOS}\\
168
169To install it on a FreeBSD operating system, issue the following command:\\
170
171\code{make -f Makefile.FreeBSD}\\
172
173To install it on Oracle Solaris or OpenSolaris computer operating systems, issue the following command:\\
174
175\code{make -f Makefile.SunOS}\\
176
177A new directory called \code{bin} will be created under the home directory of the software package, and the executables generated from the compilation are saved to that directory.
178To enable easy access to these executables, you may copy them to a system directory such as \code{/usr/bin} or add the path to them to your search path (your search path is usually specified in the environment variable \code{`PATH'}).
179
180\subsubsection{Install from source on a Windows computer}
181
182The MinGW software tool  (http://www.mingw.org/) needs to installed to compile Subread.
183
184\section{How to get help}
185
186Bioconductor support site (\url{https://support.bioconductor.org/}) or Google Subread group (\url{https://groups.google.com/forum/#!forum/subread}) are the best place to post questions or make suggestions.
187
188
189\chapter{The seed-and-vote mapping paradigm}
190
191\section{Seed-and-vote}
192
193We have developed a new read mapping paradigm called ``seed-and-vote" for efficient, accurate and scalable read mapping \cite{liao}.
194The seed-and-vote strategy uses a number of overlapping seeds from each read, called \emph{subreads}.
195Instead of trying to pick the best seed, the strategy allows all the seeds to vote on the optimal location for the read.
196The algorithm then uses more conventional alignment algorithms to fill in detailed mismatch and indel information between the subreads that make up the winning voting block.
197The following figure illustrates the proposed seed-and-vote mapping approach with an toy example.
198
199\begin{center}
200\includegraphics[scale=0.3]{seed-and-vote.png}
201\end{center}
202
203Two aligners have been developed under the seed-and-vote paradigm, including \code{Subread} and \code{Subjunc}.
204\code{Subread} is a general-purpose read aligner, which can be used to map both genomic DNA-seq and RNA-seq read data.
205Its running time is determined by the number of \emph{subreads} extracted from each read, not by the read length.
206Thus it has an excellent maping scalability, ie. its running time has only very modest increase with the increase of read length.
207
208\code{Subread} uses the largest mappable region in the read to determine its mapping location, therefore it automatically determines whether a global alignment or a local alignment should be found for the read.
209For the exon-spanning reads in a RNA-seq dataset, \code{Subread} performs local alignments for them to find the target regions in the reference genome that have the largest overlap with them.
210Note that \code{Subread} does not perform global alignments for the exon-spanning reads and it soft clips those read bases which could not be mapped.
211However, the \code{Subread} mapping result is sufficient for carrying out the gene-level expression analysis using RNA-seq data, because the mapped read bases can be reliably used to assign reads, including both exonic reads and exon-spanning reads, to genes.
212
213To get the full alignments for exon-spanning RNA-seq reads, the \code{Subjunc} aligner can be used.
214\code{Subjunc} is designd to discover exon-exon junctions from using RNA-seq data, but it performs full alignments for all the reads at the same time.
215The \code{Subjunc} mapping results should be used for detecting genomic variations in RNA-seq data, allele-specific expression analysis and exon-level gene expression analysis.
216The Section~\ref{sec:junction} describes how exon-exon junctions are discovered and how exon-spanning reads are aligned using the seed-and-vote paradigm.
217
218\section{Detection of short indels}
219
220\begin{center}
221\includegraphics[scale=0.3]{indel.png}
222\end{center}
223
224The seed-and-vote paradigm is very powerful in detecting short indels (insertions and deletions).
225The figure below shows how we use the \emph{subreads} to confidently detect short indels.
226When there is an indel existing in a read, mapping locations of subreads extracted after the indel will be shifted to the left (insertion) or to the right (deletion), relative to the mapping locations of subreads at the left side of the indel.
227Therefore, indels in the reads can be readily detected by examining the difference in mapping locations of the extracted subreads.
228Moreover, the number of bases by which the mapping location of subreads are shifted gives the precise length of the indel.
229Since no mismatches are allowed in the mapping of the subreads, the indels can be detected with a very high accuracy.
230
231
232\section{Detection of exon-exon junctions}
233\label{sec:junction}
234
235Figure below shows the schematic of exon-exon junction under seed-and-vote paradigm.
236The first scan detects all possible exon-exon junctions using the mapping locations of the subreads extracted from each read.
237Exons as short as 16bp can be detected in this step.
238The second scan verifies the putative exon-exon junctions discovered from the first scan by read re-alignment.
239
240This approach is implemented in the \code{Subjunc} program.
241The output of \code{Subjunc} includes a list of discovered junctions, in addition to the mapping results.
242By default, \code{Subjunc} only reports canonical exon-exon junctions that contain canonical donor and receptor sites (`GT' and `AG' respectively).
243It was reported that such exon-exon junctions account for $>$98\% of all junctions.
244Orientation of donor and receptor sites is indicated by `XA' tag in the SAM/BAM output.
245\code{Subjunc} will report both canonical and non-canonical junctions when `--allJunctions' option is turned on.
246
247Accuracy of junction detection generally improves when external gene annotation data is provided.
248The annotation data should include chromosomal coordinates of known exons of each gene.
249\code{Subjunc} infers exon-exon junctions from the provided annotation data by connecting each pair of neighboring exons from the same gene.
250This should cover majority of known exon-exon junctions and the other junctions are expected to be discovered by the program.
251Note that although \code{Subread} aligner does not report exon-exon junctions, providing this annotation is useful for it to map junction reads more accurately.
252See `-a' parameter in Table 2 for more details.
253
254
255\begin{center}
256\includegraphics[scale=0.5]{junction.png}
257\end{center}
258
259
260\section{Detection of structural variants (SVs)}
261
262\code{Subread} and \code{Subjunc} can be used detect SV events including long indel, duplication, inversion and translocation, in RNA-seq and genomic DNA-seq data.
263
264Detection of long indels is conducted by performing local read assembly.
265When the specified indel length (`-I' option in SourceForge \code{C} or `indels' paradigm in \code{Rsubread}) is greater than 16, \code{Subread} and \code{Subjunc} will automatically start the read assembly process to detect long indels (up to 200bp).
266
267Breakpoints detected from SV events will be saved to a text file (`.breakpoint.txt'), which includes chromosomal coordinates of breakpoints and also the number of reads supporting each pair of breakpoints found from the same SV event.
268
269For the reads that were found to contain SV breakpoints, extra tags will be added for them in mapping output.
270These tags include CC(chromosome name), CP(mapping position), CG(CIGAR string) and CT(strand), and they describe the secondary alignment of the read (the primary alignment is described in the main fields).
271
272
273\section{Two-scan read alignment}
274
275\code{Subread} and \code{Subjunc} aligners employ a two-scan approach for read mapping.
276In the first scan, the aligners use seed-and-vote method to identify candidate mapping locations for each read and also discover short indels, exon-exon junctions and structural variants.
277In the second scan, they carry out final alignment for each read using the variant and junction information.
278Variant and junction data (including chromosomal coordinates and number of supporting reads) will be output along with the read mapping results.
279To the best of our knowledge, \code{Subread} and \code{Subjunc} are the first to employ a two-scan mapping strategy to achieve a superior mapping accuracy.
280This strategy was later seen in other aligners as well (called `two-pass').
281
282
283\section{Multi-mapping reads}
284
285Multi-mapping reads are those reads that map to more than one genomic location with the same similarity score (eg. number of mis-mismatched bases).
286\code{Subread} and \code{Subjunc} aligners can effectively detect multi-mapping reads by closely examining candidate locations which receive the highest number of votes or second highest number of votes.
287Numbers of mis-matched bases and matched bases are counted for each candidate location during the final re-alignment step and they are used for identifying multi-mapping reads.
288For RNA-seq data, a read is called as a multi-mapping read if it has two or more candidate mapping locations that have the same number of mis-matched bases and this number is the smallest in all candidate locations being considered.
289For genomic DNA-seq data, a read is called as a multi-mapping read if it has two or more candidate locations that have the same number of matched bases and this number is the largest among all candidate locations being considered.
290Note that for both RNA-seq and genomic DNA-seq data, any alignment reported for a multi-mapping read must not have more than threshold number of mis-matched bases (as specified in `-M' parameter).
291
292For the reporting of a multi-mapping read, users may choose to not report any alignments for the read (by default) or report up to a pre-defined number of alignments (`--multiMapping' and `-B' options).
293
294
295\section{Mapping of paired-end reads}
296
297For the mapping of paired-end reads, we use the following formula to obtain a list of candidate mapping locations for each read pair:
298
299$$PE_{score} = w * (V_1 + V_2) $$
300
301where $V_1$ and $V_2$ are the number of votes received from two reads from the same pair, respectively.
302$w$ has a value of 1.3 if mapping locations of the two reads are within the nominal paired-end distance (or nominal fragment length), and has a value of 1 otherwise.
303
304Up to 4,096 posssible alignments will be examined for each read pair and a maximum of three candidate alignments with the highest $PE_{score}$ will be chosen for final re-alignment.
305Total number of matched bases (for genomic DNA-seq data) or mis-matched bases (for RNA-seq data) will be used to determine the best mapping in the final re-alignment step.
306
307
308
309\chapter{Mapping reads generated by genomic DNA sequencing technologies}
310\label{chapter:subread-dnaseq}
311
312\section{A quick start for using SourceForge {\Subread} package}
313
314An index must be built for the reference first and then the read mapping can be performed.\\
315
316{\noindent\bf Step 1: Build an index}\\
317
318\noindent Build a base-space index (default). You can provide a list of FASTA files or a single FASTA file including all the reference sequences. The files can be gzipped.\\
319
320\code{subread-buildindex -o my\_index chr1.fa chr2.fa ...}\\
321
322{\noindent\bf Step 2: Align reads}\\
323
324\noindent Map single-end genomic DNA sequencing reads using 5 threads (only uniquely mapped reads are reported):\\
325\code{subread-align -t 1 -T 5 -i my\_index -r reads.txt.gz -o subread\_results.bam}\\
326
327\noindent Map paired-end reads:\\
328\code{subread-align -t 1 -d 50 -D 600 -i my\_index -r reads1.txt -R reads2.txt \newline -o subread\_results.bam}\\
329
330\noindent Detect indels of up to 16bp:\\
331\code{subread-align -t 1 -I 16 -i my\_index -r reads.txt -o subread\_results.bam}\\
332
333\noindent Report up to three best mapping locations:\\
334\code{subread-align -t 1 --multiMapping -B 3 -i my\_index -r reads.txt -o subread\_results.bam}\\
335
336
337\section{A quick start for using Bioconductor {\Rsubread} package}
338
339An index must be built for the reference first and then the read mapping can be performed.\\
340
341{\noindent\bf Step 1: Building an index}\\
342
343\noindent To build the index, you must provide a single FASTA file (eg. ``genome.fa'') which includes all the reference sequences.
344
345\begin{Rcode}
346library(Rsubread)
347buildindex(basename="my_index",reference="genome.fa")
348\end{Rcode}
349
350{\noindent\bf Step 2: Aligning the reads}\\
351
352\noindent Map single-end reads using 5 threads:
353\begin{Rcode}
354align(index="my_index",readfile1="reads.txt.gz",type="dna",output_file="rsubread.bam",nthreads=5)
355\end{Rcode}
356
357\noindent Detect indels of up to 16bp:
358\begin{Rcode}
359align(index="my_index",readfile1="reads.txt.gz",type="dna",output_file="rsubread.bam",indels=16)
360\end{Rcode}
361
362\noindent Report up to three best mapping locations:
363\begin{Rcode}
364align(index="my_index",readfile1="reads.txt.gz",type="dna",output_file="rsubread.bam",
365unique=FALSE,nBestLocations=3)
366\end{Rcode}
367
368\noindent Map paired-end reads:
369\begin{Rcode}
370align(index="my_index",readfile1="reads1.txt.gz",readfile2="reads2.txt.gz",type="dna",
371output_file="rsubread.bam",minFragLength=50,maxFragLength=600)
372\end{Rcode}
373
374
375\section{Index building}
376\label{sec:index}
377
378The \code{subread-buildindex} (\code{buildindex} function in \Rsubread) program builds an index for reference genome by creating a hash table in which keys are 16bp mers (subreads) extracted from the genome and values are their chromosomal locations.
379
380A full index or a gapped index can be built for a reference genome.
381In a full index, subreads are extracted from every location in the genome.
382In a gapped index, subreads are extracted in every three bases in the genome (ie. there is a 2bp gap between two subreads next to each other).
383When a full index is used in read mapping, only one set of subreads are extracted from a read.
384However three sets of subreads need to be extracted from a read when a gapped index is used for mapping.
385The first set starts from the first base of the read, the second set starts from the second base and the third set starts from the third base.
386This makes sure that a mapped read can always have a set of subreads that match those stored in the index.
387
388A full index is larger than a gapped index.
389However the full index enables faster mapping speed to be achieved.
390When a one-block full index is used for mapping, the maximum mapping speed is achieved.
391Size of one-block full index built for the human reference genome (GRCh38) is 17.8 GB.
392The \code{subread-buildindex} function needs 15 GB of memory to build this index.
393Size of a gapped index built for GRCh38 is less than 9 GB and \code{subread-buildindex} needs 5.7 GB of memory to build it.
394Options are available to generate index of any size.
395In \Rsubread, a one-block full index is built by default.
396
397The reference sequences should be in FASTA format.
398The \code{subread-buildindex} function divides each reference sequence name (which can be found in the header lines) into multiple substrings by using separators including `\code{|}', ` '(space) and `\code{<tab>}', and it uses the first substring as the name for the reference sequence during its index building.
399The first substrings must be distinct for different reference sequences (otherwise the index cannot be built).
400Note that the starting `\code{>}' character in the header line is not included in the first substrings.
401
402Sequences of reference genomes can be downloaded from public databases.
403For instance, the primary assembly of human genome GRCh38 or mouse genome GRCm38 can be downloaded from the GENCODE database via the following links:\\
404
405\noindent\href{ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz}{\color{blue}\url{ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz}}\\
406
407\noindent\href{ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/GRCm38.primary_assembly.genome.fa.gz}{\color{blue}\url{ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/GRCm38.primary_assembly.genome.fa.gz}}\\
408
409Table 1 describes the arguments used by the \code{subread-buildindex} program.
410
411\newpage
412
413\begin{table}[!tpb]
414\raggedright{Table 1: Arguments used by the \code{subread-buildindex} program (\code{buildindex} function in \Rsubread) in alphabetical order.
415Arguments in parenthesis in the first column are used by \code{buildindex}.\newline\\}
416\begin{tabular}{|p{4cm}|p{12cm}|}
417\hline
418Arguments & Description \\
419\hline
420chr1.fa, chr2.fa, ... \newline (\code{reference}) & Give names of chromosome files. Note for {\Rsubread} only a single FASTA file including all reference sequences should be provided. The files can be gzipped.\\
421\hline
422-B \newline (\code{indexSplit=FALSE}) & Create one block of index. The built index will not be split into multiple pieces. The more blocks an index has, the slower the mapping speed. This option will override `-M' option when it is also provided.\\
423\hline
424-c \newline (\code{colorspace}) & Build a color-space index.\\
425\hline
426-f $<int>$ \newline (\code{TH\_subread}) & Specify the threshold for removing uninformative subreads (highly repetitive 16bp mers). Subreads will be excluded from the index if they occur more than threshold number of times in the reference genome. Default value is 100.\\
427\hline
428-F \newline (\code{gappedIndex=FALSE}) & Build a full index for the reference genome. 16bp mers (subreads) will be extracted from every position of a reference genome.\\
429\hline
430-M $<int>$ \newline (\code{memory}) & Specify the size of computer memory(RAM) in megabytes that will be used to store the index during read mapping, 8000MB by default. If the index size  is greater than the specified value, the index will be split into multiple blocks. Only one block will be loaded into memory at anytime during the read alignment.\\
431\hline
432-o $<string>$ \newline (\code{basename}) & Specify the base name of the index to be created.\\
433\hline
434-v & Output version of the program. \\
435\hline
436\end{tabular}
437\end{table}
438
439\newpage
440
441\section{Read mapping}
442
443The {\Subread} aligner (\texttt{subread-align} program in SourceForge {\Subread} package or \code{align} function in Bioconductor {\Rsubread} package) extracts a number of subreads from each read and then uses these subreads to vote for the mapping location of the read.
444It uses the the ``seed-and-vote'' paradigm for read mapping and reports the largest mappable region for each read.
445Table 2 describes the arguments used by {\Subread} aligner (also {\Subjunc} and {\Sublong} aligners).
446Arguments used in Bioconductor \code{Rsubread} package are included in parenthesis.\\
447
448
449\begin{longtable}{|p{5.5cm}|p{10.5cm}|}
450\multicolumn{2}{p{16cm}}{Table 2: Arguments used by the \code{subread-align}/\code{subjunc}/\code{sublong} programs included in the SourceForge {\Subread} package in alphabetical order.
451Arguments in parenthesis in the first column are the equivalent arguments used in Bioconductor {\Rsubread} package.
452\newline
453($^{1}$\code{subread-align} arguments,
454$^{2}$\code{subjunc} arguments and
455$^{3}$\code{sublong} arguments)
456\newline
457}
458\endfirsthead
459\hline
460Arguments & Description \\
461\hline
462$^{1,2}$ -a $<string>$\newline (\code{useAnnotation, annot.inbuilt, annot.ext}) & Name of a gene annotation file that includes chromosomal coordinates of exons from each gene. GTF/GFF format by default. See -F option for supported formats. Users may use the inbuilt annotations included in this package (SAF format) for human and mouse data. Exon-exon junctions are inferred by connecting each pair of neighboring exons from the same gene. Gzipped file is accepted. \\
463\hline
464$^{1,2}$ -A $<string>$\newline (\code{chrAliases}) & Name of a comma-delimited text file that includes aliases of chromosome names. This file should contain two columns. First column contains names of chromosomes included in the SAF or GTF annotation and second column contains corresponding names of chromosomes in the reference genome. No column headers should be provided. Also note that chromosome names are case sensitive. This file can be used to match chromosome names between the annotation and the reference genome.\\
465\hline
466$^{1,2}$ -b \newline (\code{color2base=TRUE}) & Output base-space reads instead of color-space reads in mapping output for color space data (eg. LifTech SOLiD data). Note that the mapping itself will still be performed at color-space.\\
467\hline
468$^{1,2}$ -B $<int>$ \newline (\code{nBestLocations}) & Specify the maximal number of equally-best mapping locations to be reported for a read. 1 by default.  In the mapping output, the `NH' tag is used to indicate how many alignments are reported for the read and the `HI' tag is used for numbering the alignments reported for the same read. This option should be used together with the `$--$multiMapping' option.  \\
469\hline
470$^{1,2}$ -d $<int>$ \newline (\code{minFragLength}) & Specify the minimum fragment/template length, 50 by default.  Note that if the two reads from the same pair do not satisfy the fragment length criteria, they will be mapped individually as if they were single-end reads.\\
471\hline
472$^{1,2}$ -D $<int>$ \newline (\code{maxFragLength}) & Specify the maximum fragment/template length, 600 by default.\\
473\hline
474$^{1,2}$ -F $<string>$ \newline (\code{isGTF}) & Specify format of the provided annotation file. Acceptable formats include `GTF' (or compatible GFF format) and `SAF'. Default format in SourceForge {\Subread} is `GTF'. Default format in {\Rsubread} is `SAF'.   \\
475\hline
476$^{1,2,3}$ -i $<string> \newline (\code{index}) $ & Specify the base name of the index. The index used by \code{sublong} aligner must be a full index and has only one block, ie. `-F' and `-B' options must be specified when building index with \code{subread-buildindex}.\\
477\hline
478$^{1,2}$ -I $<int>$ \newline (\code{indels}) & Specify the number of INDEL bases allowed in the mapping. 5 by default. Indels of up to 200bp long can be detected.\\
479\hline
480$^{1,2}$ -m  $<int>$ \newline (\code{TH1}) & Specify the consensus threshold, which is the minimal number of consensus subreads required for reporting a hit. The consensus subreads are those subreads which vote for the same location in the reference genome for the read. If pair-end read data are provided, at least one of the two reads from the same pair must satisfy this criteria. The default value is 3 for \code{subread-align}, or 1 for \code{subjunc} and \code{sublong}.\\
481\hline
482$^{1,2}$ -M $<int>$ \newline (\code{maxMismatches}) & Specify the maximum number of mis-matched bases allowed in the alignment. 3 by default. Mis-matches found in soft-clipped bases are not counted.\\
483\hline
484$^{1,2}$ -n $<int>$ \newline (\code{nsubreads}) & Specify the number of subreads extracted from each read for mapping. The default value is 10 for \code{subread-align}, or 14 for \code{subjunc}. For \code{sublong}, this is number of subreads (85 by default) extracted from each readlet. A readlet is a 100bp sequence extracted from a long read.\\
485\hline
486$^{1,2,3}$ -o $<string>$ \newline (\code{output\_file}) & Give the name of output file. The default output format is BAM. All reads are included in mapping output, including both mapped and unmapped reads, and they are in the same order as in the input file.\\
487\hline
488$^{1,2}$ -p $<int>$ \newline (\code{TH2}) & Specify the minimum number of consensus subreads both reads from the same pair must have. This argument is only applicable for paired-end read data. The value of this argument should not be greater than that of `-m' option, so as to rescue those read pairs in which one read has a high mapping quality but the other does not. 1 by default.\\
489\hline
490$^{1,2}$ -P $<3:6>$ \newline (\code{phredOffset}) & Specify the format of Phred scores used in the input data, '3' for phred+33 and '6' for phred+64. '3' by default. For \code{align} function in \Rsubread, the possible values are `33' (for phred+33) and `64' (for phred+64). `33' by default.\\
491\hline
492$^{1,2,3}$ -r $<string>$ \newline (\code{readfile1}) & Give the name of an input file (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}). For paired-end read data, this gives the first read file and the other read file should be provided via the -R option. Supported input formats include FASTQ/FASTA (uncompressed or gzip compressed)(default), SAM and BAM.\\
493\hline
494$^{1,2}$ -R $<string>$ \newline (\code{readfile2}) & Provide name of the second read file from paired-end data. The program will switch to paired-end read mapping mode if this file is provided. (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}).\\
495\hline
496$^{1,2}$ -S $<ff:fr:rf>$ \newline (\code{PE\_orientation}) & Specify the orientation of the two reads from the same pair. It has three possible values including `fr', `ff' and `'rf. Letter `f' denotes the forward strand and letter `r' the reverse strand. `fr' by default (ie. the first read in the pair is on the forward strand and the second read on the reverse strand).\\
497\hline
498$^1$ -t $<int>$ \newline (\code{type}) & Specify the type of input sequencing data. Possible values include \code{0}, denoting RNA-seq data, or \code{1}, denoting genomic DNA-seq data. User must specify the value. Character values including `rna' and `dna' can also be used in the {\R} function.  For genomic DNA-seq data, the aligner takes into account both the number of matched bases and the number of mis-matched bases to determine the the best mapping location after applying the `seed-and-vote' approach for read mapping. For RNA-seq data, only the number of mis-matched bases is considered for determining the best mapping location. \\
499\hline
500$^{1,2,3}$ -T $<int>$ \newline (\code{nthreads}) & Specify the number of threads/CPUs used for mapping. The value should be between 1 and 32. 1 by default.\\
501%\hline
502%-u \newline (\code{unique=TRUE}) & Output uniquely mapped reads only. Reads that were found to have more than one best mapping location will not be reported.\\
503\hline
504$^{2}$$--$allJunctions \newline (\code{reportAllJunctions} \newline \code{=TRUE}) & This option should be used with \code{subjunc} for detecting canonical exon-exon junctions (with `GT/AG' donor/receptor sites), non-canonical exon-exon junctions and structural variants (SVs) in RNA-seq data. detected junctions will be saved to a file with suffix name ``.junction.bed". Detected SV breakpoints will be saved to a file with suffix name ``.breakpoints.txt", which includes chromosomal coordinates of detected SV breakpoints and also number of supporting reads. In the read mapping output, each breakpoint-containing read will contain the following extra fields for the description of its secondary alignment: CC(Chr), CP(Position),CG(CIGAR) and CT(strand). The primary alignment (described in the main field) and secondary alignment give respectively the mapping results for the two segments from the same read that were seperated by the breakpoint. Note that each breakpoint-containing read occupies only one row in mapping output. The mapping output includes mapping results for all the reads.\\
505\hline
506$^{1,2}$$--$BAMinput \newline (\code{input\_format="BAM"}) & Specify that the input read data are in BAM format.\\
507\hline
508$^{1,2}$$--$complexIndels & Detect multiple short indels that occur concurrently in a small genomic region (these indels could be as close as 1bp apart).\\
509\hline
510$^{1,2}$$--$DPGapExt $<int>$ \newline (\code{DP\_GapExtPenalty}) & Specify the penalty for extending the gap when performing the Smith-Waterman dynamic programming. 0 by default.\\
511\hline
512$^{1,2}$$--$DPGapOpen $<int>$ \newline (\code{DP\_GapOpenPenalty}) & Specify the penalty for opening a gap when applying the Smith-Waterman dynamic programming to detecting indels. -2 by default.\\
513\hline
514$^{1,2}$$--$DPMismatch $<int>$ \newline (\code{DP\_MismatchPenalty}) & Specify the penalty for mismatches when performing the Smith-Waterman dynamic programming.  0 by default.\\
515\hline
516$^{1,2}$$--$DPMatch $<int>$ \newline (\code{DP\_MatchScore}) & Specify the score for the matched base when performing the Smith-Waterman dynamic programming.  2 by default.\\
517\hline
518$^{1,2}$$--$gtfFeature $<string>$ \newline (\code{GTF.featureType}) & Specify the type of features that will be extracted from a GTF annotation. `exon' by default. Feature types can be found in the 3rd column of a GTF annotation.\\
519\hline
520$^{1,2}$$--$gtfAttr $<string>$ \newline (\code{GTF.attrType}) & Specify the type of attributes in a GTF annotation that will be used to group features. `gene\_id' by default. Attributes can be found in the 9th column of a GTF annotation.\\
521\hline
522$^{1,2}$$--$keepReadOrder \newline (\code{keepReadOrder} \newline \code{=FALSE}) & Output reads in the same order as in the input read file. This option only applies to BAM output. Note that in the output, reads from the same pair are always placed next to each other no matter this option is provided or not. \\
523\hline
524$^{1,2}$$--$multiMapping \newline (\code{unique=FALSE}) & Multi-mapping reads will also be reported in the mapping output. Number of alignments reported for each multi-mapping read is determined by the `-B' option. If the total number of equally best mapping locations found for a read is greater than the number specified by `-B', then random mapping locations  (total number of these locations is the same as `-B' value) will be selected. For example, if value of `-B' is 1, then one random location will be reported. \\
525\hline
526$^{1,2}$$--$rg $<string>$ \newline (\code{readGroup}) & Add a $<tag:value>$ to the read group (RG) header in the mapping output. \\
527\hline
528$^{1,2}$$--$rg-id $<string>$ \newline (\code{readGroupID}) & Specify the read group ID. If specified, the read group ID will be added to the read group header field and also to each read in the mapping output. \\
529\hline
530$^{1,2}$$--$SAMinput \newline (\code{input\_format="SAM"}) & Specify that the input read data are in SAM format.\\
531\hline
532$^{1,2,3}$$--$SAMoutput \newline (\code{output\_format="SAM"}) & Specify that mapping results are saved into a SAM format file. \\
533\hline
534$^{1,2}$$--$sortReadsByCoordinates \newline (\code{sortReadsByCoordinates} \newline \code{=FALSE}) & If specified, reads will be sorted by their mapping coordinates in the mapping output. This option is applicable for BAM output only. A BAI index file will also be generated for each BAM file so the BAM files can be directly loaded into a genome browser such as IGB or IGV.\\
535\hline
536$^1$$--$sv \newline (\code{detectSV=TRUE}) & This option should be used with \code{subread-align} for detecting structural variants (SVs) in genomic DNA sequencing data. Detected SV breakpoints will be saved to a file with suffix name ``.breakpoints.txt", which includes chromosomal coordinates of detected SV breakpoints and also number of supporting reads for each SV event. In the read mapping output, each breakpoint-containing read will contain the following extra fields for the description of its secondary alignment: CC(Chr), CP(Position),CG(CIGAR) and CT(strand). The primary alignment (described in the main field) and secondary alignment give respectively the mapping results for the two segments from the same read that were seperated by the breakpoint. Note that each breakpoint-containing read occupies only one row in mapping output. The mapping output includes mapping results for all the reads.\\
537\hline
538$^{1,2}$$--$trim5 $<int>$ \newline (\code{nTrim5}) & Trim off $<int>$ number of bases from 5' end of each read. 0 by default.\\
539\hline
540$^{1,2}$$--$trim3 $<int>$ \newline (\code{nTrim3}) & Trim off $<int>$ number of bases from 3' end of each read. 0 by default.\\
541\hline
542$^{1,2,3}$ -v & Output version of the program. \\
543\hline
544\end{longtable}
545
546\newpage
547
548\section{Memory use and speed}
549\label{sec:memoryspeed}
550
551\code{subread-buildindex} (\code{buildindex} function in \Rsubread) needs 15GB of memory to build a full index for human/mouse genome.
552With this index, \code{subread-align} (\code{align} in \Rsubread) require 17.8GB of memory for read mapping.
553This enables fastest mapping speed, but it is recommended that the full index should be on a unix server due to relatively large memory use.
554Mapping rate is $\sim$14 million reads per minute (10 CPU threads) when full index is used.
555
556A gapped index is recommended for use on a personal computer, which typically has 16GB of memory or less.
557\code{subread-buildindex} (\code{buildindex} function in \Rsubread) only needs 5.7GB of memory to build a gapped index for human/mouse genome.
558\code{subread-align} (\code{align} in \Rsubread) needs 8.2GB of memory for mapping with the gapped index.
559
560It takes \code{subread-buildindex} (\code{buildindex} function in \Rsubread) about 40 minutes to build a full index for human/mouse genome, and building a gapped index takes about 15 minutes.
561
562Memory use for index building and read mapping can be further reduced by building a split index using the \code{-B} and \code{-M} options in \code{subread-buildindex} (\code{indexSplit} and \code{memory} options in \code{buildindex} function in \Rsubread).
563
564
565\section{Mapping quality scores}
566
567{\Subread} and {\Subjunc} aligners determine the final mapping location of each read by taking into account vote number, number of mis-matched bases, number of matched bases and mapping distance between two reads from the same pair (for paired-end reads only) .
568They then assign a mapping quality score (MQS) to each mapped read to indicate the confidence of mapping using the following formula:\\
569
570\[ MQS = \left\{
571\begin{array}{l l}
572\frac{40}{(N_c + N_{mm})} & \quad \text{if only one best location found}\\
573& \\
574& \\
5750 & \quad \text{if $>1$ equally best locations were found}\\
576\end{array} \right.\] \\
577
578\noindent where $N_c$ is the number of candidate locations considered at the re-alignment step (note that no more than three candidate locations are considered at this step).
579$N_{mm}$ is the number of mismatches present in the final reported alignment for the read.
580
581
582\section{Mapping output}
583
584Read mapping results for each library will be saved to a BAM or SAM format file.
585Short indels detected from the read data will be saved to a text file (`.indel').
586If `$--$sv' is specified when running \code{subread-align}, breakpoints detected from structural variant events will be output to a text file for each library as well (`.breakpoints.txt').
587Screen output includes a brief mapping summary, including percentage of uniquely mapped reads, percentage of multi-mapping reads and percentage of unmapped reads.
588
589
590\section{Mapping of long reads}
591
592We developed a new long-read aligner, called {\Sublong}, for the mapping of long reads that were generated by long-read sequencing technologies such as Nanopore and PacBio sequencers.
593{\Sublong} is also based on the seed-and-vote mapping strategy.
594Parameters of {\Sublong} program can be found in Table 2.
595
596\newpage
597
598
599\chapter{Mapping reads generated by RNA sequencing technologies}
600
601\section{A quick start for using SourceForge {\Subread} package}
602
603\label{sec:rnaseq-subread}
604An index must be built for the reference first and then the read mapping and/or junction detection can be carried out.\\
605
606{\noindent\bf Step 1: Building an index}\\
607
608\noindent The following command can be used to build a base-space index.
609You can provide a list of FASTA files or a single FASTA file including all the reference sequences.
610The files can be gzipped.\\
611
612\code{subread-buildindex -o my\_index chr1.fa chr2.fa ...}\\
613
614\noindent For more details about index building, see Section~\ref{sec:index}.\\
615
616{\noindent\bf Step 2: Aligning the reads}\\
617
618\noindent{{\Subread}}\\
619
620\noindent If the purpose of an RNA-seq experiment is to quantify gene-level expression and discover differentially expressed genes, the {\Subread} aligner is recommended.
621{\Subread} carries out local alignments for RNA-seq reads.
622The commands used by {\Subread} to align RNA-seq reads are the same as those used to align gDNA-seq reads.
623Below is an example of using {\Subread} to map single-end RNA-seq reads.\\
624
625\code{subread-align -t 0 -i my\_index -r rnaseq-reads.txt -o subread\_results.bam}\\
626
627\noindent Another RNA-seq aligner included in this package is the {\Subjunc} aligner.
628{\Subjunc} not only performs read alignments but also detects exon-exon junctions.
629The main difference between {\Subread} and {\Subjunc} is that {\Subread} does not attempt to detect exon-exon junctions in the RNA-seq reads.
630For the alignments of the exon-spanning reads, {\Subread} just uses the largest mappable regions in the reads to find their mapping locations.
631This makes {\Subread} more computationally efficient.
632The largest mappable regions can then be used to reliably assign the reads to their target genes by using a read summarization program (eg. \featureCounts, see Section~\ref{sec:featureCounts}), and differential expression analysis can be readily performed based on the read counts yielded from read summarization.
633Therefore, {\Subread} is sufficient for read mapping if the purpose of RNA-seq analysis is to perform a differential expression analysis.
634Also, {\Subread} could report more mapped reads than {\Subjunc}.
635For example, the exon-spanning reads that are not aligned by {\Subjunc} due to the lack of canonical GT/AG splicing signals can be aligned by {\Subread} as long as they have a good match with the reference sequence.\\
636
637\noindent{{\Subjunc}}\\
638
639For other purposes of the RNA-seq data anlayses such as exon-exon junction detection, alternative splicing analysis and genomic mutation detection, {\Subjunc} aligner should be used because exon-spanning reads need to be fully aligned.
640Below is an example command of using {\Subjunc} to perform global alignments for paired-end RNA-seq reads.
641Note that there are two files produced after mapping: one is a BAM-format file including mapping results and the other a BED-format file including discovered exon-exon junctions.\\
642
643\code{subjunc -i my\_index -r rnaseq-reads1.txt -R rnaseq-reads2.txt -o subjunc\_result}
644
645\section{A quick start for using Bioconductor {\Rsubread} package}
646
647An index must be built for the reference first and then the read mapping can be performed.\\
648
649{\noindent\bf Step 1: Building an index}\\
650
651\noindent To build the index, you must provide a single FASTA file (eg. ``genome.fa'') which includes all the reference sequences.
652The FASTA file can be a gzipped file.
653
654\begin{Rcode}
655library(Rsubread)
656buildindex(basename="my_index",reference="genome.fa")
657\end{Rcode}
658
659{\noindent\bf Step 2: Aligning the reads}\\
660
661Please refer to Section~\ref{sec:rnaseq-subread} for difference between {\Subread} and {\Subjunc} in mapping RNA-seq data.
662Below is an example for mapping a single-end RNA-seq dataset using {\Subread}.
663Useful information about \code{align} function can be found in its help page (type \code{?align} in your {\R} prompt).
664
665\begin{Rcode}
666align(index="my_index",readfile1="rnaseq-reads.txt.gz",output_file="subread_results.bam")
667\end{Rcode}
668
669Below is an example for mapping a single-end RNA-seq dataset using {\Subjunc}.
670Useful information about \code{subjunc} function can be found in its help page (type \code{?subjunc} in your {\R} prompt).
671
672\begin{Rcode}
673subjunc(index="my_index",readfile1="rnaseq-reads.txt.gz",output_file="subjunc_results.bam")
674\end{Rcode}
675
676
677\section{Index building}
678
679Please refer to Section~\ref{sec:index}.
680Same index is used for the mapping of RNA and DNA sequencing reads.
681
682\section{Local read alignment}
683
684The \code{Subread} and \code{Subjunc} can both be used to map RNA-seq reads to the reference genome.
685If the goal of the RNA-seq data is to perform expression analysis, eg. finding genes expressing differentially between different conditions, then \code{Subread} is recommended.
686\code{Subread} performs fast local alignments for reads and reports the mapping locations that have the largest overlap with the reads.
687These reads can then be assigned to genes for expression analysis.
688For this type of analysis, global alignments for the exon-spanning reads are not required because local aligments are sufficient to get reads to be accurately assigned to genes.
689
690However, for other types of RNA-seq data analyses such as exon-exon junction discovery, genomic mutation detection and allele-specific gene expression analysis, global alignments are required.
691The next section describes the {\Subjunc} aligner, which performs global aligments for RNA-seq reads.
692
693\section{Global read alignment}
694
695{\Subjunc} aligns each exon-spanning read by firstly using a large number of subreads extracted from the read to identify multiple target regions matching the selected subreads, and then using the splicing signals (donor and receptor sites) to precisely determine the mapping locations of the read bases.
696It also includes a verification step to compare the quality of mapping reads as exon-spanning reads with the quality of mapping reads as exonic reads to finally decide how to best map the reads.
697Reads may be re-aligned if required.
698
699Output of {\Subjunc} aligner includes a list of discovered exon-exon junction locations and also the complete alignment results for the reads.
700Table 2 describes the arguments used by the {\Subjunc} program.\\
701
702\section{Memory use and speed}
703
704Memory use and running time of \code{subread-buildindex} and \code{subread-align} (\code{buildindex} and \code{align} in \Rsubread) are the same as their memory use and running time in the analysis of DNA sequencing data (see Section~\ref{sec:memoryspeed}).
705
706Compared to \code{subread-align} (\code{align} in \Rsubread), \code{subjunc} uses the same amount of memory when a full index is used and it uses slightly more memory (8.8GB of memory for human/mouse data) when a gapped index is used.
707\code{subjunc} is also slightly slower than \code{subread-align}.
708
709
710\section{Mapping output}
711
712Read mapping results for each library will be saved to a BAM/SAM file.
713Detected exon-exon junctions will be saved to a BED file for each library (`.junction.bed').
714Detected short indels will be saved to a text file (`.indel').
715Screen output includes a brief mapping summary, including percentage of uniquely mapped reads, percentage of multi-mapping reads and percentage of unmapped reads.
716
717\section{Mapping microRNA sequencing reads (miRNA-seq)}
718
719To use {\Subread} aligner to map miRNA-seq reads, a full index must be built for the reference genome before read mapping can be carried out.
720For example, the following command builds a full index for mouse reference genome \emph{mm10}:
721
722\code{\\
723subread-buildindex -F -B -o mm10\_full\_index mm10.fa \\
724}
725
726The full index includes 16bp mers extracted from every genomic location in the genome.
727Note that if \code{-F} is not specified, \code{subread-buildindex} builds a gapped index which includes 16bp mers extracted every three bases in the reference genome, ie. there is a 2bp gap between each pair of neighbouring 16bp mers.
728
729After the full index was built, read alignment can be performed.
730Reads do not need to be trimmed before feeding them to {\Subread} aligner since {\Subread} soft clips sequences in the reads that can not be properly mapped.
731The parameters used for mapping miRNA-seq reads need to be carefully designed due to the very short length of miRNA sequences ($\sim$22bp).
732The total number of subreads (16bp mers) extracted from each read should be the read length minus 15, which
733is the maximum number of subreads that can be possibly extracted from a read.
734The reason why we need to extract the maximum number of subreads is to achieve a high sensitivity in detecting the short miRNA sequences.
735
736The threshold for the number of consensus subreads required for reporting a hit should be within the range of 2 to 7 consensus subreads inclusive.
737The larger the number of consensus subreads required, the more stringent the mapping will be.
738Using a threshold of 2 consensus subreads allows the detection of miRNA sequences of as short as 17bp, but the mapping error rate could be relatively high.
739With this threshold, there will be at least 17 perfectly matched bases present in each reported alignment.
740If a threshold of 4 consensus subreads was used, length of miRNA sequences that can be detected is 19 bp or longer.
741With this threshold, there will be at least 19 perfectly matched bases present in each reported alignment.
742When a threshold of 7 consensus subreads was used, only miRNA sequences of 22bp or longer can be detected (at least 22 perfectly matched bases will be present in each reported alignment).
743
744We found that there was a significant decrease in the number of mapped reads when the requried number of consensus subreads increased from 4 to 5 when we tried to align a mouse miRNA-seq dataset, suggesting that there are a lot of miRNA sequences that are only 19bp long.
745We therefore used a threshold of 4 consensus subreads to map this dataset.
746However, what we observed might not be the case for other datasets that were generated from different cell types and different species.
747
748Below is an example of mapping 50bp long reads (adaptor sequences were included in the reads in addition to the miRNA sequences), with at least 4 consensus subreads required in the mapping.
749Note that `-t' option should have a value of 1 since miRNA-seq reads are more similar to gDNA-seq reads than mRNA-seq reads from the read mapping point of vew.
750
751\code{\\
752subread-align -t 1 -i mm10\_full\_index -n 35 -m 4 -M 3 -T 10 -I 0
753--multiMapping -B 10 -r miRNA\_reads.fastq -o result.bam\\
754}
755
756The `-B 10' parameter instructs {\Subread} aligner to report up to 10 best mapping locations (equally best) in the mapping results.
757The multiple locations reported for the reads could be useful for investigating their true origin, but they might need to be filtered out when assigning mapped reads to known miRNA genes to ensure a high-quality quantification of miRNA genes.
758The miRBase database (\url{http://www.mirbase.org/}) is a useful resource that includes annotations for miRNA genes in many species.
759The {\featureCounts} program can be readily used for summarizing reads to miRNA genes.
760
761
762
763\chapter{Read summarization}
764
765\section{Introduction}
766
767Sequencing reads often need to be assigned to genomic features of interest after they are mapped to the reference genome.
768This process is often called \emph{read summarization} or \emph{read quantification}.
769Read summarization is required by a number of downstream analyses such as gene expression analysis and histone modification analysis.
770The output of read summarization is a count table, in which the number of reads assigned to each feature in each library is recorded.
771
772A particular challenge to the read summarization is how to deal with those reads that overlap more than one feature (eg. an exon) or meta-feature (eg. a gene).
773Care must be taken to ensure that such reads are not over-counted or under-counted.
774Here we describe the {\featureCounts} program, an efficient and accurate read quantifier.
775{\featureCounts} has the following features:
776\begin{itemize}
777\item It carries out precise and accurate read assignments by taking care of indels, junctions and structural variants in the reads.
778\item It takes only half a minute to summarize 20 million reads.
779\item It supports GTF and SAF format annotation.
780\item It supports strand-specific read counting.
781\item It can count reads at feature (eg. exon) or meta-feature (eg. gene) level.
782\item Highly flexible in counting multi-mapping and multi-overlapping reads. Such reads can be excluded, fully counted or fractionally counted.
783\item It gives users full control on the summarization of paired-end reads, including allowing them to check if both ends are mapped and/or if the fragment length falls within the specified range.
784\item Reduce ambiguity in assigning read pairs by searching features that overlap with both reads from the pair.
785\item It allows users to specify whether chimeric fragments should be counted.
786\item Automatically detect input format (SAM or BAM).
787\item Automatically sort paired-end reads. Users can provide either location-sorted or name-sorted bams files to featureCounts. Read sorting is implemented on the fly and it only incurs minimal time cost.
788\end{itemize}
789
790\section{featureCounts}
791\label{sec:featureCounts}
792
793\subsection{Input data}
794
795The data input to {\featureCounts} consists of (i) one or more files of aligned reads (short or long reads) in either SAM or BAM format and (ii) a list of genomic features in either Gene Transfer Format (GTF) or General Feature Format (GFF) or Simplified Annotation Format (SAF). The format of input reads is automatically detected (SAM or BAM).
796
797If the input contains location-sorted paired-end reads, {\featureCounts} will automatically re-order the reads to place next to each other the reads from the same pair before counting them.
798Sometimes name-sorted paired-end input reads are not compatible with featureCounts (due to for example reporting of multi-mapping results) and in this case {\featureCounts} will also automatically re-order them.
799We provide an utility program {\repair} to allow users to pair up the reads before feeding them to {\featureCounts}.
800
801Both read alignment and read counting should use the same reference genome. For each read, the BAM/SAM file gives the name of the reference chromosome or contig the read mapped to, the start position of the read on the chromosome or contig/scaffold, and the so-called CIGAR string giving the detailed alignment information including insertions and deletions and so on relative to the start position.
802
803The genomic features can be specified in either GTF/GFF or SAF format. The SAF format is the simpler and includes only five required columns for each feature (see next section). In either format, the feature identifiers are assumed to be unique, in accordance with commonly used Gene Transfer Format (GTF) refinement of GFF.
804
805{\featureCounts} supports strand-specific read counting if strand-specific information is provided. Read mapping results usually include mapping quality scores for mapped reads. Users can optionally specify a minimum mapping quality score that the assigned reads must satisfy.
806
807\subsection{Annotation format}
808\label{sec:annotation}
809
810The genomic features can be specified in either GTF/GFF or SAF format.
811A definition of the GTF format can be found at UCSC website (\url{http://genome.ucsc.edu/FAQ/FAQformat.html#format4}).
812The SAF format includes five required columns for each feature: feature identifier, chromosome name, start position, end position and strand.
813Both start and end positions are inclusive.
814These five columns provide the minimal sufficient information for read quantification purposes.
815Extra annotation data are allowed to be added from the sixth column.
816
817A SAF-format annotation file should be a tab-delimited text file.
818It should also include a header line.
819An example of a SAF annotation is shown as below:
820
821\code{\\
822GeneID	Chr	Start	End	Strand\\
823497097	chr1	3204563	3207049	-\\
824497097	chr1	3411783	3411982	-\\
825497097	chr1	3660633	3661579	-\\
826100503874	chr1	3637390	3640590	-\\
827100503874	chr1	3648928	3648985	-\\
828100038431	chr1	3670236	3671869	-\\
829...
830}
831
832\code{GeneID} column includes gene identifiers that can be numbers or character strings.
833Chromosomal names included in the \code{Chr} column must match the chromosomal names of reference sequences to which the reads were aligned.
834
835\subsection{In-built annotations}
836
837In-built gene annotations for genomes \emph{hg38}, \emph{hg19}, \emph{mm10} and \emph{mm9} are included in both Bioconductor {\Rsubread} package and SourceForge {\Subread} package.
838These annotations were downloaded from NCBI RefSeq database and then adapted by merging overlapping exons from the same gene to form a set of disjoint exons for each gene.
839Genes with the same Entrez gene identifiers were also merged into one gene.
840
841Each row in the annotation represents an exon of a gene. There are five columns in the annotation data including Entrez gene identifier (\emph{GeneID}), chromosomal name (\emph{Chr}), chromosomal start position(\emph{Start}), chromosomal end position (\emph{End}) and strand (\emph{Strand}).
842
843In {\Rsubread}, users can access these annotations via the {\textsf{getInBuiltAnnotation}} function.
844In {\Subread}, these annotations are stored in directory `annotation' under home directory of the package.
845
846
847\subsection{Single and paired-end reads}
848
849Reads may be paired or unpaired.
850If paired reads are used, then each pair of reads defines a DNA or RNA fragment bookended by the two reads.
851In this case, {\featureCounts} can be instructed to count fragments rather than reads.
852{\featureCounts} automatically sorts reads by name if paired reads are not in consecutive positions in the SAM or BAM file, with minimal cost.
853Users do not need to sort their paired reads before providing them to {\featureCounts}.
854
855\subsection{Assign reads to features and meta-features}
856
857{\featureCounts} is a general-purpose read summarization function, which assigns mapped reads (RNA-seq reads or genomic DNA-seq reads) to genomic features or meta-features.
858A feature is an interval (range of positions) on one of the reference sequences.
859A meta-feature is a set of features that represents a biological construct of interest.
860For example, features often correspond to exons and meta-features to genes. Features sharing the same feature identifier in the GTF or SAF annotation are taken to belong to the same meta-feature. {\featureCounts} can summarize reads at either the feature or meta-feature levels.
861
862We recommend to use unique gene identifiers, such as NCBI Entrez gene identifiers, to cluster features into meta-features. Gene names are not recommended to use for this purpose because different genes may have the same names. Unique gene identifiers were often included in many publicly available GTF annotations which can be readily used for summarization. The Bioconductor {\Rsubread} package also includes NCBI RefSeq annotations for human and mice. Entrez gene identifiers are used in these annotations.
863
864{\featureCounts} preforms precise read assignment by comparing mapping location of every base in the read with the genomic region spanned by each feature.
865It takes account of any gaps (insertions, deletions, exon-exon junctions or structural variants) that are found in the read.
866It calls a hit if any overlap is found between read and feature.
867
868Users may use `--minOverlap (\code{minOverlap} in \R)' and `--fracOverlap (\code{fracOverlap} in \R)' options to specify the minimum number of overlapping bases and minimum fraction of overlapping bases requried for assigning a read to a feature, respectively.
869The `--fracOverlap' option might be particularly useful for counting reads with variable lengths.
870
871When counting reads at meta-feature level, a hit is called for a meta-feature if the read overlaps any component feature of the meta-feature.
872Note that if a read hits a meta-feature, it is always counted once no matter how many features in the meta-feature this read overalps with.
873For instance, an exon-spanning read overlapping with more than one exon within the same gene only contributes 1 count to the gene.
874
875When assigning reads to genes or exons, most reads can be successfully assigned without ambiguity.
876However if reads are to be assigned to transcripts, due to the high overlap between transcripts from the same gene, many reads will be found to overlap more than one transcript and therefore cannot be uniquely assigned.
877Specialized transcript-level quantification tools are recommended for counting reads to transcripts.
878Such tools use model-based approaches to deconvolve reads overlapping with multiple transcripts.
879
880
881\subsection{Count multi-mapping reads and multi-overlapping reads}
882
883A multi-mapping read is a read that maps to more than one location in the reference genome.
884There are multiple options for counting such reads.
885Users can specify the `-M' option (set \code{countMultiMappingReads} to \code{TRUE} in \R) to fully count every alignment reported for a multi-mapping read (each alignment carries 1 count), or specify both `-M' and `--fraction' options (set both \code{countMultiMappingReads} and \code{fraction} to \code{TRUE} in \R) to count each alignment fractionally  (each alignment carries $1/x$ count where $x$ is the total number of alignments reported for the read), or do not count such reads at all (this is the default behavior in SourceForge {\Subread} package; In \R, you need to set \code{countMultiMappingReads} to \code{FALSE}).
886
887A multi-overlapping read is a read that overlaps more than one meta-feature when counting reads at meta-feature level or overlaps more than one feature when counting reads at feature level.
888The decision of whether or not to counting these reads is often determined by the experiment type. We recommend that reads or fragments overlapping more than one gene are not counted for RNA-seq experiments, because any single fragment must originate from only one of the target genes but the identity of the true target gene cannot be confidently determined.
889On the other hand, we recommend that multi-overlapping reads or fragments are counted for ChIP-seq experiments because for example epigenetic modifications inferred from these reads may regulate the biological functions of all their overlapping genes.
890
891By default, {\featureCounts} does not count multi-overlapping reads.
892Users can specify the `-O' option (set \code{allowMultiOverlap} to \code{TRUE} in \R) to fully count them for each overlapping meta-feature/feature (each overlapping meta-feature/feature receives a count of 1 from a read), or specify both `-O' and `--fraction' options (set both \code{allowMultiOverlap} and \code{fraction} to \code{TRUE} in \R) to assign a fractional count to each overlapping meta-feature/feature (each overlapping meta-feature/feature receives a count of $1/y$ from a read where $y$ is the total number of meta-features/features overlapping with the read).
893
894If a read is both multi-mapping and multi-overlapping, then when `-O', `-M', and `--fraction' are all specified each overlapping meta-feature/feature will receive a fractional count of $1/(x*y)$.
895Note that each alignment reported for a multi-mapping read is assessed separately for overlapping with multiple meta-features/features.
896
897When multi-mapping reads are reported with primary and secondary alignments and both `-M' and `--primary' are specified, only primary alignments will be considered in counting and secondary alignments will be ignored.
898If `-M' is specified but `--primary' is not specified, both primary and secondary alignments will be considered in counting.
899Note that all the alignments reported for a multi-mapping read are expected to have a `NH' tag and whether an alignment is primary or secondary is determined by using bit {0x100} in the FLAG field of the alignment record.
900
901
902\subsection{Read filtering}
903\label{sec:read_filtering}
904
905{\featureCounts} implements a variety of read filters to facilitate flexible read counting, which should satisfy the requirement of most downstream analyses.
906The order of these filters being applied is as following (from first to last):
907unmapped
908$>$ read type
909$>$ singleton
910$>$ mapping quality
911$>$ chimeric fragment
912$>$ fragment length
913$>$ duplicate
914$>$ multi-mapping
915$>$ secondary alignment
916$>$ split reads (or nonsplit reads)
917$>$ no overlapping features
918$>$ overlapping length
919$>$ assignment ambiguity.
920
921Number of reads that were excluded from counting by each filter is reported in the program output, in addition to the reported read counts (see Section~\ref{sec:program_output}).
922The `read type' filter removes those reads that have an unexpected read type and also cannot be counted with confidence.
923For example, if there are single end reads included in a paired end read dataset (such data can be produced from a read trimming program for instance) and reads are required to be counted in a strand-specific manner, then all the single end reads will be excluded from counting because their strandness cannot be determined.
924However if such reads are to be counted in an unstranded manner then all the single end reads will be considered for counting.
925
926
927\subsection{Read manipulation}
928
929Reads can be shifted (\code{--readShiftType} and \code{--readShiftSize}), extended (\code{--readExtension5} and \code{--readExtension3}) or reduced to an end base (\code{--read2pos}), before being assigned to features/meta-features.
930These read manipulations are carried out by {\featureCounts} in the following order: shift $>$ extension $>$ reduction.
931
932
933\subsection{Program output}
934\label{sec:program_output}
935
936The output of {\featureCounts} program includes a count table and a summary of counting results.
937For SourceForge {\Subread}, the output data are saved to two tab-delimited files: one file contains read counts (file name is specified by the user) and the other file includes summary of counting results (file name is the name of read count file added with `.summary').
938For {\Rsubread}, all the output data are saved to an {\R} `List' object (for more details see the help page for {\featureCounts} function in {\Rsubread} package).
939
940The read count table includes annotation columns (`Geneid', `Chr', `Start', `End', `Strand' and `Length') and data columns (eg. read counts for genes for each library).
941When counting reads to meta-features (eg. genes) columns `Chr', `Start', `End' and `Strand' may each contain multiple values (separated by semi-colons), which correspond to individual features included in the same meta-feature.
942Column `Length' always contains one single value which is the total number of non-overlapping bases included in a meta-feature (or a feature), regardless of counting at meta-feature level or feature level.
943When counting RNA-seq reads to genes, the `Length' column typically contains the total number of non-overlapping bases in exons belonging to the same gene for each gene.
944
945The counting summary includes total number of alignments that were successfully assigned and also number of alignments that failed to be assigned due to various filters.
946Note that the counting summary includes the number of alignments, not the number of reads.
947Number of alignments will be higher than the number of reads when multi-mapping reads are included since each multi-mapping read contains more than one alignment.
948Number and percentage of successfully assigned alignments are also shown in featureCounts screen output.
949
950Filters supported by {\featureCounts} can be found in the list below:
951
952\begin{itemize}
953\item Unassigned\_Unmapped: unmapped reads cannot be assigned.
954\item Unassigned\_Read\_Type: reads that have an unexpected read type (eg. being a single end read included in a paired end dataset) and also cannot be counted with confidence (eg. due to stranded counting). Such reads are typically generated from a read trimming program.
955\item Unassigned\_Singleton: read pairs that have only one end mapped.
956\item Unassigned\_MappingQuality: alignments with a mapping quality score lower than the threshold.
957\item Unassigned\_Chimera: two ends in a paired end alignment are located on different chromosomes or have unexpected orientation.
958\item Unassigned\_FragementLength: fragment length inferred from paired end alignment does not meet the length criteria.
959\item Unassigned\_Duplicate: alignments marked as duplicate (indicated in the FLAG field).
960\item Unassigned\_MultiMapping: alignments reported for multi-mapping reads (indicated by `NH' tag).
961\item Unassigned\_Secondary: alignments reported as secondary alignments (indicated in the FLAG field).
962\item Unassigned\_Split (or Unassigned\_NonSplit): alignments that contain junctions (or do not contain junctions).
963\item Unassigned\_NoFeatures: alignments that do not overlap any feature.
964\item Unassigned\_Overlapping\_Length: alignments that do not overlap any feature (or meta-feature) with the minimum required overlap length.
965\item Unassigned\_Ambiguity: alignments that overlap two or more features (feature-level summarization) or meta-features (meta-feature-level summarization).
966\end{itemize}
967
968In the counting summary these filters are listed in the same order as they were applied in counting process (see Section~\ref{sec:read_filtering}).
969An unassigned alignment might fall into more than one category as listed above, however it will only be allocated to one category which is the category corresponding to the first filter that filtered this alignment out.
970
971
972\subsection{Program usage}
973
974Table 3 describes the parameters used by the {\featureCounts} program.
975
976\pagebreak
977
978\begin{longtable}{|p{5cm}|p{11cm}|}
979\multicolumn{2}{p{16cm}}{Table 3: Arguments used by the {\featureCounts} program included in the SourceForge {\Subread} package in alphabetical order.
980Arguments included in parenthesis are the equivalent parameters used by {\featureCounts} function in Bioconductor {\Rsubread} package.}
981\endfirsthead
982\hline
983Arguments & Description \\
984\hline
985input\_files \newline (\code{files}) & Give the names of input read files that include the read mapping results. The program automatically detects the file format (SAM or BAM). Multiple files can be provided at the same time. Files are allowed to be provided via $<stdin>$. \\
986\hline
987-a $<string>$ \newline (\code{annot.ext, annot.inbuilt})  & Provide name of an annotation file. See \code{-F} option for file format. Gzipped file is accepted.\\
988\hline
989-A \newline (\code{chrAliases}) & Provide a chromosome name alias file to match chr names in annotation with those in the reads. This should be a two-column comma-delimited text file. Its first column should include chr names in the annotation and its second column should include chr names in the reads. Chr names are case sensitive. No column header should be included in the file.\\
990\hline
991-B \newline (\code{requireBothEndsMapped}) & If specified, only fragments that have both ends successfully aligned will be considered for summarization. This option should be used together with \code{-p} (or \code{isPairedEnd} in {\Rsubread} {\featureCounts}).\\
992\hline
993-C \newline (\code{countChimericFragments}) & If specified, the chimeric fragments (those fragments that have their two ends aligned to different chromosomes) will NOT be counted. This option should be used together with \code{-p} (or \code{isPairedEnd} in {\Rsubread} {\featureCounts}).\\
994\hline
995-d $<int>$ \newline (\code{minFragLength}) & Minimum fragment/template length, 50 by default. This option must be used together with \code{-p} and \code{-P}.\\
996\hline
997-D $<int>$ \newline (\code{maxFragLength}) & Maximum fragment/template length, 600 by default. This option must be used together with \code{-p} and \code{-P}.\\
998\hline
999-f \newline (\code{useMetaFeatures}) & If specified, read summarization will be performed at feature level (eg. exon level). Otherwise, it is performed at meta-feature level (eg. gene level).\\
1000\hline
1001-F \newline (\code{isGTFAnnotationFile}) & Specify the format of the annotation file. Acceptable formats include `GTF' and `SAF' (see Section~\ref{sec:annotation} for details). By default, {\C} version of {\featureCounts} program accepts a GTF format annotation and R version accepts a SAF format annotation. In-built annotations in SAF format are provided.\\
1002\hline
1003-g $<string>$ \newline (\code{GTF.attrType}) & Specify the attribute type used to group features (eg. exons) into meta-features (eg. genes) when GTF annotation is provided. `gene\_id' by default. This attribute type is usually the gene identifier. This argument is useful for the meta-feature level summarization.\\
1004\hline
1005-G $<string>$ \newline (\code{genome}) & Provide the name of a FASTA-format file that contains the reference sequences used in
1006read mapping that produced the provided SAM/BAM files. This optional argument can be used with `-J' option to improve read counting for junctions.\\
1007\hline
1008-J \newline (\code{juncCounts}) & Count the number of reads supporting each exon-exon junction. Junctions are identified from those exon-spanning reads (containing `N' in CIGAR string) in input data (note that options `--splitOnly' and `--nonSplitOnly' are not considered by this parameter). The output result includes names of primary and secondary genes that overlap at least one of the two splice sites of a junction. Only one primary gene is reported, but there might be more than one secondary gene reported. Secondary genes do not overlap more splice sites than the primary gene. When the primary and secondary genes overlap same number of splice sites, the gene with the smallest leftmost base position is selected as the primary gene. Also included in the output result are the position information for the left splice site (`Site1') and the right splice site (`Site2') of a junction. These include chromosome name, coordinate and strand of the splice site. In the last columns of the output, number of supporting reads is provided for each junction for each library.\\
1009\hline
1010-L \newline (\code{isLongRead}) & Turn on long-read counting mode. This option should be used when counting long reads such as Nanopore or PacBio reads.\\
1011\hline
1012-M \newline (\code{countMultiMappingReads}) & If specified, multi-mapping reads/fragments will be counted. The program uses the `NH' tag to find multi-mapping reads. Each alignment reported for a multi-mapping read will be counted individually. Each alignment will carry \code{1} count or a fractional count (\code{--fraction}). See section ``Count multi-mapping reads and multi-overlapping reads'' for more details.\\
1013\hline
1014-o $<string>$ & Give the name of the output file. The output file contains the number of reads assigned to each meta-feature (or each feature if \code{-f} is specified). Note that the {\featureCounts} function in {\Rsubread} does not use this parameter. It returns a \code{list} object including read summarization results and other data. \\
1015\hline
1016-O \newline (\code{allowMultiOverlap}) & If specified, reads (or fragments if \code{-p} is specified) will be allowed to be assigned to more than one matched meta-feature (or feature if \code{-f} is specified). Reads/fragments overlapping with more than one meta-feature/feature will be counted more than once. Note that when performing meta-feature level summarization, a read (or fragment) will still be counted once if it overlaps with multiple features within the same meta-feature (as long as it does not overlap with other meta-features). Also note that this parameter is applied to each individual alignment when there are more than one alignment reported for a read (ie. multi-mapping read). See section ``Count multi-mapping reads and multi-overlapping reads'' for more details.\\
1017\hline
1018-p \newline (\code{isPairedEnd}) & Specify if the input libraries contain paired-end reads. {\featureCounts} will terminate if the read type of the input data is incorrectly specified.\\
1019\hline
1020-P \newline (\code{checkFragLength}) & If specified, the fragment length will be checked when assigning fragments to meta-features or features. This option must be used together with \code{-p}. The fragment length thresholds should be specified using \code{-d} and \code{-D} options.\\
1021\hline
1022-Q $<int>$ \newline (\code{minMQS}) & The minimum mapping quality score a read must satisfy in order to be counted. For paired-end reads, at least one end should satisfy this criteria. 0 by default.\\
1023\hline
1024-R $<string>$ \newline (\code{reportReads}) & Output detailed read assignment results for each read (or fragment if paired end). The detailed assignment results can be saved in three different formats including \code{CORE}, \code{SAM} and \code{BAM} (note that these values are case sensitive). \newline When \code{CORE} format is specified, a tab-delimited file will be generated for each input file. Name of each generated file is the input file name added with `.featureCounts'.  Each generated file contains four columns including read name, status (assigned or the reason if not assigned), number of targets and target list. A target is a feature or a meta-feature. Items in the target lists is separated by comma. If a read is not assigned, its number of targets will be set as -1. \newline When \code{SAM} or \code{BAM} format is specified, the detailed assignment results will be saved to SAM and BAM format files. Names of generated files are the input file names added with `.featureCounts.sam' or `.featureCounts.bam'. Three tags are used to describe read assignment results: XS, XN and XT. Tag XS gives the assignment status. Tag XN gives number of targets. Tag XT gives comma separated target list. \\
1025\hline
1026-s $<int or string>$ \newline (\code{isStrandSpecific}) & Indicate if strand-specific read counting should be performed. A single integer value (applied to all input files) or a string of comma-separated values (applied to each corresponding input file) should be provided. Possible values include: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). Default value is 0 (ie. unstranded read counting carried out for all input files). For paired-end reads, strand of the first read is taken as the strand of the whole fragment. FLAG field is used to tell if a read is first or second read in a pair. Value of \code{isStrandSpecific} parameter in {\Rsubread} {\featureCounts} is a vector which has a length of either {\code{1}}, or the same with the total number of input files provided. \\
1027\hline
1028-t $<string>$ \newline (\code{GTF.featureType}) & Specify the feature type(s). If more than one feature type is provided, they should be separated by `,' (no space). Only rows which have a matched feature type in the provided GTF annotation file will be included for read counting. `exon' by default.\\
1029\hline
1030-T $<int>$ \newline (\code{nthreads}) & Number of the threads. The value should be between 1 and 32. 1 by default.\\
1031\hline
1032-v & Output version of the program. \\
1033\hline
1034$--$byReadGroup \newline (\code{byReadGroup}) & Count reads by read group. Read group information is identified from the header of BAM/SAM input files and the generated count table will include counts for each group in each library.\\
1035\hline
1036$--$countReadPairs \newline (\code{countReadPairs}) & Read pairs will be counted instead of reads. This parameter is only applicable when paired-end data were provided.\\
1037\hline
1038$--$donotsort \newline (\code{autosort}) & If specified, paired end reads will not be re-ordered even if reads from the same pair were found not to be next to each other in the input.\\
1039\hline
1040$--$extraAttributes $<string>$ \newline (\code{GTF.attrType.extra}) & Extract extra attribute types from the provided GTF annotation and include them in the counting output. These attribute types will not be used to group features. If more than one attribute type is provided they should be separated by comma (in {\Rsubread} {\featureCounts} its value is a character vector).  \\
1041\hline
1042$--$fraction \newline (\code{fraction}) & Assign fractional counts to features. This option must be used together with `-M' or `-O' or both. When `-M' is specified, each reported alignment from a multi-mapping read (identified via `NH' tag) will carry a count of 1/x, instead of 1 (one), where x is the total number of alignments reported for the same read. When `-O' is specified, each overlapping feature will receive a count of 1/y, where y is the total number of features overlapping with the read. When both `-M' and `-O' are specified, each alignment will carry a count of 1/(x*y).\\
1043\hline
1044$--$fracOverlap $<float>$ \newline (\code{fracOverlap}) & Minimum fraction of overlapping bases in a read that is required for read assignment. Value should be a float number in the range [0,1]. 0 by default. If paired end, number of overlapping bases is counted from both reads. Soft-clipped bases are counted when calculating total read length (but ignored when counting overlapping bases). Both this option and `--minOverlap' option need to be satisfied for read assignment. \\
1045\hline
1046$--$fracOverlapFeature $<float>$ \newline (\code{fracOverlapFeature}) & Minimum fraction of bases included in a feature that is required to overlap with a read or a read pair. Value should be within range [0,1]. 0 by default. \\
1047\hline
1048$--$ignoreDup \newline (\code{ignoreDup}) & If specified, reads that were marked as duplicates will be ignored. Bit Ox400 in FLAG field of SAM/BAM file is used for identifying duplicate reads. In paired end data, the entire read pair will be ignored if at least one end is found to be a duplicate read.\\
1049\hline
1050$--$largestOverlap \newline (\code{largestOverlap}) & If specified, reads (or fragments) will be assigned to the target that has the largest number of overlapping bases.\\
1051\hline
1052$--$maxMOp $<int>$ \newline (\code{maxMOp}) & Specify the maximum number of `M' operations (matches or mis-matches) allowed in a CIGAR string. 10 by default. Both `X' and `=' operations are treated as `M' and adjacent `M' operations are merged in the CIGAR string. When the number of `M' operations exceeds the limit, only the first `maxMOp' number of `M' operations will be used in read assignment.\\
1053\hline
1054$--$minOverlap $<int>$ \newline (\code{minOverlap}) & Minimum number of overlapping bases in a read that is required for read assignment. 1 by default. If a negative value is provided, then a gap of up to specified size will be allowed between read and the feature that the read is assigned to. For assignment of read pairs (fragments), number of overlapping bases from each read from the same pair will be summed. \\
1055\hline
1056$--$nonOverlap $<int>$ \newline (\code{nonOverlap}) & Maximum number of non-overlapping bases in a read (or a read pair) that is allowed when being assigned to a feature. No limit is set by default. \\
1057\hline
1058$--$nonOverlapFeature $<int>$ \newline (\code{nonOverlapFeature}) & Maximum number of non-overlapping bases in a feature that is allowed in read assignment. No limit is set by default. \\
1059\hline
1060$--$nonSplitOnly \newline (\code{nonSplitOnly}) & If specified, only non-split alignments (CIGAR strings do not contain letter `N') will be counted. All the other alignments will be ignored.\\
1061\hline
1062$--$primary \newline (\code{primaryOnly}) & If specified, only primary alignments will be counted. Primary and secondary alignments are identified using bit 0x100 in the Flag field of SAM/BAM files. All primary alignments in a dataset will be counted no matter they are from multi-mapping reads or not (ie. `-M' is ignored).\\
1063\hline
1064$--$read2pos $<int>$ \newline (\code{read2pos}) & Read is reduced to its 5' most base or 3' most base. Read summarization is then performed based on the single base position to which the read is reduced. By default no read reduction is performed. Read reduction is performed after read shifting and read extension if they are also specified. \\
1065\hline
1066$--$readExtension3 $<int>$ \newline (\code{readExtension3}) & Reads are extended downstream by $<int>$ bases from their 3' end. 0 by default. Negative value is not allowed. Read extension is performed after read shifting but before read reduction.\\
1067\hline
1068$--$readExtension5 $<int>$ \newline (\code{readExtension5}) & Reads are extended upstream by $<int>$ bases from their 5' end. 0 by default. Negative value is not allowed. \\
1069\hline
1070$--$readShiftSize $<int>$ \newline (\code{readShiftSize}) & Reads are shifted by $<int>$ bases. 0 by default. Negative value is not allowed. \\
1071\hline
1072$--$readShiftType $<string>$ \newline (\code{readShiftType}) & Specify the direction in which reads are being shifted. Possible values include \code{upstream}, \code{downstream}, \code{left} and \code{right}.  \code{upstream} by default. Read shifting is performed before read extension or reduction. \\
1073\hline
1074$--$Rpath $<string>$ \newline (\code{reportReadsPath}) & Specify a directory to save the detailed assignment results. If unspecified, the directory where counting results are saved is used. See `-R' option for obtaining detailed assignment results for reads.\\
1075\hline
1076$--$splitOnly \newline (\code{splitOnly}) & If specified, only split alignments (CIGAR strings contain letter `N') will be counted. All the other alignments will be ignored. An example of split alignments is the exon-spanning reads in RNA-seq data. If exon-spanning reads need to be assigned to all their overlapping exons, `-f' and `-O' options should be provided as well.\\
1077\hline
1078$--$tmpDir $<string>$ \newline (\code{tmpDir}) & Directory under which intermediate files are saved (later removed). By default, intermediate files will be saved to the directory specified in `-o' argument (In \R, intermediate files are saved to the current working directory by default).\\
1079\hline
1080$--$verbose \newline (\code{verbose}) & Output verbose information for debugging such as unmatched chromosomes/contigs between reads and annotation.\\
1081\hline
1082\end{longtable}
1083
1084\pagebreak
1085
1086\section{A quick start for {\featureCounts} in SourceForge \Subread}
1087
1088You need to provide read mapping results (in either SAM or BAM format) and an annotation file for the read summarization.
1089The example commands below assume your annotation file is in GTF format.\\
1090
1091\noindent Summarize BAM format single-end reads using 5 threads:\\
1092
1093\noindent\code{featureCounts -T 5 -a annotation.gtf -t exon -g gene\_id \\
1094 -o counts.txt mapping\_results\_SE.bam}\\
1095
1096\noindent Summarize BAM format single-end read data:\\
1097
1098\noindent\code{featureCounts -a annotation.gtf -t exon -g gene\_id \\
1099 -o counts.txt mapping\_results\_SE.bam}\\
1100
1101\noindent Summarize multiple libraries at the same time:\\
1102
1103\noindent\code{featureCounts -a annotation.gtf -t exon -g gene\_id \\
1104 -o counts.txt mapping\_results1.bam mapping\_results2.bam}\\
1105
1106\noindent Summarize paired-end reads and count fragments (instead of reads):\\
1107
1108\noindent\code{featureCounts -p -a annotation.gtf -t exon -g gene\_id \\
1109-o counts.txt mapping\_results\_PE.bam}\\
1110
1111\noindent Count fragments satisfying the fragment length criteria, eg. [50bp, 600bp]:\\
1112
1113\noindent\code{featureCounts -p -P -d 50 -D 600 -a annotation.gtf -t exon -g gene\_id\\
1114-o counts.txt mapping\_results\_PE.bam}\\
1115
1116\noindent Count fragments which have both ends successfully aligned without considering the fragment length constraint:\\
1117
1118\noindent\code{featureCounts -p -B -a annotation.gtf -t exon -g gene\_id\\
1119 -o counts.txt mapping\_results\_PE.bam}\\
1120
1121\noindent Exclude chimeric fragments from the fragment counting:\\
1122
1123\noindent\code{featureCounts -p -C -a annotation.gtf -t exon -g gene\_id\\
1124 -o counts.txt mapping\_results\_PE.bam}
1125
1126
1127\section{A quick start for {\featureCounts} in Bioconductor \Rsubread}
1128
1129You need to provide read mapping results (in either SAM or BAM format) and an annotation file for the read summarization.
1130The example commands below assume your annotation file is in GTF format.\\
1131
1132\noindent Load {\Rsubread} library from you {\R} session:
1133
1134\begin{Rcode}
1135library(Rsubread)
1136\end{Rcode}
1137
1138\noindent Summarize single-end reads using built-in RefSeq annotation for mouse genome `mm10' (`mm10' is the default inbuilt genome annotation):
1139\begin{Rcode}
1140featureCounts(files="mapping_results_SE.bam")
1141\end{Rcode}
1142
1143\noindent Summarize single-end reads using a user-provided GTF annotation file:
1144
1145\begin{Rcode}
1146featureCounts(files="mapping_results_SE.bam",annot.ext="annotation.gtf",
1147isGTFAnnotationFile=TRUE,GTF.featureType="exon",GTF.attrType="gene_id")
1148\end{Rcode}
1149
1150\noindent Summarize single-end reads using 5 threads:
1151
1152\begin{Rcode}
1153featureCounts(files="mapping_results_SE.bam",nthreads=5)
1154\end{Rcode}
1155
1156\noindent Summarize BAM format single-end read data:
1157
1158\begin{Rcode}
1159featureCounts(files="mapping_results_SE.bam")
1160\end{Rcode}
1161
1162\noindent Summarize multiple libraries at the same time:
1163
1164\begin{Rcode}
1165featureCounts(files=c("mapping_results1.bam","mapping_results2.bam"))
1166\end{Rcode}
1167
1168\noindent Summarize paired-end reads and counting fragments (instead of reads):
1169
1170\begin{Rcode}
1171featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE)
1172\end{Rcode}
1173
1174\noindent Count fragments satisfying the fragment length criteria, eg. [50bp, 600bp]:
1175
1176\begin{Rcode}
1177featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,checkFragLength=TRUE,
1178minFragLength=50,maxFragLength=600)
1179\end{Rcode}
1180
1181\noindent Count fragments which have both ends successfully aligned without considering the fragment length constraint:
1182
1183\begin{Rcode}
1184featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,requireBothEndsMapped=TRUE)
1185\end{Rcode}
1186
1187\noindent Exclude chimeric fragments from fragment counting:
1188
1189\begin{Rcode}
1190featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,countChimericFragments=FALSE)
1191\end{Rcode}
1192
1193\chapter{Quantify single-cell RNA-seq data}
1194
1195\section{cellCounts}
1196
1197The \emph{cellCounts} program is developed for the quantification of single-cell RNA-seq (scRNA-seq) data generated from the 10X platform.
1198With \emph{cellCounts}, the entire quantification process is simply done by just one function call.
1199\emph{cellCounts} demultiplexes samples and cells and deduplicates reads.
1200It calls \emph{align} or \emph{subjunc} function to map scRNA-seq reads to a reference genome, and it calls \emph{featureCounts} function to assign UMIs (Unique Molecular Identifiers) to genomic features such as genes and exons.
1201
1202Note that \emph{cellCounts} is only available in the {\Rsubread} package.
1203It is not available in the {\Subread} package.
1204Its parameters can be found in the table below.
1205
1206
1207\begin{longtable}{|p{4.5cm}|p{11cm}|}
1208\multicolumn{2}{p{16cm}}{Table 4: Arguments used by the \code{cellCounts} program.
1209The arguments are listed in the alphabetical order.}
1210\endfirsthead
1211\hline
1212Arguments & Description \\
1213\hline
1214annot.ext & Specify an external annotation for UMI counting. See \code{featureCounts} for more details.\\
1215\hline
1216annot.inbuilt & Specify an inbuilt annotation for UMI counting. See \code{featureCounts} for more details.\\
1217\hline
1218cell.barcode & A character string giving the name of a text file (can be gzipped) that contains the set of cell barcodes used in sample preparation. If \code{NULL}, a cell barcode set will be determined for the input data by \code{cellCounts} based on the matching of cell barcodes sequences of the first 100,000 reads in the data with the three cell barcode sets used by 10X Genomics. \code{NULL} by default.\\
1219\hline
1220GTF.attrType & See \code{featureCounts} for more details. \\
1221\hline
1222GTF.featureType & See \code{featureCounts} for more details. \\
1223\hline
1224index & A character string giving the base name of index files, generated by the \code{buildindex} function, for a reference genome. \\
1225\hline
1226isGTFAnnotationFile & See \code{featureCounts} for more details. \\
1227\hline
1228sample.index & A data frame containing index set name for each sample. The data frame must contain four columns with column headers named `InputDirectory', `Lane', `SampleName' and `IndexSetName'. Note that this is not the Sample Sheet created by an Illumina sequencer. This data frame is used by \code{cellCounts} to produce a Sample Sheet which is then used to demultiplex samples. The name of an index set specifies a set of indices that were used for the sequencing of a sample. An example of the index set name is ‘SI-P01-A2’.\\
1229\hline
1230useMetaFeatures & Specify if UMI counting should be carried out at the meta-feature level (eg. gene level). See \code{featureCounts} for more details. \\
1231\hline
1232… & other parameters passed to \code{align}, \code{subjunc} and \code{featureCounts} functions. \\
1233\hline
1234\end{longtable}
1235
1236
1237The \code{cellCounts} program returns a \code{List} object to R, and it also outputs three gzipped FASTQ files and one BAM file for each sample.
1238The three gzipped FASTQ files include cell barcode and UMI sequences (R1), sample index sequences (I1) and the actual genomic sequences of the reads (R2), respectively.
1239The BAM file includes location-sorted read mapping results.
1240
1241The returned \code{List} object contains the following components:
1242
1243\noindent\code{counts} \\
1244A \code{List} object including UMI counts for each sample. Each component in this object is a matrix that contains UMI counts for a sample. Rows in the matrix are genes and columns are cells.\\
1245
1246\noindent\code{annotation}\\
1247A \code{data.frame} object containing an annotation (eg. a gene annotation). UMIs were assigned to features (eg. genes) in this annotation. Rows in the annotation are features. Columns of the annotation include \code{GeneID}, \code{Chr}, \code{Start}, \code{End} and \code{Length}.\\
1248
1249\noindent\code{sample.info}\\
1250A \code{data.frame} object containing sample information and also statistics related to the quantification result. It includes the following columns: \code{SampleName}, \code{InputDirectory}, \code{TotalCells}, \code{HighConfidenceCells}, \code{RescuedCells}, \code{TotalUMI}, \code{TotalReads}, \code{MappedReads} and \code{AssignedReads}. The order of samples in this object is the same as that in the components \code{counts} and \code{cell.confidence}.\\
1251
1252\noindent\code{cell.confidence}\\
1253A \code{List} object indicating if a cell is a high-confidence cell or a rescued cell (low confidence). Each component in the \code{List} object corresponds to a sample. Each component is a logical vector with a \code{TRUE} value indicating a high-confidence cell.
1254
1255
1256\chapter{SNP calling}
1257
1258\section{Algorithm}
1259
1260SNPs(Single Nucleotide Polymorphisms) are the mutations of single nucleotides in the genome.
1261It has been reported that many diseases were initiated and/or driven by such mutations.
1262Therefore, successful detection of SNPs is very useful in designing better diagnosis and treatments for a variety of diseases such as cancer.
1263SNP detection also is an important subject of many population studies.
1264
1265Next-gen sequencing technologies provide an unprecedented opportunity to identify SNPs at the highest resolution.
1266However, it is extremely computing-intensive to analyze the data generated from these technologies for the purpose of SNP discovery  because of the sheer volume of the data and the large number of chromosomal locations to be considered.
1267To discover SNPs, reads need to be mapped to the reference genome first and then all the read data mapped to a particular site will be used for SNP calling for that site.
1268Discovery of SNPs is often confounded by many sources of errors.
1269Mapping errors and sequencing errors are often the major sources of errors causing incorrect SNP calling.
1270Incorrect alignments of indels, exon-exon junctions and structural variants in the reads can also result in wrong placement of blocks of continuous read bases, likely giving rise to consecutive incorrectly reported SNPs.
1271
1272We have developed a highly accurate and efficient SNP caller, called \emph{exactSNP} \cite{exactSNP}.
1273\emph{exactSNP} calls SNPs for individual samples, without requiring control samples to be provided.
1274It tests the statistical significance of SNPs by comparing SNP signals to their background noises.
1275It has been found to be an order of magnitude faster than existing SNP callers.
1276
1277\section{exactSNP}
1278
1279Below is the command for running \code{exactSNP} program.
1280The complete list of parameters used by \code{exactSNP} can be found in Table 5.\\
1281
1282\noindent\code{exactSNP [options] -i input -g reference\_genome -o output}\\
1283
1284
1285\begin{longtable}{|p{4.5cm}|p{11cm}|}
1286\multicolumn{2}{p{16cm}}{Table 5: Arguments used by the \code{exactSNP} program included in the SourceForge {\Subread} package in alphabetical order.
1287Arguments included in parenthesis are the equivalent parameters used by \code{exactSNP} function in Bioconductor {\Rsubread} package.}
1288\endfirsthead
1289\hline
1290Arguments & Description \\
1291\hline
1292-a $<file>$ \newline (SNPAnnotationFile) & Specify name of a VCF-format file that includes annotated SNPs. Such annotation files can be downloaded from public databases such as the dbSNP database. Gzipped file is accepted. Incorporating known SNPs into SNP calling has been found to be helpful. However note that the annotated SNPs may or may not be called for the sample being analyzed. \\
1293\hline
1294-b \newline (isBAM) & Indicate the input file provided via $-i$ is in BAM format. \\
1295\hline
1296-f $<float>$ \newline (minAllelicFraction) & Specify the minimum fraction of mis-matched bases a SNP-containing location must have. Its value must between 0 and 1. 0 by default. \\
1297\hline
1298-g $<file>$ \newline (refGenomeFile) & Specify name of the file including all reference sequences. Only one single FASTA format file should be provided. \\
1299\hline
1300-i $<file> [-b\ if\ BAM] \newline (readFile)$ & Specify name of an input file including read mapping results. The format of input file can be SAM or BAM  (\code{-b} needs to be specified if a BAM file is provided).\\
1301\hline
1302-n $<int>$ \newline (minAllelicBases) & Specify the minimum number of mis-matched bases a SNP-containing location must have. 1 by default.\\
1303\hline
1304-o $<file>$ \newline (outputFile) & Specify name of the output file. This program outputs a VCF format file that includes discovered SNPs. \\
1305\hline
1306-Q $<int>$  \newline (qvalueCutoff) &  Specify the q-value cutoff for SNP calling at sequencing depth of 50X. 12 by default. The corresponding p-value cutoff is $10^{-Q}$. Note that this program automatically adjusts the q-value cutoff according to the sequencing depth at each chromosomal location.\\
1307\hline
1308-r $<int>$ \newline (minReads) & Specify the minimum number of mapped reads a SNP-containing location must have (ie. the minimum coverage). 1 by default. \\
1309\hline
1310-s $<int>$ \newline (minBaseQuality) & Specify the cutoff for base calling quality scores (Phred scores) read bases must satisfy to be used for SNP calling. 13 by default. Read bases that have Phred scores lower than the cutoff value will be excluded from the analysis. \\
1311\hline
1312-t $<int>$ \newline (nTrimmedBases) & Specify the number of bases trimmed off from each end of the read. 3 by default. \\
1313\hline
1314-T $<int>$ \newline (nthreads) & Specify the number of threads. 1 by default. \\
1315\hline
1316-v & Output version of the program. \\
1317\hline
1318-x $<int>$ \newline (maxReads) & Specify the maximum depth a SNP location is allowed to have. 1,000,000 reads by default. Any location having more reads than the maximum allowed depth will not be considered for SNP calling. This option is useful for removing PCR artefacts. \\
1319\hline
1320\end{longtable}
1321
1322
1323
1324\chapter{Utility programs}
1325
1326Usage info for each utility program can be seen by just typing the program name on the command prompt.
1327
1328\section{repair}
1329
1330This program takes as input a paired-end BAM file and places reads from the same pair next to each other in its output.
1331BAM files generated by {\repair} are compatible with {\featureCounts} program, ie they will not be re-sorted by {\featureCounts}.
1332Note that you do not have to run {\repair} before running {\featureCounts}.
1333{\featureCounts} calls {\repair} automatically if it finds that reads need to be re-sorted.
1334
1335The {\repair} program uses a novel approach to quickly find reads from the same pair, rather than performing time-consuming sort of read names.
1336It takes only about half a minute to re-order a location-sorted BAM file including 30 million read pairs.
1337
1338\section{flattenGTF}
1339
1340Flatten features (eg. exons) provided in a GTF annotation and output the modified annotation to a SAF format annotation.
1341If overlapping features are found in the GTF annotation, this function can combine them to form a single large feature encompassing all the original features, or chop them into non-overlapping bins.
1342
1343\section{promoterRegions}
1344
1345This function is only implemented in {\Rsubread}. It generates a SAF format annotation that includes coordinates of promoter regions for each gene.
1346
1347\section{propmapped}
1348
1349Get number of mapped reads from a BAM/SAM file.
1350
1351\section{qualityScores}
1352
1353Retrieve Phred scores for read bases from a Fastq/BAM/SAM file.
1354
1355\section{removeDup}
1356
1357Remove duplicated reads from a SAM/BAM file.
1358In {\Rsubread} this function is called \textsf{removeDupReads}.
1359
1360\section{subread-fullscan}
1361
1362Get all chromosomal locations that contain a genomic sequence sharing high homology with a given input sequence.
1363
1364\section{txUnique}
1365
1366This function is only implemented in {\Rsubread}. It counts the number of bases unique to each transcript.
1367
1368\chapter{Case studies}
1369
1370\section{A Bioconductor R pipeline for analyzing RNA-seq data}
1371
1372Here we illustrate how to use two Bioconductor packages - {\Rsubread} and {\limma} - to perform a complete RNA-seq analysis, including {\Subread} read mapping, {\featureCounts} read summarization, {\voom} normalization and {\limma} differential expresssion analysis.\\
1373
1374{\noindent\bf Data and software.} The RNA-seq data used in this case study include four libraries: A\_1, A\_2, B\_1 and B\_2.
1375Sample A is Universal Human Reference RNA (UHRR) and sample B is Human Brain Reference RNA (HBRR).
1376A\_1 and A\_2 are two replicates of sample A (undergoing separate sample preparation), and B\_1 and B\_2 are two replicates of sample B.
1377In this case study, A\_1 and A\_2 are treated as biological replicates although they are more like technical replicates.
1378B\_1 and B\_2 are treated as biological replicates as well.
1379
1380Note that these libraries only included reads originating from human chromosome 1 (according to {\Subread} aligner).
1381Reads were generated by the MAQC/SEQC Consortium.
1382Data used in this case study can be downloaded from\\
1383\url{http://bioinf.wehi.edu.au/RNAseqCaseStudy/data.tar.gz} (283MB).
1384Both read data and reference sequence for chromosome 1 of human genome (GRCh37) were included in the data.
1385
1386After downloading the data, you can uncompress it and save it to your current working directory.
1387Launch {\R} and load {\Rsubread}, {\limma} and {\edgeR} libraries by issuing the following commands at your R prompt.
1388Version of your {\R} should be 3.0.2 or later.
1389{\Rsubread} version should be 1.12.1 or later and {\limma} version should be 3.18.0 or later.
1390Note that this case study only runs on Linux/Unix and Mac OS X.
1391
1392\begin{Rcode}
1393library(Rsubread)
1394library(limma)
1395library(edgeR)
1396\end{Rcode}
1397
1398To install/update {\Rsubread} and {\limma} packages, issue the following commands at your R prompt:
1399\begin{Rcode}
1400source("http://bioconductor.org/biocLite.R")
1401biocLite(pkgs=c("Rsubread","limma","edgeR"))
1402\end{Rcode}
1403
1404{\noindent\bf Index building.} Build an index for human chromosome 1. This typically takes $\sim$3 minutes. Index files with basename `chr1' will be generated in your current working directory.
1405
1406\begin{Rcode}
1407buildindex(basename="chr1",reference="hg19_chr1.fa")
1408\end{Rcode}
1409
1410{\noindent\bf Alignment.} Perform read alignment for all four libraries and report uniquely mapped reads only. This typically takes $\sim$5 minutes. BAM files containing the mapping results will be generated in your current working directory.
1411
1412\begin{Rcode}
1413targets <- readTargets()
1414align(index="chr1",readfile1=targets$InputFile,output_file=targets$OutputFile)
1415\end{Rcode}
1416
1417{\noindent\bf Read summarization.} Summarize mapped reads to NCBI RefSeq genes.
1418This will only take a few seconds.
1419Note that the {\featureCounts} function contains built-in RefSeq annotations for human and mouse genes.
1420{\featureCounts} returns an {\R} `List' object, which includes raw read count for each gene in each library and also annotation information such as gene identifiers and gene lengths.
1421
1422\begin{Rcode}
1423fc <- featureCounts(files=targets$OutputFile,annot.inbuilt="hg19")
1424
1425fc$counts[1:5,]
1426          A_1.bam A_2.bam B_1.bam B_2.bam
1427653635        642     522     591     596
1428100422834       1       0       0       0
1429645520          5       3       0       0
143079501           0       0       0       0
1431729737         82      72      30      25
1432
1433fc$annotation[1:5,c("GeneID","Length")]
1434     GeneID Length
14351    653635   1769
14362 100422834    138
14373    645520   1130
14384     79501    918
14395    729737   3402
1440\end{Rcode}
1441
1442Create a {\DGEList} object.
1443\begin{Rcode}
1444x <- DGEList(counts=fc$counts, genes=fc$annotation[,c("GeneID","Length")])
1445\end{Rcode}
1446
1447%Calculate RPKM (reads per kilobases of exon per million reads mapped) values for genes:
1448%\begin{Rcode}
1449%x_rpkm <- rpkm(x,x$genes$Length,prior.count=0)
1450%
1451%x_rpkm[1:5,]
1452%          A_1.bam A_2.bam B_1.bam B_2.bam
1453%653635        939   905.0     709     736
1454%100422834      19     0.0       0       0
1455%645520         11     8.1       0       0
1456%79501           0     0.0       0       0
1457%729737         62    64.9      19      16
1458%\end{Rcode}
1459
1460
1461{\noindent\bf Filtering.} Only keep in the analysis those genes which had $>$10 reads per million mapped reads in at least two libraries.
1462
1463\begin{Rcode}
1464isexpr <- rowSums(cpm(x) > 10) >= 2
1465x <- x[isexpr,]
1466\end{Rcode}
1467
1468{\noindent\bf Design matrix.} Create a design matrix:
1469
1470\begin{Rcode}
1471celltype <- factor(targets$CellType)
1472design <- model.matrix(~0+celltype)
1473colnames(design) <- levels(celltype)
1474\end{Rcode}
1475
1476{\noindent\bf Normalization.} Perform {\voom} normalization:
1477
1478\begin{Rcode}
1479y <- voom(x,design,plot=TRUE)
1480\end{Rcode}
1481
1482The figure below shows the mean-variance relationship estimated by voom.
1483\begin{center}
1484\includegraphics[scale=0.3]{voom_mean_variance.png}
1485\end{center}
1486
1487{\noindent\bf Sample clustering.} Multi-dimensional scaling (MDS) plot shows that sample A libraries are clearly separated from sample B libraries.
1488
1489\begin{Rcode}
1490plotMDS(y,xlim=c(-2.5,2.5))
1491\end{Rcode}
1492
1493\begin{center}
1494\includegraphics[scale=0.5]{MDSplot.png}
1495\end{center}
1496
1497{\noindent\bf Linear model fitting and differential expression analysis.} Fit linear models to genes and assess differential expression using eBayes moderated t statistic.
1498Here we compare sample B vs sample A.
1499
1500\begin{Rcode}
1501fit <- lmFit(y,design)
1502contr <- makeContrasts(BvsA=B-A,levels=design)
1503fit.contr <- eBayes(contrasts.fit(fit,contr))
1504dt <- decideTests(fit.contr)
1505summary(dt)
1506   BvsA
1507-1  922
15080   333
15091   537
1510\end{Rcode}
1511
1512List top 10 differentially expressed genes:
1513
1514\begin{Rcode}
1515options(digits=2)
1516topTable(fit.contr)
1517             GeneID Length logFC AveExpr   t P.Value adj.P.Val  B
1518100131754 100131754   1019   1.6      16 113 3.5e-28   6.3e-25 54
15192023           2023   1812  -2.7      13 -91 2.2e-26   1.9e-23 51
15202752           2752   4950   2.4      13  82 1.5e-25   9.1e-23 49
152122883         22883   5192   2.3      12  64 1.8e-23   7.9e-21 44
15226135           6135    609  -2.2      12 -62 3.1e-23   9.5e-21 44
15236202           6202    705  -2.4      12 -62 3.2e-23   9.5e-21 44
15244904           4904   1546  -3.0      11 -60 5.5e-23   1.4e-20 43
152523154         23154   3705   3.7      11  55 2.9e-22   6.6e-20 41
15268682           8682   2469   2.6      12  49 2.2e-21   4.3e-19 39
15276125           6125   1031  -2.0      12 -48 3.1e-21   5.6e-19 39
1528\end{Rcode}
1529
1530
1531\bibliographystyle{unsrt}
1532\bibliography{SubreadUsersGuide}
1533
1534
1535\end{document}
1536