1\documentclass[12pt]{article} 2\usepackage[colorlinks]{hyperref} 3%\usepackage{color} 4\usepackage[usenames,dvipsnames,svgnames,table]{xcolor} 5\usepackage{tabularx} 6\usepackage{longtable} 7\usepackage{hanging} 8\usepackage{changepage} 9\usepackage[top=1in, bottom=1in, left=0.7in, right=0.7in]{geometry} 10\usepackage{enumitem} 11\usepackage{tocloft} 12 13\begin{document} 14\hypersetup{ 15 linkcolor=MidnightBlue 16 } 17 18 19%%%%%% global commands 20% option name, no reference 21\newcommand{\optn}[1]{\sloppy\textcolor{violet}{\texttt{--#1}}} 22% option name 23\newcommand{\opt}[1]{\sloppy\hyperlink{#1}{\optn{#1}}} 24%option value, to be replaces with user value 25\newcommand{\optv}[1]{\sloppy\texttt{#1}} 26\newcommand{\optvr}[1]{\sloppy\textit{\texttt{#1}}} 27 28\newcommand{\code}[1]{\sloppy\texttt{#1}} 29 30\newcommand{\codelines}[1]{\begin{adjustwidth}{0.5in}{0in} 31 \raggedright\texttt{#1} 32 \end{adjustwidth}} 33 34\newcommand{\ofilen}[1]{\sloppy\texttt{#1}} 35 36\newcommand{\sechyperref}[1]{\hyperref[#1]{Section \ref{#1}. \nameref{#1}}} 37 38\title{STAR manual 2.7.9a} 39\author{Alexander Dobin\\ 40dobin@cshl.edu} 41\maketitle 42 43%\setlength{\cftsubsecindent}{8em} % to control spacing between number and title in the ToC 44\makeatletter 45\renewcommand*\l@subsection{\@dottedtocline{2}{1.8em}{2.7 em}} 46\renewcommand*\l@subsubsection{\@dottedtocline{2}{4.5em}{2.7 em}} 47\makeatother 48\tableofcontents 49 50\newpage 51 52%\section{Introduction} 53\section{Getting started.} 54\subsection{Installation.} 55 56STAR source code and binaries can be downloaded from GitHub: named releases from \url{https://github.com/alexdobin/STAR/releases}, or the master branch from \url{https://github.com/alexdobin/STAR}. The pre-compiled STAR executables are located \code{bin/} subdirectory. The \code{static} executables are the easisest to use, as they are statically compiled and are not dependents on external libraries. 57 58To compile STAR from sources run \code{make} in the source directory for a Linux-like environment, or run \code{make STARforMac} for Mac OS X. This will produce the executable \code{'STAR'} inside the source directory. 59 60\subsubsection{Installation - in depth and troubleshooting.} 61STAR is compiled with gcc c++ compiler and depends only on standard gcc libraries. Some generic instructions on installing correct gcc environments are given below. 62 63\paragraph{Ubuntu.}\hfill 64\codelines{\$ sudo apt-get update\\ 65\$ sudo apt-get install g++\\ 66\$ sudo apt-get install make 67} 68 69\paragraph{Red Hat, CentOS, Fedora.}\hfill 70\codelines{\$ sudo yum update\\ 71 \$ sudo yum install make\\ 72 \$ sudo yum install gcc-c++\\ 73 \$ sudo yum install glibc-static 74 } 75 76\paragraph{SUSE.}\hfill 77\codelines {\$ sudo zypper update\\ 78 \$ sudo zypper in gcc gcc-c++ 79} 80 81\paragraph{Mac OS X.\newline} 82Current versions of Mac OS X Xcode are shipped with Clang replacing the standard gcc compiler. Presently, standard Clang does not support OpenMP which creates problems for STAR compilation. One option to avoid this problem is to install gcc (preferably using \code{homebrew} package manager). Another option is to add OpenMP functionality to Clang. 83 84\subsection{Basic workflow.} 85Basic STAR workflow consists of 2 steps: 86\begin{enumerate} 87\item 88Generating genome indexes files (see \sechyperref{Generating_genome_indexes}.\newline 89In this step user supplied the reference genome sequences (FASTA files) and annotations (GTF file), from which STAR generate genome indexes that are utilized in the 2nd (mapping) step. The genome indexes are saved to disk and need only be generated \textbf{once} for each genome/annotation combination. A limited collection of STAR genomes is available from \url{http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/}, however, it is strongly recommended that users generate their own genome indexes with most up-to-date assemblies and annotations. 90\item 91Mapping reads to the genome (see \sechyperref{Running_mapping_jobs}).\newline 92In this step user supplies the genome files generated in the 1st step, as well as the RNA-seq reads (sequences) in the form of FASTA or FASTQ files. STAR maps the reads to the genome, and writes several output files, such as alignments (SAM/BAM), mapping summary statistics, splice junctions, unmapped reads, signal (wiggle) tracks etc. Output files are described in \sechyperref{Output_files}. Mapping is controlled by a variety of input parameters (options) that are described in brief in \sechyperref{Running_mapping_jobs}, and in more detail in \sechyperref{Description_of_all_options}. 93\end{enumerate} 94 95STAR command line has the following format:\codelines{ 96STAR --option1-name option1-value(s)--option2-name option2-value(s) ... 97} 98If an option can accept multiple values, they are separated by spaces, and in a few cases - by commas. 99 100 101\section{Generating genome indexes.}\label{Generating_genome_indexes} 102\subsection{Basic options.} 103The basic options to generate genome indices are as follows: 104\codelines{\opt{runThreadN} \optvr{NumberOfThreads}\\ 105\opt{runMode} \optv{genomeGenerate}\\ 106\opt{genomeDir} \optvr{/path/to/genomeDir}\\ 107\opt{genomeFastaFiles} \optvr{/path/to/genome/fasta1 /path/to/genome/fasta2 ...} \\ 108\opt{sjdbGTFfile} \optvr{/path/to/annotations.gtf}\\ 109\opt{sjdbOverhang} \optvr{ReadLength-1}\\ 110} 111 112\begin{itemize} 113\item[] 114\opt{runThreadN} option defines the number of threads to be used for genome generation, it has to be set to the number of available cores on the server node. 115 116\item[] 117\opt{runMode} \optv{genomeGenerate} option directs STAR to run genome indices generation job. 118 119\opt{genomeDir} specifies path to the directory (henceforth called "genome directory" where the genome indices are stored. This directory has to be created (with \code{mkdir}) before STAR run and needs to have writing permissions. The file system needs to have at least 100GB of disk space available for a typical mammalian genome. It is recommended to remove all files from the genome directory before running the genome generation step. This directory path will have to be supplied at the mapping step to identify the reference genome. 120 121\item[] 122\opt{genomeFastaFiles} specifies one or more FASTA files with the genome reference sequences. Multiple reference sequences (henceforth called “chromosomes”) are allowed for each fasta file. You can rename the chromosomes’ names in the chrName.txt keeping the order of the chromosomes in the file: the names from this file will be used in all output alignment files (such as .sam). The tabs are not allowed in chromosomes’ names, and spaces are not recommended. 123 124\item[] 125\opt{sjdbGTFfile} specifies the path to the file with annotated transcripts in the standard GTF format. STAR will extract splice junctions from this file and use them to greatly improve accuracy of the mapping. While this is optional, and STAR can be run without annotations, using annotations is \textbf{highly recommended} whenever they are available. Starting from 2.4.1a, the annotations can also be included on the fly at the mapping step. 126 127\item[] 128\opt{sjdbOverhang} specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the \optvr{ReadLength-1}, where \optvr{ReadLength} is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is \optvr{max(ReadLength)-1}. \textbf{In most cases, the default value of 100 will work as well as the ideal value.} 129\end{itemize} 130 131Genome files comprise binary genome sequence, suffix arrays, text chromosome names/lengths, splice junctions coordinates, and transcripts/genes information. Most of these files use internal STAR format and are not intended to be utilized by the end user. It is strongly \textbf{not recommended} to change any of these files with one exception: you can rename the chromosome names in the chrName.txt while keeping the order of the chromosomes in this file: the chromosome names from this file will be used in all output files (e.g. SAM/BAM). 132 133\subsection{Advanced options.} 134\subsubsection{Which chromosomes/scaffolds/patches to include?} 135It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should \textbf{not} be included in the genome. 136 137Examples of acceptable genome sequence files: 138\begin{itemize} 139\item \textbf{ENSEMBL:} files marked with .dna.primary.assembly, such as: 140\url{ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz} 141\item \textbf{GENCODE:} files marked with PRI (primary). Strongly recommended for mouse and human: \url{http://www.gencodegenes.org/}. 142\end{itemize} 143\subsubsection{Which annotations to use?} 144The use of the most comprehensive annotations for a given species is strongly recommended. Very importantly, chromosome names in the annotations GTF file have to match chromosome names in the FASTA genome sequence files. For example, one can use ENSEMBL FASTA files with ENSEMBL GTF files, and UCSC FASTA files with UCSC FASTA files. However, since UCSC uses \code{chr1, chr2, ...} naming convention, and ENSEMBL uses \code{1, 2, ...} naming, the ENSEMBL and UCSC FASTA and GTF files cannot be mixed together, unless chromosomes are renamed to match between the FASTA anf GTF files. 145 146\subsubsection{Annotations in GFF format.} 147In addition to the aforementioned options, for GFF3 formatted annotations you need to use \opt{sjdbGTFtagExonParentTranscript} \optv{Parent}. In general, for \opt{sjdbGTFfile} files STAR only processes lines which have \opt{sjdbGTFfeatureExon} (=\optv{exon} by default) in the 3rd field (column). The exons are assigned to the transcripts using parent-child relationship defined by the \opt{sjdbGTFtagExonParentTranscript} (=\optv{transcript\_id} by default) GTF/GFF attribute. 148 149\subsubsection{Using a list of annotated junctions.} 150STAR can also utilize annotations formatted as a list of splice junctions coordinates in a text file: \opt{sjdbFileChrStartEnd} \optvr{/path/to/sjdbFile.txt}. This file should contains 4 columns separated by tabs: 151\codelines{Chr \quad{\textbackslash}tab\quad Start \quad{\textbackslash}tab\quad End \quad{\textbackslash}tab\quad Strand=+/-/.} 152Here Start and End are first and last bases of the introns (1-based chromosome coordinates). 153This file can be used in addition to the \opt{sjdbGTFfile}, in which case STAR will extract junctions from both files. 154 155Note, that the \opt{sjdbFileChrStartEnd} file can contain duplicate (identical) junctions, STAR will collapse (remove) duplicate junctions. 156 157\subsubsection{Very small genome.} 158For small genomes, the parameter \opt{genomeSAindexNbases} \textbf{must} to be scaled down, with a typical value of \code{min(14, log2(GenomeLength)/2 - 1)}. For example, for 1~megaBase genome, this is equal to 9, for 100~kiloBase genome, this is equal to 7. 159 160\subsubsection{Genome with a large number of references.} 161If you are using a genome with a large (\textgreater 5,000) number of references (chrosomes/scaffolds), you may need to reduce the \opt{genomeChrBinNbits} to reduce RAM consumption. The following scaling is recommended: \opt{genomeChrBinNbits} = \code{min(18,log2[max(GenomeLength/NumberOfReferences,ReadLength)])}. For example, for 3~gigaBase genome with 100,000 chromosomes/scaffolds, this is equal to 15. 162 163\section{Running mapping jobs.}\label{Running_mapping_jobs} 164\subsection{Basic options.} 165The basic options to run a mapping job are as follows: 166\codelines{\opt{runThreadN} \optvr{NumberOfThreads}\\ 167\opt{genomeDir} \optvr{/path/to/genomeDir}\\ 168\opt{readFilesIn} \optvr{/path/to/read1} [\optvr{/path/to/read2}] 169} 170 171\begin{itemize} 172\item[] 173%\opt{runThreadN} option defines the number of threads to be used for mapping, it has to be set to the number of available cores on the server node. 174 175\opt{genomeDir} specifies path to the genome directory where genome indices where generated (see \sechyperref{Generating_genome_indexes}). 176 177\item[] 178\opt{readFilesIn} name(s) (with path) of the files containing the sequences to be mapped (e.g. RNA-seq FASTQ files). If using Illumina paired-end reads, the \optvr{read1} and \optvr{read2} files have to be supplied. STAR can process both FASTA and FASTQ files. Multi-line (i.e. sequence split in multiple lines) FASTA (but not FASTQ) files are supported. 179 180If the read files are compressed, use the \opt{readFilesCommand} \optvr{UncompressionCommand} option, where \optvr{UncompressionCommand} is the un-compression command that takes the file name as input parameter, and sends the uncompressed output to stdout. For example, for gzipped files (*.gz) use 181\code{\opt{readFilesCommand} \optv{zcat}} 182OR 183\code{\opt{readFilesCommand} \optv{gunzip -c}}. 184For bzip2-compressed files, use 185\code{\opt{readFilesCommand} \optv{bunzip2 -c}}. 186 187\end{itemize} 188 189\subsection{Mapping multiple files in one run.} 190Multiple samples can be mapped in one run with a single output. This is equivalent to concatenating the read files before mapping, except that distinct read groups can be used in \opt{outSAMattrRGline} command to keep track of reads from different files. For single-end reads use a comma separated list (no spaces around commas), e.g.: 191 192\opt{readFilesIn} \optv{sample1.fq,sample2.fq,sample3.fq} 193 194For paired-end reads, use comma separated list 195for read1, followed by space, followed by comma separated list for read2, e.g.: 196 197\opt{readFilesIn}~\optv{s1read1.fq,s2read1.fq,s3read1.fq s1read2.fq,s2read2.fq,s3read2.fq} 198 199For multiple read files, the corresponding read groups can be supplied with space/comma/space-separated list in \opt{outSAMattrRGline}, e.g. 200 201\opt{outSAMattrRGline} \optv{ID:sample1 , ID:sample2 , ID:sample3} 202 203Note that this list is separated by commas surrounded by spaces (unlike \opt{readFilesIn} list). 204 205Another option for mapping multiple reads files, especially convenient for a very large number of files, is to create a file manifest and supply it in \opt{readFilesManifest} \optv{/path/to/manifest.tsv}. 206The manifest file should contain 3 tab-separated columns. For paired-end reads: 207 208\ofilen{read1-file-name $tab$ read2-file-name $tab$ read-group-line} 209 210For single-end reads, the 2nd column should contain the dash -: 211 212\ofilen{read1-file-name $tab$ - $tab$ read-group-line} 213 214Spaces, but not tabs are allowed in the file names. 215If read-group-line does not start with ID:, it can only contain one ID field, and ID: will be added to it. 216If read-group-line starts with ID:, it can contain several fields separated by $tab$, and all the fields will be copied verbatim into SAM @RG header line. 217 218\subsection{Advanced options.} 219There are many advanced options that control STAR mapping behavior. All options are briefly described in the Section \sechyperref{Description_of_all_options}. 220 221\subsubsection{Using annotations at the mapping stage.} 222Since 2.4.1a, the annotations can be included on the fly at the mapping step, without including them at the genome generation step. You can specify \opt{sjdbGTFfile} \optvr{/path/to/ann.gtf} and/or \opt{sjdbFileChrStartEnd} \optvr{/path/to/sj.tab}, as well as \opt{sjdbOverhang}, and any other \opt{sjdb*} options. The genome indices can be generated with or without another set of annotations/junctions. In the latter case the new junctions will added to the old ones. STAR will insert the junctions into genome indices on the fly before mapping, which takes 1~2 minutes. The on the fly genome indices can be saved (for reuse) with \opt{sjdbInsertSave} \optv{All}, into \optvr{\_STARgenome} directory inside the current run directory. 223 224\subsubsection{ENCODE options} 225An example of ENCODE standard options for long RNA-seq pipeline is given below: 226\begin{itemize} 227\item[] 228\opt{outFilterType} BySJout\\ 229reduces the number of "spurious" junctions 230\item[] 231\opt{outFilterMultimapNmax} 20\\ 232max number of multiple alignments allowed for a read: if exceeded, the read is considered unmapped 233\item[] 234\opt{alignSJoverhangMin} 8\\ 235minimum overhang for unannotated junctions 236\item[] 237\opt{alignSJDBoverhangMin} 1\\ 238minimum overhang for annotated junctions 239\item[] 240\opt{outFilterMismatchNmax} 999\\ 241maximum number of mismatches per pair, large number switches off this filter 242\item[] 243\opt{outFilterMismatchNoverReadLmax} 0.04\\ 244max number of mismatches per pair relative to read length: for 2x100b, max number of mismatches is 0.04*200=8 for the paired read 245\item[] 246\opt{alignIntronMin} 20\\ 247minimum intron length 248\item[] 249\opt{alignIntronMax} 1000000\\ 250maximum intron length 251\item[] 252\opt{alignMatesGapMax} 1000000\\ 253maximum genomic distance between mates 254\end{itemize} 255 256\subsection{Using shared memory for the genome indexes.} 257The \opt{genomeLoad} option controls how the genome is loaded into memory. By default, \opt{genomeLoad} \optvr{NoSharedMemory}, shared memory is not used. 258 259With \opt{genomeLoad} \optvr{LoadAndKeep}, STAR loads the genome as a standard Linux shared memory piece. The genomes are identified by their unique directory paths. Before loading the genome, STAR checks if the genome has already been loaded into the shared memory. If the genome has not been loaded, STAR will load it and will keep it in memory even after STAR job finishes. The genome will be shared with all the other STAR jobs. You can remove the genome from the shared memory running STAR with \opt{genomeLoad} \optvr{Remove}. The shared memory piece will be physically removed only after all STAR jobs attached to it complete. With \opt{genomeLoad} \optvr{LoadAndRemove}, STAR will load genome in the shared memory, and mark it for removal, so that the genome will be removed from the shared memory once all STAR jobs using it exit. \opt{genomeLoad} \optvr{LoadAndExit}, STAR will load genome in the shared memory, and immediately exit, keeping the genome loaded in the shared memory for the future runs. 260 261If you need to check or remove shared memory pieces manually, use the standard Linux command ipcs and ipcrm. If the genome residing in shared memory is not used for a long time it may get paged out of RAM which will slow down STAR runs considerably. It is strongly recommended to regularly re-load (i.e. remove and load again) the shared memory genomes. 262 263Many standard Linux distributions do not allow large enough shared memory blocks. You can fix this issue if you have root privileges, or ask you system administrator to do it. To enable the shared memory modify or add the following lines to /etc/sysctl.conf:\\ 264\code{kernel.shmmax = Nmax}\\ 265\code{kernel.shmall = Nall}\\ 266$Nmax$, $Nall$ numbers should be chosen as follows:\\ 267$Nmax > GenomeIndexSize=Genome + SA + SAindex$ (~31000000000 for human genome)\\ 268$Nall > GenomeIndexSize/PageSize$ \\ 269where PageSize is typically 4096 (this can be checked with \code {getconf PAGE\_SIZE}). 270Then run:\\ 271\code{/sbin/sysctl -p}\\ 272This will increase the allowed shared memory blocks to ~31GB, enough for human or mouse genome. 273 274\section{Output files.}\label{Output_files} 275STAR produces multiple output files. All files have standard name, however, you can change the file prefixes using \opt{outFileNamePrefix} \optvr{/path/to/output/dir/prefix}. By default, this parameter is \optv{./}, i.e. all output files are written in the current directory. 276\subsection{Log files.} 277\begin{itemize} 278\item[] 279\ofilen{Log.out}: main log file with a lot of detailed information about the run. This file is most useful for troubleshooting and debugging. 280\item[] 281\ofilen{Log.progress.out}: reports job progress statistics, such as the number of processed reads, \% of mapped reads etc. It is updated in ~1 minute intervals. 282\item[] 283\ofilen{Log.final.out}: summary mapping statistics after mapping job is complete, very useful for quality control. The statistics are calculated for each read (single- or paired-end) and then summed or averaged over all reads. Note that STAR counts a paired-end read as one read, (unlike the samtools flagstat/idxstats, which count each mate separately). Most of the information is collected about the UNIQUE mappers (unlike samtools flagstat/idxstats which does not separate unique or multi-mappers). Each splicing is counted in the numbers of splices, which would correspond to summing the counts in \ofilen{SJ.out.tab}. The mismatch/indel error rates are calculated on a per base basis, i.e. as total number of mismatches/indels in all unique mappers divided by the total number of mapped bases. 284\end{itemize} 285 286\subsection{SAM.} 287\ofilen{Aligned.out.sam} - alignments in standard SAM format. 288\subsubsection{Multimappers.} 289The number of loci \code{Nmap} a read maps to is given by \code{NH:i:Nmap} field. Value of 1 corresponds to unique mappers, while values \textgreater1 corresponds to multi-mappers. \code{HI} attrbiutes enumerates multiple alignments of a read starting with 1 (this can be changed with the \opt{outSAMattrIHstart} - setting it to 0 may be required for compatibility with downstream software such as Cufflinks). 290 291The mapping quality MAPQ (column 5) is 255 for uniquely mapping reads, and int(-10*log10(1-1/Nmap)) for multi-mapping reads. This scheme is same as the one used by TopHat and is compatible with Cufflinks. The default MAPQ=255 for the unique mappers maybe changed with \opt{outSAMmapqUnique} parameter (integer 0 to 255) to ensure compatibility with downstream tools such as GATK. 292 293For multi-mappers, all alignments except one are marked with 0x100 (secondary alignment) in the FLAG (column 2 of the SAM). The unmarked alignment is selected from the best ones (i.e. highest scoring). This default behavior can be changed with \opt{outSAMprimaryFlag} \optv{AllBestScore} option, that will output all alignments with the best score as primary alignments (i.e. 0x100 bit in the FLAG unset). 294 295By default, the order of the multi-mapping alignments for each read is not truly random. 296The \opt{outMultimapperOrder} \optv{Random} option outputs multiple alignments for each read in random order, and also also randomizes the choice of the primary alignment from the highest scoring alignments. Parameter \opt{runRNGseed} can be used to set the random generator seed. With this option, the ordering of multi-mapping alignments of each read, and the choice of the primary alignment will vary from run to run, unless only one thread is used and the seed is kept constant. 297 298The \opt{outSAMmultNmax} parameter limits the number of output alignments (SAM lines) for multimappers. For instance, \opt{outSAMmultNmax} \optv{1} will output exactly one SAM line for each mapped read. Note that \code{NH:i:} tag in STAR will still report the actual number of loci that the reads map to, while the the number of reported alignments for a read in the SAM file is \code{min(NH,--outSAMmultNMax)}. If \opt{outSAMmultNmax} is equal to \optv{-1}, all the alignments are output according to the order specified in \opt{outMultimapperOrder} option. If \opt{outSAMmultNmax} is not equal to -1, than top-scoring alignments will always be output first, even for the default \opt{outMultimapperOrder} \optv{Old\_2.4} option. 299 300 301\subsubsection{SAM attributes.} 302The SAM attributes can be specified by the user using \opt{outSAMattributes} \optvr{A1 A2 A3 ...} option which accept a list of 2-character SAM attributes. The attributes can be listed in any order, and will be recorded in that order in the SAM file. By default, STAR outputs \optv{NH HI AS nM} attributes. 303 304\begin{itemize}[itemsep=1pt] 305\item[] 306\textbf{Presets:} 307% 308\item[] 309\optv{None} : No SAM attributes 310% 311\item[] 312\optv{Standard} : NH HI AS nM 313% 314\item[] 315\optv{All} : NH HI AS nM NM MD jM jI MC ch 316 317\textbf{Alignment:} 318% 319\item[] 320\optv{NH} : number of loci the reads maps to: $=1$ for unique mappers, $>1$ for multimappers. Standard SAM tag. 321% 322\item[] 323\optv{HI} : multiple alignment index, starts with --outSAMattrIHstart ($=1$ by default). Standard SAM tag. 324% 325\item[] 326\optv{AS} : local alignment score, $+1/-1$ for matches/mismateches, score* penalties for indels and gaps. For PE reads, total score for two mates. Standard SAM tag. 327% 328\item[] 329\optv{NM} : edit distance to the reference (number of mismatched + inserted + deleted bases) for each mate. Standard SAM tag. 330% 331\item[] 332\optv{nM} : number of mismatches per (paired) alignment, not to be confused with \optv{NM}, which is the number of mismatches+indels in each mate. 333% 334\item[] 335\optv{jM:B:c,M1,M2,...} : intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value. 336% 337\item[] 338\optv{MD} : string encoding mismatched and deleted reference bases (see standard SAM specifications). Standard SAM tag. 339% 340\item[] 341\optv{jI:B:I,Start1,End1,Start2,End2,...} : Start and End of introns for all junctions (1-based). 342% 343\item[] 344\optv{jM jI} : attributes require samtools 0.1.18 or later, and were reported to be incompatible with some downstream tools such as Cufflinks. 345 346\textbf{Variation:} 347% 348\item[] 349\optv{vA} : variant allele. 350% 351\item[] 352\optv{vG} : genomic coordinate of the variant overlapped by the read. 353% 354\item[] 355\optv{vW} : WASP filtering tag, see detailed description in Section \ref{section:WASP}. Requires \opt{waspOutputMode} \optv{SAMtag}. 356 357\textbf{STARsolo:} 358% 359\item[] 360\optv{CR CY UR UY} : sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing, not error corrected. 361% 362\item[] 363\optv{GX GN} : gene ID and name. 364% 365\item[] 366\optv{CB UB} : error-corrected cell barcodes and UMIs for solo* demultiplexing. Requires \opt{outSAMtype} \optv{BAM SortedByCoordinate}. 367% 368\item[] 369\optv{sM} : assessment of CB and UMI. 370% 371\item[] 372\optv{sS} : sequence of the entire barcode (CB,UMI,adapter...). 373% 374\item[] 375\optv{sQ} : quality of the entire barcode. 376 377 378\textbf{Unmapped reads:} 379\item[] 380\optv{uT} : for unmapped reads, reason for not mapping: 381 \begin{itemize}[noitemsep,topsep=-3pt] 382 \item[] 0 : no acceptable seed/windows, "Unmapped other" in the Log.final.out 383 \item[] 1 : best alignment shorter than min allowed mapped length, "Unmapped: too short" in the Log.final.out 384 \item[] 2 : best alignment has more mismatches than max allowed number of mismatches, "Unmapped: too many mismatches" in the Log.final.out 385 \item[] 3 : read maps to more loci than the max number of multimappng loci, "Multimapping: mapped to too many loci" in the Log.final.out 386 \item[] 4 : unmapped mate of a mapped paired-end read 387 \end{itemize} 388 389\end{itemize} 390 391\subsubsection{Compatibility with Cufflinks/Cuffdiff.} 392For unstranded RNA-seq data, Cufflinks/Cuffdiff require spliced alignments with \optv{XS} strand attribute, which STAR will generate with \opt{outSAMstrandField} \optv{intronMotif} option. As required, the XS strand attribute will be generated for all alignments that contain splice junctions. The spliced alignments that have undefined strand (i.e. containing only non-canonical unannotated junctions) will be suppressed. 393 394If you have stranded RNA-seq data, you do not need to use any specific STAR options. Instead, you need to run Cufflinks with the library option \opt{library-type} options. For example, \code{cufflinks ... --library-type fr-firststrand} should be used for the “standard” dUTP protocol, including Illumina's stranded Tru-Seq. This option has to be used only for Cufflinks runs and not for STAR runs. 395 396In addition, it is recommended to remove the non-canonical junctions for Cufflinks runs using \opt{outFilterIntronMotifs} \optv{RemoveNoncanonical}. 397 398\subsection{Unsorted and sorted-by-coordinate BAM.} 399STAR can output alignments directly in binary BAM format, thus saving time on converting SAM files to BAM. It can also sort BAM files by coordinates, which is required by many downstream applications. 400\begin{itemize} 401\raggedright 402\item[] 403\opt{outSAMtype} \optv{BAM Unsorted}\\ 404output unsorted \ofilen{Aligned.out.bam} file. The paired ends of an alignment are always adjacent, and multiple alignments of a read are adjacent as well. This "unsorted" file can be directly used with downstream software such as \code{HTseq}, without the need of name sorting. The order of the reads will match that of the input FASTQ(A) files only if one thread is used \opt{runThread} \optv{1}, and \opt{outFilterType} \opt{BySJout} is \textbf{not} used. 405\item[] 406\opt{outSAMtype} \optv{BAM SortedByCoordinate}\\ 407output sorted by coordinate \ofilen{Aligned.sortedByCoord.out.bam} file, similar to \code{samtools sort} command. If this option causes problems, it is recommended to reduce \opt{outBAMsortingThreadN} from the default $6$ to lower values (as low as 1). 408\item[] 409\opt{outSAMtype} \optv{BAM Unsorted SortedByCoordinate}\\ 410output both unsorted and sorted files. 411\end{itemize} 412 413\subsection{Unmapped reads.} 414Unmapped reads can be output into the SAM/BAM \ofilen{Aligned.*} file(s) with 415\opt{outSAMunmapped} \optv{Within} option. \opt{outSAMunmapped} \optv{Within KeepPairs} will (redundantly) record unmapped mate for each alignment, and, in case of unsorted output, keep it adjacent to its mapped mate (this only affects multi-mapping reads). 416\optv{uT} SAM tag indicates reason for not mapping: 417\begin{itemize}[noitemsep] 418 \item[] 0 : no acceptable seed/windows, "Unmapped other" in the Log.final.out 419 \item[] 1 : best alignment shorter than min allowed mapped length, "Unmapped: too short" in the Log.final.out 420 \item[] 2 : best alignment has more mismatches than max allowed number of mismatches, "Unmapped: too many mismatches" in the Log.final.out 421 \item[] 3 : read maps to more loci than the max number of multimappng loci, "Multimapping: mapped to too many loci" in the Log.final.out 422 \item[] 4 : unmapped mate of a mapped paired-end read 423\end{itemize} 424 425\opt{outReadsUnmapped} \optv{Fastx} will output unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads into separate file(s) \ofilen{Unmapped.out.mate1(2)}, formatted the same way as input read files (i.e. FASTQ or FASTA). Appended to the read name line are tag to indicate mapping status of the read mates: 426\begin{itemize}[noitemsep,topsep=-3pt] 427 \item[] $00$: mates were not mapped; 428 \item[] $10$: 1st mate mapped, 2nd unmapped 429 \item[] $01$: 1st unmapped, 2nd mapped 430\end{itemize} 431 432\subsection{Splice junctions.} 433\ofilen{SJ.out.tab} contains high confidence collapsed splice junctions in tab-delimited format. 434Note that STAR defines the junction start/end as intronic bases, while many other software define them as exonic bases. 435The columns have the following meaning: 436%\begin{enumerate}[label=\bfseries Exercise \arabic*:] 437\begin{itemize}[leftmargin=1in] 438\item[column 1:] chromosome 439\item[column 2:] first base of the intron (1-based) 440\item[column 3:] last base of the intron (1-based) 441\item[column 4:] strand (0: undefined, 1: +, 2: -) 442\item[column 5:] intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT 443\item[column 6:] 0: unannotated, 1: annotated in the splice junctions database. Note that in 2-pass mode, junctions detected in the 1st pass are reported as annotated, in addition to annotated junctions from GTF. 444\item[column 7:] number of uniquely mapping reads crossing the junction 445\item[column 8:] number of multi-mapping reads crossing the junction 446\item[column 9:] maximum spliced alignment overhang 447\end{itemize} 448The filtering for this output file is controlled by the \optn{outSJfilter*} parameters, as described in \sechyperref{Output_Filtering:_Splice_Junctions}. 449%\subsection{Wiggle/bedGraph.} 450 451\section{Chimeric and circular alignments.} 452To switch on detection of chimeric (fusion) alignments (in addition to normal mapping), \opt{chimSegmentMin} should be set to a positive value. Each chimeric alignment consists of two "segments". Each segment is non-chimeric on its own, but the segments are chimeric to each other (i.e. the segments belong to different chromosomes, or different strands, or are far from each other). Both segments may contain splice junctions, and one of the segments may contain portions of both mates. \opt{chimSegmentMin} parameter controls the minimum mapped length of the two segments that is allowed. For example, if you have 2x75 reads and used \opt{chimSegmentMin} 20, a chimeric alignment with 130b on one chromosome and 20b on the other will be output, while 135 + 15 won't be. 453 454\subsection{STAR-Fusion.} 455STAR-Fusion is a software package for detecting fusion transcript from STAR chimeric output. It is developed and maintained by Brian Haas (@Broad Institute), whose effort was inspired by earlier work done by Nicolas Stransky in the landmark publication "The landscape of kinase fusions in cancer" by Stransky et al., Nat Commun 2014, in addition to very nice work done by Daniel Nicorici with his FusionCatcher software. Please visit its GitHub page for instructions and documentation: \url{https://github.com/STAR-Fusion/STAR-Fusion}. 456 457\subsection{Chimeric alignments in the main BAM files.} 458Chimeric alignments can be included together with normal alignments in the main (sorted or unsorted) BAM file(s) using \opt{chimOutType} \optv{WithinBAM}. In these files, formatting of chimeric alignments follows the latest SAM/BAM specifications. 459 460\subsection{Chimeric alignments in \ofilen{Chimeric.out.sam} .} 461With \opt{chimOutType} \optv{SeparateSAMold} STAR will output normal alignments into \ofilen{Aligned.*.sam/bam}, and will output chimeric alignments into a separate file \ofilen{Chimeric.out.sam}. Note that this option will be deprecated in the future, and the \opt{chimOutType} \optv{WithinBAM} is strongly recommended. 462Some reads may be output to both normal SAM/BAM files, and \ofilen{Chimeric.out.sam} for the following reason. STAR will output a non-chimeric alignment into \ofilen{Aligned.out.sam} with soft-clipping a portion of the read. If this portion is long enough, and it maps well and uniquely somewhere else in the genome, there will also be a chimeric alignment output into \ofilen{Chimeric.out.sam}. For instance, if you have a paired-end read where the second mate can be split chimerically into 70 and 30 bases. The 100b of the first mate + 70b of the 2nd mate map non-chimerically,and the mapping length/score are big enough, so they will be output into \ofilen{Aligned.out.sam} file. At the same time, the chimeric segments 100-mate1 + 70-mate2 and 30-mate2 will be output into \ofilen{Chimeric.out.sam}. 463 464\subsection{Chimeric alignments in \ofilen{Chimeric.out.junction}} 465By default, or with \opt{chimOutType} \optv{Junctions}, STAR will generate \ofilen{Chimeric.out.junction} file which maybe more convenient for downstream analysis. 466The format of this file is as follows. Every line contains one chimerically aligned read, e.g.: 467\begin{verbatim} 468chr22 23632601 + chr9 133729450 + 1 0 0 469SINATRA-0006:3:3:6387:5665#0 23632554 47M29S 133729451 47S29M40p76M 470\end{verbatim} 471 472The first 9 columns give information about the chimeric junction: 473\begin{itemize}[leftmargin=1in] 474\item[column 1:] \textbf{chr\_donorA} : chromosome of the donor 475\item[column 2:] \textbf{brkpt\_donorA} : first base of the intron of the donor (1-based) 476\item[column 3:] \textbf{strand\_donorA} : strand of the donor 477\item[column 4:] \textbf{chr\_acceptorB} : chromosome of the acceptor 478\item[column 5:] \textbf{brkpt\_acceptorB} : first base of the intron of the acceptor (1-based) 479\item[column 6:] \textbf{strand\_acceptorB} : strand of the acceptor 480\item[column 7:] \textbf{junction\_type} : -1=encompassing junction (between the mates), 1=GT/AG, 2=CT/AC 481\item[column 8:] \textbf{repeat\_left\_lenA} : repeat length to the left of the junction 482\item[column 9:] \textbf{repeat\_right\_lenB} : repeat length to the right of the junction 483\end{itemize} 484 485Columns 10-14 describe the alignments of the two chimeric segments, it is SAM like. Alignments are given with respect to the (+) strand 486\begin{itemize}[leftmargin=1in] 487\item[column 10:] \textbf{read\_name} : name of the RNA-seq fragment 488\item[column 11:] \textbf{start\_alnA} : first base of the first segment (on the + strand) 489\item[column 12:] \textbf{cigar\_alnA} : CIGAR of the first segment 490\item[column 13:] \textbf{start\_alnB} : first base of the second segment 491\item[column 14:] \textbf{cigar\_alnB} : CIGAR of the second segment 492\end{itemize} 493 494Columns 15-20 provide alignment score information and relevant metadata. These columns are only output for multimapping chimeriuc algorithm \opt{chimMultimapNmax} \optv{>0}. 495\begin{itemize}[leftmargin=1in] 496\item[column 15:] \textbf{num\_chim\_aln} : number of sufficiently scoring chimeric alignments reported for this RNA-seq fragment. 497\item[column 16:] \textbf{max\_poss\_aln\_score} : maximum possible alignment score for this fragment's read(s). 498\item[column 17:] \textbf{non\_chim\_aln\_score} : best non-chimeric alignment score 499\item[column 18:] \textbf{this\_chim\_aln\_score} : score for this individual chimeric alignment 500\item[column 19:] \textbf{bestall\_chim\_aln\_score} : the highest chimeric alignment score encountered for this RNA-seq fragment among the \textbf{num\_chim\_aln} reported chimeric alignments. 501\item[column 20:] \textbf{PEmerged\_bool} : boolean indicating that overlapping PE reads were first merged into a single contiguous sequence before alignment. 502\item[column 21:] \textbf{readgrp} : read group assignment for the read as indicated in the BAM file 503\end{itemize} 504 505Unlike standard SAM, both mates are recorded in one line here. The gap of length \code{L} between the mates is marked by the \code{p} in the CIGAR string. 506If the mates overlap, \code{L<0}. 507 508For strand definitions, when aligning paired end reads, the sequence of the second mate is reverse complemented. 509 510For encompassing junctions, i.e. junction type: -1=junction is between the mates, columns 2 and 5 represent the bounds on the chimeric junction loci. For the 1st mate, it will be the genomic base following the last 3' mapped base. For the 2nd mate (which is reverse complemented to have the same orientation as 1st mate), it will be the genomic base preceding the 5' mapped base. For example, if there is a chimeric junction that connects chr1/+strand/base1000 to chr2/+strand/base2000, and read 1 maps to chr1/+strand/bases800-900, and read 2 (after reverse complementing) maps to chr2/+strand/bases2100-2200, then columns 2 and 5 will have 901 and 2099. 511 512To filter chimeric junctions and find the number of reads supporting each junction you could use, for example: 513\begin{verbatim} 514cat Chimeric.out.junction | 515awk '$1!="chrM" && $4!="chrM" && $7>0 && $8+$9<=5 {print $1,$2,$3,$4,$5,$6,$7,$8,$9}' | 516sort | uniq -c | sort -k1,1rn 517\end{verbatim} 518This will keep only the canonical junctions with the repeat length less than 5 and will remove chimeras with mitochondrion genome. 519 520When I do it for one of our K562 runs, I get: 521\begin{verbatim} 522181 chr1 144676873 - chr1 147917466 + 1 0 1 523 29 chr5 69515744 - chr5 34182973 - 1 3 1 524 28 chr1 143910077 - chr1 149459550 - 1 1 0 525 27 chr22 23632601 + chr9 133729450 + 1 0 0 526 20 chr12 90313405 - chr21 40684813 - 1 2 0 527 20 chr22 23632601 + chr9 133655755 + 1 0 1 528 20 chr9 123636256 - chr9 123578959 + 1 1 4 529 15 chr16 85589970 + chr6 16762582 + 1 3 2 530 15 chr3 197348574 - chr3 195392936 + 1 1 0 531 14 chr18 39584506 + chr18 39560613 - 1 2 0 532\end{verbatim} 533Note that line 4 and 6 here are BCR/ABL fusions. You would need to filter these junctions further to see which of them connect known but not homologous genes. 534 535 536\section{Output in transcript coordinates.} 537With \opt{quantMode} \optv{TranscriptomeSAM} option STAR will output alignments translated into transcript coordinates in the \ofilen{Aligned.toTranscriptome.out.bam} file (in addition to alignments in genomic coordinates in \ofilen{Aligned.*.sam/bam} files). These transcriptomic alignments can be used with various transcript quantification software that require reads to be mapped to transcriptome, such as RSEM or eXpress. For example, RSEM command line would look as follows: \codelines{rsem-calculate-expression ... --bam Aligned.toTranscriptome.out.bam /path/to/RSEM/reference RSEM} 538 539Note, that STAR first aligns reads to entire genome, and only then searches for concordance between alignments and transcripts.This approach offers certain advantages compared to the alignment to transcriptome only, by not forcing the alignments to annotated transcripts. Note that \opt{outFilterMultimapNmax} filter only applies to genomic alignments. If an alignment passes this filter, it is converted to all possible transcriptomic alignments and all of them are output. 540 541By default, the output satisfies RSEM requirements: soft-clipping or indels are not allowed. Use \opt{quantTranscriptomeBan} \optv{Singleend} to allow insertions, deletions ans soft-clips in the transcriptomic alignments, which can be used by some expression quantification software (e.g. eXpress). 542 543\section{Counting number of reads per gene.} 544With \opt{quantMode} \optv{GeneCounts} option STAR will count number reads per gene while mapping. 545A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the paired-end read are checked for overlaps. 546The counts coincide with those produced by htseq-count with default parameters. 547This option requires annotations in GTF format (i.e. \emph{gene\_id} tag for each exon) specified in \opt{sjdbGTFfile} at the genome generation step or at the mapping step provided in option. 548STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options: 549\begin{itemize}[leftmargin=1in] 550\item[column 1:] gene ID 551\item[column 2:] counts for unstranded RNA-seq 552\item[column 3:] counts for the 1st read strand aligned with RNA (htseq-count option -s yes) 553\item[column 4:] counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse) 554\end{itemize} 555Select the output according to the strandedness of your data. 556Note, that if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads. 557With \opt{quantMode} \optv{TranscriptomeSAM} \optv{GeneCounts}, and get both the \ofilen{Aligned.toTranscriptome.out.bam} and \ofilen{ReadsPerGene.out.tab} outputs. 558 559 560\section{2-pass mapping.} 561 562For the most sensitive novel junction discovery, it is recommended to run STAR in the 2-pass mode. It does not significantly increase the number of detected novel junctions, but allows to detect more splices reads mapping to novel junctions. The basic idea is to run 1st pass of STAR mapping with the usual parameters, then collect the junctions detected in the first pass, and use them as "annotated" junctions for the 2nd pass mapping. 563 564\subsection{Multi-sample 2-pass mapping.} 565For a study with multiple samples, it is recommended to collect 1st pass junctions from all samples. 566\begin{enumerate} 567\item Run 1st mapping pass for all samples with "usual" parameters. Using annotations is recommended either a the genome generation step, or mapping step. 568\item Run 2nd mapping pass for all samples , listing SJ.out.tab files from all samples in \opt{sjdbFileChrStartEnd} \optvr{/path/to/sj1.tab /path/to/sj2.tab ...}. 569\end{enumerate} 570 571\subsection{Per-sample 2-pass mapping.} 572 573Annotated junctions will be included in both the 1st and 2nd passes. 574To run STAR 2-pass mapping for each sample separately, use \opt{twopassMode} \optv{Basic} option. STAR will perform the 1st pass mapping, then it will automatically extract junctions, insert them into the genome index, and, finally, re-map all reads in the 2nd mapping pass. This option can be used with annotations, which can be included either at the run-time (see \#1), or at the genome generation step. 575 576\opt{twopass1readsN} defines the number of reads to be mapped in the 1st pass. The default and most sensitive approach is to set it to -1 (or make it bigger than the number of reads in the sample) - in which case all reads in the input read file(s) are used in the 1st pass. While it can reduce mapping time by $\sim40\%$, it is not recommended to use a small portion of the reads in the 1st step, since it will significantly reduce sensitivity for the low expressed novel junctions. The idea to use a portion of the reads in the 1st pass was inspired by Kim, Langmead and Salzberg in Nature Methods 12, 357–360 (2015). 577 578\subsection{2-pass mapping with re-generated genome.} 579This is the original 2-pass method which involves genome re-generation step in-between 1st and 2nd passes. Since 2.4.1a, it is recommended to use the on the fly 2-pass options as described above. 580\begin{enumerate} 581\item Run 1st pass STAR for all samples with "usual" parameters. Genome indices generated with annotations are recommended. 582\item Collect all junctions detected in the 1st pass by merging \ofilen{SJ.out.tab} files from all runs. Filter the junctions by removing likelie false positives, e.g. junctions in the mitochondrion genome, or non-canonical junctions supported by a few reads. If you are using annotations, only novel junctions need to be considered here, since annotated junctions will be re-used in the 2nd pass anyway. 583\item Use the filtered list of junctions from the 1st pass with \opt{sjdbFileChrStartEnd} option, together with annotations (via \opt{sjdbGTFfile} option) to generate the new genome indices for the 2nd pass mapping. This needs to be done only once for all samples. 584\item Run the 2nd pass mapping for all samples with the new genome index. 585\end{enumerate} 586 587\section{Merging and mapping of overlapping paired-end reads.} 588This feature improves mapping accuracy for paired-end libraries with short insert sizes, where many reads have overlapping mates. Importantly, it allows detection of chimeric junction in the overlap region. 589 590STAR will search for an overlap between mates larger or equal to \opt{peOverlapNbasesMin} bases with proportion of mismatches in the overlap area not exceeding \opt{peOverlapMMp}. 591If the overlap is found, STAR will map merge the mates and attempt to map the resulting (single-end) sequence. 592If requested, the chimeric detection will be performed on the merged-mate sequence, thus allowing chimeric detection in the overlap region. 593If the score of this alignment higher than the original one, or if a chimeric alignment is found, STAR will report the merged-mate aligment instead of the original one. 594In the output, the merged-mate aligment will be converted back to paired-end format. 595 596The developmment of this algorithm was supported by Illumina, Inc. 597Many thanks to June Snedecor, Xiao Chen, and Felix Schlesinger for their extensive help in developing this feature. 598 599\section{Detection of personal variants overlapping alignments.} 600Option \opt{varVCFfile} \optvr{/path/to/vcf/file} is used to input VCF file with personal variants. Only single nucleotide variants (SNVs) are supported at the moment. 601Each variant is expected to have a genotype with two alleles. 602To output variants that overlap alignments, vG and vA have to be added to \opt{outSAMattributes} list. 603SAM attribute vG outputs the genomic coordinate of the variant, allowing for identification of the variant. 604SAM attribute vA outputs which allele is detected in the read: $1$ or $2$ match one of the genotype alleles, $3$ - no match to genotype. 605 606\section{WASP filtering of allele specific alignments.} \label{section:WASP} 607This is re-implementation of the original WASP algorithm by Bryce van de Geijn, Graham McVicker, Yoav Gilad and Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015) \url{https://www.nature.com/articles/nmeth.3582}. 608WASP filtering is activated with \opt{waspOutputMode} \optv{SAMtag}, which will add \optv{vW} tag to the SAM output: 609\optv{vW:i:1} means alignment passed WASP filtering, and all other values mean it did not pass: 610 611\optv{vW:i:2} - multi-mapping read 612 613\optv{vW:i:3} - variant base in the read is N (non-ACGT) 614 615\optv{vW:i:4} - remapped read did not map 616 617\optv{vW:i:5} - remapped read multi-maps 618 619\optv{vW:i:6} - remapped read maps to a different locus 620 621\optv{vW:i:7} - read overlaps too many variants 622 623\section{STARconsensus} 624STARconsensus allows for mapping RNA-seq reads to consensus genome. It was introduced in STAR 2{.}7{.}7a (2020/12/28). 625 626\begin{itemize} 627\item 628Provide the VCF file with consensus SNVs and InDels at the genome generation stage with \opt{genomeTransformVC} \optv{ Variants.vcf} \opt{genomeTransformType} \optv{Haploid}. 629The alternative alleles in this VCF will be inserted to the reference genome to create a "transformed" genome. 630Both the genome sequence and transcript/gene annotations are transformed. 631 632\item 633At the mapping stage, the reads will be mapped to the tranformed (consensus) genome. 634The quantification in the transformed annotations can be performed with standard \opt{quantMode} \optv{TranscriptomeSAM} and/or \optv{GeneCounts} options. 635If desired, alignments (SAM/BAM) and spliced junctions (SJ.out.tab) can be transformed back to the original (reference) coordinates with \opt{genomeTransformOutput} \optv{SAM} and/or \optv{SJ}. 636This is useful if downstream processing relies on reference coordinates. 637\end{itemize} 638 639\section{Detection of multimapping chimeras.} 640Previous STAR chimeric detection algorithm only detected uniquely mapping chimeras, which reduced its sensitivity in some cases. 641The new algorithm can detect and output multimapping chimeras. Presently, the only output into Chimeric.out.junction is supported. 642This algorithm is activated with $>0$ value in \optv{chimMultimapNmax}, which defines the maximum number of chimeric multi-alignments. 643The \optv{chimMultimapScoreRange} ($=1$ by default) parameter defines the score range for multi-mapping chimeras below the best chimeric score, similar to the \optv{outFilterMultimapScoreRange} parameter for normal alignments. 644The \optv{chimNonchimScoreDropMin} ($=20$ by default) defines the threshold triggering chimeric detection: the drop in the best non-chimeric alignment score with respect to the read length has to be greater than this value. 645 646\section{STARsolo: mapping, demultiplexing and gene quantification for single cell RNA-seq} 647 648STARsolo is a turnkey solution for analyzing droplet single cell RNA sequencing data (e.g. 10X Genomics Chromium System) built directly into STAR code. 649STARsolo inputs the raw FASTQ reads files, and performs the following operations: 650\begin{itemize} 651 \itemsep -0.5em 652 \item 653 error correction and demultiplexing of cell barcodes using user-input whitelist 654 \item 655 mapping the reads to the reference genome using the standard STAR spliced read alignment algorithm 656 \item 657 error correction and collapsing (deduplication) of Unique Molecular Identifiers (UMIa) 658 \item 659 quantification of per-cell gene expression by counting the number of reads per gene 660\end{itemize} 661STARsolo output is designed to be a drop-in replacement for 10X CellRanger gene quantification output. 662It follows CellRanger logic for cell barcode whitelisting and UMI deduplication, and produces nearly identical gene counts in the same format. At the same time STARsolo is ~10 times faster than the CellRanger. 663 664STARsolo solo* options can be found in the Section \ref{STARsolo_(single_cell_RNA-seq)_parameters}. 665For more detailed description, please see \url{https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md}. 666 667\subsection{Feature statistics summaries.} 668Feature statistics summaries are recorded in the \optvr{Solo.out/} directory in files \optvr{<Feature>.stats} where features are those used in the \opt{soloFeatures} option, e.g. \optvr{Gene.stats}. The following metrics are recorded: 669\begin{itemize}[leftmargin=1.5in] 670 \itemsep -0.3em 671 \item[\optv{nNinBarcode:}] number of reads with more than 2 Ns in cell barcode (CB) 672 \item[\optv{nUMIhomopolymer:}] number of reads with homopolymer in CB 673 \item[\optv{nTooMany:}] not used at the moment 674 \item[\optv{nNoMatch:}] number of reads with CBs that do not match whitelist even with one mismatch 675\end{itemize} 676All of the above reads are discarded from Solo output. Remaining reads are checked for overlap with features (e.g. genes): 677\begin{itemize}[leftmargin=2in] 678 \itemsep -0.3em 679 \item[\optv{nUnmapped:}] number of reads unmapped to the genome 680 \item[\optv{nNoFeature:}] number of reads that map to the genome but do not belong to a feature 681 \item[\optv{nAmbigFeature:}] number of reads that belong to more than one feature 682 \item[\optv{nAmbigFeatureMultimap:}] number of reads that belong to more than one feature and are also multimapping to the genome (this is a subset of the nAmbigFeature) 683 \item[\optv{nTooMany:}] number of reads with ambiguous CB (i.e. CB matches whitelist with one mismatch but with posterior probability <0.95) 684 \item[\optv{nNoExactMatch:}] number of reads with CB that matches a whitelist barcode with 1 mismatch, but this whitelist barcode does not get any other reads with exact matches of CB 685\end{itemize} 686All of the reads above are output in feature (e.g. gene) / cell count matrices. 687\begin{itemize}[leftmargin=1.5in] 688 \itemsep -0.3em 689 \item[\optv{nExactMatch:}] number of reads with CB that match the whitelist exactly 690 \item[\optv{nMatch:}] total number of reads that match CB with 0 or 1 mismatches (this is superset of nExactMatch) 691 \item[\optv{nCellBarcodes:}] number of distinct CBs detected 692 \item[\optv{nUMIs:}] number of distinct UMIs detected 693\end{itemize} 694 695These metrics can be grouped into more broad categories: 696\begin{itemize} 697 \itemsep -0.3em 698 \item[]\optv{nNinBarcode+nUMIhomopolymer+nNoMatch+nTooMany+nNoExactMatch} = number of reads with CBs that do not match whitelist. 699 \item[]\optv{nUnmapped+nAmbigFeature} = number of reads without defined feature (gene) 700 \item[]\optv{nMatch} = number of reads that are output as solo counts 701 702\end{itemize} 703The three categoties above summed together should be equal to the total number of reads. 704 705\section{Description of all options.}\label{Description_of_all_options} 706For each STAR version, the most up-to-date information about all STAR parameters can be found in the \code{parametersDefault} file in the STAR source directory. The parameters in the \code{parametersDefault}, as well as in the descriptions below, are grouped by function: 707\begin{itemize} 708\item[] Special attention has to be paid to parameters that start with \optn{out*}, as they control the STAR output. 709\item[] In particular, \optn{outFilter*} parameters control the filtering of output alignments which[] you might want to tweak to fit your needs. 710\item[] Output of “chimeric” alignments is controlled by \optn{chim*} parameters. 711\item[] Genome generation is controlled by \optn{genome*} parameters. 712\item[] Annotations (splice junction database) are controlled by \optn{sjdb*} options at the genome generation step. 713\item[] Tweaking \optn{score*}, \optn{align*}, \optn{seed*}, \optn{win*} parameters, which requires understanding of the STAR alignment algorithm, is recommended only for advanced users. 714\end{itemize} 715 716Below, allowed parameter values are typed in magenta, and default values - in blue. 717 718 719 720 721\newcommand{\pright}[1]{\begin{flushright} \begin{minipage}{0.8\textwidth}\raggedright #1 \end{minipage} \end{flushright}} 722 723\newcommand{\optSection}[1]{\subsection{#1}} 724\newcommand{\optName}[1]{\hypertarget{#1}{\textcolor{violet}{\texttt{--#1}}}} 725\newcommand{\optValue}[1]{\pright{\textcolor[rgb]{0,0.5,0}{default: \texttt{#1}}}} 726\newcommand{\optLine}[1]{\pright{#1}} 727\newenvironment{optTable}{}{} 728\newenvironment{optOptTable}{\begin{flushright} \begin{minipage}{0.75\textwidth}\raggedright }{\end{minipage}\end{flushright}} 729 730\newcommand{\optOpt}[1]{\textcolor{blue}{\texttt{#1}}\par} 731\newcommand{\optOptLine}[1]{{\hangpara{0.3in}{0}#1\par}} 732 733\input{parametersDefault.tex} 734 735 736\end{document} 737