1
2Introduction
3============
4
5What is HISAT2?
6-----------------
7
8HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads
9(whole-genome, transcriptome, and exome sequencing data) against the general human population
10(as well as against a single reference genome). Based on [GCSA] (an extension of [BWT] for a graph), we designed and implemented a graph FM index (GFM),
11an original approach and its first implementation to the best of our knowledge.
12In addition to using one global GFM index that represents general population,
13HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome
14(each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover human population).
15These small indexes (called local indexes) combined with several alignment strategies enable effective alignment of sequencing reads.
16This new indexing scheme is called Hierarchical Graph FM index (HGFM).
17We have developed HISAT 2 based on the [HISAT] and [Bowtie2] implementations.
18HISAT2 outputs alignments in [SAM] format, enabling interoperation with a large number of other tools (e.g. [SAMtools], [GATK]) that use SAM.
19HISAT2 is distributed under the [GPLv3 license], and it runs on the command line under
20Linux, Mac OS X and Windows.
21
22[HISAT2]:          http://ccb.jhu.edu/software/hisat2
23[HISAT]:           http://ccb.jhu.edu/software/hisat
24[Bowtie2]:         http://bowtie-bio.sf.net/bowtie2
25[Bowtie]:          http://bowtie-bio.sf.net
26[Bowtie1]:         http://bowtie-bio.sf.net
27[GCSA]:            http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6698337&tag=1
28[Burrows-Wheeler Transform]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
29[BWT]:             http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
30[FM Index]:        http://en.wikipedia.org/wiki/FM-index
31[SAM]:             http://samtools.sourceforge.net/SAM1.pdf
32[SAMtools]:        http://samtools.sourceforge.net
33[GATK]:            http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit
34[TopHat2]:         http://ccb.jhu.edu/software/tophat
35[Cufflinks]:       http://cufflinks.cbcb.umd.edu/
36[Crossbow]:        http://bowtie-bio.sf.net/crossbow
37[Myrna]:           http://bowtie-bio.sf.net/myrna
38[Bowtie paper]:    http://genomebiology.com/2009/10/3/R25
39[GPLv3 license]:   http://www.gnu.org/licenses/gpl-3.0.html
40
41Obtaining HISAT2
42==================
43
44Download HISAT2 sources and binaries from the Releases sections on the right side.
45Binaries are available for Intel architectures (`x86_64`) running Linux, and Mac OS X.
46
47Building from source
48--------------------
49
50Building HISAT2 from source requires a GNU-like environment with GCC, GNU Make
51and other basics.  It should be possible to build HISAT2 on most vanilla Linux
52installations or on a Mac installation with [Xcode] installed.  HISAT2 can
53also be built on Windows using [Cygwin] or [MinGW] (MinGW recommended). For a
54MinGW build the choice of what compiler is to be used is important since this
55will determine if a 32 or 64 bit code can be successfully compiled using it. If
56there is a need to generate both 32 and 64 bit on the same machine then a multilib
57MinGW has to be properly installed. [MSYS], the [zlib] library, and depending on
58architecture [pthreads] library are also required. We are recommending a 64 bit
59build since it has some clear advantages in real life research problems. In order
60to simplify the MinGW setup it might be worth investigating popular MinGW personal
61builds since these are coming already prepared with most of the toolchains needed.
62
63First, download the [source package] from the Releases section on the right side.
64Unzip the file, change to the unzipped directory, and build the
65HISAT2 tools by running GNU `make` (usually with the command `make`, but
66sometimes with `gmake`) with no arguments.  If building with MinGW, run `make`
67from the MSYS environment.
68
69HISAT2 is using the multithreading software model in order to speed up
70execution times on SMP architectures where this is possible. On POSIX
71platforms (like linux, Mac OS, etc) it needs the pthread library. Although
72it is possible to use pthread library on non-POSIX platform like Windows, due
73to performance reasons HISAT2 will try to use Windows native multithreading
74if possible.
75
76For the support of SRA data access in HISAT2, please download and install the [NCBI-NGS] toolkit.
77When running `make`, specify additional variables as follow.
78`make USE_SRA=1 NCBI_NGS_DIR=/path/to/NCBI-NGS-directory NCBI_VDB_DIR=/path/to/NCBI-NGS-directory`,
79where `NCBI_NGS_DIR` and `NCBI_VDB_DIR` will be used in Makefile for -I and -L compilation options.
80For example, $(NCBI_NGS_DIR)/include and $(NCBI_NGS_DIR)/lib64 will be used.
81
82[Cygwin]:   http://www.cygwin.com/
83[MinGW]:    http://www.mingw.org/
84[MSYS]:     http://www.mingw.org/wiki/msys
85[zlib]:     http://cygwin.com/packages/mingw-zlib/
86[pthreads]: http://sourceware.org/pthreads-win32/
87[GnuWin32]: http://gnuwin32.sf.net/packages/coreutils.htm
88[Download]: https://sourceforge.net/projects/bowtie-bio/files/bowtie2/
89[sourceforge site]: https://sourceforge.net/projects/bowtie-bio/files/bowtie2/
90[source package]: http://ccb.jhu.edu/software/hisat2/downloads/hisat2-2.0.0-beta-source.zip
91[Xcode]:    http://developer.apple.com/xcode/
92[NCBI-NGS]: https://github.com/ncbi/ngs/wiki/Downloads
93
94Running HISAT2
95=============
96
97Adding to PATH
98--------------
99
100By adding your new HISAT2 directory to your [PATH environment variable], you
101ensure that whenever you run `hisat2`, `hisat2-build` or `hisat2-inspect`
102from the command line, you will get the version you just installed without
103having to specify the entire path.  This is recommended for most users.  To do
104this, follow your operating system's instructions for adding the directory to
105your [PATH].
106
107If you would like to install HISAT2 by copying the HISAT2 executable files
108to an existing directory in your [PATH], make sure that you copy all the
109executables, including `hisat2`, `hisat2-align-s`, `hisat2-align-l`, `hisat2-build`, `hisat2-build-s`, `hisat2-build-l`, `hisat2-inspect`, `hisat2-inspect-s` and
110`hisat2-inspect-l`.
111
112[PATH environment variable]: http://en.wikipedia.org/wiki/PATH_(variable)
113[PATH]: http://en.wikipedia.org/wiki/PATH_(variable)
114
115Reporting
116---------
117
118The reporting mode governs how many alignments HISAT2 looks for, and how to
119report them.
120
121In general, when we say that a read has an alignment, we mean that it has a
122[valid alignment].  When we say that a read has multiple alignments, we mean
123that it has multiple alignments that are valid and distinct from one another.
124
125By default, HISAT2 may soft-clip reads near their 5' and 3' ends.  Users can control this behavior by setting different penalties for soft-clipping (`--sp`) or by disallowing soft-clipping (`--no-softclip`).
126
127### Distinct alignments map a read to different places
128
129Two alignments for the same individual read are "distinct" if they map the same
130read to different places.  Specifically, we say that two alignments are distinct
131if there are no alignment positions where a particular read offset is aligned
132opposite a particular reference offset in both alignments with the same
133orientation.  E.g. if the first alignment is in the forward orientation and
134aligns the read character at read offset 10 to the reference character at
135chromosome 3, offset 3,445,245, and the second alignment is also in the forward
136orientation and also aligns the read character at read offset 10 to the
137reference character at chromosome 3, offset 3,445,245, they are not distinct
138alignments.
139
140Two alignments for the same pair are distinct if either the mate 1s in the two
141paired-end alignments are distinct or the mate 2s in the two alignments are
142distinct or both.
143
144### Default mode: search for one or more alignments, report each
145
146HISAT2 searches for up to N distinct, primary alignments for
147each read, where N equals the integer specified with the `-k` parameter.
148Primary alignments mean alignments whose alignment score is equal or higher than any other alignments.
149It is possible that multiple distinct alignments have the same score.
150That is, if `-k 2` is specified, HISAT2 will search for at most 2 distinct
151alignments. The alignment score for a paired-end alignment equals the sum of the
152alignment scores of the individual mates.  Each reported read or pair alignment
153beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS
154field.  See the [SAM specification] for details.
155
156HISAT2 does not "find" alignments in any specific order, so for reads that
157have more than N distinct, valid alignments, HISAT2 does not guarantee that
158the N alignments reported are the best possible in terms of alignment score.
159Still, this mode can be effective and fast in situations where the user cares
160more about whether a read aligns (or aligns a certain number of times) than
161where exactly it originated.
162
163[SAM specification]: http://samtools.sourceforge.net/SAM1.pdf
164
165Alignment summary
166------------------
167
168When HISAT2 finishes running, it prints messages summarizing what happened.
169These messages are printed to the "standard error" ("stderr") filehandle.  For
170datasets consisting of unpaired reads, the summary might look like this:
171
172    20000 reads; of these:
173      20000 (100.00%) were unpaired; of these:
174        1247 (6.24%) aligned 0 times
175        18739 (93.69%) aligned exactly 1 time
176        14 (0.07%) aligned >1 times
177    93.77% overall alignment rate
178
179For datasets consisting of pairs, the summary might look like this:
180
181    10000 reads; of these:
182      10000 (100.00%) were paired; of these:
183        650 (6.50%) aligned concordantly 0 times
184        8823 (88.23%) aligned concordantly exactly 1 time
185        527 (5.27%) aligned concordantly >1 times
186        ----
187        650 pairs aligned concordantly 0 times; of these:
188          34 (5.23%) aligned discordantly 1 time
189        ----
190        616 pairs aligned 0 times concordantly or discordantly; of these:
191          1232 mates make up the pairs; of these:
192            660 (53.57%) aligned 0 times
193            571 (46.35%) aligned exactly 1 time
194            1 (0.08%) aligned >1 times
195    96.70% overall alignment rate
196
197The indentation indicates how subtotals relate to totals.
198
199Wrapper
200-------
201
202The `hisat2`, `hisat2-build` and `hisat2-inspect` executables are actually
203wrapper scripts that call binary programs as appropriate.  The wrappers shield
204users from having to distinguish between "small" and "large" index formats,
205discussed briefly in the following section.  Also, the `hisat2` wrapper
206provides some key functionality, like the ability to handle compressed inputs,
207and the functionality for `--un`, `--al` and related options.
208
209It is recommended that you always run the hisat2 wrappers and not run the
210binaries directly.
211
212Small and large indexes
213-----------------------
214
215`hisat2-build` can index reference genomes of any size.  For genomes less than
216about 4 billion nucleotides in length, `hisat2-build` builds a "small" index
217using 32-bit numbers in various parts of the index.  When the genome is longer,
218`hisat2-build` builds a "large" index using 64-bit numbers.  Small indexes are
219stored in files with the `.ht2` extension, and large indexes are stored in
220files with the `.ht2l` extension.  The user need not worry about whether a
221particular index is small or large; the wrapper scripts will automatically build
222and use the appropriate index.
223
224Performance tuning
225------------------
226
2271.  If your computer has multiple processors/cores, use `-p`
228
229    The `-p` option causes HISAT2 to launch a specified number of parallel
230    search threads.  Each thread runs on a different processor/core and all
231    threads find alignments in parallel, increasing alignment throughput by
232    approximately a multiple of the number of threads (though in practice,
233    speedup is somewhat worse than linear).
234
235Command Line
236------------
237
238### Setting function options
239
240Some HISAT2 options specify a function rather than an individual number or
241setting.  In these cases the user specifies three parameters: (a) a function
242type `F`, (b) a constant term `B`, and (c) a coefficient `A`.  The available
243function types are constant (`C`), linear (`L`), square-root (`S`), and natural
244log (`G`). The parameters are specified as `F,B,A` - that is, the function type,
245the constant term, and the coefficient are separated by commas with no
246whitespace.  The constant term and coefficient may be negative and/or
247floating-point numbers.
248
249For example, if the function specification is `L,-0.4,-0.6`, then the function
250defined is:
251
252    f(x) = -0.4 + -0.6 * x
253
254If the function specification is `G,1,5.4`, then the function defined is:
255
256    f(x) = 1.0 + 5.4 * ln(x)
257
258See the documentation for the option in question to learn what the parameter `x`
259is for.  For example, in the case if the `--score-min` option, the function
260`f(x)` sets the minimum alignment score necessary for an alignment to be
261considered valid, and `x` is the read length.
262
263### Usage
264
265    hisat2 [options]* -x <hisat2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <hit>]
266
267### Main arguments
268
269    -x <hisat2-idx>
270
271The basename of the index for the reference genome.  The basename is the name of
272any of the index files up to but not including the final `.1.ht2` / etc.
273`hisat2` looks for the specified index first in the current directory,
274then in the directory specified in the `HISAT2_INDEXES` environment variable.
275
276    -1 <m1>
277
278Comma-separated list of files containing mate 1s (filename usually includes
279`_1`), e.g. `-1 flyA_1.fq,flyB_1.fq`.  Sequences specified with this option must
280correspond file-for-file and read-for-read with those specified in `<m2>`. Reads
281may be a mix of different lengths. If `-` is specified, `hisat2` will read the
282mate 1s from the "standard in" or "stdin" filehandle.
283
284    -2 <m2>
285
286Comma-separated list of files containing mate 2s (filename usually includes
287`_2`), e.g. `-2 flyA_2.fq,flyB_2.fq`.  Sequences specified with this option must
288correspond file-for-file and read-for-read with those specified in `<m1>`. Reads
289may be a mix of different lengths. If `-` is specified, `hisat2` will read the
290mate 2s from the "standard in" or "stdin" filehandle.
291
292    -U <r>
293
294Comma-separated list of files containing unpaired reads to be aligned, e.g.
295`lane1.fq,lane2.fq,lane3.fq,lane4.fq`.  Reads may be a mix of different lengths.
296If `-` is specified, `hisat2` gets the reads from the "standard in" or "stdin"
297filehandle.
298
299    --sra-acc <SRA accession number>
300
301Comma-separated list of SRA accession numbers, e.g. `--sra-acc SRR353653,SRR353654`.
302Information about read types is available at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?sp=runinfo&acc=<b>sra-acc</b>&retmode=xml,
303where <b>sra-acc</b> is SRA accession number.  If users run HISAT2 on a computer cluster, it is recommended to disable SRA-related caching (see the instruction at [SRA-MANUAL]).
304
305[SRA-MANUAL]:	     https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration
306
307    -S <hit>
308
309File to write SAM alignments to.  By default, alignments are written to the
310"standard out" or "stdout" filehandle (i.e. the console).
311
312### Options
313
314#### Input options
315
316    -q
317
318Reads (specified with `<m1>`, `<m2>`, `<s>`) are FASTQ files.  FASTQ files
319usually have extension `.fq` or `.fastq`.  FASTQ is the default format.  See
320also: `--solexa-quals` and `--int-quals`.
321
322    --qseq
323
324Reads (specified with `<m1>`, `<m2>`, `<s>`) are QSEQ files.  QSEQ files usually
325end in `_qseq.txt`.  See also: `--solexa-quals` and `--int-quals`.
326
327    -f
328
329Reads (specified with `<m1>`, `<m2>`, `<s>`) are FASTA files.  FASTA files
330usually have extension `.fa`, `.fasta`, `.mfa`, `.fna` or similar.  FASTA files
331do not have a way of specifying quality values, so when `-f` is set, the result
332is as if `--ignore-quals` is also set.
333
334    -r
335
336Reads (specified with `<m1>`, `<m2>`, `<s>`) are files with one input sequence
337per line, without any other information (no read names, no qualities).  When
338`-r` is set, the result is as if `--ignore-quals` is also set.
339
340    -c
341
342The read sequences are given on command line.  I.e. `<m1>`, `<m2>` and
343`<singles>` are comma-separated lists of reads rather than lists of read files.
344There is no way to specify read names or qualities, so `-c` also implies
345`--ignore-quals`.
346
347    -s/--skip <int>
348
349Skip (i.e. do not align) the first `<int>` reads or pairs in the input.
350
351    -u/--qupto <int>
352
353Align the first `<int>` reads or read pairs from the input (after the
354`-s`/`--skip` reads or pairs have been skipped), then stop.  Default: no limit.
355
356    -5/--trim5 <int>
357
358Trim `<int>` bases from 5' (left) end of each read before alignment (default: 0).
359
360    -3/--trim3 <int>
361
362Trim `<int>` bases from 3' (right) end of each read before alignment (default:
3630).
364
365    --phred33
366
367Input qualities are ASCII chars equal to the [Phred quality] plus 33.  This is
368also called the "Phred+33" encoding, which is used by the very latest Illumina
369pipelines.
370
371[Phred quality]: http://en.wikipedia.org/wiki/Phred_quality_score
372
373    --phred64
374
375Input qualities are ASCII chars equal to the [Phred quality] plus 64.  This is
376also called the "Phred+64" encoding.
377
378    --solexa-quals
379
380Convert input qualities from [Solexa][Phred quality] (which can be negative) to
381[Phred][Phred quality] (which can't).  This scheme was used in older Illumina GA
382Pipeline versions (prior to 1.3).  Default: off.
383
384    --int-quals
385
386Quality values are represented in the read input file as space-separated ASCII
387integers, e.g., `40 40 30 40`..., rather than ASCII characters, e.g., `II?I`....
388 Integers are treated as being on the [Phred quality] scale unless
389`--solexa-quals` is also specified. Default: off.
390
391#### Alignment options
392
393    --n-ceil <func>
394
395Sets a function governing the maximum number of ambiguous characters (usually
396`N`s and/or `.`s) allowed in a read as a function of read length.  For instance,
397specifying `-L,0,0.15` sets the N-ceiling function `f` to `f(x) = 0 + 0.15 * x`,
398where x is the read length.  See also: [setting function options].  Reads
399exceeding this ceiling are [filtered out].  Default: `L,0,0.15`.
400
401    --ignore-quals
402
403When calculating a mismatch penalty, always consider the quality value at the
404mismatched position to be the highest possible, regardless of the actual value.
405I.e. input is treated as though all quality values are high.  This is also the
406default behavior when the input doesn't specify quality values (e.g. in `-f`,
407`-r`, or `-c` modes).
408
409    --nofw/--norc
410
411If `--nofw` is specified, `hisat2` will not attempt to align unpaired reads to
412the forward (Watson) reference strand.  If `--norc` is specified, `hisat2` will
413not attempt to align unpaired reads against the reverse-complement (Crick)
414reference strand. In paired-end mode, `--nofw` and `--norc` pertain to the
415fragments; i.e. specifying `--nofw` causes `hisat2` to explore only those
416paired-end configurations corresponding to fragments from the reverse-complement
417(Crick) strand.  Default: both strands enabled.
418
419#### Scoring options
420
421    --mp MX,MN
422
423Sets the maximum (`MX`) and minimum (`MN`) mismatch penalties, both integers.  A
424number less than or equal to `MX` and greater than or equal to `MN` is
425subtracted from the alignment score for each position where a read character
426aligns to a reference character, the characters do not match, and neither is an
427`N`.  If `--ignore-quals` is specified, the number subtracted quals `MX`.
428Otherwise, the number subtracted is `MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) )`
429where Q is the Phred quality value.  Default: `MX` = 6, `MN` = 2.
430
431    --sp MX,MN
432
433Sets the maximum (`MX`) and minimum (`MN`) penalties for soft-clipping per base,
434both integers. A number less than or equal to `MX` and greater than or equal to `MN` is
435subtracted from the alignment score for each position.
436The number subtracted is `MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) )`
437where Q is the Phred quality value.  Default: `MX` = 2, `MN` = 1.
438
439    --no-softclip
440
441Disallow soft-clipping.
442
443    --np <int>
444
445Sets penalty for positions where the read, reference, or both, contain an
446ambiguous character such as `N`.  Default: 1.
447
448    --rdg <int1>,<int2>
449
450Sets the read gap open (`<int1>`) and extend (`<int2>`) penalties.  A read gap of
451length N gets a penalty of `<int1>` + N * `<int2>`.  Default: 5, 3.
452
453    --rfg <int1>,<int2>
454
455Sets the reference gap open (`<int1>`) and extend (`<int2>`) penalties.  A
456reference gap of length N gets a penalty of `<int1>` + N * `<int2>`.  Default:
4575, 3.
458
459    --score-min <func>
460
461Sets a function governing the minimum alignment score needed for an alignment to
462be considered "valid" (i.e. good enough to report).  This is a function of read
463length. For instance, specifying `L,0,-0.6` sets the minimum-score function `f`
464to `f(x) = 0 + -0.6 * x`, where `x` is the read length.  See also: [setting
465function options].  The default is `L,0,-0.2`.
466
467#### Spliced alignment options
468
469    --pen-cansplice <int>
470
471Sets the penalty for each pair of canonical splice sites (e.g. GT/AG). Default: 0.
472
473    --pen-noncansplice <int>
474
475Sets the penalty for each pair of non-canonical splice sites (e.g. non-GT/AG). Default: 12.
476
477    --pen-canintronlen <func>
478
479Sets the penalty for long introns with canonical splice sites so that alignments with shorter introns are preferred
480to those with longer ones.  Default: G,-8,1
481
482    --pen-noncanintronlen <func>
483
484Sets the penalty for long introns with noncanonical splice sites so that alignments with shorter introns are preferred
485to those with longer ones.  Default: G,-8,1
486
487    --min-intronlen <int>
488
489Sets minimum intron length. Default: 20
490
491    --max-intronlen <int>
492
493Sets maximum intron length. Default: 500000
494
495    --known-splicesite-infile <path>
496
497With this mode, you can provide a list of known splice sites, which HISAT2 makes use of to align reads with small anchors.
498You can create such a list using `python hisat2_extract_splice_sites.py genes.gtf > splicesites.txt`,
499where `hisat2_extract_splice_sites.py` is included in the HISAT2 package, `genes.gtf` is a gene annotation file,
500and `splicesites.txt` is a list of splice sites with which you provide HISAT2 in this mode.
501Note that it is better to use indexes built using annotated transcripts (such as <i>genome_tran</i> or <i>genome_snp_tran</i>), which works better
502than using this option.  It has no effect to provide splice sites that are already included in the indexes.
503
504    --novel-splicesite-outfile <path>
505
506In this mode, HISAT2 reports a list of splice sites in the file <path>:
507   chromosome name `<tab>` genomic position of the flanking base on the left side of an intron `<tab>` genomic position of the flanking base on the right `<tab>` strand (+, -, and .)
508   '.' indicates an unknown strand for non-canonical splice sites.
509
510    --novel-splicesite-infile <path>
511
512With this mode, you can provide a list of novel splice sites that were generated from the above option "--novel-splicesite-outfile".
513
514    --no-temp-splicesite
515
516HISAT2, by default, makes use of splice sites found by earlier reads to align later reads in the same run,
517in particular, reads with small anchors (<= 15 bp).
518The option disables this default alignment strategy.
519
520    --no-spliced-alignment
521
522Disable spliced alignment.
523
524    --rna-strandness <string>
525
526Specify strand-specific information: the default is unstranded.
527For single-end reads, use F or R.
528 'F' means a read corresponds to a transcript.
529 'R' means a read corresponds to the reverse complemented counterpart of a transcript.
530For paired-end reads, use either FR or RF.
531With this option being used, every read alignment will have an XS attribute tag:
532 '+' means a read belongs to a transcript on '+' strand of genome.
533 '-' means a read belongs to a transcript on '-' strand of genome.
534
535(TopHat has a similar option, --library-type option, where fr-firststrand corresponds to R and RF; fr-secondstrand corresponds to F and FR.)
536
537    --tmo/--transcriptome-mapping-only
538
539Report only those alignments within known transcripts.
540
541    --dta/--downstream-transcriptome-assembly
542
543Report alignments tailored for transcript assemblers including StringTie.
544With this option, HISAT2 requires longer anchor lengths for de novo discovery of splice sites.
545This leads to fewer alignments with short-anchors,
546which helps transcript assemblers improve significantly in computation and memory usage.
547
548    --dta-cufflinks
549
550Report alignments tailored specifically for Cufflinks. In addition to what HISAT2 does with the above option (--dta),
551With this option, HISAT2 looks for novel splice sites with three signals (GT/AG, GC/AG, AT/AC), but all user-provided splice sites are used irrespective of their signals.
552HISAT2 produces an optional field, XS:A:[+-], for every spliced alignment.
553
554    --no-templatelen-adjustment
555
556Disables template length adjustment for RNA-seq reads.
557
558#### Reporting options
559
560    -k <int>
561
562It searches for at most `<int>` distinct, primary alignments for each read.
563Primary alignments mean alignments whose alignment score is equal or higher than any other alignments.
564The search terminates when it can't find more distinct valid alignments, or when it
565finds `<int>`, whichever happens first. The alignment score for a paired-end
566alignment equals the sum of the alignment scores of the individual mates. Each
567reported read or pair alignment beyond the first has the SAM 'secondary' bit
568(which equals 256) set in its FLAGS field.  For reads that have more than
569`<int>` distinct, valid alignments, `hisat2` does not guarantee that the
570`<int>` alignments reported are the best possible in terms of alignment score. Default: 5 (HFM) or 10 (HGFM)
571
572Note: HISAT2 is not designed with large values for `-k` in mind, and when
573aligning reads to long, repetitive genomes large `-k` can be very, very slow.
574
575    --max-seeds <int>
576
577HISAT2, like other aligners, uses seed-and-extend approaches.  HISAT2 tries to extend seeds to full-length alignments. In HISAT2, --max-seeds is used to control the maximum number of seeds that will be extended. HISAT2 extends up to these many seeds and skips the rest of the seeds. Large values for `--max-seeds` may improve alignment sensitivity, but HISAT2 is not designed with large values for `--max-seeds` in mind, and when aligning reads to long, repetitive genomes large `--max-seeds` can be very, very slow. The default value is the maximum of 5 and the value that comes with`-k`.
578
579    --secondary
580
581Report secondary alignments.
582
583#### Paired-end options
584
585    -I/--minins <int>
586
587The minimum fragment length for valid paired-end alignments.This option is valid only with --no-spliced-alignment.
588E.g. if `-I 60` is specified and a paired-end alignment consists of two 20-bp alignments in the
589appropriate orientation with a 20-bp gap between them, that alignment is
590considered valid (as long as `-X` is also satisfied).  A 19-bp gap would not
591be valid in that case.  If trimming options `-3` or `-5` are also used, the
592`-I` constraint is applied with respect to the untrimmed mates.
593
594The larger the difference between `-I` and `-X`, the slower HISAT2 will
595run.  This is because larger differences between `-I` and `-X` require that
596HISAT2 scan a larger window to determine if a concordant alignment exists.
597For typical fragment length ranges (200 to 400 nucleotides), HISAT2 is very
598efficient.
599
600Default: 0 (essentially imposing no minimum)
601
602    -X/--maxins <int>
603
604The maximum fragment length for valid paired-end alignments.  This option is valid only with --no-spliced-alignment.
605E.g. if `-X 100` is specified and a paired-end alignment consists of two 20-bp alignments in the
606proper orientation with a 60-bp gap between them, that alignment is considered
607valid (as long as `-I` is also satisfied).  A 61-bp gap would not be valid in
608that case.  If trimming options `-3` or `-5` are also used, the `-X`
609constraint is applied with respect to the untrimmed mates, not the trimmed
610mates.
611
612The larger the difference between `-I` and `-X`, the slower HISAT2 will
613run.  This is because larger differences between `-I` and `-X` require that
614HISAT2 scan a larger window to determine if a concordant alignment exists.
615For typical fragment length ranges (200 to 400 nucleotides), HISAT2 is very
616efficient.
617
618Default: 500.
619
620    --fr/--rf/--ff
621
622The upstream/downstream mate orientations for a valid paired-end alignment
623against the forward reference strand.  E.g., if `--fr` is specified and there is
624a candidate paired-end alignment where mate 1 appears upstream of the reverse
625complement of mate 2 and the fragment length constraints (`-I` and `-X`) are
626met, that alignment is valid.  Also, if mate 2 appears upstream of the reverse
627complement of mate 1 and all other constraints are met, that too is valid.
628`--rf` likewise requires that an upstream mate1 be reverse-complemented and a
629downstream mate2 be forward-oriented. ` --ff` requires both an upstream mate 1
630and a downstream mate 2 to be forward-oriented.  Default: `--fr` (appropriate
631for Illumina's Paired-end Sequencing Assay).
632
633    --no-mixed
634
635By default, when `hisat2` cannot find a concordant or discordant alignment for
636a pair, it then tries to find alignments for the individual mates.  This option
637disables that behavior.
638
639    --no-discordant
640
641By default, `hisat2` looks for discordant alignments if it cannot find any
642concordant alignments.  A discordant alignment is an alignment where both mates
643align uniquely, but that does not satisfy the paired-end constraints
644(`--fr`/`--rf`/`--ff`, `-I`, `-X`).  This option disables that behavior.
645
646#### Output options
647
648    -t/--time
649
650Print the wall-clock time required to load the index files and align the reads.
651This is printed to the "standard error" ("stderr") filehandle.  Default: off.
652
653    --un <path>
654    --un-gz <path>
655    --un-bz2 <path>
656
657Write unpaired reads that fail to align to file at `<path>`.  These reads
658correspond to the SAM records with the FLAGS `0x4` bit set and neither the
659`0x40` nor `0x80` bits set.  If `--un-gz` is specified, output will be gzip
660compressed. If `--un-bz2` is specified, output will be bzip2 compressed.  Reads
661written in this way will appear exactly as they did in the input file, without
662any modification (same sequence, same name, same quality string, same quality
663encoding).  Reads will not necessarily appear in the same order as they did in
664the input.
665
666    --al <path>
667    --al-gz <path>
668    --al-bz2 <path>
669
670Write unpaired reads that align at least once to file at `<path>`.  These reads
671correspond to the SAM records with the FLAGS `0x4`, `0x40`, and `0x80` bits
672unset.  If `--al-gz` is specified, output will be gzip compressed. If `--al-bz2`
673is specified, output will be bzip2 compressed.  Reads written in this way will
674appear exactly as they did in the input file, without any modification (same
675sequence, same name, same quality string, same quality encoding).  Reads will
676not necessarily appear in the same order as they did in the input.
677
678    --un-conc <path>
679    --un-conc-gz <path>
680    --un-conc-bz2 <path>
681
682Write paired-end reads that fail to align concordantly to file(s) at `<path>`.
683These reads correspond to the SAM records with the FLAGS `0x4` bit set and
684either the `0x40` or `0x80` bit set (depending on whether it's mate #1 or #2).
685`.1` and `.2` strings are added to the filename to distinguish which file
686contains mate #1 and mate #2.  If a percent symbol, `%`, is used in `<path>`,
687the percent symbol is replaced with `1` or `2` to make the per-mate filenames.
688Otherwise, `.1` or `.2` are added before the final dot in `<path>` to make the
689per-mate filenames.  Reads written in this way will appear exactly as they did
690in the input files, without any modification (same sequence, same name, same
691quality string, same quality encoding).  Reads will not necessarily appear in
692the same order as they did in the inputs.
693
694    --al-conc <path>
695    --al-conc-gz <path>
696    --al-conc-bz2 <path>
697
698Write paired-end reads that align concordantly at least once to file(s) at
699`<path>`. These reads correspond to the SAM records with the FLAGS `0x4` bit
700unset and either the `0x40` or `0x80` bit set (depending on whether it's mate #1
701or #2). `.1` and `.2` strings are added to the filename to distinguish which
702file contains mate #1 and mate #2.  If a percent symbol, `%`, is used in
703`<path>`, the percent symbol is replaced with `1` or `2` to make the per-mate
704filenames. Otherwise, `.1` or `.2` are added before the final dot in `<path>` to
705make the per-mate filenames.  Reads written in this way will appear exactly as
706they did in the input files, without any modification (same sequence, same name,
707same quality string, same quality encoding).  Reads will not necessarily appear
708in the same order as they did in the inputs.
709
710    --quiet
711
712Print nothing besides alignments and serious errors.
713
714    --summary-file
715
716Print alignment summary to this file.
717
718    --new-summary
719
720Print alignment summary in a new style, which is more machine-friendly.
721
722    --met-file <path>
723
724Write `hisat2` metrics to file `<path>`.  Having alignment metric can be useful
725for debugging certain problems, especially performance issues.  See also:
726`--met`.  Default: metrics disabled.
727
728    --met-stderr
729
730Write `hisat2` metrics to the "standard error" ("stderr") filehandle.  This is
731not mutually exclusive with `--met-file`.  Having alignment metric can be
732useful for debugging certain problems, especially performance issues.  See also:
733`--met`.  Default: metrics disabled.
734
735    --met <int>
736
737Write a new `hisat2` metrics record every `<int>` seconds.  Only matters if
738either `--met-stderr` or `--met-file` are specified.  Default: 1.
739
740#### SAM options
741
742    --no-unal
743
744Suppress SAM records for reads that failed to align.
745
746    --no-hd
747
748Suppress SAM header lines (starting with `@`).
749
750    --no-sq
751
752Suppress `@SQ` SAM header lines.
753
754    --rg-id <text>
755
756Set the read group ID to `<text>`.  This causes the SAM `@RG` header line to be
757printed, with `<text>` as the value associated with the `ID:` tag.  It also
758causes the `RG:Z:` extra field to be attached to each SAM output record, with
759value set to `<text>`.
760
761    --rg <text>
762
763Add `<text>` (usually of the form `TAG:VAL`, e.g. `SM:Pool1`) as a field on the
764`@RG` header line.  Note: in order for the `@RG` line to appear, `--rg-id`
765must also be specified.  This is because the `ID` tag is required by the [SAM
766Spec][SAM].  Specify `--rg` multiple times to set multiple fields.  See the
767[SAM Spec][SAM] for details about what fields are legal.
768
769    --remove-chrname
770
771Remove 'chr' from reference names in alignment (e.g., chr18 to 18)
772
773    --add-chrname
774
775Add 'chr' to reference names in alignment (e.g., 18 to chr18)
776
777    --omit-sec-seq
778
779When printing secondary alignments, HISAT2 by default will write out the `SEQ`
780and `QUAL` strings.  Specifying this option causes HISAT2 to print an asterisk
781in those fields instead.
782
783#### Performance options
784
785    -o/--offrate <int>
786
787Override the offrate of the index with `<int>`.  If `<int>` is greater
788than the offrate used to build the index, then some row markings are
789discarded when the index is read into memory.  This reduces the memory
790footprint of the aligner but requires more time to calculate text
791offsets.  `<int>` must be greater than the value used to build the
792index.
793
794    -p/--threads NTHREADS
795
796Launch `NTHREADS` parallel search threads (default: 1).  Threads will run on
797separate processors/cores and synchronize when parsing reads and outputting
798alignments.  Searching for alignments is highly parallel, and speedup is close
799to linear.  Increasing `-p` increases HISAT2's memory footprint. E.g. when
800aligning to a human genome index, increasing `-p` from 1 to 8 increases the
801memory footprint by a few hundred megabytes.  This option is only available if
802`hisat2` is linked with the `pthreads` library (i.e. if `HISAT2_PTHREADS=0` is
803not specified at build time).
804
805    --reorder
806
807Guarantees that output SAM records are printed in an order corresponding to the
808order of the reads in the original input file, even when `-p` is set greater
809than 1.  Specifying `--reorder` and setting `-p` greater than 1 causes HISAT2
810to run somewhat slower and use somewhat more memory then if `--reorder` were
811not specified.  Has no effect if `-p` is set to 1, since output order will
812naturally correspond to input order in that case.
813
814    --mm
815
816Use memory-mapped I/O to load the index, rather than typical file I/O.
817Memory-mapping allows many concurrent `hisat2` processes on the same computer to
818share the same memory image of the index (i.e. you pay the memory overhead just
819once).  This facilitates memory-efficient parallelization of `hisat2` in
820situations where using `-p` is not possible or not preferable.
821
822#### Other options
823
824    --qc-filter
825
826Filter out reads for which the QSEQ filter field is non-zero.  Only has an
827effect when read format is `--qseq`.  Default: off.
828
829    --seed <int>
830
831Use `<int>` as the seed for pseudo-random number generator.  Default: 0.
832
833    --non-deterministic
834
835Normally, HISAT2 re-initializes its pseudo-random generator for each read.  It
836seeds the generator with a number derived from (a) the read name, (b) the
837nucleotide sequence, (c) the quality sequence, (d) the value of the `--seed`
838option.  This means that if two reads are identical (same name, same
839nucleotides, same qualities) HISAT2 will find and report the same alignment(s)
840for both, even if there was ambiguity.  When `--non-deterministic` is specified,
841HISAT2 re-initializes its pseudo-random generator for each read using the
842current time.  This means that HISAT2 will not necessarily report the same
843alignment for two identical reads.  This is counter-intuitive for some users,
844but might be more appropriate in situations where the input consists of many
845identical reads.
846
847    --version
848
849Print version information and quit.
850
851    -h/--help
852
853Print usage information and quit.
854
855SAM output
856----------
857
858Following is a brief description of the [SAM] format as output by `hisat2`.
859For more details, see the [SAM format specification][SAM].
860
861By default, `hisat2` prints a SAM header with `@HD`, `@SQ` and `@PG` lines.
862When one or more `--rg` arguments are specified, `hisat2` will also print
863an `@RG` line that includes all user-specified `--rg` tokens separated by
864tabs.
865
866Each subsequent line describes an alignment or, if the read failed to align, a
867read.  Each line is a collection of at least 12 fields separated by tabs; from
868left to right, the fields are:
869
8701.  Name of read that aligned.
871
872    Note that the [SAM specification] disallows whitespace in the read name.
873	If the read name contains any whitespace characters, HISAT2 will truncate
874	the name at the first whitespace character.  This is similar to the
875	behavior of other tools.
876
8772.  Sum of all applicable flags.  Flags relevant to HISAT2 are:
878
879        1
880
881    The read is one of a pair
882
883        2
884
885    The alignment is one end of a proper paired-end alignment
886
887        4
888
889    The read has no reported alignments
890
891        8
892
893    The read is one of a pair and has no reported alignments
894
895        16
896
897    The alignment is to the reverse reference strand
898
899        32
900
901    The other mate in the paired-end alignment is aligned to the
902    reverse reference strand
903
904        64
905
906    The read is mate 1 in a pair
907
908        128
909
910    The read is mate 2 in a pair
911
912    Thus, an unpaired read that aligns to the reverse reference strand
913    will have flag 16.  A paired-end read that aligns and is the first
914    mate in the pair will have flag 83 (= 64 + 16 + 2 + 1).
915
9163.  Name of reference sequence where alignment occurs
917
9184.  1-based offset into the forward reference strand where leftmost
919    character of the alignment occurs
920
9215.  Mapping quality
922
9236.  CIGAR string representation of alignment
924
9257.  Name of reference sequence where mate's alignment occurs.  Set to `=` if the
926mate's reference sequence is the same as this alignment's, or `*` if there is no
927mate.
928
9298.  1-based offset into the forward reference strand where leftmost character of
930the mate's alignment occurs.  Offset is 0 if there is no mate.
931
9329.  Inferred fragment length.  Size is negative if the mate's alignment occurs
933upstream of this alignment.  Size is 0 if the mates did not align concordantly.
934However, size is non-0 if the mates aligned discordantly to the same
935chromosome.
936
93710. Read sequence (reverse-complemented if aligned to the reverse strand)
938
93911. ASCII-encoded read qualities (reverse-complemented if the read aligned to
940the reverse strand).  The encoded quality values are on the [Phred quality]
941scale and the encoding is ASCII-offset by 33 (ASCII char `!`), similarly to a
942[FASTQ] file.
943
94412. Optional fields.  Fields are tab-separated.  `hisat2` outputs zero or more
945of these optional fields for each alignment, depending on the type of the
946alignment:
947
948        AS:i:<N>
949
950    Alignment score.  Can be negative.  Only present if SAM record is for
951    an aligned read.
952
953        ZS:i:<N>
954
955    Alignment score for the best-scoring alignment found other than the
956    alignment reported.  Can be negative.  Only present if the SAM record is
957    for an aligned read and more than one alignment was found for the read.
958    Note that, when the read is part of a concordantly-aligned pair, this score
959    could be greater than `AS:i`.
960
961        YS:i:<N>
962
963    Alignment score for opposite mate in the paired-end alignment.  Only present
964    if the SAM record is for a read that aligned as part of a paired-end
965    alignment.
966
967        XN:i:<N>
968
969    The number of ambiguous bases in the reference covering this alignment.
970    Only present if SAM record is for an aligned read.
971
972        XM:i:<N>
973
974    The number of mismatches in the alignment.  Only present if SAM record is
975    for an aligned read.
976
977        XO:i:<N>
978
979    The number of gap opens, for both read and reference gaps, in the alignment.
980    Only present if SAM record is for an aligned read.
981
982        XG:i:<N>
983
984    The number of gap extensions, for both read and reference gaps, in the
985    alignment. Only present if SAM record is for an aligned read.
986
987        NM:i:<N>
988
989    The edit distance; that is, the minimal number of one-nucleotide edits
990    (substitutions, insertions and deletions) needed to transform the read
991    string into the reference string.  Only present if SAM record is for an
992    aligned read.
993
994        YF:Z:<S>
995
996    String indicating reason why the read was filtered out.  See also:
997    [Filtering].  Only appears for reads that were filtered out.
998
999        YT:Z:<S>
1000
1001    Value of `UU` indicates the read was not part of a pair.  Value of `CP`
1002    indicates the read was part of a pair and the pair aligned concordantly.
1003    Value of `DP` indicates the read was part of a pair and the pair aligned
1004    discordantly.  Value of `UP` indicates the read was part of a pair but the
1005    pair failed to aligned either concordantly or discordantly.
1006
1007        MD:Z:<S>
1008
1009    A string representation of the mismatched reference bases in the alignment.
1010    See [SAM] format specification for details.  Only present if SAM record is
1011    for an aligned read.
1012
1013        XS:A:<A>
1014
1015    Values of `+` and `-` indicate the read is mapped to transcripts on sense and anti-sense
1016    strands, respectively.  Spliced alignments need to have this field, which is required in Cufflinks and StringTie.
1017    We can report this field for the canonical-splice site (GT/AG), but not for non-canonical splice sites.
1018    You can direct HISAT2 not to output such alignments (involving non-canonical splice sites)  using "--pen-noncansplice 1000000".
1019
1020        NH:i:<N>
1021
1022    The number of mapped locations for the read or the pair.
1023
1024        Zs:Z:<S>
1025
1026    When the alignment of a read involves SNPs that are in the index, this option is used to indicate where exactly the read involves the SNPs.
1027    This optional field is similar to the above MD:Z field.
1028    For example, `Zs:Z:1|S|rs3747203,97|S|rs16990981` indicates the second base of the read corresponds to a known SNP (ID: rs3747203).
1029    97 bases after the third base (the base after the second one), the read at 100th base involves another known SNP (ID: rs16990981).
1030    'S' indicates a single nucleotide polymorphism.  'D' and 'I' indicate a deletion and an insertion, respectively.
1031
1032[SAM format specification]: http://samtools.sf.net/SAM1.pdf
1033[FASTQ]: http://en.wikipedia.org/wiki/FASTQ_format
1034
1035The `hisat2-build` indexer
1036===========================
1037
1038`hisat2-build` builds a HISAT2 index from a set of DNA sequences.
1039`hisat2-build` outputs a set of 6 files with suffixes `.1.ht2`, `.2.ht2`,
1040`.3.ht2`, `.4.ht2`, `.5.ht2`, `.6.ht2`, `.7.ht2`, and `.8.ht2`.  In the case of a large
1041index these suffixes will have a `ht2l` termination.  These files together
1042constitute the index: they are all that is needed to align reads to that
1043reference.  The original sequence FASTA files are no longer used by HISAT2
1044once the index is built.
1045
1046Use of Karkkainen's [blockwise algorithm] allows `hisat2-build` to trade off
1047between running time and memory usage. `hisat2-build` has three options
1048governing how it makes this trade: `-p`/`--packed`, `--bmax`/`--bmaxdivn`,
1049and `--dcv`.  By default, `hisat2-build` will automatically search for the
1050settings that yield the best running time without exhausting memory.  This
1051behavior can be disabled using the `-a`/`--noauto` option.
1052
1053The indexer provides options pertaining to the "shape" of the index, e.g.
1054`--offrate` governs the fraction of [Burrows-Wheeler]
1055rows that are "marked" (i.e., the density of the suffix-array sample; see the
1056original [FM Index] paper for details).  All of these options are potentially
1057profitable trade-offs depending on the application.  They have been set to
1058defaults that are reasonable for most cases according to our experiments.  See
1059[Performance tuning] for details.
1060
1061`hisat2-build` can generate either [small or large indexes].  The wrapper
1062will decide which based on the length of the input genome.  If the reference
1063does not exceed 4 billion characters but a large index is preferred,  the user
1064can specify `--large-index` to force `hisat2-build` to build a large index
1065instead.
1066
1067The HISAT2 index is based on the [FM Index] of Ferragina and Manzini, which in
1068turn is based on the [Burrows-Wheeler] transform.  The algorithm used to build
1069the index is based on the [blockwise algorithm] of Karkkainen.
1070
1071[Blockwise algorithm]: http://portal.acm.org/citation.cfm?id=1314852
1072[Burrows-Wheeler]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
1073
1074Command Line
1075------------
1076
1077Usage:
1078
1079    hisat2-build [options]* <reference_in> <ht2_base>
1080
1081### Notes
1082    If you use --snp, --ss, and/or --exon, hisat2-build will need about 200GB RAM for the human genome size as index building involves a graph construction.
1083    Otherwise, you will be able to build an index on your desktop with 8GB RAM.
1084
1085### Main arguments
1086
1087A comma-separated list of FASTA files containing the reference sequences to be
1088aligned to, or, if `-c` is specified, the sequences
1089themselves. E.g., `<reference_in>` might be `chr1.fa,chr2.fa,chrX.fa,chrY.fa`,
1090or, if `-c` is specified, this might be
1091`GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA`.
1092
1093The basename of the index files to write.  By default, `hisat2-build` writes
1094files named `NAME.1.ht2`, `NAME.2.ht2`, `NAME.3.ht2`, `NAME.4.ht2`,
1095`NAME.5.ht2`, `NAME.6.ht2`, `NAME.7.ht2`, and `NAME.8.ht2` where `NAME` is `<ht2_base>`.
1096
1097### Options
1098
1099    -f
1100
1101The reference input files (specified as `<reference_in>`) are FASTA files
1102(usually having extension `.fa`, `.mfa`, `.fna` or similar).
1103
1104    -c
1105
1106The reference sequences are given on the command line.  I.e. `<reference_in>` is
1107a comma-separated list of sequences rather than a list of FASTA files.
1108
1109    --large-index
1110
1111Force `hisat2-build` to build a [large index], even if the reference is less
1112than ~ 4 billion nucleotides long.
1113
1114    -a/--noauto
1115
1116Disable the default behavior whereby `hisat2-build` automatically selects
1117values for the `--bmax`, `--dcv` and `--packed` parameters according to
1118available memory.  Instead, user may specify values for those parameters.  If
1119memory is exhausted during indexing, an error message will be printed; it is up
1120to the user to try new parameters.
1121
1122    --bmax <int>
1123
1124The maximum number of suffixes allowed in a block.  Allowing more suffixes per
1125block makes indexing faster, but increases peak memory usage.  Setting this
1126option overrides any previous setting for `--bmax`, or `--bmaxdivn`.
1127Default (in terms of the `--bmaxdivn` parameter) is `--bmaxdivn` 4.  This is
1128configured automatically by default; use `-a`/`--noauto` to configure manually.
1129
1130    --bmaxdivn <int>
1131
1132The maximum number of suffixes allowed in a block, expressed as a fraction of
1133the length of the reference.  Setting this option overrides any previous setting
1134for `--bmax`, or `--bmaxdivn`.  Default: `--bmaxdivn` 4.  This is
1135configured automatically by default; use `-a`/`--noauto` to configure manually.
1136
1137    --dcv <int>
1138
1139Use `<int>` as the period for the difference-cover sample.  A larger period
1140yields less memory overhead, but may make suffix sorting slower, especially if
1141repeats are present.  Must be a power of 2 no greater than 4096.  Default: 1024.
1142 This is configured automatically by default; use `-a`/`--noauto` to configure
1143manually.
1144
1145    --nodc
1146
1147Disable use of the difference-cover sample.  Suffix sorting becomes
1148quadratic-time in the worst case (where the worst case is an extremely
1149repetitive reference).  Default: off.
1150
1151    -r/--noref
1152
1153Do not build the `NAME.3.ht2` and `NAME.4.ht2` portions of the index, which
1154contain a bitpacked version of the reference sequences and are used for
1155paired-end alignment.
1156
1157    -3/--justref
1158
1159Build only the `NAME.3.ht2` and `NAME.4.ht2` portions of the index, which
1160contain a bitpacked version of the reference sequences and are used for
1161paired-end alignment.
1162
1163    -o/--offrate <int>
1164
1165To map alignments back to positions on the reference sequences, it's necessary
1166to annotate ("mark") some or all of the [Burrows-Wheeler] rows with their
1167corresponding location on the genome.
1168`-o`/`--offrate` governs how many rows get marked:
1169the indexer will mark every 2^`<int>` rows.  Marking more rows makes
1170reference-position lookups faster, but requires more memory to hold the
1171annotations at runtime.  The default is 4 (every 16th row is marked; for human
1172genome, annotations occupy about 680 megabytes).
1173
1174    -t/--ftabchars <int>
1175
1176The ftab is the lookup table used to calculate an initial [Burrows-Wheeler]
1177range with respect to the first `<int>` characters of the query.  A larger
1178`<int>` yields a larger lookup table but faster query times.  The ftab has size
11794^(`<int>`+1) bytes.  The default setting is 10 (ftab is 4MB).
1180
1181    --localoffrate <int>
1182
1183This option governs how many rows get marked in a local index:
1184the indexer will mark every 2^`<int>` rows.  Marking more rows makes
1185reference-position lookups faster, but requires more memory to hold the
1186annotations at runtime.  The default is 3 (every 8th row is marked,
1187this occupies about 16KB per local index).
1188
1189    --localftabchars <int>
1190
1191The local ftab is the lookup table in a local index.
1192The default setting is 6 (ftab is 8KB per local index).
1193
1194    -p <int>
1195
1196Launch `NTHREADS` parallel build threads (default: 1).
1197
1198    --snp <path>
1199
1200Provide a list of SNPs (in the HISAT2's own format) as follows (five columns).
1201
1202   SNP ID `<tab>` snp type (single, deletion, or insertion) `<tab>` chromosome name `<tab>` zero-offset based genomic position of a SNP `<tab>` alternative base (single), the length of SNP (deletion), or insertion sequence (insertion)
1203
1204   For example,
1205       rs58784443      single  13      18447947        T
1206
1207Use `hisat2_extract_snps_haplotypes_UCSC.py` (in the HISAT2 package) to extract SNPs and haplotypes from a dbSNP file (e.g. http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/snp144Common.txt.gz).
1208or `hisat2_extract_snps_haplotypes_VCF.py` to extract SNPs and haplotypes from a VCF file (e.g. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr22.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP_no_SVs.vcf.gz).
1209
1210    --haplotype <path>
1211
1212Provide a list of haplotypes (in the HISAT2's own format) as follows (five columns).
1213
1214   Haplotype ID `<tab>` chromosome name `<tab>` zero-offset based left coordinate of haplotype `<tab>` zero-offset based right coordinate of haplotype `<tab>` a comma separated list of SNP ids in the haplotype
1215
1216   For example,
1217       ht35    13      18446877        18446945        rs12381094,rs12381056,rs192016659,rs538569910
1218
1219See the above option, --snp, about how to extract haplotypes.  This option is not required, but haplotype information can keep the index construction from exploding and reduce the index size substantially.
1220
1221    --ss <path>
1222
1223Note this option should be used with the following --exon option.
1224Provide a list of splice sites (in the HISAT2's own format) as follows (four columns).
1225
1226   chromosome name `<tab>` zero-offset based genomic position of the flanking base on the left side of an intron `<tab>` zero-offset based genomic position of the flanking base on the right `<tab>` strand
1227
1228Use `hisat2_extract_splice_sites.py` (in the HISAT2 package) to extract splice sites from a GTF file.
1229
1230    --exon <path>
1231
1232Note this option should be used with the above --ss option.
1233Provide a list of exons (in the HISAT2's own format) as follows (three columns).
1234
1235   chromosome name `<tab>` zero-offset based left genomic position of an exon `<tab>` zero-offset based right genomic position of an exon
1236
1237Use `hisat2_extract_exons.py` (in the HISAT2 package) to extract exons from a GTF file.
1238
1239    --seed <int>
1240
1241Use `<int>` as the seed for pseudo-random number generator.
1242
1243    --cutoff <int>
1244
1245Index only the first `<int>` bases of the reference sequences (cumulative across
1246sequences) and ignore the rest.
1247
1248    -q/--quiet
1249
1250`hisat2-build` is verbose by default.  With this option `hisat2-build` will
1251print only error messages.
1252
1253    -h/--help
1254
1255Print usage information and quit.
1256
1257    --version
1258
1259Print version information and quit.
1260
1261The `hisat2-inspect` index inspector
1262=====================================
1263
1264`hisat2-inspect` extracts information from a HISAT2 index about what kind of
1265index it is and what reference sequences were used to build it. When run without
1266any options, the tool will output a FASTA file containing the sequences of the
1267original references (with all non-`A`/`C`/`G`/`T` characters converted to `N`s).
1268 It can also be used to extract just the reference sequence names using the
1269`-n`/`--names` option or a more verbose summary using the `-s`/`--summary`
1270option.
1271
1272Command Line
1273------------
1274
1275Usage:
1276
1277    hisat2-inspect [options]* <ht2_base>
1278
1279### Main arguments
1280
1281The basename of the index to be inspected.  The basename is name of any of the
1282index files but with the `.X.ht2` suffix omitted.
1283`hisat2-inspect` first looks in the current directory for the index files, then
1284in the directory specified in the `HISAT2_INDEXES` environment variable.
1285
1286### Options
1287
1288    -a/--across <int>
1289
1290When printing FASTA output, output a newline character every `<int>` bases
1291(default: 60).
1292
1293    -n/--names
1294
1295Print reference sequence names, one per line, and quit.
1296
1297    -s/--summary
1298
1299Print a summary that includes information about index settings, as well as the
1300names and lengths of the input sequences.  The summary has this format:
1301
1302    Colorspace	<0 or 1>
1303    SA-Sample	1 in <sample>
1304    FTab-Chars	<chars>
1305    Sequence-1	<name>	<len>
1306    Sequence-2	<name>	<len>
1307    ...
1308    Sequence-N	<name>	<len>
1309
1310Fields are separated by tabs.  Colorspace is always set to 0 for HISAT2.
1311
1312    --snp
1313
1314Print SNPs, and quit.
1315
1316    --ss
1317
1318Print splice sites, and quit.
1319
1320    --ss-all
1321
1322Print splice sites including those not in the global index, and quit.
1323
1324    --exon
1325
1326Print exons, and quit.
1327
1328    -v/--verbose
1329
1330Print verbose output (for debugging).
1331
1332    --version
1333
1334Print version information and quit.
1335
1336    -h/--help
1337
1338Print usage information and quit.
1339
1340Getting started with HISAT2
1341===================================================
1342
1343HISAT2 comes with some example files to get you started.  The example files
1344are not scientifically significant; these files will simply let you start running HISAT2 and
1345downstream tools right away.
1346
1347First follow the manual instructions to [obtain HISAT2].  Set the `HISAT2_HOME`
1348environment variable to point to the new HISAT2 directory containing the
1349`hisat2`, `hisat2-build` and `hisat2-inspect` binaries.  This is important,
1350as the `HISAT2_HOME` variable is used in the commands below to refer to that
1351directory.
1352
1353Indexing a reference genome
1354---------------------------
1355
1356To create an index for the genomic region (1 million bps from the human chromosome 22 between 20,000,000 and 20,999,999)
1357included with HISAT2, create a new temporary directory (it doesn't matter where), change into that directory, and run:
1358
1359    $HISAT2_HOME/hisat2-build $HISAT2_HOME/example/reference/22_20-21M.fa --snp $HISAT2_HOME/example/reference/22_20-21M.snp 22_20-21M_snp
1360
1361The command should print many lines of output then quit. When the command
1362completes, the current directory will contain ten new files that all start with
1363`22_20-21M_snp` and end with `.1.ht2`, `.2.ht2`, `.3.ht2`, `.4.ht2`, `.5.ht2`, `.6.ht2`,
1364`.7.ht2`, and `.8.ht2`.  These files constitute the index - you're done!
1365
1366You can use `hisat2-build` to create an index for a set of FASTA files obtained
1367from any source, including sites such as [UCSC], [NCBI], and [Ensembl]. When
1368indexing multiple FASTA files, specify all the files using commas to separate
1369file names.  For more details on how to create an index with `hisat2-build`,
1370see the [manual section on index building].  You may also want to bypass this
1371process by obtaining a pre-built index.
1372
1373[UCSC]: http://genome.ucsc.edu/cgi-bin/hgGateway
1374[NCBI]: http://www.ncbi.nlm.nih.gov/sites/genome
1375[Ensembl]: http://www.ensembl.org/
1376
1377Aligning example reads
1378----------------------
1379
1380Stay in the directory created in the previous step, which now contains the
1381`22_20-21M` index files.  Next, run:
1382
1383    $HISAT2_HOME/hisat2 -f -x $HISAT2_HOME/example/index/22_20-21M_snp -U $HISAT2_HOME/example/reads/reads_1.fa -S eg1.sam
1384
1385This runs the HISAT2 aligner, which aligns a set of unpaired reads to the
1386genome region using the index generated in the previous step.
1387The alignment results in SAM format are written to the file `eg1.sam`, and a
1388short alignment summary is written to the console.  (Actually, the summary is
1389written to the "standard error" or "stderr" filehandle, which is typically
1390printed to the console.)
1391
1392To see the first few lines of the SAM output, run:
1393
1394    head eg1.sam
1395
1396You will see something like this:
1397
1398    @HD     VN:1.0  SO:unsorted
1399    @SQ     SN:22:20000001-21000000 LN:1000000
1400    @PG     ID:hisat2       PN:hisat2       VN:2.0.0-beta
1401    1       0       22:20000001-21000000    397984  255     100M    *       0       0       GCCTGTGAGGGAGCCCCGGACCCGGTCAGAGCAGGAGCCTGGCCTGGGGCCAAGTTCACCTTATGGACTCTCTTCCCTGCCCTTCCAGGAGCAGCTCACT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100        YT:Z:UU NH:i:1
1402    2       16      22:20000001-21000000    398131  255     100M    *       0       0       ATGACACACTGTACACACCAGGGGCCCTGTGCTCCCCAGGAAGAGGGCCCTCACTTGAAGCGGGGCCCGATGGCCGCCACGTGCCGGTTCATGCTCCCCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:80A19      YT:Z:UU NH:i:1  Zs:Z:80|S|rs576159895
1403    3       16      22:20000001-21000000    398222  255     100M    *       0       0       TGCTCCCCTTGGCCCCGCCGATGTTCAGGGACATGGAGCGCTGCAGCAGGCTGGAGAAGATCTCCACTTGGTCAGAGCTGCAGTACTTGGCGATCTCAAA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:16A83      YT:Z:UU NH:i:1  Zs:Z:16|S|rs2629364
1404    4       16      22:20000001-21000000    398247  255     90M200N10M      *       0       0       CAGGGACATGGAGCGCTGCAGCAGGCTGGAGAAGATCTCCACTTGGTCAGAGCTGCAGTACTTGGCGATCTCAAACCGCTGCACCAGGAAGTCGATCCAG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100        YT:Z:UU XS:A:-  NH:i:1
1405    5       16      22:20000001-21000000    398194  255     100M    *       0       0       GGCCCGATGGCCGCCACGTGCCGGTTCATGCTCCCCTTGGCCCCGCCGATGTTCAGGGACATGGAGCGCTGCAGCAGGCTGGAGAAGATCTCCACTTGGT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:17A26A55   YT:Z:UU NH:i:1  Zs:Z:17|S|rs576159895,26|S|rs2629364
1406    6       0       22:20000001-21000000    398069  255     100M    *       0       0       CAGGAGCAGCTCACTGAAATGTGTTCCCCGTCTACAGAAGTACCGTGATACACAGACGCCCCATGACACACTGTACACACCAGGGGCCCTGTGCTCCCCA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100        YT:Z:UU NH:i:1
1407    7       0       22:20000001-21000000    397896  255     100M    *       0       0       GTGGAGTAGATCTTCTCGCGAAGCACATTGCAGATGGTTGCATTTGGAACCACATCGGCATGCAGGAGGGACAGCCCCAGGGTCAGCAGCCTGTGAGGGA    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:31G68      YT:Z:UU NH:i:1  Zs:Z:31|S|rs562662261
1408    8       0       22:20000001-21000000    398150  255     100M    *       0       0       AGGGGCCCTGTGCTCCCCAGGAAGAGGGCCCTCACTTGAAGCGGGGCCCGATGGCCGCCACGTGCCGGTTCATGCTCCCCTTGGCCCCGCCGATGTTCAG    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:61A26A11   YT:Z:UU NH:i:1  Zs:Z:61|S|rs576159895,26|S|rs2629364
1409    9       16      22:20000001-21000000    398329  255     8M200N92M       *       0       0       ACCAGGAAGTCGATCCAGATGTAGTGGGGGGTCACTTCGGGGGGACAGGGTTTGGGTTGACTTGCTTCCGAGGCAGCCAGGGGGTCTGCTTCCTTTATCT    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100        YT:Z:UU XS:A:-  NH:i:1
1410    10      16      22:20000001-21000000    398184  255     100M    *       0       0       CTTGAAGCGGGGCCCGATGGCCGCCACGTGCCGGTTCATGCTCCCCTTGGCCCCGCCGATGTTCAGGGACATGGAGCGCTGCAGCAGGCTGGAGAAGATC    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:27A26A45   YT:Z:UU NH:i:1  Zs:Z:27|S|rs576159895,26|S|rs2629364
1411
1412The first few lines (beginning with `@`) are SAM header lines, and the rest of
1413the lines are SAM alignments, one line per read or mate.  See the [HISAT2
1414manual section on SAM output] and the [SAM specification] for details about how
1415to interpret the SAM file format.
1416
1417Paired-end example
1418------------------
1419
1420To align paired-end reads included with HISAT2, stay in the same directory and
1421run:
1422
1423    $HISAT2_HOME/hisat2 -f -x $HISAT2_HOME/example/index/22_20-21M_snp -1 $HISAT2_HOME/example/reads/reads_1.fa -2 $HISAT2_HOME/example/reads/reads_2.fa -S eg2.sam
1424
1425This aligns a set of paired-end reads to the reference genome, with results
1426written to the file `eg2.sam`.
1427
1428Using SAMtools/BCFtools downstream
1429----------------------------------
1430
1431[SAMtools] is a collection of tools for manipulating and analyzing SAM and BAM
1432alignment files.  [BCFtools] is a collection of tools for calling variants and
1433manipulating VCF and BCF files, and it is typically distributed with [SAMtools].
1434Using these tools together allows you to get from alignments in SAM format to
1435variant calls in VCF format.  This example assumes that `samtools` and
1436`bcftools` are installed and that the directories containing these binaries are
1437in your [PATH environment variable].
1438
1439Run the paired-end example:
1440
1441    $HISAT2_HOME/hisat -f -x $HISAT2_HOME/example/index/22_20-21M_snp -1 $HISAT2_HOME/example/reads/reads_1.fa -2 $HISAT2_HOME/example/reads/reads_2.fa -S eg2.sam
1442
1443Use `samtools view` to convert the SAM file into a BAM file.  BAM is a the
1444binary format corresponding to the SAM text format.  Run:
1445
1446    samtools view -bS eg2.sam > eg2.bam
1447
1448Use `samtools sort` to convert the BAM file to a sorted BAM file. The following command requires samtools version 1.2 or higher.
1449
1450    samtools sort eg2.bam -o eg2.sorted.bam
1451
1452We now have a sorted BAM file called `eg2.sorted.bam`. Sorted BAM is a useful
1453format because the alignments are (a) compressed, which is convenient for
1454long-term storage, and (b) sorted, which is convenient for variant discovery.
1455To generate variant calls in VCF format, run:
1456
1457    samtools mpileup -uf $HISAT2_HOME/example/reference/22_20-21M.fa eg2.sorted.bam | bcftools view -bvcg - > eg2.raw.bcf
1458
1459Then to view the variants, run:
1460
1461    bcftools view eg2.raw.bcf
1462
1463See the official SAMtools guide to [Calling SNPs/INDELs with SAMtools/BCFtools]
1464for more details and variations on this process.
1465
1466[BCFtools]: http://samtools.sourceforge.net/mpileup.shtml
1467[Calling SNPs/INDELs with SAMtools/BCFtools]: http://samtools.sourceforge.net/mpileup.shtml
1468