1 2Introduction 3============ 4 5What is HISAT2? 6----------------- 7 8HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads 9(whole-genome, transcriptome, and exome sequencing data) against the general human population 10(as well as against a single reference genome). Based on [GCSA] (an extension of [BWT] for a graph), we designed and implemented a graph FM index (GFM), 11an original approach and its first implementation to the best of our knowledge. 12In addition to using one global GFM index that represents general population, 13HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome 14(each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover human population). 15These small indexes (called local indexes) combined with several alignment strategies enable effective alignment of sequencing reads. 16This new indexing scheme is called Hierarchical Graph FM index (HGFM). 17We have developed HISAT 2 based on the [HISAT] and [Bowtie2] implementations. 18HISAT2 outputs alignments in [SAM] format, enabling interoperation with a large number of other tools (e.g. [SAMtools], [GATK]) that use SAM. 19HISAT2 is distributed under the [GPLv3 license], and it runs on the command line under 20Linux, Mac OS X and Windows. 21 22[HISAT2]: http://ccb.jhu.edu/software/hisat2 23[HISAT]: http://ccb.jhu.edu/software/hisat 24[Bowtie2]: http://bowtie-bio.sf.net/bowtie2 25[Bowtie]: http://bowtie-bio.sf.net 26[Bowtie1]: http://bowtie-bio.sf.net 27[GCSA]: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6698337&tag=1 28[Burrows-Wheeler Transform]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform 29[BWT]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform 30[FM Index]: http://en.wikipedia.org/wiki/FM-index 31[SAM]: http://samtools.sourceforge.net/SAM1.pdf 32[SAMtools]: http://samtools.sourceforge.net 33[GATK]: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit 34[TopHat2]: http://ccb.jhu.edu/software/tophat 35[Cufflinks]: http://cufflinks.cbcb.umd.edu/ 36[Crossbow]: http://bowtie-bio.sf.net/crossbow 37[Myrna]: http://bowtie-bio.sf.net/myrna 38[Bowtie paper]: http://genomebiology.com/2009/10/3/R25 39[GPLv3 license]: http://www.gnu.org/licenses/gpl-3.0.html 40 41Obtaining HISAT2 42================== 43 44Download HISAT2 sources and binaries from the Releases sections on the right side. 45Binaries are available for Intel architectures (`x86_64`) running Linux, and Mac OS X. 46 47Building from source 48-------------------- 49 50Building HISAT2 from source requires a GNU-like environment with GCC, GNU Make 51and other basics. It should be possible to build HISAT2 on most vanilla Linux 52installations or on a Mac installation with [Xcode] installed. HISAT2 can 53also be built on Windows using [Cygwin] or [MinGW] (MinGW recommended). For a 54MinGW build the choice of what compiler is to be used is important since this 55will determine if a 32 or 64 bit code can be successfully compiled using it. If 56there is a need to generate both 32 and 64 bit on the same machine then a multilib 57MinGW has to be properly installed. [MSYS], the [zlib] library, and depending on 58architecture [pthreads] library are also required. We are recommending a 64 bit 59build since it has some clear advantages in real life research problems. In order 60to simplify the MinGW setup it might be worth investigating popular MinGW personal 61builds since these are coming already prepared with most of the toolchains needed. 62 63First, download the [source package] from the Releases section on the right side. 64Unzip the file, change to the unzipped directory, and build the 65HISAT2 tools by running GNU `make` (usually with the command `make`, but 66sometimes with `gmake`) with no arguments. If building with MinGW, run `make` 67from the MSYS environment. 68 69HISAT2 is using the multithreading software model in order to speed up 70execution times on SMP architectures where this is possible. On POSIX 71platforms (like linux, Mac OS, etc) it needs the pthread library. Although 72it is possible to use pthread library on non-POSIX platform like Windows, due 73to performance reasons HISAT2 will try to use Windows native multithreading 74if possible. 75 76For the support of SRA data access in HISAT2, please download and install the [NCBI-NGS] toolkit. 77When running `make`, specify additional variables as follow. 78`make USE_SRA=1 NCBI_NGS_DIR=/path/to/NCBI-NGS-directory NCBI_VDB_DIR=/path/to/NCBI-NGS-directory`, 79where `NCBI_NGS_DIR` and `NCBI_VDB_DIR` will be used in Makefile for -I and -L compilation options. 80For example, $(NCBI_NGS_DIR)/include and $(NCBI_NGS_DIR)/lib64 will be used. 81 82[Cygwin]: http://www.cygwin.com/ 83[MinGW]: http://www.mingw.org/ 84[MSYS]: http://www.mingw.org/wiki/msys 85[zlib]: http://cygwin.com/packages/mingw-zlib/ 86[pthreads]: http://sourceware.org/pthreads-win32/ 87[GnuWin32]: http://gnuwin32.sf.net/packages/coreutils.htm 88[Download]: https://sourceforge.net/projects/bowtie-bio/files/bowtie2/ 89[sourceforge site]: https://sourceforge.net/projects/bowtie-bio/files/bowtie2/ 90[source package]: http://ccb.jhu.edu/software/hisat2/downloads/hisat2-2.0.0-beta-source.zip 91[Xcode]: http://developer.apple.com/xcode/ 92[NCBI-NGS]: https://github.com/ncbi/ngs/wiki/Downloads 93 94Running HISAT2 95============= 96 97Adding to PATH 98-------------- 99 100By adding your new HISAT2 directory to your [PATH environment variable], you 101ensure that whenever you run `hisat2`, `hisat2-build` or `hisat2-inspect` 102from the command line, you will get the version you just installed without 103having to specify the entire path. This is recommended for most users. To do 104this, follow your operating system's instructions for adding the directory to 105your [PATH]. 106 107If you would like to install HISAT2 by copying the HISAT2 executable files 108to an existing directory in your [PATH], make sure that you copy all the 109executables, including `hisat2`, `hisat2-align-s`, `hisat2-align-l`, `hisat2-build`, `hisat2-build-s`, `hisat2-build-l`, `hisat2-inspect`, `hisat2-inspect-s` and 110`hisat2-inspect-l`. 111 112[PATH environment variable]: http://en.wikipedia.org/wiki/PATH_(variable) 113[PATH]: http://en.wikipedia.org/wiki/PATH_(variable) 114 115Reporting 116--------- 117 118The reporting mode governs how many alignments HISAT2 looks for, and how to 119report them. 120 121In general, when we say that a read has an alignment, we mean that it has a 122[valid alignment]. When we say that a read has multiple alignments, we mean 123that it has multiple alignments that are valid and distinct from one another. 124 125By default, HISAT2 may soft-clip reads near their 5' and 3' ends. Users can control this behavior by setting different penalties for soft-clipping (`--sp`) or by disallowing soft-clipping (`--no-softclip`). 126 127### Distinct alignments map a read to different places 128 129Two alignments for the same individual read are "distinct" if they map the same 130read to different places. Specifically, we say that two alignments are distinct 131if there are no alignment positions where a particular read offset is aligned 132opposite a particular reference offset in both alignments with the same 133orientation. E.g. if the first alignment is in the forward orientation and 134aligns the read character at read offset 10 to the reference character at 135chromosome 3, offset 3,445,245, and the second alignment is also in the forward 136orientation and also aligns the read character at read offset 10 to the 137reference character at chromosome 3, offset 3,445,245, they are not distinct 138alignments. 139 140Two alignments for the same pair are distinct if either the mate 1s in the two 141paired-end alignments are distinct or the mate 2s in the two alignments are 142distinct or both. 143 144### Default mode: search for one or more alignments, report each 145 146HISAT2 searches for up to N distinct, primary alignments for 147each read, where N equals the integer specified with the `-k` parameter. 148Primary alignments mean alignments whose alignment score is equal or higher than any other alignments. 149It is possible that multiple distinct alignments have the same score. 150That is, if `-k 2` is specified, HISAT2 will search for at most 2 distinct 151alignments. The alignment score for a paired-end alignment equals the sum of the 152alignment scores of the individual mates. Each reported read or pair alignment 153beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS 154field. See the [SAM specification] for details. 155 156HISAT2 does not "find" alignments in any specific order, so for reads that 157have more than N distinct, valid alignments, HISAT2 does not guarantee that 158the N alignments reported are the best possible in terms of alignment score. 159Still, this mode can be effective and fast in situations where the user cares 160more about whether a read aligns (or aligns a certain number of times) than 161where exactly it originated. 162 163[SAM specification]: http://samtools.sourceforge.net/SAM1.pdf 164 165Alignment summary 166------------------ 167 168When HISAT2 finishes running, it prints messages summarizing what happened. 169These messages are printed to the "standard error" ("stderr") filehandle. For 170datasets consisting of unpaired reads, the summary might look like this: 171 172 20000 reads; of these: 173 20000 (100.00%) were unpaired; of these: 174 1247 (6.24%) aligned 0 times 175 18739 (93.69%) aligned exactly 1 time 176 14 (0.07%) aligned >1 times 177 93.77% overall alignment rate 178 179For datasets consisting of pairs, the summary might look like this: 180 181 10000 reads; of these: 182 10000 (100.00%) were paired; of these: 183 650 (6.50%) aligned concordantly 0 times 184 8823 (88.23%) aligned concordantly exactly 1 time 185 527 (5.27%) aligned concordantly >1 times 186 ---- 187 650 pairs aligned concordantly 0 times; of these: 188 34 (5.23%) aligned discordantly 1 time 189 ---- 190 616 pairs aligned 0 times concordantly or discordantly; of these: 191 1232 mates make up the pairs; of these: 192 660 (53.57%) aligned 0 times 193 571 (46.35%) aligned exactly 1 time 194 1 (0.08%) aligned >1 times 195 96.70% overall alignment rate 196 197The indentation indicates how subtotals relate to totals. 198 199Wrapper 200------- 201 202The `hisat2`, `hisat2-build` and `hisat2-inspect` executables are actually 203wrapper scripts that call binary programs as appropriate. The wrappers shield 204users from having to distinguish between "small" and "large" index formats, 205discussed briefly in the following section. Also, the `hisat2` wrapper 206provides some key functionality, like the ability to handle compressed inputs, 207and the functionality for `--un`, `--al` and related options. 208 209It is recommended that you always run the hisat2 wrappers and not run the 210binaries directly. 211 212Small and large indexes 213----------------------- 214 215`hisat2-build` can index reference genomes of any size. For genomes less than 216about 4 billion nucleotides in length, `hisat2-build` builds a "small" index 217using 32-bit numbers in various parts of the index. When the genome is longer, 218`hisat2-build` builds a "large" index using 64-bit numbers. Small indexes are 219stored in files with the `.ht2` extension, and large indexes are stored in 220files with the `.ht2l` extension. The user need not worry about whether a 221particular index is small or large; the wrapper scripts will automatically build 222and use the appropriate index. 223 224Performance tuning 225------------------ 226 2271. If your computer has multiple processors/cores, use `-p` 228 229 The `-p` option causes HISAT2 to launch a specified number of parallel 230 search threads. Each thread runs on a different processor/core and all 231 threads find alignments in parallel, increasing alignment throughput by 232 approximately a multiple of the number of threads (though in practice, 233 speedup is somewhat worse than linear). 234 235Command Line 236------------ 237 238### Setting function options 239 240Some HISAT2 options specify a function rather than an individual number or 241setting. In these cases the user specifies three parameters: (a) a function 242type `F`, (b) a constant term `B`, and (c) a coefficient `A`. The available 243function types are constant (`C`), linear (`L`), square-root (`S`), and natural 244log (`G`). The parameters are specified as `F,B,A` - that is, the function type, 245the constant term, and the coefficient are separated by commas with no 246whitespace. The constant term and coefficient may be negative and/or 247floating-point numbers. 248 249For example, if the function specification is `L,-0.4,-0.6`, then the function 250defined is: 251 252 f(x) = -0.4 + -0.6 * x 253 254If the function specification is `G,1,5.4`, then the function defined is: 255 256 f(x) = 1.0 + 5.4 * ln(x) 257 258See the documentation for the option in question to learn what the parameter `x` 259is for. For example, in the case if the `--score-min` option, the function 260`f(x)` sets the minimum alignment score necessary for an alignment to be 261considered valid, and `x` is the read length. 262 263### Usage 264 265 hisat2 [options]* -x <hisat2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <hit>] 266 267### Main arguments 268 269 -x <hisat2-idx> 270 271The basename of the index for the reference genome. The basename is the name of 272any of the index files up to but not including the final `.1.ht2` / etc. 273`hisat2` looks for the specified index first in the current directory, 274then in the directory specified in the `HISAT2_INDEXES` environment variable. 275 276 -1 <m1> 277 278Comma-separated list of files containing mate 1s (filename usually includes 279`_1`), e.g. `-1 flyA_1.fq,flyB_1.fq`. Sequences specified with this option must 280correspond file-for-file and read-for-read with those specified in `<m2>`. Reads 281may be a mix of different lengths. If `-` is specified, `hisat2` will read the 282mate 1s from the "standard in" or "stdin" filehandle. 283 284 -2 <m2> 285 286Comma-separated list of files containing mate 2s (filename usually includes 287`_2`), e.g. `-2 flyA_2.fq,flyB_2.fq`. Sequences specified with this option must 288correspond file-for-file and read-for-read with those specified in `<m1>`. Reads 289may be a mix of different lengths. If `-` is specified, `hisat2` will read the 290mate 2s from the "standard in" or "stdin" filehandle. 291 292 -U <r> 293 294Comma-separated list of files containing unpaired reads to be aligned, e.g. 295`lane1.fq,lane2.fq,lane3.fq,lane4.fq`. Reads may be a mix of different lengths. 296If `-` is specified, `hisat2` gets the reads from the "standard in" or "stdin" 297filehandle. 298 299 --sra-acc <SRA accession number> 300 301Comma-separated list of SRA accession numbers, e.g. `--sra-acc SRR353653,SRR353654`. 302Information about read types is available at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?sp=runinfo&acc=<b>sra-acc</b>&retmode=xml, 303where <b>sra-acc</b> is SRA accession number. If users run HISAT2 on a computer cluster, it is recommended to disable SRA-related caching (see the instruction at [SRA-MANUAL]). 304 305[SRA-MANUAL]: https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration 306 307 -S <hit> 308 309File to write SAM alignments to. By default, alignments are written to the 310"standard out" or "stdout" filehandle (i.e. the console). 311 312### Options 313 314#### Input options 315 316 -q 317 318Reads (specified with `<m1>`, `<m2>`, `<s>`) are FASTQ files. FASTQ files 319usually have extension `.fq` or `.fastq`. FASTQ is the default format. See 320also: `--solexa-quals` and `--int-quals`. 321 322 --qseq 323 324Reads (specified with `<m1>`, `<m2>`, `<s>`) are QSEQ files. QSEQ files usually 325end in `_qseq.txt`. See also: `--solexa-quals` and `--int-quals`. 326 327 -f 328 329Reads (specified with `<m1>`, `<m2>`, `<s>`) are FASTA files. FASTA files 330usually have extension `.fa`, `.fasta`, `.mfa`, `.fna` or similar. FASTA files 331do not have a way of specifying quality values, so when `-f` is set, the result 332is as if `--ignore-quals` is also set. 333 334 -r 335 336Reads (specified with `<m1>`, `<m2>`, `<s>`) are files with one input sequence 337per line, without any other information (no read names, no qualities). When 338`-r` is set, the result is as if `--ignore-quals` is also set. 339 340 -c 341 342The read sequences are given on command line. I.e. `<m1>`, `<m2>` and 343`<singles>` are comma-separated lists of reads rather than lists of read files. 344There is no way to specify read names or qualities, so `-c` also implies 345`--ignore-quals`. 346 347 -s/--skip <int> 348 349Skip (i.e. do not align) the first `<int>` reads or pairs in the input. 350 351 -u/--qupto <int> 352 353Align the first `<int>` reads or read pairs from the input (after the 354`-s`/`--skip` reads or pairs have been skipped), then stop. Default: no limit. 355 356 -5/--trim5 <int> 357 358Trim `<int>` bases from 5' (left) end of each read before alignment (default: 0). 359 360 -3/--trim3 <int> 361 362Trim `<int>` bases from 3' (right) end of each read before alignment (default: 3630). 364 365 --phred33 366 367Input qualities are ASCII chars equal to the [Phred quality] plus 33. This is 368also called the "Phred+33" encoding, which is used by the very latest Illumina 369pipelines. 370 371[Phred quality]: http://en.wikipedia.org/wiki/Phred_quality_score 372 373 --phred64 374 375Input qualities are ASCII chars equal to the [Phred quality] plus 64. This is 376also called the "Phred+64" encoding. 377 378 --solexa-quals 379 380Convert input qualities from [Solexa][Phred quality] (which can be negative) to 381[Phred][Phred quality] (which can't). This scheme was used in older Illumina GA 382Pipeline versions (prior to 1.3). Default: off. 383 384 --int-quals 385 386Quality values are represented in the read input file as space-separated ASCII 387integers, e.g., `40 40 30 40`..., rather than ASCII characters, e.g., `II?I`.... 388 Integers are treated as being on the [Phred quality] scale unless 389`--solexa-quals` is also specified. Default: off. 390 391#### Alignment options 392 393 --n-ceil <func> 394 395Sets a function governing the maximum number of ambiguous characters (usually 396`N`s and/or `.`s) allowed in a read as a function of read length. For instance, 397specifying `-L,0,0.15` sets the N-ceiling function `f` to `f(x) = 0 + 0.15 * x`, 398where x is the read length. See also: [setting function options]. Reads 399exceeding this ceiling are [filtered out]. Default: `L,0,0.15`. 400 401 --ignore-quals 402 403When calculating a mismatch penalty, always consider the quality value at the 404mismatched position to be the highest possible, regardless of the actual value. 405I.e. input is treated as though all quality values are high. This is also the 406default behavior when the input doesn't specify quality values (e.g. in `-f`, 407`-r`, or `-c` modes). 408 409 --nofw/--norc 410 411If `--nofw` is specified, `hisat2` will not attempt to align unpaired reads to 412the forward (Watson) reference strand. If `--norc` is specified, `hisat2` will 413not attempt to align unpaired reads against the reverse-complement (Crick) 414reference strand. In paired-end mode, `--nofw` and `--norc` pertain to the 415fragments; i.e. specifying `--nofw` causes `hisat2` to explore only those 416paired-end configurations corresponding to fragments from the reverse-complement 417(Crick) strand. Default: both strands enabled. 418 419#### Scoring options 420 421 --mp MX,MN 422 423Sets the maximum (`MX`) and minimum (`MN`) mismatch penalties, both integers. A 424number less than or equal to `MX` and greater than or equal to `MN` is 425subtracted from the alignment score for each position where a read character 426aligns to a reference character, the characters do not match, and neither is an 427`N`. If `--ignore-quals` is specified, the number subtracted quals `MX`. 428Otherwise, the number subtracted is `MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) )` 429where Q is the Phred quality value. Default: `MX` = 6, `MN` = 2. 430 431 --sp MX,MN 432 433Sets the maximum (`MX`) and minimum (`MN`) penalties for soft-clipping per base, 434both integers. A number less than or equal to `MX` and greater than or equal to `MN` is 435subtracted from the alignment score for each position. 436The number subtracted is `MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) )` 437where Q is the Phred quality value. Default: `MX` = 2, `MN` = 1. 438 439 --no-softclip 440 441Disallow soft-clipping. 442 443 --np <int> 444 445Sets penalty for positions where the read, reference, or both, contain an 446ambiguous character such as `N`. Default: 1. 447 448 --rdg <int1>,<int2> 449 450Sets the read gap open (`<int1>`) and extend (`<int2>`) penalties. A read gap of 451length N gets a penalty of `<int1>` + N * `<int2>`. Default: 5, 3. 452 453 --rfg <int1>,<int2> 454 455Sets the reference gap open (`<int1>`) and extend (`<int2>`) penalties. A 456reference gap of length N gets a penalty of `<int1>` + N * `<int2>`. Default: 4575, 3. 458 459 --score-min <func> 460 461Sets a function governing the minimum alignment score needed for an alignment to 462be considered "valid" (i.e. good enough to report). This is a function of read 463length. For instance, specifying `L,0,-0.6` sets the minimum-score function `f` 464to `f(x) = 0 + -0.6 * x`, where `x` is the read length. See also: [setting 465function options]. The default is `L,0,-0.2`. 466 467#### Spliced alignment options 468 469 --pen-cansplice <int> 470 471Sets the penalty for each pair of canonical splice sites (e.g. GT/AG). Default: 0. 472 473 --pen-noncansplice <int> 474 475Sets the penalty for each pair of non-canonical splice sites (e.g. non-GT/AG). Default: 12. 476 477 --pen-canintronlen <func> 478 479Sets the penalty for long introns with canonical splice sites so that alignments with shorter introns are preferred 480to those with longer ones. Default: G,-8,1 481 482 --pen-noncanintronlen <func> 483 484Sets the penalty for long introns with noncanonical splice sites so that alignments with shorter introns are preferred 485to those with longer ones. Default: G,-8,1 486 487 --min-intronlen <int> 488 489Sets minimum intron length. Default: 20 490 491 --max-intronlen <int> 492 493Sets maximum intron length. Default: 500000 494 495 --known-splicesite-infile <path> 496 497With this mode, you can provide a list of known splice sites, which HISAT2 makes use of to align reads with small anchors. 498You can create such a list using `python hisat2_extract_splice_sites.py genes.gtf > splicesites.txt`, 499where `hisat2_extract_splice_sites.py` is included in the HISAT2 package, `genes.gtf` is a gene annotation file, 500and `splicesites.txt` is a list of splice sites with which you provide HISAT2 in this mode. 501Note that it is better to use indexes built using annotated transcripts (such as <i>genome_tran</i> or <i>genome_snp_tran</i>), which works better 502than using this option. It has no effect to provide splice sites that are already included in the indexes. 503 504 --novel-splicesite-outfile <path> 505 506In this mode, HISAT2 reports a list of splice sites in the file <path>: 507 chromosome name `<tab>` genomic position of the flanking base on the left side of an intron `<tab>` genomic position of the flanking base on the right `<tab>` strand (+, -, and .) 508 '.' indicates an unknown strand for non-canonical splice sites. 509 510 --novel-splicesite-infile <path> 511 512With this mode, you can provide a list of novel splice sites that were generated from the above option "--novel-splicesite-outfile". 513 514 --no-temp-splicesite 515 516HISAT2, by default, makes use of splice sites found by earlier reads to align later reads in the same run, 517in particular, reads with small anchors (<= 15 bp). 518The option disables this default alignment strategy. 519 520 --no-spliced-alignment 521 522Disable spliced alignment. 523 524 --rna-strandness <string> 525 526Specify strand-specific information: the default is unstranded. 527For single-end reads, use F or R. 528 'F' means a read corresponds to a transcript. 529 'R' means a read corresponds to the reverse complemented counterpart of a transcript. 530For paired-end reads, use either FR or RF. 531With this option being used, every read alignment will have an XS attribute tag: 532 '+' means a read belongs to a transcript on '+' strand of genome. 533 '-' means a read belongs to a transcript on '-' strand of genome. 534 535(TopHat has a similar option, --library-type option, where fr-firststrand corresponds to R and RF; fr-secondstrand corresponds to F and FR.) 536 537 --tmo/--transcriptome-mapping-only 538 539Report only those alignments within known transcripts. 540 541 --dta/--downstream-transcriptome-assembly 542 543Report alignments tailored for transcript assemblers including StringTie. 544With this option, HISAT2 requires longer anchor lengths for de novo discovery of splice sites. 545This leads to fewer alignments with short-anchors, 546which helps transcript assemblers improve significantly in computation and memory usage. 547 548 --dta-cufflinks 549 550Report alignments tailored specifically for Cufflinks. In addition to what HISAT2 does with the above option (--dta), 551With this option, HISAT2 looks for novel splice sites with three signals (GT/AG, GC/AG, AT/AC), but all user-provided splice sites are used irrespective of their signals. 552HISAT2 produces an optional field, XS:A:[+-], for every spliced alignment. 553 554 --no-templatelen-adjustment 555 556Disables template length adjustment for RNA-seq reads. 557 558#### Reporting options 559 560 -k <int> 561 562It searches for at most `<int>` distinct, primary alignments for each read. 563Primary alignments mean alignments whose alignment score is equal or higher than any other alignments. 564The search terminates when it can't find more distinct valid alignments, or when it 565finds `<int>`, whichever happens first. The alignment score for a paired-end 566alignment equals the sum of the alignment scores of the individual mates. Each 567reported read or pair alignment beyond the first has the SAM 'secondary' bit 568(which equals 256) set in its FLAGS field. For reads that have more than 569`<int>` distinct, valid alignments, `hisat2` does not guarantee that the 570`<int>` alignments reported are the best possible in terms of alignment score. Default: 5 (HFM) or 10 (HGFM) 571 572Note: HISAT2 is not designed with large values for `-k` in mind, and when 573aligning reads to long, repetitive genomes large `-k` can be very, very slow. 574 575 --max-seeds <int> 576 577HISAT2, like other aligners, uses seed-and-extend approaches. HISAT2 tries to extend seeds to full-length alignments. In HISAT2, --max-seeds is used to control the maximum number of seeds that will be extended. HISAT2 extends up to these many seeds and skips the rest of the seeds. Large values for `--max-seeds` may improve alignment sensitivity, but HISAT2 is not designed with large values for `--max-seeds` in mind, and when aligning reads to long, repetitive genomes large `--max-seeds` can be very, very slow. The default value is the maximum of 5 and the value that comes with`-k`. 578 579 --secondary 580 581Report secondary alignments. 582 583#### Paired-end options 584 585 -I/--minins <int> 586 587The minimum fragment length for valid paired-end alignments.This option is valid only with --no-spliced-alignment. 588E.g. if `-I 60` is specified and a paired-end alignment consists of two 20-bp alignments in the 589appropriate orientation with a 20-bp gap between them, that alignment is 590considered valid (as long as `-X` is also satisfied). A 19-bp gap would not 591be valid in that case. If trimming options `-3` or `-5` are also used, the 592`-I` constraint is applied with respect to the untrimmed mates. 593 594The larger the difference between `-I` and `-X`, the slower HISAT2 will 595run. This is because larger differences between `-I` and `-X` require that 596HISAT2 scan a larger window to determine if a concordant alignment exists. 597For typical fragment length ranges (200 to 400 nucleotides), HISAT2 is very 598efficient. 599 600Default: 0 (essentially imposing no minimum) 601 602 -X/--maxins <int> 603 604The maximum fragment length for valid paired-end alignments. This option is valid only with --no-spliced-alignment. 605E.g. if `-X 100` is specified and a paired-end alignment consists of two 20-bp alignments in the 606proper orientation with a 60-bp gap between them, that alignment is considered 607valid (as long as `-I` is also satisfied). A 61-bp gap would not be valid in 608that case. If trimming options `-3` or `-5` are also used, the `-X` 609constraint is applied with respect to the untrimmed mates, not the trimmed 610mates. 611 612The larger the difference between `-I` and `-X`, the slower HISAT2 will 613run. This is because larger differences between `-I` and `-X` require that 614HISAT2 scan a larger window to determine if a concordant alignment exists. 615For typical fragment length ranges (200 to 400 nucleotides), HISAT2 is very 616efficient. 617 618Default: 500. 619 620 --fr/--rf/--ff 621 622The upstream/downstream mate orientations for a valid paired-end alignment 623against the forward reference strand. E.g., if `--fr` is specified and there is 624a candidate paired-end alignment where mate 1 appears upstream of the reverse 625complement of mate 2 and the fragment length constraints (`-I` and `-X`) are 626met, that alignment is valid. Also, if mate 2 appears upstream of the reverse 627complement of mate 1 and all other constraints are met, that too is valid. 628`--rf` likewise requires that an upstream mate1 be reverse-complemented and a 629downstream mate2 be forward-oriented. ` --ff` requires both an upstream mate 1 630and a downstream mate 2 to be forward-oriented. Default: `--fr` (appropriate 631for Illumina's Paired-end Sequencing Assay). 632 633 --no-mixed 634 635By default, when `hisat2` cannot find a concordant or discordant alignment for 636a pair, it then tries to find alignments for the individual mates. This option 637disables that behavior. 638 639 --no-discordant 640 641By default, `hisat2` looks for discordant alignments if it cannot find any 642concordant alignments. A discordant alignment is an alignment where both mates 643align uniquely, but that does not satisfy the paired-end constraints 644(`--fr`/`--rf`/`--ff`, `-I`, `-X`). This option disables that behavior. 645 646#### Output options 647 648 -t/--time 649 650Print the wall-clock time required to load the index files and align the reads. 651This is printed to the "standard error" ("stderr") filehandle. Default: off. 652 653 --un <path> 654 --un-gz <path> 655 --un-bz2 <path> 656 657Write unpaired reads that fail to align to file at `<path>`. These reads 658correspond to the SAM records with the FLAGS `0x4` bit set and neither the 659`0x40` nor `0x80` bits set. If `--un-gz` is specified, output will be gzip 660compressed. If `--un-bz2` is specified, output will be bzip2 compressed. Reads 661written in this way will appear exactly as they did in the input file, without 662any modification (same sequence, same name, same quality string, same quality 663encoding). Reads will not necessarily appear in the same order as they did in 664the input. 665 666 --al <path> 667 --al-gz <path> 668 --al-bz2 <path> 669 670Write unpaired reads that align at least once to file at `<path>`. These reads 671correspond to the SAM records with the FLAGS `0x4`, `0x40`, and `0x80` bits 672unset. If `--al-gz` is specified, output will be gzip compressed. If `--al-bz2` 673is specified, output will be bzip2 compressed. Reads written in this way will 674appear exactly as they did in the input file, without any modification (same 675sequence, same name, same quality string, same quality encoding). Reads will 676not necessarily appear in the same order as they did in the input. 677 678 --un-conc <path> 679 --un-conc-gz <path> 680 --un-conc-bz2 <path> 681 682Write paired-end reads that fail to align concordantly to file(s) at `<path>`. 683These reads correspond to the SAM records with the FLAGS `0x4` bit set and 684either the `0x40` or `0x80` bit set (depending on whether it's mate #1 or #2). 685`.1` and `.2` strings are added to the filename to distinguish which file 686contains mate #1 and mate #2. If a percent symbol, `%`, is used in `<path>`, 687the percent symbol is replaced with `1` or `2` to make the per-mate filenames. 688Otherwise, `.1` or `.2` are added before the final dot in `<path>` to make the 689per-mate filenames. Reads written in this way will appear exactly as they did 690in the input files, without any modification (same sequence, same name, same 691quality string, same quality encoding). Reads will not necessarily appear in 692the same order as they did in the inputs. 693 694 --al-conc <path> 695 --al-conc-gz <path> 696 --al-conc-bz2 <path> 697 698Write paired-end reads that align concordantly at least once to file(s) at 699`<path>`. These reads correspond to the SAM records with the FLAGS `0x4` bit 700unset and either the `0x40` or `0x80` bit set (depending on whether it's mate #1 701or #2). `.1` and `.2` strings are added to the filename to distinguish which 702file contains mate #1 and mate #2. If a percent symbol, `%`, is used in 703`<path>`, the percent symbol is replaced with `1` or `2` to make the per-mate 704filenames. Otherwise, `.1` or `.2` are added before the final dot in `<path>` to 705make the per-mate filenames. Reads written in this way will appear exactly as 706they did in the input files, without any modification (same sequence, same name, 707same quality string, same quality encoding). Reads will not necessarily appear 708in the same order as they did in the inputs. 709 710 --quiet 711 712Print nothing besides alignments and serious errors. 713 714 --summary-file 715 716Print alignment summary to this file. 717 718 --new-summary 719 720Print alignment summary in a new style, which is more machine-friendly. 721 722 --met-file <path> 723 724Write `hisat2` metrics to file `<path>`. Having alignment metric can be useful 725for debugging certain problems, especially performance issues. See also: 726`--met`. Default: metrics disabled. 727 728 --met-stderr 729 730Write `hisat2` metrics to the "standard error" ("stderr") filehandle. This is 731not mutually exclusive with `--met-file`. Having alignment metric can be 732useful for debugging certain problems, especially performance issues. See also: 733`--met`. Default: metrics disabled. 734 735 --met <int> 736 737Write a new `hisat2` metrics record every `<int>` seconds. Only matters if 738either `--met-stderr` or `--met-file` are specified. Default: 1. 739 740#### SAM options 741 742 --no-unal 743 744Suppress SAM records for reads that failed to align. 745 746 --no-hd 747 748Suppress SAM header lines (starting with `@`). 749 750 --no-sq 751 752Suppress `@SQ` SAM header lines. 753 754 --rg-id <text> 755 756Set the read group ID to `<text>`. This causes the SAM `@RG` header line to be 757printed, with `<text>` as the value associated with the `ID:` tag. It also 758causes the `RG:Z:` extra field to be attached to each SAM output record, with 759value set to `<text>`. 760 761 --rg <text> 762 763Add `<text>` (usually of the form `TAG:VAL`, e.g. `SM:Pool1`) as a field on the 764`@RG` header line. Note: in order for the `@RG` line to appear, `--rg-id` 765must also be specified. This is because the `ID` tag is required by the [SAM 766Spec][SAM]. Specify `--rg` multiple times to set multiple fields. See the 767[SAM Spec][SAM] for details about what fields are legal. 768 769 --remove-chrname 770 771Remove 'chr' from reference names in alignment (e.g., chr18 to 18) 772 773 --add-chrname 774 775Add 'chr' to reference names in alignment (e.g., 18 to chr18) 776 777 --omit-sec-seq 778 779When printing secondary alignments, HISAT2 by default will write out the `SEQ` 780and `QUAL` strings. Specifying this option causes HISAT2 to print an asterisk 781in those fields instead. 782 783#### Performance options 784 785 -o/--offrate <int> 786 787Override the offrate of the index with `<int>`. If `<int>` is greater 788than the offrate used to build the index, then some row markings are 789discarded when the index is read into memory. This reduces the memory 790footprint of the aligner but requires more time to calculate text 791offsets. `<int>` must be greater than the value used to build the 792index. 793 794 -p/--threads NTHREADS 795 796Launch `NTHREADS` parallel search threads (default: 1). Threads will run on 797separate processors/cores and synchronize when parsing reads and outputting 798alignments. Searching for alignments is highly parallel, and speedup is close 799to linear. Increasing `-p` increases HISAT2's memory footprint. E.g. when 800aligning to a human genome index, increasing `-p` from 1 to 8 increases the 801memory footprint by a few hundred megabytes. This option is only available if 802`hisat2` is linked with the `pthreads` library (i.e. if `HISAT2_PTHREADS=0` is 803not specified at build time). 804 805 --reorder 806 807Guarantees that output SAM records are printed in an order corresponding to the 808order of the reads in the original input file, even when `-p` is set greater 809than 1. Specifying `--reorder` and setting `-p` greater than 1 causes HISAT2 810to run somewhat slower and use somewhat more memory then if `--reorder` were 811not specified. Has no effect if `-p` is set to 1, since output order will 812naturally correspond to input order in that case. 813 814 --mm 815 816Use memory-mapped I/O to load the index, rather than typical file I/O. 817Memory-mapping allows many concurrent `hisat2` processes on the same computer to 818share the same memory image of the index (i.e. you pay the memory overhead just 819once). This facilitates memory-efficient parallelization of `hisat2` in 820situations where using `-p` is not possible or not preferable. 821 822#### Other options 823 824 --qc-filter 825 826Filter out reads for which the QSEQ filter field is non-zero. Only has an 827effect when read format is `--qseq`. Default: off. 828 829 --seed <int> 830 831Use `<int>` as the seed for pseudo-random number generator. Default: 0. 832 833 --non-deterministic 834 835Normally, HISAT2 re-initializes its pseudo-random generator for each read. It 836seeds the generator with a number derived from (a) the read name, (b) the 837nucleotide sequence, (c) the quality sequence, (d) the value of the `--seed` 838option. This means that if two reads are identical (same name, same 839nucleotides, same qualities) HISAT2 will find and report the same alignment(s) 840for both, even if there was ambiguity. When `--non-deterministic` is specified, 841HISAT2 re-initializes its pseudo-random generator for each read using the 842current time. This means that HISAT2 will not necessarily report the same 843alignment for two identical reads. This is counter-intuitive for some users, 844but might be more appropriate in situations where the input consists of many 845identical reads. 846 847 --version 848 849Print version information and quit. 850 851 -h/--help 852 853Print usage information and quit. 854 855SAM output 856---------- 857 858Following is a brief description of the [SAM] format as output by `hisat2`. 859For more details, see the [SAM format specification][SAM]. 860 861By default, `hisat2` prints a SAM header with `@HD`, `@SQ` and `@PG` lines. 862When one or more `--rg` arguments are specified, `hisat2` will also print 863an `@RG` line that includes all user-specified `--rg` tokens separated by 864tabs. 865 866Each subsequent line describes an alignment or, if the read failed to align, a 867read. Each line is a collection of at least 12 fields separated by tabs; from 868left to right, the fields are: 869 8701. Name of read that aligned. 871 872 Note that the [SAM specification] disallows whitespace in the read name. 873 If the read name contains any whitespace characters, HISAT2 will truncate 874 the name at the first whitespace character. This is similar to the 875 behavior of other tools. 876 8772. Sum of all applicable flags. Flags relevant to HISAT2 are: 878 879 1 880 881 The read is one of a pair 882 883 2 884 885 The alignment is one end of a proper paired-end alignment 886 887 4 888 889 The read has no reported alignments 890 891 8 892 893 The read is one of a pair and has no reported alignments 894 895 16 896 897 The alignment is to the reverse reference strand 898 899 32 900 901 The other mate in the paired-end alignment is aligned to the 902 reverse reference strand 903 904 64 905 906 The read is mate 1 in a pair 907 908 128 909 910 The read is mate 2 in a pair 911 912 Thus, an unpaired read that aligns to the reverse reference strand 913 will have flag 16. A paired-end read that aligns and is the first 914 mate in the pair will have flag 83 (= 64 + 16 + 2 + 1). 915 9163. Name of reference sequence where alignment occurs 917 9184. 1-based offset into the forward reference strand where leftmost 919 character of the alignment occurs 920 9215. Mapping quality 922 9236. CIGAR string representation of alignment 924 9257. Name of reference sequence where mate's alignment occurs. Set to `=` if the 926mate's reference sequence is the same as this alignment's, or `*` if there is no 927mate. 928 9298. 1-based offset into the forward reference strand where leftmost character of 930the mate's alignment occurs. Offset is 0 if there is no mate. 931 9329. Inferred fragment length. Size is negative if the mate's alignment occurs 933upstream of this alignment. Size is 0 if the mates did not align concordantly. 934However, size is non-0 if the mates aligned discordantly to the same 935chromosome. 936 93710. Read sequence (reverse-complemented if aligned to the reverse strand) 938 93911. ASCII-encoded read qualities (reverse-complemented if the read aligned to 940the reverse strand). The encoded quality values are on the [Phred quality] 941scale and the encoding is ASCII-offset by 33 (ASCII char `!`), similarly to a 942[FASTQ] file. 943 94412. Optional fields. Fields are tab-separated. `hisat2` outputs zero or more 945of these optional fields for each alignment, depending on the type of the 946alignment: 947 948 AS:i:<N> 949 950 Alignment score. Can be negative. Only present if SAM record is for 951 an aligned read. 952 953 ZS:i:<N> 954 955 Alignment score for the best-scoring alignment found other than the 956 alignment reported. Can be negative. Only present if the SAM record is 957 for an aligned read and more than one alignment was found for the read. 958 Note that, when the read is part of a concordantly-aligned pair, this score 959 could be greater than `AS:i`. 960 961 YS:i:<N> 962 963 Alignment score for opposite mate in the paired-end alignment. Only present 964 if the SAM record is for a read that aligned as part of a paired-end 965 alignment. 966 967 XN:i:<N> 968 969 The number of ambiguous bases in the reference covering this alignment. 970 Only present if SAM record is for an aligned read. 971 972 XM:i:<N> 973 974 The number of mismatches in the alignment. Only present if SAM record is 975 for an aligned read. 976 977 XO:i:<N> 978 979 The number of gap opens, for both read and reference gaps, in the alignment. 980 Only present if SAM record is for an aligned read. 981 982 XG:i:<N> 983 984 The number of gap extensions, for both read and reference gaps, in the 985 alignment. Only present if SAM record is for an aligned read. 986 987 NM:i:<N> 988 989 The edit distance; that is, the minimal number of one-nucleotide edits 990 (substitutions, insertions and deletions) needed to transform the read 991 string into the reference string. Only present if SAM record is for an 992 aligned read. 993 994 YF:Z:<S> 995 996 String indicating reason why the read was filtered out. See also: 997 [Filtering]. Only appears for reads that were filtered out. 998 999 YT:Z:<S> 1000 1001 Value of `UU` indicates the read was not part of a pair. Value of `CP` 1002 indicates the read was part of a pair and the pair aligned concordantly. 1003 Value of `DP` indicates the read was part of a pair and the pair aligned 1004 discordantly. Value of `UP` indicates the read was part of a pair but the 1005 pair failed to aligned either concordantly or discordantly. 1006 1007 MD:Z:<S> 1008 1009 A string representation of the mismatched reference bases in the alignment. 1010 See [SAM] format specification for details. Only present if SAM record is 1011 for an aligned read. 1012 1013 XS:A:<A> 1014 1015 Values of `+` and `-` indicate the read is mapped to transcripts on sense and anti-sense 1016 strands, respectively. Spliced alignments need to have this field, which is required in Cufflinks and StringTie. 1017 We can report this field for the canonical-splice site (GT/AG), but not for non-canonical splice sites. 1018 You can direct HISAT2 not to output such alignments (involving non-canonical splice sites) using "--pen-noncansplice 1000000". 1019 1020 NH:i:<N> 1021 1022 The number of mapped locations for the read or the pair. 1023 1024 Zs:Z:<S> 1025 1026 When the alignment of a read involves SNPs that are in the index, this option is used to indicate where exactly the read involves the SNPs. 1027 This optional field is similar to the above MD:Z field. 1028 For example, `Zs:Z:1|S|rs3747203,97|S|rs16990981` indicates the second base of the read corresponds to a known SNP (ID: rs3747203). 1029 97 bases after the third base (the base after the second one), the read at 100th base involves another known SNP (ID: rs16990981). 1030 'S' indicates a single nucleotide polymorphism. 'D' and 'I' indicate a deletion and an insertion, respectively. 1031 1032[SAM format specification]: http://samtools.sf.net/SAM1.pdf 1033[FASTQ]: http://en.wikipedia.org/wiki/FASTQ_format 1034 1035The `hisat2-build` indexer 1036=========================== 1037 1038`hisat2-build` builds a HISAT2 index from a set of DNA sequences. 1039`hisat2-build` outputs a set of 6 files with suffixes `.1.ht2`, `.2.ht2`, 1040`.3.ht2`, `.4.ht2`, `.5.ht2`, `.6.ht2`, `.7.ht2`, and `.8.ht2`. In the case of a large 1041index these suffixes will have a `ht2l` termination. These files together 1042constitute the index: they are all that is needed to align reads to that 1043reference. The original sequence FASTA files are no longer used by HISAT2 1044once the index is built. 1045 1046Use of Karkkainen's [blockwise algorithm] allows `hisat2-build` to trade off 1047between running time and memory usage. `hisat2-build` has three options 1048governing how it makes this trade: `-p`/`--packed`, `--bmax`/`--bmaxdivn`, 1049and `--dcv`. By default, `hisat2-build` will automatically search for the 1050settings that yield the best running time without exhausting memory. This 1051behavior can be disabled using the `-a`/`--noauto` option. 1052 1053The indexer provides options pertaining to the "shape" of the index, e.g. 1054`--offrate` governs the fraction of [Burrows-Wheeler] 1055rows that are "marked" (i.e., the density of the suffix-array sample; see the 1056original [FM Index] paper for details). All of these options are potentially 1057profitable trade-offs depending on the application. They have been set to 1058defaults that are reasonable for most cases according to our experiments. See 1059[Performance tuning] for details. 1060 1061`hisat2-build` can generate either [small or large indexes]. The wrapper 1062will decide which based on the length of the input genome. If the reference 1063does not exceed 4 billion characters but a large index is preferred, the user 1064can specify `--large-index` to force `hisat2-build` to build a large index 1065instead. 1066 1067The HISAT2 index is based on the [FM Index] of Ferragina and Manzini, which in 1068turn is based on the [Burrows-Wheeler] transform. The algorithm used to build 1069the index is based on the [blockwise algorithm] of Karkkainen. 1070 1071[Blockwise algorithm]: http://portal.acm.org/citation.cfm?id=1314852 1072[Burrows-Wheeler]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform 1073 1074Command Line 1075------------ 1076 1077Usage: 1078 1079 hisat2-build [options]* <reference_in> <ht2_base> 1080 1081### Notes 1082 If you use --snp, --ss, and/or --exon, hisat2-build will need about 200GB RAM for the human genome size as index building involves a graph construction. 1083 Otherwise, you will be able to build an index on your desktop with 8GB RAM. 1084 1085### Main arguments 1086 1087A comma-separated list of FASTA files containing the reference sequences to be 1088aligned to, or, if `-c` is specified, the sequences 1089themselves. E.g., `<reference_in>` might be `chr1.fa,chr2.fa,chrX.fa,chrY.fa`, 1090or, if `-c` is specified, this might be 1091`GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA`. 1092 1093The basename of the index files to write. By default, `hisat2-build` writes 1094files named `NAME.1.ht2`, `NAME.2.ht2`, `NAME.3.ht2`, `NAME.4.ht2`, 1095`NAME.5.ht2`, `NAME.6.ht2`, `NAME.7.ht2`, and `NAME.8.ht2` where `NAME` is `<ht2_base>`. 1096 1097### Options 1098 1099 -f 1100 1101The reference input files (specified as `<reference_in>`) are FASTA files 1102(usually having extension `.fa`, `.mfa`, `.fna` or similar). 1103 1104 -c 1105 1106The reference sequences are given on the command line. I.e. `<reference_in>` is 1107a comma-separated list of sequences rather than a list of FASTA files. 1108 1109 --large-index 1110 1111Force `hisat2-build` to build a [large index], even if the reference is less 1112than ~ 4 billion nucleotides long. 1113 1114 -a/--noauto 1115 1116Disable the default behavior whereby `hisat2-build` automatically selects 1117values for the `--bmax`, `--dcv` and `--packed` parameters according to 1118available memory. Instead, user may specify values for those parameters. If 1119memory is exhausted during indexing, an error message will be printed; it is up 1120to the user to try new parameters. 1121 1122 --bmax <int> 1123 1124The maximum number of suffixes allowed in a block. Allowing more suffixes per 1125block makes indexing faster, but increases peak memory usage. Setting this 1126option overrides any previous setting for `--bmax`, or `--bmaxdivn`. 1127Default (in terms of the `--bmaxdivn` parameter) is `--bmaxdivn` 4. This is 1128configured automatically by default; use `-a`/`--noauto` to configure manually. 1129 1130 --bmaxdivn <int> 1131 1132The maximum number of suffixes allowed in a block, expressed as a fraction of 1133the length of the reference. Setting this option overrides any previous setting 1134for `--bmax`, or `--bmaxdivn`. Default: `--bmaxdivn` 4. This is 1135configured automatically by default; use `-a`/`--noauto` to configure manually. 1136 1137 --dcv <int> 1138 1139Use `<int>` as the period for the difference-cover sample. A larger period 1140yields less memory overhead, but may make suffix sorting slower, especially if 1141repeats are present. Must be a power of 2 no greater than 4096. Default: 1024. 1142 This is configured automatically by default; use `-a`/`--noauto` to configure 1143manually. 1144 1145 --nodc 1146 1147Disable use of the difference-cover sample. Suffix sorting becomes 1148quadratic-time in the worst case (where the worst case is an extremely 1149repetitive reference). Default: off. 1150 1151 -r/--noref 1152 1153Do not build the `NAME.3.ht2` and `NAME.4.ht2` portions of the index, which 1154contain a bitpacked version of the reference sequences and are used for 1155paired-end alignment. 1156 1157 -3/--justref 1158 1159Build only the `NAME.3.ht2` and `NAME.4.ht2` portions of the index, which 1160contain a bitpacked version of the reference sequences and are used for 1161paired-end alignment. 1162 1163 -o/--offrate <int> 1164 1165To map alignments back to positions on the reference sequences, it's necessary 1166to annotate ("mark") some or all of the [Burrows-Wheeler] rows with their 1167corresponding location on the genome. 1168`-o`/`--offrate` governs how many rows get marked: 1169the indexer will mark every 2^`<int>` rows. Marking more rows makes 1170reference-position lookups faster, but requires more memory to hold the 1171annotations at runtime. The default is 4 (every 16th row is marked; for human 1172genome, annotations occupy about 680 megabytes). 1173 1174 -t/--ftabchars <int> 1175 1176The ftab is the lookup table used to calculate an initial [Burrows-Wheeler] 1177range with respect to the first `<int>` characters of the query. A larger 1178`<int>` yields a larger lookup table but faster query times. The ftab has size 11794^(`<int>`+1) bytes. The default setting is 10 (ftab is 4MB). 1180 1181 --localoffrate <int> 1182 1183This option governs how many rows get marked in a local index: 1184the indexer will mark every 2^`<int>` rows. Marking more rows makes 1185reference-position lookups faster, but requires more memory to hold the 1186annotations at runtime. The default is 3 (every 8th row is marked, 1187this occupies about 16KB per local index). 1188 1189 --localftabchars <int> 1190 1191The local ftab is the lookup table in a local index. 1192The default setting is 6 (ftab is 8KB per local index). 1193 1194 -p <int> 1195 1196Launch `NTHREADS` parallel build threads (default: 1). 1197 1198 --snp <path> 1199 1200Provide a list of SNPs (in the HISAT2's own format) as follows (five columns). 1201 1202 SNP ID `<tab>` snp type (single, deletion, or insertion) `<tab>` chromosome name `<tab>` zero-offset based genomic position of a SNP `<tab>` alternative base (single), the length of SNP (deletion), or insertion sequence (insertion) 1203 1204 For example, 1205 rs58784443 single 13 18447947 T 1206 1207Use `hisat2_extract_snps_haplotypes_UCSC.py` (in the HISAT2 package) to extract SNPs and haplotypes from a dbSNP file (e.g. http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/snp144Common.txt.gz). 1208or `hisat2_extract_snps_haplotypes_VCF.py` to extract SNPs and haplotypes from a VCF file (e.g. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr22.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP_no_SVs.vcf.gz). 1209 1210 --haplotype <path> 1211 1212Provide a list of haplotypes (in the HISAT2's own format) as follows (five columns). 1213 1214 Haplotype ID `<tab>` chromosome name `<tab>` zero-offset based left coordinate of haplotype `<tab>` zero-offset based right coordinate of haplotype `<tab>` a comma separated list of SNP ids in the haplotype 1215 1216 For example, 1217 ht35 13 18446877 18446945 rs12381094,rs12381056,rs192016659,rs538569910 1218 1219See the above option, --snp, about how to extract haplotypes. This option is not required, but haplotype information can keep the index construction from exploding and reduce the index size substantially. 1220 1221 --ss <path> 1222 1223Note this option should be used with the following --exon option. 1224Provide a list of splice sites (in the HISAT2's own format) as follows (four columns). 1225 1226 chromosome name `<tab>` zero-offset based genomic position of the flanking base on the left side of an intron `<tab>` zero-offset based genomic position of the flanking base on the right `<tab>` strand 1227 1228Use `hisat2_extract_splice_sites.py` (in the HISAT2 package) to extract splice sites from a GTF file. 1229 1230 --exon <path> 1231 1232Note this option should be used with the above --ss option. 1233Provide a list of exons (in the HISAT2's own format) as follows (three columns). 1234 1235 chromosome name `<tab>` zero-offset based left genomic position of an exon `<tab>` zero-offset based right genomic position of an exon 1236 1237Use `hisat2_extract_exons.py` (in the HISAT2 package) to extract exons from a GTF file. 1238 1239 --seed <int> 1240 1241Use `<int>` as the seed for pseudo-random number generator. 1242 1243 --cutoff <int> 1244 1245Index only the first `<int>` bases of the reference sequences (cumulative across 1246sequences) and ignore the rest. 1247 1248 -q/--quiet 1249 1250`hisat2-build` is verbose by default. With this option `hisat2-build` will 1251print only error messages. 1252 1253 -h/--help 1254 1255Print usage information and quit. 1256 1257 --version 1258 1259Print version information and quit. 1260 1261The `hisat2-inspect` index inspector 1262===================================== 1263 1264`hisat2-inspect` extracts information from a HISAT2 index about what kind of 1265index it is and what reference sequences were used to build it. When run without 1266any options, the tool will output a FASTA file containing the sequences of the 1267original references (with all non-`A`/`C`/`G`/`T` characters converted to `N`s). 1268 It can also be used to extract just the reference sequence names using the 1269`-n`/`--names` option or a more verbose summary using the `-s`/`--summary` 1270option. 1271 1272Command Line 1273------------ 1274 1275Usage: 1276 1277 hisat2-inspect [options]* <ht2_base> 1278 1279### Main arguments 1280 1281The basename of the index to be inspected. The basename is name of any of the 1282index files but with the `.X.ht2` suffix omitted. 1283`hisat2-inspect` first looks in the current directory for the index files, then 1284in the directory specified in the `HISAT2_INDEXES` environment variable. 1285 1286### Options 1287 1288 -a/--across <int> 1289 1290When printing FASTA output, output a newline character every `<int>` bases 1291(default: 60). 1292 1293 -n/--names 1294 1295Print reference sequence names, one per line, and quit. 1296 1297 -s/--summary 1298 1299Print a summary that includes information about index settings, as well as the 1300names and lengths of the input sequences. The summary has this format: 1301 1302 Colorspace <0 or 1> 1303 SA-Sample 1 in <sample> 1304 FTab-Chars <chars> 1305 Sequence-1 <name> <len> 1306 Sequence-2 <name> <len> 1307 ... 1308 Sequence-N <name> <len> 1309 1310Fields are separated by tabs. Colorspace is always set to 0 for HISAT2. 1311 1312 --snp 1313 1314Print SNPs, and quit. 1315 1316 --ss 1317 1318Print splice sites, and quit. 1319 1320 --ss-all 1321 1322Print splice sites including those not in the global index, and quit. 1323 1324 --exon 1325 1326Print exons, and quit. 1327 1328 -v/--verbose 1329 1330Print verbose output (for debugging). 1331 1332 --version 1333 1334Print version information and quit. 1335 1336 -h/--help 1337 1338Print usage information and quit. 1339 1340Getting started with HISAT2 1341=================================================== 1342 1343HISAT2 comes with some example files to get you started. The example files 1344are not scientifically significant; these files will simply let you start running HISAT2 and 1345downstream tools right away. 1346 1347First follow the manual instructions to [obtain HISAT2]. Set the `HISAT2_HOME` 1348environment variable to point to the new HISAT2 directory containing the 1349`hisat2`, `hisat2-build` and `hisat2-inspect` binaries. This is important, 1350as the `HISAT2_HOME` variable is used in the commands below to refer to that 1351directory. 1352 1353Indexing a reference genome 1354--------------------------- 1355 1356To create an index for the genomic region (1 million bps from the human chromosome 22 between 20,000,000 and 20,999,999) 1357included with HISAT2, create a new temporary directory (it doesn't matter where), change into that directory, and run: 1358 1359 $HISAT2_HOME/hisat2-build $HISAT2_HOME/example/reference/22_20-21M.fa --snp $HISAT2_HOME/example/reference/22_20-21M.snp 22_20-21M_snp 1360 1361The command should print many lines of output then quit. When the command 1362completes, the current directory will contain ten new files that all start with 1363`22_20-21M_snp` and end with `.1.ht2`, `.2.ht2`, `.3.ht2`, `.4.ht2`, `.5.ht2`, `.6.ht2`, 1364`.7.ht2`, and `.8.ht2`. These files constitute the index - you're done! 1365 1366You can use `hisat2-build` to create an index for a set of FASTA files obtained 1367from any source, including sites such as [UCSC], [NCBI], and [Ensembl]. When 1368indexing multiple FASTA files, specify all the files using commas to separate 1369file names. For more details on how to create an index with `hisat2-build`, 1370see the [manual section on index building]. You may also want to bypass this 1371process by obtaining a pre-built index. 1372 1373[UCSC]: http://genome.ucsc.edu/cgi-bin/hgGateway 1374[NCBI]: http://www.ncbi.nlm.nih.gov/sites/genome 1375[Ensembl]: http://www.ensembl.org/ 1376 1377Aligning example reads 1378---------------------- 1379 1380Stay in the directory created in the previous step, which now contains the 1381`22_20-21M` index files. Next, run: 1382 1383 $HISAT2_HOME/hisat2 -f -x $HISAT2_HOME/example/index/22_20-21M_snp -U $HISAT2_HOME/example/reads/reads_1.fa -S eg1.sam 1384 1385This runs the HISAT2 aligner, which aligns a set of unpaired reads to the 1386genome region using the index generated in the previous step. 1387The alignment results in SAM format are written to the file `eg1.sam`, and a 1388short alignment summary is written to the console. (Actually, the summary is 1389written to the "standard error" or "stderr" filehandle, which is typically 1390printed to the console.) 1391 1392To see the first few lines of the SAM output, run: 1393 1394 head eg1.sam 1395 1396You will see something like this: 1397 1398 @HD VN:1.0 SO:unsorted 1399 @SQ SN:22:20000001-21000000 LN:1000000 1400 @PG ID:hisat2 PN:hisat2 VN:2.0.0-beta 1401 1 0 22:20000001-21000000 397984 255 100M * 0 0 GCCTGTGAGGGAGCCCCGGACCCGGTCAGAGCAGGAGCCTGGCCTGGGGCCAAGTTCACCTTATGGACTCTCTTCCCTGCCCTTCCAGGAGCAGCTCACT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1 1402 2 16 22:20000001-21000000 398131 255 100M * 0 0 ATGACACACTGTACACACCAGGGGCCCTGTGCTCCCCAGGAAGAGGGCCCTCACTTGAAGCGGGGCCCGATGGCCGCCACGTGCCGGTTCATGCTCCCCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:80A19 YT:Z:UU NH:i:1 Zs:Z:80|S|rs576159895 1403 3 16 22:20000001-21000000 398222 255 100M * 0 0 TGCTCCCCTTGGCCCCGCCGATGTTCAGGGACATGGAGCGCTGCAGCAGGCTGGAGAAGATCTCCACTTGGTCAGAGCTGCAGTACTTGGCGATCTCAAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:16A83 YT:Z:UU NH:i:1 Zs:Z:16|S|rs2629364 1404 4 16 22:20000001-21000000 398247 255 90M200N10M * 0 0 CAGGGACATGGAGCGCTGCAGCAGGCTGGAGAAGATCTCCACTTGGTCAGAGCTGCAGTACTTGGCGATCTCAAACCGCTGCACCAGGAAGTCGATCCAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU XS:A:- NH:i:1 1405 5 16 22:20000001-21000000 398194 255 100M * 0 0 GGCCCGATGGCCGCCACGTGCCGGTTCATGCTCCCCTTGGCCCCGCCGATGTTCAGGGACATGGAGCGCTGCAGCAGGCTGGAGAAGATCTCCACTTGGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:17A26A55 YT:Z:UU NH:i:1 Zs:Z:17|S|rs576159895,26|S|rs2629364 1406 6 0 22:20000001-21000000 398069 255 100M * 0 0 CAGGAGCAGCTCACTGAAATGTGTTCCCCGTCTACAGAAGTACCGTGATACACAGACGCCCCATGACACACTGTACACACCAGGGGCCCTGTGCTCCCCA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU NH:i:1 1407 7 0 22:20000001-21000000 397896 255 100M * 0 0 GTGGAGTAGATCTTCTCGCGAAGCACATTGCAGATGGTTGCATTTGGAACCACATCGGCATGCAGGAGGGACAGCCCCAGGGTCAGCAGCCTGTGAGGGA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:31G68 YT:Z:UU NH:i:1 Zs:Z:31|S|rs562662261 1408 8 0 22:20000001-21000000 398150 255 100M * 0 0 AGGGGCCCTGTGCTCCCCAGGAAGAGGGCCCTCACTTGAAGCGGGGCCCGATGGCCGCCACGTGCCGGTTCATGCTCCCCTTGGCCCCGCCGATGTTCAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:61A26A11 YT:Z:UU NH:i:1 Zs:Z:61|S|rs576159895,26|S|rs2629364 1409 9 16 22:20000001-21000000 398329 255 8M200N92M * 0 0 ACCAGGAAGTCGATCCAGATGTAGTGGGGGGTCACTTCGGGGGGACAGGGTTTGGGTTGACTTGCTTCCGAGGCAGCCAGGGGGTCTGCTTCCTTTATCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU XS:A:- NH:i:1 1410 10 16 22:20000001-21000000 398184 255 100M * 0 0 CTTGAAGCGGGGCCCGATGGCCGCCACGTGCCGGTTCATGCTCCCCTTGGCCCCGCCGATGTTCAGGGACATGGAGCGCTGCAGCAGGCTGGAGAAGATC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:27A26A45 YT:Z:UU NH:i:1 Zs:Z:27|S|rs576159895,26|S|rs2629364 1411 1412The first few lines (beginning with `@`) are SAM header lines, and the rest of 1413the lines are SAM alignments, one line per read or mate. See the [HISAT2 1414manual section on SAM output] and the [SAM specification] for details about how 1415to interpret the SAM file format. 1416 1417Paired-end example 1418------------------ 1419 1420To align paired-end reads included with HISAT2, stay in the same directory and 1421run: 1422 1423 $HISAT2_HOME/hisat2 -f -x $HISAT2_HOME/example/index/22_20-21M_snp -1 $HISAT2_HOME/example/reads/reads_1.fa -2 $HISAT2_HOME/example/reads/reads_2.fa -S eg2.sam 1424 1425This aligns a set of paired-end reads to the reference genome, with results 1426written to the file `eg2.sam`. 1427 1428Using SAMtools/BCFtools downstream 1429---------------------------------- 1430 1431[SAMtools] is a collection of tools for manipulating and analyzing SAM and BAM 1432alignment files. [BCFtools] is a collection of tools for calling variants and 1433manipulating VCF and BCF files, and it is typically distributed with [SAMtools]. 1434Using these tools together allows you to get from alignments in SAM format to 1435variant calls in VCF format. This example assumes that `samtools` and 1436`bcftools` are installed and that the directories containing these binaries are 1437in your [PATH environment variable]. 1438 1439Run the paired-end example: 1440 1441 $HISAT2_HOME/hisat -f -x $HISAT2_HOME/example/index/22_20-21M_snp -1 $HISAT2_HOME/example/reads/reads_1.fa -2 $HISAT2_HOME/example/reads/reads_2.fa -S eg2.sam 1442 1443Use `samtools view` to convert the SAM file into a BAM file. BAM is a the 1444binary format corresponding to the SAM text format. Run: 1445 1446 samtools view -bS eg2.sam > eg2.bam 1447 1448Use `samtools sort` to convert the BAM file to a sorted BAM file. The following command requires samtools version 1.2 or higher. 1449 1450 samtools sort eg2.bam -o eg2.sorted.bam 1451 1452We now have a sorted BAM file called `eg2.sorted.bam`. Sorted BAM is a useful 1453format because the alignments are (a) compressed, which is convenient for 1454long-term storage, and (b) sorted, which is convenient for variant discovery. 1455To generate variant calls in VCF format, run: 1456 1457 samtools mpileup -uf $HISAT2_HOME/example/reference/22_20-21M.fa eg2.sorted.bam | bcftools view -bvcg - > eg2.raw.bcf 1458 1459Then to view the variants, run: 1460 1461 bcftools view eg2.raw.bcf 1462 1463See the official SAMtools guide to [Calling SNPs/INDELs with SAMtools/BCFtools] 1464for more details and variations on this process. 1465 1466[BCFtools]: http://samtools.sourceforge.net/mpileup.shtml 1467[Calling SNPs/INDELs with SAMtools/BCFtools]: http://samtools.sourceforge.net/mpileup.shtml 1468