1Introduction 2 3Bowtie 2 is an ultrafast and memory-efficient tool for aligning 4sequencing reads to long reference sequences. It is particularly good at 5aligning reads of about 50 up to 100s of characters to relatively long 6(e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index 7(based on the Burrows-Wheeler Transform or BWT) to keep its memory 8footprint small: for the human genome, its memory footprint is typically 9around 3.2 gigabytes of RAM. Bowtie 2 supports gapped, local, and 10paired-end alignment modes. Multiple processors can be used 11simultaneously to achieve greater alignment speed. 12 13Bowtie 2 outputs alignments in SAM format, enabling interoperation with 14a large number of other tools (e.g. SAMtools, GATK) that use SAM. Bowtie 152 is distributed under the GPLv3 license, and it runs on the command 16line under Windows, Mac OS X and Linux and BSD. 17 18Bowtie 2 is often the first step in pipelines for comparative genomics, 19including for variation calling, ChIP-seq, RNA-seq, BS-seq. Bowtie 2 and 20Bowtie (also called "Bowtie 1" here) are also tightly integrated into 21many other tools, some of which are listed here. 22 23If you use Bowtie 2 for your published research, please cite our work. 24Papers describing Bowtie 2 are: 25 26- Langmead B, Wilks C, Antonescu V, Charles R. Scaling read aligners 27 to hundreds of threads on general-purpose processors. 28 Bioinformatics. 2018 Jul 18. doi: 10.1093/bioinformatics/bty648. 29 30- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. 31 Nature Methods. 2012 Mar 4;9(4):357-9. doi: 10.1038/nmeth.1923. 32 33How is Bowtie 2 different from Bowtie 1? 34 35Bowtie 1 was released in 2009 and was geared toward aligning the 36relatively short sequencing reads (up to 25-50 nucleotides) prevalent at 37the time. Since then, technology has improved both sequencing throughput 38(more nucleotides produced per sequencer per day) and read length (more 39nucleotides per read). 40 41The chief differences between Bowtie 1 and Bowtie 2 are: 42 431. For reads longer than about 50 bp Bowtie 2 is generally faster, more 44 sensitive, and uses less memory than Bowtie 1. For relatively short 45 reads (e.g. less than 50 bp) Bowtie 1 is sometimes faster and/or 46 more sensitive. 47 482. Bowtie 2 supports gapped alignment with affine gap penalties. Number 49 of gaps and gap lengths are not restricted, except by way of the 50 configurable scoring scheme. Bowtie 1 finds just ungapped 51 alignments. 52 533. Bowtie 2 supports local alignment, which doesn't require reads to 54 align end-to-end. Local alignments might be "trimmed" ("soft 55 clipped") at one or both extremes in a way that optimizes alignment 56 score. Bowtie 2 also supports end-to-end alignment which, like 57 Bowtie 1, requires that the read align entirely. 58 594. There is no upper limit on read length in Bowtie 2. Bowtie 1 had an 60 upper limit of around 1000 bp. 61 625. Bowtie 2 allows alignments to overlap ambiguous characters (e.g. Ns) 63 in the reference. Bowtie 1 does not. 64 656. Bowtie 2 does away with Bowtie 1's notion of alignment "stratum", 66 and its distinction between "Maq-like" and "end-to-end" modes. In 67 Bowtie 2 all alignments lie along a continuous spectrum of alignment 68 scores where the scoring scheme, similar to Needleman-Wunsch and 69 Smith-Waterman. 70 717. Bowtie 2's paired-end alignment is more flexible. E.g. for pairs 72 that do not align in a paired fashion, Bowtie 2 attempts to find 73 unpaired alignments for each mate. 74 758. Bowtie 2 reports a spectrum of mapping qualities, in contrast for 76 Bowtie 1 which reports either 0 or high. 77 789. Bowtie 2 does not align colorspace reads. 79 80Bowtie 2 is not a "drop-in" replacement for Bowtie 1. Bowtie 2's 81command-line arguments and genome index format are both different from 82Bowtie 1's. 83 84What isn't Bowtie 2? 85 86Bowtie 2 is geared toward aligning relatively short sequencing reads to 87long genomes. That said, it handles arbitrarily small reference 88sequences (e.g. amplicons) and very long reads (i.e. upwards of 10s or 89100s of kilobases), though it is slower in those settings. It is 90optimized for the read lengths and error modes yielded by typical 91Illumina sequencers. 92 93Bowtie 2 does not support alignment of colorspace reads. (Bowtie 1 94does.) 95 96Obtaining Bowtie 2 97 98Bowtie 2 is available from various package managers, notably Bioconda. 99With Bioconda installed, you should be able to install Bowtie 2 with 100conda install bowtie2. 101 102Containerized versions of Bowtie 2 are also available via the 103Biocontainers project (e.g. via Docker Hub). 104 105You can also download Bowtie 2 sources and binaries from the Download 106section of the Sourceforge site. Binaries are available for the x86_64 107architecture running Linux, Mac OS X, and Windows. FreeBSD users can 108obtain the latest version of Bowtie 2 from ports using 109pkg install bowtie2. If you plan to compile Bowtie 2 yourself, make sure 110to get the source package, i.e., the filename that ends in 111"-source.zip". 112 113Building from source 114 115Building from source 116 117Building Bowtie 2 from source requires a GNU-like environment with 118Clang/GCC, GNU Make and other basics. It should be possible to build 119Bowtie 2 on most vanilla *NIX installations or on a Mac installation 120with Xcode installed. Bowtie 2 can also be built on Windows using a 12164-bit MinGW distribution and MSYS. In order to simplify the MinGW setup 122it might be worth investigating popular MinGW personal builds since 123these are coming already prepared with most of the toolchains needed. 124 125First, download the source package from the sourceforge site. Make sure 126you're getting the source package; the file downloaded should end in 127-source.zip. Unzip the file, change to the unzipped directory, and build 128the Bowtie 2 tools by running GNU make (usually with the command make, 129but sometimes with gmake) with no arguments. If building with MinGW, run 130make from the MSYS environment. 131 132The Bowtie 2 Makefile also includes recipes for basic automatic 133dependency management. Running make static-libs && make STATIC_BUILD=1 134will issue a series of commands that will: 1. download zstd and zlib 2. 135compile them as static libraries 3. link the resulting libraries to the 136compiled Bowtie 2 binaries 137 138As of version 2.3.5 bowtie2 now supports aligning SRA reads. Prepackaged 139builds will include a package that supports SRA. If you're building 140bowtie2 from source please make sure that the Java runtime is available 141on your system. You can then proceed with the build by running 142make sra-deps && make USE_SRA=1. 143 144Adding to PATH 145 146By adding your new Bowtie 2 directory to your PATH environment variable, 147you ensure that whenever you run bowtie2, bowtie2-build or 148bowtie2-inspect from the command line, you will get the version you just 149installed without having to specify the entire path. This is recommended 150for most users. To do this, follow your operating system's instructions 151for adding the directory to your PATH. 152 153If you would like to install Bowtie 2 by copying the Bowtie 2 executable 154files to an existing directory in your PATH, make sure that you copy all 155the executables, including bowtie2, bowtie2-align-s, bowtie2-align-l, 156bowtie2-build, bowtie2-build-s, bowtie2-build-l, bowtie2-inspect, 157bowtie2-inspect-s and bowtie2-inspect-l. 158 159The bowtie2 aligner 160 161bowtie2 takes a Bowtie 2 index and a set of sequencing read files and 162outputs a set of alignments in SAM format. 163 164"Alignment" is the process by which we discover how and where the read 165sequences are similar to the reference sequence. An "alignment" is a 166result from this process, specifically: an alignment is a way of "lining 167up" some or all of the characters in the read with some characters from 168the reference in a way that reveals how they're similar. For example: 169 170 Read: GACTGGGCGATCTCGACTTCG 171 ||||| |||||||||| ||| 172 Reference: GACTG--CGATCTCGACATCG 173 174Where dash symbols represent gaps and vertical bars show where aligned 175characters match. 176 177We use alignment to make an educated guess as to where a read originated 178with respect to the reference genome. It's not always possible to 179determine this with certainty. For instance, if the reference genome 180contains several long stretches of As (AAAAAAAAA etc.) and the read 181sequence is a short stretch of As (AAAAAAA), we cannot know for certain 182exactly where in the sea of As the read originated. 183 184End-to-end alignment versus local alignment 185 186By default, Bowtie 2 performs end-to-end read alignment. That is, it 187searches for alignments involving all of the read characters. This is 188also called an "untrimmed" or "unclipped" alignment. 189 190When the --local option is specified, Bowtie 2 performs local read 191alignment. In this mode, Bowtie 2 might "trim" or "clip" some read 192characters from one or both ends of the alignment if doing so maximizes 193the alignment score. 194 195End-to-end alignment example 196 197The following is an "end-to-end" alignment because it involves all the 198characters in the read. Such an alignment can be produced by Bowtie 2 in 199either end-to-end mode or in local mode. 200 201 Read: GACTGGGCGATCTCGACTTCG 202 Reference: GACTGCGATCTCGACATCG 203 204 Alignment: 205 Read: GACTGGGCGATCTCGACTTCG 206 ||||| |||||||||| ||| 207 Reference: GACTG--CGATCTCGACATCG 208 209Local alignment example 210 211The following is a "local" alignment because some of the characters at 212the ends of the read do not participate. In this case, 4 characters are 213omitted (or "soft trimmed" or "soft clipped") from the beginning and 3 214characters are omitted from the end. This sort of alignment can be 215produced by Bowtie 2 only in local mode. 216 217 Read: ACGGTTGCGTTAATCCGCCACG 218 Reference: TAACTTGCGTTAAATCCGCCTGG 219 220 Alignment: 221 Read: ACGGTTGCGTTAA-TCCGCCACG 222 ||||||||| |||||| 223 Reference: TAACTTGCGTTAAATCCGCCTGG 224 225Scores: higher = more similar 226 227An alignment score quantifies how similar the read sequence is to the 228reference sequence aligned to. The higher the score, the more similar 229they are. A score is calculated by subtracting penalties for each 230difference (mismatch, gap, etc.) and, in local alignment mode, adding 231bonuses for each match. 232 233The scores can be configured with the --ma (match bonus), --mp (mismatch 234penalty), --np (penalty for having an N in either the read or the 235reference), --rdg (affine read gap penalty) and --rfg (affine reference 236gap penalty) options. 237 238End-to-end alignment score example 239 240A mismatched base at a high-quality position in the read receives a 241penalty of -6 by default. A length-2 read gap receives a penalty of -11 242by default (-5 for the gap open, -3 for the first extension, -3 for the 243second extension). Thus, in end-to-end alignment mode, if the read is 50 244bp long and it matches the reference exactly except for one mismatch at 245a high-quality position and one length-2 read gap, then the overall 246score is -(6 + 11) = -17. 247 248The best possible alignment score in end-to-end mode is 0, which happens 249when there are no differences between the read and the reference. 250 251Local alignment score example 252 253A mismatched base at a high-quality position in the read receives a 254penalty of -6 by default. A length-2 read gap receives a penalty of -11 255by default (-5 for the gap open, -3 for the first extension, -3 for the 256second extension). A base that matches receives a bonus of +2 be 257default. Thus, in local alignment mode, if the read is 50 bp long and it 258matches the reference exactly except for one mismatch at a high-quality 259position and one length-2 read gap, then the overall score equals the 260total bonus, 2 * 49, minus the total penalty, 6 + 11, = 81. 261 262The best possible score in local mode equals the match bonus times the 263length of the read. This happens when there are no differences between 264the read and the reference. 265 266Valid alignments meet or exceed the minimum score threshold 267 268For an alignment to be considered "valid" (i.e. "good enough") by Bowtie 2692, it must have an alignment score no less than the minimum score 270threshold. The threshold is configurable and is expressed as a function 271of the read length. In end-to-end alignment mode, the default minimum 272score threshold is -0.6 + -0.6 * L, where L is the read length. In local 273alignment mode, the default minimum score threshold is 20 + 8.0 * ln(L), 274where L is the read length. This can be configured with the --score-min 275option. For details on how to set options like --score-min that 276correspond to functions, see the section on setting function options. 277 278Mapping quality: higher = more unique 279 280The aligner cannot always assign a read to its point of origin with high 281confidence. For instance, a read that originated inside a repeat element 282might align equally well to many occurrences of the element throughout 283the genome, leaving the aligner with no basis for preferring one over 284the others. 285 286Aligners characterize their degree of confidence in the point of origin 287by reporting a mapping quality: a non-negative integer Q = -10 log10 p, 288where p is an estimate of the probability that the alignment does not 289correspond to the read's true point of origin. Mapping quality is 290sometimes abbreviated MAPQ, and is recorded in the SAM MAPQ field. 291 292Mapping quality is related to "uniqueness." We say an alignment is 293unique if it has a much higher alignment score than all the other 294possible alignments. The bigger the gap between the best alignment's 295score and the second-best alignment's score, the more unique the best 296alignment, and the higher its mapping quality should be. 297 298Accurate mapping qualities are useful for downstream tools like variant 299callers. For instance, a variant caller might choose to ignore evidence 300from alignments with mapping quality less than, say, 10. A mapping 301quality of 10 or less indicates that there is at least a 1 in 10 chance 302that the read truly originated elsewhere. 303 304Aligning pairs 305 306A "paired-end" or "mate-pair" read consists of pair of mates, called 307mate 1 and mate 2. Pairs come with a prior expectation about (a) the 308relative orientation of the mates, and (b) the distance separating them 309on the original DNA molecule. Exactly what expectations hold for a given 310dataset depends on the lab procedures used to generate the data. For 311example, a common lab procedure for producing pairs is Illumina's 312Paired-end Sequencing Assay, which yields pairs with a relative 313orientation of FR ("forward, reverse") meaning that if mate 1 came from 314the Watson strand, mate 2 very likely came from the Crick strand and 315vice versa. Also, this protocol yields pairs where the expected genomic 316distance from end to end is about 200-500 base pairs. 317 318For simplicity, this manual uses the term "paired-end" to refer to any 319pair of reads with some expected relative orientation and distance. 320Depending on the protocol, these might actually be referred to as 321"paired-end" or "mate-paired." Also, we always refer to the individual 322sequences making up the pair as "mates." 323 324Paired inputs 325 326Pairs are often stored in a pair of files, one file containing the mate 3271s and the other containing the mates 2s. The first mate in the file for 328mate 1 forms a pair with the first mate in the file for mate 2, the 329second with the second, and so on. When aligning pairs with Bowtie 2, 330specify the file with the mate 1s mates using the -1 argument and the 331file with the mate 2s using the -2 argument. This causes Bowtie 2 to 332take the paired nature of the reads into account when aligning them. 333 334Paired SAM output 335 336When Bowtie 2 prints a SAM alignment for a pair, it prints two records 337(i.e. two lines of output), one for each mate. The first record 338describes the alignment for mate 1 and the second record describes the 339alignment for mate 2. In both records, some of the fields of the SAM 340record describe various properties of the alignment; for instance, the 3417th and 8th fields (RNEXT and PNEXT respectively) indicate the reference 342name and position where the other mate aligned, and the 9th field 343indicates the inferred length of the DNA fragment from which the two 344mates were sequenced. See the SAM specification for more details 345regarding these fields. 346 347Concordant pairs match pair expectations, discordant pairs don't 348 349A pair that aligns with the expected relative mate orientation and with 350the expected range of distances between mates is said to align 351"concordantly". If both mates have unique alignments, but the alignments 352do not match paired-end expectations (i.e. the mates aren't in the 353expected relative orientation, or aren't within the expected distance 354range, or both), the pair is said to align "discordantly". Discordant 355alignments may be of particular interest, for instance, when seeking 356structural variants. 357 358The expected relative orientation of the mates is set using the --ff, 359--fr, or --rf options. The expected range of inter-mates distances (as 360measured from the furthest extremes of the mates; also called "outer 361distance") is set with the -I and -X options. Note that setting -I and 362-X far apart makes Bowtie 2 slower. See documentation for -I and -X. 363 364To declare that a pair aligns discordantly, Bowtie 2 requires that both 365mates align uniquely. This is a conservative threshold, but this is 366often desirable when seeking structural variants. 367 368By default, Bowtie 2 searches for both concordant and discordant 369alignments, though searching for discordant alignments can be disabled 370with the --no-discordant option. 371 372Mixed mode: paired where possible, unpaired otherwise 373 374If Bowtie 2 cannot find a paired-end alignment for a pair, by default it 375will go on to look for unpaired alignments for the constituent mates. 376This is called "mixed mode." To disable mixed mode, set the --no-mixed 377option. 378 379Bowtie 2 runs a little faster in --no-mixed mode, but will only consider 380alignment status of pairs per se, not individual mates. 381 382Some SAM FLAGS describe paired-end properties 383 384The SAM FLAGS field, the second field in a SAM record, has multiple bits 385that describe the paired-end nature of the read and alignment. The first 386(least significant) bit (1 in decimal, 0x1 in hexadecimal) is set if the 387read is part of a pair. The second bit (2 in decimal, 0x2 in 388hexadecimal) is set if the read is part of a pair that aligned in a 389paired-end fashion. The fourth bit (8 in decimal, 0x8 in hexadecimal) is 390set if the read is part of a pair and the other mate in the pair had at 391least one valid alignment. The sixth bit (32 in decimal, 0x20 in 392hexadecimal) is set if the read is part of a pair and the other mate in 393the pair aligned to the Crick strand (or, equivalently, if the reverse 394complement of the other mate aligned to the Watson strand). The seventh 395bit (64 in decimal, 0x40 in hexadecimal) is set if the read is mate 1 in 396a pair. The eighth bit (128 in decimal, 0x80 in hexadecimal) is set if 397the read is mate 2 in a pair. See the SAM specification for a more 398detailed description of the FLAGS field. 399 400Some SAM optional fields describe more paired-end properties 401 402The last several fields of each SAM record usually contain SAM optional 403fields, which are simply tab-separated strings conveying additional 404information about the reads and alignments. A SAM optional field is 405formatted like this: "XP:i:1" where "XP" is the TAG, "i" is the TYPE 406("integer" in this case), and "1" is the VALUE. See the SAM 407specification for details regarding SAM optional fields. 408 409Mates can overlap, contain, or dovetail each other 410 411The fragment and read lengths might be such that alignments for the two 412mates from a pair overlap each other. Consider this example: 413 414(For these examples, assume we expect mate 1 to align to the left of 415mate 2.) 416 417 Mate 1: GCAGATTATATGAGTCAGCTACGATATTGTT 418 Mate 2: TGTTTGGGGTGACACATTACGCGTCTTTGAC 419 Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC 420 421It's also possible, though unusual, for one mate alignment to contain 422the other, as in these examples: 423 424 Mate 1: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGC 425 Mate 2: TGTTTGGGGTGACACATTACGC 426 Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC 427 428 Mate 1: CAGCTACGATATTGTTTGGGGTGACACATTACGC 429 Mate 2: CTACGATATTGTTTGGGGTGAC 430 Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC 431 432And it's also possible, though unusual, for the mates to "dovetail", 433with the mates seemingly extending "past" each other as in this example: 434 435 Mate 1: GTCAGCTACGATATTGTTTGGGGTGACACATTACGC 436 Mate 2: TATGAGTCAGCTACGATATTGTTTGGGGTGACACAT 437 Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC 438 439In some situations, it's desirable for the aligner to consider all these 440cases as "concordant" as long as other paired-end constraints are not 441violated. Bowtie 2's default behavior is to consider overlapping and 442containing as being consistent with concordant alignment. By default, 443dovetailing is considered inconsistent with concordant alignment. 444 445These defaults can be overridden. Setting --no-overlap causes Bowtie 2 446to consider overlapping mates as non-concordant. Setting --no-contain 447causes Bowtie 2 to consider cases where one mate alignment contains the 448other as non-concordant. Setting --dovetail causes Bowtie 2 to consider 449cases where the mate alignments dovetail as concordant. 450 451Reporting 452 453The reporting mode governs how many alignments Bowtie 2 looks for, and 454how to report them. Bowtie 2 has three distinct reporting modes. The 455default reporting mode is similar to the default reporting mode of many 456other read alignment tools, including BWA. It is also similar to Bowtie 4571's -M alignment mode. 458 459In general, when we say that a read has an alignment, we mean that it 460has a valid alignment. When we say that a read has multiple alignments, 461we mean that it has multiple alignments that are valid and distinct from 462one another. 463 464Distinct alignments map a read to different places 465 466Two alignments for the same individual read are "distinct" if they map 467the same read to different places. Specifically, we say that two 468alignments are distinct if there are no alignment positions where a 469particular read offset is aligned opposite a particular reference offset 470in both alignments with the same orientation. E.g. if the first 471alignment is in the forward orientation and aligns the read character at 472read offset 10 to the reference character at chromosome 3, offset 4733,445,245, and the second alignment is also in the forward orientation 474and also aligns the read character at read offset 10 to the reference 475character at chromosome 3, offset 3,445,245, they are not distinct 476alignments. 477 478Two alignments for the same pair are distinct if either the mate 1s in 479the two paired-end alignments are distinct or the mate 2s in the two 480alignments are distinct or both. 481 482Default mode: search for multiple alignments, report the best one 483 484By default, Bowtie 2 searches for distinct, valid alignments for each 485read. When it finds a valid alignment, it generally will continue to 486look for alignments that are nearly as good or better. It will 487eventually stop looking, either because it exceeded a limit placed on 488search effort (see -D and -R) or because it already knows all it needs 489to know to report an alignment. Information from the best alignments are 490used to estimate mapping quality (the MAPQ SAM field) and to set SAM 491optional fields, such as AS:i and XS:i. Bowtie 2 does not guarantee that 492the alignment reported is the best possible in terms of alignment score. 493 494See also: -D, which puts an upper limit on the number of dynamic 495programming problems (i.e. seed extensions) that can "fail" in a row 496before Bowtie 2 stops searching. Increasing -D makes Bowtie 2 slower, 497but increases the likelihood that it will report the correct alignment 498for a read that aligns many places. 499 500See also: -R, which sets the maximum number of times Bowtie 2 will 501"re-seed" when attempting to align a read with repetitive seeds. 502Increasing -R makes Bowtie 2 slower, but increases the likelihood that 503it will report the correct alignment for a read that aligns many places. 504 505-k mode: search for one or more alignments, report each 506 507In -k mode, Bowtie 2 searches for up to N distinct, valid alignments for 508each read, where N equals the integer specified with the -k parameter. 509That is, if -k 2 is specified, Bowtie 2 will search for at most 2 510distinct alignments. It reports all alignments found, in descending 511order by alignment score. The alignment score for a paired-end alignment 512equals the sum of the alignment scores of the individual mates. Each 513reported read or pair alignment beyond the first has the SAM 'secondary' 514bit (which equals 256) set in its FLAGS field. Supplementary alignments 515will also be assigned a MAPQ of 255. See the SAM specification for 516details. 517 518Bowtie 2 does not "find" alignments in any specific order, so for reads 519that have more than N distinct, valid alignments, Bowtie 2 does not 520guarantee that the N alignments reported are the best possible in terms 521of alignment score. Still, this mode can be effective and fast in 522situations where the user cares more about whether a read aligns (or 523aligns a certain number of times) than where exactly it originated. 524 525-a mode: search for and report all alignments 526 527-a mode is similar to -k mode except that there is no upper limit on the 528number of alignments Bowtie 2 should report. Alignments are reported in 529descending order by alignment score. The alignment score for a 530paired-end alignment equals the sum of the alignment scores of the 531individual mates. Each reported read or pair alignment beyond the first 532has the SAM 'secondary' bit (which equals 256) set in its FLAGS field. 533Supplementary alignments will be assigned a MAPQ of 255. See the SAM 534specification for details. 535 536Some tools are designed with this reporting mode in mind. Bowtie 2 is 537not! For very large genomes, this mode is very slow. 538 539Randomness in Bowtie 2 540 541Bowtie 2's search for alignments for a given read is "randomized." That 542is, when Bowtie 2 encounters a set of equally-good choices, it uses a 543pseudo-random number to choose. For example, if Bowtie 2 discovers a set 544of 3 equally-good alignments and wants to decide which to report, it 545picks a pseudo-random integer 0, 1 or 2 and reports the corresponding 546alignment. Arbitrary choices can crop up at various points during 547alignment. 548 549The pseudo-random number generator is re-initialized for every read, and 550the seed used to initialize it is a function of the read name, 551nucleotide string, quality string, and the value specified with --seed. 552If you run the same version of Bowtie 2 on two reads with identical 553names, nucleotide strings, and quality strings, and if --seed is set the 554same for both runs, Bowtie 2 will produce the same output; i.e., it will 555align the read to the same place, even if there are multiple equally 556good alignments. This is intuitive and desirable in most cases. Most 557users expect Bowtie to produce the same output when run twice on the 558same input. 559 560However, when the user specifies the --non-deterministic option, Bowtie 5612 will use the current time to re-initialize the pseudo-random number 562generator. When this is specified, Bowtie 2 might report different 563alignments for identical reads. This is counter-intuitive for some 564users, but might be more appropriate in situations where the input 565consists of many identical reads. 566 567Multiseed heuristic 568 569To rapidly narrow the number of possible alignments that must be 570considered, Bowtie 2 begins by extracting substrings ("seeds") from the 571read and its reverse complement and aligning them in an ungapped fashion 572with the help of the FM Index. This is "multiseed alignment" and it is 573similar to what Bowtie 1 does, except Bowtie 1 attempts to align the 574entire read this way. 575 576This initial step makes Bowtie 2 much faster than it would be without 577such a filter, but at the expense of missing some valid alignments. For 578instance, it is possible for a read to have a valid overall alignment 579but to have no valid seed alignments because each potential seed 580alignment is interrupted by too many mismatches or gaps. 581 582The trade-off between speed and sensitivity/accuracy can be adjusted by 583setting the seed length (-L), the interval between extracted seeds (-i), 584and the number of mismatches permitted per seed (-N). For more sensitive 585alignment, set these parameters to (a) make the seeds closer together, 586(b) make the seeds shorter, and/or (c) allow more mismatches. You can 587adjust these options one-by-one, though Bowtie 2 comes with some useful 588combinations of options prepackaged as "preset options." 589 590-D and -R are also options that adjust the trade-off between speed and 591sensitivity/accuracy. 592 593FM Index memory footprint 594 595Bowtie 2 uses the FM Index to find ungapped alignments for seeds. This 596step accounts for the bulk of Bowtie 2's memory footprint, as the FM 597Index itself is typically the largest data structure used. For instance, 598the memory footprint of the FM Index for the human genome is about 3.2 599gigabytes of RAM. 600 601Ambiguous characters 602 603Non-whitespace characters besides A, C, G or T are considered 604"ambiguous." N is a common ambiguous character that appears in reference 605sequences. Bowtie 2 considers all ambiguous characters in the reference 606(including IUPAC nucleotide codes) to be Ns. 607 608Bowtie 2 allows alignments to overlap ambiguous characters in the 609reference. An alignment position that contains an ambiguous character in 610the read, reference, or both, is penalized according to --np. --n-ceil 611sets an upper limit on the number of positions that may contain 612ambiguous reference characters in a valid alignment. The optional field 613XN:i reports the number of ambiguous reference characters overlapped by 614an alignment. 615 616Note that the multiseed heuristic cannot find seed alignments that 617overlap ambiguous reference characters. For an alignment overlapping an 618ambiguous reference character to be found, it must have one or more seed 619alignments that do not overlap ambiguous reference characters. 620 621Presets: setting many settings at once 622 623Bowtie 2 comes with some useful combinations of parameters packaged into 624shorter "preset" parameters. For example, running Bowtie 2 with the 625--very-sensitive option is the same as running with options: 626-D 20 -R 3 -N 0 -L 20 -i S,1,0.50. The preset options that come with 627Bowtie 2 are designed to cover a wide area of the 628speed/sensitivity/accuracy trade-off space, with the presets ending in 629fast generally being faster but less sensitive and less accurate, and 630the presets ending in sensitive generally being slower but more 631sensitive and more accurate. See the documentation for the preset 632options for details. 633 634As of Bowtie2 v2.4.0, individual preset values can be overridden by 635providing the specific options e.g. the configured seed length of 20 in 636the [--very-senitive] preset above can be changed to 25 by also 637specifying the -L 25 parameter anywhere on the command line. 638 639Filtering 640 641Some reads are skipped or "filtered out" by Bowtie 2. For example, reads 642may be filtered out because they are extremely short or have a high 643proportion of ambiguous nucleotides. Bowtie 2 will still print a SAM 644record for such a read, but no alignment will be reported and the YF:i 645SAM optional field will be set to indicate the reason the read was 646filtered. 647 648- YF:Z:LN: the read was filtered because it had length less than or 649 equal to the number of seed mismatches set with the -N option. 650- YF:Z:NS: the read was filtered because it contains a number of 651 ambiguous characters (usually N or .) greater than the ceiling 652 specified with --n-ceil. 653- YF:Z:SC: the read was filtered because the read length and the match 654 bonus (set with --ma) are such that the read can't possibly earn an 655 alignment score greater than or equal to the threshold set with 656 --score-min 657- YF:Z:QC: the read was filtered because it was marked as failing 658 quality control and the user specified the --qc-filter option. This 659 only happens when the input is in Illumina's QSEQ format (i.e. when 660 --qseq is specified) and the last (11th) field of the read's QSEQ 661 record contains 1. 662 663If a read could be filtered for more than one reason, the value YF:Z 664flag will reflect only one of those reasons. 665 666Alignment summary 667 668When Bowtie 2 finishes running, it prints messages summarizing what 669happened. These messages are printed to the "standard error" ("stderr") 670filehandle. For datasets consisting of unpaired reads, the summary might 671look like this: 672 673 20000 reads; of these: 674 20000 (100.00%) were unpaired; of these: 675 1247 (6.24%) aligned 0 times 676 18739 (93.69%) aligned exactly 1 time 677 14 (0.07%) aligned >1 times 678 93.77% overall alignment rate 679 680For datasets consisting of pairs, the summary might look like this: 681 682 10000 reads; of these: 683 10000 (100.00%) were paired; of these: 684 650 (6.50%) aligned concordantly 0 times 685 8823 (88.23%) aligned concordantly exactly 1 time 686 527 (5.27%) aligned concordantly >1 times 687 ---- 688 650 pairs aligned concordantly 0 times; of these: 689 34 (5.23%) aligned discordantly 1 time 690 ---- 691 616 pairs aligned 0 times concordantly or discordantly; of these: 692 1232 mates make up the pairs; of these: 693 660 (53.57%) aligned 0 times 694 571 (46.35%) aligned exactly 1 time 695 1 (0.08%) aligned >1 times 696 96.70% overall alignment rate 697 698The indentation indicates how subtotals relate to totals. 699 700Wrapper scripts 701 702The bowtie2, bowtie2-build and bowtie2-inspect executables are actually 703wrapper scripts that call binary programs as appropriate. The wrappers 704shield users from having to distinguish between "small" and "large" 705index formats, discussed briefly in the following section. Also, the 706bowtie2 wrapper provides some key functionality, like the ability to 707handle compressed inputs, and the functionality for --un, --al and 708related options. 709 710It is recommended that you always run the bowtie2 wrappers and not run 711the binaries directly. 712 713Small and large indexes 714 715bowtie2-build can index reference genomes of any size. For genomes less 716than about 4 billion nucleotides in length, bowtie2-build builds a 717"small" index using 32-bit numbers in various parts of the index. When 718the genome is longer, bowtie2-build builds a "large" index using 64-bit 719numbers. Small indexes are stored in files with the .bt2 extension, and 720large indexes are stored in files with the .bt2l extension. The user 721need not worry about whether a particular index is small or large; the 722wrapper scripts will automatically build and use the appropriate index. 723 724Performance tuning 725 7261. If your computer has multiple processors/cores, use -p 727 728 The -p option causes Bowtie 2 to launch a specified number of 729 parallel search threads. Each thread runs on a different 730 processor/core and all threads find alignments in parallel, 731 increasing alignment throughput by approximately a multiple of the 732 number of threads (though in practice, speedup is somewhat worse 733 than linear). 734 7352. If reporting many alignments per read, try reducing 736 bowtie2-build --offrate 737 738 If you are using -k or -a options and Bowtie 2 is reporting many 739 alignments per read, using an index with a denser SA sample can 740 speed things up considerably. To do this, specify a 741 smaller-than-default -o/--offrate value when running bowtie2-build. 742 A denser SA sample yields a larger index, but is also particularly 743 effective at speeding up alignment when many alignments are reported 744 per read. 745 7463. If bowtie2 "thrashes", try increasing bowtie2-build --offrate 747 748 If bowtie2 runs very slowly on a relatively low-memory computer, try 749 setting -o/--offrate to a larger value when building the index. This 750 decreases the memory footprint of the index. 751 752Command Line 753 754Setting function options 755 756Some Bowtie 2 options specify a function rather than an individual 757number or setting. In these cases the user specifies three parameters: 758(a) a function type F, (b) a constant term B, and (c) a coefficient A. 759The available function types are constant (C), linear (L), square-root 760(S), and natural log (G). The parameters are specified as F,B,A - that 761is, the function type, the constant term, and the coefficient are 762separated by commas with no whitespace. The constant term and 763coefficient may be negative and/or floating-point numbers. 764 765For example, if the function specification is L,-0.4,-0.6, then the 766function defined is: 767 768 f(x) = -0.4 + -0.6 * x 769 770If the function specification is G,1,5.4, then the function defined is: 771 772 f(x) = 1.0 + 5.4 * ln(x) 773 774See the documentation for the option in question to learn what the 775parameter x is for. For example, in the case if the --score-min option, 776the function f(x) sets the minimum alignment score necessary for an 777alignment to be considered valid, and x is the read length. 778 779Usage 780 781 bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | --sra-acc <acc> | b <bam>} -S [<sam>] 782 783Main arguments 784 785 -x <bt2-idx> 786 787The basename of the index for the reference genome. The basename is the 788name of any of the index files up to but not including the final .1.bt2 789/ .rev.1.bt2 / etc. bowtie2 looks for the specified index first in the 790current directory, then in the directory specified in the 791BOWTIE2_INDEXES environment variable. 792 793 -1 <m1> 794 795Comma-separated list of files containing mate 1s (filename usually 796includes _1), e.g. -1 flyA_1.fq,flyB_1.fq. Sequences specified with this 797option must correspond file-for-file and read-for-read with those 798specified in <m2>. Reads may be a mix of different lengths. If - is 799specified, bowtie2 will read the mate 1s from the "standard in" or 800"stdin" filehandle. 801 802 -2 <m2> 803 804Comma-separated list of files containing mate 2s (filename usually 805includes _2), e.g. -2 flyA_2.fq,flyB_2.fq. Sequences specified with this 806option must correspond file-for-file and read-for-read with those 807specified in <m1>. Reads may be a mix of different lengths. If - is 808specified, bowtie2 will read the mate 2s from the "standard in" or 809"stdin" filehandle. 810 811 -U <r> 812 813Comma-separated list of files containing unpaired reads to be aligned, 814e.g. lane1.fq,lane2.fq,lane3.fq,lane4.fq. Reads may be a mix of 815different lengths. If - is specified, bowtie2 gets the reads from the 816"standard in" or "stdin" filehandle. 817 818 --interleaved 819 820Reads interleaved FASTQ files where the first two records (8 lines) 821represent a mate pair. 822 823 --sra-acc 824 825Reads are SRA accessions. If the accession provided cannot be found in 826local storage it will be fetched from the NCBI database. If you find 827that SRA alignments are long running please rerun your command with the 828-p/--threads parameter set to desired number of threads. 829 830NB: this option is only available if bowtie 2 is compiled with the 831necessary SRA libraries. See Obtaining Bowtie 2 for details. 832 833 -b <bam> 834 835Reads are unaligned BAM records sorted by read name. The 836--align-paired-reads and --preserve-tags options affect the way Bowtie 2 837processes records. 838 839 -S <sam> 840 841File to write SAM alignments to. By default, alignments are written to 842the "standard out" or "stdout" filehandle (i.e. the console). 843 844Options 845 846Input options 847 848 -q 849 850Reads (specified with <m1>, <m2>, <s>) are FASTQ files. FASTQ files 851usually have extension .fq or .fastq. FASTQ is the default format. See 852also: --solexa-quals and --int-quals. 853 854 --tab5 855 856Each read or pair is on a single line. An unpaired read line is 857[name]\t[seq]\t[qual]\n. A paired-end read line is 858[name]\t[seq1]\t[qual1]\t[seq2]\t[qual2]\n. An input file can be a mix 859of unpaired and paired-end reads and Bowtie 2 recognizes each according 860to the number of fields, handling each as it should. 861 862 --tab6 863 864Similar to --tab5 except, for paired-end reads, the second end can have 865a different name from the first: 866[name1]\t[seq1]\t[qual1]\t[name2]\t[seq2]\t[qual2]\n 867 868 --qseq 869 870Reads (specified with <m1>, <m2>, <s>) are QSEQ files. QSEQ files 871usually end in _qseq.txt. See also: --solexa-quals and --int-quals. 872 873 -f 874 875Reads (specified with <m1>, <m2>, <s>) are FASTA files. FASTA files 876usually have extension .fa, .fasta, .mfa, .fna or similar. FASTA files 877do not have a way of specifying quality values, so when -f is set, the 878result is as if --ignore-quals is also set. 879 880 -r 881 882Reads (specified with <m1>, <m2>, <s>) are files with one input sequence 883per line, without any other information (no read names, no qualities). 884When -r is set, the result is as if --ignore-quals is also set. 885 886 -F k:<int>,i:<int> 887 888Reads are substrings (k-mers) extracted from a FASTA file <s>. 889Specifically, for every reference sequence in FASTA file <s>, Bowtie 2 890aligns the k-mers at offsets 1, 1+i, 1+2i, ... until reaching the end of 891the reference. Each k-mer is aligned as a separate read. Quality values 892are set to all Is (40 on Phred scale). Each k-mer (read) is given a name 893like <sequence>_<offset>, where <sequence> is the name of the FASTA 894sequence it was drawn from and <offset> is its 0-based offset of origin 895with respect to the sequence. Only single k-mers, i.e. unpaired reads, 896can be aligned in this way. 897 -c 898 899The read sequences are given on command line. I.e. <m1>, <m2> and 900<singles> are comma-separated lists of reads rather than lists of read 901files. There is no way to specify read names or qualities, so -c also 902implies --ignore-quals. 903 904 -s/--skip <int> 905 906Skip (i.e. do not align) the first <int> reads or pairs in the input. 907 908 -u/--qupto <int> 909 910Align the first <int> reads or read pairs from the input (after the 911-s/--skip reads or pairs have been skipped), then stop. Default: no 912limit. 913 914 -5/--trim5 <int> 915 916Trim <int> bases from 5' (left) end of each read before alignment 917(default: 0). 918 919 -3/--trim3 <int> 920 921Trim <int> bases from 3' (right) end of each read before alignment 922(default: 0). 923 924 --trim-to [3:|5:]<int> 925 926Trim reads exceeding <int> bases. Bases will be trimmed from either the 9273' (right) or 5' (left) end of the read. If the read end if not 928specified, bowtie 2 will default to trimming from the 3' (right) end of 929the read. --trim-to and -3/-5 are mutually exclusive. 930 931 --phred33 932 933Input qualities are ASCII chars equal to the Phred quality plus 33. This 934is also called the "Phred+33" encoding, which is used by the very latest 935Illumina pipelines. 936 937 --phred64 938 939Input qualities are ASCII chars equal to the Phred quality plus 64. This 940is also called the "Phred+64" encoding. 941 942 --solexa-quals 943 944Convert input qualities from Solexa (which can be negative) to Phred 945(which can't). This scheme was used in older Illumina GA Pipeline 946versions (prior to 1.3). Default: off. 947 948 --int-quals 949 950Quality values are represented in the read input file as space-separated 951ASCII integers, e.g., 40 40 30 40..., rather than ASCII characters, 952e.g., II?I.... Integers are treated as being on the Phred quality scale 953unless --solexa-quals is also specified. Default: off. 954 955Preset options in --end-to-end mode 956 957 --very-fast 958 959Same as: -D 5 -R 1 -N 0 -L 22 -i S,0,2.50 960 961 --fast 962 963Same as: -D 10 -R 2 -N 0 -L 22 -i S,0,2.50 964 965 --sensitive 966 967Same as: -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default in --end-to-end 968mode) 969 970 --very-sensitive 971 972Same as: -D 20 -R 3 -N 0 -L 20 -i S,1,0.50 973 974Preset options in --local mode 975 976 --very-fast-local 977 978Same as: -D 5 -R 1 -N 0 -L 25 -i S,1,2.00 979 980 --fast-local 981 982Same as: -D 10 -R 2 -N 0 -L 22 -i S,1,1.75 983 984 --sensitive-local 985 986Same as: -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default in --local mode) 987 988 --very-sensitive-local 989 990Same as: -D 20 -R 3 -N 0 -L 20 -i S,1,0.50 991 992Alignment options 993 994 -N <int> 995 996Sets the number of mismatches to allowed in a seed alignment during 997multiseed alignment. Can be set to 0 or 1. Setting this higher makes 998alignment slower (often much slower) but increases sensitivity. Default: 9990. 1000 1001 -L <int> 1002 1003Sets the length of the seed substrings to align during multiseed 1004alignment. Smaller values make alignment slower but more sensitive. 1005Default: the --sensitive preset is used by default, which sets -L to 22 1006and 20 in --end-to-end mode and in --local mode. 1007 1008 -i <func> 1009 1010Sets a function governing the interval between seed substrings to use 1011during multiseed alignment. For instance, if the read has 30 characters, 1012and seed length is 10, and the seed interval is 6, the seeds extracted 1013will be: 1014 1015 Read: TAGCTACGCTCTACGCTATCATGCATAAAC 1016 Seed 1 fw: TAGCTACGCT 1017 Seed 1 rc: AGCGTAGCTA 1018 Seed 2 fw: CGCTCTACGC 1019 Seed 2 rc: GCGTAGAGCG 1020 Seed 3 fw: ACGCTATCAT 1021 Seed 3 rc: ATGATAGCGT 1022 Seed 4 fw: TCATGCATAA 1023 Seed 4 rc: TTATGCATGA 1024 1025Since it's best to use longer intervals for longer reads, this parameter 1026sets the interval as a function of the read length, rather than a single 1027one-size-fits-all number. For instance, specifying -i S,1,2.5 sets the 1028interval function f to f(x) = 1 + 2.5 * sqrt(x), where x is the read 1029length. See also: setting function options. If the function returns a 1030result less than 1, it is rounded up to 1. Default: the --sensitive 1031preset is used by default, which sets -i to S,1,1.15 in --end-to-end 1032mode to -i S,1,0.75 in --local mode. 1033 1034 --n-ceil <func> 1035 1036Sets a function governing the maximum number of ambiguous characters 1037(usually Ns and/or .s) allowed in a read as a function of read length. 1038For instance, specifying -L,0,0.15 sets the N-ceiling function f to 1039f(x) = 0 + 0.15 * x, where x is the read length. See also: setting 1040function options. Reads exceeding this ceiling are filtered out. 1041Default: L,0,0.15. 1042 1043 --dpad <int> 1044 1045"Pads" dynamic programming problems by <int> columns on either side to 1046allow gaps. Default: 15. 1047 1048 --gbar <int> 1049 1050Disallow gaps within <int> positions of the beginning or end of the 1051read. Default: 4. 1052 1053 --ignore-quals 1054 1055When calculating a mismatch penalty, always consider the quality value 1056at the mismatched position to be the highest possible, regardless of the 1057actual value. I.e. input is treated as though all quality values are 1058high. This is also the default behavior when the input doesn't specify 1059quality values (e.g. in -f, -r, or -c modes). 1060 1061 --nofw/--norc 1062 1063If --nofw is specified, bowtie2 will not attempt to align unpaired reads 1064to the forward (Watson) reference strand. If --norc is specified, 1065bowtie2 will not attempt to align unpaired reads against the 1066reverse-complement (Crick) reference strand. In paired-end mode, --nofw 1067and --norc pertain to the fragments; i.e. specifying --nofw causes 1068bowtie2 to explore only those paired-end configurations corresponding to 1069fragments from the reverse-complement (Crick) strand. Default: both 1070strands enabled. 1071 1072 --no-1mm-upfront 1073 1074By default, Bowtie 2 will attempt to find either an exact or a 10751-mismatch end-to-end alignment for the read before trying the multiseed 1076heuristic. Such alignments can be found very quickly, and many short 1077read alignments have exact or near-exact end-to-end alignments. However, 1078this can lead to unexpected alignments when the user also sets options 1079governing the multiseed heuristic, like -L and -N. For instance, if the 1080user specifies -N 0 and -L equal to the length of the read, the user 1081will be surprised to find 1-mismatch alignments reported. This option 1082prevents Bowtie 2 from searching for 1-mismatch end-to-end alignments 1083before using the multiseed heuristic, which leads to the expected 1084behavior when combined with options such as -L and -N. This comes at the 1085expense of speed. 1086 1087 --end-to-end 1088 1089In this mode, Bowtie 2 requires that the entire read align from one end 1090to the other, without any trimming (or "soft clipping") of characters 1091from either end. The match bonus --ma always equals 0 in this mode, so 1092all alignment scores are less than or equal to 0, and the greatest 1093possible alignment score is 0. This is mutually exclusive with --local. 1094--end-to-end is the default mode. 1095 1096 --local 1097 1098In this mode, Bowtie 2 does not require that the entire read align from 1099one end to the other. Rather, some characters may be omitted ("soft 1100clipped") from the ends in order to achieve the greatest possible 1101alignment score. The match bonus --ma is used in this mode, and the best 1102possible alignment score is equal to the match bonus (--ma) times the 1103length of the read. Specifying --local and one of the presets (e.g. 1104--local --very-fast) is equivalent to specifying the local version of 1105the preset (--very-fast-local). This is mutually exclusive with 1106--end-to-end. --end-to-end is the default mode. 1107 1108Scoring options 1109 1110 --ma <int> 1111 1112Sets the match bonus. In --local mode <int> is added to the alignment 1113score for each position where a read character aligns to a reference 1114character and the characters match. Not used in --end-to-end mode. 1115Default: 2. 1116 1117 --mp MX,MN 1118 1119Sets the maximum (MX) and minimum (MN) mismatch penalties, both 1120integers. A number less than or equal to MX and greater than or equal to 1121MN is subtracted from the alignment score for each position where a read 1122character aligns to a reference character, the characters do not match, 1123and neither is an N. If --ignore-quals is specified, the number 1124subtracted quals MX. Otherwise, the number subtracted is 1125MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) ) where Q is the Phred quality 1126value. Default: MX = 6, MN = 2. 1127 1128 --np <int> 1129 1130Sets penalty for positions where the read, reference, or both, contain 1131an ambiguous character such as N. Default: 1. 1132 1133 --rdg <int1>,<int2> 1134 1135Sets the read gap open (<int1>) and extend (<int2>) penalties. A read 1136gap of length N gets a penalty of <int1> + N * <int2>. Default: 5, 3. 1137 1138 --rfg <int1>,<int2> 1139 1140Sets the reference gap open (<int1>) and extend (<int2>) penalties. A 1141reference gap of length N gets a penalty of <int1> + N * <int2>. 1142Default: 5, 3. 1143 1144 --score-min <func> 1145 1146Sets a function governing the minimum alignment score needed for an 1147alignment to be considered "valid" (i.e. good enough to report). This is 1148a function of read length. For instance, specifying L,0,-0.6 sets the 1149minimum-score function f to f(x) = 0 + -0.6 * x, where x is the read 1150length. See also: setting function options. The default in --end-to-end 1151mode is L,-0.6,-0.6 and the default in --local mode is G,20,8. 1152 1153Reporting options 1154 1155 -k <int> 1156 1157By default, bowtie2 searches for distinct, valid alignments for each 1158read. When it finds a valid alignment, it continues looking for 1159alignments that are nearly as good or better. The best alignment found 1160is reported (randomly selected from among best if tied). Information 1161about the best alignments is used to estimate mapping quality and to set 1162SAM optional fields, such as AS:i and XS:i. 1163 1164When -k is specified, however, bowtie2 behaves differently. Instead, it 1165searches for at most <int> distinct, valid alignments for each read. The 1166search terminates when it can't find more distinct valid alignments, or 1167when it finds <int>, whichever happens first. All alignments found are 1168reported in descending order by alignment score. The alignment score for 1169a paired-end alignment equals the sum of the alignment scores of the 1170individual mates. Each reported read or pair alignment beyond the first 1171has the SAM 'secondary' bit (which equals 256) set in its FLAGS field. 1172For reads that have more than <int> distinct, valid alignments, bowtie2 1173does not guarantee that the <int> alignments reported are the best 1174possible in terms of alignment score. -k is mutually exclusive with -a. 1175 1176Note: Bowtie 2 is not designed with large values for -k in mind, and 1177when aligning reads to long, repetitive genomes large -k can be very, 1178very slow. 1179 1180 -a 1181 1182Like -k but with no upper limit on number of alignments to search for. 1183-a is mutually exclusive with -k. 1184 1185Note: Bowtie 2 is not designed with -a mode in mind, and when aligning 1186reads to long, repetitive genomes this mode can be very, very slow. 1187 1188Effort options 1189 1190 -D <int> 1191 1192Up to <int> consecutive seed extension attempts can "fail" before Bowtie 11932 moves on, using the alignments found so far. A seed extension "fails" 1194if it does not yield a new best or a new second-best alignment. This 1195limit is automatically adjusted up when -k or -a are specified. Default: 119615. 1197 1198 -R <int> 1199 1200<int> is the maximum number of times Bowtie 2 will "re-seed" reads with 1201repetitive seeds. When "re-seeding," Bowtie 2 simply chooses a new set 1202of reads (same length, same number of mismatches allowed) at different 1203offsets and searches for more alignments. A read is considered to have 1204repetitive seeds if the total number of seed hits divided by the number 1205of seeds that aligned at least once is greater than 300. Default: 2. 1206 1207Paired-end options 1208 1209 -I/--minins <int> 1210 1211The minimum fragment length for valid paired-end alignments. E.g. if 1212-I 60 is specified and a paired-end alignment consists of two 20-bp 1213alignments in the appropriate orientation with a 20-bp gap between them, 1214that alignment is considered valid (as long as -X is also satisfied). A 121519-bp gap would not be valid in that case. If trimming options -3 or -5 1216are also used, the -I constraint is applied with respect to the 1217untrimmed mates. 1218 1219The larger the difference between -I and -X, the slower Bowtie 2 will 1220run. This is because larger differences between -I and -X require that 1221Bowtie 2 scan a larger window to determine if a concordant alignment 1222exists. For typical fragment length ranges (200 to 400 nucleotides), 1223Bowtie 2 is very efficient. 1224 1225Default: 0 (essentially imposing no minimum) 1226 1227 -X/--maxins <int> 1228 1229The maximum fragment length for valid paired-end alignments. E.g. if 1230-X 100 is specified and a paired-end alignment consists of two 20-bp 1231alignments in the proper orientation with a 60-bp gap between them, that 1232alignment is considered valid (as long as -I is also satisfied). A 61-bp 1233gap would not be valid in that case. If trimming options -3 or -5 are 1234also used, the -X constraint is applied with respect to the untrimmed 1235mates, not the trimmed mates. 1236 1237The larger the difference between -I and -X, the slower Bowtie 2 will 1238run. This is because larger differences between -I and -X require that 1239Bowtie 2 scan a larger window to determine if a concordant alignment 1240exists. For typical fragment length ranges (200 to 400 nucleotides), 1241Bowtie 2 is very efficient. 1242 1243Default: 500. 1244 1245 --fr/--rf/--ff 1246 1247The upstream/downstream mate orientations for a valid paired-end 1248alignment against the forward reference strand. E.g., if --fr is 1249specified and there is a candidate paired-end alignment where mate 1 1250appears upstream of the reverse complement of mate 2 and the fragment 1251length constraints (-I and -X) are met, that alignment is valid. Also, 1252if mate 2 appears upstream of the reverse complement of mate 1 and all 1253other constraints are met, that too is valid. --rf likewise requires 1254that an upstream mate1 be reverse-complemented and a downstream mate2 be 1255forward-oriented. --ff requires both an upstream mate 1 and a downstream 1256mate 2 to be forward-oriented. Default: --fr (appropriate for Illumina's 1257Paired-end Sequencing Assay). 1258 1259 --no-mixed 1260 1261By default, when bowtie2 cannot find a concordant or discordant 1262alignment for a pair, it then tries to find alignments for the 1263individual mates. This option disables that behavior. 1264 1265 --no-discordant 1266 1267By default, bowtie2 looks for discordant alignments if it cannot find 1268any concordant alignments. A discordant alignment is an alignment where 1269both mates align uniquely, but that does not satisfy the paired-end 1270constraints (--fr/--rf/--ff, -I, -X). This option disables that 1271behavior. 1272 1273 --dovetail 1274 1275If the mates "dovetail", that is if one mate alignment extends past the 1276beginning of the other such that the wrong mate begins upstream, 1277consider that to be concordant. See also: Mates can overlap, contain or 1278dovetail each other. Default: mates cannot dovetail in a concordant 1279alignment. 1280 1281 --no-contain 1282 1283If one mate alignment contains the other, consider that to be 1284non-concordant. See also: Mates can overlap, contain or dovetail each 1285other. Default: a mate can contain the other in a concordant alignment. 1286 1287 --no-overlap 1288 1289If one mate alignment overlaps the other at all, consider that to be 1290non-concordant. See also: Mates can overlap, contain or dovetail each 1291other. Default: mates can overlap in a concordant alignment. 1292 1293BAM options 1294 1295 --align-paired-reads 1296 1297Bowtie 2 will, by default, attempt to align unpaired BAM reads. Use this 1298option to align paired-end reads instead. 1299 1300 --preserve-tags 1301 1302Preserve tags from the original BAM record by appending them to the end 1303of the corresponding Bowtie 2 SAM output. 1304 1305Output options 1306 1307 -t/--time 1308 1309Print the wall-clock time required to load the index files and align the 1310reads. This is printed to the "standard error" ("stderr") filehandle. 1311Default: off. 1312 1313 --un <path> 1314 --un-gz <path> 1315 --un-bz2 <path> 1316 --un-lz4 <path> 1317 1318Write unpaired reads that fail to align to file at <path>. These reads 1319correspond to the SAM records with the FLAGS 0x4 bit set and neither the 13200x40 nor 0x80 bits set. If --un-gz is specified, output will be gzip 1321compressed. If --un-bz2 or --un-lz4 is specified, output will be bzip2 1322or lz4 compressed. Reads written in this way will appear exactly as they 1323did in the input file, without any modification (same sequence, same 1324name, same quality string, same quality encoding). Reads will not 1325necessarily appear in the same order as they did in the input. 1326 1327 --al <path> 1328 --al-gz <path> 1329 --al-bz2 <path> 1330 --al-lz4 <path> 1331 1332Write unpaired reads that align at least once to file at <path>. These 1333reads correspond to the SAM records with the FLAGS 0x4, 0x40, and 0x80 1334bits unset. If --al-gz is specified, output will be gzip compressed. If 1335--al-bz2 is specified, output will be bzip2 compressed. Similarly if 1336--al-lz4 is specified, output will be lz4 compressed. Reads written in 1337this way will appear exactly as they did in the input file, without any 1338modification (same sequence, same name, same quality string, same 1339quality encoding). Reads will not necessarily appear in the same order 1340as they did in the input. 1341 1342 --un-conc <path> 1343 --un-conc-gz <path> 1344 --un-conc-bz2 <path> 1345 --un-conc-lz4 <path> 1346 1347Write paired-end reads that fail to align concordantly to file(s) at 1348<path>. These reads correspond to the SAM records with the FLAGS 0x4 bit 1349set and either the 0x40 or 0x80 bit set (depending on whether it's mate 1350#1 or #2). .1 and .2 strings are added to the filename to distinguish 1351which file contains mate #1 and mate #2. If a percent symbol, %, is used 1352in <path>, the percent symbol is replaced with 1 or 2 to make the 1353per-mate filenames. Otherwise, .1 or .2 are added before the final dot 1354in <path> to make the per-mate filenames. Reads written in this way will 1355appear exactly as they did in the input files, without any modification 1356(same sequence, same name, same quality string, same quality encoding). 1357Reads will not necessarily appear in the same order as they did in the 1358inputs. 1359 1360 --al-conc <path> 1361 --al-conc-gz <path> 1362 --al-conc-bz2 <path> 1363 --al-conc-lz4 <path> 1364 1365Write paired-end reads that align concordantly at least once to file(s) 1366at <path>. These reads correspond to the SAM records with the FLAGS 0x4 1367bit unset and either the 0x40 or 0x80 bit set (depending on whether it's 1368mate #1 or #2). .1 and .2 strings are added to the filename to 1369distinguish which file contains mate #1 and mate #2. If a percent 1370symbol, %, is used in <path>, the percent symbol is replaced with 1 or 2 1371to make the per-mate filenames. Otherwise, .1 or .2 are added before the 1372final dot in <path> to make the per-mate filenames. Reads written in 1373this way will appear exactly as they did in the input files, without any 1374modification (same sequence, same name, same quality string, same 1375quality encoding). Reads will not necessarily appear in the same order 1376as they did in the inputs. 1377 1378 --quiet 1379 1380Print nothing besides alignments and serious errors. 1381 1382 --met-file <path> 1383 1384Write bowtie2 metrics to file <path>. Having alignment metric can be 1385useful for debugging certain problems, especially performance issues. 1386See also: --met. Default: metrics disabled. 1387 1388 --met-stderr <path> 1389 1390Write bowtie2 metrics to the "standard error" ("stderr") filehandle. 1391This is not mutually exclusive with --met-file. Having alignment metric 1392can be useful for debugging certain problems, especially performance 1393issues. See also: --met. Default: metrics disabled. 1394 1395 --met <int> 1396 1397Write a new bowtie2 metrics record every <int> seconds. Only matters if 1398either --met-stderr or --met-file are specified. Default: 1. 1399 1400SAM options 1401 1402 --no-unal 1403 1404Suppress SAM records for reads that failed to align. 1405 1406 --no-hd 1407 1408Suppress SAM header lines (starting with @). 1409 1410 --no-sq 1411 1412Suppress @SQ SAM header lines. 1413 1414 --rg-id <text> 1415 1416Set the read group ID to <text>. This causes the SAM @RG header line to 1417be printed, with <text> as the value associated with the ID: tag. It 1418also causes the RG:Z: extra field to be attached to each SAM output 1419record, with value set to <text>. 1420 1421 --rg <text> 1422 1423Add <text> (usually of the form TAG:VAL, e.g. SM:Pool1) as a field on 1424the @RG header line. Note: in order for the @RG line to appear, --rg-id 1425must also be specified. This is because the ID tag is required by the 1426SAM Spec. Specify --rg multiple times to set multiple fields. See the 1427SAM Spec for details about what fields are legal. 1428 1429 --omit-sec-seq 1430 1431When printing secondary alignments, Bowtie 2 by default will write out 1432the SEQ and QUAL strings. Specifying this option causes Bowtie 2 to 1433print an asterisk in those fields instead. 1434 1435 --soft-clipped-unmapped-tlen 1436 1437Consider soft-clipped bases unmapped when calculating TLEN. Only 1438available in --local mode. 1439 1440 --sam-no-qname-trunc 1441 1442Suppress standard behavior of truncating readname at first whitespace at 1443the expense of generating non-standard SAM 1444 1445 --xeq 1446 1447Use '='/'X', instead of 'M', to specify matches/mismatches in SAM record 1448 1449 --sam-append-comment 1450 1451Append FASTA/FASTQ comment to SAM record, where a comment is everything 1452after the first space in the read name. 1453 1454Performance options 1455 1456 -o/--offrate <int> 1457 1458Override the offrate of the index with <int>. If <int> is greater than 1459the offrate used to build the index, then some row markings are 1460discarded when the index is read into memory. This reduces the memory 1461footprint of the aligner but requires more time to calculate text 1462offsets. <int> must be greater than the value used to build the index. 1463 1464 -p/--threads NTHREADS 1465 1466Launch NTHREADS parallel search threads (default: 1). Threads will run 1467on separate processors/cores and synchronize when parsing reads and 1468outputting alignments. Searching for alignments is highly parallel, and 1469speedup is close to linear. Increasing -p increases Bowtie 2's memory 1470footprint. E.g. when aligning to a human genome index, increasing -p 1471from 1 to 8 increases the memory footprint by a few hundred megabytes. 1472This option is only available if bowtie is linked with the pthreads 1473library (i.e. if BOWTIE_PTHREADS=0 is not specified at build time). 1474 1475 --reorder 1476 1477Guarantees that output SAM records are printed in an order corresponding 1478to the order of the reads in the original input file, even when -p is 1479set greater than 1. Specifying --reorder and setting -p greater than 1 1480causes Bowtie 2 to run somewhat slower and use somewhat more memory than 1481if --reorder were not specified. Has no effect if -p is set to 1, since 1482output order will naturally correspond to input order in that case. 1483 1484 --mm 1485 1486Use memory-mapped I/O to load the index, rather than typical file I/O. 1487Memory-mapping allows many concurrent bowtie processes on the same 1488computer to share the same memory image of the index (i.e. you pay the 1489memory overhead just once). This facilitates memory-efficient 1490parallelization of bowtie in situations where using -p is not possible 1491or not preferable. 1492 1493Other options 1494 1495 --qc-filter 1496 1497Filter out reads for which the QSEQ filter field is non-zero. Only has 1498an effect when read format is --qseq. Default: off. 1499 1500 --seed <int> 1501 1502Use <int> as the seed for pseudo-random number generator. Default: 0. 1503 1504 --non-deterministic 1505 1506Normally, Bowtie 2 re-initializes its pseudo-random generator for each 1507read. It seeds the generator with a number derived from (a) the read 1508name, (b) the nucleotide sequence, (c) the quality sequence, (d) the 1509value of the --seed option. This means that if two reads are identical 1510(same name, same nucleotides, same qualities) Bowtie 2 will find and 1511report the same alignment(s) for both, even if there was ambiguity. When 1512--non-deterministic is specified, Bowtie 2 re-initializes its 1513pseudo-random generator for each read using the current time. This means 1514that Bowtie 2 will not necessarily report the same alignment for two 1515identical reads. This is counter-intuitive for some users, but might be 1516more appropriate in situations where the input consists of many 1517identical reads. 1518 1519 --version 1520 1521Print version information and quit. 1522 1523 -h/--help 1524 1525Print usage information and quit. 1526 1527SAM output 1528 1529Following is a brief description of the SAM format as output by bowtie2. 1530For more details, see the SAM format specification. 1531 1532By default, bowtie2 prints a SAM header with @HD, @SQ and @PG lines. 1533When one or more --rg arguments are specified, bowtie2 will also print 1534an @RG line that includes all user-specified --rg tokens separated by 1535tabs. 1536 1537Each subsequent line describes an alignment or, if the read failed to 1538align, a read. Each line is a collection of at least 12 fields separated 1539by tabs; from left to right, the fields are: 1540 15411. Name of read that aligned. 1542 1543 Note that the SAM specification disallows whitespace in the read 1544 name. If the read name contains any whitespace characters, Bowtie 2 1545 will truncate the name at the first whitespace character. This is 1546 similar to the behavior of other tools. The standard behavior of 1547 truncating at the first whitespace can be suppressed with 1548 --sam-no-qname-trunc at the expense of generating non-standard SAM. 1549 15502. Sum of all applicable flags. Flags relevant to Bowtie are: 1551 1552 1 1553 1554 The read is one of a pair 1555 1556 2 1557 1558 The alignment is one end of a proper paired-end alignment 1559 1560 4 1561 1562 The read has no reported alignments 1563 1564 8 1565 1566 The read is one of a pair and has no reported alignments 1567 1568 16 1569 1570 The alignment is to the reverse reference strand 1571 1572 32 1573 1574 The other mate in the paired-end alignment is aligned to the reverse 1575 reference strand 1576 1577 64 1578 1579 The read is mate 1 in a pair 1580 1581 128 1582 1583 The read is mate 2 in a pair 1584 1585 Thus, an unpaired read that aligns to the reverse reference strand 1586 will have flag 16. A paired-end read that aligns and is the first 1587 mate in the pair will have flag 83 (= 64 + 16 + 2 + 1). 1588 15893. Name of reference sequence where alignment occurs 1590 15914. 1-based offset into the forward reference strand where leftmost 1592 character of the alignment occurs 1593 15945. Mapping quality 1595 15966. CIGAR string representation of alignment 1597 15987. Name of reference sequence where mate's alignment occurs. Set to = 1599 if the mate's reference sequence is the same as this alignment's, or 1600 * if there is no mate. 1601 16028. 1-based offset into the forward reference strand where leftmost 1603 character of the mate's alignment occurs. Offset is 0 if there is no 1604 mate. 1605 16069. Inferred fragment length. Size is negative if the mate's alignment 1607 occurs upstream of this alignment. Size is 0 if the mates did not 1608 align concordantly. However, size is non-0 if the mates aligned 1609 discordantly to the same chromosome. 1610 161110. Read sequence (reverse-complemented if aligned to the reverse 1612 strand) 1613 161411. ASCII-encoded read qualities (reverse-complemented if the read 1615 aligned to the reverse strand). The encoded quality values are on 1616 the Phred quality scale and the encoding is ASCII-offset by 33 1617 (ASCII char !), similarly to a FASTQ file. 1618 161912. Optional fields. Fields are tab-separated. bowtie2 outputs zero or 1620 more of these optional fields for each alignment, depending on the 1621 type of the alignment: 1622 1623 AS:i:<N> 1624 1625Alignment score. Can be negative. Can be greater than 0 in --local mode 1626(but not in --end-to-end mode). Only present if SAM record is for an 1627aligned read. 1628 1629 XS:i:<N> 1630 1631Alignment score for the best-scoring alignment found other than the 1632alignment reported. Can be negative. Can be greater than 0 in --local 1633mode (but not in --end-to-end mode). Only present if the SAM record is 1634for an aligned read and more than one alignment was found for the read. 1635Note that, when the read is part of a concordantly-aligned pair, this 1636score could be greater than AS:i. 1637 1638 YS:i:<N> 1639 1640Alignment score for opposite mate in the paired-end alignment. Only 1641present if the SAM record is for a read that aligned as part of a 1642paired-end alignment. 1643 1644 XN:i:<N> 1645 1646The number of ambiguous bases in the reference covering this alignment. 1647Only present if SAM record is for an aligned read. 1648 1649 XM:i:<N> 1650 1651The number of mismatches in the alignment. Only present if SAM record is 1652for an aligned read. 1653 1654 XO:i:<N> 1655 1656The number of gap opens, for both read and reference gaps, in the 1657alignment. Only present if SAM record is for an aligned read. 1658 1659 XG:i:<N> 1660 1661The number of gap extensions, for both read and reference gaps, in the 1662alignment. Only present if SAM record is for an aligned read. 1663 1664 NM:i:<N> 1665 1666The edit distance; that is, the minimal number of one-nucleotide edits 1667(substitutions, insertions and deletions) needed to transform the read 1668string into the reference string. Only present if SAM record is for an 1669aligned read. 1670 1671 YF:Z:<S> 1672 1673String indicating reason why the read was filtered out. See also: 1674Filtering. Only appears for reads that were filtered out. 1675 1676 YT:Z:<S> 1677 1678Value of UU indicates the read was not part of a pair. Value of CP 1679indicates the read was part of a pair and the pair aligned concordantly. 1680Value of DP indicates the read was part of a pair and the pair aligned 1681discordantly. Value of UP indicates the read was part of a pair but the 1682pair failed to aligned either concordantly or discordantly. 1683 1684 MD:Z:<S> 1685 1686A string representation of the mismatched reference bases in the 1687alignment. See SAM Tags format specification for details. Only present 1688if SAM record is for an aligned read. 1689 1690The bowtie2-build indexer 1691 1692bowtie2-build builds a Bowtie index from a set of DNA sequences. 1693bowtie2-build outputs a set of 6 files with suffixes .1.bt2, .2.bt2, 1694.3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. In the case of a large index 1695these suffixes will have a bt2l termination. These files together 1696constitute the index: they are all that is needed to align reads to that 1697reference. The original sequence FASTA files are no longer used by 1698Bowtie 2 once the index is built. 1699 1700Bowtie 2's .bt2 index format is different from Bowtie 1's .ebwt format, 1701and they are not compatible with each other. 1702 1703Use of Karkkainen's blockwise algorithm allows bowtie2-build to trade 1704off between running time and memory usage. bowtie2-build has three 1705options governing how it makes this trade: -p/--packed, 1706--bmax/--bmaxdivn, and --dcv. By default, bowtie2-build will 1707automatically search for the settings that yield the best running time 1708without exhausting memory. This behavior can be disabled using the 1709-a/--noauto option. 1710 1711The indexer provides options pertaining to the "shape" of the index, 1712e.g. --offrate governs the fraction of Burrows-Wheeler rows that are 1713"marked" (i.e., the density of the suffix-array sample; see the original 1714FM Index paper for details). All of these options are potentially 1715profitable trade-offs depending on the application. They have been set 1716to defaults that are reasonable for most cases according to our 1717experiments. See Performance tuning for details. 1718 1719bowtie2-build can generate either small or large indexes. The wrapper 1720will decide which based on the length of the input genome. If the 1721reference does not exceed 4 billion characters but a large index is 1722preferred, the user can specify --large-index to force bowtie2-build to 1723build a large index instead. 1724 1725The Bowtie 2 index is based on the FM Index of Ferragina and Manzini, 1726which in turn is based on the Burrows-Wheeler transform. The algorithm 1727used to build the index is based on the blockwise algorithm of 1728Karkkainen. 1729 1730Command Line 1731 1732Usage: 1733 1734 bowtie2-build [options]* <reference_in> <bt2_base> 1735 1736Main arguments 1737 1738 <reference_in> 1739 1740A comma-separated list of FASTA files containing the reference sequences 1741to be aligned to, or, if -c is specified, the sequences themselves. 1742E.g., <reference_in> might be chr1.fa,chr2.fa,chrX.fa,chrY.fa, or, if -c 1743is specified, this might be GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA. 1744 1745 <bt2_base> 1746 1747The basename of the index files to write. By default, bowtie2-build 1748writes files named NAME.1.bt2, NAME.2.bt2, NAME.3.bt2, NAME.4.bt2, 1749NAME.rev.1.bt2, and NAME.rev.2.bt2, where NAME is <bt2_base>. 1750 1751Options 1752 1753 -f 1754 1755The reference input files (specified as <reference_in>) are FASTA files 1756(usually having extension .fa, .mfa, .fna or similar). 1757 1758 -c 1759 1760The reference sequences are given on the command line. I.e. 1761<reference_in> is a comma-separated list of sequences rather than a list 1762of FASTA files. 1763 1764 --large-index 1765 1766Force bowtie2-build to build a large index, even if the reference is 1767less than ~ 4 billion nucleotides inlong. 1768 1769 -a/--noauto 1770 1771Disable the default behavior whereby bowtie2-build automatically selects 1772values for the --bmax, --dcv and --packed parameters according to 1773available memory. Instead, user may specify values for those parameters. 1774If memory is exhausted during indexing, an error message will be 1775printed; it is up to the user to try new parameters. 1776 1777 -p/--packed 1778 1779Use a packed (2-bits-per-nucleotide) representation for DNA strings. 1780This saves memory but makes indexing 2-3 times slower. Default: off. 1781This is configured automatically by default; use -a/--noauto to 1782configure manually. 1783 1784 --bmax <int> 1785 1786The maximum number of suffixes allowed in a block. Allowing more 1787suffixes per block makes indexing faster, but increases peak memory 1788usage. Setting this option overrides any previous setting for --bmax, or 1789--bmaxdivn. Default (in terms of the --bmaxdivn parameter) is --bmaxdivn 17904 * number of threads. This is configured automatically by default; use 1791-a/--noauto to configure manually. 1792 1793 --bmaxdivn <int> 1794 1795The maximum number of suffixes allowed in a block, expressed as a 1796fraction of the length of the reference. Setting this option overrides 1797any previous setting for --bmax, or --bmaxdivn. Default: --bmaxdivn 4 * 1798number of threads. This is configured automatically by default; use 1799-a/--noauto to configure manually. 1800 1801 --dcv <int> 1802 1803Use <int> as the period for the difference-cover sample. A larger period 1804yields less memory overhead, but may make suffix sorting slower, 1805especially if repeats are present. Must be a power of 2 no greater than 18064096. Default: 1024. This is configured automatically by default; use 1807-a/--noauto to configure manually. 1808 1809 --nodc 1810 1811Disable use of the difference-cover sample. Suffix sorting becomes 1812quadratic-time in the worst case (where the worst case is an extremely 1813repetitive reference). Default: off. 1814 1815 -r/--noref 1816 1817Do not build the NAME.3.bt2 and NAME.4.bt2 portions of the index, which 1818contain a bitpacked version of the reference sequences and are used for 1819paired-end alignment. 1820 1821 -3/--justref 1822 1823Build only the NAME.3.bt2 and NAME.4.bt2 portions of the index, which 1824contain a bitpacked version of the reference sequences and are used for 1825paired-end alignment. 1826 1827 -o/--offrate <int> 1828 1829To map alignments back to positions on the reference sequences, it's 1830necessary to annotate ("mark") some or all of the Burrows-Wheeler rows 1831with their corresponding location on the genome. -o/--offrate governs 1832how many rows get marked: the indexer will mark every 2^<int> rows. 1833Marking more rows makes reference-position lookups faster, but requires 1834more memory to hold the annotations at runtime. The default is 5 (every 183532nd row is marked; for human genome, annotations occupy about 340 1836megabytes). 1837 1838 -t/--ftabchars <int> 1839 1840The ftab is the lookup table used to calculate an initial 1841Burrows-Wheeler range with respect to the first <int> characters of the 1842query. A larger <int> yields a larger lookup table but faster query 1843times. The ftab has size 4^(<int>+1) bytes. The default setting is 10 1844(ftab is 4MB). 1845 1846 --seed <int> 1847 1848Use <int> as the seed for pseudo-random number generator. 1849 1850 --cutoff <int> 1851 1852Index only the first <int> bases of the reference sequences (cumulative 1853across sequences) and ignore the rest. 1854 1855 -q/--quiet 1856 1857bowtie2-build is verbose by default. With this option bowtie2-build will 1858print only error messages. 1859 1860 --threads <int> 1861 1862By default bowtie2-build is using only one thread. Increasing the number 1863of threads will speed up the index building considerably in most cases. 1864 1865 -h/--help 1866 1867Print usage information and quit. 1868 1869 --version 1870 1871Print version information and quit. 1872 1873The bowtie2-inspect index inspector 1874 1875bowtie2-inspect extracts information from a Bowtie index about what kind 1876of index it is and what reference sequences were used to build it. When 1877run without any options, the tool will output a FASTA file containing 1878the sequences of the original references (with all non-A/C/G/T 1879characters converted to Ns). It can also be used to extract just the 1880reference sequence names using the -n/--names option or a more verbose 1881summary using the -s/--summary option. 1882 1883Command Line 1884 1885Usage: 1886 1887 bowtie2-inspect [options]* <bt2_base> 1888 1889Main arguments 1890 1891 <bt2_base> 1892 1893The basename of the index to be inspected. The basename is name of any 1894of the index files but with the .X.bt2 or .rev.X.bt2 suffix omitted. 1895bowtie2-inspect first looks in the current directory for the index 1896files, then in the directory specified in the BOWTIE2_INDEXES 1897environment variable. 1898 1899Options 1900 1901 -a/--across <int> 1902 1903When printing FASTA output, output a newline character every <int> bases 1904(default: 60). 1905 1906 -n/--names 1907 1908Print reference sequence names, one per line, and quit. 1909 1910 -s/--summary 1911 1912Print a summary that includes information about index settings, as well 1913as the names and lengths of the input sequences. The summary has this 1914format: 1915 1916 Colorspace <0 or 1> 1917 SA-Sample 1 in <sample> 1918 FTab-Chars <chars> 1919 Sequence-1 <name> <len> 1920 Sequence-2 <name> <len> 1921 ... 1922 Sequence-N <name> <len> 1923 1924Fields are separated by tabs. Colorspace is always set to 0 for Bowtie 19252. 1926 1927 -v/--verbose 1928 1929Print verbose output (for debugging). 1930 1931 --version 1932 1933Print version information and quit. 1934 1935 -h/--help 1936 1937Print usage information and quit. 1938 1939Getting started with Bowtie 2: Lambda phage example 1940 1941Bowtie 2 comes with some example files to get you started. The example 1942files are not scientifically significant; we use the Lambda phage 1943reference genome simply because it's short, and the reads were generated 1944by a computer program, not a sequencer. However, these files will let 1945you start running Bowtie 2 and downstream tools right away. 1946 1947First follow the manual instructions to obtain Bowtie 2. Set the 1948BT2_HOME environment variable to point to the new Bowtie 2 directory 1949containing the bowtie2, bowtie2-build and bowtie2-inspect binaries. This 1950is important, as the BT2_HOME variable is used in the commands below to 1951refer to that directory. 1952 1953Indexing a reference genome 1954 1955To create an index for the Lambda phage reference genome included with 1956Bowtie 2, create a new temporary directory (it doesn't matter where), 1957change into that directory, and run: 1958 1959 $BT2_HOME/bowtie2-build $BT2_HOME/example/reference/lambda_virus.fa lambda_virus 1960 1961The command should print many lines of output then quit. When the 1962command completes, the current directory will contain four new files 1963that all start with lambda_virus and end with .1.bt2, .2.bt2, .3.bt2, 1964.4.bt2, .rev.1.bt2, and .rev.2.bt2. These files constitute the index - 1965you're done! 1966 1967You can use bowtie2-build to create an index for a set of FASTA files 1968obtained from any source, including sites such as UCSC, NCBI, and 1969Ensembl. When indexing multiple FASTA files, specify all the files using 1970commas to separate file names. For more details on how to create an 1971index with bowtie2-build, see the manual section on index building. You 1972may also want to bypass this process by obtaining a pre-built index. See 1973using a pre-built index below for an example. 1974 1975Aligning example reads 1976 1977Stay in the directory created in the previous step, which now contains 1978the lambda_virus index files. Next, run: 1979 1980 $BT2_HOME/bowtie2 -x lambda_virus -U $BT2_HOME/example/reads/reads_1.fq -S eg1.sam 1981 1982This runs the Bowtie 2 aligner, which aligns a set of unpaired reads to 1983the Lambda phage reference genome using the index generated in the 1984previous step. The alignment results in SAM format are written to the 1985file eg1.sam, and a short alignment summary is written to the console. 1986(Actually, the summary is written to the "standard error" or "stderr" 1987filehandle, which is typically printed to the console.) 1988 1989To see the first few lines of the SAM output, run: 1990 1991 head eg1.sam 1992 1993You will see something like this: 1994 1995 @HD VN:1.0 SO:unsorted 1996 @SQ SN:gi|9626243|ref|NC_001416.1| LN:48502 1997 @PG ID:bowtie2 PN:bowtie2 VN:2.0.1 1998 r1 0 gi|9626243|ref|NC_001416.1| 18401 42 122M * 0 0 TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG +"@6<:27(F&5)9"B):%B+A-%5A?2$HCB0B+0=D<7E/<.03#!.F77@6B==?C"7>;))%;,3-$.A06+<-1/@@?,26">=?*@'0;$:;??G+:#+(A?9+10!8!?()?7C> AS:i:-5 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:59G13G21G26 YT:Z:UU 1999 r2 0 gi|9626243|ref|NC_001416.1| 8886 42 275M * 0 0 NTTNTGATGCGGGCTTGTGGAGTTCAGCCGATCTGACTTATGTCATTACCTATGAAATGTGAGGACGCTATGCCTGTACCAAATCCTACAATGCCGGTGAAAGGTGCCGGGATCACCCTGTGGGTTTATAAGGGGATCGGTGACCCCTACGCGAATCCGCTTTCAGACGTTGACTGGTCGCGTCTGGCAAAAGTTAAAGACCTGACGCCCGGCGAACTGACCGCTGAGNCCTATGACGACAGCTATCTCGATGATGAAGATGCAGACTGGACTGC (#!!'+!$""%+(+)'%)%!+!(&++)''"#"#&#"!'!("%'""("+&%$%*%%#$%#%#!)*'(#")(($&$'&%+&#%*)*#*%*')(%+!%%*"$%"#+)$&&+)&)*+!"*)!*!("&&"*#+"&"'(%)*("'!$*!!%$&&&$!!&&"(*"$&"#&!$%'%"#)$#+%*+)!&*)+(""#!)!%*#"*)*')&")($+*%%)!*)!('(%""+%"$##"#+(('!*(($*'!"*('"+)&%#&$+('**$$&+*&!#%)')'(+(!%+ AS:i:-14 XN:i:0 XM:i:8 XO:i:0 XG:i:0 NM:i:8 MD:Z:0A0C0G0A108C23G9T81T46 YT:Z:UU 2000 r3 16 gi|9626243|ref|NC_001416.1| 11599 42 338M * 0 0 GGGCGCGTTACTGGGATGATCGTGAAAAGGCCCGTCTTGCGCTTGAAGCCGCCCGAAAGAAGGCTGAGCAGCAGACTCAAGAGGAGAAAAATGCGCAGCAGCGGAGCGATACCGAAGCGTCACGGCTGAAATATACCGAAGAGGCGCAGAAGGCTNACGAACGGCTGCAGACGCCGCTGCAGAAATATACCGCCCGTCAGGAAGAACTGANCAAGGCACNGAAAGACGGGAAAATCCTGCAGGCGGATTACAACACGCTGATGGCGGCGGCGAAAAAGGATTATGAAGCGACGCTGTAAAAGCCGAAACAGTCCAGCGTGAAGGTGTCTGCGGGCGAT 7F$%6=$:9B@/F'>=?!D?@0(:A*)7/>9C>6#1<6:C(.CC;#.;>;2'$4D:?&B!>689?(0(G7+0=@37F)GG=>?958.D2E04C<E,*AD%G0.%$+A:'H;?8<72:88?E6((CF)6DF#.)=>B>D-="C'B080E'5BH"77':"@70#4%A5=6.2/1>;9"&-H6)=$/0;5E:<8G!@::1?2DC7C*;@*#.1C0.D>H/20,!"C-#,6@%<+<D(AG-).?�.00'@)/F8?B!&"170,)>:?<A7#1(A@0E#&A.*DC.E")AH"+.,5,2>5"2?:G,F"D0B8D-6$65D<D!A/38860.*4;4B<*31?6 AS:i:-22 XN:i:0 XM:i:8 XO:i:0 XG:i:0 NM:i:8 MD:Z:80C4C16A52T23G30A8T76A41 YT:Z:UU 2001 r4 0 gi|9626243|ref|NC_001416.1| 40075 42 184M * 0 0 GGGCCAATGCGCTTACTGATGCGGAATTACGCCGTAAGGCCGCAGATGAGCTTGTCCATATGACTGCGAGAATTAACNGTGGTGAGGCGATCCCTGAACCAGTAAAACAACTTCCTGTCATGGGCGGTAGACCTCTAAATCGTGCACAGGCTCTGGCGAAGATCGCAGAAATCAAAGCTAAGT(=8B)GD04*G%&4F,1'A>.C&7=F$,+#6!))43C,5/5+)?-/0>/D3=-,2/+.1?@->;)00!'3!7BH$G)HG+ADC'#-9F)7<7"$?&.>0)@5;4,!0-#C!15CF8&HB+B==H>7,/)C5)5*+(F5A%D,EA<(>G9E0>7&/E?4%;#'92)<5+@7:A.(BG@BG86@.G AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:77C106 YT:Z:UU 2002 r5 0 gi|9626243|ref|NC_001416.1| 48010 42 138M * 0 0 GTCAGGAAAGTGGTAAAACTGCAACTCAATTACTGCAATGCCCTCGTAATTAAGTGAATTTACAATATCGTCCTGTTCGGAGGGAAGAACGCGGGATGTTCATTCTTCATCACTTTTAATTGATGTATATGCTCTCTT 9''%<D)A03E1-*7=),:F/0!6,D9:H,<9D%:0B(%'E,(8EFG$E89B$27G8F*2+4,-!,0D5()&=(FGG:5;3*@/.0F-G#5#3->('FDFEG?)5.!)"AGADB3?6(@H(:B<>6!>;>6>G,."?% AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:138 YT:Z:UU 2003 r6 16 gi|9626243|ref|NC_001416.1| 41607 42 72M2D119M * 0 0 TCGATTTGCAAATACCGGAACATCTCGGTAACTGCATATTCTGCATTAAAAAATCAACGCAAAAAATCGGACGCCTGCAAAGATGAGGAGGGATTGCAGCGTGTTTTTAATGAGGTCATCACGGGATNCCATGTGCGTGACGGNCATCGGGAAACGCCAAAGGAGATTATGTACCGAGGAAGAATGTCGCT 1H#G;H"$E*E#&"*)2%66?=9/9'=;4)4/>@%+5#@#$4A*!<D=="8#1*A9BA=:(1+#C&.#(3#H=9E)AC*5,AC#E'536*2?)H14?>9'B=7(3H/B:+A:8%1-+#(E%&$$&14"76D?>7(&20H5%*&CF8!G5B+A4F$7(:"'?0$?G+$)B-?2<0<F=D!38BH,%=8&5@+ AS:i:-13 XN:i:0 XM:i:2 XO:i:1 XG:i:2 NM:i:4 MD:Z:72^TT55C15A47 YT:Z:UU 2004 r7 16 gi|9626243|ref|NC_001416.1| 4692 42 143M * 0 0 TCAGCCGGACGCGGGCGCTGCAGCCGTACTCGGGGATGACCGGTTACAACGGCATTATCGCCCGTCTGCAACAGGCTGCCAGCGATCCGATGGTGGACAGCATTCTGCTCGATATGGACANGCCCGGCGGGATGGTGGCGGGG -"/@*7A0)>2,AAH@&"%B)*5*23B/,)90.B@%=FE,E063C9?,:26$-0:,.,1849'4.;F>FA;76+5&$<C":$!A*,<B,<)@<'85D%C*:)30@85;?.B$05=@95DCDH<53!8G:F:B7/A.E':434> AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:98G21C22 YT:Z:UU 2005 2006The first few lines (beginning with @) are SAM header lines, and the 2007rest of the lines are SAM alignments, one line per read or mate. See the 2008Bowtie 2 manual section on SAM output and the SAM specification for 2009details about how to interpret the SAM file format. 2010 2011Paired-end example 2012 2013To align paired-end reads included with Bowtie 2, stay in the same 2014directory and run: 2015 2016 $BT2_HOME/bowtie2 -x lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam 2017 2018This aligns a set of paired-end reads to the reference genome, with 2019results written to the file eg2.sam. 2020 2021Local alignment example 2022 2023To use local alignment to align some longer reads included with Bowtie 20242, stay in the same directory and run: 2025 2026 $BT2_HOME/bowtie2 --local -x lambda_virus -U $BT2_HOME/example/reads/longreads.fq -S eg3.sam 2027 2028This aligns the long reads to the reference genome using local 2029alignment, with results written to the file eg3.sam. 2030 2031Using SAMtools/BCFtools downstream 2032 2033SAMtools is a collection of tools for manipulating and analyzing SAM and 2034BAM alignment files. BCFtools is a collection of tools for calling 2035variants and manipulating VCF and BCF files, and it is typically 2036distributed with SAMtools. Using these tools together allows you to get 2037from alignments in SAM format to variant calls in VCF format. This 2038example assumes that samtools and bcftools are installed and that the 2039directories containing these binaries are in your PATH environment 2040variable. 2041 2042Run the paired-end example: 2043 2044 $BT2_HOME/bowtie2 -x $BT2_HOME/example/index/lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam 2045 2046Use samtools view to convert the SAM file into a BAM file. BAM is the 2047binary format corresponding to the SAM text format. Run: 2048 2049 samtools view -bS eg2.sam > eg2.bam 2050 2051Use samtools sort to convert the BAM file to a sorted BAM file. 2052 2053 samtools sort eg2.bam -o eg2.sorted.bam 2054 2055We now have a sorted BAM file called eg2.sorted.bam. Sorted BAM is a 2056useful format because the alignments are (a) compressed, which is 2057convenient for long-term storage, and (b) sorted, which is conveneint 2058for variant discovery. To generate variant calls in VCF format, run: 2059 2060 bcftools mpileup -f $BT2_HOME/example/reference/lambda_virus.fa eg2.sorted.bam | bcftools view -Ov - > eg2.raw.bcf 2061 2062Then to view the variants, run: 2063 2064 bcftools view eg2.raw.bcf 2065 2066See the official SAMtools guide to Calling SNPs/INDELs with 2067SAMtools/BCFtools for more details and variations on this process. 2068