1Introduction
2
3Bowtie 2 is an ultrafast and memory-efficient tool for aligning
4sequencing reads to long reference sequences. It is particularly good at
5aligning reads of about 50 up to 100s of characters to relatively long
6(e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index
7(based on the Burrows-Wheeler Transform or BWT) to keep its memory
8footprint small: for the human genome, its memory footprint is typically
9around 3.2 gigabytes of RAM. Bowtie 2 supports gapped, local, and
10paired-end alignment modes. Multiple processors can be used
11simultaneously to achieve greater alignment speed.
12
13Bowtie 2 outputs alignments in SAM format, enabling interoperation with
14a large number of other tools (e.g. SAMtools, GATK) that use SAM. Bowtie
152 is distributed under the GPLv3 license, and it runs on the command
16line under Windows, Mac OS X and Linux and BSD.
17
18Bowtie 2 is often the first step in pipelines for comparative genomics,
19including for variation calling, ChIP-seq, RNA-seq, BS-seq. Bowtie 2 and
20Bowtie (also called "Bowtie 1" here) are also tightly integrated into
21many other tools, some of which are listed here.
22
23If you use Bowtie 2 for your published research, please cite our work.
24Papers describing Bowtie 2 are:
25
26-   Langmead B, Wilks C, Antonescu V, Charles R. Scaling read aligners
27    to hundreds of threads on general-purpose processors.
28    Bioinformatics. 2018 Jul 18. doi: 10.1093/bioinformatics/bty648.
29
30-   Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2.
31    Nature Methods. 2012 Mar 4;9(4):357-9. doi: 10.1038/nmeth.1923.
32
33How is Bowtie 2 different from Bowtie 1?
34
35Bowtie 1 was released in 2009 and was geared toward aligning the
36relatively short sequencing reads (up to 25-50 nucleotides) prevalent at
37the time. Since then, technology has improved both sequencing throughput
38(more nucleotides produced per sequencer per day) and read length (more
39nucleotides per read).
40
41The chief differences between Bowtie 1 and Bowtie 2 are:
42
431.  For reads longer than about 50 bp Bowtie 2 is generally faster, more
44    sensitive, and uses less memory than Bowtie 1. For relatively short
45    reads (e.g. less than 50 bp) Bowtie 1 is sometimes faster and/or
46    more sensitive.
47
482.  Bowtie 2 supports gapped alignment with affine gap penalties. Number
49    of gaps and gap lengths are not restricted, except by way of the
50    configurable scoring scheme. Bowtie 1 finds just ungapped
51    alignments.
52
533.  Bowtie 2 supports local alignment, which doesn't require reads to
54    align end-to-end. Local alignments might be "trimmed" ("soft
55    clipped") at one or both extremes in a way that optimizes alignment
56    score. Bowtie 2 also supports end-to-end alignment which, like
57    Bowtie 1, requires that the read align entirely.
58
594.  There is no upper limit on read length in Bowtie 2. Bowtie 1 had an
60    upper limit of around 1000 bp.
61
625.  Bowtie 2 allows alignments to overlap ambiguous characters (e.g. Ns)
63    in the reference. Bowtie 1 does not.
64
656.  Bowtie 2 does away with Bowtie 1's notion of alignment "stratum",
66    and its distinction between "Maq-like" and "end-to-end" modes. In
67    Bowtie 2 all alignments lie along a continuous spectrum of alignment
68    scores where the scoring scheme, similar to Needleman-Wunsch and
69    Smith-Waterman.
70
717.  Bowtie 2's paired-end alignment is more flexible. E.g. for pairs
72    that do not align in a paired fashion, Bowtie 2 attempts to find
73    unpaired alignments for each mate.
74
758.  Bowtie 2 reports a spectrum of mapping qualities, in contrast for
76    Bowtie 1 which reports either 0 or high.
77
789.  Bowtie 2 does not align colorspace reads.
79
80Bowtie 2 is not a "drop-in" replacement for Bowtie 1. Bowtie 2's
81command-line arguments and genome index format are both different from
82Bowtie 1's.
83
84What isn't Bowtie 2?
85
86Bowtie 2 is geared toward aligning relatively short sequencing reads to
87long genomes. That said, it handles arbitrarily small reference
88sequences (e.g. amplicons) and very long reads (i.e. upwards of 10s or
89100s of kilobases), though it is slower in those settings. It is
90optimized for the read lengths and error modes yielded by typical
91Illumina sequencers.
92
93Bowtie 2 does not support alignment of colorspace reads. (Bowtie 1
94does.)
95
96Obtaining Bowtie 2
97
98Bowtie 2 is available from various package managers, notably Bioconda.
99With Bioconda installed, you should be able to install Bowtie 2 with
100conda install bowtie2.
101
102Containerized versions of Bowtie 2 are also available via the
103Biocontainers project (e.g. via Docker Hub).
104
105You can also download Bowtie 2 sources and binaries from the Download
106section of the Sourceforge site. Binaries are available for the x86_64
107architecture running Linux, Mac OS X, and Windows. FreeBSD users can
108obtain the latest version of Bowtie 2 from ports using
109pkg install bowtie2. If you plan to compile Bowtie 2 yourself, make sure
110to get the source package, i.e., the filename that ends in
111"-source.zip".
112
113Building from source
114
115Building from source
116
117Building Bowtie 2 from source requires a GNU-like environment with
118Clang/GCC, GNU Make and other basics. It should be possible to build
119Bowtie 2 on most vanilla *NIX installations or on a Mac installation
120with Xcode installed. Bowtie 2 can also be built on Windows using a
12164-bit MinGW distribution and MSYS. In order to simplify the MinGW setup
122it might be worth investigating popular MinGW personal builds since
123these are coming already prepared with most of the toolchains needed.
124
125First, download the source package from the sourceforge site. Make sure
126you're getting the source package; the file downloaded should end in
127-source.zip. Unzip the file, change to the unzipped directory, and build
128the Bowtie 2 tools by running GNU make (usually with the command make,
129but sometimes with gmake) with no arguments. If building with MinGW, run
130make from the MSYS environment.
131
132The Bowtie 2 Makefile also includes recipes for basic automatic
133dependency management. Running make static-libs && make STATIC_BUILD=1
134will issue a series of commands that will: 1. download zstd and zlib 2.
135compile them as static libraries 3. link the resulting libraries to the
136compiled Bowtie 2 binaries
137
138As of version 2.3.5 bowtie2 now supports aligning SRA reads. Prepackaged
139builds will include a package that supports SRA. If you're building
140bowtie2 from source please make sure that the Java runtime is available
141on your system. You can then proceed with the build by running
142make sra-deps && make USE_SRA=1.
143
144Adding to PATH
145
146By adding your new Bowtie 2 directory to your PATH environment variable,
147you ensure that whenever you run bowtie2, bowtie2-build or
148bowtie2-inspect from the command line, you will get the version you just
149installed without having to specify the entire path. This is recommended
150for most users. To do this, follow your operating system's instructions
151for adding the directory to your PATH.
152
153If you would like to install Bowtie 2 by copying the Bowtie 2 executable
154files to an existing directory in your PATH, make sure that you copy all
155the executables, including bowtie2, bowtie2-align-s, bowtie2-align-l,
156bowtie2-build, bowtie2-build-s, bowtie2-build-l, bowtie2-inspect,
157bowtie2-inspect-s and bowtie2-inspect-l.
158
159The bowtie2 aligner
160
161bowtie2 takes a Bowtie 2 index and a set of sequencing read files and
162outputs a set of alignments in SAM format.
163
164"Alignment" is the process by which we discover how and where the read
165sequences are similar to the reference sequence. An "alignment" is a
166result from this process, specifically: an alignment is a way of "lining
167up" some or all of the characters in the read with some characters from
168the reference in a way that reveals how they're similar. For example:
169
170      Read:      GACTGGGCGATCTCGACTTCG
171                 |||||  |||||||||| |||
172      Reference: GACTG--CGATCTCGACATCG
173
174Where dash symbols represent gaps and vertical bars show where aligned
175characters match.
176
177We use alignment to make an educated guess as to where a read originated
178with respect to the reference genome. It's not always possible to
179determine this with certainty. For instance, if the reference genome
180contains several long stretches of As (AAAAAAAAA etc.) and the read
181sequence is a short stretch of As (AAAAAAA), we cannot know for certain
182exactly where in the sea of As the read originated.
183
184End-to-end alignment versus local alignment
185
186By default, Bowtie 2 performs end-to-end read alignment. That is, it
187searches for alignments involving all of the read characters. This is
188also called an "untrimmed" or "unclipped" alignment.
189
190When the --local option is specified, Bowtie 2 performs local read
191alignment. In this mode, Bowtie 2 might "trim" or "clip" some read
192characters from one or both ends of the alignment if doing so maximizes
193the alignment score.
194
195End-to-end alignment example
196
197The following is an "end-to-end" alignment because it involves all the
198characters in the read. Such an alignment can be produced by Bowtie 2 in
199either end-to-end mode or in local mode.
200
201    Read:      GACTGGGCGATCTCGACTTCG
202    Reference: GACTGCGATCTCGACATCG
203
204    Alignment:
205      Read:      GACTGGGCGATCTCGACTTCG
206                 |||||  |||||||||| |||
207      Reference: GACTG--CGATCTCGACATCG
208
209Local alignment example
210
211The following is a "local" alignment because some of the characters at
212the ends of the read do not participate. In this case, 4 characters are
213omitted (or "soft trimmed" or "soft clipped") from the beginning and 3
214characters are omitted from the end. This sort of alignment can be
215produced by Bowtie 2 only in local mode.
216
217    Read:      ACGGTTGCGTTAATCCGCCACG
218    Reference: TAACTTGCGTTAAATCCGCCTGG
219
220    Alignment:
221      Read:      ACGGTTGCGTTAA-TCCGCCACG
222                     ||||||||| ||||||
223      Reference: TAACTTGCGTTAAATCCGCCTGG
224
225Scores: higher = more similar
226
227An alignment score quantifies how similar the read sequence is to the
228reference sequence aligned to. The higher the score, the more similar
229they are. A score is calculated by subtracting penalties for each
230difference (mismatch, gap, etc.) and, in local alignment mode, adding
231bonuses for each match.
232
233The scores can be configured with the --ma (match bonus), --mp (mismatch
234penalty), --np (penalty for having an N in either the read or the
235reference), --rdg (affine read gap penalty) and --rfg (affine reference
236gap penalty) options.
237
238End-to-end alignment score example
239
240A mismatched base at a high-quality position in the read receives a
241penalty of -6 by default. A length-2 read gap receives a penalty of -11
242by default (-5 for the gap open, -3 for the first extension, -3 for the
243second extension). Thus, in end-to-end alignment mode, if the read is 50
244bp long and it matches the reference exactly except for one mismatch at
245a high-quality position and one length-2 read gap, then the overall
246score is -(6 + 11) = -17.
247
248The best possible alignment score in end-to-end mode is 0, which happens
249when there are no differences between the read and the reference.
250
251Local alignment score example
252
253A mismatched base at a high-quality position in the read receives a
254penalty of -6 by default. A length-2 read gap receives a penalty of -11
255by default (-5 for the gap open, -3 for the first extension, -3 for the
256second extension). A base that matches receives a bonus of +2 be
257default. Thus, in local alignment mode, if the read is 50 bp long and it
258matches the reference exactly except for one mismatch at a high-quality
259position and one length-2 read gap, then the overall score equals the
260total bonus, 2 * 49, minus the total penalty, 6 + 11, = 81.
261
262The best possible score in local mode equals the match bonus times the
263length of the read. This happens when there are no differences between
264the read and the reference.
265
266Valid alignments meet or exceed the minimum score threshold
267
268For an alignment to be considered "valid" (i.e. "good enough") by Bowtie
2692, it must have an alignment score no less than the minimum score
270threshold. The threshold is configurable and is expressed as a function
271of the read length. In end-to-end alignment mode, the default minimum
272score threshold is -0.6 + -0.6 * L, where L is the read length. In local
273alignment mode, the default minimum score threshold is 20 + 8.0 * ln(L),
274where L is the read length. This can be configured with the --score-min
275option. For details on how to set options like --score-min that
276correspond to functions, see the section on setting function options.
277
278Mapping quality: higher = more unique
279
280The aligner cannot always assign a read to its point of origin with high
281confidence. For instance, a read that originated inside a repeat element
282might align equally well to many occurrences of the element throughout
283the genome, leaving the aligner with no basis for preferring one over
284the others.
285
286Aligners characterize their degree of confidence in the point of origin
287by reporting a mapping quality: a non-negative integer Q = -10 log10 p,
288where p is an estimate of the probability that the alignment does not
289correspond to the read's true point of origin. Mapping quality is
290sometimes abbreviated MAPQ, and is recorded in the SAM MAPQ field.
291
292Mapping quality is related to "uniqueness." We say an alignment is
293unique if it has a much higher alignment score than all the other
294possible alignments. The bigger the gap between the best alignment's
295score and the second-best alignment's score, the more unique the best
296alignment, and the higher its mapping quality should be.
297
298Accurate mapping qualities are useful for downstream tools like variant
299callers. For instance, a variant caller might choose to ignore evidence
300from alignments with mapping quality less than, say, 10. A mapping
301quality of 10 or less indicates that there is at least a 1 in 10 chance
302that the read truly originated elsewhere.
303
304Aligning pairs
305
306A "paired-end" or "mate-pair" read consists of pair of mates, called
307mate 1 and mate 2. Pairs come with a prior expectation about (a) the
308relative orientation of the mates, and (b) the distance separating them
309on the original DNA molecule. Exactly what expectations hold for a given
310dataset depends on the lab procedures used to generate the data. For
311example, a common lab procedure for producing pairs is Illumina's
312Paired-end Sequencing Assay, which yields pairs with a relative
313orientation of FR ("forward, reverse") meaning that if mate 1 came from
314the Watson strand, mate 2 very likely came from the Crick strand and
315vice versa. Also, this protocol yields pairs where the expected genomic
316distance from end to end is about 200-500 base pairs.
317
318For simplicity, this manual uses the term "paired-end" to refer to any
319pair of reads with some expected relative orientation and distance.
320Depending on the protocol, these might actually be referred to as
321"paired-end" or "mate-paired." Also, we always refer to the individual
322sequences making up the pair as "mates."
323
324Paired inputs
325
326Pairs are often stored in a pair of files, one file containing the mate
3271s and the other containing the mates 2s. The first mate in the file for
328mate 1 forms a pair with the first mate in the file for mate 2, the
329second with the second, and so on. When aligning pairs with Bowtie 2,
330specify the file with the mate 1s mates using the -1 argument and the
331file with the mate 2s using the -2 argument. This causes Bowtie 2 to
332take the paired nature of the reads into account when aligning them.
333
334Paired SAM output
335
336When Bowtie 2 prints a SAM alignment for a pair, it prints two records
337(i.e. two lines of output), one for each mate. The first record
338describes the alignment for mate 1 and the second record describes the
339alignment for mate 2. In both records, some of the fields of the SAM
340record describe various properties of the alignment; for instance, the
3417th and 8th fields (RNEXT and PNEXT respectively) indicate the reference
342name and position where the other mate aligned, and the 9th field
343indicates the inferred length of the DNA fragment from which the two
344mates were sequenced. See the SAM specification for more details
345regarding these fields.
346
347Concordant pairs match pair expectations, discordant pairs don't
348
349A pair that aligns with the expected relative mate orientation and with
350the expected range of distances between mates is said to align
351"concordantly". If both mates have unique alignments, but the alignments
352do not match paired-end expectations (i.e. the mates aren't in the
353expected relative orientation, or aren't within the expected distance
354range, or both), the pair is said to align "discordantly". Discordant
355alignments may be of particular interest, for instance, when seeking
356structural variants.
357
358The expected relative orientation of the mates is set using the --ff,
359--fr, or --rf options. The expected range of inter-mates distances (as
360measured from the furthest extremes of the mates; also called "outer
361distance") is set with the -I and -X options. Note that setting -I and
362-X far apart makes Bowtie 2 slower. See documentation for -I and -X.
363
364To declare that a pair aligns discordantly, Bowtie 2 requires that both
365mates align uniquely. This is a conservative threshold, but this is
366often desirable when seeking structural variants.
367
368By default, Bowtie 2 searches for both concordant and discordant
369alignments, though searching for discordant alignments can be disabled
370with the --no-discordant option.
371
372Mixed mode: paired where possible, unpaired otherwise
373
374If Bowtie 2 cannot find a paired-end alignment for a pair, by default it
375will go on to look for unpaired alignments for the constituent mates.
376This is called "mixed mode." To disable mixed mode, set the --no-mixed
377option.
378
379Bowtie 2 runs a little faster in --no-mixed mode, but will only consider
380alignment status of pairs per se, not individual mates.
381
382Some SAM FLAGS describe paired-end properties
383
384The SAM FLAGS field, the second field in a SAM record, has multiple bits
385that describe the paired-end nature of the read and alignment. The first
386(least significant) bit (1 in decimal, 0x1 in hexadecimal) is set if the
387read is part of a pair. The second bit (2 in decimal, 0x2 in
388hexadecimal) is set if the read is part of a pair that aligned in a
389paired-end fashion. The fourth bit (8 in decimal, 0x8 in hexadecimal) is
390set if the read is part of a pair and the other mate in the pair had at
391least one valid alignment. The sixth bit (32 in decimal, 0x20 in
392hexadecimal) is set if the read is part of a pair and the other mate in
393the pair aligned to the Crick strand (or, equivalently, if the reverse
394complement of the other mate aligned to the Watson strand). The seventh
395bit (64 in decimal, 0x40 in hexadecimal) is set if the read is mate 1 in
396a pair. The eighth bit (128 in decimal, 0x80 in hexadecimal) is set if
397the read is mate 2 in a pair. See the SAM specification for a more
398detailed description of the FLAGS field.
399
400Some SAM optional fields describe more paired-end properties
401
402The last several fields of each SAM record usually contain SAM optional
403fields, which are simply tab-separated strings conveying additional
404information about the reads and alignments. A SAM optional field is
405formatted like this: "XP:i:1" where "XP" is the TAG, "i" is the TYPE
406("integer" in this case), and "1" is the VALUE. See the SAM
407specification for details regarding SAM optional fields.
408
409Mates can overlap, contain, or dovetail each other
410
411The fragment and read lengths might be such that alignments for the two
412mates from a pair overlap each other. Consider this example:
413
414(For these examples, assume we expect mate 1 to align to the left of
415mate 2.)
416
417    Mate 1:    GCAGATTATATGAGTCAGCTACGATATTGTT
418    Mate 2:                               TGTTTGGGGTGACACATTACGCGTCTTTGAC
419    Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC
420
421It's also possible, though unusual, for one mate alignment to contain
422the other, as in these examples:
423
424    Mate 1:    GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGC
425    Mate 2:                               TGTTTGGGGTGACACATTACGC
426    Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC
427
428    Mate 1:                   CAGCTACGATATTGTTTGGGGTGACACATTACGC
429    Mate 2:                      CTACGATATTGTTTGGGGTGAC
430    Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC
431
432And it's also possible, though unusual, for the mates to "dovetail",
433with the mates seemingly extending "past" each other as in this example:
434
435    Mate 1:                 GTCAGCTACGATATTGTTTGGGGTGACACATTACGC
436    Mate 2:            TATGAGTCAGCTACGATATTGTTTGGGGTGACACAT
437    Reference: GCAGATTATATGAGTCAGCTACGATATTGTTTGGGGTGACACATTACGCGTCTTTGAC
438
439In some situations, it's desirable for the aligner to consider all these
440cases as "concordant" as long as other paired-end constraints are not
441violated. Bowtie 2's default behavior is to consider overlapping and
442containing as being consistent with concordant alignment. By default,
443dovetailing is considered inconsistent with concordant alignment.
444
445These defaults can be overridden. Setting --no-overlap causes Bowtie 2
446to consider overlapping mates as non-concordant. Setting --no-contain
447causes Bowtie 2 to consider cases where one mate alignment contains the
448other as non-concordant. Setting --dovetail causes Bowtie 2 to consider
449cases where the mate alignments dovetail as concordant.
450
451Reporting
452
453The reporting mode governs how many alignments Bowtie 2 looks for, and
454how to report them. Bowtie 2 has three distinct reporting modes. The
455default reporting mode is similar to the default reporting mode of many
456other read alignment tools, including BWA. It is also similar to Bowtie
4571's -M alignment mode.
458
459In general, when we say that a read has an alignment, we mean that it
460has a valid alignment. When we say that a read has multiple alignments,
461we mean that it has multiple alignments that are valid and distinct from
462one another.
463
464Distinct alignments map a read to different places
465
466Two alignments for the same individual read are "distinct" if they map
467the same read to different places. Specifically, we say that two
468alignments are distinct if there are no alignment positions where a
469particular read offset is aligned opposite a particular reference offset
470in both alignments with the same orientation. E.g. if the first
471alignment is in the forward orientation and aligns the read character at
472read offset 10 to the reference character at chromosome 3, offset
4733,445,245, and the second alignment is also in the forward orientation
474and also aligns the read character at read offset 10 to the reference
475character at chromosome 3, offset 3,445,245, they are not distinct
476alignments.
477
478Two alignments for the same pair are distinct if either the mate 1s in
479the two paired-end alignments are distinct or the mate 2s in the two
480alignments are distinct or both.
481
482Default mode: search for multiple alignments, report the best one
483
484By default, Bowtie 2 searches for distinct, valid alignments for each
485read. When it finds a valid alignment, it generally will continue to
486look for alignments that are nearly as good or better. It will
487eventually stop looking, either because it exceeded a limit placed on
488search effort (see -D and -R) or because it already knows all it needs
489to know to report an alignment. Information from the best alignments are
490used to estimate mapping quality (the MAPQ SAM field) and to set SAM
491optional fields, such as AS:i and XS:i. Bowtie 2 does not guarantee that
492the alignment reported is the best possible in terms of alignment score.
493
494See also: -D, which puts an upper limit on the number of dynamic
495programming problems (i.e. seed extensions) that can "fail" in a row
496before Bowtie 2 stops searching. Increasing -D makes Bowtie 2 slower,
497but increases the likelihood that it will report the correct alignment
498for a read that aligns many places.
499
500See also: -R, which sets the maximum number of times Bowtie 2 will
501"re-seed" when attempting to align a read with repetitive seeds.
502Increasing -R makes Bowtie 2 slower, but increases the likelihood that
503it will report the correct alignment for a read that aligns many places.
504
505-k mode: search for one or more alignments, report each
506
507In -k mode, Bowtie 2 searches for up to N distinct, valid alignments for
508each read, where N equals the integer specified with the -k parameter.
509That is, if -k 2 is specified, Bowtie 2 will search for at most 2
510distinct alignments. It reports all alignments found, in descending
511order by alignment score. The alignment score for a paired-end alignment
512equals the sum of the alignment scores of the individual mates. Each
513reported read or pair alignment beyond the first has the SAM 'secondary'
514bit (which equals 256) set in its FLAGS field. Supplementary alignments
515will also be assigned a MAPQ of 255. See the SAM specification for
516details.
517
518Bowtie 2 does not "find" alignments in any specific order, so for reads
519that have more than N distinct, valid alignments, Bowtie 2 does not
520guarantee that the N alignments reported are the best possible in terms
521of alignment score. Still, this mode can be effective and fast in
522situations where the user cares more about whether a read aligns (or
523aligns a certain number of times) than where exactly it originated.
524
525-a mode: search for and report all alignments
526
527-a mode is similar to -k mode except that there is no upper limit on the
528number of alignments Bowtie 2 should report. Alignments are reported in
529descending order by alignment score. The alignment score for a
530paired-end alignment equals the sum of the alignment scores of the
531individual mates. Each reported read or pair alignment beyond the first
532has the SAM 'secondary' bit (which equals 256) set in its FLAGS field.
533Supplementary alignments will be assigned a MAPQ of 255. See the SAM
534specification for details.
535
536Some tools are designed with this reporting mode in mind. Bowtie 2 is
537not! For very large genomes, this mode is very slow.
538
539Randomness in Bowtie 2
540
541Bowtie 2's search for alignments for a given read is "randomized." That
542is, when Bowtie 2 encounters a set of equally-good choices, it uses a
543pseudo-random number to choose. For example, if Bowtie 2 discovers a set
544of 3 equally-good alignments and wants to decide which to report, it
545picks a pseudo-random integer 0, 1 or 2 and reports the corresponding
546alignment. Arbitrary choices can crop up at various points during
547alignment.
548
549The pseudo-random number generator is re-initialized for every read, and
550the seed used to initialize it is a function of the read name,
551nucleotide string, quality string, and the value specified with --seed.
552If you run the same version of Bowtie 2 on two reads with identical
553names, nucleotide strings, and quality strings, and if --seed is set the
554same for both runs, Bowtie 2 will produce the same output; i.e., it will
555align the read to the same place, even if there are multiple equally
556good alignments. This is intuitive and desirable in most cases. Most
557users expect Bowtie to produce the same output when run twice on the
558same input.
559
560However, when the user specifies the --non-deterministic option, Bowtie
5612 will use the current time to re-initialize the pseudo-random number
562generator. When this is specified, Bowtie 2 might report different
563alignments for identical reads. This is counter-intuitive for some
564users, but might be more appropriate in situations where the input
565consists of many identical reads.
566
567Multiseed heuristic
568
569To rapidly narrow the number of possible alignments that must be
570considered, Bowtie 2 begins by extracting substrings ("seeds") from the
571read and its reverse complement and aligning them in an ungapped fashion
572with the help of the FM Index. This is "multiseed alignment" and it is
573similar to what Bowtie 1 does, except Bowtie 1 attempts to align the
574entire read this way.
575
576This initial step makes Bowtie 2 much faster than it would be without
577such a filter, but at the expense of missing some valid alignments. For
578instance, it is possible for a read to have a valid overall alignment
579but to have no valid seed alignments because each potential seed
580alignment is interrupted by too many mismatches or gaps.
581
582The trade-off between speed and sensitivity/accuracy can be adjusted by
583setting the seed length (-L), the interval between extracted seeds (-i),
584and the number of mismatches permitted per seed (-N). For more sensitive
585alignment, set these parameters to (a) make the seeds closer together,
586(b) make the seeds shorter, and/or (c) allow more mismatches. You can
587adjust these options one-by-one, though Bowtie 2 comes with some useful
588combinations of options prepackaged as "preset options."
589
590-D and -R are also options that adjust the trade-off between speed and
591sensitivity/accuracy.
592
593FM Index memory footprint
594
595Bowtie 2 uses the FM Index to find ungapped alignments for seeds. This
596step accounts for the bulk of Bowtie 2's memory footprint, as the FM
597Index itself is typically the largest data structure used. For instance,
598the memory footprint of the FM Index for the human genome is about 3.2
599gigabytes of RAM.
600
601Ambiguous characters
602
603Non-whitespace characters besides A, C, G or T are considered
604"ambiguous." N is a common ambiguous character that appears in reference
605sequences. Bowtie 2 considers all ambiguous characters in the reference
606(including IUPAC nucleotide codes) to be Ns.
607
608Bowtie 2 allows alignments to overlap ambiguous characters in the
609reference. An alignment position that contains an ambiguous character in
610the read, reference, or both, is penalized according to --np. --n-ceil
611sets an upper limit on the number of positions that may contain
612ambiguous reference characters in a valid alignment. The optional field
613XN:i reports the number of ambiguous reference characters overlapped by
614an alignment.
615
616Note that the multiseed heuristic cannot find seed alignments that
617overlap ambiguous reference characters. For an alignment overlapping an
618ambiguous reference character to be found, it must have one or more seed
619alignments that do not overlap ambiguous reference characters.
620
621Presets: setting many settings at once
622
623Bowtie 2 comes with some useful combinations of parameters packaged into
624shorter "preset" parameters. For example, running Bowtie 2 with the
625--very-sensitive option is the same as running with options:
626-D 20 -R 3 -N 0 -L 20 -i S,1,0.50. The preset options that come with
627Bowtie 2 are designed to cover a wide area of the
628speed/sensitivity/accuracy trade-off space, with the presets ending in
629fast generally being faster but less sensitive and less accurate, and
630the presets ending in sensitive generally being slower but more
631sensitive and more accurate. See the documentation for the preset
632options for details.
633
634As of Bowtie2 v2.4.0, individual preset values can be overridden by
635providing the specific options e.g. the configured seed length of 20 in
636the [--very-senitive] preset above can be changed to 25 by also
637specifying the -L 25 parameter anywhere on the command line.
638
639Filtering
640
641Some reads are skipped or "filtered out" by Bowtie 2. For example, reads
642may be filtered out because they are extremely short or have a high
643proportion of ambiguous nucleotides. Bowtie 2 will still print a SAM
644record for such a read, but no alignment will be reported and the YF:i
645SAM optional field will be set to indicate the reason the read was
646filtered.
647
648-   YF:Z:LN: the read was filtered because it had length less than or
649    equal to the number of seed mismatches set with the -N option.
650-   YF:Z:NS: the read was filtered because it contains a number of
651    ambiguous characters (usually N or .) greater than the ceiling
652    specified with --n-ceil.
653-   YF:Z:SC: the read was filtered because the read length and the match
654    bonus (set with --ma) are such that the read can't possibly earn an
655    alignment score greater than or equal to the threshold set with
656    --score-min
657-   YF:Z:QC: the read was filtered because it was marked as failing
658    quality control and the user specified the --qc-filter option. This
659    only happens when the input is in Illumina's QSEQ format (i.e. when
660    --qseq is specified) and the last (11th) field of the read's QSEQ
661    record contains 1.
662
663If a read could be filtered for more than one reason, the value YF:Z
664flag will reflect only one of those reasons.
665
666Alignment summary
667
668When Bowtie 2 finishes running, it prints messages summarizing what
669happened. These messages are printed to the "standard error" ("stderr")
670filehandle. For datasets consisting of unpaired reads, the summary might
671look like this:
672
673    20000 reads; of these:
674      20000 (100.00%) were unpaired; of these:
675        1247 (6.24%) aligned 0 times
676        18739 (93.69%) aligned exactly 1 time
677        14 (0.07%) aligned >1 times
678    93.77% overall alignment rate
679
680For datasets consisting of pairs, the summary might look like this:
681
682    10000 reads; of these:
683      10000 (100.00%) were paired; of these:
684        650 (6.50%) aligned concordantly 0 times
685        8823 (88.23%) aligned concordantly exactly 1 time
686        527 (5.27%) aligned concordantly >1 times
687        ----
688        650 pairs aligned concordantly 0 times; of these:
689          34 (5.23%) aligned discordantly 1 time
690        ----
691        616 pairs aligned 0 times concordantly or discordantly; of these:
692          1232 mates make up the pairs; of these:
693            660 (53.57%) aligned 0 times
694            571 (46.35%) aligned exactly 1 time
695            1 (0.08%) aligned >1 times
696    96.70% overall alignment rate
697
698The indentation indicates how subtotals relate to totals.
699
700Wrapper scripts
701
702The bowtie2, bowtie2-build and bowtie2-inspect executables are actually
703wrapper scripts that call binary programs as appropriate. The wrappers
704shield users from having to distinguish between "small" and "large"
705index formats, discussed briefly in the following section. Also, the
706bowtie2 wrapper provides some key functionality, like the ability to
707handle compressed inputs, and the functionality for --un, --al and
708related options.
709
710It is recommended that you always run the bowtie2 wrappers and not run
711the binaries directly.
712
713Small and large indexes
714
715bowtie2-build can index reference genomes of any size. For genomes less
716than about 4 billion nucleotides in length, bowtie2-build builds a
717"small" index using 32-bit numbers in various parts of the index. When
718the genome is longer, bowtie2-build builds a "large" index using 64-bit
719numbers. Small indexes are stored in files with the .bt2 extension, and
720large indexes are stored in files with the .bt2l extension. The user
721need not worry about whether a particular index is small or large; the
722wrapper scripts will automatically build and use the appropriate index.
723
724Performance tuning
725
7261.  If your computer has multiple processors/cores, use -p
727
728    The -p option causes Bowtie 2 to launch a specified number of
729    parallel search threads. Each thread runs on a different
730    processor/core and all threads find alignments in parallel,
731    increasing alignment throughput by approximately a multiple of the
732    number of threads (though in practice, speedup is somewhat worse
733    than linear).
734
7352.  If reporting many alignments per read, try reducing
736    bowtie2-build --offrate
737
738    If you are using -k or -a options and Bowtie 2 is reporting many
739    alignments per read, using an index with a denser SA sample can
740    speed things up considerably. To do this, specify a
741    smaller-than-default -o/--offrate value when running bowtie2-build.
742    A denser SA sample yields a larger index, but is also particularly
743    effective at speeding up alignment when many alignments are reported
744    per read.
745
7463.  If bowtie2 "thrashes", try increasing bowtie2-build --offrate
747
748    If bowtie2 runs very slowly on a relatively low-memory computer, try
749    setting -o/--offrate to a larger value when building the index. This
750    decreases the memory footprint of the index.
751
752Command Line
753
754Setting function options
755
756Some Bowtie 2 options specify a function rather than an individual
757number or setting. In these cases the user specifies three parameters:
758(a) a function type F, (b) a constant term B, and (c) a coefficient A.
759The available function types are constant (C), linear (L), square-root
760(S), and natural log (G). The parameters are specified as F,B,A - that
761is, the function type, the constant term, and the coefficient are
762separated by commas with no whitespace. The constant term and
763coefficient may be negative and/or floating-point numbers.
764
765For example, if the function specification is L,-0.4,-0.6, then the
766function defined is:
767
768    f(x) = -0.4 + -0.6 * x
769
770If the function specification is G,1,5.4, then the function defined is:
771
772    f(x) = 1.0 + 5.4 * ln(x)
773
774See the documentation for the option in question to learn what the
775parameter x is for. For example, in the case if the --score-min option,
776the function f(x) sets the minimum alignment score necessary for an
777alignment to be considered valid, and x is the read length.
778
779Usage
780
781    bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | --sra-acc <acc> | b <bam>} -S [<sam>]
782
783Main arguments
784
785    -x <bt2-idx>
786
787The basename of the index for the reference genome. The basename is the
788name of any of the index files up to but not including the final .1.bt2
789/ .rev.1.bt2 / etc. bowtie2 looks for the specified index first in the
790current directory, then in the directory specified in the
791BOWTIE2_INDEXES environment variable.
792
793    -1 <m1>
794
795Comma-separated list of files containing mate 1s (filename usually
796includes _1), e.g. -1 flyA_1.fq,flyB_1.fq. Sequences specified with this
797option must correspond file-for-file and read-for-read with those
798specified in <m2>. Reads may be a mix of different lengths. If - is
799specified, bowtie2 will read the mate 1s from the "standard in" or
800"stdin" filehandle.
801
802    -2 <m2>
803
804Comma-separated list of files containing mate 2s (filename usually
805includes _2), e.g. -2 flyA_2.fq,flyB_2.fq. Sequences specified with this
806option must correspond file-for-file and read-for-read with those
807specified in <m1>. Reads may be a mix of different lengths. If - is
808specified, bowtie2 will read the mate 2s from the "standard in" or
809"stdin" filehandle.
810
811    -U <r>
812
813Comma-separated list of files containing unpaired reads to be aligned,
814e.g. lane1.fq,lane2.fq,lane3.fq,lane4.fq. Reads may be a mix of
815different lengths. If - is specified, bowtie2 gets the reads from the
816"standard in" or "stdin" filehandle.
817
818    --interleaved
819
820Reads interleaved FASTQ files where the first two records (8 lines)
821represent a mate pair.
822
823    --sra-acc
824
825Reads are SRA accessions. If the accession provided cannot be found in
826local storage it will be fetched from the NCBI database. If you find
827that SRA alignments are long running please rerun your command with the
828-p/--threads parameter set to desired number of threads.
829
830NB: this option is only available if bowtie 2 is compiled with the
831necessary SRA libraries. See Obtaining Bowtie 2 for details.
832
833    -b <bam>
834
835Reads are unaligned BAM records sorted by read name. The
836--align-paired-reads and --preserve-tags options affect the way Bowtie 2
837processes records.
838
839    -S <sam>
840
841File to write SAM alignments to. By default, alignments are written to
842the "standard out" or "stdout" filehandle (i.e. the console).
843
844Options
845
846Input options
847
848    -q
849
850Reads (specified with <m1>, <m2>, <s>) are FASTQ files. FASTQ files
851usually have extension .fq or .fastq. FASTQ is the default format. See
852also: --solexa-quals and --int-quals.
853
854    --tab5
855
856Each read or pair is on a single line. An unpaired read line is
857[name]\t[seq]\t[qual]\n. A paired-end read line is
858[name]\t[seq1]\t[qual1]\t[seq2]\t[qual2]\n. An input file can be a mix
859of unpaired and paired-end reads and Bowtie 2 recognizes each according
860to the number of fields, handling each as it should.
861
862    --tab6
863
864Similar to --tab5 except, for paired-end reads, the second end can have
865a different name from the first:
866[name1]\t[seq1]\t[qual1]\t[name2]\t[seq2]\t[qual2]\n
867
868    --qseq
869
870Reads (specified with <m1>, <m2>, <s>) are QSEQ files. QSEQ files
871usually end in _qseq.txt. See also: --solexa-quals and --int-quals.
872
873    -f
874
875Reads (specified with <m1>, <m2>, <s>) are FASTA files. FASTA files
876usually have extension .fa, .fasta, .mfa, .fna or similar. FASTA files
877do not have a way of specifying quality values, so when -f is set, the
878result is as if --ignore-quals is also set.
879
880    -r
881
882Reads (specified with <m1>, <m2>, <s>) are files with one input sequence
883per line, without any other information (no read names, no qualities).
884When -r is set, the result is as if --ignore-quals is also set.
885
886    -F k:<int>,i:<int>
887
888Reads are substrings (k-mers) extracted from a FASTA file <s>.
889Specifically, for every reference sequence in FASTA file <s>, Bowtie 2
890aligns the k-mers at offsets 1, 1+i, 1+2i, ... until reaching the end of
891the reference. Each k-mer is aligned as a separate read. Quality values
892are set to all Is (40 on Phred scale). Each k-mer (read) is given a name
893like <sequence>_<offset>, where <sequence> is the name of the FASTA
894sequence it was drawn from and <offset> is its 0-based offset of origin
895with respect to the sequence. Only single k-mers, i.e. unpaired reads,
896can be aligned in this way.
897    -c
898
899The read sequences are given on command line. I.e. <m1>, <m2> and
900<singles> are comma-separated lists of reads rather than lists of read
901files. There is no way to specify read names or qualities, so -c also
902implies --ignore-quals.
903
904    -s/--skip <int>
905
906Skip (i.e. do not align) the first <int> reads or pairs in the input.
907
908    -u/--qupto <int>
909
910Align the first <int> reads or read pairs from the input (after the
911-s/--skip reads or pairs have been skipped), then stop. Default: no
912limit.
913
914    -5/--trim5 <int>
915
916Trim <int> bases from 5' (left) end of each read before alignment
917(default: 0).
918
919    -3/--trim3 <int>
920
921Trim <int> bases from 3' (right) end of each read before alignment
922(default: 0).
923
924    --trim-to [3:|5:]<int>
925
926Trim reads exceeding <int> bases. Bases will be trimmed from either the
9273' (right) or 5' (left) end of the read. If the read end if not
928specified, bowtie 2 will default to trimming from the 3' (right) end of
929the read. --trim-to and -3/-5 are mutually exclusive.
930
931    --phred33
932
933Input qualities are ASCII chars equal to the Phred quality plus 33. This
934is also called the "Phred+33" encoding, which is used by the very latest
935Illumina pipelines.
936
937    --phred64
938
939Input qualities are ASCII chars equal to the Phred quality plus 64. This
940is also called the "Phred+64" encoding.
941
942    --solexa-quals
943
944Convert input qualities from Solexa (which can be negative) to Phred
945(which can't). This scheme was used in older Illumina GA Pipeline
946versions (prior to 1.3). Default: off.
947
948    --int-quals
949
950Quality values are represented in the read input file as space-separated
951ASCII integers, e.g., 40 40 30 40..., rather than ASCII characters,
952e.g., II?I.... Integers are treated as being on the Phred quality scale
953unless --solexa-quals is also specified. Default: off.
954
955Preset options in --end-to-end mode
956
957    --very-fast
958
959Same as: -D 5 -R 1 -N 0 -L 22 -i S,0,2.50
960
961    --fast
962
963Same as: -D 10 -R 2 -N 0 -L 22 -i S,0,2.50
964
965    --sensitive
966
967Same as: -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default in --end-to-end
968mode)
969
970    --very-sensitive
971
972Same as: -D 20 -R 3 -N 0 -L 20 -i S,1,0.50
973
974Preset options in --local mode
975
976    --very-fast-local
977
978Same as: -D 5 -R 1 -N 0 -L 25 -i S,1,2.00
979
980    --fast-local
981
982Same as: -D 10 -R 2 -N 0 -L 22 -i S,1,1.75
983
984    --sensitive-local
985
986Same as: -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default in --local mode)
987
988    --very-sensitive-local
989
990Same as: -D 20 -R 3 -N 0 -L 20 -i S,1,0.50
991
992Alignment options
993
994    -N <int>
995
996Sets the number of mismatches to allowed in a seed alignment during
997multiseed alignment. Can be set to 0 or 1. Setting this higher makes
998alignment slower (often much slower) but increases sensitivity. Default:
9990.
1000
1001    -L <int>
1002
1003Sets the length of the seed substrings to align during multiseed
1004alignment. Smaller values make alignment slower but more sensitive.
1005Default: the --sensitive preset is used by default, which sets -L to 22
1006and 20 in --end-to-end mode and in --local mode.
1007
1008    -i <func>
1009
1010Sets a function governing the interval between seed substrings to use
1011during multiseed alignment. For instance, if the read has 30 characters,
1012and seed length is 10, and the seed interval is 6, the seeds extracted
1013will be:
1014
1015    Read:      TAGCTACGCTCTACGCTATCATGCATAAAC
1016    Seed 1 fw: TAGCTACGCT
1017    Seed 1 rc: AGCGTAGCTA
1018    Seed 2 fw:       CGCTCTACGC
1019    Seed 2 rc:       GCGTAGAGCG
1020    Seed 3 fw:             ACGCTATCAT
1021    Seed 3 rc:             ATGATAGCGT
1022    Seed 4 fw:                   TCATGCATAA
1023    Seed 4 rc:                   TTATGCATGA
1024
1025Since it's best to use longer intervals for longer reads, this parameter
1026sets the interval as a function of the read length, rather than a single
1027one-size-fits-all number. For instance, specifying -i S,1,2.5 sets the
1028interval function f to f(x) = 1 + 2.5 * sqrt(x), where x is the read
1029length. See also: setting function options. If the function returns a
1030result less than 1, it is rounded up to 1. Default: the --sensitive
1031preset is used by default, which sets -i to S,1,1.15 in --end-to-end
1032mode to -i S,1,0.75 in --local mode.
1033
1034    --n-ceil <func>
1035
1036Sets a function governing the maximum number of ambiguous characters
1037(usually Ns and/or .s) allowed in a read as a function of read length.
1038For instance, specifying -L,0,0.15 sets the N-ceiling function f to
1039f(x) = 0 + 0.15 * x, where x is the read length. See also: setting
1040function options. Reads exceeding this ceiling are filtered out.
1041Default: L,0,0.15.
1042
1043    --dpad <int>
1044
1045"Pads" dynamic programming problems by <int> columns on either side to
1046allow gaps. Default: 15.
1047
1048    --gbar <int>
1049
1050Disallow gaps within <int> positions of the beginning or end of the
1051read. Default: 4.
1052
1053    --ignore-quals
1054
1055When calculating a mismatch penalty, always consider the quality value
1056at the mismatched position to be the highest possible, regardless of the
1057actual value. I.e. input is treated as though all quality values are
1058high. This is also the default behavior when the input doesn't specify
1059quality values (e.g. in -f, -r, or -c modes).
1060
1061    --nofw/--norc
1062
1063If --nofw is specified, bowtie2 will not attempt to align unpaired reads
1064to the forward (Watson) reference strand. If --norc is specified,
1065bowtie2 will not attempt to align unpaired reads against the
1066reverse-complement (Crick) reference strand. In paired-end mode, --nofw
1067and --norc pertain to the fragments; i.e. specifying --nofw causes
1068bowtie2 to explore only those paired-end configurations corresponding to
1069fragments from the reverse-complement (Crick) strand. Default: both
1070strands enabled.
1071
1072    --no-1mm-upfront
1073
1074By default, Bowtie 2 will attempt to find either an exact or a
10751-mismatch end-to-end alignment for the read before trying the multiseed
1076heuristic. Such alignments can be found very quickly, and many short
1077read alignments have exact or near-exact end-to-end alignments. However,
1078this can lead to unexpected alignments when the user also sets options
1079governing the multiseed heuristic, like -L and -N. For instance, if the
1080user specifies -N 0 and -L equal to the length of the read, the user
1081will be surprised to find 1-mismatch alignments reported. This option
1082prevents Bowtie 2 from searching for 1-mismatch end-to-end alignments
1083before using the multiseed heuristic, which leads to the expected
1084behavior when combined with options such as -L and -N. This comes at the
1085expense of speed.
1086
1087    --end-to-end
1088
1089In this mode, Bowtie 2 requires that the entire read align from one end
1090to the other, without any trimming (or "soft clipping") of characters
1091from either end. The match bonus --ma always equals 0 in this mode, so
1092all alignment scores are less than or equal to 0, and the greatest
1093possible alignment score is 0. This is mutually exclusive with --local.
1094--end-to-end is the default mode.
1095
1096    --local
1097
1098In this mode, Bowtie 2 does not require that the entire read align from
1099one end to the other. Rather, some characters may be omitted ("soft
1100clipped") from the ends in order to achieve the greatest possible
1101alignment score. The match bonus --ma is used in this mode, and the best
1102possible alignment score is equal to the match bonus (--ma) times the
1103length of the read. Specifying --local and one of the presets (e.g.
1104--local --very-fast) is equivalent to specifying the local version of
1105the preset (--very-fast-local). This is mutually exclusive with
1106--end-to-end. --end-to-end is the default mode.
1107
1108Scoring options
1109
1110    --ma <int>
1111
1112Sets the match bonus. In --local mode <int> is added to the alignment
1113score for each position where a read character aligns to a reference
1114character and the characters match. Not used in --end-to-end mode.
1115Default: 2.
1116
1117    --mp MX,MN
1118
1119Sets the maximum (MX) and minimum (MN) mismatch penalties, both
1120integers. A number less than or equal to MX and greater than or equal to
1121MN is subtracted from the alignment score for each position where a read
1122character aligns to a reference character, the characters do not match,
1123and neither is an N. If --ignore-quals is specified, the number
1124subtracted quals MX. Otherwise, the number subtracted is
1125MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) ) where Q is the Phred quality
1126value. Default: MX = 6, MN = 2.
1127
1128    --np <int>
1129
1130Sets penalty for positions where the read, reference, or both, contain
1131an ambiguous character such as N. Default: 1.
1132
1133    --rdg <int1>,<int2>
1134
1135Sets the read gap open (<int1>) and extend (<int2>) penalties. A read
1136gap of length N gets a penalty of <int1> + N * <int2>. Default: 5, 3.
1137
1138    --rfg <int1>,<int2>
1139
1140Sets the reference gap open (<int1>) and extend (<int2>) penalties. A
1141reference gap of length N gets a penalty of <int1> + N * <int2>.
1142Default: 5, 3.
1143
1144    --score-min <func>
1145
1146Sets a function governing the minimum alignment score needed for an
1147alignment to be considered "valid" (i.e. good enough to report). This is
1148a function of read length. For instance, specifying L,0,-0.6 sets the
1149minimum-score function f to f(x) = 0 + -0.6 * x, where x is the read
1150length. See also: setting function options. The default in --end-to-end
1151mode is L,-0.6,-0.6 and the default in --local mode is G,20,8.
1152
1153Reporting options
1154
1155    -k <int>
1156
1157By default, bowtie2 searches for distinct, valid alignments for each
1158read. When it finds a valid alignment, it continues looking for
1159alignments that are nearly as good or better. The best alignment found
1160is reported (randomly selected from among best if tied). Information
1161about the best alignments is used to estimate mapping quality and to set
1162SAM optional fields, such as AS:i and XS:i.
1163
1164When -k is specified, however, bowtie2 behaves differently. Instead, it
1165searches for at most <int> distinct, valid alignments for each read. The
1166search terminates when it can't find more distinct valid alignments, or
1167when it finds <int>, whichever happens first. All alignments found are
1168reported in descending order by alignment score. The alignment score for
1169a paired-end alignment equals the sum of the alignment scores of the
1170individual mates. Each reported read or pair alignment beyond the first
1171has the SAM 'secondary' bit (which equals 256) set in its FLAGS field.
1172For reads that have more than <int> distinct, valid alignments, bowtie2
1173does not guarantee that the <int> alignments reported are the best
1174possible in terms of alignment score. -k is mutually exclusive with -a.
1175
1176Note: Bowtie 2 is not designed with large values for -k in mind, and
1177when aligning reads to long, repetitive genomes large -k can be very,
1178very slow.
1179
1180    -a
1181
1182Like -k but with no upper limit on number of alignments to search for.
1183-a is mutually exclusive with -k.
1184
1185Note: Bowtie 2 is not designed with -a mode in mind, and when aligning
1186reads to long, repetitive genomes this mode can be very, very slow.
1187
1188Effort options
1189
1190    -D <int>
1191
1192Up to <int> consecutive seed extension attempts can "fail" before Bowtie
11932 moves on, using the alignments found so far. A seed extension "fails"
1194if it does not yield a new best or a new second-best alignment. This
1195limit is automatically adjusted up when -k or -a are specified. Default:
119615.
1197
1198    -R <int>
1199
1200<int> is the maximum number of times Bowtie 2 will "re-seed" reads with
1201repetitive seeds. When "re-seeding," Bowtie 2 simply chooses a new set
1202of reads (same length, same number of mismatches allowed) at different
1203offsets and searches for more alignments. A read is considered to have
1204repetitive seeds if the total number of seed hits divided by the number
1205of seeds that aligned at least once is greater than 300. Default: 2.
1206
1207Paired-end options
1208
1209    -I/--minins <int>
1210
1211The minimum fragment length for valid paired-end alignments. E.g. if
1212-I 60 is specified and a paired-end alignment consists of two 20-bp
1213alignments in the appropriate orientation with a 20-bp gap between them,
1214that alignment is considered valid (as long as -X is also satisfied). A
121519-bp gap would not be valid in that case. If trimming options -3 or -5
1216are also used, the -I constraint is applied with respect to the
1217untrimmed mates.
1218
1219The larger the difference between -I and -X, the slower Bowtie 2 will
1220run. This is because larger differences between -I and -X require that
1221Bowtie 2 scan a larger window to determine if a concordant alignment
1222exists. For typical fragment length ranges (200 to 400 nucleotides),
1223Bowtie 2 is very efficient.
1224
1225Default: 0 (essentially imposing no minimum)
1226
1227    -X/--maxins <int>
1228
1229The maximum fragment length for valid paired-end alignments. E.g. if
1230-X 100 is specified and a paired-end alignment consists of two 20-bp
1231alignments in the proper orientation with a 60-bp gap between them, that
1232alignment is considered valid (as long as -I is also satisfied). A 61-bp
1233gap would not be valid in that case. If trimming options -3 or -5 are
1234also used, the -X constraint is applied with respect to the untrimmed
1235mates, not the trimmed mates.
1236
1237The larger the difference between -I and -X, the slower Bowtie 2 will
1238run. This is because larger differences between -I and -X require that
1239Bowtie 2 scan a larger window to determine if a concordant alignment
1240exists. For typical fragment length ranges (200 to 400 nucleotides),
1241Bowtie 2 is very efficient.
1242
1243Default: 500.
1244
1245    --fr/--rf/--ff
1246
1247The upstream/downstream mate orientations for a valid paired-end
1248alignment against the forward reference strand. E.g., if --fr is
1249specified and there is a candidate paired-end alignment where mate 1
1250appears upstream of the reverse complement of mate 2 and the fragment
1251length constraints (-I and -X) are met, that alignment is valid. Also,
1252if mate 2 appears upstream of the reverse complement of mate 1 and all
1253other constraints are met, that too is valid. --rf likewise requires
1254that an upstream mate1 be reverse-complemented and a downstream mate2 be
1255forward-oriented. --ff requires both an upstream mate 1 and a downstream
1256mate 2 to be forward-oriented. Default: --fr (appropriate for Illumina's
1257Paired-end Sequencing Assay).
1258
1259    --no-mixed
1260
1261By default, when bowtie2 cannot find a concordant or discordant
1262alignment for a pair, it then tries to find alignments for the
1263individual mates. This option disables that behavior.
1264
1265    --no-discordant
1266
1267By default, bowtie2 looks for discordant alignments if it cannot find
1268any concordant alignments. A discordant alignment is an alignment where
1269both mates align uniquely, but that does not satisfy the paired-end
1270constraints (--fr/--rf/--ff, -I, -X). This option disables that
1271behavior.
1272
1273    --dovetail
1274
1275If the mates "dovetail", that is if one mate alignment extends past the
1276beginning of the other such that the wrong mate begins upstream,
1277consider that to be concordant. See also: Mates can overlap, contain or
1278dovetail each other. Default: mates cannot dovetail in a concordant
1279alignment.
1280
1281    --no-contain
1282
1283If one mate alignment contains the other, consider that to be
1284non-concordant. See also: Mates can overlap, contain or dovetail each
1285other. Default: a mate can contain the other in a concordant alignment.
1286
1287    --no-overlap
1288
1289If one mate alignment overlaps the other at all, consider that to be
1290non-concordant. See also: Mates can overlap, contain or dovetail each
1291other. Default: mates can overlap in a concordant alignment.
1292
1293BAM options
1294
1295    --align-paired-reads
1296
1297Bowtie 2 will, by default, attempt to align unpaired BAM reads. Use this
1298option to align paired-end reads instead.
1299
1300    --preserve-tags
1301
1302Preserve tags from the original BAM record by appending them to the end
1303of the corresponding Bowtie 2 SAM output.
1304
1305Output options
1306
1307    -t/--time
1308
1309Print the wall-clock time required to load the index files and align the
1310reads. This is printed to the "standard error" ("stderr") filehandle.
1311Default: off.
1312
1313    --un <path>
1314    --un-gz <path>
1315    --un-bz2 <path>
1316    --un-lz4 <path>
1317
1318Write unpaired reads that fail to align to file at <path>. These reads
1319correspond to the SAM records with the FLAGS 0x4 bit set and neither the
13200x40 nor 0x80 bits set. If --un-gz is specified, output will be gzip
1321compressed. If --un-bz2 or --un-lz4 is specified, output will be bzip2
1322or lz4 compressed. Reads written in this way will appear exactly as they
1323did in the input file, without any modification (same sequence, same
1324name, same quality string, same quality encoding). Reads will not
1325necessarily appear in the same order as they did in the input.
1326
1327    --al <path>
1328    --al-gz <path>
1329    --al-bz2 <path>
1330    --al-lz4 <path>
1331
1332Write unpaired reads that align at least once to file at <path>. These
1333reads correspond to the SAM records with the FLAGS 0x4, 0x40, and 0x80
1334bits unset. If --al-gz is specified, output will be gzip compressed. If
1335--al-bz2 is specified, output will be bzip2 compressed. Similarly if
1336--al-lz4 is specified, output will be lz4 compressed. Reads written in
1337this way will appear exactly as they did in the input file, without any
1338modification (same sequence, same name, same quality string, same
1339quality encoding). Reads will not necessarily appear in the same order
1340as they did in the input.
1341
1342    --un-conc <path>
1343    --un-conc-gz <path>
1344    --un-conc-bz2 <path>
1345    --un-conc-lz4 <path>
1346
1347Write paired-end reads that fail to align concordantly to file(s) at
1348<path>. These reads correspond to the SAM records with the FLAGS 0x4 bit
1349set and either the 0x40 or 0x80 bit set (depending on whether it's mate
1350#1 or #2). .1 and .2 strings are added to the filename to distinguish
1351which file contains mate #1 and mate #2. If a percent symbol, %, is used
1352in <path>, the percent symbol is replaced with 1 or 2 to make the
1353per-mate filenames. Otherwise, .1 or .2 are added before the final dot
1354in <path> to make the per-mate filenames. Reads written in this way will
1355appear exactly as they did in the input files, without any modification
1356(same sequence, same name, same quality string, same quality encoding).
1357Reads will not necessarily appear in the same order as they did in the
1358inputs.
1359
1360    --al-conc <path>
1361    --al-conc-gz <path>
1362    --al-conc-bz2 <path>
1363    --al-conc-lz4 <path>
1364
1365Write paired-end reads that align concordantly at least once to file(s)
1366at <path>. These reads correspond to the SAM records with the FLAGS 0x4
1367bit unset and either the 0x40 or 0x80 bit set (depending on whether it's
1368mate #1 or #2). .1 and .2 strings are added to the filename to
1369distinguish which file contains mate #1 and mate #2. If a percent
1370symbol, %, is used in <path>, the percent symbol is replaced with 1 or 2
1371to make the per-mate filenames. Otherwise, .1 or .2 are added before the
1372final dot in <path> to make the per-mate filenames. Reads written in
1373this way will appear exactly as they did in the input files, without any
1374modification (same sequence, same name, same quality string, same
1375quality encoding). Reads will not necessarily appear in the same order
1376as they did in the inputs.
1377
1378    --quiet
1379
1380Print nothing besides alignments and serious errors.
1381
1382    --met-file <path>
1383
1384Write bowtie2 metrics to file <path>. Having alignment metric can be
1385useful for debugging certain problems, especially performance issues.
1386See also: --met. Default: metrics disabled.
1387
1388    --met-stderr <path>
1389
1390Write bowtie2 metrics to the "standard error" ("stderr") filehandle.
1391This is not mutually exclusive with --met-file. Having alignment metric
1392can be useful for debugging certain problems, especially performance
1393issues. See also: --met. Default: metrics disabled.
1394
1395    --met <int>
1396
1397Write a new bowtie2 metrics record every <int> seconds. Only matters if
1398either --met-stderr or --met-file are specified. Default: 1.
1399
1400SAM options
1401
1402    --no-unal
1403
1404Suppress SAM records for reads that failed to align.
1405
1406    --no-hd
1407
1408Suppress SAM header lines (starting with @).
1409
1410    --no-sq
1411
1412Suppress @SQ SAM header lines.
1413
1414    --rg-id <text>
1415
1416Set the read group ID to <text>. This causes the SAM @RG header line to
1417be printed, with <text> as the value associated with the ID: tag. It
1418also causes the RG:Z: extra field to be attached to each SAM output
1419record, with value set to <text>.
1420
1421    --rg <text>
1422
1423Add <text> (usually of the form TAG:VAL, e.g. SM:Pool1) as a field on
1424the @RG header line. Note: in order for the @RG line to appear, --rg-id
1425must also be specified. This is because the ID tag is required by the
1426SAM Spec. Specify --rg multiple times to set multiple fields. See the
1427SAM Spec for details about what fields are legal.
1428
1429    --omit-sec-seq
1430
1431When printing secondary alignments, Bowtie 2 by default will write out
1432the SEQ and QUAL strings. Specifying this option causes Bowtie 2 to
1433print an asterisk in those fields instead.
1434
1435    --soft-clipped-unmapped-tlen
1436
1437Consider soft-clipped bases unmapped when calculating TLEN. Only
1438available in --local mode.
1439
1440    --sam-no-qname-trunc
1441
1442Suppress standard behavior of truncating readname at first whitespace at
1443the expense of generating non-standard SAM
1444
1445    --xeq
1446
1447Use '='/'X', instead of 'M', to specify matches/mismatches in SAM record
1448
1449    --sam-append-comment
1450
1451Append FASTA/FASTQ comment to SAM record, where a comment is everything
1452after the first space in the read name.
1453
1454Performance options
1455
1456    -o/--offrate <int>
1457
1458Override the offrate of the index with <int>. If <int> is greater than
1459the offrate used to build the index, then some row markings are
1460discarded when the index is read into memory. This reduces the memory
1461footprint of the aligner but requires more time to calculate text
1462offsets. <int> must be greater than the value used to build the index.
1463
1464    -p/--threads NTHREADS
1465
1466Launch NTHREADS parallel search threads (default: 1). Threads will run
1467on separate processors/cores and synchronize when parsing reads and
1468outputting alignments. Searching for alignments is highly parallel, and
1469speedup is close to linear. Increasing -p increases Bowtie 2's memory
1470footprint. E.g. when aligning to a human genome index, increasing -p
1471from 1 to 8 increases the memory footprint by a few hundred megabytes.
1472This option is only available if bowtie is linked with the pthreads
1473library (i.e. if BOWTIE_PTHREADS=0 is not specified at build time).
1474
1475    --reorder
1476
1477Guarantees that output SAM records are printed in an order corresponding
1478to the order of the reads in the original input file, even when -p is
1479set greater than 1. Specifying --reorder and setting -p greater than 1
1480causes Bowtie 2 to run somewhat slower and use somewhat more memory than
1481if --reorder were not specified. Has no effect if -p is set to 1, since
1482output order will naturally correspond to input order in that case.
1483
1484    --mm
1485
1486Use memory-mapped I/O to load the index, rather than typical file I/O.
1487Memory-mapping allows many concurrent bowtie processes on the same
1488computer to share the same memory image of the index (i.e. you pay the
1489memory overhead just once). This facilitates memory-efficient
1490parallelization of bowtie in situations where using -p is not possible
1491or not preferable.
1492
1493Other options
1494
1495    --qc-filter
1496
1497Filter out reads for which the QSEQ filter field is non-zero. Only has
1498an effect when read format is --qseq. Default: off.
1499
1500    --seed <int>
1501
1502Use <int> as the seed for pseudo-random number generator. Default: 0.
1503
1504    --non-deterministic
1505
1506Normally, Bowtie 2 re-initializes its pseudo-random generator for each
1507read. It seeds the generator with a number derived from (a) the read
1508name, (b) the nucleotide sequence, (c) the quality sequence, (d) the
1509value of the --seed option. This means that if two reads are identical
1510(same name, same nucleotides, same qualities) Bowtie 2 will find and
1511report the same alignment(s) for both, even if there was ambiguity. When
1512--non-deterministic is specified, Bowtie 2 re-initializes its
1513pseudo-random generator for each read using the current time. This means
1514that Bowtie 2 will not necessarily report the same alignment for two
1515identical reads. This is counter-intuitive for some users, but might be
1516more appropriate in situations where the input consists of many
1517identical reads.
1518
1519    --version
1520
1521Print version information and quit.
1522
1523    -h/--help
1524
1525Print usage information and quit.
1526
1527SAM output
1528
1529Following is a brief description of the SAM format as output by bowtie2.
1530For more details, see the SAM format specification.
1531
1532By default, bowtie2 prints a SAM header with @HD, @SQ and @PG lines.
1533When one or more --rg arguments are specified, bowtie2 will also print
1534an @RG line that includes all user-specified --rg tokens separated by
1535tabs.
1536
1537Each subsequent line describes an alignment or, if the read failed to
1538align, a read. Each line is a collection of at least 12 fields separated
1539by tabs; from left to right, the fields are:
1540
15411.  Name of read that aligned.
1542
1543    Note that the SAM specification disallows whitespace in the read
1544    name. If the read name contains any whitespace characters, Bowtie 2
1545    will truncate the name at the first whitespace character. This is
1546    similar to the behavior of other tools. The standard behavior of
1547    truncating at the first whitespace can be suppressed with
1548    --sam-no-qname-trunc at the expense of generating non-standard SAM.
1549
15502.  Sum of all applicable flags. Flags relevant to Bowtie are:
1551
1552        1
1553
1554    The read is one of a pair
1555
1556        2
1557
1558    The alignment is one end of a proper paired-end alignment
1559
1560        4
1561
1562    The read has no reported alignments
1563
1564        8
1565
1566    The read is one of a pair and has no reported alignments
1567
1568        16
1569
1570    The alignment is to the reverse reference strand
1571
1572        32
1573
1574    The other mate in the paired-end alignment is aligned to the reverse
1575    reference strand
1576
1577        64
1578
1579    The read is mate 1 in a pair
1580
1581        128
1582
1583    The read is mate 2 in a pair
1584
1585    Thus, an unpaired read that aligns to the reverse reference strand
1586    will have flag 16. A paired-end read that aligns and is the first
1587    mate in the pair will have flag 83 (= 64 + 16 + 2 + 1).
1588
15893.  Name of reference sequence where alignment occurs
1590
15914.  1-based offset into the forward reference strand where leftmost
1592    character of the alignment occurs
1593
15945.  Mapping quality
1595
15966.  CIGAR string representation of alignment
1597
15987.  Name of reference sequence where mate's alignment occurs. Set to =
1599    if the mate's reference sequence is the same as this alignment's, or
1600    * if there is no mate.
1601
16028.  1-based offset into the forward reference strand where leftmost
1603    character of the mate's alignment occurs. Offset is 0 if there is no
1604    mate.
1605
16069.  Inferred fragment length. Size is negative if the mate's alignment
1607    occurs upstream of this alignment. Size is 0 if the mates did not
1608    align concordantly. However, size is non-0 if the mates aligned
1609    discordantly to the same chromosome.
1610
161110. Read sequence (reverse-complemented if aligned to the reverse
1612    strand)
1613
161411. ASCII-encoded read qualities (reverse-complemented if the read
1615    aligned to the reverse strand). The encoded quality values are on
1616    the Phred quality scale and the encoding is ASCII-offset by 33
1617    (ASCII char !), similarly to a FASTQ file.
1618
161912. Optional fields. Fields are tab-separated. bowtie2 outputs zero or
1620    more of these optional fields for each alignment, depending on the
1621    type of the alignment:
1622
1623    AS:i:<N>
1624
1625Alignment score. Can be negative. Can be greater than 0 in --local mode
1626(but not in --end-to-end mode). Only present if SAM record is for an
1627aligned read.
1628
1629    XS:i:<N>
1630
1631Alignment score for the best-scoring alignment found other than the
1632alignment reported. Can be negative. Can be greater than 0 in --local
1633mode (but not in --end-to-end mode). Only present if the SAM record is
1634for an aligned read and more than one alignment was found for the read.
1635Note that, when the read is part of a concordantly-aligned pair, this
1636score could be greater than AS:i.
1637
1638    YS:i:<N>
1639
1640Alignment score for opposite mate in the paired-end alignment. Only
1641present if the SAM record is for a read that aligned as part of a
1642paired-end alignment.
1643
1644    XN:i:<N>
1645
1646The number of ambiguous bases in the reference covering this alignment.
1647Only present if SAM record is for an aligned read.
1648
1649    XM:i:<N>
1650
1651The number of mismatches in the alignment. Only present if SAM record is
1652for an aligned read.
1653
1654    XO:i:<N>
1655
1656The number of gap opens, for both read and reference gaps, in the
1657alignment. Only present if SAM record is for an aligned read.
1658
1659    XG:i:<N>
1660
1661The number of gap extensions, for both read and reference gaps, in the
1662alignment. Only present if SAM record is for an aligned read.
1663
1664    NM:i:<N>
1665
1666The edit distance; that is, the minimal number of one-nucleotide edits
1667(substitutions, insertions and deletions) needed to transform the read
1668string into the reference string. Only present if SAM record is for an
1669aligned read.
1670
1671    YF:Z:<S>
1672
1673String indicating reason why the read was filtered out. See also:
1674Filtering. Only appears for reads that were filtered out.
1675
1676    YT:Z:<S>
1677
1678Value of UU indicates the read was not part of a pair. Value of CP
1679indicates the read was part of a pair and the pair aligned concordantly.
1680Value of DP indicates the read was part of a pair and the pair aligned
1681discordantly. Value of UP indicates the read was part of a pair but the
1682pair failed to aligned either concordantly or discordantly.
1683
1684    MD:Z:<S>
1685
1686A string representation of the mismatched reference bases in the
1687alignment. See SAM Tags format specification for details. Only present
1688if SAM record is for an aligned read.
1689
1690The bowtie2-build indexer
1691
1692bowtie2-build builds a Bowtie index from a set of DNA sequences.
1693bowtie2-build outputs a set of 6 files with suffixes .1.bt2, .2.bt2,
1694.3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. In the case of a large index
1695these suffixes will have a bt2l termination. These files together
1696constitute the index: they are all that is needed to align reads to that
1697reference. The original sequence FASTA files are no longer used by
1698Bowtie 2 once the index is built.
1699
1700Bowtie 2's .bt2 index format is different from Bowtie 1's .ebwt format,
1701and they are not compatible with each other.
1702
1703Use of Karkkainen's blockwise algorithm allows bowtie2-build to trade
1704off between running time and memory usage. bowtie2-build has three
1705options governing how it makes this trade: -p/--packed,
1706--bmax/--bmaxdivn, and --dcv. By default, bowtie2-build will
1707automatically search for the settings that yield the best running time
1708without exhausting memory. This behavior can be disabled using the
1709-a/--noauto option.
1710
1711The indexer provides options pertaining to the "shape" of the index,
1712e.g. --offrate governs the fraction of Burrows-Wheeler rows that are
1713"marked" (i.e., the density of the suffix-array sample; see the original
1714FM Index paper for details). All of these options are potentially
1715profitable trade-offs depending on the application. They have been set
1716to defaults that are reasonable for most cases according to our
1717experiments. See Performance tuning for details.
1718
1719bowtie2-build can generate either small or large indexes. The wrapper
1720will decide which based on the length of the input genome. If the
1721reference does not exceed 4 billion characters but a large index is
1722preferred, the user can specify --large-index to force bowtie2-build to
1723build a large index instead.
1724
1725The Bowtie 2 index is based on the FM Index of Ferragina and Manzini,
1726which in turn is based on the Burrows-Wheeler transform. The algorithm
1727used to build the index is based on the blockwise algorithm of
1728Karkkainen.
1729
1730Command Line
1731
1732Usage:
1733
1734    bowtie2-build [options]* <reference_in> <bt2_base>
1735
1736Main arguments
1737
1738    <reference_in>
1739
1740A comma-separated list of FASTA files containing the reference sequences
1741to be aligned to, or, if -c is specified, the sequences themselves.
1742E.g., <reference_in> might be chr1.fa,chr2.fa,chrX.fa,chrY.fa, or, if -c
1743is specified, this might be GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA.
1744
1745    <bt2_base>
1746
1747The basename of the index files to write. By default, bowtie2-build
1748writes files named NAME.1.bt2, NAME.2.bt2, NAME.3.bt2, NAME.4.bt2,
1749NAME.rev.1.bt2, and NAME.rev.2.bt2, where NAME is <bt2_base>.
1750
1751Options
1752
1753    -f
1754
1755The reference input files (specified as <reference_in>) are FASTA files
1756(usually having extension .fa, .mfa, .fna or similar).
1757
1758    -c
1759
1760The reference sequences are given on the command line. I.e.
1761<reference_in> is a comma-separated list of sequences rather than a list
1762of FASTA files.
1763
1764    --large-index
1765
1766Force bowtie2-build to build a large index, even if the reference is
1767less than ~ 4 billion nucleotides inlong.
1768
1769    -a/--noauto
1770
1771Disable the default behavior whereby bowtie2-build automatically selects
1772values for the --bmax, --dcv and --packed parameters according to
1773available memory. Instead, user may specify values for those parameters.
1774If memory is exhausted during indexing, an error message will be
1775printed; it is up to the user to try new parameters.
1776
1777    -p/--packed
1778
1779Use a packed (2-bits-per-nucleotide) representation for DNA strings.
1780This saves memory but makes indexing 2-3 times slower. Default: off.
1781This is configured automatically by default; use -a/--noauto to
1782configure manually.
1783
1784    --bmax <int>
1785
1786The maximum number of suffixes allowed in a block. Allowing more
1787suffixes per block makes indexing faster, but increases peak memory
1788usage. Setting this option overrides any previous setting for --bmax, or
1789--bmaxdivn. Default (in terms of the --bmaxdivn parameter) is --bmaxdivn
17904 * number of threads. This is configured automatically by default; use
1791-a/--noauto to configure manually.
1792
1793    --bmaxdivn <int>
1794
1795The maximum number of suffixes allowed in a block, expressed as a
1796fraction of the length of the reference. Setting this option overrides
1797any previous setting for --bmax, or --bmaxdivn. Default: --bmaxdivn 4 *
1798number of threads. This is configured automatically by default; use
1799-a/--noauto to configure manually.
1800
1801    --dcv <int>
1802
1803Use <int> as the period for the difference-cover sample. A larger period
1804yields less memory overhead, but may make suffix sorting slower,
1805especially if repeats are present. Must be a power of 2 no greater than
18064096. Default: 1024. This is configured automatically by default; use
1807-a/--noauto to configure manually.
1808
1809    --nodc
1810
1811Disable use of the difference-cover sample. Suffix sorting becomes
1812quadratic-time in the worst case (where the worst case is an extremely
1813repetitive reference). Default: off.
1814
1815    -r/--noref
1816
1817Do not build the NAME.3.bt2 and NAME.4.bt2 portions of the index, which
1818contain a bitpacked version of the reference sequences and are used for
1819paired-end alignment.
1820
1821    -3/--justref
1822
1823Build only the NAME.3.bt2 and NAME.4.bt2 portions of the index, which
1824contain a bitpacked version of the reference sequences and are used for
1825paired-end alignment.
1826
1827    -o/--offrate <int>
1828
1829To map alignments back to positions on the reference sequences, it's
1830necessary to annotate ("mark") some or all of the Burrows-Wheeler rows
1831with their corresponding location on the genome. -o/--offrate governs
1832how many rows get marked: the indexer will mark every 2^<int> rows.
1833Marking more rows makes reference-position lookups faster, but requires
1834more memory to hold the annotations at runtime. The default is 5 (every
183532nd row is marked; for human genome, annotations occupy about 340
1836megabytes).
1837
1838    -t/--ftabchars <int>
1839
1840The ftab is the lookup table used to calculate an initial
1841Burrows-Wheeler range with respect to the first <int> characters of the
1842query. A larger <int> yields a larger lookup table but faster query
1843times. The ftab has size 4^(<int>+1) bytes. The default setting is 10
1844(ftab is 4MB).
1845
1846    --seed <int>
1847
1848Use <int> as the seed for pseudo-random number generator.
1849
1850    --cutoff <int>
1851
1852Index only the first <int> bases of the reference sequences (cumulative
1853across sequences) and ignore the rest.
1854
1855    -q/--quiet
1856
1857bowtie2-build is verbose by default. With this option bowtie2-build will
1858print only error messages.
1859
1860    --threads <int>
1861
1862By default bowtie2-build is using only one thread. Increasing the number
1863of threads will speed up the index building considerably in most cases.
1864
1865    -h/--help
1866
1867Print usage information and quit.
1868
1869    --version
1870
1871Print version information and quit.
1872
1873The bowtie2-inspect index inspector
1874
1875bowtie2-inspect extracts information from a Bowtie index about what kind
1876of index it is and what reference sequences were used to build it. When
1877run without any options, the tool will output a FASTA file containing
1878the sequences of the original references (with all non-A/C/G/T
1879characters converted to Ns). It can also be used to extract just the
1880reference sequence names using the -n/--names option or a more verbose
1881summary using the -s/--summary option.
1882
1883Command Line
1884
1885Usage:
1886
1887    bowtie2-inspect [options]* <bt2_base>
1888
1889Main arguments
1890
1891    <bt2_base>
1892
1893The basename of the index to be inspected. The basename is name of any
1894of the index files but with the .X.bt2 or .rev.X.bt2 suffix omitted.
1895bowtie2-inspect first looks in the current directory for the index
1896files, then in the directory specified in the BOWTIE2_INDEXES
1897environment variable.
1898
1899Options
1900
1901    -a/--across <int>
1902
1903When printing FASTA output, output a newline character every <int> bases
1904(default: 60).
1905
1906    -n/--names
1907
1908Print reference sequence names, one per line, and quit.
1909
1910    -s/--summary
1911
1912Print a summary that includes information about index settings, as well
1913as the names and lengths of the input sequences. The summary has this
1914format:
1915
1916    Colorspace  <0 or 1>
1917    SA-Sample   1 in <sample>
1918    FTab-Chars  <chars>
1919    Sequence-1  <name>  <len>
1920    Sequence-2  <name>  <len>
1921    ...
1922    Sequence-N  <name>  <len>
1923
1924Fields are separated by tabs. Colorspace is always set to 0 for Bowtie
19252.
1926
1927    -v/--verbose
1928
1929Print verbose output (for debugging).
1930
1931    --version
1932
1933Print version information and quit.
1934
1935    -h/--help
1936
1937Print usage information and quit.
1938
1939Getting started with Bowtie 2: Lambda phage example
1940
1941Bowtie 2 comes with some example files to get you started. The example
1942files are not scientifically significant; we use the Lambda phage
1943reference genome simply because it's short, and the reads were generated
1944by a computer program, not a sequencer. However, these files will let
1945you start running Bowtie 2 and downstream tools right away.
1946
1947First follow the manual instructions to obtain Bowtie 2. Set the
1948BT2_HOME environment variable to point to the new Bowtie 2 directory
1949containing the bowtie2, bowtie2-build and bowtie2-inspect binaries. This
1950is important, as the BT2_HOME variable is used in the commands below to
1951refer to that directory.
1952
1953Indexing a reference genome
1954
1955To create an index for the Lambda phage reference genome included with
1956Bowtie 2, create a new temporary directory (it doesn't matter where),
1957change into that directory, and run:
1958
1959    $BT2_HOME/bowtie2-build $BT2_HOME/example/reference/lambda_virus.fa lambda_virus
1960
1961The command should print many lines of output then quit. When the
1962command completes, the current directory will contain four new files
1963that all start with lambda_virus and end with .1.bt2, .2.bt2, .3.bt2,
1964.4.bt2, .rev.1.bt2, and .rev.2.bt2. These files constitute the index -
1965you're done!
1966
1967You can use bowtie2-build to create an index for a set of FASTA files
1968obtained from any source, including sites such as UCSC, NCBI, and
1969Ensembl. When indexing multiple FASTA files, specify all the files using
1970commas to separate file names. For more details on how to create an
1971index with bowtie2-build, see the manual section on index building. You
1972may also want to bypass this process by obtaining a pre-built index. See
1973using a pre-built index below for an example.
1974
1975Aligning example reads
1976
1977Stay in the directory created in the previous step, which now contains
1978the lambda_virus index files. Next, run:
1979
1980    $BT2_HOME/bowtie2 -x lambda_virus -U $BT2_HOME/example/reads/reads_1.fq -S eg1.sam
1981
1982This runs the Bowtie 2 aligner, which aligns a set of unpaired reads to
1983the Lambda phage reference genome using the index generated in the
1984previous step. The alignment results in SAM format are written to the
1985file eg1.sam, and a short alignment summary is written to the console.
1986(Actually, the summary is written to the "standard error" or "stderr"
1987filehandle, which is typically printed to the console.)
1988
1989To see the first few lines of the SAM output, run:
1990
1991    head eg1.sam
1992
1993You will see something like this:
1994
1995    @HD VN:1.0  SO:unsorted
1996    @SQ SN:gi|9626243|ref|NC_001416.1|  LN:48502
1997    @PG ID:bowtie2  PN:bowtie2  VN:2.0.1
1998    r1  0   gi|9626243|ref|NC_001416.1| 18401   42  122M    *   0   0   TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG  +"@6<:27(F&5)9"B):%B+A-%5A?2$HCB0B+0=D<7E/<.03#!.F77@6B==?C"7>;))%;,3-$.A06+<-1/@@?,26">=?*@'0;$:;??G+:#+(A?9+10!8!?()?7C>  AS:i:-5 XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:59G13G21G26    YT:Z:UU
1999    r2  0   gi|9626243|ref|NC_001416.1| 8886    42  275M    *   0   0   NTTNTGATGCGGGCTTGTGGAGTTCAGCCGATCTGACTTATGTCATTACCTATGAAATGTGAGGACGCTATGCCTGTACCAAATCCTACAATGCCGGTGAAAGGTGCCGGGATCACCCTGTGGGTTTATAAGGGGATCGGTGACCCCTACGCGAATCCGCTTTCAGACGTTGACTGGTCGCGTCTGGCAAAAGTTAAAGACCTGACGCCCGGCGAACTGACCGCTGAGNCCTATGACGACAGCTATCTCGATGATGAAGATGCAGACTGGACTGC (#!!'+!$""%+(+)'%)%!+!(&++)''"#"#&#"!'!("%'""("+&%$%*%%#$%#%#!)*'(#")(($&$'&%+&#%*)*#*%*')(%+!%%*"$%"#+)$&&+)&)*+!"*)!*!("&&"*#+"&"'(%)*("'!$*!!%$&&&$!!&&"(*"$&"#&!$%'%"#)$#+%*+)!&*)+(""#!)!%*#"*)*')&")($+*%%)!*)!('(%""+%"$##"#+(('!*(($*'!"*('"+)&%#&$+('**$$&+*&!#%)')'(+(!%+ AS:i:-14    XN:i:0  XM:i:8  XO:i:0  XG:i:0  NM:i:8  MD:Z:0A0C0G0A108C23G9T81T46 YT:Z:UU
2000    r3  16  gi|9626243|ref|NC_001416.1| 11599   42  338M    *   0   0   GGGCGCGTTACTGGGATGATCGTGAAAAGGCCCGTCTTGCGCTTGAAGCCGCCCGAAAGAAGGCTGAGCAGCAGACTCAAGAGGAGAAAAATGCGCAGCAGCGGAGCGATACCGAAGCGTCACGGCTGAAATATACCGAAGAGGCGCAGAAGGCTNACGAACGGCTGCAGACGCCGCTGCAGAAATATACCGCCCGTCAGGAAGAACTGANCAAGGCACNGAAAGACGGGAAAATCCTGCAGGCGGATTACAACACGCTGATGGCGGCGGCGAAAAAGGATTATGAAGCGACGCTGTAAAAGCCGAAACAGTCCAGCGTGAAGGTGTCTGCGGGCGAT  7F$%6=$:9B@/F'>=?!D?@0(:A*)7/>9C>6#1<6:C(.CC;#.;>;2'$4D:?&B!>689?(0(G7+0=@37F)GG=>?958.D2E04C<E,*AD%G0.%$+A:'H;?8<72:88?E6((CF)6DF#.)=>B>D-="C'B080E'5BH"77':"@70#4%A5=6.2/1>;9"&-H6)=$/0;5E:<8G!@::1?2DC7C*;@*#.1C0.D>H/20,!"C-#,6@%<+<D(AG-).?&#0.00'@)/F8?B!&"170,)>:?<A7#1(A@0E#&A.*DC.E")AH"+.,5,2>5"2?:G,F"D0B8D-6$65D<D!A/38860.*4;4B<*31?6  AS:i:-22    XN:i:0  XM:i:8  XO:i:0  XG:i:0  NM:i:8  MD:Z:80C4C16A52T23G30A8T76A41   YT:Z:UU
2001    r4  0   gi|9626243|ref|NC_001416.1| 40075   42  184M    *   0   0   GGGCCAATGCGCTTACTGATGCGGAATTACGCCGTAAGGCCGCAGATGAGCTTGTCCATATGACTGCGAGAATTAACNGTGGTGAGGCGATCCCTGAACCAGTAAAACAACTTCCTGTCATGGGCGGTAGACCTCTAAATCGTGCACAGGCTCTGGCGAAGATCGCAGAAATCAAAGCTAAGT(=8B)GD04*G%&4F,1'A>.C&7=F$,+#6!))43C,5/5+)?-/0>/D3=-,2/+.1?@->;)00!'3!7BH$G)HG+ADC'#-9F)7<7"$?&.>0)@5;4,!0-#C!15CF8&HB+B==H>7,/)C5)5*+(F5A%D,EA<(>G9E0>7&/E?4%;#'92)<5+@7:A.(BG@BG86@.G AS:i:-1 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:77C106 YT:Z:UU
2002    r5  0   gi|9626243|ref|NC_001416.1| 48010   42  138M    *   0   0   GTCAGGAAAGTGGTAAAACTGCAACTCAATTACTGCAATGCCCTCGTAATTAAGTGAATTTACAATATCGTCCTGTTCGGAGGGAAGAACGCGGGATGTTCATTCTTCATCACTTTTAATTGATGTATATGCTCTCTT  9''%<D)A03E1-*7=),:F/0!6,D9:H,<9D%:0B(%'E,(8EFG$E89B$27G8F*2+4,-!,0D5()&=(FGG:5;3*@/.0F-G#5#3->('FDFEG?)5.!)"AGADB3?6(@H(:B<>6!>;>6>G,."?%  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:138    YT:Z:UU
2003    r6  16  gi|9626243|ref|NC_001416.1| 41607   42  72M2D119M   *   0   0   TCGATTTGCAAATACCGGAACATCTCGGTAACTGCATATTCTGCATTAAAAAATCAACGCAAAAAATCGGACGCCTGCAAAGATGAGGAGGGATTGCAGCGTGTTTTTAATGAGGTCATCACGGGATNCCATGTGCGTGACGGNCATCGGGAAACGCCAAAGGAGATTATGTACCGAGGAAGAATGTCGCT 1H#G;H"$E*E#&"*)2%66?=9/9'=;4)4/>@%+5#@#$4A*!<D=="8#1*A9BA=:(1+#C&.#(3#H=9E)AC*5,AC#E'536*2?)H14?>9'B=7(3H/B:+A:8%1-+#(E%&$$&14"76D?>7(&20H5%*&CF8!G5B+A4F$7(:"'?0$?G+$)B-?2<0<F=D!38BH,%=8&5@+ AS:i:-13    XN:i:0  XM:i:2  XO:i:1  XG:i:2  NM:i:4  MD:Z:72^TT55C15A47  YT:Z:UU
2004    r7  16  gi|9626243|ref|NC_001416.1| 4692    42  143M    *   0   0   TCAGCCGGACGCGGGCGCTGCAGCCGTACTCGGGGATGACCGGTTACAACGGCATTATCGCCCGTCTGCAACAGGCTGCCAGCGATCCGATGGTGGACAGCATTCTGCTCGATATGGACANGCCCGGCGGGATGGTGGCGGGG -"/@*7A0)>2,AAH@&"%B)*5*23B/,)90.B@%=FE,E063C9?,:26$-0:,.,1849'4.;F>FA;76+5&$<C":$!A*,<B,<)@<'85D%C*:)30@85;?.B$05=@95DCDH<53!8G:F:B7/A.E':434> AS:i:-6 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:98G21C22   YT:Z:UU
2005
2006The first few lines (beginning with @) are SAM header lines, and the
2007rest of the lines are SAM alignments, one line per read or mate. See the
2008Bowtie 2 manual section on SAM output and the SAM specification for
2009details about how to interpret the SAM file format.
2010
2011Paired-end example
2012
2013To align paired-end reads included with Bowtie 2, stay in the same
2014directory and run:
2015
2016    $BT2_HOME/bowtie2 -x lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam
2017
2018This aligns a set of paired-end reads to the reference genome, with
2019results written to the file eg2.sam.
2020
2021Local alignment example
2022
2023To use local alignment to align some longer reads included with Bowtie
20242, stay in the same directory and run:
2025
2026    $BT2_HOME/bowtie2 --local -x lambda_virus -U $BT2_HOME/example/reads/longreads.fq -S eg3.sam
2027
2028This aligns the long reads to the reference genome using local
2029alignment, with results written to the file eg3.sam.
2030
2031Using SAMtools/BCFtools downstream
2032
2033SAMtools is a collection of tools for manipulating and analyzing SAM and
2034BAM alignment files. BCFtools is a collection of tools for calling
2035variants and manipulating VCF and BCF files, and it is typically
2036distributed with SAMtools. Using these tools together allows you to get
2037from alignments in SAM format to variant calls in VCF format. This
2038example assumes that samtools and bcftools are installed and that the
2039directories containing these binaries are in your PATH environment
2040variable.
2041
2042Run the paired-end example:
2043
2044    $BT2_HOME/bowtie2 -x $BT2_HOME/example/index/lambda_virus -1 $BT2_HOME/example/reads/reads_1.fq -2 $BT2_HOME/example/reads/reads_2.fq -S eg2.sam
2045
2046Use samtools view to convert the SAM file into a BAM file. BAM is the
2047binary format corresponding to the SAM text format. Run:
2048
2049    samtools view -bS eg2.sam > eg2.bam
2050
2051Use samtools sort to convert the BAM file to a sorted BAM file.
2052
2053    samtools sort eg2.bam -o eg2.sorted.bam
2054
2055We now have a sorted BAM file called eg2.sorted.bam. Sorted BAM is a
2056useful format because the alignments are (a) compressed, which is
2057convenient for long-term storage, and (b) sorted, which is conveneint
2058for variant discovery. To generate variant calls in VCF format, run:
2059
2060    bcftools mpileup -f $BT2_HOME/example/reference/lambda_virus.fa eg2.sorted.bam | bcftools view -Ov - > eg2.raw.bcf
2061
2062Then to view the variants, run:
2063
2064    bcftools view eg2.raw.bcf
2065
2066See the official SAMtools guide to Calling SNPs/INDELs with
2067SAMtools/BCFtools for more details and variations on this process.
2068