1*** RazerS 3 - Faster, fully sensitive read mapping ***
2http://www.seqan.de/projects/razers.html
3
4------------------------------------------------------------------------------
5Table of Contents
6------------------------------------------------------------------------------
7 1. Overview
8 2. Installation
9 3. Usage
10 4. Output Formats
11 5. Examples
12 6. Contact
13 7. References
14
15------------------------------------------------------------------------------
161. Overview
17------------------------------------------------------------------------------
18
19RazerS 3 is a tool for mapping millions of short genomic reads onto a
20reference genome. It was designed with focus on mapping next-generation
21sequencing reads onto whole DNA genomes. RazerS 3 searches for matches of
22reads with a percent identity above a given threshold (-i X), whereby it
23detects alignments with mismatches as well as gaps.
24
25RazerS 3 consists of a filtration part, in which a k-mer filter scans the
26genome for regions that possibly contain read matches, and a verification
27part, where results from the filtration are then subjected to a verification
28algorithm. The user can choose between two filters: (1) a seed-based filter
29based on the pigeonhole principle or (2) a k-mer counting filter based on the
30SWIFT algorithm (Rasmussen et al., 2006). The pigeonhole filter (default) is
31faster for a broad range of read sets and error rates, whereas the swift
32filter (-fl swift) is faster for short reads (<50bp) and high error rates
33(10-20%).
34
35Both filters can be run in full-sensitive mode (-rr 100), i.e. given a maximal
36error rate they will output every match as a match candidate, or in lossy mode
37with a user-defined sensitivity (-rr X) at higher speeds. To exceed the
38specified minimal sensitivity, RazerS 3 computes the expected loss rates of
39different filter settings, based on base-call qualities of the reads and a
40user-defined mutation rate, and chooses the most performant setting.
41
42To verify the found candidates, we devised a banded version of the
43bit-parallel approximate string search algorithm proposed by Myers (1999). The
44found matches are recorded, duplicate-filtered, and ranked by the number of
45errors (and deviation from a given paired-end insert size). At the end, the
46results are written to an output file (-o FILENAME). Besides others, RazerS 3
47supports a very efficient native format (.razers) and the commonly used SAM
48and BAM formats (.sam or .bam).
49
50------------------------------------------------------------------------------
512. Installation
52------------------------------------------------------------------------------
53
54To install RazerS 3, you can either compile the latest version from the Git
55version or use a precompiled binary.
56
57------------------------------------------------------------------------------
582.1. Compilation from source code
59------------------------------------------------------------------------------
60
61Follow the "Getting Started" section on http://trac.seqan.de/wiki and check
62out the latest Git repo. Instead of creating a project file in Debug mode,
63switch to Release mode (-DCMAKE_BUILD_TYPE=Release) and compile razers3. This
64can be done as follows:
65
66 1) git clone https://github.com/seqan/seqan.git
67 2) mkdir seqan/buld; cd seqan/build
68 3) cmake .. -DCMAKE_BUILD_TYPE=Release
69 4) make razers3
70 5) ./bin/razers3 --help
71
72If RazerS 3 will be run on the same machine it is compiled on, you may
73consider to optimize for the given system architecture. For gcc or llvm/clang
74compilers this can be done with:
75
76 cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS:STRING="-march=native"
77 make razers3
78
79After compilation, copy the binary to a folder in your PATH variable, e.g.
80/usr/local/bin:
81
82 sudo cp bin/razers3 /usr/local/bin
83
84
85------------------------------------------------------------------------------
862.2. Precompiled binaries
87------------------------------------------------------------------------------
88
89We also provide a precompiled binary of RazerS 3 for 64bit Linux. It was
90succesfully tested on Debian GNU/Linux 6.0.5 (squeeze) and Ubuntu 12.04.
91Please download the binary from: http://www.seqan.de/projects/razers.html
92
93------------------------------------------------------------------------------
943. Usage
95------------------------------------------------------------------------------
96
97To get a short usage description of RazerS 3, you can execute razers3 -h or
98razers3 --help.
99
100Usage: razers3 [OPTION]... <GENOME FILE> <READS FILE>
101 razers3 [OPTION]... <GENOME FILE> <PE-READS FILE1> <PE-READS FILE2>
102
103RazerS 3 expects the names of two or three FASTA or FASTQ files. The first
104contains a reference genome and the second (and third) contain genomic reads
105that should be mapped onto the reference. If two read files are given, both
106have to contain exactly the same number of reads, which are considered as read
107pairs.
108
109------------------------------------------------------------------------------
1103.1. Main Options
111------------------------------------------------------------------------------
112
113 [ -fl STRING ], [ --filter STRING ]
114
115 Use the seed-based pigeonhole filter (-fl pigeonhole, default) or the k-mer
116 counting SWIFT filter (-fl swift). See Section 3.3, for filter-specific
117 parameters.
118
119 [ -tc NUM ], [ --thread-count NUM ]
120
121 Use NUM threads (default: 1). Set to 0 to use the old RazerS 1/2 code path.
122
123 [ -i NUM ], [ --percent-identity NUM ]
124
125 Set the percent identity threshold. NUM must be a value between 50 and
126 100 (default is 95). RazerS 3 searches for matches with a percent identity
127 of at least NUM. A match of a read R with e errors has percent identity
128 of 100*(1 - e/|R|), whereby |R| is the read length. In other words, a
129 read is allowed to have not more than |R|*(100-NUM)/100 errors.
130 The maximal error rate is the direct opposite of the identity threshold,
131 e.g. an error rate of 4% corresponds to an identity of 96%.
132
133 [ -rr NUM ], [ --recognition-rate NUM ]
134
135 Set the percent recognition rate. NUM must be a value between 80 and 100
136 (default is 99). The recognition rate controls the sensitivity of RazerS 3.
137 The higher the recognition rate the more sensitive is RazerS 3. The lower
138 the recognition rate the faster runs RazerS 3. A value of 100 corresponds
139 to a lossless read mapping. The recognition rate corresponds to the
140 expected fraction of matches RazerS 3 will find compared to a lossless
141 mapping. Depending on the desired recogition rate, the percent identity
142 and the read length the filter is configured to run as fast as possible.
143 For this purpose, it either computes the sensitivities of different
144 pigeonhole filter settings on runtime or (due to the much larger search
145 space) uses precomputed sensitivies of the SWIFT filter. The latter are
146 precompiled in RazerS but can be replaced by user-specific settings in a
147 parameter directory (-pd).
148 The recognition rate value (and also the automatic sensitivity control) is
149 disabled if filtration parameters, i.e. overlap length (-ol), shape (-s),
150 or minimum threshold (-t), are set manually.
151
152 [ -ng ], [ --no-gaps ]
153
154 Consider only mismatches as errors (Hamming distance). By default,
155 insertions, deletions and mismatches are considered as errors (edit
156 distance).
157
158 [ -f ], [ --forward ]
159
160 Only map reads onto the positive/forward strand of the genome. By
161 default, both strands are scanned.
162
163 [ -r ], [ --reverse ]
164
165 Only map reads onto the negative/reverse-complement strand of the
166 genome. By default, both strands are scanned.
167
168 [ -m NUM ], [ --max-hits NUM ]
169
170 Output at most NUM of the best matches, default value is 100.
171
172 [ --unique ]
173
174 Output only unique best matches (like ELAND). This flag is equivalent to
175 '-m 1 -dr 0 -pa'.
176
177 [ -tr NUM ], [ --trim-reads NUM ]
178
179 Trim reads to length NUM bp.
180
181 [ -ll NUM ], [ --library-length NUM ]
182
183 Set the mean library size, default is 220. The library size is the outer
184 distance of the two mapped reads of a read pair. This value is used only for
185 paired-end read mapping.
186
187 [ -le NUM ], [ --library-error NUM ]
188
189 Set the tolerated absolute deviation of the library size, default value is
190 50. This value is used only for paired-end read mapping.
191
192 [ -o FILE ], [ --output FILE ]
193
194 Change the output filename to FILE. By default, this is the (first) read
195 filename extended by the suffix '.razers'.
196
197 [ -v ], [ --verbose ]
198
199 Verbose. Print extra information and running times.
200
201 [ -vv ], [ --vverbose ]
202
203 Very verbose. Like -v, but also print filtering statistics like number of
204 candidates and successful verifications.
205
206 [ --version ]
207
208 Print version information.
209
210 [ -h ], [ --help ]
211
212 Print a brief usage summary.
213
214------------------------------------------------------------------------------
2153.2. Output Format Options
216------------------------------------------------------------------------------
217
218 [ -a ], [ --alignment ]
219
220 Dump the alignment for each match in the ".result" file (only for razer or
221 fasta format). The alignment is written directly after the match and has the
222 following format:
223 #Read: CAGGAGATAAGCTGGATCGTTTACGGT
224 #Genome: CAGGAGATAAGC-GGATCTTTTACG--
225
226 [ -pa ], [ --purge-ambiguous ]
227
228 Omit reads with more than <max-hits> matches.
229
230 [ -dr NUM ], [ --distance-range NUM ]
231
232 If the best match of a read has E errors, only consider hits with
233 E <= X <= E+NUM errors as matches. Disabled by default.
234
235 [ -gn NUM ], [ --genome-naming NUM ]
236
237 Select how genomes are named in the output file. If NUM is 0, the Fasta
238 ids of the genome sequences are used (default). If NUM is 1, the genome
239 sequences are enumerated beginning with 1.
240
241 [ -rn NUM ], [ --read-naming NUM ]
242
243 Select how reads are named in the output file. If NUM is 0, the Fasta ids
244 of the reads are used (default). If NUM is 1, the reads are enumerated
245 beginning with 1. If NUM is 2, the read sequence itself is used. If NUM is
246 3, Fasta ids are used without a /L or /R suffix in paired-end mode.
247
248 [ --full-readid ]
249
250 Use the whole Fasta id of each read in the output file. By default, only the
251 prefix up to the first space (excluding) is used.
252
253 [ -so NUM ], [ --sort-order NUM ]
254
255 Select how matches are ordered in the output file.
256 If NUM is 0, matches are sorted primarily by the read number and
257 secondarily by their position in the genome sequence (default).
258 If NUM is 1, matches are sorted primarily by their position in the genome
259 sequence and secondarily by the read number.
260
261 [ -pf NUM ], [ --position-format NUM ]
262
263 Select how positions are stored in the output file.
264 If NUM is 0, the gap space is used, i.e. gaps around characters are
265 enumerated beginning with 0 and the beginning and end position is the
266 postion of the gap before and after a match (default).
267 If NUM is 1, the position space is used, i.e. characters are enumerated
268 beginning with 1 and the beginning and end position is the postion of the
269 first and last character involved in a match.
270
271 Example: Consider the string CONCAT. The beginning and end positions
272 of the substring CAT are (3,6) in gap space and (4,6) in position space.
273
274------------------------------------------------------------------------------
2753.3. Filtration Options
276------------------------------------------------------------------------------
277
278 [ -fl STRING ], [ --filter STRING ]
279
280 Described in section 3.1.
281
282 [ -mr NUM ], [ --mutation-rate NUM ]
283
284 Set the percent mutation rate used by the pigeonhole sensitivity estimation.
285 The mutation rate specifies the rate of differences between sequenced and
286 the reference genome, i.e. all errors except sequencing errors. These errors
287 include small variants (SNPs, indels) or errors in the assembly of the
288 reference. Default value is 5 (=5%).
289
290 [-ol NUM ], [ --overlap-length NUM ]
291
292 Manually set the overlap length of adjacent k-mer seeds used in the
293 pigeonhole filter. If the overlap is 0, non-overlapping k-mers of the
294 specified shape (-s) are used. For overlaps of NUM > 0, the shape is
295 extended to the right by NUM characters that overlap with the next seed.
296 The seed positions in the reads are not affected by the overlap.
297 This option disables the automatic sensitivity control.
298
299 [ -pd DIR ], [ --param-dir DIR ]
300
301 Read user-computed parameter files of the SWIFT filter given in the
302 directory <DIR>. These parameters can be computed based on a machine
303 specific error distribution file and the param_chooser tool.
304
305 [ -t NUM ], [ --threshold NUM ]
306
307 Depending on the percent identity and the length, the SWIFT filter computes
308 for read a threshold of common k-mers between read and reference genome.
309 These thresholds determine the filtration strictness and are crucial to the
310 overall running time. With this option the threshold values can manually be
311 raised to a minimum value to reduce the running time at cost of the mapping
312 sensitivity. All threshold values smaller than NUM are raised to NUM. The
313 default value is 1.
314 This option disables the automatic sensitivity control.
315
316 [ -tl NUM ], [ --taboo-length NUM ]
317
318 The taboo length is the minimal distance two k-mer must have in the
319 reference genome when counting common k-mers between reads and reference
320 genome (default is 1).
321
322 [ -s BITSTRING ], [ --shape BITSTRING ]
323
324 Define the k-mer shape. BITSTRING must be a sequence of bits beginning
325 and ending with 1, e.g. 1111111001101. A '1' defines a relevant and a
326 '0' an irrelevant position. Two k-mers are equal, if all characters at
327 relevant postitions are equal.
328 This option disables the automatic sensitivity control.
329
330 [ -oc NUM ], [ --overabundance-cut NUM ]
331
332 Remove overabundant read k-mers from the k-mer index. k-mers with a
333 relative abundance above NUM are removed. NUM must be a value between
334 0 (remove all) and 1 (remove nothing, default).
335
336 [ -rl NUM ], [ --repeat-length NUM ]
337
338 The repeat length is the minimal length a simple-repeat in the
339 genome sequence must have to be filtered out by the repeat masker of
340 RazerS 3. Simple repeats are tandem repeats of only one repeated
341 character, e.g. AAAAA. Independently of this parameter, N characters in
342 the genome are filtered out automatically. Default value is 1000.
343
344 [ -lf NUM ], [ --load-factor NUM ]
345
346 Set the load factor for the open addressing k-mer index. Defines how many
347 entries should be used in the hash table. If the index stores at most X
348 different k-mers, X * NUM entries will be reserved for the hash table.
349
350------------------------------------------------------------------------------
3513.4. Verification Options
352------------------------------------------------------------------------------
353
354 [ -mN ], [ --match-N ]
355
356 By default, 'N' characters in read or genome sequences equal to nothing,
357 not even to another 'N'. They are considered as errors. By activating this
358 option, 'N' equals to every other character and produces no mismatch in
359 the verification process. The filtration is not affected by this option.
360
361 [ -ed FILE ], [ --error-distr FILE ]
362
363 Produce an error distribution file containing the relative frequencies of
364 mismatches for each read position. If the "--indels" option is given, the
365 relative frequencies of insertions and deletions are also recorded.
366
367 [ -mf FILE ], [ --mismatch-file FILE ]
368
369 Produce a mismatch file containing for each read alignment a line of
370 tab-seperated 0's and 1's representing a match (0) or a mismatch (1).
371
372
373------------------------------------------------------------------------------
3744. Output Formats
375------------------------------------------------------------------------------
376
377RazerS 3 supports currently 5 different output formats which are automatically
378chosen from the output filename suffix.
379
380 .razers = Razer Format
381 .fa|.fasta = Enhanced Fasta Format
382 .eland = Eland Format
383 .sam|.bam = Sequence Alignment and Mapping Format (SAM)
384 .afg = AMOS assembler format
385
386------------------------------------------------------------------------------
3874.1. Razer Format
388------------------------------------------------------------------------------
389
390The output file is a text file whose lines represent matches. A line
391consists of different tab separated match values in the following format:
392
393RNAME RBEGIN REND GSTRAND GNAME GBEGIN GEND PERCID [PAIRID PAIRSCR MATEDIST]
394
395Match value description:
396
397 RNAME Name of the read sequence (see --read-naming)
398 RBEGIN Beginning position in the read sequence (0/1 see -pf option)
399 REND End position in the read sequence (length of the read)
400 GSTRAND 'F'=forward strand or 'R'=reverse strand
401 GNAME Name of the genome sequence (see --genome-naming)
402 GBEGIN Beginning position in the genome sequence
403 GEND End position in the genome sequence
404 PERCID Percent identity (see --percent-identity)
405
406For paired-end read mapping 3 additional values are dumped:
407
408 PAIRID Unique number to identify the two corresponding mate matches
409 PAIRSCR The sum of the negative number of errors in both matches
410 MATEDIST Relative outer distance to the mate match
411 The absolute value is the insert size
412
413For matches on the reverse strand, GBEGIN and GEND are positions on the
414related forward strand. It holds GBEGIN < GEND, regardless of GSTRAND.
415
416------------------------------------------------------------------------------
4174.2. Enhanced Fasta Format
418------------------------------------------------------------------------------
419
420The matches are stored in the same order as in the Razer format. Each match
421is stored in two lines:
422
423>GBEGIN,GEND[KEY1=VALUE1,KEY2=VALUE2,...]
424READSEQ
425
426Match value description:
427
428 GBEGIN Beginning position in the genome sequence
429 GEND End position in the genome sequence
430 READSEQ Read sequence.
431
432The following keys are output:
433
434 id ID value of the input file Fasta header (>..[id=ID,..]..)
435 fragId Fragment ID value (>..[..,fragId=FRAGID,..]..)
436 contigId Name of the genome sequence (see --genome-naming)
437 errors Absolute numbers of errors in this match
438 percId Percent identity (see --percent-identity)
439 ambiguity Number of matches of this read as good as or better than this
440
441If the ID or fragment ID values of a read couldn't be found in the reads file
442the read number (beginning with 0) is used instead. For matches on the
443reverse strand, GBegin and GEnd are positions on the related forward strand
444and GBEGIN > GEND.
445
446------------------------------------------------------------------------------
4474.3. Eland Format
448------------------------------------------------------------------------------
449
450Each line of the output file corresponds to a read appearing in the same
451order as they are stored in the reads file. A line consists of the following
452tab separated values:
453
454>RNAME READSEQ TYPE N0 N1 N2 GNAME GBEGIN* GSTRAND '..' SUBST1 SUBST2 ...
455
456Additional value description:
457
458 TYPE NM = No match found
459 QC = No matching done (too many Ns in read sequence)
460 Ux = Best match found was unique with x errors
461 Rx = Multiple best matches found having x errors
462 N0 N1 N2 Number of exact, 1-error, and 2-error matches
463 GBEGIN* Minimum of GBEGIN and GEND
464 SUBSTx Position and type of the x'th mismatch (not for --indels)
465 (e.g. 12A: 12'th base was A in the genome)
466
467------------------------------------------------------------------------------
4684.4. SAM or BAM Output Format
469------------------------------------------------------------------------------
470
471The SAM output format has established itself as the standard output format for
472read alignments. Altough SAM is capable of representing a multiple alignment
473between reads and contig (with paddings), RazerS 3 only outputs pairwise
474alignments as this way has established to be the defacto standard.
475The BAM format is the binary representation of SAM compressed with gzip.
476Whenever possible the more compact BAM format should be preferred over SAM.
477
478See http://samtools.sourceforge.net/ for more details.
479
480------------------------------------------------------------------------------
4814.5. AMOS Output Format
482------------------------------------------------------------------------------
483
484The AMOS assembly format (aka AFG format) is used by the AMOS assembler and
485represents a multiple global alignment between reads and contig and also
486stores the consensus sequences (including gaps).
487
488See http://www.cbcb.umd.edu/research/contig_representation.shtml#AMOS for more
489details.
490
491------------------------------------------------------------------------------
4925. Examples
493------------------------------------------------------------------------------
494
495To map single-end reads with 4% error rate using 12 threads call:
496
497 razers3 -i 96 -tc 12 -o map.result hg18.fa reads.fq
498
499To map paired-end reads with up to 6% errors, 95% sensitivity, 12 threads, and
500only output aligned pairs with an outer distance of 200-360bp call:
501
502 razers3 -i 94 -rr 95 -tc 12 -ll 280 --le 80 -o map.result hg18.fa r1.fq r2.fq
503
504------------------------------------------------------------------------------
5056. Contact
506------------------------------------------------------------------------------
507
508For questions or comments, contact:
509 David Weese <david.weese@fu-berlin.de>
510
511------------------------------------------------------------------------------
5127. References
513------------------------------------------------------------------------------
514
515Weese, D., Holtgrewe M., & Reinert, K. (2012). RazerS 3: Faster, fully
516sensitive read mapping. Bioinformatics, 28(20), 2592–2599.
517