• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

example/H02-Feb-2018-3430

tests/H07-May-2022-293,519293,213

LICENSEH A D02-Feb-201834.3 KiB675553

READMEH A D02-Feb-201821.2 KiB517373

compPHSens.cppH A D02-Feb-20184.5 KiB13778

job_queue.hH A D02-Feb-20185.7 KiB227156

outputFormat.hH A D02-Feb-201853.2 KiB1,371899

parallel_job_queue.hH A D02-Feb-20188.9 KiB333166

parallel_misc.hH A D02-Feb-20181.7 KiB4938

parallel_store.hH A D02-Feb-20182 KiB8364

paramChooser.hH A D02-Feb-201843.8 KiB1,3011,036

param_tabs.cppH A D02-Feb-20181.9 KiB5119

param_tabs.hH A D02-Feb-20182.2 KiB5717

param_tabs.incH A D02-Feb-20183.6 MiB68,25968,257

profile_timeline.hH A D02-Feb-20187.5 KiB231148

quality2prob.cppH A D02-Feb-20182 KiB6139

razers.cppH A D02-Feb-201839 KiB881702

razers.hH A D02-Feb-2018127.9 KiB3,4612,531

razers_match_filter.hH A D02-Feb-20188.8 KiB266179

razers_matepairs.hH A D02-Feb-201845.2 KiB1,168826

razers_matepairs_parallel.hH A D02-Feb-201875.8 KiB1,6691,095

razers_paired_match_filter.hH A D02-Feb-20186.8 KiB182117

razers_parallel.hH A D02-Feb-201852.5 KiB1,209823

razers_window.hH A D02-Feb-20187.2 KiB189113

readSimulator.hH A D02-Feb-201817.1 KiB483227

simulate_reads.cppH A D02-Feb-20184.4 KiB149113

README

1*** RazerS 3 - Faster, fully sensitive read mapping ***
2http://www.seqan.de/projects/razers.html
3
4------------------------------------------------------------------------------
5Table of Contents
6------------------------------------------------------------------------------
7  1.   Overview
8  2.   Installation
9  3.   Usage
10  4.   Output Formats
11  5.   Examples
12  6.   Contact
13  7.   References
14
15------------------------------------------------------------------------------
161. Overview
17------------------------------------------------------------------------------
18
19RazerS 3 is a tool for mapping millions of short genomic reads onto a
20reference genome. It was designed with focus on mapping next-generation
21sequencing reads onto whole DNA genomes. RazerS 3 searches for matches of
22reads with a percent identity above a given threshold (-i X), whereby it
23detects alignments with mismatches as well as gaps.
24
25RazerS 3 consists of a filtration part, in which a k-mer filter scans the
26genome for regions that possibly contain read matches, and a verification
27part, where results from the filtration are then subjected to a verification
28algorithm. The user can choose between two filters: (1) a seed-based filter
29based on the pigeonhole principle or (2) a k-mer counting filter based on the
30SWIFT algorithm (Rasmussen et al., 2006). The pigeonhole filter (default) is
31faster for a broad range of read sets and error rates, whereas the swift
32filter (-fl swift) is faster for short reads (<50bp) and high error rates
33(10-20%).
34
35Both filters can be run in full-sensitive mode (-rr 100), i.e. given a maximal
36error rate they will output every match as a match candidate, or in lossy mode
37with a user-defined sensitivity (-rr X) at higher speeds. To exceed the
38specified minimal sensitivity, RazerS 3 computes the expected loss rates of
39different filter settings, based on base-call qualities of the reads and a
40user-defined mutation rate, and chooses the most performant setting.
41
42To verify the found candidates, we devised a banded version of the
43bit-parallel approximate string search algorithm proposed by Myers (1999). The
44found matches are recorded, duplicate-filtered, and ranked by the number of
45errors (and deviation from a given paired-end insert size). At the end, the
46results are written to an output file (-o FILENAME). Besides others, RazerS 3
47supports a very efficient native format (.razers) and the commonly used SAM
48and BAM formats (.sam or .bam).
49
50------------------------------------------------------------------------------
512. Installation
52------------------------------------------------------------------------------
53
54To install RazerS 3, you can either compile the latest version from the Git
55version or use a precompiled binary.
56
57------------------------------------------------------------------------------
582.1. Compilation from source code
59------------------------------------------------------------------------------
60
61Follow the "Getting Started" section on http://trac.seqan.de/wiki and check
62out the latest Git repo. Instead of creating a project file in Debug mode,
63switch to Release mode (-DCMAKE_BUILD_TYPE=Release) and compile razers3. This
64can be done as follows:
65
66  1)  git clone https://github.com/seqan/seqan.git
67  2)  mkdir seqan/buld; cd seqan/build
68  3)  cmake .. -DCMAKE_BUILD_TYPE=Release
69  4)  make razers3
70  5)  ./bin/razers3 --help
71
72If RazerS 3 will be run on the same machine it is compiled on, you may
73consider to optimize for the given system architecture. For gcc or llvm/clang
74compilers this can be done with:
75
76  cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS:STRING="-march=native"
77  make razers3
78
79After compilation, copy the binary to a folder in your PATH variable, e.g.
80/usr/local/bin:
81
82  sudo cp bin/razers3 /usr/local/bin
83
84
85------------------------------------------------------------------------------
862.2. Precompiled binaries
87------------------------------------------------------------------------------
88
89We also provide a precompiled binary of RazerS 3 for 64bit Linux. It was
90succesfully tested on Debian GNU/Linux 6.0.5 (squeeze) and Ubuntu 12.04.
91Please download the binary from: http://www.seqan.de/projects/razers.html
92
93------------------------------------------------------------------------------
943. Usage
95------------------------------------------------------------------------------
96
97To get a short usage description of RazerS 3, you can execute razers3 -h or
98razers3 --help.
99
100Usage: razers3 [OPTION]... <GENOME FILE> <READS FILE>
101       razers3 [OPTION]... <GENOME FILE> <PE-READS FILE1> <PE-READS FILE2>
102
103RazerS 3 expects the names of two or three FASTA or FASTQ files. The first
104contains a reference genome and the second (and third) contain genomic reads
105that should be mapped onto the reference. If two read files are given, both
106have to contain exactly the same number of reads, which are considered as read
107pairs.
108
109------------------------------------------------------------------------------
1103.1. Main Options
111------------------------------------------------------------------------------
112
113  [ -fl STRING ],  [ --filter STRING ]
114
115  Use the seed-based pigeonhole filter (-fl pigeonhole, default) or the k-mer
116  counting SWIFT filter (-fl swift). See Section 3.3, for filter-specific
117  parameters.
118
119  [ -tc NUM ],  [ --thread-count NUM ]
120
121  Use NUM threads (default: 1). Set to 0 to use the old RazerS 1/2 code path.
122
123  [ -i NUM ],  [ --percent-identity NUM ]
124
125  Set the percent identity threshold. NUM must be a value between 50 and
126  100 (default is 95). RazerS 3 searches for matches with a percent identity
127  of at least NUM. A match of a read R with e errors has percent identity
128  of 100*(1 - e/|R|), whereby |R| is the read length. In other words, a
129  read is allowed to have not more than |R|*(100-NUM)/100 errors.
130  The maximal error rate is the direct opposite of the identity threshold,
131  e.g. an error rate of 4% corresponds to an identity of 96%.
132
133  [ -rr NUM ],  [ --recognition-rate NUM ]
134
135  Set the percent recognition rate. NUM must be a value between 80 and 100
136  (default is 99). The recognition rate controls the sensitivity of RazerS 3.
137  The higher the recognition rate the more sensitive is RazerS 3. The lower
138  the recognition rate the faster runs RazerS 3. A value of 100 corresponds
139  to a lossless read mapping. The recognition rate corresponds to the
140  expected fraction of matches RazerS 3 will find compared to a lossless
141  mapping. Depending on the desired recogition rate, the percent identity
142  and the read length the filter is configured to run as fast as possible.
143  For this purpose, it either computes the sensitivities of different
144  pigeonhole filter settings on runtime or (due to the much larger search
145  space) uses precomputed sensitivies of the SWIFT filter. The latter are
146  precompiled in RazerS but can be replaced by user-specific settings in a
147  parameter directory (-pd).
148  The recognition rate value (and also the automatic sensitivity control) is
149  disabled if filtration parameters, i.e. overlap length (-ol), shape (-s),
150  or minimum threshold (-t), are set manually.
151
152  [ -ng ],  [ --no-gaps ]
153
154  Consider only mismatches as errors (Hamming distance). By default,
155  insertions, deletions and mismatches are considered as errors (edit
156  distance).
157
158  [ -f ],  [ --forward ]
159
160  Only map reads onto the positive/forward strand of the genome. By
161  default, both strands are scanned.
162
163  [ -r ],  [ --reverse ]
164
165  Only map reads onto the negative/reverse-complement strand of the
166  genome. By default, both strands are scanned.
167
168  [ -m NUM ],  [ --max-hits NUM ]
169
170  Output at most NUM of the best matches, default value is 100.
171
172  [ --unique ]
173
174  Output only unique best matches (like ELAND). This flag is equivalent to
175  '-m 1 -dr 0 -pa'.
176
177  [ -tr NUM ],  [ --trim-reads NUM ]
178
179  Trim reads to length NUM bp.
180
181  [ -ll NUM ],  [ --library-length NUM ]
182
183  Set the mean library size, default is 220. The library size is the outer
184  distance of the two mapped reads of a read pair. This value is used only for
185  paired-end read mapping.
186
187  [ -le NUM ],  [ --library-error NUM ]
188
189  Set the tolerated absolute deviation of the library size, default value is
190  50. This value is used only for paired-end read mapping.
191
192  [ -o FILE ],  [ --output FILE ]
193
194  Change the output filename to FILE. By default, this is the (first) read
195  filename extended by the suffix '.razers'.
196
197  [ -v ],  [ --verbose ]
198
199  Verbose. Print extra information and running times.
200
201  [ -vv ],  [ --vverbose ]
202
203  Very verbose. Like -v, but also print filtering statistics like number of
204  candidates and successful verifications.
205
206  [ --version ]
207
208  Print version information.
209
210  [ -h ],  [ --help ]
211
212  Print a brief usage summary.
213
214------------------------------------------------------------------------------
2153.2. Output Format Options
216------------------------------------------------------------------------------
217
218  [ -a ],  [ --alignment ]
219
220  Dump the alignment for each match in the ".result" file (only for razer or
221  fasta format). The alignment is written directly after the match and has the
222  following format:
223  #Read:   CAGGAGATAAGCTGGATCGTTTACGGT
224  #Genome: CAGGAGATAAGC-GGATCTTTTACG--
225
226  [ -pa ],  [ --purge-ambiguous ]
227
228  Omit reads with more than <max-hits> matches.
229
230  [ -dr NUM ], [ --distance-range NUM ]
231
232  If the best match of a read has E errors, only consider hits with
233  E <= X <= E+NUM errors as matches. Disabled by default.
234
235  [ -gn NUM ],  [ --genome-naming NUM ]
236
237  Select how genomes are named in the output file. If NUM is 0, the Fasta
238  ids of the genome sequences are used (default). If NUM is 1, the genome
239  sequences are enumerated beginning with 1.
240
241  [ -rn NUM ],  [ --read-naming NUM ]
242
243  Select how reads are named in the output file. If NUM is 0, the Fasta ids
244  of the reads are used (default). If NUM is 1, the reads are enumerated
245  beginning with 1. If NUM is 2, the read sequence itself is used. If NUM is
246  3, Fasta ids are used without a /L or /R suffix in paired-end mode.
247
248  [ --full-readid ]
249
250  Use the whole Fasta id of each read in the output file. By default, only the
251  prefix up to the first space (excluding) is used.
252
253  [ -so NUM ],  [ --sort-order NUM ]
254
255  Select how matches are ordered in the output file.
256  If NUM is 0, matches are sorted primarily by the read number and
257  secondarily by their position in the genome sequence (default).
258  If NUM is 1, matches are sorted primarily by their position in the genome
259  sequence and secondarily by the read number.
260
261  [ -pf NUM ],  [ --position-format NUM ]
262
263  Select how positions are stored in the output file.
264  If NUM is 0, the gap space is used, i.e. gaps around characters are
265  enumerated beginning with 0 and the beginning and end position is the
266  postion of the gap before and after a match (default).
267  If NUM is 1, the position space is used, i.e. characters are enumerated
268  beginning with 1 and the beginning and end position is the postion of the
269  first and last character involved in a match.
270
271  Example: Consider the string CONCAT. The beginning and end positions
272  of the substring CAT are (3,6) in gap space and (4,6) in position space.
273
274------------------------------------------------------------------------------
2753.3. Filtration Options
276------------------------------------------------------------------------------
277
278  [ -fl STRING ],  [ --filter STRING ]
279
280  Described in section 3.1.
281
282  [ -mr NUM ],  [ --mutation-rate NUM ]
283
284  Set the percent mutation rate used by the pigeonhole sensitivity estimation.
285  The mutation rate specifies the rate of differences between sequenced and
286  the reference genome, i.e. all errors except sequencing errors. These errors
287  include small variants (SNPs, indels) or errors in the assembly of the
288  reference. Default value is 5 (=5%).
289
290  [-ol NUM ],  [ --overlap-length NUM ]
291
292  Manually set the overlap length of adjacent k-mer seeds used in the
293  pigeonhole filter. If the overlap is 0, non-overlapping k-mers of the
294  specified shape (-s) are used. For overlaps of NUM > 0, the shape is
295  extended to the right by NUM characters that overlap with the next seed.
296  The seed positions in the reads are not affected by the overlap.
297  This option disables the automatic sensitivity control.
298
299  [ -pd DIR ],  [ --param-dir DIR ]
300
301  Read user-computed parameter files of the SWIFT filter given in the
302  directory <DIR>. These parameters can be computed based on a machine
303  specific error distribution file and the param_chooser tool.
304
305  [ -t NUM ],  [ --threshold NUM ]
306
307  Depending on the percent identity and the length, the SWIFT filter computes
308  for read a threshold of common k-mers between read and reference genome.
309  These thresholds determine the filtration strictness and are crucial to the
310  overall running time. With this option the threshold values can manually be
311  raised to a minimum value to reduce the running time at cost of the mapping
312  sensitivity. All threshold values smaller than NUM are raised to NUM. The
313  default value is 1.
314  This option disables the automatic sensitivity control.
315
316  [ -tl NUM ],  [ --taboo-length NUM ]
317
318  The taboo length is the minimal distance two k-mer must have in the
319  reference genome when counting common k-mers between reads and reference
320  genome (default is 1).
321
322  [ -s BITSTRING ],  [ --shape BITSTRING ]
323
324  Define the k-mer shape. BITSTRING must be a sequence of bits beginning
325  and ending with 1, e.g. 1111111001101. A '1' defines a relevant and a
326  '0' an irrelevant position. Two k-mers are equal, if all characters at
327  relevant postitions are equal.
328  This option disables the automatic sensitivity control.
329
330  [ -oc NUM ],  [ --overabundance-cut NUM ]
331
332  Remove overabundant read k-mers from the k-mer index. k-mers with a
333  relative abundance above NUM are removed. NUM must be a value between
334  0 (remove all) and 1 (remove nothing, default).
335
336  [ -rl NUM ],  [ --repeat-length NUM ]
337
338  The repeat length is the minimal length a simple-repeat in the
339  genome sequence must have to be filtered out by the repeat masker of
340  RazerS 3. Simple repeats are tandem repeats of only one repeated
341  character, e.g. AAAAA. Independently of this parameter, N characters in
342  the genome are filtered out automatically. Default value is 1000.
343
344  [ -lf NUM ],  [ --load-factor NUM ]
345
346  Set the load factor for the open addressing k-mer index. Defines how many
347  entries should be used in the hash table. If the index stores at most X
348  different k-mers, X * NUM entries will be reserved for the hash table.
349
350------------------------------------------------------------------------------
3513.4. Verification Options
352------------------------------------------------------------------------------
353
354  [ -mN ],  [ --match-N ]
355
356  By default, 'N' characters in read or genome sequences equal to nothing,
357  not even to another 'N'. They are considered as errors. By activating this
358  option, 'N' equals to every other character and produces no mismatch in
359  the verification process. The filtration is not affected by this option.
360
361  [ -ed FILE ],  [ --error-distr FILE ]
362
363  Produce an error distribution file containing the relative frequencies of
364  mismatches for each read position. If the "--indels" option is given, the
365  relative frequencies of insertions and deletions are also recorded.
366
367  [ -mf FILE ],  [ --mismatch-file FILE ]
368
369  Produce a mismatch file containing for each read alignment a line of
370  tab-seperated 0's and 1's representing a match (0) or a mismatch (1).
371
372
373------------------------------------------------------------------------------
3744. Output Formats
375------------------------------------------------------------------------------
376
377RazerS 3 supports currently 5 different output formats which are automatically
378chosen from the output filename suffix.
379
380  .razers    = Razer Format
381  .fa|.fasta = Enhanced Fasta Format
382  .eland     = Eland Format
383  .sam|.bam  = Sequence Alignment and Mapping Format (SAM)
384  .afg       = AMOS assembler format
385
386------------------------------------------------------------------------------
3874.1. Razer Format
388------------------------------------------------------------------------------
389
390The output file is a text file whose lines represent matches. A line
391consists of different tab separated match values in the following format:
392
393RNAME RBEGIN REND GSTRAND GNAME GBEGIN GEND PERCID [PAIRID PAIRSCR MATEDIST]
394
395Match value description:
396
397  RNAME        Name of the read sequence (see --read-naming)
398  RBEGIN       Beginning position in the read sequence (0/1 see -pf option)
399  REND         End position in the read sequence (length of the read)
400  GSTRAND      'F'=forward strand or 'R'=reverse strand
401  GNAME        Name of the genome sequence (see --genome-naming)
402  GBEGIN       Beginning position in the genome sequence
403  GEND         End position in the genome sequence
404  PERCID       Percent identity (see --percent-identity)
405
406For paired-end read mapping 3 additional values are dumped:
407
408  PAIRID       Unique number to identify the two corresponding mate matches
409  PAIRSCR      The sum of the negative number of errors in both matches
410  MATEDIST     Relative outer distance to the mate match
411               The absolute value is the insert size
412
413For matches on the reverse strand, GBEGIN and GEND are positions on the
414related forward strand. It holds GBEGIN < GEND, regardless of GSTRAND.
415
416------------------------------------------------------------------------------
4174.2. Enhanced Fasta Format
418------------------------------------------------------------------------------
419
420The matches are stored in the same order as in the Razer format. Each match
421is stored in two lines:
422
423>GBEGIN,GEND[KEY1=VALUE1,KEY2=VALUE2,...]
424READSEQ
425
426Match value description:
427
428  GBEGIN       Beginning position in the genome sequence
429  GEND         End position in the genome sequence
430  READSEQ      Read sequence.
431
432The following keys are output:
433
434  id           ID value of the input file Fasta header (>..[id=ID,..]..)
435  fragId       Fragment ID value (>..[..,fragId=FRAGID,..]..)
436  contigId     Name of the genome sequence (see --genome-naming)
437  errors       Absolute numbers of errors in this match
438  percId       Percent identity (see --percent-identity)
439  ambiguity    Number of matches of this read as good as or better than this
440
441If the ID or fragment ID values of a read couldn't be found in the reads file
442the read number (beginning with 0) is used instead. For matches on the
443reverse strand, GBegin and GEnd are positions on the related forward strand
444and GBEGIN > GEND.
445
446------------------------------------------------------------------------------
4474.3. Eland Format
448------------------------------------------------------------------------------
449
450Each line of the output file corresponds to a read appearing in the same
451order as they are stored in the reads file. A line consists of the following
452tab separated values:
453
454>RNAME READSEQ TYPE N0 N1 N2 GNAME GBEGIN* GSTRAND '..' SUBST1 SUBST2 ...
455
456Additional value description:
457
458  TYPE         NM = No match found
459               QC = No matching done (too many Ns in read sequence)
460               Ux = Best match found was unique with x errors
461               Rx = Multiple best matches found having x errors
462  N0 N1 N2     Number of exact, 1-error, and 2-error matches
463  GBEGIN*      Minimum of GBEGIN and GEND
464  SUBSTx       Position and type of the x'th mismatch (not for --indels)
465               (e.g. 12A: 12'th base was A in the genome)
466
467------------------------------------------------------------------------------
4684.4. SAM or BAM Output Format
469------------------------------------------------------------------------------
470
471The SAM output format has established itself as the standard output format for
472read alignments. Altough SAM is capable of representing a multiple alignment
473between reads and contig (with paddings), RazerS 3 only outputs pairwise
474alignments as this way has established to be the defacto standard.
475The BAM format is the binary representation of SAM compressed with gzip.
476Whenever possible the more compact BAM format should be preferred over SAM.
477
478See http://samtools.sourceforge.net/ for more details.
479
480------------------------------------------------------------------------------
4814.5. AMOS Output Format
482------------------------------------------------------------------------------
483
484The AMOS assembly format (aka AFG format) is used by the AMOS assembler and
485represents a multiple global alignment between reads and contig and also
486stores the consensus sequences (including gaps).
487
488See http://www.cbcb.umd.edu/research/contig_representation.shtml#AMOS for more
489details.
490
491------------------------------------------------------------------------------
4925. Examples
493------------------------------------------------------------------------------
494
495To map single-end reads with 4% error rate using 12 threads call:
496
497 razers3 -i 96 -tc 12 -o map.result hg18.fa reads.fq
498
499To map paired-end reads with up to 6% errors, 95% sensitivity, 12 threads, and
500only output aligned pairs with an outer distance of 200-360bp call:
501
502 razers3 -i 94 -rr 95 -tc 12 -ll 280 --le 80 -o map.result hg18.fa r1.fq r2.fq
503
504------------------------------------------------------------------------------
5056. Contact
506------------------------------------------------------------------------------
507
508For questions or comments, contact:
509  David Weese <david.weese@fu-berlin.de>
510
511------------------------------------------------------------------------------
5127. References
513------------------------------------------------------------------------------
514
515Weese, D., Holtgrewe M., & Reinert, K. (2012). RazerS 3: Faster, fully
516sensitive read mapping. Bioinformatics, 28(20), 2592–2599.
517