• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

example/H03-May-2022-4136

gapped_params/H03-May-2022-68,59168,367

params/H07-May-2022-7,8477,667

READMEH A D14-Mar-201220 KiB502370

outputFormat.hH A D14-Mar-201236.4 KiB1,246998

paramChooser.hH A D14-Mar-201233.1 KiB1,178919

razers.cppH A D14-Mar-201227 KiB615490

razers.hH A D14-Mar-201265.2 KiB2,2591,715

razers_matepairs.hH A D14-Mar-201231.7 KiB1,095814

razers_parallel.hH A D14-Mar-20127.5 KiB259186

razers_spliced.hH A D14-Mar-201268.7 KiB2,1381,575

readSimulator.hH A D14-Mar-201213.4 KiB470221

README

1*** RazerS - Fast Read Mapping with Sensitivity Control ***
2http://www.seqan.de/projects/razers.html
3
4---------------------------------------------------------------------------
5Table of Contents
6---------------------------------------------------------------------------
7  1.   Overview
8  2.   Installation
9  3.   Usage
10  4.   Output Format
11  5.   Example
12  6.   Contact
13
14---------------------------------------------------------------------------
151. Overview
16---------------------------------------------------------------------------
17
18RazerS is a tool for mapping millions of short genomic reads onto a
19reference genome. It was designed with focus on mapping next-generation
20sequencing reads onto whole DNA genomes. RazerS searches for matches of
21reads with a percent identity above a given threshold, whereby it detects
22matches with mismatches as well as gaps.
23RazerS uses a k-mer index of all reads and counts common k-mers of reads
24and the reference genome in parallelograms. Each parallelogram with a k-mer
25count above a certain threshold triggers a verification. On success, the
26genomic subsequence and the read number are stored and written to the
27output file.
28
29---------------------------------------------------------------------------
302. Installation
31---------------------------------------------------------------------------
32
33RazerS is distributed with SeqAn - The C++ Sequence Analysis Library (see
34http://www.seqan.de). To build RazerS do the following:
35
36  1)  Download the latest snapshot of SeqAn
37  2)  Unzip it to a directory of your choice (e.g. snapshot)
38  3)  cd snapshot/apps
39  4)  make razers
40  5)  cd razers
41  6)  ./razers --help
42
43Alternatively you can check out the latest SVN version of RazerS and SeqAn
44with:
45
46  1)  svn co http://svn.mi.fu-berlin.de/seqan/trunk/seqan
47  2)  cd seqan
48  3)  make forwards
49  4)  cd projects/library/apps
50  5)  make razers
51  6)  cd razers
52  7)  ./razers --help
53
54On success, an executable file razers was build and a brief usage
55description was dumped.
56
57---------------------------------------------------------------------------
583. Usage
59---------------------------------------------------------------------------
60
61To get a short usage description of RazerS, you can execute razers -h or
62razers --help.
63
64Usage: razers [OPTION]... <GENOME FILE> <READS FILE>
65       razers [OPTION]... <GENOME FILE> <MP-READS FILE1> <MP-READS FILE2>
66
67RazerS expects the names of two or three DNA (multi-)Fasta files. The first
68contains a reference genome and the second (and third) contains genomic reads
69that should be mapped onto the reference. If two read files are given, both
70have to contain exactly the same number of reads, which are considered as
71mate-pairs. To save memory RazerS uses bit fields which limit the maximal
72number of reads to 16.7 mil. and the read length to 256 bps. If your read set
73exceeds this limitation please remove "#define RAZERS_MEMOPT" in razers.cpp
74and recompile. Without any additional parameters RazerS would map all reads
75onto both strands of the reference genome with 92% identity (i.e. 8% errors
76per read) and dump all found matches in an output file. The output file name
77is the read file name extended by the suffix ".result". The default behaviour
78can be modified by adding the following options to the command line:
79
80---------------------------------------------------------------------------
813.1. Main Options
82---------------------------------------------------------------------------
83
84  [ -f ],  [ --forward ]
85
86  Only map reads onto the positive/forward strand of the genome. By
87  default, both strands are scanned.
88
89  [ -r ],  [ --reverse ]
90
91  Only map reads onto the negative/reverse-complement strand of the
92  genome. By default, both strands are scanned.
93
94  [ -i NUM ],  [ --percent-identity NUM ]
95
96  Set the percent identity threshold. NUM must be a value between 50 and
97  100 (default is 92). RazerS searches for matches with a percent identity
98  of at least NUM. A match of a read R with e errors has percent identity
99  of 100*(1 - e/|R|), whereby |R| is the read length. In other words, a
100  read is allowed to have not more than |R|*(100-NUM)/100 errors.
101
102  [ -rr NUM ],  [ --recognition-rate NUM ]
103
104  Set the percent recognition rate. NUM must be a value between 80 and 100
105  (default is 99). The recognition rate controls the sensitivity of RazerS.
106  The higher the recognition rate the more sensitive is RazerS. The lower
107  the recognition rate the faster runs RazerS. A value of 100 corresponds
108  to a lossless read mapping. The recognition rate corresponds to the
109  expected fraction of matches RazerS will find compared to a lossless
110  mapping. Depending on the desired recogition rate, the percent identity
111  and the read length the filter is configured to run as fast as possible.
112  Therefore it needs access to files with precomputed filtration settings
113  in a 'gapped_params' subfolder which resides in the razers folder. This
114  value is ignored if the shape (-s) or the minimum threshold (-t) is set
115  manually.
116
117  [ -id ],  [ --indels ]
118
119  Consider insertions, deletions and mismatches as errors. By default, only
120  mismatches are recognized.
121
122  [ -ll NUM ],  [ --library-length NUM ]
123
124  Set the mean library size, default is 220. The library size is the outer
125  distance of the two reads of a mate-pair. This value is used only for
126  paired-end read mapping.
127
128  [ -le NUM ],  [ --library-error NUM ]
129
130  Set the tolerated absolute deviation of the library size, default value is
131  50. This value is used only for paired-end read mapping.
132
133  [ -m NUM ],  [ --max-hits NUM ]
134
135  Output at most NUM of the best matches, default value is 100.
136
137  [ --unique ]
138
139  Output only unique best matches (like ELAND). This flag corresponds to
140  '-m 1 -dr 0 -pa'.
141
142  [ -tr NUM ],  [ --trim-reads NUM ]
143
144  Trim reads to length NUM.
145
146  [ -o FILE ],  [ --output FILE ]
147
148  Change the output filename to FILE. By default, this is the read file
149  name extended by the suffix ".result".
150
151  [ -v ],  [ --verbose ]
152
153  Verbose. Print extra information and running times.
154
155  [ -vv ],  [ --vverbose ]
156
157  Very verbose. Like -v, but also print filtering statistics like true and
158  false positives (TP/FP).
159
160  [ -V ],  [ --version ]
161
162  Print version information.
163
164  [ -h ],  [ --help ]
165
166  Print a brief usage summary.
167
168---------------------------------------------------------------------------
1693.2. Output Format Options
170---------------------------------------------------------------------------
171
172  [ -a ],  [ --alignment ]
173
174  Dump the alignment for each match in the ".result" file. The alignment is
175  written directly after the match and has the following format:
176  #Read:   CAGGAGATAAGCTGGATCGTTTACGGT
177  #Genome: CAGGAGATAAGC-GGATCTTTTACG--
178
179  [ -pa ],  [ --purge-ambiguous ]
180
181  Omit reads with more than #max-hits many matches.
182
183  [ -dr NUM ], [ --distance-range NUM ]
184
185  If the best match of a read has E errors, only consider hits with
186  E <= X <= E+NUM errors as matches.
187
188  [ -of NUM ], [ --output-format NUM ]
189
190  Select the output format the matches should be stored in. See section 4.
191
192  [ -gn NUM ],  [ --genome-naming NUM ]
193
194  Select how genomes are named in the output file. If NUM is 0, the Fasta
195  ids of the genome sequences are used (default). If NUM is 1, the genome
196  sequences are enumerated beginning with 1.
197
198  [ -rn NUM ],  [ --read-naming NUM ]
199
200  Select how reads are named in the output file. If NUM is 0, the Fasta ids
201  of the reads are used (default). If NUM is 1, the reads are enumerated
202  beginning with 1. If NUM is 2, the read sequence itself is used.
203
204  [ -so NUM ],  [ --sort-order NUM ]
205
206  Select how matches are ordered in the output file.
207  If NUM is 0, matches are sorted primarily by the read number and
208  secondarily by their position in the genome sequence (default).
209  If NUM is 1, matches are sorted primarily by their position in the genome
210  sequence and secondarily by the read number.
211
212  [ -pf NUM ],  [ --position-format NUM ]
213
214  Select how positions are stored in the output file.
215  If NUM is 0, the gap space is used, i.e. gaps around characters are
216  enumerated beginning with 0 and the beginning and end position is the
217  postion of the gap before and after a match (default).
218  If NUM is 1, the position space is used, i.e. characters are enumerated
219  beginning with 1 and the beginning and end position is the postion of the
220  first and last character involved in a match.
221
222  Example: Consider the string CONCAT. The beginning and end positions
223  of the substring CAT are (3,6) in gap space and (4,6) in position space.
224
225---------------------------------------------------------------------------
2263.3. Filtration Options
227---------------------------------------------------------------------------
228
229  [ -s BITSTRING ],  [ --shape BITSTRING ]
230
231  Define the k-mer shape. BITSTRING must be a sequence of bits beginning
232  and ending with 1, e.g. 1111111001101. A '1' defines a relevant and a
233  '0' an irrelevant position. Two k-mers are equal, if all characters at
234  relevant postitions are equal.
235
236  [ -t NUM ],  [ --threshold NUM ]
237
238  Depending on the percent identity and the length, for each read a
239  threshold of common k-mers between read and reference genome is
240  calculated. These thresholds determine the filtratition strictness and are
241  crucial to the overall running time. With this option the threshold values
242  can manually be raised to a minimum value to reduce the running time at
243  cost of the mapping sensitivity. All threshold values smaller than NUM
244  are raised to NUM. The default value is 1.
245
246  [ -oc NUM ],  [ --overabundance-cut NUM ]
247
248  Remove overabundant read k-mers from the k-mer index. k-mers with a
249  relative abundance above NUM are removed. NUM must be a value between
250  0 (remove all) and 1 (remove nothing, default).
251
252  [ -rl NUM ],  [ --repeat-length NUM ]
253
254  The repeat length is the minimal length a simple-repeat in the
255  genome sequence must have to be filtered out by the repeat masker of
256  RazerS. Simple repeats are tandem repeats of only one repeated
257  character, e.g. AAAAA. Independently of this parameter, N characters in
258  the genome are filtered out automatically. Default value is 1000.
259
260  [ -tl NUM ],  [ --taboo-length NUM ]
261
262  The taboo length is the minimal distance two k-mer must have in the
263  reference genome when counting common k-mers between reads and reference
264  genome (default is 1).
265
266---------------------------------------------------------------------------
2673.4. Verification Options
268---------------------------------------------------------------------------
269
270  [ -mN ],  [ --match-N ]
271
272  By default, 'N' characters in read or genome sequences equal to nothing,
273  not even to another 'N'. They are considered as errors. By activating this
274  option, 'N' equals to every other character and produces no mismatch in
275  the verification process. The filtration is not affected by this option.
276
277  [ -ed FILE ],  [ --error-distr FILE ]
278
279  Produce an error distribution file containing the relative frequencies of
280  mismatches for each read position. If the "--indels" option is given, the
281  relative frequencies of insertions and deletions are also recorded.
282
283
284---------------------------------------------------------------------------
2854. Output Formats
286---------------------------------------------------------------------------
287
288RazerS supports currently 4 different output formats selectable via the
289"--output-format NUM" option. The following values for NUM are possible:
290
291  0 = Razer Format
292  1 = Enhanced Fasta Format
293  2 = Eland Format
294  3 = General Feature Format (GFF)
295
296---------------------------------------------------------------------------
2974.1. Razer Format
298---------------------------------------------------------------------------
299
300The output file is a text file whose lines represent matches. A line
301consists of different tab separated match values in the following format:
302
303RNAME RBEGIN REND GSTRAND GNAME GBEGIN GEND PERCID [PAIRID PAIRSCR MATEDIST]
304
305Match value description:
306
307  RNAME        Name of the read sequence (see --read-naming)
308  RBEGIN       Beginning position in the read sequence (0/1 see -pf option)
309  REND         End position in the read sequence (length of the read)
310  GSTRAND      'F'=forward strand or 'R'=reverse strand
311  GNAME        Name of the genome sequence (see --genome-naming)
312  GBEGIN       Beginning position in the genome sequence
313  GEND         End position in the genome sequence
314  PERCID       Percent identity (see --percent-identity)
315
316For paired-end read mapping 3 additional values are dumped:
317
318  PAIRID       Unique number to identify the two corresponding mate matches
319  PAIRSCR      The sum of the negative number of errors in both matches
320  MATEDIST     Relative outer distance to the mate match
321               The absolute value is the insert size
322
323For matches on the reverse strand, GBEGIN and GEND are positions on the
324related forward strand. It holds GBEGIN < GEND, regardless of GSTRAND.
325
326---------------------------------------------------------------------------
3274.2. Enhanced Fasta Format
328---------------------------------------------------------------------------
329
330The matches are stored in the same order as in the Razer format. Each match
331is stored in two lines:
332
333>GBEGIN,GEND[KEY1=VALUE1,KEY2=VALUE2,...]
334READSEQ
335
336Match value description:
337
338  GBEGIN       Beginning position in the genome sequence
339  GEND         End position in the genome sequence
340  READSEQ      Read sequence.
341
342The following keys are output:
343
344  id           ID value of the input file Fasta header (>..[id=ID,..]..)
345  fragId       Fragment ID value (>..[..,fragId=FRAGID,..]..)
346  contigId     Name of the genome sequence (see --genome-naming)
347  errors       Absolute numbers of errors in this match
348  percId       Percent identity (see --percent-identity)
349  ambiguity    Number of matches of this read as good as or better than this
350
351If the ID or fragment ID values of a read couldn't be found in the reads file
352the read number (beginning with 0) is used instead. For matches on the
353reverse strand, GBegin and GEnd are positions on the related forward strand
354and GBEGIN > GEND.
355
356---------------------------------------------------------------------------
3574.3. Eland Format
358---------------------------------------------------------------------------
359
360Each line of the output file corresponds to a read appearing in the same
361order as they are stored in the reads file. A line consists of the following
362tab separated values:
363
364>RNAME READSEQ TYPE N0 N1 N2 GNAME GBEGIN* GSTRAND '..' SUBST1 SUBST2 ...
365
366Additional value description:
367
368  TYPE         NM = No match found
369               QC = No matching done (too many Ns in read sequence)
370               Ux = Best match found was unique with x errors
371               Rx = Multiple best matches found having x errors
372  N0 N1 N2     Number of exact, 1-error, and 2-error matches
373  GBEGIN*      Minimum of GBEGIN and GEND
374  SUBSTx       Position and type of the x'th mismatch (not for --indels)
375               (e.g. 12A: 12'th base was A in the genome)
376
377---------------------------------------------------------------------------
3784.4. General Feature Format
379---------------------------------------------------------------------------
380
381The General Feature Format is specified by the Sanger Institute as a tab-
382delimited text format with the following columns:
383
384<seqname> <src> <feat> <start> <end> <score> <strand> <frame> [attr] [cmts]
385
386See also: http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
387Consistent with this specification razers GFF output looks as follows:
388
389GNAME razers read GBEGIN GEND PERCID GSTRAND . ATTRIBUTES
390
391Match value description:
392
393  GNAME        Name of the genome sequence (see --genome-naming)
394  razers       Constant
395  read         Constant
396  GBEGIN       Beginning position in the genome sequence
397               (positions are counted from 1)
398  GEND         End position in the genome sequence (included!)
399  PERCID       Percent identity (see --percent-identity)
400  GSTRAND      '+'=forward strand or '-'=reverse strand
401  .            Constant
402  ATTRIBUTES   A list of attributes in the format <tag_name>[=<tag>]
403               separated by ';'
404
405Attributes are:
406
407  ID=          Name of the read
408  quality=     Ascii coded quality values of the read
409  cigar=       Read-reference alignment description in cigar format*
410  mutations=   Positions and bases that differ from the reference
411               with respect to the read (counting from 1)
412  unique       This is the best read match and it is unique
413  multi        This is one of multiple best machtes
414  suboptimal   This is a suboptimal read match
415
416The original read sequence can be retrieved using the genomic subsequence
417and the information contained in the 'cigar' and 'mutations' tags.
418
419For matches on the reverse strand, GBEGIN and GEND are positions on the
420related forward strand. It holds GBEGIN < GEND, regardless of GSTRAND.
421
422*http://may2005.archive.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html
423
424---------------------------------------------------------------------------
4255. Example
426---------------------------------------------------------------------------
427
428---------------------------------------------------------------------------
4295.1. Single read mapping
430---------------------------------------------------------------------------
431
432There are example read and genome files in the folder in snapshot/apps/
433razers/example containing 3 27bp reads and two short contigs. The 3
434reads and their reverse-complements were implanted with errors into the
435genome. To map the example reads onto the genome do the following:
436
437  1)  cd snapshot/apps/razers
438  2)  ./razers example/genome.fa example/reads.fa -id -a -mN -v
439  3)  less example/reads.fa.result
440
441On success, RazerS dumped the resulting matches with their corresponding
442semi-global alignments into the file example/reads.fa.result:
443
444read1   0   27  F   contig1 47  73  92.593
445#Read:   AATTGAATGAGGTCTTGCAGCCATGGC
446#Genome: AATTGAATGACGTC-TGCAGCCATGGC
447read1   0   27  R   contig2 300 328 92.593
448#Read:   AATTGAATGAGGTCTT-GCAGCCATGGC
449#Genome: AATTGAATGAGGTCTTCGCAGTCATGGC
450read2   0   27  R   contig2 228 255 96.296
451#Read:   CAGGAGATAAGCTGGATCGTTTACGGT
452#Genome: CAGGAGATAAGCTGGATCGTTTACAGT
453read3   0   27  F   contig2 335 362 100
454#Read:   GCCATTAGAGGCCACCACACCAGACGT
455#Genome: GCCATTAGAGGCCACCACACCAGNNNN
456
457If alignments are not needed '-a' can be omited resulting in:
458
459read1   0   27  F   contig1 47  73  92.593
460read1   0   27  R   contig2 300 328 92.593
461read2   0   27  R   contig2 228 255 96.296
462read3   0   27  F   contig2 335 362 100
463
464---------------------------------------------------------------------------
4655.2. Paired-end read mapping
466---------------------------------------------------------------------------
467
468To demonstrate how paired-end read mapping works, we provide a read file
469with mates of the reads in the example above. To run RazerS in paired-end
470mode you simply have to add the second read file to the command line.
471
472  1)  cd snapshot/apps/razers
473  2)  ./razers example/genome.fa example/reads.fa example/reads2.fa -id -mN
474  3)  less example/reads.fa.result
475
476read1/L 0   27  F   contig1 47  73  92.593  1   -3  217
477read1/R 0   27  R   contig1 236 264 96.296  1   -3  -217
478read2/L 0   27  R   contig2 228 255 96.296  3   -1  -207
479read2/R 0   27  F   contig2 48  75  100     3   -1  207
480read3/L 0   27  F   contig2 335 362 100     2   0   221
481read3/R 0   27  R   contig2 529 556 100     2   0   -221
482
483In this short example the library sizes vary between 207bp and 221bp. The
484default library size of RazerS is 220bp with a tolerance of 50bp, i.e. all
485libraries between 170bp and 270bp are recognized. To alter the library size
486settings use the -ll and -le options.
487
488The pair score value in the second last column is the negative sum of
489errors in both matches and the third last column is a unique number to
490identify the two corresponding mate matches. Remember that RazerS can
491output more than one match per mate-pair and two corresponding mate matches
492are not necessarily in consecutive lines in the output file, e.g. when
493altering the sort-order with the -so option.
494
495---------------------------------------------------------------------------
4966. Contact
497---------------------------------------------------------------------------
498
499For questions or comments, contact:
500  David Weese <weese@inf.fu-berlin.de>
501  Anne-Katrin Emde <emde@inf.fu-berlin.de>
502