• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

config/H17-Sep-2020-17,01713,406

src/H17-Sep-2020-590,333513,028

tests/H03-May-2022-5,1824,873

util/H17-Sep-2020-7,5156,044

AUTHORSH A D12-Mar-2020181 86

COPYINGH A D12-Mar-202028 21

ChangeLogH A D17-Sep-2020996.7 KiB29,26117,990

INSTALLH A D17-Sep-202015.4 KiB371289

LICENSEH A D12-Mar-2020583 1510

Makefile.amH A D12-Mar-2020260 176

Makefile.inH A D17-Sep-202025.2 KiB805709

NOTICEH A D12-Mar-20203.4 KiB5951

READMEH A D01-Jun-202060 KiB1,3401,011

VERSIONH A D13-Sep-202010 11

acinclude.m4H A D12-Mar-2020836 3323

aclocal.m4H A D17-Sep-202039.7 KiB1,1091,010

config.siteH A D12-Mar-20205.9 KiB1570

configureH A D03-May-2022385.4 KiB13,52511,229

configure.acH A D21-Apr-202020.7 KiB749616

README

1
20.  Availability
3============
4
5The source code for this package is available from
6http://research-pub.gene.com/gmap.  License terms are provided in the
7LICENSE file.
8
9
101.  Building and installing GMAP and GSNAP
11==========================================
12
13Prerequisites: a Unix system (including Cygwin on Windows), a C
14compiler, and Perl
15
16Step 1: Set your site-specific variables by editing the file
17config.site.  In particular, you should set appropriate values for
18"prefix" and probably for "with_gmapdb", as explained in that file.
19If you are compiling this package on a Macintosh, you may need to edit
20CFLAGS to be
21
22CFLAGS = '-O3 -m64'
23
24since Macintosh machines will make only 32-bit executables by default.
25
26
27Step 2: Build, test, and install the programs, by running the
28following GNU commands
29
30    ./configure
31    make
32    make check   (optional)
33    make install
34
35Note 1: Instead of editing the config.site file in step 1, you may type
36everything on the command line for the ./configure script in step 2,
37like this
38
39    ./configure --prefix=/your/usr/local/path --with-gmapdb=/path/to/gmapdb
40
41If you omit --with-gmapdb, it defaults to ${prefix}/share.  If you
42omit --prefix, it defaults to /usr/local.  Note that on the command
43line, it is "with-gmapdb" with a hyphen, but in a config.site file,
44it is "with_gmapdb" with an underscore.
45
46
47Note 2: If you want to keep your version of config.site or have
48multiple versions, you can save the file to a different filename, and
49then refer to it like this
50
51    ./configure CONFIG_SITE=<config site file>
52
53
54Note 3: GSNAP can read from gzip-compressed FASTA or FASTQ input
55files.  This feature requires the zlib library to be present
56(available from http://www.zlib.net).  The configure program will
57detect the availability of zlib automatically.  However, to disable
58this feature, you can add "--disable-zlib" to the ./configure command
59or edit your config.site file to have the command "disable_zlib".
60
61
62Note 4: GSNAP can read from bzip2-compressed FASTA or FASTQ input
63files.  This feature requires the bzlib library to be present.  The
64configure program will detect the availability of bzlib automatically.
65However, to disable this feature, you can add "--disable-bzlib" to the
66./configure command or edit your config.site file to have the command
67"disable_bzlib".
68
69
702.  Possible issues during compilation
71======================================
72
73Recent versions of GMAP and GSNAP use certain techniques to achieve
74increased speed.  One of these techniques is special SIMD
75(single-instruction, multiple data) instructions that can perform some
76computations in parallel.  There are various levels of SIMD
77instructions, and we use those from the SSE2, SSE4.1, AVX2, and AVX512
78instruction sets, depending on the capabilities of the computer.  Most
79computers built within the last 10 years should support SSE2 at least.
80
81However, you may run into the following issues:
82
83Issue 1: If you have a computer cluster with different types of SIMD
84capabilities, then you should compile the source code multiple times,
85one on each type of computer.  If your computer has SSE2 capabilities,
86the ./configure and "make install" commands will create the programs
87gmap.sse2 and gsnap.sse2.  Likewise, if your computer has AVX2
88capabilities, the ./configure and "make install" commands will create
89gmap.avx2 and gsnap.avx2.  The gmap and gsnap programs are dispatch
90programs that determine at run time what capabilities the computer has
91and call the appropriate version of the program (e.g., gsnap.sse2 or
92gsnap.avx2).  The ./configure and "make install" commands will also
93create versions of the programs, gmap.nosimd and gsnap.nosimd, that do
94not depend on SIMD.  If the gmap and gsnap dispatch programs cannot
95find a version of the program that matches the given computer, then it
96will run the non-SIMD version instead.
97
98Issue 2.  If you compile the code successfully, but when you run the
99program, you see something that says "Illegal instruction", then you
100must be running GMAP or GSNAP on a different computer than the one
101where you compiled it.  The dispatch program should not allow this to
102happen, especially if you have compiled the code on a computer of each
103type in your cluster, so you can report this error to the authors and
104we can try to help.  In the worst case, you can compile your program
105for the lowest common denominator by by providing the
106--simd-level=<level> flag to ./configure.
107
108
1093.  Building a GMAP/GSNAP index (one chromosome per FASTA entry)
110==============================================================================
111
112A GMAP/GSNAP "index" is a set of genomic index files, representing the
113genome in a hash table format.  You use the gmap_build program to
114build your own database from a FASTA file that contains the genome.
115The gmap and gsnap programs use the same index.
116
117Below I use the terms "genome" and "chromosome", but the input
118sequences can be anything you wish to align to, including transcripts
119or small fragments.
120
121A genome can be considered regular or large in size.  A regular genome
122has a total sequence length of up to 2^32 = 4,294,967,296 (about 4
123billion) bp, while a large genome may contain up to 2^40 (about 1
124trillion) bp.  However, each individual chromosome is still limited to
1252^32 bp.  The gmap_build program will automatically recognize when a
126genome is larger than 2^32 bp and build the index files accordingly.
127
128To handle large genomes, the ./configure and "make install" procedures
129produce the programs gmapl and gsnapl.  The gmapl and gsnapl programs
130work on large genomes, while the gmap and gsnap programs work on
131regular genomes.  If you try to use the wrong program on a given
132database, the programs will notify you of the error and quit.
133
134You will need to start with a set of FASTA files containing either
135entire chromosomes or contigs that represent pieces of chromosomes.
136For example, for the human genome, you can retrieve all of the FASTA
137files under the ftp directory at
138
139    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/
140
141If your FASTA entries each contain a single chromosome, and the
142accession for each chromosome is the chromosome number/letter, you can
143simply run this command
144
145    gmap_build -d <genome> [-k <kmer size>] <fasta_files...>
146
147which will build and install the GMAP index files in your default
148GMAPDB location.
149
150You can see the full usage of gmap_build by doing "gmap_build --help",
151but here are some useful flags.  If your FASTA files are compressed
152using gzip, you can add the flag "-g" to gmap_build.  You can control
153the k-mer size for the genomic index with the -k flag, which can range
154from 12 to 15.  The default value for -k is 15, but this requires your
155machine to have 4 GB of RAM to build the indices.  If you do not have
1564 GB of RAM, then you will need to reduce the value of -k or find
157another machine.  Here are the RAM requirements for building various
158indices:
159
160    k-mer of 12: 64 MB
161    k-mer of 13: 256 MB
162    k-mer of 14: 1 GB
163    k-mer of 15: 4 GB
164
165Note that the term
166
167    <fasta_files...>
168
169above indicates that multiple files can be listed.  The files can be
170in any order, and the contigs can be in any order within these files.
171By default, the gmap_build process will sort the contigs and
172chromosomes into their appropriate "chrom" order.  For the human
173genome, this order is 1, 2, ..., 10, 11, ..., 22, X, Y, M, followed by
174all other chromosomes in numeric/alphabetical order.  If you don't
175want this sort, provide the "-s none" flag to gmap_build.  Other sort
176options besides "none" and "chrom" are "alpha" and "numeric-alpha".
177
178You can type "gmap_build --help" to see the full set of options.  We
179discuss some specific situations below.
180
181You can provide information to gmap_build that certain chromosomes are
182circular, with the -c or --circular flag.  The value for these flags
183is a list of chromosomes, separated by commas.  The gmap_build program
184will then allow GSNAP and GMAP to align reads across the ends of the
185chromosome.  For example, the mitochondrial genome in human beings is
186circular.
187
188If your genome files are compressed using gzip, you can simply add the
189flag "-g" to gmap_build.  But if your genome files require some other
190type of processing, you may need to write a small script that pipes
191the sequences in FASTA format to gmap_build.  You can tell gmap_build
192about your script with the -E (or --fasta-pipe) flag, like this
193
194    gmap_build -d <genome> -E 'gunzip -c chr*.gz'
195    gmap_build -d <genome> -E 'cat *.fa | ./add-chromosomal-info.pl'
196
197You can think of the command as a Unix pipe for processing each FASTA
198file before it is read by gmap_build.
199
200
2014.  Running GMAP
202================
203
204To see the full set of options, type "gmap --help".  The following are
205some common examples of usage.  For more examples, see the document
206available at http://www.gene.com/share/gmap/paper/demo-slides.pdf
207
208For each of the examples below, we assume that you have installed a
209genome database called <genome> in your GMAPDB directory.  (If your
210database is located elsewhere, you can specify the -D flag to gmap or
211set the environment variable GMAPDB to point to that directory.)
212
213* Mapping only: To map one or more cDNAs in a FASTA file onto a
214  genome, run GMAP as follows:
215
216    gmap -d <genome> <cdna_file>
217
218* Mapping and alignment: If you want to map the cDNAs to a genome,
219  and show the full alignment, provide the -A flag:
220
221    gmap -d <genome> -A <cdna_file>
222
223
224* Alignment only: To align one or more cDNAs in a FASTA file onto a
225  given genomic segment (also in a FASTA file), use the -g flag
226  instead of the -d flag:
227
228    gmap -g <genome_segment> -A <cdna_file>
229
230
231* Batch mode: If you have a large number of cDNAs to run, and you have
232  sufficient RAM to run in batch mode, add the "-B 3", "-B 4", or "-B
233  5" option.  Details for these options are provided by running "gmap
234  --help".
235
236    gmap -d <genome> -B 5 -A <cdna_file>
237
238
239* Multithreaded mode: If your machine has several processors, you can
240  make batch mode run even faster by specifying multiple threads with
241  the -t flag:
242
243    gmap -d <genome> -B 5 -A -t <nthreads> <cdna_file>
244
245  Note that with multiple threads, the output results will appear in
246  random order, depending on which thread finishes its computation
247  first.  If you wish your output to be in the same order as the
248  input cDNA file, add the '-O' (letter O, not the number 0) flag
249  to get ordered output.
250
251  Guidelines: The -t flag specifies the number of computational
252  threads.  In addition, if your machine supports threads, GMAP also
253  uses one thread for reading the input query sequences, and one
254  thread for printing the output results.  Therefore, the total number
255  of threads will be 2 plus the number you specify.  The program will
256  work optimally if it uses one thread per available processor.  If
257  you specify too many threads, you can cause your computer to thrash
258  and slow down.  Note that other programs running on your computer
259  also need processors.
260
261
262* Compressed output: If you want to store the alignment results in a
263  compressed format, use the -Z flag.  You can uncompress the results
264  by using the gmap_uncompress.pl program:
265
266    gmap -d <genome> -Z <cdna_file> > x
267    cat x | gmap_uncompress
268
269
270Note that gmap is written for small genomes (less than 2^32 bp in
271total length).  With large genomes, there is an equivalent program
272called gmapl, which you should run instead of gmap.  The gmapl program
273is equivalent to gmap, and is based on the same source code, but is
274compiled to use 64-bit index files instead of 32-bit files.  The gmap
275and gmapl programs will detect whether the genomes are the correct
276size, and will exit if you try to run them on the wrong-sized genomes.
277
278
279
2805.  Building map files
281======================
282
283This package includes an implementation of interval index trees
284(IITs), which permits efficient lookup of interval information.  The
285gmap program also allows you (with its -m flag) to look up pre-mapped
286annotation information that overlaps your query cDNA sequence.  These
287interval index trees (or map files) are built using the iit_store
288program included in this package.  To build a map file, do the
289following:
290
291Step 1: Put your map information for a given genome into a map file
292with the following FASTA-like format:
293
294    >label coords optional_tag
295    optional_annotation (which may be zero, one, or multiple lines)
296
297For example, the label may be an EST accession, with the coords
298representing its genomic position.  Labels may be duplicated if
299necessary.
300
301The coords should be of the form
302
303    chr:position
304    chr:startposition..endposition
305
306The term "chr:position" is equivalent to "chr:position..position".  If
307you want to indicate that the interval is on the minus strand or
308reverse direction, then <endposition> may be less than
309<startposition>.
310
311Tags are very general and can be used for a variety of purposes.  For
312example, you could
313
314
315Step 2: Run iit_store on this map file as follows
316
317    cat <mapfile> | iit_store -o <mapname>
318
319The program will create a file called <mapname>.iit.
320
321Now you can retrieve this information with iit_get
322
323    iit_get <mapname>.iit <coords>
324
325where <coords> has the format "chr:position" or
326"chr:startposition..endposition".  The iit_get program has other
327capabilities, including the ability to retrieve information by label,
328like this:
329
330    iit_get <mapname>.iit <label>
331
332More details can be found by running "iit_get --help".
333
334If you plan to use this map information frequently, you may want to
335place it with its corresponding genome for future use.  In each
336GMAP/GSNAP database, there is a subdirectory for storing map files,
337like this
338
339    /path/to/gmapdb/<genome>/<genome>.maps/
340
341(If you don't remember where your default gmap directory is, run "gmap
342--version" to find it.)  If you put your <mapname>.iit file into this
343maps subdirectory, you can get additional functionality.  First, you
344can run the program get-genome, which is used mainly for getting
345genome sequence, to get map information instead, like this
346
347    get-genome -d <genome> -m <mapname> <coords>
348
349Second, you can use GMAP with the -m flag to retrieve map information
350that corresponds to a given cDNA sequence like this
351
352    gmap -d <genome> -m <mapname> <cdna_file>
353
354Finally, GMAP and the IIT utilities support the GFF3 format.  GMAP can
355generate its results in GFF3 format, and iit_store can parse GFF3
356files using its -G and -l flags.  More details about iit_store can be
357found by doing "iit_store --help".
358
359
3606.  Running GSNAP
361=================
362
363GSNAP uses the same database as GMAP does, so you will need to process
364the genome using gmap_build as explained above, if you haven't done
365that already.  To see the full set of options for GSNAP, type "gsnap --help".
366
367Input to GSNAP should be either in FASTQ or FASTA format.  The FASTQ
368input may include quality scores, which will then be included in SAM
369output, if that output format is selected.  For single-end reads, the
370FASTQ file may be piped into GSNAP, or given as its command-line
371argument, like this
372
373    cat <fastq_file> | gsnap -d <genome>
374
375or
376
377    gsnap -d <genome> <fastq_file>
378
379
380For paired-end reads, the two corresponding FASTQ files should be
381given as command-line arguments in pairs, like this
382
383    gsnap -d <genome> <fastq_file_1> <fastq_file_2> [<fastq_file_3> <fastq_file_4>...]
384
385A pipe cannot work since GSNAP needs to access both FASTQ files in
386parallel.  The reads in FASTQ files may have varying lengths, if
387desired.  Note that GSNAP can process multiple sets of paired-end
388reads, by adding the files in pairs.  If you want to provide multiple
389single-end files, you can either use "cat" to concatenate them into
390the stdin of gsnap, like this:
391
392    cat <fastq_file_1> [<fastq_file_2>...] | gsnap -d <genome>
393
394or you can provide them all on the command line with the
395--force-single-end flag, like this:
396
397    gsnap -d <genome> --force-single-end <fastq_file_1> [<fastq_file_2>...]
398
399which will process each FASTQ file one at a time as single-end reads,
400and not try to pair them up.
401
402GSNAP also has the ability to deal with files compressed with gzip, if
403the configure script at compile time can find a zlib library in your
404system (see Note 3 in the section above about building and installing
405GMAP and GSNAP).  If so, and your files are gzipped, you can then read
406in gzipped files directly like this
407
408    gsnap --gunzip -d <genome> <fastq.gz>, or
409    gsnap --gunzip -d <genome> --force-single-end <fastq1.gz> [<fastq2.gz>...]
410
411for single-end reads, or
412
413    gsnap --gunzip -d <genome> <fastq_1.gz> <fastq_2.gz> [<fastq_3.gz> <fastq_4.gz>...]
414
415for paired-end reads.
416
417Likewise, GSNAP can handle files compressed with bzip2, if the
418configure script at compile time can find a bzlib library in your
419system (see Note 3 in the section above about building and installing
420GMAP and GSNAP).  If so, and your files are bzip2-compressed, you can
421then read in those files directly like this
422
423    gsnap --bunzip2 -d <genome> <fastq.bz2>, or
424    gsnap --bunzip2 -d <genome> --force-single-end <fastq1.bz2> [<fastq2.bz2>...]
425
426for single-end reads, or
427
428    gsnap --bunzip2 -d <genome> <fastq_1.bz2> <fastq_2.bz2> [<fastq_1.bz2> <fastq_2.bz2>...]
429
430for paired-end reads.
431
432
433For FASTA format, you should include one line per read (or end of a
434paired-end read).  The same FASTA file can have a mixture of
435single-end and paired-end reads of varying lengths, if desired.
436
437Single-end reads:
438
439Each FASTA entry should contain one short read per line, like this
440
441>Header information
442AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA
443
444Each short read can have a different length.  However, the entire read
445needs to be on a single line, and may not wrap around multiple lines.
446If it extends to a second line, GSNAP will think that the read is
447paired-end.
448
449
450Paired-end reads:
451
452Each FASTA entry should contain two short reads, one per line, like
453this
454
455>Header information
456AAAACATTCTCCTCCGCATAAGCCTAGTAGATTA
457GGCGTAGGTAGAAGTAGAGGTTAAGGCGCGTCAG
458
459By default, the program assumes that the second end is in the reverse
460complement direction compared with the first end.  If they are in the
461same direction, you may need to use the --circular-input (or -c) flag.
462
463GSNAP and GMAP can also read an extended FASTA format that include
464quality scores, which look like this
465
466    >Header information
467    AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA
468    +
469    <quality scores>
470
471for single-end reads, or
472
473    <Header information
474    AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA
475    +
476    <quality scores>
477
478for the second-end of a paired-end read.  In addition, GSNAP can read
479an extended FASTA format for paired-end reads, like this:
480
481    >Header information
482    AAAACATTCTCCTCCGCATAAGCCTAGTAGATTA
483    GGCGTAGGTAGAAGTAGAGGTTAAGGCGCGTCAG
484    +
485    <quality scores 1>
486    <quality scores 2>
487
488This extended FASTA format is useful if paired-end information needs
489to be piped into GSNAP via stdin.
490
491
492As with gmap, gsnap is written for small genomes (less than 2^32 bp in
493total length).  With large genomes, there is an equivalent program
494called gsnapl, which you should run instead of gsnap.  The gsnapl
495program is equivalent to gmap, and is based on the same source code,
496but is compiled to use 64-bit index files instead of 32-bit files.
497The gsnap and gsnapl programs will detect whether the genomes are the
498correct size, and will exit if you try to run them on the wrong-sized
499genomes.
500
501
5027.  Detecting known and novel splice sites in GSNAP
503====================================================
504
505Whereas GMAP assumes RNA-seq input by default, GSNAP assumes DNA-seq
506input by default.  To tell GSNAP to find splice junctions in
507individual reads, you need to provide either the "--novelsplicing=1"
508(or "-N 1") flag and/or the "-s <splicing_file>" flag to GSNAP.
509
510GSNAP can detect splices using a probabilistic model using the
511--novelsplicing (or -N) flag.  You can also detect splices from a set
512of splice sites that you provide, using the --splicesites (or -s)
513flag.  You may specify both flags, which will report splice junctions
514involving both known and novel sites.
515
516Output for a splicing junction will look like this:
517
518>TCCGTGACGTGGATTGGTGCTGCACCCCTCATC      1       Header
519 TCCGTGACGTGGATTGgt---------------      1..16   +19:56050054..56050069  start:0..donor:0.99,splice_dist:1238,dir:sense,sub:0+0=0,label:NM_001648.KLK3.exon1/5|NM_001030047.KLK3.exon1/5|NM_001030048.KLK3.exon1/5|NM_001030049.KLK3.exon1/6|NM_001030050.KLK3.exon1/2
520,--------------agGTGCTGCACCCCTCATC      17..33  +19:56051308..56051324  acceptor:0.99..end:0,dir:sense,sub:0+0=0,label:NM_001648.KLK3.exon2/5|NM_001030047.KLK3.exon2/5|NM_001030048.KLK3.exon2/5|NM_001030049.KLK3.exon2/6|NM_001030050.KLK3.exon2/2
521
522After the "donor:" or "acceptor:" splice site type, the model score
523probability is given, even if the splice site is known.  For known
524splice sites, the "label:" field will provide information about the
525site.  If there is more than one known splice site at a genomic
526position, the labels are separated by a "|" delimiter.
527
528There are several advantages to specifying a database of known splice
529sites.  First, GSNAP will then be able to detect splicing involving
530atypical splice sites, that would otherwise give low scores using its
531probabilistic model.  A known splice site is treated as if its model
532probability is 1.0.  Second, GSNAP can find splicing involving short
533exons.  Such cases have a single end aligning to three exons,
534separated by two introns.  Third, GSNAP can identify splicing at the
535ends of reads with greater sensitivity, even if they have short
536overlaps onto the next exon.  Fourth, GSNAP can detect known long
537splices, because expected splice lengths can be included with the
538splice site information.
539
540GSNAP allows for known splicing at two levels: at the level of known
541splice sites and at the level of known introns.  At the site level,
542GSNAP finds splicing between arbitrary combinations of donor and
543acceptor splice sites, meaning that it can find alternative splicing
544events.  At the intron level, GSNAP finds splicing only between the
545set of given donor-acceptor pairs, so it is constrained not to find
546alternative splicing events, only introns included in the given list.
547For most purposes, I would recommend using known splice sites, rather
548than known introns, unless you are certain that all alternative
549splicing events are known are represented in your file.
550
551GSNAP can tell the difference between known site-level and known
552intron-level splicing based on the format of the input file.  To
553perform known site-level splicing, you will need to create a file with
554the following format:
555
556>NM_004448.ERBB2.exon1 17:35110090..35110091 donor 6678
557>NM_004448.ERBB2.exon2 17:35116768..35116769 acceptor 6678
558>NM_004448.ERBB2.exon2 17:35116920..35116921 donor 1179
559>NM_004448.ERBB2.exon3 17:35118099..35118100 acceptor 1179
560>NM_004449.ERG.exon1 21:38955452..38955451 donor 783
561>NM_004449.ERG.exon2 21:38878740..38878739 acceptor 783
562>NM_004449.ERG.exon2 21:38878638..38878637 donor 360
563>NM_004449.ERG.exon3 21:38869542..38869541 acceptor 360
564
565Each line must start with a ">" character, then be followed by an
566identifier, which may have duplicates and can have any format, with
567the gene name or exon number shown here only as a suggestion.  Then
568there should be the chromosomal coordinates which straddle the
569exon-intron boundary, so one coordinate is on the exon and one is on
570the intron.  (Coordinates are all 1-based, so the first character of a
571chromosome is number 1.)  Finally, there should be the splice type:
572"donor" or "acceptor".  You may optionally store the intron distance
573at the end.  GSNAP can use this intron distance, if it is longer than
574its value for --localsplicedist, to look for long introns at that
575splice site.  The same splice site may have different intron distances
576in the database; GSNAP will use the longest intron distance reported
577in searching for long introns.
578
579Note that the chromosomal coordinates are in the sense direction.
580Therefore, genes on the plus strand of the genome (like NM_004448) have
581the coordinates in ascending order (e.g., 35110090..35110091).
582Genes on the minus strand of the genome (like NM_004449) have the
583coordinates in descending order (e.g., 38955452..38955451).
584
585On the other hand, to perform known intron-level splicing, you will need
586to create a file with the following format:
587
588>NM_004448.ERBB2.intron1 17:35110090..35116769
589>NM_004448.ERBB2.intron2 17:35116920..35118100
590>NM_004449.ERG.intron1 21:38955452..38878739
591>NM_004449.ERG.intron2 21:38878638..38869541
592
593Again, coordinates are 1-based, and specify the exon coordinates
594surrounding the intron, with the first coordinate being from the donor
595exon and the second one being from the acceptor exon.
596
597
598There are several ways to help you generate these files.  First, if
599you have a GTF file, you can use the included programs gtf_splicesites
600and gtf_introns like this:
601
602    cat <gtf file> | gtf_splicesites > foo.splicesites
603    cat <gtf file> | gtf_introns > foo.introns
604
605Second, if you retrieve an alignment tracks from UCSC, like this:
606
607    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz
608
609if you are aligning to genome hg18, or
610
611    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
612
613if you are aligning to genome hg19, you can process this track using
614the included program psl_splicesites or psl_introns, like this:
615
616    gunzip -c refGene.txt.gz | psl_splicesites -s 1 > foo.splicesites
617    gunzip -c refGene.txt.gz | psl_introns -s 1 > foo.introns
618
619Note that alignment tracks in UCSC sometimes have an extra column on
620the left.  The "-s" flag allows you to indicate how many columns
621should be skipped.
622
623Once you have built this splicesites or introns file, you process it
624as a map file (see "Building map files" above), by doing
625
626    cat foo.splicesites | iit_store -o <splicesitesfile>, or
627    cat foo.introns | iit_store -o <intronsfile>
628
629If you want to include more than one track, you can do this:
630
631    gunzip -c refGene.txt.gz | psl_splicesites -s 1 > foo
632    gunzip -c knownGene.txt.gz | psl_splicesites > bar
633    cat foo bar | iit_store -o <splicesitesfile>
634
635
636A third way to build a known splicesites or known introns file is
637useful if you have cDNA sequences rather than an alignment track, or
638if you do not trust the alignment track and prefer to use cDNA
639sequences.  GMAP has an option "-f splicesites" that finds splice
640sites in cDNA sequences and reports them in the correct splicesite
641format.  Likewise, GMAP can build an intron file, with the option "-f
642introns".
643
644When processing known cDNA sequences, you should also run GMAP with
645the "-n 1" flag, so you get the best alignment, and with the "-z
646sense_force" or "-z sense_filter" flag.  The sense_force option will
647help GMAP know that the introns in your cDNA sequences are in the
648correct GT-AG sense, and is applicable when you have a high quality
649set of cDNA sequences.  The sense_filter option will allow GMAP to try
650either sense or antisense, and to filter out sequences that appear to
651be antisense; this is applicable if you are uncertain about the
652validity of your cDNA sequences.
653
654Again once you have built either a known splicesites or known introns
655file, you process it as a map file by doing
656
657    cat <file> | iit_store -o <splicesitesfile>, or
658    cat <file> | iit_store -o <intronsfile>
659
660which creates <splicesitesfile>.iit or <intronsfile>.iit.
661
662
663Regardless of how you built <splicesitesfile>.iit or <intronsfile>.iit,
664you put it in the maps subdirectory by doing
665
666    cp <splicesitesfile>.iit /path/to/gmapdb/<genome>/<genome>.maps, or
667    cp <intronsfile>.iit /path/to/gmapdb/<genome>/<genome>.maps
668
669Then, you may use the file by doing this:
670
671    gsnap -d <genome> -s <splicesitesfile> <shortreads>, or
672    gsnap -d <genome> -s <intronsfile> <shortreads>, or
673
674
6758.  SAM output format
676=====================
677
678GSNAP can generate SAM output format, by providing the "-A sam" flag
679to GSNAP.  In addition, GMAP can also print its alignments in SAM
680output, using the "-f samse" or "-f sampe" options, for single-end or
681paired-end data.  The sampe option will generate SAM flags to indicate
682whether the read is the first or second end of a pair, which requires
683that you provide GMAP with an extended FASTA format having a ">" or
684"<" character in the header to indicate that information.  However,
685the sampe option will change only the SAM flags, and not change the
686underlying alignment algorithm.  GMAP does not know how to find
687concordance between paired-end reads like GSNAP does.
688
689GSNAP provides some special SAM flags as follows:
690
691XQ: A non-normalized mapping quality score
692
693X2: The second best XQ score among all multimapping alignments for a
694given read.  If there is only a single alignment, this value is 0.
695
696XO: Output type.  GSNAP categorizes its alignments into output types,
697as follows.  Note that the --split-output option will create separate
698output files for each output type.  Alternatively, if you use
699sam_sort, you should provide --split-output to that program instead
700and achieve the same functionality.  (The reason for this is that
701there may be situations where GSNAP assigns different output types to
702the first and second ends of the reads and sam_sort needs to see
703alignments from both ends together.)  In either case, the output types
704have the following meanings and filename suffixes:
705
706  NM (nomapping) (filename suffix ".nomapping"): The entire read
707  (single-end or paired-end) could not be aligned.  If the
708  --failed-input=FILENAME flag is specified, then these reads are also
709   printed in FASTQ or FASTA format (depending on the input format) in
710  the given file (plus a .1 or .2 ending for the first and second ends
711  of a paired-end read).
712
713  CU (concordant unique) (filename suffix ".concordant_uniq"): Both
714  ends of a paired-end read could be aligned concordantly to a single
715  position in the genome.  For a definition of concordance, see the
716  section "Output types" below.
717
718  CM (concordant multiple) (filename suffix ".concordant_mult"): Both
719  ends of a paired-end read could be aligned concordantly, but to more
720  than one position in the genome.
721
722  CX (concordant multiple excess) (filename suffix
723  ".concordant_mult_xs"): Multiple concordant alignments, but user
724  specified --quiet-if-excessive and the number of alignments exceeds
725  "-n" threshold.  If the --failed-input option is given, these reads
726  are also printed in FASTA or FASTQ format in the given file.
727
728  CT (concordant translocation) (filename suffix
729  ".concordant_transloc"): Both ends of a paired-end read could be
730  aligned concordantly, but one end requires a split alignment to a
731  distant location, such as another chromosome, or a different strand
732  on that chromosome, or a far distance on that strand.  Note that
733  translocation alignments need to be printed on two separate SAM
734  lines.
735
736  CC (concordant circular) (filename suffix ".concordant_circular"):
737  Both ends of a paired-end read could be aligned concordantly, but
738  one or both ends require an alignment that goes around the origin of
739  a circular chromosome.  Circular chromosomes are specified in the
740  gmap_build step by using the -c or --circular flag, as described
741  previously.  Note that circular alignments need to be printed on two
742  separate SAM lines.
743
744  PI (paired unique inversion) (filename suffix ".paired_uniq_inv"):
745  Both ends of a paired-end read could be aligned uniquely, but in a
746  way that indicates that a genomic inversion has occurred between the
747  two ends.
748
749  PS (paired unique scramble) (filename suffix ".paired_uniq_scr"):
750  Both ends of a paired-end read could be aligned uniquely, but in a
751  way that indicates that the genomic order is scrambled.  This
752  typically occurs because of tandem duplications.
753
754  PL (paired unique long) (filename suffix ".paired_uniq_long"): Both
755  ends of a paired-end read could be aligned uniquely, but in a way
756  that indicates that a large genomic deletion has occurred between
757  the two ends.
758
759  PC (paired unique circular) (filename suffix
760  ".paired_uniq_circular"): Both ends of a paired-end read could be
761  aligned uniquely, but not concordantly, representing an inversion,
762  scramble, or deletion.  In addition, one or both ends of the read
763  goes around the origin of a circular chromosome.
764
765  PM (paired multiple) (filename suffix ".paired_mult"): Both
766  ends of a paired-end read could be aligned near each other,
767  representing an inversion, scramble, or deletion, but there are
768  multiple places in the genome where an alignment is found.
769
770  PX (paired multiple excess) (filename suffix ".paired_mult_xs"):
771  Multiple paired alignments, but user specified --quiet-if-excessive
772  and the number of alignments exceeds the "-n" threshold.  If the
773  --failed-input option is given, these reads are also printed in
774  FASTA or FASTQ format in the given file.
775
776  HU, HM, HT, HC (halfmapping unique, halfmapping multiple,
777  halfmapping translocation, and halfmapping circular, respectively)
778  (filename suffixes: ".halfmapping_uniq", ".halfmapping_mult",
779  ".halfmapping_transloc", ".halfmapping_circular): Same as for the
780  concordant output types, except that only one end of the paired-end
781  read could be aligned, and the other end could not be aligned
782  anywhere in the genome.
783
784  HX (halfmapping multiple excess) (filename suffix
785  ".halfmapping_mult_xs"): Multiple halfmapping alignments, but user
786  specified --quiet-if-excessive and the number of alignments exceeds
787  the "-n" threshold.  If the --failed-input option is given, these
788  reads are also printed in FASTA or FASTQ format in the given file.
789
790  UU, UM, UT, UC (unpaired unique, unpaired multiple, unpaired
791  translocation, and unpaired circular, respectively) (filename
792  suffixes: ".unpaired_uniq", ".unpaired_mult", ".unpaired_transloc",
793  ".unpaired_circular): Same as for the concordant output types,
794  except that the two ends could not be aligned concordantly or even
795  paired.  These "unpaired" categories are also used for single-end
796  reads, since they lack a mate end that can allow for concordance,
797  pairing, or halfmapping.
798
799  UX (unpaired multiple excess) (filename suffix ".unpaired_mult_xs"):
800  Multiple unpaired alignments, but user specified
801  --quiet-if-excessive and the number of alignments on one end exceeds
802  the "-n" threshold.  If the --failed-input option is given, these
803  reads are also printed in FASTA or FASTQ format in the given file.
804
805
806XB: Prints the barcode extracted from the end of the read.  Applies only
807if --barcode-length is not 0.
808
809XP: Prints the primer inferred from a paired-end read.  Applies only
810if the --adapter-strip flag is specified.
811
812XH: Prints the part of the query sequence that was hard-clipped.
813Sequence is printed in plus-genomic order and replaces the "H" part of
814the CIGAR string.
815
816XI: Prints the part of the quality string that was hard-clipped.
817Sequence is printed in plus-genomic order and replaces the "H" part of
818the CIGAR string.
819
820XS: Prints the strand orientation (+ or -) for a splice.  Appears only
821if splicing is allowed (-N or -s flag provided), and only for reads
822containing a splice.  The value "+" means the expected GT-AG, GC-AG,
823or AT-AC dinucleotide pair is on the plus strand of the genome, and
824"-" means the dinucleotides are on the minus strand.  If the
825orientation is not obvious, because the dinucleotides do not match
826GT-AG, GC-AG, AT-AC, or their complements, then the program applies a
827probabilistic splice model to determine the orientation.  If the
828splice sites have low probability, then the program may not be able to
829determine an orientation, and the result will be printed as XS:A:?.
830To prevent this flag, which cannot be handled by such programs (such
831as Cufflinks), use the --force-xs-dir flag.  However, this flag will
832merely change occurrences of XS:A:? arbitrarily to XS:A:+.
833
834XA: Indicates an ambiguous splice.  If GSNAP finds two or more
835possible splices at a given position, it will try to resolve the
836ambiguity if possible based on the other end of the paired-end read.
837If the ambiguity cannot be resolved, GSNAP will not report any of the
838splices, but will report a soft clip instead.  The XA field indicates
839which end or ends are ambiguous and the number of matches found on
840each ambiguous end, based on the output XA:Z:i,j.  If i or j is
841greater than 0, that indicates that the lower or higher chromosomal
842end is ambiguous, respectively.  The value given indicates the number
843of matches found in the ambiguous end.  This number may be smaller
844than the number of bases soft-clipped, due to mismatches.
845
846XC: Indicates whether the alignment crosses over the origin of a
847circular chromosome.  If so, the string XC:A:+ is printed.
848
849XT: Prints the intron dinucleotides and splice probabilities around a
850distant splicing event (genomic deletion, inversion, scramble, or
851translocation).
852
853XW and XV: Printed only when SNP-tolerant alignment is enabled.  XW
854provides the number of mismatches against both the reference and
855alternate alleles (or the "World" population).  Therefore, these are
856true mismatches.  XV provides the number of positions that are
857mismatches against the reference genome, but do match the alternate
858genome.  Therefore, these are known variant positions.  The sum of XW
859and XV provides the number of differences relative to the reference
860genome, and with the exception of indels, should equal the value of
861NM.
862
863XG: Indicates which method within GSNAP generated the alignment.  B:
864GMAP alignment produced from suffix array, M: GMAP alignment produced
865from GSNAP hash table method, T: terminal alignment, O: merging of
866overlaps.  Absence of XG flag indicates the standard GSNAP hash table
867method.  (Note: older versions of GSNAP used "PG:", but some
868downstream software required all PG methods to be listed in the header
869section, so we changed the field name to "XG:")
870
871
8729.  GSNAP output format
873========================
874
875By default, GSNAP prints its output in a FASTA-like format, which we
876developed before we incorporated the SAM format.  The default GSNAP
877output has some advantages over SAM output, especially for debugging
878purposes.  However, we routinely use SAM output for our own pipeline,
879and it has been subject to the most testing by ourselves and by outside
880users.
881
882Here is some output from GSNAP on a paired-end read:
883
884>GGACTGCGCACCGAACGGCAGCGACTTCCCGTAGTAGCGGTGCTCCGCGAAGACCAGTAGAGCCCCCCGCTCGGCC   1 concordant    ILLUMINA-A1CCE9_0004:1:1:1510:2090#0
885 GGACTGCGCACCGAACGGCAGCGACTTCCCGTAGTAGCGct-----------------------------------   1..39   +9:139128263..139128301 start:0..acceptor:0.99,dir:antisense,splice_dist:214,sub:0+0=0,label:NM_013379.DPP7.exon4/13  segs:2,align_score:2  pair_score:5,pair_length:112
886,-------------------------------------acGTGCTCCGCGAAGACCAGTAGAGCCCCCCGCTCGGCC   40..76  +9:139128516..139128552 donor:0.96..end:0,dir:antisense,splice_dist:214,sub:0+0=0,label:NM_013379.DPP7.exon3/13
887
888<CTTCGCCAACAACTCGGGCTTCGTCGCGGAGCTGGCGGCCGAGCGGGGGGCTCTACTGGTCTTCGCGGAGCACCGC   1 concordant    ILLUMINA-A1CCE9_0004:1:1:1510:2090#0
889 CTTCGCCAACAACTCGGcCTTCGTCGCGGAGCTGGCGGCCGAGCGGGGGGCTCTACTGGTCTTCGCGGAGCACgtg   1..73   -9:139128588..139128516 start:0..end:3,sub:3+1=4      segs:1,align_score:3  pair_score:5,pair_length:112
890
891Each end of a read gets its own block, with the first read starting
892with ">" and the second read for paired-end reads starting with "<".
893The block starts with a header line that has in column 1, the query
894sequence in its original direction (and with lower-case preserved if
895any); in column 2, the number of hits for that query and if the read
896is paired-end, the relationship between the ends (as discussed in the
897next paragraph); and in column 3, the accession number for the query.
898
899
90010.  Output types
901=================
902
903The two ends of a paired-end read can have the following
904relationships: "concordant", "paired", or "unpaired".  A paired-end
905read is concordant if the ends align to the same chromosome, in the
906expected relative orientations, and having an inferred insert length
907greater than zero and within the "--pairmax" parameter.  The inferred
908insert length is the distance from the end of the first-end alignment
909to the start of the second-end alignment, plus the read lengths of the
910two ends.  There may be more than one concordant alignment for a given
911read, and if so, the alignments for each end are reported in
912corresponding order.
913
914If a concordant relationship cannot be found, then the program will
915report any paired relationships it can find.  A paired alignment
916occurs when the two ends align to the same chromosome, but fail some
917criterion for concordance.  There are different subtypes of paired
918alignments, depending on which criterion is violated.  If the
919orientations are opposite what is expected, the paired subtype is
920"inversion".  If they are in the expected orientation, but the
921distance is greater than the "--pairmax" parameter, then the paired
922subtype is "toolong".  If they are in the expected orientation, but
923the inferred insert length appears to be negative, then the paired
924subtype is "scramble".  In GSNAP output, a paired subtype is shown in
925a label called "pairtype", which can have the values
926"pairtype:inversion", "pairtype:toolong", and "pairtype:scramble".
927
928Otherwise, if neither a concordant nor paired alignment can be found,
929then the program will align each end separately, and report the
930relationship as being "unpaired".
931
932GSNAP can find translocation splices within a single end of a read,
933but it tries to be conservative about reporting them.  If there is any
934alignment that does not involve such a translocation, then it will not
935report the translocation.  It therefore reports translocation splices
936only when no other alignment is found within the concordant, paired,
937or unpaired categories.  Therefore, such results are listed in the
938header as having "(transloc)" appear after the "concordant", "paired",
939or "unpaired" result type.
940
941After the query line, each of the genomic hits is shown, up to the
942'-n' parameter.  If too many hits were found (more than the '-n'
943parameter), the behavior depends on whether the "--quiet-if-excessive"
944flag is given to GSNAP.  If not, then the first n hits will be printed
945and the rest will not be printed.  If the "--quiet-if-excessive" flag
946is given to GSNAP, then no hits will be printed if the number exceeds
947n.
948
949Each of the genomic hits contains one or more alignment segments,
950which is a region of continuous one-to-one correspondence (matches or
951mismatches) between the query and the genome.  Multiple segments occur
952when the alignment contains an insertion, deletion, or splice.  The
953first segment is marked by a space (" ") at the beginning of the line,
954while the second and following segments are marked by a comma (",") at
955the beginning of the line.  (In the current implementation of GSNAP
956that allows only a single indel or splice, the number of segments is
957at most two.)
958
959The segments contain information in tab-delimited columns as follows:
960
961Column 1: Genomic sequence with matches in capital letters, mismatches
962in lower-case letters, and regions outside the segment with dashes.
963For deletions in the query, the deleted genomic sequence is also
964included in lower case.  For spliced reads, the two dinucleotides at
965the intron ends are included in lower case.
966
967Column 2: Range of query sequence aligned in the segment.  Coordinates
968are inclusive, with the first nucleotide considered to be position 1.
969
970Column 3: Range of genomic segment aligned, again with inclusive
971coordinates, with the first nucleotide in each chromosome considered
972to be position 1.  Plus and minus strands are marked with a "+" or "-"
973sign.
974
975Column 4: Segment information, delimited by commas.  The first item
976reports on the ends of the segment, which can be of type "start",
977"end", "ins", "del", "donor", "acceptor", or "term".  After "start"
978and "end", we report the number of nucleotides clipped or trimmed from
979the segment.  In our example above, "end:3" means that 3 nucleotides
980should be trimmed from the end.  Trimming finds a local maximum of
981matches to mismatches from the end and is computed only if the "-T"
982flag is specified, and the value for "-T" limits the amount of
983trimming allowed.  After "ins" and "del", we report the number of
984nucleotides that were inserted or deleted in the query relative to the
985genome.  After "donor" or "acceptor", we report the probability of the
986splice site, based on the MaxEnt model.  The "term" label indicate a
987terminal segment, where the entire read could not be aligned, but more
988than half of the read could be aligned from either end.
989
990Each segment will also show after the "sub" tag, he number of
991mismatches in that segment including the part that is trimmed, if any.
992If SNP-tolerant alignment is chosen, then the number of SNPs is also
993shown (see details below under SNP-tolerant alignment).  Other
994information may also be included with the segment information, such as
995the orientation and distance of the splice or known splice labels, if
996appropriate flags and information are given to GSNAP.  Splices are
997marked with a splice_type, which can be "consistent", "inversion",
998"scramble", or "translocation".  A "translocation" splice includes
999splices on the same chromosome where the splice distance exceeds the
1000parameter for localsplicedist.
1001
1002Column 5: Alignment or hit information, delimited by commas.  For the
1003first segment in a hit (the one starting with a space), this column
1004provides the number of segments (denoted by "segs:") and the score of
1005the alignment (denoted by "align_score:").
1006
1007Column 6: Pair information (for paired-end reads only).  For the first
1008segment in a hit (with the same information repeated on both ends of a
1009concordant pair), this column provides the score of the pair (which is
1010the sum of the alignment scores) and the inferred length of the insert
1011(ignoring splices within each segment, but not between segments).
1012
1013
101411.  SNP-tolerant alignment in GSNAP
1015====================================
1016
1017GSNAP has the ability to align to a reference space of all possible
1018major and minor alleles in a set of known SNPs provided by the user.
1019This ability is provided by the -v flag, and produces output like
1020this:
1021
1022>ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGA   2       Header
1023 ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGt   1..36   -12:34263937..34263902  start:0..end:0,sub:1+0=1
1024 ATGGTAATCCTGCTCAGTAGGAGAGGAACCCCAGGt   1..36   -21:14379300..14379265  start:0..end:0,sub:1+2=3
1025
1026The notation "sub:1+0=1" indicates that the SNP-tolerant alignment has
1027one mismatch ("sub:1") and zero minor SNP alleles ("+0"), for a total
1028of one differences ("=1") relative to the reference genome with all
1029major alleles.  Likewise, the notation "sub:1+2=3" indicates one
1030SNP-tolerant mismatch, two minor SNP alleles, and 3 mismatches
1031relative to the reference sequence with all major alleles.
1032
1033Note that by default, GSNAP shows only differences relative to both
1034the reference and alternate genomes in lower case.  If you want to
1035show differences relative to the reference genome in lower case, you
1036will need to provide the flag --show-refdiff, which would give the
1037output above as:
1038
1039>ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGA   2       Header
1040 ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGt   1..36   -12:34263937..34263902  start:0..end:0,sub:1+0=1
1041 ATGGTAATCCTGCTCAGTAgGAGAGGAACCcCAGGt   1..36   -21:14379300..14379265  start:0..end:0,sub:1+2=3
1042
1043For SAM output format, all differences from the reference sequence are
1044shown in the NM and MD fields, although the alignment scoring and
1045mapping qualities are based on a SNP-tolerant alignment.
1046
1047Before deciding to use SNP-tolerant alignment, please note the
1048following.  First, SNP-tolerant alignment requires more memory than
1049standard alignment, perhaps twice as much.  Also, with longer reads
1050now of 75 or more bp, GSNAP alignments are generally fine without
1051SNP-tolerant alignment.  Finally, if your SNPs are incorrect (and many
1052in the databases are), then they can lead SNP-tolerant alignment to
1053favor the faulty SNPs.
1054
1055If you still wish to use SNP-tolerant alignment, you will need to
1056specify a set of known SNPs by creating a file with the following
1057format:
1058
1059>rs62211261 21:14379270 CG
1060>rs62211262 21:14379281 CG
1061
1062Each line must start with a ">" character, then be followed by an
1063identifier (which may have duplicates).  Then there should be the
1064chromosomal coordinate of the SNP.  (Coordinates are all 1-based, so
1065the first character of a chromosome is number 1.)  Finally, there
1066should be the two possible alleles.  (Previous versions required that
1067these be in alphabetical order: "AC", "AG", "AT", "CG", "CT", or "GT",
1068but that is no longer a requirement.)  These alleles must correspond
1069to the possible nucleotides on the plus strand of the genome.  If the
1070one of these two letters does not match the allele in the reference
1071sequence, that SNP will be ignored in subsequent processing as a
1072probable error.
1073
1074GSNAP also supports the idea of a wildcard SNP.  A wildcard SNP allows
1075all nucleotides to match at that position, not just a given reference
1076and alternate allele.  It is essentially as if an "N" were recorded at
1077that genomic location, although the index files still keep track of
1078the reference allele.  To indicate that a position has a wildcard SNP,
1079you can indicate the genotype as "WN", where "W" is the reference
1080allele.  Another indication of a wildcard SNP is to provide two
1081separate lines at that position with the genotypes "WX" and "WY",
1082where "W" is the reference allele and "X" and "Y" are two different
1083alternate alleles.
1084
1085To help you generate the file, this package includes programs that can
1086process known SNP data, from various formats, including dbSNP files,
1087VCF format, or GVF format.  The dbSNP files can be obtained from UCSC,
1088like this
1089
1090    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130.txt.gz
1091
1092For versions before snp132, you may also want to exclude exceptions,
1093which will require this file:
1094
1095    ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130Exceptions.txt.gz
1096
1097You can then process these files using our program dbsnp_iit, like
1098this:
1099
1100    gunzip -c snp130.txt.gz | dbsnp_iit [-w <weight>] [-e snp130Exceptions.txt.gz] > <snpfile>.txt, or
1101    gunzip -c snp130.txt.gz | dbsnp_iit [-w <weight>] [-e snp130Exceptions.txt] > <snpfile>.txt
1102
1103For versions starting with snp132, the exceptions are included in
1104column 19 of the snp file, so the -e flag is not necessary.  The
1105dbsnp_iit program will read the exceptions from that column.
1106
1107The option "-w <weight>" makes use of the dbSNP item weight, a value
1108from 1 to 3, where lower weight means higher confidence.  Items will
1109be included if the item weight is the given value <weight> or less.
1110The default value of -w is 1, which is the criterion UCSC uses to
1111build its ambiguous version of the genome.  To allow all item weights,
1112specify "-w 3".
1113
1114SNPs have various types of exceptions, as provided either by the "-e
1115<exceptions_file>" flag, or starting with snp132, in the snp file
1116itself.  By default, dbsnp_iit will exclude all types of exceptions.
1117However, if you want to include some types of exceptions, you will
1118need to modify the dbsnp_iit program (written in Perl) to indicate a
1119"1" for the type of exception you want to include.
1120
1121Sometimes SNP data are provided in VCF format, and can be retrieved
1122like this:
1123
1124    ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz
1125
1126You can process VCF files using our program vcf_iit, like this:
1127
1128    gunzip -c 00-All.vcf.gz | vcf_iit [-v <version>] > <snpfile>.txt
1129
1130The VCF file contains multiple versions of dbSNP, so if you want a
1131particular version, such as 135, you would use the flag "-v 135".  The
1132vcf_iit program tries to pick a subset of SNPs that somewhat parallel
1133the ones without exceptions in the UCSC dbSNP file.  It keeps all SNPs
1134that have been validated (marked in the VCF file as "VLD") or have a
1135submitter link-out ("SLO").  Otherwise, it excludes SNPs that are
1136individual genotypes ("GNO").  If none of these conditions hold, then
1137the SNP is allowed.  These rules might not be the best ones; I made
1138them up by trying to compare version 135 of the VCF data with
1139version 135 of the UCSC dbSNP data.
1140
1141Likewise, SNP data are sometimes provided in GVF format, and these
1142files can be processed using our program gvf_iit, like this:
1143
1144    gunzip -c filename.gvf.gz | gvf_iit > <snpfile>.txt
1145
1146From my limited experience with GVF format, it appears that the file
1147contains variants from only one version of dbSNP.
1148
1149
1150Once you have built a <snpfile>.txt file using dbsnp_iit, vcf_iit, or
1151gvf_iit, you create an IIT file by doing this
1152
1153    cat <snpfile>.txt | iit_store -o <snpfile>
1154
1155which creates <snpfile>.iit.  If you wish to store this file for other
1156people to use, you can then put it in a central place:
1157
1158    cp <snpfile>.iit /path/to/gmapdb/<genome>/<genome>.maps
1159
1160Or you can keep it in your own directory for your own use.  Then you
1161need to create a reference space index and compressed genome by doing
1162
1163    snpindex -d <genome> [-V <snpdirectory>] -v <snpfile> <snpfile>.iit
1164
1165If you specify the [-V <snpdirectory>] option, then the resulting SNP
1166index files are placed in the given directory.  Otherwise, if you
1167don't specify this, then the SNP files are saved with the reference
1168index files.
1169
1170If your <snpfile>.iit file is already in a <genome>.maps subdirectory,
1171then you can simply run
1172
1173    snpindex -d <genome> [-V <snpdirectory>] -v <snpfile>
1174
1175and the program will find the indicated <snpfile>.iit file.
1176Additional options for snpindex can be found by doing "snpindex
1177--help".
1178
1179
1180Finally, you can perform SNP-tolerant alignment by doing
1181
1182    gsnap -d <genome> [-V <snpdirectory>] -v <snpfile> <shortreads>
1183
1184You can also retrieve snp information for genomic segments by running
1185get-genome with the -v and -f flags.  For more details, run
1186"get-genome --help".
1187
1188
118912.  Alignment of reads from bisulfite-treated DNA in GSNAP
1190===========================================================
1191
1192GSNAP has the ability to align reads from bisulfite-treated DNA, which
1193converts unmethylated cytosines to uracils that appear as thymines in
1194reads.  GSNAP is able to identify genomic-T to read-C mismatches, and
1195produces output like this:
1196
1197>CTACGTcGTAGACATATTGATTCAGAATTTGAAGTTTAGCTAGATCTTAc     1       Read
1198 C.ACG.tGTAGACATA.TGATTCAGAAT.TGAAGTTTAGCTAGA.C.TAg     1..50   sub:2   +9:1000000..1000049
1199
1200As with all GSNAP output, the first line is the query, and the ones
1201afterward represent genomic segments.  The "." symbol indicates an
1202unmethylated C in the genome that appears as a T in the read.
1203Mismatches are shown by lower case characters in the genomic segment.
1204In position 6, we see an example of a genomic-T that appears as a
1205read-C, representing a mismatch.  The last position also shows a
1206mismatch between a genomic-G and read-C.
1207
1208To process reads from bisulfite-treated DNA, you will first need to
1209create the necessary index files by doing
1210
1211    cmetindex -d <genome>
1212
1213Then, you can align the reads by doing
1214
1215    gsnap -d <genome> --mode=cmet-stranded, or
1216    gsnap -d <genome> --mode=cmet-nonstranded
1217
1218on your set of short reads.  The stranded flag works on data where
1219the laboratory allows only reads that go from 5' to 3' on each strand
1220of the genome, and excludes their reverse complement reads that would
1221be produced by PCR amplification.  The resulting reads should have a
1222preponderance of C's converted to T's.
1223
1224The non-stranded flag is for the laboratory protocol that allows both
1225the 5'-to-3' genomic reads on each strand, as well as their reverse
1226complements.  The reverse complemented reads will have the C-to-T
1227conversion show up as a G-to-A change.  The resulting reads should
1228have a preponderance of T's in half the reads, and a preponderance of
1229A's in half the reads.
1230
1231
1232
123313.  RNA-editing tolerance in GSNAP
1234===================================
1235
1236Just as GSNAP has a program cmetindex and a mode called "cmet" for
1237tolerance to C-to-T changes, it can be tolerant to A-to-G changes
1238using the program atoiindex and a mode called "atoi".  This mode is
1239designed to facilitate alignments that are tolerant to RNA editing
1240where A's are converted to I's, which appear as G's to a sequencer.
1241
1242To process reads under RNA-editing tolerance, you will first need to
1243create the necessary index files by doing
1244
1245    atoiindex -d <genome>
1246
1247Then, you can align the reads by doing
1248
1249    gsnap -d <genome> --mode=atoi-stranded, or
1250    gsnap -d <genome> --mode=atoi-nonstranded
1251
1252on your set of short reads.  As with bisulfite sequencing, the
1253stranded flag is for laboratory protocols that allow only the 5'-to-3'
1254RNA, or sense, reads, and the non-stranded flag is for laboratory
1255protocols that allow both sense and antisense reads.
1256
1257
125814.  Transcriptome-guided genomic alignment
1259===========================================
1260
1261Transcriptome-guided genomic alignment (TGGA) is a feature in GSNAP
1262that utilizes both a genome and a transcriptome.  In this alignment
1263mode, GSNAP aligns reads first against the set of transcripts.  This
1264alignment is generally gap-free, since splicing is not an issue.
1265Indels can still be present, but GSNAP knows how to identify them in
1266transcripts.  The advantage of aligning against a transcript is that
1267GSNAP can find seeds anywhere in the transcript, even if they span an
1268exon-exon junction.  In contrast, with standard genomic alignment, if
1269the end of a read is near a splice junction, there may not be enough
1270nucleotides on the distal end of the splice to find a seed.
1271Therefore, in such cases, GSNAP would have soft-clip the end of the
1272alignment.
1273
1274In TGGA, GSNAP converts alignments to transcripts to alignments in the
1275genome, using information about how the entire transcript maps to the
1276genome.  The ends of reads can be aligned more accurately, even if
1277only one or two nucleotides are on the distal end of a splice.  In
1278addition, gap-free alignment to transcripts is fast, so TGGA is many
1279times faster than regular genomic alignment.  A final advantage of
1280TGGA is that GSNAP reports the transcripts that are consistent with
1281the read, which can be useful for downstream analyses that measure
1282gene expression by transcript.  Therefore, TGGA serves a function
1283similar to that of alignment-free RNA-seq tools like Kallisto or
1284Salmon, but also provides the genomic alignment needed for the
1285analysis of splicing, mutations, and polymorphisms.
1286
1287To perform TGGA, in addition to the genome in FASTA format (suppose it
1288is called genome.fasta), you will need either a set of transcripts in
1289FASTA format (transcripts.fasta) or a set of transcript mappings to
1290the genome (transcripts.map).  The map format can be obtained from a
1291GFF3 or GTF file using the programs gff3_genes or gtf_genes included
1292with this software package, as follows:
1293
1294    gff3_genes <gff3_file> > transcripts.map
1295    gtf_genes <gtf_file> > transcripts.map
1296
1297The map file can also be generated by aligning the transcript FASTA
1298format against the genome using "gmap -f map_exons", but if you are
1299starting with a FASTA file, it is easier to let gmap_build perform
1300this process for you.
1301
1302To build the genome and transcriptome index files, there are two
1303cases:
1304
1305Case 1: The genome index has not yet been built.  In this case, you
1306can build both the genome and transcriptome indices with one command,
1307either
1308
1309    gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -T transcripts.fasta genome.fasta, or
1310
1311    gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -G transcripts.map genome.fasta
1312
1313depending on whether you have a FASTA or map file.
1314
1315Case 2: The genome index has already been built (as described in
1316section 3 above).  In this case, you do not need to build the genome
1317index again, just the transcriptome index, so you should omit the
1318genome FASTA file:
1319
1320    gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -T transcripts.fasta, or
1321
1322    gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -G transcripts.map
1323
1324The result of these procedures will be to create index files in one
1325directory for the genome, and in another directory for the
1326transcriptome.  In addition, there will be a subdirectory
1327<genome_name>.transcripts with files <transcriptome_name>.*, which
1328contain information about how transcripts map to the genome.
1329
1330After these indices are built, you can perform TGGA by doing
1331
1332    gsnap [-D <directory>] -d <genome_name> -c <transcriptome_name> -N 1 -A sam <fastq files...>
1333
1334In other words, the only addition is the flag "-c
1335<transcriptome_name>".  The "-N 1" flag is actually optional, because
1336the fact that you have a transcriptome tells GSNAP that your input
1337files are RNA-seq.
1338
1339
1340