1 20. Availability 3============ 4 5The source code for this package is available from 6http://research-pub.gene.com/gmap. License terms are provided in the 7LICENSE file. 8 9 101. Building and installing GMAP and GSNAP 11========================================== 12 13Prerequisites: a Unix system (including Cygwin on Windows), a C 14compiler, and Perl 15 16Step 1: Set your site-specific variables by editing the file 17config.site. In particular, you should set appropriate values for 18"prefix" and probably for "with_gmapdb", as explained in that file. 19If you are compiling this package on a Macintosh, you may need to edit 20CFLAGS to be 21 22CFLAGS = '-O3 -m64' 23 24since Macintosh machines will make only 32-bit executables by default. 25 26 27Step 2: Build, test, and install the programs, by running the 28following GNU commands 29 30 ./configure 31 make 32 make check (optional) 33 make install 34 35Note 1: Instead of editing the config.site file in step 1, you may type 36everything on the command line for the ./configure script in step 2, 37like this 38 39 ./configure --prefix=/your/usr/local/path --with-gmapdb=/path/to/gmapdb 40 41If you omit --with-gmapdb, it defaults to ${prefix}/share. If you 42omit --prefix, it defaults to /usr/local. Note that on the command 43line, it is "with-gmapdb" with a hyphen, but in a config.site file, 44it is "with_gmapdb" with an underscore. 45 46 47Note 2: If you want to keep your version of config.site or have 48multiple versions, you can save the file to a different filename, and 49then refer to it like this 50 51 ./configure CONFIG_SITE=<config site file> 52 53 54Note 3: GSNAP can read from gzip-compressed FASTA or FASTQ input 55files. This feature requires the zlib library to be present 56(available from http://www.zlib.net). The configure program will 57detect the availability of zlib automatically. However, to disable 58this feature, you can add "--disable-zlib" to the ./configure command 59or edit your config.site file to have the command "disable_zlib". 60 61 62Note 4: GSNAP can read from bzip2-compressed FASTA or FASTQ input 63files. This feature requires the bzlib library to be present. The 64configure program will detect the availability of bzlib automatically. 65However, to disable this feature, you can add "--disable-bzlib" to the 66./configure command or edit your config.site file to have the command 67"disable_bzlib". 68 69 702. Possible issues during compilation 71====================================== 72 73Recent versions of GMAP and GSNAP use certain techniques to achieve 74increased speed. One of these techniques is special SIMD 75(single-instruction, multiple data) instructions that can perform some 76computations in parallel. There are various levels of SIMD 77instructions, and we use those from the SSE2, SSE4.1, AVX2, and AVX512 78instruction sets, depending on the capabilities of the computer. Most 79computers built within the last 10 years should support SSE2 at least. 80 81However, you may run into the following issues: 82 83Issue 1: If you have a computer cluster with different types of SIMD 84capabilities, then you should compile the source code multiple times, 85one on each type of computer. If your computer has SSE2 capabilities, 86the ./configure and "make install" commands will create the programs 87gmap.sse2 and gsnap.sse2. Likewise, if your computer has AVX2 88capabilities, the ./configure and "make install" commands will create 89gmap.avx2 and gsnap.avx2. The gmap and gsnap programs are dispatch 90programs that determine at run time what capabilities the computer has 91and call the appropriate version of the program (e.g., gsnap.sse2 or 92gsnap.avx2). The ./configure and "make install" commands will also 93create versions of the programs, gmap.nosimd and gsnap.nosimd, that do 94not depend on SIMD. If the gmap and gsnap dispatch programs cannot 95find a version of the program that matches the given computer, then it 96will run the non-SIMD version instead. 97 98Issue 2. If you compile the code successfully, but when you run the 99program, you see something that says "Illegal instruction", then you 100must be running GMAP or GSNAP on a different computer than the one 101where you compiled it. The dispatch program should not allow this to 102happen, especially if you have compiled the code on a computer of each 103type in your cluster, so you can report this error to the authors and 104we can try to help. In the worst case, you can compile your program 105for the lowest common denominator by by providing the 106--simd-level=<level> flag to ./configure. 107 108 1093. Building a GMAP/GSNAP index (one chromosome per FASTA entry) 110============================================================================== 111 112A GMAP/GSNAP "index" is a set of genomic index files, representing the 113genome in a hash table format. You use the gmap_build program to 114build your own database from a FASTA file that contains the genome. 115The gmap and gsnap programs use the same index. 116 117Below I use the terms "genome" and "chromosome", but the input 118sequences can be anything you wish to align to, including transcripts 119or small fragments. 120 121A genome can be considered regular or large in size. A regular genome 122has a total sequence length of up to 2^32 = 4,294,967,296 (about 4 123billion) bp, while a large genome may contain up to 2^40 (about 1 124trillion) bp. However, each individual chromosome is still limited to 1252^32 bp. The gmap_build program will automatically recognize when a 126genome is larger than 2^32 bp and build the index files accordingly. 127 128To handle large genomes, the ./configure and "make install" procedures 129produce the programs gmapl and gsnapl. The gmapl and gsnapl programs 130work on large genomes, while the gmap and gsnap programs work on 131regular genomes. If you try to use the wrong program on a given 132database, the programs will notify you of the error and quit. 133 134You will need to start with a set of FASTA files containing either 135entire chromosomes or contigs that represent pieces of chromosomes. 136For example, for the human genome, you can retrieve all of the FASTA 137files under the ftp directory at 138 139 ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ 140 141If your FASTA entries each contain a single chromosome, and the 142accession for each chromosome is the chromosome number/letter, you can 143simply run this command 144 145 gmap_build -d <genome> [-k <kmer size>] <fasta_files...> 146 147which will build and install the GMAP index files in your default 148GMAPDB location. 149 150You can see the full usage of gmap_build by doing "gmap_build --help", 151but here are some useful flags. If your FASTA files are compressed 152using gzip, you can add the flag "-g" to gmap_build. You can control 153the k-mer size for the genomic index with the -k flag, which can range 154from 12 to 15. The default value for -k is 15, but this requires your 155machine to have 4 GB of RAM to build the indices. If you do not have 1564 GB of RAM, then you will need to reduce the value of -k or find 157another machine. Here are the RAM requirements for building various 158indices: 159 160 k-mer of 12: 64 MB 161 k-mer of 13: 256 MB 162 k-mer of 14: 1 GB 163 k-mer of 15: 4 GB 164 165Note that the term 166 167 <fasta_files...> 168 169above indicates that multiple files can be listed. The files can be 170in any order, and the contigs can be in any order within these files. 171By default, the gmap_build process will sort the contigs and 172chromosomes into their appropriate "chrom" order. For the human 173genome, this order is 1, 2, ..., 10, 11, ..., 22, X, Y, M, followed by 174all other chromosomes in numeric/alphabetical order. If you don't 175want this sort, provide the "-s none" flag to gmap_build. Other sort 176options besides "none" and "chrom" are "alpha" and "numeric-alpha". 177 178You can type "gmap_build --help" to see the full set of options. We 179discuss some specific situations below. 180 181You can provide information to gmap_build that certain chromosomes are 182circular, with the -c or --circular flag. The value for these flags 183is a list of chromosomes, separated by commas. The gmap_build program 184will then allow GSNAP and GMAP to align reads across the ends of the 185chromosome. For example, the mitochondrial genome in human beings is 186circular. 187 188If your genome files are compressed using gzip, you can simply add the 189flag "-g" to gmap_build. But if your genome files require some other 190type of processing, you may need to write a small script that pipes 191the sequences in FASTA format to gmap_build. You can tell gmap_build 192about your script with the -E (or --fasta-pipe) flag, like this 193 194 gmap_build -d <genome> -E 'gunzip -c chr*.gz' 195 gmap_build -d <genome> -E 'cat *.fa | ./add-chromosomal-info.pl' 196 197You can think of the command as a Unix pipe for processing each FASTA 198file before it is read by gmap_build. 199 200 2014. Running GMAP 202================ 203 204To see the full set of options, type "gmap --help". The following are 205some common examples of usage. For more examples, see the document 206available at http://www.gene.com/share/gmap/paper/demo-slides.pdf 207 208For each of the examples below, we assume that you have installed a 209genome database called <genome> in your GMAPDB directory. (If your 210database is located elsewhere, you can specify the -D flag to gmap or 211set the environment variable GMAPDB to point to that directory.) 212 213* Mapping only: To map one or more cDNAs in a FASTA file onto a 214 genome, run GMAP as follows: 215 216 gmap -d <genome> <cdna_file> 217 218* Mapping and alignment: If you want to map the cDNAs to a genome, 219 and show the full alignment, provide the -A flag: 220 221 gmap -d <genome> -A <cdna_file> 222 223 224* Alignment only: To align one or more cDNAs in a FASTA file onto a 225 given genomic segment (also in a FASTA file), use the -g flag 226 instead of the -d flag: 227 228 gmap -g <genome_segment> -A <cdna_file> 229 230 231* Batch mode: If you have a large number of cDNAs to run, and you have 232 sufficient RAM to run in batch mode, add the "-B 3", "-B 4", or "-B 233 5" option. Details for these options are provided by running "gmap 234 --help". 235 236 gmap -d <genome> -B 5 -A <cdna_file> 237 238 239* Multithreaded mode: If your machine has several processors, you can 240 make batch mode run even faster by specifying multiple threads with 241 the -t flag: 242 243 gmap -d <genome> -B 5 -A -t <nthreads> <cdna_file> 244 245 Note that with multiple threads, the output results will appear in 246 random order, depending on which thread finishes its computation 247 first. If you wish your output to be in the same order as the 248 input cDNA file, add the '-O' (letter O, not the number 0) flag 249 to get ordered output. 250 251 Guidelines: The -t flag specifies the number of computational 252 threads. In addition, if your machine supports threads, GMAP also 253 uses one thread for reading the input query sequences, and one 254 thread for printing the output results. Therefore, the total number 255 of threads will be 2 plus the number you specify. The program will 256 work optimally if it uses one thread per available processor. If 257 you specify too many threads, you can cause your computer to thrash 258 and slow down. Note that other programs running on your computer 259 also need processors. 260 261 262* Compressed output: If you want to store the alignment results in a 263 compressed format, use the -Z flag. You can uncompress the results 264 by using the gmap_uncompress.pl program: 265 266 gmap -d <genome> -Z <cdna_file> > x 267 cat x | gmap_uncompress 268 269 270Note that gmap is written for small genomes (less than 2^32 bp in 271total length). With large genomes, there is an equivalent program 272called gmapl, which you should run instead of gmap. The gmapl program 273is equivalent to gmap, and is based on the same source code, but is 274compiled to use 64-bit index files instead of 32-bit files. The gmap 275and gmapl programs will detect whether the genomes are the correct 276size, and will exit if you try to run them on the wrong-sized genomes. 277 278 279 2805. Building map files 281====================== 282 283This package includes an implementation of interval index trees 284(IITs), which permits efficient lookup of interval information. The 285gmap program also allows you (with its -m flag) to look up pre-mapped 286annotation information that overlaps your query cDNA sequence. These 287interval index trees (or map files) are built using the iit_store 288program included in this package. To build a map file, do the 289following: 290 291Step 1: Put your map information for a given genome into a map file 292with the following FASTA-like format: 293 294 >label coords optional_tag 295 optional_annotation (which may be zero, one, or multiple lines) 296 297For example, the label may be an EST accession, with the coords 298representing its genomic position. Labels may be duplicated if 299necessary. 300 301The coords should be of the form 302 303 chr:position 304 chr:startposition..endposition 305 306The term "chr:position" is equivalent to "chr:position..position". If 307you want to indicate that the interval is on the minus strand or 308reverse direction, then <endposition> may be less than 309<startposition>. 310 311Tags are very general and can be used for a variety of purposes. For 312example, you could 313 314 315Step 2: Run iit_store on this map file as follows 316 317 cat <mapfile> | iit_store -o <mapname> 318 319The program will create a file called <mapname>.iit. 320 321Now you can retrieve this information with iit_get 322 323 iit_get <mapname>.iit <coords> 324 325where <coords> has the format "chr:position" or 326"chr:startposition..endposition". The iit_get program has other 327capabilities, including the ability to retrieve information by label, 328like this: 329 330 iit_get <mapname>.iit <label> 331 332More details can be found by running "iit_get --help". 333 334If you plan to use this map information frequently, you may want to 335place it with its corresponding genome for future use. In each 336GMAP/GSNAP database, there is a subdirectory for storing map files, 337like this 338 339 /path/to/gmapdb/<genome>/<genome>.maps/ 340 341(If you don't remember where your default gmap directory is, run "gmap 342--version" to find it.) If you put your <mapname>.iit file into this 343maps subdirectory, you can get additional functionality. First, you 344can run the program get-genome, which is used mainly for getting 345genome sequence, to get map information instead, like this 346 347 get-genome -d <genome> -m <mapname> <coords> 348 349Second, you can use GMAP with the -m flag to retrieve map information 350that corresponds to a given cDNA sequence like this 351 352 gmap -d <genome> -m <mapname> <cdna_file> 353 354Finally, GMAP and the IIT utilities support the GFF3 format. GMAP can 355generate its results in GFF3 format, and iit_store can parse GFF3 356files using its -G and -l flags. More details about iit_store can be 357found by doing "iit_store --help". 358 359 3606. Running GSNAP 361================= 362 363GSNAP uses the same database as GMAP does, so you will need to process 364the genome using gmap_build as explained above, if you haven't done 365that already. To see the full set of options for GSNAP, type "gsnap --help". 366 367Input to GSNAP should be either in FASTQ or FASTA format. The FASTQ 368input may include quality scores, which will then be included in SAM 369output, if that output format is selected. For single-end reads, the 370FASTQ file may be piped into GSNAP, or given as its command-line 371argument, like this 372 373 cat <fastq_file> | gsnap -d <genome> 374 375or 376 377 gsnap -d <genome> <fastq_file> 378 379 380For paired-end reads, the two corresponding FASTQ files should be 381given as command-line arguments in pairs, like this 382 383 gsnap -d <genome> <fastq_file_1> <fastq_file_2> [<fastq_file_3> <fastq_file_4>...] 384 385A pipe cannot work since GSNAP needs to access both FASTQ files in 386parallel. The reads in FASTQ files may have varying lengths, if 387desired. Note that GSNAP can process multiple sets of paired-end 388reads, by adding the files in pairs. If you want to provide multiple 389single-end files, you can either use "cat" to concatenate them into 390the stdin of gsnap, like this: 391 392 cat <fastq_file_1> [<fastq_file_2>...] | gsnap -d <genome> 393 394or you can provide them all on the command line with the 395--force-single-end flag, like this: 396 397 gsnap -d <genome> --force-single-end <fastq_file_1> [<fastq_file_2>...] 398 399which will process each FASTQ file one at a time as single-end reads, 400and not try to pair them up. 401 402GSNAP also has the ability to deal with files compressed with gzip, if 403the configure script at compile time can find a zlib library in your 404system (see Note 3 in the section above about building and installing 405GMAP and GSNAP). If so, and your files are gzipped, you can then read 406in gzipped files directly like this 407 408 gsnap --gunzip -d <genome> <fastq.gz>, or 409 gsnap --gunzip -d <genome> --force-single-end <fastq1.gz> [<fastq2.gz>...] 410 411for single-end reads, or 412 413 gsnap --gunzip -d <genome> <fastq_1.gz> <fastq_2.gz> [<fastq_3.gz> <fastq_4.gz>...] 414 415for paired-end reads. 416 417Likewise, GSNAP can handle files compressed with bzip2, if the 418configure script at compile time can find a bzlib library in your 419system (see Note 3 in the section above about building and installing 420GMAP and GSNAP). If so, and your files are bzip2-compressed, you can 421then read in those files directly like this 422 423 gsnap --bunzip2 -d <genome> <fastq.bz2>, or 424 gsnap --bunzip2 -d <genome> --force-single-end <fastq1.bz2> [<fastq2.bz2>...] 425 426for single-end reads, or 427 428 gsnap --bunzip2 -d <genome> <fastq_1.bz2> <fastq_2.bz2> [<fastq_1.bz2> <fastq_2.bz2>...] 429 430for paired-end reads. 431 432 433For FASTA format, you should include one line per read (or end of a 434paired-end read). The same FASTA file can have a mixture of 435single-end and paired-end reads of varying lengths, if desired. 436 437Single-end reads: 438 439Each FASTA entry should contain one short read per line, like this 440 441>Header information 442AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA 443 444Each short read can have a different length. However, the entire read 445needs to be on a single line, and may not wrap around multiple lines. 446If it extends to a second line, GSNAP will think that the read is 447paired-end. 448 449 450Paired-end reads: 451 452Each FASTA entry should contain two short reads, one per line, like 453this 454 455>Header information 456AAAACATTCTCCTCCGCATAAGCCTAGTAGATTA 457GGCGTAGGTAGAAGTAGAGGTTAAGGCGCGTCAG 458 459By default, the program assumes that the second end is in the reverse 460complement direction compared with the first end. If they are in the 461same direction, you may need to use the --circular-input (or -c) flag. 462 463GSNAP and GMAP can also read an extended FASTA format that include 464quality scores, which look like this 465 466 >Header information 467 AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA 468 + 469 <quality scores> 470 471for single-end reads, or 472 473 <Header information 474 AAAACATTCTCCTCCGCATAAGCCTGCGTCAGATTA 475 + 476 <quality scores> 477 478for the second-end of a paired-end read. In addition, GSNAP can read 479an extended FASTA format for paired-end reads, like this: 480 481 >Header information 482 AAAACATTCTCCTCCGCATAAGCCTAGTAGATTA 483 GGCGTAGGTAGAAGTAGAGGTTAAGGCGCGTCAG 484 + 485 <quality scores 1> 486 <quality scores 2> 487 488This extended FASTA format is useful if paired-end information needs 489to be piped into GSNAP via stdin. 490 491 492As with gmap, gsnap is written for small genomes (less than 2^32 bp in 493total length). With large genomes, there is an equivalent program 494called gsnapl, which you should run instead of gsnap. The gsnapl 495program is equivalent to gmap, and is based on the same source code, 496but is compiled to use 64-bit index files instead of 32-bit files. 497The gsnap and gsnapl programs will detect whether the genomes are the 498correct size, and will exit if you try to run them on the wrong-sized 499genomes. 500 501 5027. Detecting known and novel splice sites in GSNAP 503==================================================== 504 505Whereas GMAP assumes RNA-seq input by default, GSNAP assumes DNA-seq 506input by default. To tell GSNAP to find splice junctions in 507individual reads, you need to provide either the "--novelsplicing=1" 508(or "-N 1") flag and/or the "-s <splicing_file>" flag to GSNAP. 509 510GSNAP can detect splices using a probabilistic model using the 511--novelsplicing (or -N) flag. You can also detect splices from a set 512of splice sites that you provide, using the --splicesites (or -s) 513flag. You may specify both flags, which will report splice junctions 514involving both known and novel sites. 515 516Output for a splicing junction will look like this: 517 518>TCCGTGACGTGGATTGGTGCTGCACCCCTCATC 1 Header 519 TCCGTGACGTGGATTGgt--------------- 1..16 +19:56050054..56050069 start:0..donor:0.99,splice_dist:1238,dir:sense,sub:0+0=0,label:NM_001648.KLK3.exon1/5|NM_001030047.KLK3.exon1/5|NM_001030048.KLK3.exon1/5|NM_001030049.KLK3.exon1/6|NM_001030050.KLK3.exon1/2 520,--------------agGTGCTGCACCCCTCATC 17..33 +19:56051308..56051324 acceptor:0.99..end:0,dir:sense,sub:0+0=0,label:NM_001648.KLK3.exon2/5|NM_001030047.KLK3.exon2/5|NM_001030048.KLK3.exon2/5|NM_001030049.KLK3.exon2/6|NM_001030050.KLK3.exon2/2 521 522After the "donor:" or "acceptor:" splice site type, the model score 523probability is given, even if the splice site is known. For known 524splice sites, the "label:" field will provide information about the 525site. If there is more than one known splice site at a genomic 526position, the labels are separated by a "|" delimiter. 527 528There are several advantages to specifying a database of known splice 529sites. First, GSNAP will then be able to detect splicing involving 530atypical splice sites, that would otherwise give low scores using its 531probabilistic model. A known splice site is treated as if its model 532probability is 1.0. Second, GSNAP can find splicing involving short 533exons. Such cases have a single end aligning to three exons, 534separated by two introns. Third, GSNAP can identify splicing at the 535ends of reads with greater sensitivity, even if they have short 536overlaps onto the next exon. Fourth, GSNAP can detect known long 537splices, because expected splice lengths can be included with the 538splice site information. 539 540GSNAP allows for known splicing at two levels: at the level of known 541splice sites and at the level of known introns. At the site level, 542GSNAP finds splicing between arbitrary combinations of donor and 543acceptor splice sites, meaning that it can find alternative splicing 544events. At the intron level, GSNAP finds splicing only between the 545set of given donor-acceptor pairs, so it is constrained not to find 546alternative splicing events, only introns included in the given list. 547For most purposes, I would recommend using known splice sites, rather 548than known introns, unless you are certain that all alternative 549splicing events are known are represented in your file. 550 551GSNAP can tell the difference between known site-level and known 552intron-level splicing based on the format of the input file. To 553perform known site-level splicing, you will need to create a file with 554the following format: 555 556>NM_004448.ERBB2.exon1 17:35110090..35110091 donor 6678 557>NM_004448.ERBB2.exon2 17:35116768..35116769 acceptor 6678 558>NM_004448.ERBB2.exon2 17:35116920..35116921 donor 1179 559>NM_004448.ERBB2.exon3 17:35118099..35118100 acceptor 1179 560>NM_004449.ERG.exon1 21:38955452..38955451 donor 783 561>NM_004449.ERG.exon2 21:38878740..38878739 acceptor 783 562>NM_004449.ERG.exon2 21:38878638..38878637 donor 360 563>NM_004449.ERG.exon3 21:38869542..38869541 acceptor 360 564 565Each line must start with a ">" character, then be followed by an 566identifier, which may have duplicates and can have any format, with 567the gene name or exon number shown here only as a suggestion. Then 568there should be the chromosomal coordinates which straddle the 569exon-intron boundary, so one coordinate is on the exon and one is on 570the intron. (Coordinates are all 1-based, so the first character of a 571chromosome is number 1.) Finally, there should be the splice type: 572"donor" or "acceptor". You may optionally store the intron distance 573at the end. GSNAP can use this intron distance, if it is longer than 574its value for --localsplicedist, to look for long introns at that 575splice site. The same splice site may have different intron distances 576in the database; GSNAP will use the longest intron distance reported 577in searching for long introns. 578 579Note that the chromosomal coordinates are in the sense direction. 580Therefore, genes on the plus strand of the genome (like NM_004448) have 581the coordinates in ascending order (e.g., 35110090..35110091). 582Genes on the minus strand of the genome (like NM_004449) have the 583coordinates in descending order (e.g., 38955452..38955451). 584 585On the other hand, to perform known intron-level splicing, you will need 586to create a file with the following format: 587 588>NM_004448.ERBB2.intron1 17:35110090..35116769 589>NM_004448.ERBB2.intron2 17:35116920..35118100 590>NM_004449.ERG.intron1 21:38955452..38878739 591>NM_004449.ERG.intron2 21:38878638..38869541 592 593Again, coordinates are 1-based, and specify the exon coordinates 594surrounding the intron, with the first coordinate being from the donor 595exon and the second one being from the acceptor exon. 596 597 598There are several ways to help you generate these files. First, if 599you have a GTF file, you can use the included programs gtf_splicesites 600and gtf_introns like this: 601 602 cat <gtf file> | gtf_splicesites > foo.splicesites 603 cat <gtf file> | gtf_introns > foo.introns 604 605Second, if you retrieve an alignment tracks from UCSC, like this: 606 607 ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz 608 609if you are aligning to genome hg18, or 610 611 ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz 612 613if you are aligning to genome hg19, you can process this track using 614the included program psl_splicesites or psl_introns, like this: 615 616 gunzip -c refGene.txt.gz | psl_splicesites -s 1 > foo.splicesites 617 gunzip -c refGene.txt.gz | psl_introns -s 1 > foo.introns 618 619Note that alignment tracks in UCSC sometimes have an extra column on 620the left. The "-s" flag allows you to indicate how many columns 621should be skipped. 622 623Once you have built this splicesites or introns file, you process it 624as a map file (see "Building map files" above), by doing 625 626 cat foo.splicesites | iit_store -o <splicesitesfile>, or 627 cat foo.introns | iit_store -o <intronsfile> 628 629If you want to include more than one track, you can do this: 630 631 gunzip -c refGene.txt.gz | psl_splicesites -s 1 > foo 632 gunzip -c knownGene.txt.gz | psl_splicesites > bar 633 cat foo bar | iit_store -o <splicesitesfile> 634 635 636A third way to build a known splicesites or known introns file is 637useful if you have cDNA sequences rather than an alignment track, or 638if you do not trust the alignment track and prefer to use cDNA 639sequences. GMAP has an option "-f splicesites" that finds splice 640sites in cDNA sequences and reports them in the correct splicesite 641format. Likewise, GMAP can build an intron file, with the option "-f 642introns". 643 644When processing known cDNA sequences, you should also run GMAP with 645the "-n 1" flag, so you get the best alignment, and with the "-z 646sense_force" or "-z sense_filter" flag. The sense_force option will 647help GMAP know that the introns in your cDNA sequences are in the 648correct GT-AG sense, and is applicable when you have a high quality 649set of cDNA sequences. The sense_filter option will allow GMAP to try 650either sense or antisense, and to filter out sequences that appear to 651be antisense; this is applicable if you are uncertain about the 652validity of your cDNA sequences. 653 654Again once you have built either a known splicesites or known introns 655file, you process it as a map file by doing 656 657 cat <file> | iit_store -o <splicesitesfile>, or 658 cat <file> | iit_store -o <intronsfile> 659 660which creates <splicesitesfile>.iit or <intronsfile>.iit. 661 662 663Regardless of how you built <splicesitesfile>.iit or <intronsfile>.iit, 664you put it in the maps subdirectory by doing 665 666 cp <splicesitesfile>.iit /path/to/gmapdb/<genome>/<genome>.maps, or 667 cp <intronsfile>.iit /path/to/gmapdb/<genome>/<genome>.maps 668 669Then, you may use the file by doing this: 670 671 gsnap -d <genome> -s <splicesitesfile> <shortreads>, or 672 gsnap -d <genome> -s <intronsfile> <shortreads>, or 673 674 6758. SAM output format 676===================== 677 678GSNAP can generate SAM output format, by providing the "-A sam" flag 679to GSNAP. In addition, GMAP can also print its alignments in SAM 680output, using the "-f samse" or "-f sampe" options, for single-end or 681paired-end data. The sampe option will generate SAM flags to indicate 682whether the read is the first or second end of a pair, which requires 683that you provide GMAP with an extended FASTA format having a ">" or 684"<" character in the header to indicate that information. However, 685the sampe option will change only the SAM flags, and not change the 686underlying alignment algorithm. GMAP does not know how to find 687concordance between paired-end reads like GSNAP does. 688 689GSNAP provides some special SAM flags as follows: 690 691XQ: A non-normalized mapping quality score 692 693X2: The second best XQ score among all multimapping alignments for a 694given read. If there is only a single alignment, this value is 0. 695 696XO: Output type. GSNAP categorizes its alignments into output types, 697as follows. Note that the --split-output option will create separate 698output files for each output type. Alternatively, if you use 699sam_sort, you should provide --split-output to that program instead 700and achieve the same functionality. (The reason for this is that 701there may be situations where GSNAP assigns different output types to 702the first and second ends of the reads and sam_sort needs to see 703alignments from both ends together.) In either case, the output types 704have the following meanings and filename suffixes: 705 706 NM (nomapping) (filename suffix ".nomapping"): The entire read 707 (single-end or paired-end) could not be aligned. If the 708 --failed-input=FILENAME flag is specified, then these reads are also 709 printed in FASTQ or FASTA format (depending on the input format) in 710 the given file (plus a .1 or .2 ending for the first and second ends 711 of a paired-end read). 712 713 CU (concordant unique) (filename suffix ".concordant_uniq"): Both 714 ends of a paired-end read could be aligned concordantly to a single 715 position in the genome. For a definition of concordance, see the 716 section "Output types" below. 717 718 CM (concordant multiple) (filename suffix ".concordant_mult"): Both 719 ends of a paired-end read could be aligned concordantly, but to more 720 than one position in the genome. 721 722 CX (concordant multiple excess) (filename suffix 723 ".concordant_mult_xs"): Multiple concordant alignments, but user 724 specified --quiet-if-excessive and the number of alignments exceeds 725 "-n" threshold. If the --failed-input option is given, these reads 726 are also printed in FASTA or FASTQ format in the given file. 727 728 CT (concordant translocation) (filename suffix 729 ".concordant_transloc"): Both ends of a paired-end read could be 730 aligned concordantly, but one end requires a split alignment to a 731 distant location, such as another chromosome, or a different strand 732 on that chromosome, or a far distance on that strand. Note that 733 translocation alignments need to be printed on two separate SAM 734 lines. 735 736 CC (concordant circular) (filename suffix ".concordant_circular"): 737 Both ends of a paired-end read could be aligned concordantly, but 738 one or both ends require an alignment that goes around the origin of 739 a circular chromosome. Circular chromosomes are specified in the 740 gmap_build step by using the -c or --circular flag, as described 741 previously. Note that circular alignments need to be printed on two 742 separate SAM lines. 743 744 PI (paired unique inversion) (filename suffix ".paired_uniq_inv"): 745 Both ends of a paired-end read could be aligned uniquely, but in a 746 way that indicates that a genomic inversion has occurred between the 747 two ends. 748 749 PS (paired unique scramble) (filename suffix ".paired_uniq_scr"): 750 Both ends of a paired-end read could be aligned uniquely, but in a 751 way that indicates that the genomic order is scrambled. This 752 typically occurs because of tandem duplications. 753 754 PL (paired unique long) (filename suffix ".paired_uniq_long"): Both 755 ends of a paired-end read could be aligned uniquely, but in a way 756 that indicates that a large genomic deletion has occurred between 757 the two ends. 758 759 PC (paired unique circular) (filename suffix 760 ".paired_uniq_circular"): Both ends of a paired-end read could be 761 aligned uniquely, but not concordantly, representing an inversion, 762 scramble, or deletion. In addition, one or both ends of the read 763 goes around the origin of a circular chromosome. 764 765 PM (paired multiple) (filename suffix ".paired_mult"): Both 766 ends of a paired-end read could be aligned near each other, 767 representing an inversion, scramble, or deletion, but there are 768 multiple places in the genome where an alignment is found. 769 770 PX (paired multiple excess) (filename suffix ".paired_mult_xs"): 771 Multiple paired alignments, but user specified --quiet-if-excessive 772 and the number of alignments exceeds the "-n" threshold. If the 773 --failed-input option is given, these reads are also printed in 774 FASTA or FASTQ format in the given file. 775 776 HU, HM, HT, HC (halfmapping unique, halfmapping multiple, 777 halfmapping translocation, and halfmapping circular, respectively) 778 (filename suffixes: ".halfmapping_uniq", ".halfmapping_mult", 779 ".halfmapping_transloc", ".halfmapping_circular): Same as for the 780 concordant output types, except that only one end of the paired-end 781 read could be aligned, and the other end could not be aligned 782 anywhere in the genome. 783 784 HX (halfmapping multiple excess) (filename suffix 785 ".halfmapping_mult_xs"): Multiple halfmapping alignments, but user 786 specified --quiet-if-excessive and the number of alignments exceeds 787 the "-n" threshold. If the --failed-input option is given, these 788 reads are also printed in FASTA or FASTQ format in the given file. 789 790 UU, UM, UT, UC (unpaired unique, unpaired multiple, unpaired 791 translocation, and unpaired circular, respectively) (filename 792 suffixes: ".unpaired_uniq", ".unpaired_mult", ".unpaired_transloc", 793 ".unpaired_circular): Same as for the concordant output types, 794 except that the two ends could not be aligned concordantly or even 795 paired. These "unpaired" categories are also used for single-end 796 reads, since they lack a mate end that can allow for concordance, 797 pairing, or halfmapping. 798 799 UX (unpaired multiple excess) (filename suffix ".unpaired_mult_xs"): 800 Multiple unpaired alignments, but user specified 801 --quiet-if-excessive and the number of alignments on one end exceeds 802 the "-n" threshold. If the --failed-input option is given, these 803 reads are also printed in FASTA or FASTQ format in the given file. 804 805 806XB: Prints the barcode extracted from the end of the read. Applies only 807if --barcode-length is not 0. 808 809XP: Prints the primer inferred from a paired-end read. Applies only 810if the --adapter-strip flag is specified. 811 812XH: Prints the part of the query sequence that was hard-clipped. 813Sequence is printed in plus-genomic order and replaces the "H" part of 814the CIGAR string. 815 816XI: Prints the part of the quality string that was hard-clipped. 817Sequence is printed in plus-genomic order and replaces the "H" part of 818the CIGAR string. 819 820XS: Prints the strand orientation (+ or -) for a splice. Appears only 821if splicing is allowed (-N or -s flag provided), and only for reads 822containing a splice. The value "+" means the expected GT-AG, GC-AG, 823or AT-AC dinucleotide pair is on the plus strand of the genome, and 824"-" means the dinucleotides are on the minus strand. If the 825orientation is not obvious, because the dinucleotides do not match 826GT-AG, GC-AG, AT-AC, or their complements, then the program applies a 827probabilistic splice model to determine the orientation. If the 828splice sites have low probability, then the program may not be able to 829determine an orientation, and the result will be printed as XS:A:?. 830To prevent this flag, which cannot be handled by such programs (such 831as Cufflinks), use the --force-xs-dir flag. However, this flag will 832merely change occurrences of XS:A:? arbitrarily to XS:A:+. 833 834XA: Indicates an ambiguous splice. If GSNAP finds two or more 835possible splices at a given position, it will try to resolve the 836ambiguity if possible based on the other end of the paired-end read. 837If the ambiguity cannot be resolved, GSNAP will not report any of the 838splices, but will report a soft clip instead. The XA field indicates 839which end or ends are ambiguous and the number of matches found on 840each ambiguous end, based on the output XA:Z:i,j. If i or j is 841greater than 0, that indicates that the lower or higher chromosomal 842end is ambiguous, respectively. The value given indicates the number 843of matches found in the ambiguous end. This number may be smaller 844than the number of bases soft-clipped, due to mismatches. 845 846XC: Indicates whether the alignment crosses over the origin of a 847circular chromosome. If so, the string XC:A:+ is printed. 848 849XT: Prints the intron dinucleotides and splice probabilities around a 850distant splicing event (genomic deletion, inversion, scramble, or 851translocation). 852 853XW and XV: Printed only when SNP-tolerant alignment is enabled. XW 854provides the number of mismatches against both the reference and 855alternate alleles (or the "World" population). Therefore, these are 856true mismatches. XV provides the number of positions that are 857mismatches against the reference genome, but do match the alternate 858genome. Therefore, these are known variant positions. The sum of XW 859and XV provides the number of differences relative to the reference 860genome, and with the exception of indels, should equal the value of 861NM. 862 863XG: Indicates which method within GSNAP generated the alignment. B: 864GMAP alignment produced from suffix array, M: GMAP alignment produced 865from GSNAP hash table method, T: terminal alignment, O: merging of 866overlaps. Absence of XG flag indicates the standard GSNAP hash table 867method. (Note: older versions of GSNAP used "PG:", but some 868downstream software required all PG methods to be listed in the header 869section, so we changed the field name to "XG:") 870 871 8729. GSNAP output format 873======================== 874 875By default, GSNAP prints its output in a FASTA-like format, which we 876developed before we incorporated the SAM format. The default GSNAP 877output has some advantages over SAM output, especially for debugging 878purposes. However, we routinely use SAM output for our own pipeline, 879and it has been subject to the most testing by ourselves and by outside 880users. 881 882Here is some output from GSNAP on a paired-end read: 883 884>GGACTGCGCACCGAACGGCAGCGACTTCCCGTAGTAGCGGTGCTCCGCGAAGACCAGTAGAGCCCCCCGCTCGGCC 1 concordant ILLUMINA-A1CCE9_0004:1:1:1510:2090#0 885 GGACTGCGCACCGAACGGCAGCGACTTCCCGTAGTAGCGct----------------------------------- 1..39 +9:139128263..139128301 start:0..acceptor:0.99,dir:antisense,splice_dist:214,sub:0+0=0,label:NM_013379.DPP7.exon4/13 segs:2,align_score:2 pair_score:5,pair_length:112 886,-------------------------------------acGTGCTCCGCGAAGACCAGTAGAGCCCCCCGCTCGGCC 40..76 +9:139128516..139128552 donor:0.96..end:0,dir:antisense,splice_dist:214,sub:0+0=0,label:NM_013379.DPP7.exon3/13 887 888<CTTCGCCAACAACTCGGGCTTCGTCGCGGAGCTGGCGGCCGAGCGGGGGGCTCTACTGGTCTTCGCGGAGCACCGC 1 concordant ILLUMINA-A1CCE9_0004:1:1:1510:2090#0 889 CTTCGCCAACAACTCGGcCTTCGTCGCGGAGCTGGCGGCCGAGCGGGGGGCTCTACTGGTCTTCGCGGAGCACgtg 1..73 -9:139128588..139128516 start:0..end:3,sub:3+1=4 segs:1,align_score:3 pair_score:5,pair_length:112 890 891Each end of a read gets its own block, with the first read starting 892with ">" and the second read for paired-end reads starting with "<". 893The block starts with a header line that has in column 1, the query 894sequence in its original direction (and with lower-case preserved if 895any); in column 2, the number of hits for that query and if the read 896is paired-end, the relationship between the ends (as discussed in the 897next paragraph); and in column 3, the accession number for the query. 898 899 90010. Output types 901================= 902 903The two ends of a paired-end read can have the following 904relationships: "concordant", "paired", or "unpaired". A paired-end 905read is concordant if the ends align to the same chromosome, in the 906expected relative orientations, and having an inferred insert length 907greater than zero and within the "--pairmax" parameter. The inferred 908insert length is the distance from the end of the first-end alignment 909to the start of the second-end alignment, plus the read lengths of the 910two ends. There may be more than one concordant alignment for a given 911read, and if so, the alignments for each end are reported in 912corresponding order. 913 914If a concordant relationship cannot be found, then the program will 915report any paired relationships it can find. A paired alignment 916occurs when the two ends align to the same chromosome, but fail some 917criterion for concordance. There are different subtypes of paired 918alignments, depending on which criterion is violated. If the 919orientations are opposite what is expected, the paired subtype is 920"inversion". If they are in the expected orientation, but the 921distance is greater than the "--pairmax" parameter, then the paired 922subtype is "toolong". If they are in the expected orientation, but 923the inferred insert length appears to be negative, then the paired 924subtype is "scramble". In GSNAP output, a paired subtype is shown in 925a label called "pairtype", which can have the values 926"pairtype:inversion", "pairtype:toolong", and "pairtype:scramble". 927 928Otherwise, if neither a concordant nor paired alignment can be found, 929then the program will align each end separately, and report the 930relationship as being "unpaired". 931 932GSNAP can find translocation splices within a single end of a read, 933but it tries to be conservative about reporting them. If there is any 934alignment that does not involve such a translocation, then it will not 935report the translocation. It therefore reports translocation splices 936only when no other alignment is found within the concordant, paired, 937or unpaired categories. Therefore, such results are listed in the 938header as having "(transloc)" appear after the "concordant", "paired", 939or "unpaired" result type. 940 941After the query line, each of the genomic hits is shown, up to the 942'-n' parameter. If too many hits were found (more than the '-n' 943parameter), the behavior depends on whether the "--quiet-if-excessive" 944flag is given to GSNAP. If not, then the first n hits will be printed 945and the rest will not be printed. If the "--quiet-if-excessive" flag 946is given to GSNAP, then no hits will be printed if the number exceeds 947n. 948 949Each of the genomic hits contains one or more alignment segments, 950which is a region of continuous one-to-one correspondence (matches or 951mismatches) between the query and the genome. Multiple segments occur 952when the alignment contains an insertion, deletion, or splice. The 953first segment is marked by a space (" ") at the beginning of the line, 954while the second and following segments are marked by a comma (",") at 955the beginning of the line. (In the current implementation of GSNAP 956that allows only a single indel or splice, the number of segments is 957at most two.) 958 959The segments contain information in tab-delimited columns as follows: 960 961Column 1: Genomic sequence with matches in capital letters, mismatches 962in lower-case letters, and regions outside the segment with dashes. 963For deletions in the query, the deleted genomic sequence is also 964included in lower case. For spliced reads, the two dinucleotides at 965the intron ends are included in lower case. 966 967Column 2: Range of query sequence aligned in the segment. Coordinates 968are inclusive, with the first nucleotide considered to be position 1. 969 970Column 3: Range of genomic segment aligned, again with inclusive 971coordinates, with the first nucleotide in each chromosome considered 972to be position 1. Plus and minus strands are marked with a "+" or "-" 973sign. 974 975Column 4: Segment information, delimited by commas. The first item 976reports on the ends of the segment, which can be of type "start", 977"end", "ins", "del", "donor", "acceptor", or "term". After "start" 978and "end", we report the number of nucleotides clipped or trimmed from 979the segment. In our example above, "end:3" means that 3 nucleotides 980should be trimmed from the end. Trimming finds a local maximum of 981matches to mismatches from the end and is computed only if the "-T" 982flag is specified, and the value for "-T" limits the amount of 983trimming allowed. After "ins" and "del", we report the number of 984nucleotides that were inserted or deleted in the query relative to the 985genome. After "donor" or "acceptor", we report the probability of the 986splice site, based on the MaxEnt model. The "term" label indicate a 987terminal segment, where the entire read could not be aligned, but more 988than half of the read could be aligned from either end. 989 990Each segment will also show after the "sub" tag, he number of 991mismatches in that segment including the part that is trimmed, if any. 992If SNP-tolerant alignment is chosen, then the number of SNPs is also 993shown (see details below under SNP-tolerant alignment). Other 994information may also be included with the segment information, such as 995the orientation and distance of the splice or known splice labels, if 996appropriate flags and information are given to GSNAP. Splices are 997marked with a splice_type, which can be "consistent", "inversion", 998"scramble", or "translocation". A "translocation" splice includes 999splices on the same chromosome where the splice distance exceeds the 1000parameter for localsplicedist. 1001 1002Column 5: Alignment or hit information, delimited by commas. For the 1003first segment in a hit (the one starting with a space), this column 1004provides the number of segments (denoted by "segs:") and the score of 1005the alignment (denoted by "align_score:"). 1006 1007Column 6: Pair information (for paired-end reads only). For the first 1008segment in a hit (with the same information repeated on both ends of a 1009concordant pair), this column provides the score of the pair (which is 1010the sum of the alignment scores) and the inferred length of the insert 1011(ignoring splices within each segment, but not between segments). 1012 1013 101411. SNP-tolerant alignment in GSNAP 1015==================================== 1016 1017GSNAP has the ability to align to a reference space of all possible 1018major and minor alleles in a set of known SNPs provided by the user. 1019This ability is provided by the -v flag, and produces output like 1020this: 1021 1022>ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGA 2 Header 1023 ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGt 1..36 -12:34263937..34263902 start:0..end:0,sub:1+0=1 1024 ATGGTAATCCTGCTCAGTAGGAGAGGAACCCCAGGt 1..36 -21:14379300..14379265 start:0..end:0,sub:1+2=3 1025 1026The notation "sub:1+0=1" indicates that the SNP-tolerant alignment has 1027one mismatch ("sub:1") and zero minor SNP alleles ("+0"), for a total 1028of one differences ("=1") relative to the reference genome with all 1029major alleles. Likewise, the notation "sub:1+2=3" indicates one 1030SNP-tolerant mismatch, two minor SNP alleles, and 3 mismatches 1031relative to the reference sequence with all major alleles. 1032 1033Note that by default, GSNAP shows only differences relative to both 1034the reference and alternate genomes in lower case. If you want to 1035show differences relative to the reference genome in lower case, you 1036will need to provide the flag --show-refdiff, which would give the 1037output above as: 1038 1039>ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGA 2 Header 1040 ATGGTAATCCTGCTCAGTACGAGAGGAACCGCAGGt 1..36 -12:34263937..34263902 start:0..end:0,sub:1+0=1 1041 ATGGTAATCCTGCTCAGTAgGAGAGGAACCcCAGGt 1..36 -21:14379300..14379265 start:0..end:0,sub:1+2=3 1042 1043For SAM output format, all differences from the reference sequence are 1044shown in the NM and MD fields, although the alignment scoring and 1045mapping qualities are based on a SNP-tolerant alignment. 1046 1047Before deciding to use SNP-tolerant alignment, please note the 1048following. First, SNP-tolerant alignment requires more memory than 1049standard alignment, perhaps twice as much. Also, with longer reads 1050now of 75 or more bp, GSNAP alignments are generally fine without 1051SNP-tolerant alignment. Finally, if your SNPs are incorrect (and many 1052in the databases are), then they can lead SNP-tolerant alignment to 1053favor the faulty SNPs. 1054 1055If you still wish to use SNP-tolerant alignment, you will need to 1056specify a set of known SNPs by creating a file with the following 1057format: 1058 1059>rs62211261 21:14379270 CG 1060>rs62211262 21:14379281 CG 1061 1062Each line must start with a ">" character, then be followed by an 1063identifier (which may have duplicates). Then there should be the 1064chromosomal coordinate of the SNP. (Coordinates are all 1-based, so 1065the first character of a chromosome is number 1.) Finally, there 1066should be the two possible alleles. (Previous versions required that 1067these be in alphabetical order: "AC", "AG", "AT", "CG", "CT", or "GT", 1068but that is no longer a requirement.) These alleles must correspond 1069to the possible nucleotides on the plus strand of the genome. If the 1070one of these two letters does not match the allele in the reference 1071sequence, that SNP will be ignored in subsequent processing as a 1072probable error. 1073 1074GSNAP also supports the idea of a wildcard SNP. A wildcard SNP allows 1075all nucleotides to match at that position, not just a given reference 1076and alternate allele. It is essentially as if an "N" were recorded at 1077that genomic location, although the index files still keep track of 1078the reference allele. To indicate that a position has a wildcard SNP, 1079you can indicate the genotype as "WN", where "W" is the reference 1080allele. Another indication of a wildcard SNP is to provide two 1081separate lines at that position with the genotypes "WX" and "WY", 1082where "W" is the reference allele and "X" and "Y" are two different 1083alternate alleles. 1084 1085To help you generate the file, this package includes programs that can 1086process known SNP data, from various formats, including dbSNP files, 1087VCF format, or GVF format. The dbSNP files can be obtained from UCSC, 1088like this 1089 1090 ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130.txt.gz 1091 1092For versions before snp132, you may also want to exclude exceptions, 1093which will require this file: 1094 1095 ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130Exceptions.txt.gz 1096 1097You can then process these files using our program dbsnp_iit, like 1098this: 1099 1100 gunzip -c snp130.txt.gz | dbsnp_iit [-w <weight>] [-e snp130Exceptions.txt.gz] > <snpfile>.txt, or 1101 gunzip -c snp130.txt.gz | dbsnp_iit [-w <weight>] [-e snp130Exceptions.txt] > <snpfile>.txt 1102 1103For versions starting with snp132, the exceptions are included in 1104column 19 of the snp file, so the -e flag is not necessary. The 1105dbsnp_iit program will read the exceptions from that column. 1106 1107The option "-w <weight>" makes use of the dbSNP item weight, a value 1108from 1 to 3, where lower weight means higher confidence. Items will 1109be included if the item weight is the given value <weight> or less. 1110The default value of -w is 1, which is the criterion UCSC uses to 1111build its ambiguous version of the genome. To allow all item weights, 1112specify "-w 3". 1113 1114SNPs have various types of exceptions, as provided either by the "-e 1115<exceptions_file>" flag, or starting with snp132, in the snp file 1116itself. By default, dbsnp_iit will exclude all types of exceptions. 1117However, if you want to include some types of exceptions, you will 1118need to modify the dbsnp_iit program (written in Perl) to indicate a 1119"1" for the type of exception you want to include. 1120 1121Sometimes SNP data are provided in VCF format, and can be retrieved 1122like this: 1123 1124 ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz 1125 1126You can process VCF files using our program vcf_iit, like this: 1127 1128 gunzip -c 00-All.vcf.gz | vcf_iit [-v <version>] > <snpfile>.txt 1129 1130The VCF file contains multiple versions of dbSNP, so if you want a 1131particular version, such as 135, you would use the flag "-v 135". The 1132vcf_iit program tries to pick a subset of SNPs that somewhat parallel 1133the ones without exceptions in the UCSC dbSNP file. It keeps all SNPs 1134that have been validated (marked in the VCF file as "VLD") or have a 1135submitter link-out ("SLO"). Otherwise, it excludes SNPs that are 1136individual genotypes ("GNO"). If none of these conditions hold, then 1137the SNP is allowed. These rules might not be the best ones; I made 1138them up by trying to compare version 135 of the VCF data with 1139version 135 of the UCSC dbSNP data. 1140 1141Likewise, SNP data are sometimes provided in GVF format, and these 1142files can be processed using our program gvf_iit, like this: 1143 1144 gunzip -c filename.gvf.gz | gvf_iit > <snpfile>.txt 1145 1146From my limited experience with GVF format, it appears that the file 1147contains variants from only one version of dbSNP. 1148 1149 1150Once you have built a <snpfile>.txt file using dbsnp_iit, vcf_iit, or 1151gvf_iit, you create an IIT file by doing this 1152 1153 cat <snpfile>.txt | iit_store -o <snpfile> 1154 1155which creates <snpfile>.iit. If you wish to store this file for other 1156people to use, you can then put it in a central place: 1157 1158 cp <snpfile>.iit /path/to/gmapdb/<genome>/<genome>.maps 1159 1160Or you can keep it in your own directory for your own use. Then you 1161need to create a reference space index and compressed genome by doing 1162 1163 snpindex -d <genome> [-V <snpdirectory>] -v <snpfile> <snpfile>.iit 1164 1165If you specify the [-V <snpdirectory>] option, then the resulting SNP 1166index files are placed in the given directory. Otherwise, if you 1167don't specify this, then the SNP files are saved with the reference 1168index files. 1169 1170If your <snpfile>.iit file is already in a <genome>.maps subdirectory, 1171then you can simply run 1172 1173 snpindex -d <genome> [-V <snpdirectory>] -v <snpfile> 1174 1175and the program will find the indicated <snpfile>.iit file. 1176Additional options for snpindex can be found by doing "snpindex 1177--help". 1178 1179 1180Finally, you can perform SNP-tolerant alignment by doing 1181 1182 gsnap -d <genome> [-V <snpdirectory>] -v <snpfile> <shortreads> 1183 1184You can also retrieve snp information for genomic segments by running 1185get-genome with the -v and -f flags. For more details, run 1186"get-genome --help". 1187 1188 118912. Alignment of reads from bisulfite-treated DNA in GSNAP 1190=========================================================== 1191 1192GSNAP has the ability to align reads from bisulfite-treated DNA, which 1193converts unmethylated cytosines to uracils that appear as thymines in 1194reads. GSNAP is able to identify genomic-T to read-C mismatches, and 1195produces output like this: 1196 1197>CTACGTcGTAGACATATTGATTCAGAATTTGAAGTTTAGCTAGATCTTAc 1 Read 1198 C.ACG.tGTAGACATA.TGATTCAGAAT.TGAAGTTTAGCTAGA.C.TAg 1..50 sub:2 +9:1000000..1000049 1199 1200As with all GSNAP output, the first line is the query, and the ones 1201afterward represent genomic segments. The "." symbol indicates an 1202unmethylated C in the genome that appears as a T in the read. 1203Mismatches are shown by lower case characters in the genomic segment. 1204In position 6, we see an example of a genomic-T that appears as a 1205read-C, representing a mismatch. The last position also shows a 1206mismatch between a genomic-G and read-C. 1207 1208To process reads from bisulfite-treated DNA, you will first need to 1209create the necessary index files by doing 1210 1211 cmetindex -d <genome> 1212 1213Then, you can align the reads by doing 1214 1215 gsnap -d <genome> --mode=cmet-stranded, or 1216 gsnap -d <genome> --mode=cmet-nonstranded 1217 1218on your set of short reads. The stranded flag works on data where 1219the laboratory allows only reads that go from 5' to 3' on each strand 1220of the genome, and excludes their reverse complement reads that would 1221be produced by PCR amplification. The resulting reads should have a 1222preponderance of C's converted to T's. 1223 1224The non-stranded flag is for the laboratory protocol that allows both 1225the 5'-to-3' genomic reads on each strand, as well as their reverse 1226complements. The reverse complemented reads will have the C-to-T 1227conversion show up as a G-to-A change. The resulting reads should 1228have a preponderance of T's in half the reads, and a preponderance of 1229A's in half the reads. 1230 1231 1232 123313. RNA-editing tolerance in GSNAP 1234=================================== 1235 1236Just as GSNAP has a program cmetindex and a mode called "cmet" for 1237tolerance to C-to-T changes, it can be tolerant to A-to-G changes 1238using the program atoiindex and a mode called "atoi". This mode is 1239designed to facilitate alignments that are tolerant to RNA editing 1240where A's are converted to I's, which appear as G's to a sequencer. 1241 1242To process reads under RNA-editing tolerance, you will first need to 1243create the necessary index files by doing 1244 1245 atoiindex -d <genome> 1246 1247Then, you can align the reads by doing 1248 1249 gsnap -d <genome> --mode=atoi-stranded, or 1250 gsnap -d <genome> --mode=atoi-nonstranded 1251 1252on your set of short reads. As with bisulfite sequencing, the 1253stranded flag is for laboratory protocols that allow only the 5'-to-3' 1254RNA, or sense, reads, and the non-stranded flag is for laboratory 1255protocols that allow both sense and antisense reads. 1256 1257 125814. Transcriptome-guided genomic alignment 1259=========================================== 1260 1261Transcriptome-guided genomic alignment (TGGA) is a feature in GSNAP 1262that utilizes both a genome and a transcriptome. In this alignment 1263mode, GSNAP aligns reads first against the set of transcripts. This 1264alignment is generally gap-free, since splicing is not an issue. 1265Indels can still be present, but GSNAP knows how to identify them in 1266transcripts. The advantage of aligning against a transcript is that 1267GSNAP can find seeds anywhere in the transcript, even if they span an 1268exon-exon junction. In contrast, with standard genomic alignment, if 1269the end of a read is near a splice junction, there may not be enough 1270nucleotides on the distal end of the splice to find a seed. 1271Therefore, in such cases, GSNAP would have soft-clip the end of the 1272alignment. 1273 1274In TGGA, GSNAP converts alignments to transcripts to alignments in the 1275genome, using information about how the entire transcript maps to the 1276genome. The ends of reads can be aligned more accurately, even if 1277only one or two nucleotides are on the distal end of a splice. In 1278addition, gap-free alignment to transcripts is fast, so TGGA is many 1279times faster than regular genomic alignment. A final advantage of 1280TGGA is that GSNAP reports the transcripts that are consistent with 1281the read, which can be useful for downstream analyses that measure 1282gene expression by transcript. Therefore, TGGA serves a function 1283similar to that of alignment-free RNA-seq tools like Kallisto or 1284Salmon, but also provides the genomic alignment needed for the 1285analysis of splicing, mutations, and polymorphisms. 1286 1287To perform TGGA, in addition to the genome in FASTA format (suppose it 1288is called genome.fasta), you will need either a set of transcripts in 1289FASTA format (transcripts.fasta) or a set of transcript mappings to 1290the genome (transcripts.map). The map format can be obtained from a 1291GFF3 or GTF file using the programs gff3_genes or gtf_genes included 1292with this software package, as follows: 1293 1294 gff3_genes <gff3_file> > transcripts.map 1295 gtf_genes <gtf_file> > transcripts.map 1296 1297The map file can also be generated by aligning the transcript FASTA 1298format against the genome using "gmap -f map_exons", but if you are 1299starting with a FASTA file, it is easier to let gmap_build perform 1300this process for you. 1301 1302To build the genome and transcriptome index files, there are two 1303cases: 1304 1305Case 1: The genome index has not yet been built. In this case, you 1306can build both the genome and transcriptome indices with one command, 1307either 1308 1309 gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -T transcripts.fasta genome.fasta, or 1310 1311 gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -G transcripts.map genome.fasta 1312 1313depending on whether you have a FASTA or map file. 1314 1315Case 2: The genome index has already been built (as described in 1316section 3 above). In this case, you do not need to build the genome 1317index again, just the transcriptome index, so you should omit the 1318genome FASTA file: 1319 1320 gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -T transcripts.fasta, or 1321 1322 gmap_build [-D <directory>] -d <genome_name> -c <transcriptome_name> -G transcripts.map 1323 1324The result of these procedures will be to create index files in one 1325directory for the genome, and in another directory for the 1326transcriptome. In addition, there will be a subdirectory 1327<genome_name>.transcripts with files <transcriptome_name>.*, which 1328contain information about how transcripts map to the genome. 1329 1330After these indices are built, you can perform TGGA by doing 1331 1332 gsnap [-D <directory>] -d <genome_name> -c <transcriptome_name> -N 1 -A sam <fastq files...> 1333 1334In other words, the only addition is the flag "-c 1335<transcriptome_name>". The "-N 1" flag is actually optional, because 1336the fact that you have a transcriptome tells GSNAP that your input 1337files are RNA-seq. 1338 1339 1340