1\documentclass[12pt]{article}
2\begin{document}
3
4\title{Using the Columbus extension to Velvet}
5\author{Daniel Zerbino}
6\date{June 12, 2010}
7\maketitle
8
9\section*{Abstract}
10
11Since its 1.0 release, the Velvet short-read assembler contains a module called Columbus which allows the user to provide reference sequences along with mappings of sequencing reads onto those reference sequences, to efficiently assist the assembly process. This short manual describes how to use the Columbus module within the Velvet package. Users unfamiliar with Velvet's use should first refer to the main Velvet Manual.
12
13\section{For impatient people}
14
15\begin{verbatim}
16
17> head myRegions.fa
18>chr1:123456789-123457789
19ATGTGTGTACTAGCTAGCGCGCTAGCTAGTCATGTGTGTACTAGCTAGCGCGCTAGCTAGTC
20[etc ...]
21
22> sort myReads.sam > mySortedReads.sam
23
24> velveth my_dir 21 -reference myRegions.fa \
25	-shortPaired -sam mySortedReads.sam
26
27> velvetg my_dir [etc ...]
28
29\end{verbatim}
30
31\section{How could it be used?}
32
33\begin{description}
34\item[Breakpoint assembly]
35
36You sequenced an individual genome by WGS, mapped the reads onto the reference, and detected a structural variant (SV) breakpoint using the algorithm of your choice.  As a further validation you wish to assemble the reads which span this breakpoint, and thus obtain an image of the rearrangement down basepair level.
37
38You would then select the two regions which appear to have been joined (allowing for some arbitrary margin on either side), and all the reads which map onto those two regions, along with their mate-pairs. Provided with all this, Velvet would then assemble the breakpoint, using the known references as scaffolds.
39
40If you are confounded by some form of heterzygocity at the breakpoint, then you can be sneaky and consider the assembly to be a problem between short
41overlapping sequences with different concentrations... a lot like transcripts! You would therefore use the Oases package to deconvolute the different
42``isoforms'' of your ``gene'' (it sounds pretty roundabout, but from an assembly point of view it actually makes sense).
43
44\item[Assisted transcriptome assembly]
45
46You sequenced the transcriptome of a new species, strain or individual, and you happen to know the gene sequences of a nearby species, strain or reference individual.
47
48You would then map the reads onto the reference genome, using the short-read mapper of your choice, and provide the alignments along with the known exonic sequences to Velvet. It would rebuild contigs based on the alignments, which could then be used by the Oases package.
49
50\end{description}
51
52\section{Overview of the process}
53
54\begin{enumerate}
55\item Map the reads against a set of \emph{target sequences} (typically, an entire reference genome, made up of chromosomal sequences).
56\item Select \emph{reference regions} within the target sequence (optionally, a reference region can be an entire target sequence) (e.g.: exon regions or SV regions).
57\item Prepare a FASTA file containing the sequences of the reference regions, along with their coordinates within the target sequences.
58\item Provide this FASTA file along with the SAM/BAM alignment file (or selected lines thereof).
59\item Run \emph{velvetg} as usual.
60\item If applicable, run Oases as usual.
61\end{enumerate}
62
63
64\section{Providing read mappings}
65
66Just like like in Velvet, you can provide paired or unpaired read alignments in SAM or BAM format, as produced by most short read aligners. You
67should simply remember to sort the reads by read name. If you turn on strand specificity (more on this later), you must sort your reads
68such that the forward read of each pair comes right before the reverse read of that pair.
69
70Because it is possible to specify regions within the target sequences, the mapping need not be specifically against the selected reference sequences.
71For example, if you wish to use exons as reference sequences, you can perform an alignment against the whole genome. Velvet will then sort out which
72alignments are within the desired regions, and ignore the rest.
73
74Beware that some read-aligners (e.g. Tophat) only report aligned reads in their output. To make the most of \emph{de novo} assembly, it is important to ensure that the unmapped reads are also provided to Velvet, whether in a separate file or in the SAM/BAM alignment file. It is generally better to put all the reads in the same file, so that pairings between mapped and unmapped reads are not lost.
75
76\section{Reference sequences}
77
78\subsection{How to choose reference sequences}
79
80The reference sequences can be any subset of the target sequences (whether whole genome, contigs, etc) used during the mapping process.
81
82\subsection{How \textbf{NOT} to choose reference sequences}
83
84\begin{itemize}
85\item Reference sequences must \textbf{NOT} overlap in the target sequence coordinate space
86
87This is because when reading the alignment files, Velvet must be able to assign a read to at most one reference sequence. Ambiguity in the coordinate space would defeat the purpose of providing alignments. Normally, Velvet will detect overlapping reference coordinates and exit with an error message.
88
89\item Reference sequences should \textbf{NOT} contain large amounts of repeats
90
91This is more a question of performance. If you load as reference large chunks of the reference genome (as in 100's of Mb) which contain large amounts of repeats, then Velvet will have to make many loops comparing identical sequences. This will slow down the assembly considerably.
92
93\item The coordinate system must \textbf{NOT} change between input files.
94
95This is a very obvious point, but it is important to that the coordinate system in the alignment file must be the same as that in the reference regions' file. If the coordinates are off, then Velvet will be unable to use the alignments (it will still be able to function and use the reference sequences, but not as efficiently as if it had reliable alignments). For example, if you align your reads to the human genome assembly HG19, then the reference regions must also be selected from HG19.
96
97\end{itemize}
98
99\subsection{How to provide reference sequences to Velvet}
100
101Reference sequences must necessarily be contained in a FASTA file. If the sequences correspond to complete target sequences used in the mapping
102process (e.g. when mapping to contigs) then the names in the FASTA headers must be identical to those of the target sequences. If the reference
103sequences correspond to a subset of one of the target sequences, then its header must correspond to the standard ``browser'' coordinates (i.e.
1041-based, inclusive on both ends, 5' start to 3' end, as illustrated in the quick example above).
105
106\subsection{Strand specificity in Columbus}
107
108If doing a strand specific analysis on Columbus, then the ``-strand\_specific'' velveth flag must be provided before any option, i.e. right after the
109hash length.
110
111In this case, the reference sequences are taken from the desired strand. The headers of the regions on the negative strand have their
112start and end coordinates reversed, i.e. the 3' coordinate comes before the 5' coordinate.
113
114For example, if the positive strand of a toy example is:
115\begin{verbatim}
116>chr1:1000-1010
117AGTCGATAGA
118\end{verbatim}
119
120then the reverse complement of this region is provided as:
121\begin{verbatim}
122>chr1:1010-1000
123TCTATCGACT
124\end{verbatim}
125
126\section{Some scripts I use}
127
128Below are a few scripts which I put up to prototype Columbus. They're pretty rough BioPerl code, but they serve their purpose...
129
130\begin{description}
131
132\item[enlarge\_exons.pl] This scripts adds an arbitrary buffer length (default $100 bp$) upstream and downstream of each exon. This avoid false positive
133overlaps just because one reference region happens to end at a homologous region.
134
135\item[merge\_gtf\_exons.pl] This script sorts regions, and merges overlapping ones (it assumes no strand specificity).
136
137\item[gff2fasta.pl] Once you have the GFF file you desire, it can be converted into a fasta with the appropriate headers with this script.
138
139\end{description}
140
141Typically, you would prepare the references as such:
142\begin{verbatim}
143grep exon my_annotation.gtf | enlarge_exons.pl | merge_gtf_exons.pl \
144	| gff2fasta.pl my_reference_assembly.fa > my_reference_sequences.fa
145\end{verbatim}
146
147\end{document}
148