1
2\documentclass{article}
3\usepackage{epsfig}
4
5\begin{document}
6\newcommand{\programtext}[1]{{\tt #1}}
7
8\title{Wise2 Documentation (version 2.2 series)}
9\author{Ewan Birney\\
10EBI\\
11Wellcome Trust Genome Campus\\
12Hinxton, Cambridge CB10 1SD,\\
13England.\\
14Email: birney@ebi.ac.uk}
15
16\maketitle
17
18\newpage
19\tableofcontents
20\newpage
21
22\section{Overview}
23
24Wise2 is a package focused on comparisons of biopolymers, commonly DNA
25sequence and protein sequence. There are many other packages which do
26this, probably the best known being BLAST package (from NCBI) and the
27Fasta package (from Bill Pearson). There are other packages, such as
28the HMMER package (Sean Eddy) or SAM package (UC Santa Cruz) focused
29on hidden Markov models (HMMs) of biopolymers.
30
31Wise2's is now a collection of algorithms which generally differ from
32the usualy, ``standard'' bioinformatics comparison methods. Probably
33the most used algorithm in Wise2 is \emph{genewise}, which is the
34comparison of DNA sequence at the level of its protein
35translation. This comparison allows the simultaneous prediction of say
36gene structure with homology based alignment. However Wise2 has a number
37of other methods which I hope will also appeal to users - \emph{promoterwise}
38for comparing upstream regions, \emph{genomewise} as a ``protein gene finisher''
39tool for combining disparate evidence strands and \emph{scanwisep} as a
40fast but sensitive search method.
41
42
43Wise2, although implemented in C makes heavy use of the Dynamite code
44generating language. Dynamite was written for this project, by myself
45(Ewan Birney). There is a separate documentation for Dynamite found in
46the same place as this file; to be honest noone except me can be
47expected to use Dynamite; it is a cranky, stupid, badly written code
48generating language with about two thirds of its code base dedicated
49to dynamic programming, and the remaining third a rather bad primitive
50object based system. It does however fit me like a glove, and makes my
51own personal code efficiency impressive.
52
53
54Wise2 has had a varied life. It started off as a rewrite of my old
55pairwise and searchwise package (called WiseTools), hence the name
56Wise2. For a while it looked as if it was going to be pensioned off
57and live its life out as a rather solid code base for genewise and
58estwise, but mid 2002 I came back to it for long and complex reasons
59not worth putting into documentation. This lead to the new methods
60such as promoterwise and scanwisep which I have high hopes for, though
61the real test is with real data used by real users...
62
63\subsection{Authors}
64
65The Wise2 package was principly written by Ewan Birney, who wrote the
66main genewise and estwise programs. The protein comparison database
67search program was written by Richard Copley using the underlying
68Wise2 libraries. Wise2 also uses code from Sean Eddy for reading HMMs
69and for Extreme value distribution fitting.
70
71However the authorship of Wise2 should be more fairly distributed
72between the main authors and the wonderful alpha testers I have had
73over the years. In particular Michele Clamp stands out as my longest
74running collaborator and tester, and indeed in effect all my
75successful algorithms have been best used and developed with
76Michele. Other mentions go to Gos Micklem and Niclas Jareborg and
77for their work at testing and their patience in my coding over the
78last couple of years. Other notables are (in no apparent order) -
79Enoch Huang, Erik Sonnhammer, Doug Rusch, Steve Jones, Ian Korf,
80Iftach Nachman, George Hartzell and Lars Arvestead. I believe that
81program writing is a 50-50 partnership between the coders and the
82testers or developers, and these people have actively helped me make a
83much better package.
84
85
86\newpage
87\section{Introduction for the impatient}
88
89It may well be that you want to understand Wise2's functionality now,
90without bothering with the concepts or the installation instructions.
91This section is designed for you.
92
93Wise2 has four main executable programs using sequence inputs which
94are designed to provide access to the main algorithms sensibly. The
95algorithms you are interested in is \emph{genewise} - compare protein
96information to genomic DNA and \emph{estwise} - compare protein
97information to EST/cDNA DNA.
98
99Other algorithms in Wise2 have their own single executables. In particular
100you might be interested in \emph{promoterwise}
101
102These are the programs which you might use for this.
103
104\begin{description}
105\item[genewise] a single protein vs a single genomic dna sequence
106\item[genewisedb] a database of proteins vs a database of genomic dna sequences. Read
107section \ref{genewise_large} before you use this in anger though.
108\item[estwise] a single protein vs a single EST/cDNA sequence.
109\item[estwisedb] a database of proteins vs a database of EST/cDNA sequences. Read
110section \ref{estwise_large} before you use this in anger though.
111\end{description}
112
113If you see error messages like
114\begin{verbatim}
115Warning Error
116        Could not open human.gf as a genefrequency file
117Warning Error
118        Could not read a GeneFrequency file in human.gf
119...
120\end{verbatim}
121This means that the enviroment variable WISECONFIGDIR has not been
122set up correctly. You need to find where the distribution was downloaded
123to (a directory called something like wise2.1.16b) and inside that
124directory should be the configuration directory wisecfg. You need to
125setenv WISECONFIGDIR to that directory.
126
127In each of the programs the protein can either be a protein sequence
128or a protein profile HMM, as made by the HMMER package (both version 1
129and version 2 HMMs can be read). Any of the databases can have one
130entry (in which case more efficient routines are used), and databases
131of profile HMMs, such as those provided by Pfam, can be used.
132
133The simple running of a protein sequence (drosophila) vs a human genomic sequence,
134using genewise is given below. The output comes on stdout, which in normal unix
135notation can be redirected to a file.
136
137\begin{verbatim}
138adnah:[/birney/search]<98>: genewise road.pep hngen.fa
139genewise (unreleased release)
140This program is freely distributed under a GPL. See source directory
141Copyright (c) GRL limited: portions of the code are from separate copyright
142
143Query protein:       roa1_drome
144Comp Matrix:         blosum62.bla
145Gap open:            12
146Gap extension:       2
147Start/End            local
148Target Sequence      HSHNRNPA
149Strand:              forward
150Gene Paras:          human.gf
151Codon Table:         codon.table
152Subs error:          1e-05
153Indel error:         1e-05
154Model splice?        model
155Model codon bias?    flat
156Model intron bias?   tied
157Null model           syn
158Algorithm            623
159Find start end points: [25,1387][346,3962] Score 87719
160Recovering alignment: Alignment recoveredExplicit read offone 94%
161genewise output
162Score 253.10 bits over entire alignment
163Scores as bits over a synchronous coding model
164
165Warning: The bits scores is not probablistically correct for single seqs
166See WWW help for more info
167
168
169
170roa1_drome        88 AQKSRPHKIDGRVVEPKRAVPRQ                       DID
171                     A  +RPHK+DGRVVEPKRAV R+                       D
172                     AMNARPHKVDGRVVEPKRAVSRE                       DSQ
173HSHNRNPA        1867 gaagaccagggagggcaaggtagGTGAGTG  Intron 2   TAGgtc
174                     ctacgcaataggttacagctcga<0-----[1936 : 2083]-0>aca
175                     tgtagacggtaatgaagatccaa                       tta
176
177
178roa1_drome       114 SPNAGATVKKLFVGALKDDHDEQSIRDYFQHFGNIVDINIVIDKETGKK
179                      P A  TVKK+FVG +K+D +E  +RDYF+ +G I  I I+ D+ +GKK
180                     RPGAHLTVKKIFVGGIKEDTEEHHLRDYFEQYGKIEVIEIMTDRGSGKK
181HSHNRNPA        2093 acggctagaaatgggaaggaggcccagttgctgaaggagaaagcgagaa
182                     gcgcatctaatttggtaaacaaaatgaataaagatattattcaggggaa
183                     aatccatgagatttctaactaatcaatttagtaatagtacgtcactcga
184
185
186roa1_drome       163 RGFAFVEFDDYDPVDKVV                          QKQHQ
187                     RGFAFV FDD+D VDK+V                          QK H
188                     RGFAFVTFDDHDSVDKIV          L:I[att]        QKYHT
189HSHNRNPA        2240 agtgtgatggcgtggaagAGTAAGTA  Intron 3   TAGTTcatca
190                     ggtcttctaaaactaatt <1-----[2295 : 2387]-1>  aaaac
191                     gctctactcctccgtgtc                          gactt
192
193
194roa1_drome       187 LNGKMVDVKKALPKQNDQQGGGGGR
195                     +NG   +V+KAL KQ         R
196                     VNGHNCEVRKALSKQEMASASSSQR          G:G[ggt]
197HSHNRNPA        2405 gagcatggaagctacgagagttacaGGTATGCT  Intron 4
198                     tagaagatgactcaaatcgcccgag <1-----[2481 : 2793]
199                     gtccctataacgagaggtttaccaa
200
201...truncated
202
203\end{verbatim}
204
205The output is as follows
206\begin{itemize}
207\item Parameters of the comparison used (it used default parameters)
208\item The alignment of a combined homology + gene prediction alignment
209\end{itemize}
210
211The pretty alignment shows the protein sequence on the first line,
212followed by a line indicating the similarity level of the match
213followed by 4 lines representing the DNA sequence. The DNA sequence in
214the exons descending in triplets, each triplet being a codon. The
215translation of each codon is shown above it. Between the two protein
216sequences a line indicating the similarity of the match is printed. In
217introns the DNA sequence is not shown but for the first 7 bases
218(making the 5' splice site) and the last 3 bases of the 3' splice
219site. The intervening sequence is indicated in the square
220brackets. Above each intron, for phase 1 and 2 introns (ones that
221split a codon) the implied protein to conceptual gene match is
222displayed, with the codon in square brackets.
223
224Generally the defaults of the options are reasonably sensible, and for
225the main part you should trust them until you become familar with the package.
226
227The following commands show how to run the other programs in a variety
228of different modes
229
230\subsection{Common running modes}
231\label{sec:commonmode}
232Running modes for genewise (genomic to protein comparisons).
233
234NB, the order of the -options are not important, but the protein file must be
235before the dna file
236
237\begin{itemize}
238\item \hbox{\tt{genewise protein.pep cosmid.dna}} compares a protein sequence to a DNA sequence (same
239as the example above)
240\item \hbox{\tt{genewise -hmmer pkinase.hmm cosmid.dna}} compares a protein profile HMM to
241a DNA sequence
242\item \hbox{\tt{genewisedb protein.pep human.fa}} compares a single protein sequence to a
243database of DNA sequences
244\item \hbox{\tt{genewisedb -hmmer pkinase.hmm human.fa}} compares a single protein profile HMM to
245a database of DNA sequences
246\item \hbox{\tt{genewisedb -prodb protein.pep -dnas cosmid.dna}} compares a database of protein
247sequences to a single dna sequence
248\item \hbox{\tt{genewisedb -pfam Pfam -dnas cosmid.dna}} compares a database of protein profile HMMs
249to a single dna sequence
250\item \hbox{\tt{genewisedb -prodb protein.pep human.fa}} compares a database of protein
251sequences to a database dna sequences - beware, this will take a while!
252\item \hbox{\tt{genewisedb -pfam Pfam human.fa}} compares a database of protein profile HMMs
253to a database of single sequences - beware, this will take a while
254\end{itemize}
255
256The estwise (protein to est/cDNA comparisons) have precisely the same running modes.
257Listed for completeness below
258
259\begin{itemize}
260\item \hbox{\tt{estwise protein.pep singleest.fa}} compares a protein sequence to a DNA sequence (same
261as the example above)
262\item \hbox{\tt{estwise -hmmer pkinase.hmm singleest.fa}} compares a protein profile HMM to
263a DNA sequence
264\item \hbox{\tt{estwisedb protein.pep est.fa}} compares a single protein sequence to a
265database of DNA sequences
266\item \hbox{\tt{estwisedb -hmmer pkinase.hmm est.fa}} compares a single protein profile HMM to
267a database of DNA sequences
268\item \hbox{\tt{estwisedb -prodb protein.pep -dnas singleest.fa}} compares a database of protein
269sequences to a single dna sequence
270\item \hbox{\tt{estwisedb -pfam Pfam -dnas singleest.fa}} compares a database of protein profile HMMs
271to a single dna sequence
272\item \hbox{\tt{estwisedb -prodb protein.pep est.fa}} compares a database of protein
273sequences to a database dna sequences - beware, this will take a while!
274\item \hbox{\tt{estwisedb -pfam Pfam est.fa}} compares a database of protein profile HMMs
275to a database of single sequences - beware, this will take a while
276\end{itemize}
277
278\subsection{Common options to change}
279
280There are a number of common options that can be used. Options can be issued anywhere
281on the command line.
282
283\begin{description}
284\item[-help] help on options
285\item[-version] show version and build date (useful for bug reporting)
286\item[-quiet] remove update line on stderr and informational messages
287\item[-silent] suppress all messages to stderr
288\item[-report] \emph{number} for database searching, issue a report on stderr every
289\emph{number} of comparisons (useful to ensure it is actually running)
290\item[-trev] genewise and estwise - use the reverse strand of the DNA
291\item[-both] genewise and estwise - use both strands of the DNA
292\item[-u position] The start point in the DNA sequence for the comparison
293\item[-v position] The end point in the DNA sequence for the comparison
294\item[-init]   [default/global/local/wing]  (see section \ref{sec:start_end})
295For protein sequences the default is to be local (like
296smith waterman). For protein profile HMMs, the default is read from the HMM - the
297HMM carries this information internally. The global mode is equivalent to to the ls building option
298(the default in the HMMer2 package). The local mode is equivalent to to the fs building option (-f)
299in the HMMer2 package. The wing model is local on the edges and global in the middle.
300\item[-gene file] change gene model parameters. Currently we have either
301human (human.gf) or worm (worm.gf)
302\item[-genes] Output option for genewise algorithms - show an easy to read gene
303structure report
304\item[-trans] Output option for genewise algorithms - provide an automatic
305translation of the predicted gene as a fasta format
306\item[-cdna] Output option for genewise algorithms - provide an automatic
307construction of the spliced dna sequence as a fasta format
308\item[-ace] Output option for genewise algorithms - provide an ACeDB
309subsequence model output
310\end{description}
311
312\subsection{Common gripes, Cookbook and FAQ}
313
314\subsubsection{It hasn't given me a complete gene prediction}
315
316The genewise algorithm does not attempt to predict an entire gene,
317from Met to STOP. It tries to predict regions which are justified with the
318protein homology and no more.
319
320This does mean you can be confident of the predictions that genewise makes
321
322\subsubsection{How can I get rid of the annoying messages on stderr?}
323
324Some people like them. use -quiet
325
326\subsubsection{It goes far too slow}
327
328Well... I have always had the philosophy that if it took you over a month
329to sequence a gene, then 4 hours in a computer is not an issue. However, in
330particular for times when people are using genewise simply to confirm
331that the a gene prediction is correct with respect to a protein sequence
332(sometimes the notional translation!) it is taking too long. In many cases
333you will know the rough region to compare the sequence to - if so use the -u
334and -v options to truncate your DNA at the correct points (the output will
335remain in the coordinates of the full length sequence).
336
337For database searching there is the option of using SMP boxes efficiently
338with the pthreads port.
339
340There are also a number of heurisitcs that use the BLAST program to
341provide the speed. These heuristics are found in the perl/scripts
342directory, called halfwise and blastwise and notes on how to use them
343are a later section (\ref{half_and_blast}). The scripts have extensive
344installation instructions, and I completely expect people to edit them
345for their system.
346
347There is functionality for providing a heurisitic bound to the space the
348algorithm explores in the alignment. This is done via the potential
349gene option in genewise. It is not well tested out.
350
351\subsubsection{I have a new cosmid. What do I do?}
352
353One thing to do is to use the halfwise script available in the perl/scripts
354package. Another is to use the blastwise script.
355
356
357\subsubsection{Can I modify or use the Wise2 source code?}
358
359Of course you can - it is Open Source code, licensed under the Gnu
360Public Licensed (GPL'd), like emacs or gcc. For more information on
361this License read the GNULICENSE file in the distribution.
362
363As well as using the source code, you can if you like contribute
364directly back into the Wise2 source code. Get in contact with me
365if you would like to do this.
366
367\subsubsection{Making a single gene prediction on the basis of a close homolog}
368
369This is perhaps the easiest use of genewise. The basic formulation is
370\begin{verbatim}
371%genewise protein.fasta dna.fasta
372\end{verbatim}
373To get out computer parsable formats of the gene prediction
374try -genes or -gff or -ace. To get out the protein translation
375in one go use -trans
376\subsubsection{Using non human/worm/fly genomic DNA}
377
378At the moment, genewise only has gene frequency files for
379human and worm sequences. The production of these files
380are based around somewhat annoying and non portable script. In
381any case, making a dataset requires alot of effort as it
382needs to be clean
383
384The consequence of all this is that the species that you
385are comparing against (eg, hamster) may not have a gene frequency
386(.gf) file. In which case you basically have two options
387\begin{itemize}
388\item Use a close species - ie, for hamster, use human or rat
389\item Use -splice flat -intron tied which switches the splice model to ``start at GT, finish at AG'' with no other information
390\end{itemize}
391
392\subsubsection{Working with non spliced (bacterial) genomic DNA}
393
394Use genewise with the -alg 333 or -alg 333L options. This has all the
395outputs of genewise but does not consider introns. The -gene option
396and -intron, -splice options are all pointless. The only options to worry
397about is the -subs and -indel for substitution and insertion and deletion
398errors respectively.
399
400\subsubsection{Working with ESTs}
401
402Use the estwise/estwisedb programs
403
404\subsubsection{Getting out the protein translation}
405
406You have three approaches for getting out protein translations
407
408\begin{itemize}
409\item -pep available on all programs, provides the translations moving
410over frameshifts and introns
411\item -trans available on genewise/genewisedb provides the translations
412across introns but breaks on frameshift errors. This means that the translations can be correctly placed on the genomic DNA provided
413
414\item -mul available only on estwisedb when a HMM is used, provides a
415protein multiple alignment of all the DNA hits derived against the HMM
416match
417\end{itemize}
418
419\subsubsection{Using Pfam}
420
421Pfam can be used with the genewisedb or the estwisedb program with the
422-pfam flag. Usually you want to also use the -dnas (single DNA
423sequence flag) as well. An example run would be
424\begin{verbatim}
425genewisedb -pfam Pfam -dnas myseq.fa
426\end{verbatim}
427If you have set up the HMMER package to work with Pfam using the enviroment
428variable HMMERDB, Wise2 will also pick that up as well.
429
430
431
432\subsubsection{Optimising alignment speed}
433
434Wise2 assummes you have a rather small amount of memory (20 MBytes).
435When it is making an alignment, if it cannot make the explicit matrix
436in that size (being length of query $\times$ length of target $\times$
437state number) it has to move to linear memory (length of query
438$\times$ state number). The linear memory is much slower (it is the
439one that starts with ``Find start end points'').
440
441If you have more memory than 20 Mbytes, then it is really sensible to
442up the number, using the -kbyte option. For a machine with say
44364Mbytes physical memory I would suggest putting an upper limit of
44450Mbytes with -kbyte. This does assumme you are not using it for
445anything else.
446
447You can change the compile time default in basematrix.h if you can't
448be bothered to remember to change it every time
449
450\subsubsection{Optimising search speed}
451
452See sections \ref{genewise_large} and \ref{estwise_large} for use of
453these programs in large scale throughput environments.
454
455Make sure you have compiled with optimisation. If you are using the
456make all from the top level you have. If you are using gcc, make
457sure you are using -O2 optimisation, and probably crank it all the
458way up.
459
460If you have a large SMP box, you can compile with pthread support. The
461searches work on SGI/Compaq alpha/Suns. There are some issues about some
462architecture ports, which I need to expand somewhere in the docs, but first
463off, just try compiling with pthreads (\ref{compile_pthread}) and using pthreads
464in the search.
465
466For real, order-of-magnitude speed ups, you are going to have to use a
467heuristic stage before the actual database search - in other words,
468using BLAST. I dislike this, but it is fact of life, and there are two
469scripts in perl/scripts, halfwise and blastwise
470(\ref{half_and_blast}), which help you do this. Both scripts use Steve
471Chervitz excellent perl Blast parser, which is available in
472bioperl. (Make sure you have a 0.05 release or later of bioperl, as
473the Blast parser in the 0.05 release is much better).
474
475\begin{itemize}
476\item halfwise is for the Pfam search. You need to pick up the halfwise
477database (done for a specific release of Pfam) from the ftp site.
478\item blastwise is for post processing blast results. It uses the Wise2
479perl port to do this, so you have to go make perl at the top level
480\end{itemize}
481
482halfwise is a pretty sensible, self contained script. blastwise I expect
483people to modify heavily to get to work as wished on their systems. Please
484read it, and add in your own heuristics (eg, figuring out start/end points).
485I am very interested in better heuristics in this area.
486
487\subsubsection{segmentation fault = bottle of champagne}
488
489You've found a bug? I am really keen to hear from you. I want
490to hear about the problems you've got. Each year I award my
491best tester with a prize. This year (1998/99) it will be
492a bottle of champagne. Send a mail to birney@sanger.ac.uk
493for your prize!
494
495\subsection{Using GeneWise in a large scale throughput manner}
496\label{genewise_large}
497
498If you are analysing genomic DNA in a large scale manner, you might
499wonder what is the best way to use genewise. Genewise is \emph{very
500CPU expensive} compared to other programs. Part of this is because
501I have concentrated much more on correctness of the algorithm, not
502its speed (it is probably about 2 fold slower than it could be optimally),
503but mainly this is because the algorithm is complicated and DNA sequence
504is generally very large. I do not believe that optimising genewise in
505the code will solve people's CPU problems.
506
507For these reasons, I do not advise the serious use of genewisedb as
508a single executable for comparing DNA sequence to either Pfam or
509protein databases. For these cases I suggest using the halfwise and
510blastwise scripts. See the section on Halfwise and Blastwise (\ref{half_and_blast}
511
512\begin{itemize}
513\item halfwise compares a DNA sequence to Pfam using a Blast speed up
514\item blastwise uses a Blastx result to provide the database search
515and provides sequence alignments ontop of that
516\end{itemize}
517
518Another option is to get in contact with Paracel, Compugen or
519TimeLogic, all of whom may be able to sell you specialised
520hardware. Paracel has successfully ported genewise to their hardware
521with only a few minor changes to the method.
522
523\subsection{Using EstWise in a large scale throughput manner}
524\label{estwise_large}
525
526Estwise in a large scale manner is a more troubling issue than genewise.
527Generally the DNA databases are as large, but the algorithm is smaller
528and often people are equally interested in sensitivity and alignment
529quality. Therefore it makes more sense to use estwise directly as the
530database search. Estwise is still pretty slow, so here is a check list
531of things to do
532
533\begin{itemize}
534\item Run estwise on a \emph{clustered} EST database, not raw reads
535\item Make sure you are using the 3:12 algorithm on estwise (-alg 312)
536for the database search. Try using the 3:12 quick (-alg 312Q).
537\item Use the pthread port if you have a large SGI/Sun/Dec multiprocessor
538\item For the final tuning, make sure you have switched on all the compiler
539options. egcs is a good thing to try. This will give you a 10-15% speed up
540at most
541\end{itemize}
542
543I am thinking about improvements to the estwise running time. I would very
544much like to collaborate with someone on estwise in terms of understanding
545its sensitivity and improving all aspects of the algorithm. Please get in
546contact with me.
547
548The hardware solutions from Compugen, Paracel and Time Logic are all very
549good in this area, and worth investigating if you have money to spend.
550
551
552\newpage
553\section{Installation}
554
555Installation is quite easy as long as you are au fait with
556standard UNIX utilities. You should ftp to ftp.sanger.ac.uk,
557log in as anonymous and move to pub/birney/wise2. You can
558then pick up the release - I would pick up the latest
559numbered in that directory. (NB, if you want to be working
560in the development release, go to the pub/birney/wise2/alpha
561directory, but be sure to read the html help at
562http://www.sanger.ac.uk/Software/Wise2/Programming).
563
564\subsection{Building the executables}
565The release is distributed as a gzipped, tar file. To
566unzip and untar in a single command you can type
567
568\begin{verbatim}
569%zcat wise2.1.12b.tar.gz | tar -xvf -
570\end{verbatim}
571
572This will untar into a directory called 'wise2.1.12b'
573(of course, your version of Wise2 might be different).
574
575Once you have made the tar file, it should build completely cleanly as
576long as you have an ANSI C compiler. If in doubt, just assumme that it
577is, but in particular sun users might want to use gcc (gnu cc) as the
578sun cc compiler installed by default is often non-ANSI. To change the
579cc compiler you only need to edit the line in the top level makefile
580called CC = cc to CC = gcc.
581
582
583To build the package type
584\begin{verbatim}
585%cd wise2.1.12b
586%make all
587%make bin
588\end{verbatim}
589
590The executable files will now be in wise2.1.12b/bin
591
592
593I am interested in all compiler errors, and consider most of them
594to be bugs (which means if you report them you could be on the
595champagne list!)
596
597\subsection{Environment set up}
598
599The Wise2 package needs to know where a number of files are (eg,
600the gene predicition statistics). These files are in the directory
601called wisecfg/. You will need to setenv WISECONFIGDIR to this
602directory (you can of course move the directory elsewhere, and set
603WISECONFIGDIR to it).
604
605\subsection{Building with thread support (for SMP machines)}
606\label{compile_pthread}
607To build with pthread support you must switch on some extra
608compile time options before you type make all. These are found
609at the top of the makefile in the top directory, and it is
610pretty clear from the makefile what to do. See the section \ref{running_pthread}
611for information on how to run pthreaded code.
612
613In some cases the pthreads do not schedule correctly, preventing multiple
614threads working on different processors at the same time. If you have
615this problem, trying compiling with -D HAS\_PTHREAD\_SETSCOPE on the CFLAGS
616line.
617
618The pthreaded code has been reported to be 97% efficient with 8 threads, and
619there have been reports of up to 100 multiple threads running fine.
620
621\subsection{Building Perl port}
622\label{perl_port}
623
624To build with Perl support you need to go
625\begin{verbatim}
626make perl
627\end{verbatim}
628at the top level. This should build everything correctly. The only
629problem is if you have a Solaris or *BSD box. If so you need to compile
630with -fpic or -fPIC depending on your compiler. This needs to go into
631the top level CFLAGS line. In addition, in the out-of-the box perl distribution
632for solaris they built it with a different compiler to the one it comes
633with (idiots!), so the perl generated makefile has the wrong -fpic option.
634You need to edit that by hand.
635
636\newpage
637\section{Concepts and conventions}
638
639The algorithms used in Wise2 have a strong theoretical justification,
640which is useful, though not necessary to understand.  For example to
641understand what most of the options do in the gene model part of
642genewise you need to understand the algorithm.
643
644\subsection{Technical Approach}
645
646You can miss this section which describes some of the theoretical
647background of the work.  The algorithms are based around a 'Bayesian'
648formalism that has been established in Bioinformatics by such people
649as David Haussler, Gary Churchill, Anders Krogh, Richard Durbin, Sean
650Eddy and Graeme Mitchinson, as well as many others. In this formalism
651there is assumed to be a generative model of the process that you are
652observing, which has probabilities to generate a number of different
653observations. Deciding whether this model fits a previously unseen
654piece of data or not is the first decision to make. Given that the
655data fits, a second question is what actual processes were the most
656likely to produce the observed data. Both these questions fit
657naturally into a Bayesian framework where the result is a posterior
658probability having seen the data.
659
660For people coming from a bioinformatics/biology background where the last
661paragraph may seem very confusing, it is only because this a different
662(and well established) field with their own terminology to describe the
663algorithms. In fact the methods a very close to standard techniques presented
664in bioinformatics. The generative models that we
665use are the models that are implied by the standard bioinformatics
666tools. For example, the Smith-Waterman algorithm implies a process of
667evolution with certain probabilities for seeing say an Leucine to  Valine
668substitution and certain probabilities for creating and extending a
669insertion (gap). As you can see you can almost replace the word
670'probability' with 'score' to return to the standard method, and
671mathematically it is almost that easy: the score is related to the log
672of the probability.
673
674Perhaps a better known example is the relationship between the old
675profile technology, as developped by Gribskov and Gibson along with
676others, and its probabilistic partner, profile Hidden Markov Models
677(profile HMMs).  In terms of the actual algorithm these two methods
678are very similar: it is simply that the profile HMM has a strong
679probabilistic model underlying it, allowing well established
680techniques to be used in its generation.
681
682\subsection{Introduction to Models in Wise2}
683Wise2 contains a number of algorithms, each of which are based around
684one of two biological models.
685
686\begin{description}
687\item[genewise] comparison of a related protein to genomic DNA
688\item[estwise] comparison of a related protein to cDNA (or ESTs)
689\end{description}
690
691This models themselves are built up from two component models, one for how
692protein residues are matched, and one for the gene prediction process. For the model
693of protein residues I have taken the established models of profile HMMs.
694The model of splicing and translation we developed with an eye to biology.
695It has many of the features of the GenScan model [chris Burge]. The model
696of translation (for estwise) is simple.
697
698\subsection{Model}
699
700The main model to understand is the genewise model (called genewise 21:93
701for reasons discussed below). It is this model which the other models
702are based on - for the estwise models, by removing the intron generating
703part of the models, and for the other genewise algorithms by making
704approximations to genewise21:93. A diagramatic representation of genewise21:93
705is shown in Figure \ref{Figure:genewise21}
706
707\begin{figure}
708\begin{center}
709\leavevmode
710\epsfxsize 300pt
711\epsfbox{genewise21.eps}
712\newline
713\caption{GeneWise21:93 Algorithm. The dark circles represent states, and the
714arrows between them transitions. Black transitions are standard
715protein transitions, red transitions are frameshifting transitions and
716green transitions are intronic transitions. Introns are each built of
717three states, listed at the bottem of the figure}
718\label{Figure:genewise2193}
719\end{center}
720\end{figure}
721
722
723The central part of the model is the Match-Insert-Delete trio common to both
724profile HMMs (such as HMMER models) and the smith waterman model. This trio
725of states is one model 'position' in the profile HMMs, where each model position
726contains a Match, Insert and Delete states. This means to interpret the figure
727of the model in the way the profile HMM models are usually displayed, you have
728to imagine a series of these states concatonated together. I imagine the model
729growing as stack of pages out from the figure, each new page being a new position
730in the profile HMM.
731
732The first addition to the model are the frameshifting transitions, shown in
733with x4 boxes above them. These occur whenever there is a transition which
734produces a codon: in effect all transitions that terminate at either match
735or insert states. There are four frameshifting transitions in each
736Notice that there are frameshifting transitions from Delete
737to Match, which is equivalent to saying that a frameshift occurs on the codon
738just after a run of deletions in the model. It is these sorts of frameshifts
739that are not well modelled by other algorithms.
740
741The second addition involves the intron emitting states found in the green boxes.
742Each intron is modelled by having 5 regions, two of which are fixed length. The
743five regions are
744
745\begin{itemize}
746\item 5'SS The splice site consensus region at the 5' end of the intron. Fixed length
747\item The central part of the intron that constitutes the major part of the intron
748\item The polypyrimidine tract (a region of C/T bias upstream of the 3'SS)
749\item an optional joining region between the poly-py tract and the 3'SS
750\item 3'SS The splice site consensus region at the 3' end of the intron. Fixed length
751\end{itemize}
752
753Notice that there is no branch site, because we could not produce a good enough
754statistical model for it.
755
756This model can be modelled using 3 states, with the fixed length regions being
757accommodated using transitions which emitted the appropiate length of sequence.
758
759Each of the intron models must be duplicated 3 times to account for
760the 3 different phases of introns (each phase being a different
761placement of the intron relative to the codon), so we need to
762duplicated these 3 states at least 3 times. In addition, if this
763intron lies in an insert state, ie, the surrounding protein sequence
764in the exons are being produced by an insert state in the underlying
765protein profile HMM, so we have to maintain that information across
766the intron. This means that we need to duplicate the intron states 6
767times in total: 3 times for the different phases and twice on top of
768that for the different protein states this intron could lie in.
769
770\subsubsection{Parameterisation of the model}
771
772The model presented above seems biological sensible, but how on earth are we
773going to parameterise it? Are we honestly going to let a user try to juggle the
774forty odd parameters inherent to this model? Clearly not. The approach we have
775taken to this is to provide set statistics derived from a maximum likelhood approach
776from known genes - this requires virtually no training - and then give switches
777to the user to turn on and off a variety of different parts of the algorithm.
778
779The model is parameterised as probabilities, but actually calculated
780in log space.  If you look in the code you would find that there is alot of switching
781between the two spaces: these are provided by the functions
782Probability2Score and Score2Probability (notice that the 'Score' here
783is very specific to the Wise2 package - you can't put any old score
784into Score2Probability to get a probability out as it depends on how
785that Score was converted into Log space).
786
787\subsubsection{The protein model}
788
789For the emissions of the actually underlying amino acids when we have
790a profile HMM, we are lucky - we can take the probabilies defined in
791the HMMer2 models. This is completely natural and means I don't have
792to worry about deriving probabilities for the profile HMMs
793
794In the case where we have a protein sequence, I somehow have to get to
795a profile HMM type representation. Thankfully the smith waterman
796algorithm in terms of architecture is very close to a profile HMM, and
797so the only problem is mapping the usual scores used in the smith
798waterman algorithm to probabilites. This is quite hard to do
799correctly, but I've hacked it by knowing that the blosum62 matrix is
800given in half bits, in other words using a 2*log2 mapping from
801probability space to the give scores in the matrix.  By reversing this
802process one can get pretty good emission probability for the amino
803acids. I now assumme that the gap penalities are \emph{as if} they were
804written in half bits. A certain amount of normalisation is required to
805make sure things add to one, and eh voila - one profile HMM from a
806single sequence.
807
808\subsubsection{Start End points}
809\label{sec:start_end}
810
811One interesting issue about the protein model is how the start end points
812work. For proteins it is obvious that for distant homology, it needs to
813be local - ie can start or finish anywhere in the sequence. For protein HMMs
814it is less clear. If a HMM really represents a single domain then global start
815end points are correct. However, many times local start end points are useful.
816
817The HMMer2 models internally carry whether this HMM is has global or local (or
818indeed any type) of start end policy.
819
820However, the genewise algorithm is quite dependent on the models being global
821to effectively predict introns in domains, when the looping algorithm (multiple
822copies of the domain) is present. This is because nearly always in a local
823HMM, an intron can be better modelled as the end of the domain half way
824through and the start of a new domain half way through, further down the sequence,
825thus not predicting the intron. To get clean intron prediction, one needs to go
826to global mode. However, using global mode forces the start and end point of the model
827to be really correct, and in some cases (in particular some Pfam models) this makes
828very incorrect results on the edges of the domain. To combat this another type
829of start end policy is introduced - wing. This has a local start mode for the first
83015 model positions and end mode for the last 15 model positions, but global in the
831central part of the model.
832
833In the programs one can set four types of start end policy
834
835\begin{itemize}
836\item default local for protein, and the HMM default for HMMs
837\item local   local
838\item global  global
839\item wing    local on the edges, global in the middle
840\end{itemize}
841
842
843\subsubsection{The gene model}
844
845For the emissions of the gene model we had to do more work. What we did was to
846make a database of known genes, with annotated gene structure. These
847genes then provided a raw set of counts for particular parts of the
848gene structure. It is these raw counts which are stored in the .gf files.
849(we store the raw counts because one might want to do something clever
850for deriving the probabilities of certain things using these counts.
851Counts are the basis for the probability derivations, not frequencies).
852
853The only issue here is what to do with the splice sites. We were well aware
854that the information in the splice sites is considerably more than just the
855simple position matrix. We chose to use a single branching (biased) decision
856tree, in which each branch either carried along the main trunk of the tree or
857ended in a leaf, each leaf representing a consensus build from A,T,G,C or N
858for any character. This decision tree could be easily constructed by chosing
859the most common consensus (where N is allowed where a position is better
860represented by N than any specific residue), and then removing that consensus
861from the list of observed consensi, and then repeating the process. This
862also gave us the same basis (counts) for each consensus used in the splice
863sites.
864
865One additional twist came about in the splice site development. The
866splice sites overlap between their consensi and the coding sequence
867region. These overlaps need to be treated correctly: the problem is
868that probabilistically we have two processes wanting to account for
869the same DNA bases. This was solved by assumming conditional
870independence between the two processes. A more formal mathematicall
871approach can be found in the documented called 'probappendix'.
872
873
874\subsubsection{The NULL model}
875
876The probability of the model has to compared to an alternative model
877(in fact to all alternative models which are possible) to allow proper
878Bayesian inference. This causes considerable difficulty in these
879algorithms because from a algorithmical point of view we would
880probably like to use an alternative model which is a single state,
881like the random model in profile-HMMs, where we can simply 'log-odd'
882the scored model, whereas from a biological point of view we probably
883want to use a full gene predicting alternative model.
884
885In addition we need to account for the fact that the protein HMM or
886protein homolog probably does not extend over all the gene sequence,
887nor in fact does the gene have to be the only gene in the DNA
888sequence. This means that there are very good splice
889sites/poly-pyrimidine tracts outside of the 'matched' alignment can
890severely de-rail the alignment.
891
892Basically we are in trouble with the random model parts of this
893problem.
894
895The solutions is different in the genewise21:93 compared to the genewise 6:23
896algorithms. Genewise 6:23 is shown in figure \ref{Figure:genewise623}
897
898\begin{figure}
899\begin{center}
900\leavevmode
901\epsfxsize 300pt
902\epsfbox{genewise6.eps}
903\newline
904\caption{GeneWise6:23}
905\label{Figure:genewise623}
906\end{center}
907\end{figure}
908
909\begin{itemize}
910\item In 6:23 we force the external match portions of the homology
911model to be identical to the alternative model, thus cancelling each
912other out. This is a pretty gross approximation and is sort of
913equivalent to the intron tie'ing. It makes things algorithmically
914easier... However this means a) 6:23 is nowhere near a probabilistic
915model and b) you really have to used a tied intron model in 6:23
916otherwise very bad edge effects (final introns being ridiculously
917long) occur.
918\item In 21:93 we have a full probabilistic model on each side
919of the homology segment. This is not reported in the -pretty output
920but you can see it in the -alb output if you like. Do not trust the
921gene model outside of the homology segment however. By having these
922external gene model parts we can use all the gene model features safe
923in the knowledge that if the homology segments do not justify the
924match then the external part of the model will soak up the additional
925intron/py-tract/splice site biases.
926\end{itemize}
927
928However this still does not solve the problem about what to compare it
929to.
930
931There are two approaches to the comparison
932
933\begin{description}
934\item[flat] The homology model is scored against a single state 0.25 emission
935model. This is effectively 'how likely is this DNA segement has any
936genes some with this homologous protein/HMM in it' for 21:93. It is,
937unsurprisingly, a massive 'yes' for nearly all biological DNA, and
938though a valid number in terms in bayesian inference pretty
939biologically uninteresing. There is also no decent interpretation of
940partial scores (ie, scores per domain).
941
942\item[syn] For synchronous model pretends that there is an alternative
943model of a complete gene which is dragged into the coding part of the
944gene when the homology model is in the coding part. This is not
945probabilistically valid, but gives better results and interpretable
946scores for partial regions, ie domain by domain. (in fact, very
947similar scores to protein sequences). However I'm worried about what I
948am doing It would be much better to get some mathematically
949justification for this.
950\end{description}
951
952
953\subsection{Algorithms}
954\label{sec:alg}
955The algorithms are then based around this central model, but
956have a variety of features removed from it progressively, either
957due to biological constraints (bacterial sequences have no introns,
958so there is no need to model them) or to speed up the the algorithm.
959
960Algorithms are named in two parts, \emph{descriptive-word} \emph{state-number:transition-number}.
961The descriptive word indicates the \emph{biological} model. At the moment
962there are 2 such biological models in the package
963\begin{description}
964\item[genewise] comparisons of protein information to genomic DNA
965\item[estwise] comparisons of protein information to cDNA/bacterial DNA (no
966introns)
967\end{description}
968There are many other models being worked on in development
969\begin{description}
970\item[sywise] comparisons of genomic DNA to genomic DNA
971\item[parawise] comparions of cDNA to cDNA
972\end{description}
973
974The \emph{state-number:transition-number} is the number of states in the model
975followed by the number of transitions. GeneWise 21:93 is the most complicated
976model, with 21 states and 93 transitions. The number of states is directly
977proportional to the memory usage of the program. The number of transitions
978is roughly proportional to the CPU time of the algorithm. For comparison the
979standard smithwaterman algorithm is a 3:7 algorithm (3 states, 7 transitions).
980These numbers are per compared residue - so as genomic DNA is some 1,000 fold
981longer than protein sequences on average, there is an additional massive
982CPU load.
983
984Finally the algorithms can be looping or not. A Looping algorithm is one in
985which the protein information can be repeated in the DNA target sequence.
986This could either be due to mutliple copies of the gene in the DNA sequence
987or multiple copies of a domain in a single gene. Looping algorithms are
988given a 'L' tag. By default, when you use profile-HMMs you use a looping model
989
990For the genewise family the following algorithms are available.
991\begin{description}
992\item[genewise 21:93] The largest genewise algorithm which also contains
993a complex flanking model to prevent inappropiate gene predictions
994\item[genewise 21:93L] The same algorithm with a looping mode. This allows
995a protein HMM (nearly always a HMM) to match multiple times a DNA sequence.
996This could be due to multiple domains in a single gene or multiple genes
997in a DNA sequence with the domain. The algorithm doesn't distinguish between
998these possibilities.
999\item[genewise 6:23] This is a smaller, (and so faster) algorithm. The
1000approximations made compared to genewise 21:93 are that there is no
1001poly-pyrimidine tract in the intron, and that introns from match states
1002are not distinct from introns in insert states.
1003
1004A side effect of these approximations is that 6:23 is much more robust
1005with respect to unmasked repeats and strange composition effects found
1006in the DNA sequences.
1007\item[genewise 6:23L] The same algorithm as 6:23 but in looping mode
1008\item[genewise 4:21] The smallest algorithm in the genewise family,
1009with an additional approximation of not distinguishing between introns
1010of different phases. This has been compiled for short protein sequences
1011only - effectively only profile-HMMs.
1012\end{description}
1013
1014For the estwise family the following algorithms are available
1015\begin{description}
1016\item[estwise 3:33] The largest estwise algorithm, modelling potential
1017insertion or deletions throughout the alignment of the protein
1018information to the DNA sequence.
1019\item[estwise 3:33L] The same algorithm but in looping mode.
1020\item[estwise 3:12] A slimmer algorithm designed for faster db searching.
1021The algorithm models enough insertions or deletions of DNA bases to
1022'ride through' a indel region without too much penalty, even if it
1023doesn't model the most correct one.
1024\end{description}
1025
1026\subsection{Scores}
1027
1028The scoring system for the algorithms, as eluded to earlier is a
1029Bayesian score. This score is related to the probability that model
1030provided in the algorithm exists in the sequence (often called the
1031posterior). Rather than expressing this probability directly I report
1032a log-odds ratio of the likelhoods of the model compared to a random
1033model of DNA sequence. This ratio (often called \emph{bits score}
1034because the log is base 2) should be such that a score of 0 means that
1035the two alternatives \emph{it has this homology} and \emph{it is a
1036random DNA sequence} are equally likely. However there are two
1037features of the scoring scheme that are not worked into the score that
1038means that some extra calculations are required
1039
1040\begin{itemize}
1041\item The score is reported as a likelhood of the models, and to
1042convert this to a posterior probability you need to factor in the
1043ratio of the prior probabilities for a match. Because you expect a far
1044greater number of sequences to be random than not, this probability of
1045your prior knowledge needs to be worked in. Offhand sensible priors
1046would in the order of probability that there is a match being roughly
1047proportional to the database size.
1048
1049\item The posterior probability should not merely be in favour of the
1050homology model over the random model but also be confident in it. In
1051other words you would want probabilities in the 0.95 or 0.99 range
1052before being confident that this match was correct.
1053\end{itemize}
1054
1055These two features mean that the reported bits score needs to be above
1056some threshold which combines the effect of the prior probabilities
1057and the need to have confidence in the posterior probability. In this
1058field people do not tend to work the threshold out rigorously using
1059the above technique, as in fact, deficiencies in the model mean that
1060you end up choosing some arbitary number for a cutoff. In my
1061experience, the following things hold true: bit scores above 35 nearly
1062always mean that there is something there, bit scores between 25-35
1063generally are true, and bit scores between 18-25 in some families are
1064true but in other families definitely noise. I don't trust anything
1065with a bit score less than 15 bits for these DNA based searches. For
1066protein-HMM to protein there are a number of cases where very negative
1067bit scores are still 'real' (this is best shown by a classical
1068statistical method, usually given as evalues, which is available from
1069the HMMer2 package), but this doesn't seem to occur in the DNA
1070searches.
1071
1072I have been thinking about using a classical statistic method on top
1073of the bit score, assumming the distribution is an extreme value
1074distribution (EVD), but for DNA it becomes difficult to know what to
1075do with the problem of different lengths of DNA. As these can be
1076wildly different, it is hard to know precisely how to handle
1077it. Currently a single HMM compared to a DNA database can produce
1078evalues using Sean Eddy's EVD fitting code but, I am not completely
1079confident that I am doing the correct thing.  Please use it, but keep
1080in mind that it is an experimental feature.
1081
1082\newpage
1083
1084\section{Halfwise and Blastwise}
1085\label{half_and_blast}
1086
1087The use of genewise in large scale analysis is beyond most people's CPU
1088abilities. To counter this I have written two scripts which allow people
1089to use genewise more sensibly.
1090\begin{itemize}
1091\item Halfwise - a Perl script that compares a DNA sequence to Pfam sensibly,
1092using BLAST to speed up the process.
1093\item Blastwise - a Perl script that compares a DNA sequence to a protein
1094database, using BLASTX and then calls genewise on a carefully selected
1095set of proteins
1096\end{itemize}
1097To run halfwise you will need
1098\begin{itemize}
1099\item The Wise2 package, compiled to provide the genewisedb executable at least
1100\item One of the blastx type programs, either blast 1 series, blast 2 series from ncbi or wublast from
1101warren gish (bioperl automatically detects the different flavours of blast and adjusts).
1102\item The bioperl distribution, preferably the 0.05 series
1103\item The halfwise protein database, found at ftp://ftp.sanger.ac.uk/pub/birney/wise2/halfwise
1104\item The halfwise Pfam database, at the same ftp site
1105\item The HMMER package, version 2.1 series
1106\end{itemize}
1107
1108The halfwise database is made from the Pfam FULL alignments, made non redundant to
110975%. This gives a good coverage of the protein sequences represented by Pfam whilst
1110being quite a small database.
1111
1112To install halfwise you need to
1113\begin{itemize}
1114\item place the halfwise protein database in the directory pointed by BLASTDB and either
1115pressdb or setdb depending on which version of blast you are going to use. (If you don't have
1116the BLASTDB directory set up, make a directory called blastdb and set the environment variable
1117BLASTDB to point to that)
1118\item install bioperl, best by following the instructions in the README
1119\item install Wise2
1120\item install the latest HMMER
1121\item place the HMM library in BLASTDB and run hmmindex (from the HMMER package) on it
1122\item edit the information at the top of the halfwise.pl to point to the correct executables, if need be
1123\end{itemize}
1124
1125To run halfwise go
1126\begin{verbatim}
1127halfwise dna.seq > dna.seq.hlf
1128\end{verbatim}
1129
1130halfwise by itself gives you help about it.
1131
1132To run blastwise you will need
1133\begin{itemize}
1134\item a blastable protein database
1135\item one of the blastx type programs, as above
1136\item The Wise2 package, having made the perl port. This is done by going
1137``make perl'' in the root directory
1138\item Bioperl version 0.05 or above
1139\item a way of fetching fasta formatted sequences from teh protein database, eg SRS
1140\end{itemize}
1141
1142Install bioperl and blast as before, install the Wise2 perl port. Edit
1143the blastwise.pl script, making sure you change protein database and
1144the GETZ line lower down to represent the way of getting sequences.
1145
1146To run blastwise go
1147\begin{verbatim}
1148blastwise.pl dna.seq > dna.seq.blw
1149\end{verbatim}
1150
1151The blastwise script is designed to be adjusted to fit your site. There
1152are a number of us world wide concentrating on extending and improving blastwise.
1153Please get in touch if you want to help.
1154\section{Principle Programs}
1155
1156The main programs are genewise, genewisedb, estwise, estwisedb.
1157These all have basically the same
1158running mode
1159
1160\begin{verbatim}
1161%genewise protein-file dna-file
1162\end{verbatim}
1163
1164A number of options are common to these programs from the point
1165of view of how they run
1166
1167\begin{description}
1168\item[-help] verbose help of all options
1169\item[-version]   show version and compile info
1170\item[-silent] No messages on stderr, whether reports or warnings
1171\item[-quiet] No reports or information messages on stderr
1172\item[-erroroffstd] No warning messages to stderr, but reports are still issued
1173\item[-errorlog] [file] Log warning messages to file (useful for sending to me)
1174\end{description}
1175
1176You will probably want to read the \ref{sec:commonmode} common modes of usage
1177section as well
1178
1179
1180
1181\subsection{genewise}
1182
1183Genewise compares a protein sequence or a protein profile HMM to a dna sequence
1184
1185\subsubsection{genewise - options: dna/protein}
1186
1187\begin{description}
1188\item[-u]               start position in dna
1189\item[-v]               end position in dna
1190\item[-trev]            Compare on the reverse strand
1191\item[-tfor] (default)  Compare on the forward strand
1192\item[-both]            Both strands
1193\item[-tabs]            Report positions as absolute to truncated/reverse sequence
1194\item[-s]               start position in protein - has no meaning for HMMs
1195\item[-t]               end   position in protein - has no meaning for HMMs
1196\item[-gap] [no] default [12]  gap penalty to use for protein comparisons. This
1197is used to estimate a probability per gap
1198\item[-ext] [no] default [2]  extension penalty to use for protein comparisons.
1199This is used to estimate a probability for an extension of a gap
1200\item[-matrix] default [blosum62.bla]  Comparison matrix. Must be in half-bit
1201units (blosum62 is in half bits). This is used to estimate a probability of amino
1202acid comparisons
1203\item[-hmmer]           Protein file is HMMer 2 HMM
1204\item[-hname]           Use this as the name of the HMM.
1205\item[-init]   [default/global/local/wing]  (see section \ref{sec:start_end})
1206For protein sequences the default is to be local (like
1207smith waterman). For protein profile HMMs, the default is read from the HMM - the
1208HMM carries this information internally. The global mode is equivalent to to the ls building option
1209(the default in the HMMer2 package). The local mode is equivalent to to the fs building option (-f)
1210in the HMMer2 package. The wing model is local on the edges and global in the middle.
1211\end{description}
1212\subsubsection{genewise - options: gene model}
1213
1214\begin{description}
1215\item[-codon] [codon.table] Codon file. The default is for the
1216universal code, but you can supply your own
1217\item[-gene] [human.gf] Gene parameter file. Provide statistics for
1218different gene models. Current human.gf and worm.gf are provided. The
1219statistics are basically too complicated to explain here.
1220\item[-subs] [1e-05] Substitution error rate, ie the assummed
1221probability of base substitutions in the sequencing reaction/assembly
1222that provided the DNA sequence. The substituion error is what
1223dominates the penalty for stop codons - a higher error rate implies a
1224smaller penalty for stop codons
1225\item[-indel] [1e-05] Insertion/deletion error rate, ie the assummed
1226probability of indel events in the sequencing reaction/assembly that
1227provided the DNA sequence. The indel rate is what provides the penalty
1228for frameshift errors. A higher error rate implies a smaller penalty
1229for indels.
1230\item[-cfreq] [model/flat] Using codon bias or not?  [default flat] -
1231a reasonably pointless option now, as it only applies when using -syn
1232flat. If codon bias is modelled, then common codons score more than
1233uncommons one for the same amino acid.
1234\item[-splice] [model/flat] Using splice model or GT/AG? [default
1235model] - use the full blown model for splice sites, or a simplistic
1236GT/AG. Generally if you are using a DNA sequence which is from human
1237or worm, then leave this on. If you are using a very different (eg
1238plant) species, switch it off.
1239\item[-intron] [model/tied] Use tied model for introns [default tied]
1240- whether intron base distribution effects the parse. Because varying
1241GC content and/or repeats can seriously drag the algorithm away from
1242correct parses when intron base distribution is used, this is usually
1243switched off.
1244\item[-null] [syn/flat] Random Model as synchronous or flat [default
1245syn] - whether to use a null model which is a simple base distribution
1246(called flat), or imagine that the viterbi path is being compared to a
1247gene based null model that is making all the same gene exon/intron
1248boundaries (synchronous). The latter is basically a hack which
1249demphaises the gene prediction machinery and tries to trust the
1250homology machinery. (not ideal!)
1251
1252\item[-pg] [file] Potential Gene file (heurestic for speeding
1253alignments). The potential gene file should look like
1254\begin{verbatim}
1255pgene # stands for potential gene
1256ptrans # stands for potential transcript
1257pexon <start-in-dna> <end-in-dna> <start-in-protein> <end-in-protein>
1258pexon <start-in-dna> <end-in-dna> <start-in-protein> <end-in-protein>
1259...
1260endptrans
1261<another ptrans if you like>
1262endpgene
1263\end{verbatim}
1264
1265When this file is read in, it provides a series of start/end in dna
1266and protein sequences around which is drawn an envelope of possibly
1267alignment area. The alignment is then calculated only in this area
1268
1269This feature has not been well tested yet. any potential bugs reported in are very useful.
1270\item[-alg] [623/623L/2193/2193L/6LITE] Algorithm used [default 623/623L]
1271You should read the section on algorithms (\ref{sec:alg}). Basically
1272623 and 623L are cheaper computationally and more robust with respect
1273to repeats etc. 2193 and 2193L are much more expensive, more sensitive
1274to changes in parameters but potentially more accurate.
1275\item[-kbyte] [ 2000] Max number of kilobytes used in main
1276calculation. Indicates how much memory can be used for the dynamic
1277programming calculation.
1278\end{description}
1279\subsubsection{genewise - options: output}
1280
1281All output options can be used at the same time. They are separated by
1282the value to -divide option
1283
1284\begin{description}
1285
1286\item[-pretty] show pretty ascii output, as see in Section 2
1287\item[-pseudo] For genes with frameshifts, mark them as pseudo genes
1288\item[-genes] show gene structure - as
1289\begin{verbatim}
1290Gene 1
1291  Gene 1386 3963
1292    Exon 1386 1493
1293    Exon 1789 1935
1294    Exon 2084 2294
1295    Exon 2388 2480
1296    Exon 2794 2868
1297    Exon 3073 3228
1298    Exon 3806 3963
1299//
1300\end{verbatim}
1301
1302\item[-para]            show parameters
1303\item[-sum]             show summary output. Shows output as
1304\begin{verbatim}
1305Bits   Query         start end Target      start end   idels introns
1306230.57 roa1_drome     26  347 HSHNRNPA     1386 3963    0    6
1307\end{verbatim}
1308This is useful for parsing, but probably if you want to do something
1309like that you want to get hold of the API directly.
1310\item[-cdna]            show cDNA
1311Show a fasta format of the predicted cDNA sequence
1312\item[-trans]           show protein translation
1313Show a fasta format of the predicted protein sequence. Breaks on frameshifts
1314\item[-pep]             show predicted peptide. Shows predicted peptide,
1315including frameshifts, which are X's in the proteins
1316\item[-ace]             ace file gene structure - ACeDB subsequence model
1317\begin{verbatim}
1318  Sequence HSHNRNPA
1319  subsequence HSHNRNPA.1 1386 3963
1320
1321  Sequence HSHNRNPA.1
1322  CDS
1323  CDS_predicted_by genewise 0.00
1324  source_Exons 1 108
1325  source_Exons 404 550
1326  source_Exons 699 909
1327  source_Exons 1003 1095
1328  source_Exons 1409 1483
1329  source_Exons 1688 1843
1330  source_Exons 2421 257
1331\end{verbatim}
1332\item[-gff]             Gene Feature Format file - useful for programs which also support GFF
1333\begin{verbatim}
1334  HSHNRNPA GeneWise cds_exon 1386 1494 0.00 + 0
1335  HSHNRNPA GeneWise cds_exon 1789 1936 0.00 + 0
1336  HSHNRNPA GeneWise cds_exon 2084 2295 0.00 + 0
1337\end{verbatim}
1338\item[-gener]           raw gene structure - a debugging output
1339\item[-alb]             show logical AlnBlock alignment - a debugging output
1340\item[-pal]             show raw matrix alignment - a debugging output
1341\item[-block]  [50]     Length of main block in pretty output
1342\item[-divide] [//]     divide string for multiple outputs
1343\end{description}
1344
1345\subsection{genewisedb}
1346
1347genewisedb is the database searching version of genewise.  It takes a
1348database of proteins and compares it to a database of dna sequences
1349\subsubsection{genewisedb - search modes}
1350\begin{description}
1351\item[-protein]  [default] single protein. Protein is a single protein sequence in fasta format
1352\item[-prodb]    protein fasta format db. Protein is a database of protein sequences in fasta format
1353\item[-pfam]     pfam hmm library. Protein is a database of HMMer2 models as a single file
1354\item[-pfam2]    pfam old style model directory (2.1). Protein is a directory of HMMs with a file
1355called HMMs in it indicating which HMMs there. This is how Pfam databases 2.1 and lower were distributed
1356\item[-hmmer]    single hmmer HMM (version 2 compatible). Protein is a single HMM
1357\item[-dnadb]    [default] dna fasta database. The DNA sequence is a fasta format
1358file with multiple sequences
1359\item[-dnas]     a single dna fasta sequence. The DNA sequence is a single sequence in fasta format
1360\end{description}
1361\subsubsection{genewisedb - protein comparison options}
1362\begin{description}
1363\item[-gap]    [ 12]  gap penalty - see genewise option
1364\item[-ext]    [  2]  extension penalty - see genewise option
1365\item[-matrix] [blosum62.bla]  Comparison matrix - see genewise option
1366\item[-hname]           For single hmms, use this as the name, not filename
1367\end{description}
1368\subsubsection{genewisedb - gene model options}
1369Many of these options are identical to the genewise options
1370listed above
1371\begin{description}
1372\item[-init]   [default/global/local/wing]  (see section \ref{sec:start_end})
1373For protein sequences the default is to be local (like
1374smith waterman). For protein profile HMMs, the default is read from the HMM - the
1375HMM carries this information internally. The global mode is equivalent to to the ls building option
1376(the default in the HMMer2 package). The local mode is equivalent to to the fs building option (-f)
1377in the HMMer2 package. The wing model is local on the edges and global in the middle.
1378\item[-codon]  [codon.table]  Codon file -see genewise option
1379\item[-gene]   [human.gf]  Gene parameter file - see genewise option
1380\item[-subs]   [1e-05] Substitution error rate - see genewise option
1381\item[-indel]  [1e-05] Insertion/deletion error rate - see genewise option
1382\item[-cfreq]  [model/flat] Using codon bias or not?     [default flat] - see genewise option
1383\item[-splice] [model/flat] Using splice model or GT/AG? [default model] - see genewise option
1384\item[-intron] [model/tied] Use tied model for introns   [default tied] - see genewise option
1385\item[-null]   [syn/flat]   Random Model as synchronous or flat [default syn] - see genewise option
1386\item[-alg]    [421/623/2193/]            Algorithm used for searching [default 623]
1387	The is the algorithm to use for the database search part of the process. 421 is the
1388	cheapest algorithm but can only be used with HMMs or small proteins as it has been compiled
1389	for a limited size of query. Looping algorithms (623L and 2193L) are not permitted as it is
1390	hard to interpret the results
1391\item[-aalg]   [623/623L/2193/2193L]  Algorithm used for alignment [default 623/623L]
1392	This is the algorithm used for the alignment of the matches. The default for proteins is
1393623, whereas for HMMs it is the looping model 623L.
1394\item[-kbyte]  [ 2000]  Max number of kilobytes used in alignments calculation. Maximum amount of
1395	memory allowed in the alignment process.
1396\item[-cut]    [20.00]   Bits cutoff for reporting in search algorithm. Comparisons scoring greater
1397than this cutoff are aligned.
1398\item[-ecut]   [n/a]     Evalue cutoff only for searches which can calculate evalues
1399\item[-aln]    [50]   Max number of alignments (even if above cut). A cutoff for the number of
1400alignments, whatever their bits score.
1401\item[-nohis]           Don't show histogram on single protein/hmm vs DNA search.
1402On a single protein (or hmm) vs DNA database search an on-the-fly evalue score is calculated.
1403This disables the production of a histogram
1404\item[-report] [0]      Issue a report every x comparisons (default 0 comparisons). Mainly for debugging
1405\end{description}
1406\subsubsection{genewisedb output - for each comparison}
1407For each alignment made by genewisedb you can output it as a number
1408of different options
1409\begin{description}
1410\item[-pretty]          show pretty ascii output, as in genewise
1411\item[-pseudo]          For genes with frameshifts, mark them as pseudo genes
1412\item[-genes]           show gene structure, as in genewise
1413\item[-para]            show parameters, as in genewise
1414\item[-sum]             show summary output, as in genewise
1415\item[-cdna]            show cDNA, as in genewise
1416\item[-trans]           show protein translation, as in genewise
1417\item[-ace]             ace file gene structure, as in genewise
1418\item[-gff]             Gene Feature Format file, as in genewise
1419\item[-gener]           raw gene structure, as in genewise
1420\item[-alb]             show logical AlnBlock alignment, as in genewise
1421\item[-pal]             show raw matrix alignment, as in genewise
1422\item[-block]  [50]     Length of main block in pretty output, as in genewise
1423\item[-divide] [//]     divide string for multiple outputs, as in genewise
1424\end{description}
1425\subsubsection{genewisedb output - complete analysis}
1426Each alignment produces a notional gene prediction. At the
1427end of the output, these gene predictions can be displayed
1428together. This only works for -pfam or -prodb and -dnas
1429options, ie a database of protein information vs a single
1430dna sequence
1431
1432In the future it is hoped that additional options (such
1433as merging consistent gene predictions) will operate
1434before these outptus are made
1435
1436\begin{description}
1437\item[-ctrans]          provide all translations
1438\item[-ccdna]           provide all cdna
1439\item[-cgene]           provide all gene structures
1440\item[-cace]            provide all gene structures in ace format
1441\end{description}
1442
1443\subsection{estwise}
1444Estwise runs very much like genewise with basically a subset of
1445options. For completeness they are all listed below
1446\subsubsection{estwise - options: dna/protein}
1447\begin{description}
1448\item[-u]               start position in dna
1449\item[-v]               end position in dna
1450\item[-trev]            reverse complement dna
1451\item[-tfor]            use forward strands only
1452\item[-both] [default]  do both strands
1453\item[-tabs]            Positions reported as absolute to DNA
1454\item[-s]               start position in protein
1455\item[-t]               end   position in protein
1456\item[-gap]    [ 12]  gap penalty
1457\item[-ext]    [  2]  extension penalty
1458\item[-matrix] [blosum62.bla]   Comparison matrix
1459\item[-hmmer]           Protein file is HMMer 1.x file
1460\item[-hname]           Name of HMM rather than using the filename
1461\end{description}
1462\subsubsection{estwise - options: model}
1463\begin{description}
1464\item[-init]   [default/global/local/wing]  (see section \ref{sec:start_end})
1465For protein sequences the default is to be local (like
1466smith waterman). For protein profile HMMs, the default is read from the HMM - the
1467HMM carries this information internally. The global mode is equivalent to to the ls building option
1468(the default in the HMMer2 package). The local mode is equivalent to to the fs building option (-f)
1469in the HMMer2 package. The wing model is local on the edges and global in the middle.
1470
1471\item[-codon]  [codon.table]  Codon file. The default is for the universal code, but
1472you can supply your own
1473\item[-subs]   [0.01] Substitution error rate, ie the assummed probability of base substitutions
1474in the sequencing reaction/assembly that provided the DNA sequence. The substituion error is what dominates
1475the penalty for stop codons - a higher error rate implies a smaller penalty for stop codons
1476\item[-indel]  [0.01] Insertion/deletion error rate, ie the assummed probability of indel events
1477in the sequencing reaction/assembly that provided the DNA sequence. The indel rate is what provides
1478the penalty for frameshift errors. A higher error rate implies a smaller penalty for indels.
1479\item[-null]   [syn/flat]   Random Model as synchronous or flat [default syn]
1480 whether to use
1481a null model which is a simple base distribution (called flat), or imagine that the viterbi path
1482is being compared to a gene based null model that is making all the same gene exon/intron boundaries
1483(synchronous). The latter is basically a hack which demphaises the placement of frameshifts and
1484tries to trust the homology machinery. (not ideal!)
1485\item[-alg]    [333,333L,333F]  Algorithm used. 333 is the normal algorithm. 333L is the looping
1486algorithm
1487\item[-kbyte]  [ 2000]  Max number of kilobytes used in main calculation
1488\item[-pretty]          show pretty ascii output as in genewise
1489\item[-para]            show parameters
1490\item[-sum]             show summary information as in genewise
1491\item[-alb]             show logical AlnBlock alignment, debugging output
1492\item[-pal]             show raw matrix alignment, debugging output
1493\item[-block]  [50]     Length of main block in pretty output - the length of the main text in the pretty
1494output
1495\item[-divide] [//]     divide string for multiple outputs, the string used to separate multiple outputs
1496\end{description}
1497
1498\subsection{estwisedb}
1499estwisedb is the database searching version of the
1500estwise program. Like estwise, it has the same sort of
1501running modes as genewisedb, but with more limited options.
1502\subsubsection{estwisedb - options: running modes}
1503\begin{description}
1504\item[-protein]  [default] single protein
1505\item[-prodb]    protein fasta format db
1506\item[-pfam]     pfam hmm library
1507\item[-pfam2]    pfam style model directory (2.1)
1508\item[-hmmer]    single hmmer 1.x HMM
1509\item[-dnadb]    [default] dna fasta database
1510\item[-dnas]     a single dna fasta sequence
1511\end{description}
1512\subsubsection{estwisedb - options: model}
1513\begin{description}
1514\item[-gap]    [ 12]  gap penalty
1515\item[-ext]    [  2]  extension penalty
1516\item[-matrix] [blosum62.bla]  Comparison matrix
1517\item[-hname]           For single hmms, use this as the name, not filename
1518\item[-codon]  [codon.table]  Codon file
1519\item[-subs]   [0.01] Substitution error rate
1520\item[-indel]  [0.01] Insertion/deletion error rate
1521\item[-null]   [syn/flat]   Random Model as synchronous or flat [default syn]
1522\item[-alg]    [333/312/312Q]            Algorithm used for searching [default 312]
1523\item[-aalg]   [333/333L]        Algorithm used for alignment [default 623]
1524\item[-kbyte]  [ 2000]  Max number of kilobytes used in alignments calculation
1525\item[-cut]    [20.00]   Bits cutoff for reporting in search algorithm
1526\item[-ecut]   [n/a]     Evalue cutoff only for searches which can calculate evalues
1527\item[-aln]    [50]   Max number of alignments (even if above cut)
1528\item[-nohis]           Don't show histogram on single protein/hmm vs DNA search
1529\item[-report] [0]      Issue a report every x comparisons (default 0 comparisons)
1530\end{description}
1531\subsubsection{estwisedb - options: output}
1532\begin{description}
1533\item[-pretty]          show pretty ascii output
1534\item[-para]            show parameters
1535\item[-sum]             show summary output
1536\item[-alb]             show logical AlnBlock alignment
1537\item[-pal]             show raw matrix alignment
1538\item[-mul]             produce complete protein multiple alignment from a HMM to DNA db search as a mul format M/A.
1539\item[-pep]             show predicted peptide. Shows predicted peptide,
1540including frameshifts, which are X's in the proteins
1541\item[-block]  [50]     Length of main block in pretty output
1542\item[-divide] [//]     divide string for multiple outputs
1543\item[-help]      help
1544\item[-version]   show version and compile info
1545\item[-silent]    No messages on stderr
1546\item[-quiet]    No report on stderr
1547\item[-erroroffstd] No warning messages to stderr
1548\item[-errorlog] [file] Log warning messages to file
1549\end{description}
1550
1551\subsection{Running with pthreads}
1552\label{running_pthread}
1553
1554The two database searching programs, genewisedb and estwisedb can be run with pthread
1555support on SMP boxes. To do so you need to compile the source code with pthread
1556support (it is very easy, see section \ref{compile_pthread}). Then the programs
1557need to be run with the additional option {\tt -pthread}. On most machines the
1558executable will pick up the number of available processors automatically and
1559run that number of threads. If you want to override this use the {\tt -pthr\_no} option.
1560
1561
1562\newpage
1563\section{Other Programs}
1564
1565There are other programs in the wise2 package which are sometimes pretty
1566well worked out (eg promoterwise) and sometimes just a little standard program
1567(eg, psw).
1568
1569
1570\subsection{promoterwise}
1571
1572promoterwise is a sort of next generation DBA (see next section). It
1573is designed for comparisons between two promoter sequences or
1574realistically any two orthologous regulatory regions (or homologous
1575for that matter, but in theory it should work better for orthologous
1576regulatory regions, depending on how much active change you expect
1577paralogous regulatory regions to have). Promoterwise reports alignments
1578between these two sequences assumming that alignments cannot overlap in
1579both sequences, but *not* assumming that the alignments have to be co
1580linear or on the same strand.
1581
1582
1583Promoterwise works by taking the two sequences and then finds all
1584common exact 7mers between them, in both the forward and reverse
1585strands. These are then merged such that close HSPs (whoes centers are
1586within the window size of each other) are considered one region.
1587These regions then have a local version of the DBA algorithm run over
1588them, which has a model of DNA similarity of small regions of
1589similarity, potentially with small gaps separated by large pieces of
1590unknown DNA.
1591
1592
1593The resulting set of alignments are then sorted by score, and a simple
1594greedy algorithm is used to discard ``bad'' subsequent alignments. By
1595default this is to discard alignments which overlap on the query
1596coordinate with alignments of a higher score (this can be
1597changed). The alignments are then outputted with bits score. In my
1598hands I think a bit score of over 20bits looks good.
1599
1600
1601Of course there are many options to change here.
1602
1603
1604\subsubsection{promoterwise - options}
1605\begin{description}
1606\item[-s]     query start position restriction
1607\item[-t]    query end position restriction
1608\item[-u]    target start position restriction
1609\item[-v]    target end position restriction
1610\item[-lhwindow]   sequence window given to alignment, default 50
1611\item[-lhseed]    seed score cutoff in bits, defualt 10.0
1612\item[-lhaln]    aln  score cutoff, default 8.0 bits
1613\item[-lhscore]    sort final list by score (default by position)
1614\item[-lhreject] [none/query/both] - overlap rejection criteria in greedy assembly [query]
1615\item[-lhmax]    maximum number of processed hits - default 20000
1616\item[-hitoutput] [pseudoblast/xml/tab] pseudoblast by default
1617\item[-hithelp]   more detailed help on hitlist formats
1618\item[-dymem]      memory style [default/linear/explicit]
1619\item[-kbyte]       memory amount to use [4000]
1620\item[-\[no\]dycache] implicitly cache dy matrix usage (default yes)
1621\item[-dydebug]     drop into dynamite dp matrix debugger
1622\item[-paldebug]    print PackAln after debugger run if used
1623\item[-help]      show help options
1624\item[-version]   show version and compile info
1625\item[-silent]    No messages on stderr
1626\item[-quiet]     No report on stderr
1627\item[-erroroffstd] No warning messages to stderr
1628\item[-errorlog] [file] Log warning messages to file
1629\end{description}
1630
1631
1632
1633\subsection{dba - Dna Block Aligner}
1634\label{sec:dba}
1635
1636dba - standing for Dna Block Aligner, was developped by Niclas Jareborg,
1637Richard Durbin and Ewan Birney for characterising shared regulatory regions
1638of genomic DNA, either in upstream regions or introns of genes
1639
1640The idea was that in these regions there would a series of shared motifs,
1641perhaps with one or two insertions or deletions but between motifs there
1642would be any length of sequence.
1643
1644The subsquent model was a 3 state model which was log-odd'd ratio to a null
1645model of their being no examples of a motif in the two sequences.
1646
1647\subsubsection{dba - options}
1648\begin{description}
1649\item[-match] [0.8]      match probability
1650\item[-gap] [0.05]       gap probability
1651\item[-blockopen] [0.01] block open probability
1652\item[-umatch] [0.99]    unmatched gap probability
1653\item[-nomatchn]         do not match N to any base
1654\item[-align]            show alignment
1655\item[-params]           print parameters
1656\item[-help]            print this message
1657\end{description}
1658
1659\subsection{psw - Protein Smith-Waterman and other comparisons}
1660\label{sec:psw}
1661
1662psw is a short and sweet program for calculating smith waterman alginments
1663quickly. It was mainly written as C driver to test the underlying code which
1664is more useful in things like the Perl port.
1665
1666More recently I added in the generalised gap penalty model of Stephen
1667Altschul, that is known as the \emph{abc} model in Wise2.  The abc
1668model is detailed in Proteins 1998 Jul 1, 32 pages 88-96.
1669
1670\subsubsection{psw - options}
1671\begin{description}
1672\item[-g] gap penalty (default 12) - gap penalty used for smith waterman
1673\item[-e] ext penatly (default 2) - ext penalty used for smith waterman
1674\item[-m] comp matrix (default blosum62.bla) - comparison matrix used for
1675both smith waterman and the abc model
1676\item[-abc] use the abc model: use Stephen Altschul's 'generalised gap penalty'
1677model (called the abc model in Wise2)
1678\item[-a]   a penalty for above (default 120) gap opening penalty in the abc model
1679\item[-b]   b penalty for above (default 10) gap extension penalty in the abc model
1680\item[-c]   c penalty for above (default 3) unmatched 'gap' region penalty in the abc model
1681\item[-r] show raw output - raw matrix output
1682\item[-l] show label output - label based output
1683\item[-f] show fancy output - pretty output
1684\end{description}
1685
1686\subsection{pswdb}
1687
1688pswdb - protein smith waterman database searching was written by Richard Copley using
1689the underlying Wise2 libraries
1690\subsubsection{psw - options}
1691\begin{description}
1692\item[-g] gap penalty (default 12) - gap penalty used for smith waterman
1693\item[-e] ext penatly (default 2) - ext penalty used for smith waterman
1694\item[-m] comp matrix (default blosum62.bla) - comparison matrix used for
1695both smith waterman and the abc model
1696\item[-abc] use the abc model: use Stephen Altschul's 'generalised gap penalty'
1697model (called the abc model in Wise2)
1698\item[-a]   a penalty for above (default 120) gap opening penalty in the abc model
1699\item[-b]   b penalty for above (default 10) gap extension penalty in the abc model
1700\item[-c]   c penalty for above (default 3) unmatched 'gap' region penalty in the abc model
1701\item[-max\_desc] Maximum number of description lines
1702\item[-max\_aln] Maximum number of alignments
1703\item[-ids] in alignments, show sequence names, not probe/target
1704\item[-r] show raw output - raw matrix output
1705\item[-l] show label output - label based output
1706\item[-f] show fancy output - pretty output
1707\end{description}
1708
1709\section{API}
1710
1711There used to be a direct Perl binding API. No longer. Frankly why I thought this
1712was a good idea is now beyond me (the excitment of youth. The thrill of binding
1713C directly to Perl. The head thumping complexity of XS). Wise2 programs are best
1714run on the command line or shell'd out from scripts and then parsed in.
1715
1716\end{document}
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728