1\documentclass[11pt]{article}
2\RequirePackage{helvet}
3\newcommand{\CURRENT}{fasta-36.3.8}
4\renewcommand{\familydefault}{\sfdefault}
5\usepackage{cite}
6\usepackage{url}
7\usepackage{fancyhdr}
8\usepackage{url}
9\usepackage{needspace}
10\usepackage{longtable}
11\addtolength{\oddsidemargin}{-0.75in}
12\addtolength{\evensidemargin}{-0.75in}
13\addtolength{\textwidth}{1.5in}
14\addtolength{\topmargin}{-0.75in}
15\addtolength{\textheight}{1.5in}
16
17\lhead{\CURRENT}
18\rhead{\today}
19\cfoot{\thepage}
20\pagestyle{fancy}
21\newcommand{\FASTA}{\texttt{FASTA }}
22\hyphenation{Swiss-Prot}
23
24\parskip 0.5ex
25
26\begin{document}
27%% .he 'FASTA3.DOC''Release 3.6, March, 2011'
28
29\section*{\Large{The FASTA program package}}
30
31%% \begin{quote}
32%% \emph{This document is undergoing extensive revision.
33%% Some parts of it are old; while still accurate, they are less
34%% relevant.  Recent improvements to the FASTA programs are not well
35%% documented, particularly with respect to options for selecting
36%% databases and database sequences.}
37%% \end{quote}
38
39\section*{Introduction}
40
41This documentation describes the version 36 of the FASTA program
42package (see W. R. Pearson and D. J. Lipman (1988), ``Improved Tools
43for Biological Sequence Analysis'', PNAS 85:2444-2448,\cite{wrp881}
44W. R. Pearson (1996) ``Effective protein sequence comparison''
45Meth. Enzymol. 266:227-258 \cite{wrp960}; and Pearson et. al. (1997)
46Genomics 46:24-36 \cite{wrp973}.  Version 3 of the FASTA packages
47contains many programs for searching DNA and protein databases and for
48evaluating statistical significance from randomly shuffled sequences.
49
50This document is divided into four sections: (1) A summary overview of
51the programs in the FASTA3 package; (2) A guide to using the FASTA
52programs; (3) A guide to installing the programs and
53databases. Section (4) provides answers to some Frequently Asked
54Questions (FAQs).  In addition to this document, the
55\texttt{changes\_v36.html}, \texttt{changes\_v35.html} and
56\texttt{changes\_v34.html} files list functional changes to the programs.
57The \texttt{readme.v30..v36} files provide a more complete revision
58history of the programs, including bug fixes.
59
60The programs are easy to use; if you are using them on a machine that
61is administered by someone else, you can focus on sections (1) and (2)
62to learn how to use the programs.  If you are installing the programs
63on your own machine, you will need to read section (3) carefully.
64
65\emph{FASTA and BLAST} -- FASTA and BLAST have the same goal: to
66identify statistically significant sequence similarity that can be
67used to infer homology.  The FASTA programs offer several advantages
68over BLAST:
69\begin{enumerate}
70\item
71Rigorous algorithms unavailable in BLAST (Table I).  Smith-Waterman
72(\texttt{ssearch36}), global: global (\texttt{ggsearch36}), and
73global:local (\texttt{glsearch36}) programs are available, and these
74programs can be used with \texttt{psiblast} PSSM profiles.
75\item
76Better translated alignments. \texttt{fastx36}, \texttt{fasty36},
77\texttt{tfastx36}, and \texttt{tfastx36} allow frame-shifts in
78alignments; frame-shifts are treated like gap-penalties, alignments
79tend to be longer in error-prone reads.
80\item
81Better statistics. BLAST calculates very accurate statistics for
82protein:protein alignments, but its model-based strategy is less
83robust for translated-DNA:protein and DNA:DNA scores.  FASTA uses an
84empirical estimation strategy, and now provides both search-based, and
85high-scoring shuffle-based statistics (\texttt{-z 21}).
86\item
87More flexible library sequence formats.  The FASTA programs can read
88FASTA, NCBI/ \texttt{formatdb}, and several other sequence formats, and can
89directly query MySQL and Postgres databases. The programs offer
90several strategies for specifying subsets of databases.
91\item
92A very efficient threaded implementation.  The FASTA programs are
93fully threaded; both similarity scores and alignments can be
94calculated in parallel on multi-core hardware.  On multi-core
95machines, FASTA can be faster than BLAST while producing better
96alignments with more accurate statistical estimates.
97\item
98  A powerful annotation facility.  The FASTA programs can incorporate
99  functional site annotations, site variation, and domain-based
100  sub-alignment scoring using annotations from sequence libraries.
101  Scripts are available to download site and domain information from
102  Uniprot, and domain information from Pfam and CATH.  Domain
103  information can be used to sub-divide alignment scores to ensure
104  that the aligned domain is homologous.
105
106\end{enumerate}
107
108In addition, the FASTA programs from \texttt{fasta-36.3.4} on provide
109an option to produce very BLAST-like output (\texttt{-m BB}), so that
110analysis pipelines require minimal modification.
111
112\section{An overview of the \texttt{FASTA} programs}
113
114\begin{table}
115\caption{\label{table1} Comparison programs in the FASTA36 package}
116\vspace{0.5ex}
117\begin{tabular}{ p{0.8in} p{0.6in} p{4.6 in}}
118\hline \\[-1.0ex]
119FASTA \mbox{program} & BLAST equiv. & Description \\[1.2ex]
120\hline \\[-1.0ex]
121\texttt{fasta36} & \texttt{blastp}/ \texttt{blastn} &
122Compare a protein sequence to a protein sequence
123database or a DNA sequence to a DNA sequence database using the FASTA
124algorithm \cite{wrp881,wrp960}.  Search speed and selectivity are
125controlled with the \emph{ktup}(wordsize) parameter.  For protein
126comparisons, \emph{ktup} = 2 by default; \emph{ktup} =1 is more sensitive
127but slower.  For DNA comparisons, \emph{ktup}=6 by default; \emph{ktup}=3 or
128\emph{ktup}=4 provides higher sensitivity.\\[1 ex]
129
130\texttt{ssearch36} &  & Compare a protein sequence to a protein sequence
131database or a DNA sequence to a DNA sequence database using the
132Smith-Waterman algorithm \cite{wat815}. \texttt{ssearch36} uses SSE2
133acceleration, and is only 2 - 5X slower than \texttt{fasta36} \cite{farrar2007}. \\[1 ex]
134
135\texttt{ggsearch36}/ \texttt{glsearch36} &  & Compare a protein sequence to a protein sequence
136database or a DNA sequence to a DNA sequence database using
137an optimal global:global (\texttt{ggsearch36}) or global:local
138(\texttt{glsearch36}) algorithm.\\[1 ex]
139
140\texttt{fastx36}/ \texttt{fasty36} & \texttt{blastx} &
141Compare a DNA sequence to a protein
142sequence database, by comparing the translated DNA sequence in three
143frames and allowing gaps and frameshifts.  \texttt{fastx36} uses a
144simpler, faster algorithm for alignments that allows frameshifts only
145between codons; \texttt{fasty36} is slower but can produce better alignments
146because frameshifts are allowed within codons \cite{wrp971}.\\[1 ex]
147
148\texttt{tfastx36}/ \texttt{tfasty36}& \texttt{tblastn} &
149Compare a protein sequence to a DNA sequence
150database, calculating similarities with frameshifts to the forward and
151reverse orientations  \cite{wrp971}.\\[1 ex]
152
153\texttt{fastf36/ tfastf36} &  &
154Compares an ordered peptide mixture, as would be obtained by
155Edman degradation of a CNBr cleavage of a protein, against a protein
156(\texttt{fastf}) or DNA (\texttt{tfastf}) database \cite{wrp021}.\\[1 ex]
157
158\texttt{fasts36/ tfasts36} &  &
159Compares set of short peptide fragments, as would be obtained
160from mass-spec. analysis of a protein, against a
161protein (\texttt{fasts}) or DNA (\texttt{tfasts}) database \cite{wrp021}.\\[1 ex]
162
163\texttt{lalign36} & & Calculate multiple, non-intersecting alignments
164using the sim2 implementation of the Waterman-Eggert
165algorithm\cite{wat875} developed by Xiaoqui Huang and Web
166Miller\cite{mil908}.  Statistical estimates are calculated from
167Smith-Waterman scores of shuffled sequences. \\[1 ex]
168
169\hline \\
170\end{tabular}
171\end{table}
172
173Although there are a large number of programs in this package, they
174belong to three groups: (1) Traditional similarity searching programs:
175\texttt{fasta36}, \texttt{fastx36}, \texttt{fasty36},
176\texttt{tfastx36}, \texttt{tfasty36}, \texttt{ssearch36},
177\texttt{ggsearch36}, and \texttt{glsearch36}; (2) Programs for
178searching with short fragments: \texttt{fasts36}, \texttt{fastf36},
179\texttt{tfasts36}, \texttt{tfastf36}, and \texttt{fastm36}; (3) A
180program for finding non-overlapping local alignments: \texttt{lalign36}.
181Programs that start with \texttt{fast} search protein databases, while
182\texttt{tfast} programs search translated DNA databases.  Table I
183gives a brief description of the programs.
184
185In addition, there are several programs included. \texttt{map\_db} is
186used to index FASTA format sequence databases for more efficient
187scanning. \texttt{scripts/lav2plt.pl} can plot the \texttt{.lav}
188files produced by \texttt{lalign -m 11} as postscript
189(\texttt{lav2plt.pl --dev ps}) or SVG (\texttt{lav2plt.pl --dev svg}) output.
190
191\section{Using the FASTA Package}
192\subsection{Introduction/Overview}
193
194All the FASTA sequence comparison programs use similar command line
195options and arguments.  The simplest command line arguments are (in
196order): the name of a query sequence file, a library file, and
197(possibly) the \emph{ktup} parameter.  If command line options are
198provided, they \emph{must} precede the standard query-file and
199library-file arguments. Thus:
200\begin{quote}
201\texttt{fasta36 -s BP62 query.file library.file}
202\end{quote}
203will compare the sequences in \texttt{query.file} with those in
204\texttt{library.file} using the \texttt{BLOSUM62} scoring matrix with
205BLASTP gap penalties (\texttt{-11/-1}).
206
207The program can also be run by typing:
208\begin{quote}
209\texttt{fasta36 -I}
210\end{quote}
211which presents the ``classic'' interative mode (this was
212the default behavior before version \texttt{36.3.4}).
213In interactive mode,
214you will be prompted for: (1) the name of the test sequence file; (2)
215the name of the library file; (3) whether you want ktup = 1 or 2. (
2161 -- 6 for DNA sequences).
217
218Current versions of the FASTA programs expect a query file and library, if you simply type ``\texttt{fasta36}'', you will see a short help message:
219\begin{footnotesize}
220\begin{quote}
221\begin{verbatim}
222% ssearch36
223USAGE
224 ssearch36 [-options] query_file library_file
225 ssearch36 -help for a complete option list
226
227DESCRIPTION
228 SSEARCH performs a Smith-Waterman search
229 version: 36.3.4 Mar, 2011
230
231COMMON OPTIONS (options must precede query_file library_file)
232 -s:  [BL50] scoring matrix;
233 -f:  [-10] gap-open penalty;
234 -g:  [-2] gap-extension penalty;
235 -S   filter lowercase (seg) residues;
236 -b:  high scores reported (limited by -E by default);
237 -d:  number of alignments shown (limited by -E by default);
238 -I   interactive mode;
239\end{verbatim}
240\end{quote}
241\end{footnotesize}
242``\texttt{fasta36 -help}'' (or any of the other program names in Table
243I) provides complete listing of the options available for the program
244and their default values.
245
246The package includes several test files.  To check to make certain
247that everything is working, you can try:
248\begin{quote}
249\begin{verbatim}
250fasta36  ../seq/musplfm.aa ../seq/prot_test.lib
251or
252tfastx36 ../seq/mgstm1.aa ../seq/gst.nlib
253\end{verbatim}
254\end{quote}
255
256\subsection{Sequence files}
257
258The \texttt{fasta36} programs can read query and library files in many
259standard formats (Section \ref{fastlibs}). The default file format for query and library files
260-- the format that will be used if no additional file format
261information is provided -- is \texttt{FASTA} format. Like
262\texttt{BLAST}, version 36 can compare a query file with multiple
263query sequences to a sequence database, performing an independent
264search with each sequence in the query file.
265
266FASTA format files consist of a description line, beginning
267with a '$>$' character, followed by the sequence itself:
268\begin{quote}
269\begin{verbatim}
270>sequence name and description 1
271A F A S Y T .... actual sequence.
272F S S       .... second line of sequence.
273>sequence name and description 2
274PMILTYV ... sequence 2
275\end{verbatim}
276\end{quote}
277All of the characters of the description line are read, and special
278characters can be used to indicate additional information about the
279sequence. In general, non-amino-acid/non-nucleotide sequences in the
280sequence lines are ignored.
281
282FASTA format files from major sequence distributors, like the NCBI and
283EBI, have specially formatted description lines, e.g.:\\
284\indent
285\texttt{
286>gi|54321|ref|np\_12345| example NCBI refseq sequence\\
287}
288or\\
289\indent
290\texttt{
291>sw:gstm1\_human P01234 glutathione transferase GSTM1 - human\\
292}
293
294Several sample test files are included with the FASTA distribution:
295\texttt{seq/*.aa} and \texttt{seq/*.seq}, as well as two small sequence
296libraries, \texttt{seq/prot\_test.lib} and \texttt{seq/gst.nlib}.
297
298You can build your own library by concatenating several sequence
299files.  Just be sure that each sequence is preceded by a line
300beginning with a '\texttt{>}' followed by a sequence name/description.  Sequences
301entered with word processors should use a ``text'' mode, e.g. ``Save as
302text'' with MS-WORD, with end of line characters and no special
303formatting characters in the file.  The FASTA program cannot read
304Microsoft Word .DOC files, or rich text (.RTF) files; query and
305library sequence files should contain only sequence descriptions,
306sequences, and end-of-line characters.
307
308\subsection{Running the programs}
309As mentioned earlier, the FASTA programs can be run either
310interactively, by typing the name of a FASTA program (and possibly
311command line options), followed by \texttt{-I} (\texttt{fasta36 -I})
312or from the command line, entering command line options, and the
313query and library file names. For searches of large databases that
314may take several minutes (or longer), it is more convenient
315to run searches from the command line, e.g.:
316\begin{quote}
317\begin{verbatim}
318fasta36 query.file library.file > output.file
319\end{verbatim}
320\end{quote}
321The command line shown above could be typed in a Unix or MacOSX
322terminal window, or from the MS-Windows command line interface
323(command.exe).  The command line syntax shown above works for all
324the FASTA programs, e.g.:
325\begin{quote}
326\begin{verbatim}
327lalign36 mchu.aa mchu.aa > mchu.laln
328fastx36 mgstm1.seq prot_test.lseg > mgstm1.fx_out
329ssearch36 mgstm1.aa xurtg.aa > mgstm1_xurtg.ss
330\end{verbatim}
331\end{quote}
332
333\emph{Command line options} -- The FASTA programs provide a variety of
334command line options that modify the default scoring matrix
335(\texttt{-sBL62}) and gap penalties (\texttt{-f -11}, \texttt{-g -1}), other
336algorithm parameters, the output options (\texttt{-E 0.1}, \texttt{-d 20},
337\texttt{-m 9i}), and statistical procedures (\texttt{z -2}).  A complete
338list of command line options is shown near the end of this document.
339Unlike the \texttt{BLAST} programs, all \texttt{FASTA} command line options
340must precede the query file name and library file name (and there are
341no command line options available to specify the query and library
342file names).  Thus, you should type:
343\begin{quote}
344\begin{verbatim}
345ssearch36 -s BL62 -f -11 -g -1 query.file library.file > output.file
346\end{verbatim}
347\end{quote}
348If you include \texttt{-I} as one of the options, you can provide
349command line options (e.g. to change the scoring matrix or gap
350penalties) without a query file or library file, and the program will
351use the options but prompt for the necessary files .
352
353\subsection{Interpreting the results}
354
355Fig. \ref{ssearch_run} shows the output from a typical FASTA program
356(\texttt{ssearch36}).  The output file can be
357viewed as four parts: (a) the initial command line and description of
358the query sequence used (mgstm1.aa, 218 aa) and library (PIR1, 13,143
359entries); (b) a description of the search statistics, algorithm
360(Smith-Waterman, SSE2 accelerated), and search parameters (BLOSUM50
361matrix, gap penalties: -10 to open a gap, -2 for each residue in a
362gap); (c) a list of high scoring library sequences, descriptions, similarity scores, and statistical significance; (d) the alignments that produced the scores.
363
364\begin{figure}
365\include{fasta_guide.fg1}
366\caption{\label{ssearch_run}\texttt{ssearch36} results}
367\vspace{1.0ex}
368Comparison of \texttt{seq/mgstm1.aa} against a small protein database
369(\texttt{pir1.lseg}). Some high-scoring sequences and all but one
370alignment were removed to reduce the output size.
371\end{figure}
372
373\subsubsection{Identifying homologs}
374In the description section (which starts: \texttt{The best scores
375  are:}), four numbers after the description of each library sequence
376are shown: (i) (in parentheses) the length of the library sequence;
377(ii) the raw Smith-Waterman score for the alignment (\texttt{s-w}; for
378the \texttt{fasta36}, \texttt{[t]fast[x,y]36} programs, this column
379would be labeled \texttt{opt}, for the \emph{opt}imized -- banded
380Smith-Waterman -- score), (iii) the \emph{bit} score, and (iv) the
381expectation (E()), or statistical significance, of the alignment
382score.  The E()-value depends on the size of the database searched, in
383this case, 13,143 sequences, so the database size is given at the top
384of the list.
385
386The bit score is equivalent to a BLAST bit score; together with the
387length of query and library sequences, it can be used to calculate the
388significance of the alignment.\footnote{$E(D) = D m n 2^{-b}$,
389where $D$ is the number of sequences in the database, $m, n$ are the
390lengths of the two sequences, and $b$ is the bit score.}  Bit scores
391are convenient because they provide a matrix independent score that
392can be compared with other searches performed with other matrices and
393gap penalties against other databases.  However, the E()-value, or
394expectation, provides the most direct measure of the statistical
395significance of the match.
396
397In this example, the \texttt{GSTP1\_RAT}, \texttt{GSTA1\_RAT}, and
398\texttt{GSTA4\_RAT} proteins share strong significant similarity
399(better than $E() < 6.1 \times 10^{-7}$ ), while the
400\texttt{GSTF1\_MAIZE}, \texttt{GSTF3\_MAIZE}, and
401\texttt{GSTT1\_DROME} sequences do not share significant similarity
402($E() < 0.001$).  However, \texttt{GSTF1\_MAIZE},
403\texttt{GSTF3\_MAIZE}, and \texttt{GSTT1\_DROME} are all glutathione
404transferase homologs, they simply do not share statistically
405significant similarity with this particular \texttt{mGSTM1} query.
406Statistically significant sequence similarity scores \emph{can} be
407used to infer \emph{homology} (common ancestry), but non-significant
408scores \emph{cannot} be used to infer \emph{non-homology}.
409
410While percent identity is often used to characterize the quality of an
411alignment and the likelihood that it reflects homology, the E()-value
412is a much more reliable value for homology infernence (once homology is
413established, the percent identity is much more useful for estimating
414evolutionary distance).  Often sequences that share less than 30\%
415identity will share very significant similarity (in the example above
416\texttt{mgstm1.aa} and \texttt{GSTA4\_RAT}, with E() $<$ 6.1E-07 are
41725.6\% identical).  The expectation value captures information about
418conservative replacements, identities, and alignment length to provide
419a \emph{single} value that captures the significance of the alignment.
420
421For protein searches, library sequences with E()-values $<$ 0.001 for
422searches of a 10,000 entry protein database are almost always
423homologous. Some sequences with E()-values from 1 - 10 may also be
424related, but unrelated sequences ( 1--10 per search) will have scores
425in this range as well.
426
427E()-values $<$ 0.001 can reliably be used to infer homology, assuming
428that the statistical estimates are accurate.  The two most common
429causes of statistical problems are low-complexity regions and
430amino-acid composition bias.  Low-complexity regions are can be
431identified using the \texttt{pseg} program \cite{woo935}, and filtered
432out using the \texttt{-S} option. Composition bias rarely produces
433highly-signficiant E()-values, but can cause unrelated sequences to
434have E()-values between 0.01 and 0.001. The FASTA programs offer two
435shuffle-based strategies for evaluating composition bias; calculating
436similarity scores for random sequences with the same length and amino
437acid composition (\texttt{-z 11} $..$ \texttt{16}),\footnote{Random
438  shuffles are performed for pairwise alignments and \texttt{lalign36}
439  by default.} and calculating statistical estimates derived from
440shuffles of the high-scoring sequences (\texttt{-z 21, 22, 24, 25,
441  26}).
442
443When \texttt{-z 21 .. 26} shuffles are performed, the FASTA36 programs
444present two E()-values in the list of high scoring sequences and the
445alignments; the traditional one based on the library search, and a
446second \texttt{E2()} value, based on the shuffles of the high scoring
447sequences.  \texttt{-z 21 .. 26} shuffles are most useful for
448evaluating the significance of translated-DNA:protein searches like
449\texttt{fastx36}.  Out-of-frame translations can produce cryptic low
450complexity regions, which are most apparent when the high-scoring
451sequences are shuffled.  \texttt{-z 21} shuffles are more efficient
452than individual sequence shuffles, because the set of high scoring
453sequences is shuffled 500 -- 1,000 times, rather than 500 shuffles for
454each of 50 -- 100 high scoring library sequences.  It is almost as
455effective, because homologous sequences share similar amino-acid
456composition.
457
458The statistical routines assume that the library contains a large
459sample of unrelated sequences.  If the library contains fewer than
460500 sequences (\texttt{MAX\_RSTATS}), then the library sequences are
461shuffled to produce 500 random scores, from which lambda and K
462statistical parameters are estimated. If the library contains a large
463number of \emph{related} sequences, then the statistical parameters
464should be estimated by using the \texttt{-z 11-15}, options.
465\texttt{-z} options greater than 10 calculate a shuffled similarity
466score for each library sequence, in addition to the unshuffled score,
467and estimate the statistical parameters from the scores of the
468shuffled sequences.
469
470\subsubsection{Looking at alignments}
471
472The description section described above contains the critical
473information for inferring homology, the \texttt{E()}-value.  The
474alignment section shows the actual alignments that produced the
475similarity score and statistical estimates.  In
476Fig. \ref{ssearch_run}, the alignment display reports the percent
477identity, percent similarity (number of aligned residues with BLOSUM50
478values $\ge$ 0), and the boundaries of the alignment.  Note that for
479the \texttt{ssearch36} and \texttt{fasta36}, the alignment shown can
480include residues that are not part of the best local alignment
481(e.g. residues 1--5 and 207--218 in \texttt{mGSTM1} in
482Fig. \ref{ssearch_run}).  The amount of additional sequence context
483shown is the alignment line length (60 residues, set by \texttt{-w
484  len}) divided by 2 by default, but can be adjusted with the
485\texttt{-W context} option.
486
487\begin{figure}
488\include{fasta_guide.fg2}
489\vspace{-3.0ex}
490\caption{\label{seg-aln}Alignment with \texttt{-S} filtered sequence}
491\end{figure}
492
493Fig. \ref{seg-aln} shows an example of a \texttt{fasta36} alignment
494produced using the \texttt{-S} option to filter out lower-case (low
495complexity) residues.  Here, additional scores (\texttt{initn},
496\texttt{init1} are shown, in addition to the \texttt{opt} score which
497is used to rank the sequences and calculate statistical significance.
498The \texttt{init1} score is the highest scoring alignment without
499gaps; \texttt{initn} is a score that combines consistent
500(non-overlapping) runs without gaps, and \texttt{opt} is the score of
501a banded Smith-Waterman of width 16 for \emph{ktup=2} that is applied
502to sequences with \texttt{initn} scores over the optimization
503threshold. In Fig. \ref{seg-aln},  the \texttt{init1} score is
504based on the long, un-gapped region from residues 46--208 in
505\texttt{mGSTM1}, while the \texttt{initn} and \texttt{opt} scores
506include the other regions joined by gaps.  The \texttt{initn} score is
507higher than the \texttt{opt} score, because it uses a simpler,
508length-independent, gap penalty.
509
510The \texttt{init1} and \texttt{initn} scores are shown for
511historical reasons, and can be used to illustrate the FASTA algorithm.
512But the \texttt{opt} score is the most reliable and sensitive score
513for inferring homology; the others can be ignored.
514
515For \texttt{fasta36} with proteins, the final alignment and score is
516calculated with the Smith-Waterman algorithm. For DNA sequences, a
517banded Smith-Waterman is used. (The \texttt{-A} option produces banded
518Smith-Waterman alignments for proteins, and full Smith-Waterman for
519DNA.)  In Fig. \ref{seg-aln}, the \texttt{opt} score and
520\texttt{Smith-Waterman} scores are calculated on exactly the same
521alignment, but the \texttt{opt} score excludes the contribution from
522the ``low-complexity'' region between 19--30 in \texttt{GST26\_SCHMA}.
523
524\subsubsection{Results without alignments}
525
526While sequence alignments are very informative, it is often not
527practical to examine all the statistically significant alignments in
528large-scale searches. The \texttt{-m 9} and \texttt{-m 8} options
529present summaries of each alignment (alignment boundaries, percent
530identity, and other information) in a much more compact form.  In
531addition, \texttt{-m 9c} or \texttt{-m 9C} (see options below) provide
532a detailed encoding of the alignment, that allows it to be
533reconstructed. For large-scale searches, we routinely use \texttt{-m
534  8} with the \texttt{-d 0} option, which sets the number of
535alignments shown to 0 (thus none are shown).  Alternatively, the
536\texttt{-m 8} and \texttt{-m 8C} ouput options produce BLAST-format
537tabular results summaries (\texttt{-m 8C} provides commented tabular
538results).  \texttt{-m 8CC} adds an alignment CIGAR string and
539annotation string to the BLAST tabular format.
540\texttt{-m BB} produces an output that mimics BLAST output
541(with alignments).
542
543\subsubsection{Alignments with annotations}
544
545The command line \texttt{-V} option (described below) causes the FASTA
546programs to ``decorate'' its sequence alignments with annotation
547information, such as functional sites, variants, and domain-based
548sub-alignment scores.  For example, a comparison of the
549\texttt{seqs/gstm1\_human.vaa} sequence with SwissProt using the
550\texttt{scripts/ann\_feats\_up\_www2.pl} script:
551\begin{footnotesize}
552\begin{quote}
553\begin{verbatim}
554ssearch36 -m 9i -V \!../scripts/ann_feats_up_www2.pl ../seq/gstm1\_human.vaa /slib/swissprot.fa
555\end{verbatim}
556\end{quote}
557\end{footnotesize}
558Produces the following addtional output:
559\begin{footnotesize}
560\begin{verbatim}
561Annotation symbols:
562 = : Active site
563 * : Modified
564 # : Substrate binding
565 ^ : Metal binding
566 @ : Site
567
568The best scores are:                       s-w bits E(458668) %_id  %_sim  alen
569sp|P09488.3|GSTM1_HUMAN Glutathione ( 218) 1500 375.2 2.4e-103 1.000 1.000  218 |Var: K173N;S210T;
570sp|Q03013.3|GSTM4_HUMAN Glutathione ( 218) 1375 344.5 4.4e-94 0.904 0.963  218 |Var: S2P;A160V;L208V;Y209F;...
571sp|Q5R8E8.3|GSTM2_PONAB Glutathione ( 218) 1310 328.5 2.9e-89 0.862 0.959  218
572sp|Q9TSM5.3|GSTM1_MACFA Glutathione ( 218) 1308 328.0   4e-89 0.858 0.954  218
573sp|Q9TSM4.3|GSTM2_MACFA Glutathione ( 218) 1307 327.7 4.8e-89 0.862 0.954  218
574sp|P28161.2|GSTM2_HUMAN Glutathione ( 218) 1306 327.5 5.7e-89 0.853 0.959  218 |Var: S173N;
575sp|P46439.3|GSTM5_HUMAN Glutathione ( 218) 1305 327.2 6.7e-89 0.876 0.954  218 |Var: L179P;
576\end{verbatim}
577\end{footnotesize}
578This summary of high scoring hits shows one of the effects of
579annotation---substitution of variant residues to increase the score.
580In this case, the query sequence \texttt{gstm1\_human.vaa} is a known
581variant of the canonical \texttt{GSTM1\_HUMAN}/\texttt{P09488} UniProt
582sequence.  Without the \texttt{-V} annotation option,
583\texttt{gstm1\_human.vaa} would be 99\% identical to
584\texttt{GSTM1\_HUMAN}, but because UniProt documents the variant
585residues in the feature table, the \texttt{K173N} and \texttt{S210T}
586substutions are made in the library (subject) sequence, producing a
587perfect match.
588
589In addition to the variant substitution shown above, the alignments
590provide a more complete view of the annotations available on the
591library (subject) proteins.  Below is the report for the
592\texttt{GSTM4\_HUMAN} alignment:
593\begin{footnotesize}
594\begin{quote}
595\begin{verbatim}
596>>sp|Q03013.3|GSTM4_HUMAN Glutathione S-trans             (218 aa)
597 Variant: 2P=2P : S2P : UniProtKB FT ID: VAR_033979
598 Site:@ : 7Y=7Y : Site: Glutathione binding
599 Site:@ : 46W=46W : Site: Glutathione binding
600 Site:@ : 59N=59N : Site: Glutathione binding
601 Site:@ : 72Q=72Q : Site: Glutathione binding
602 Region: 2-88:2-88 : score=599; bits=150.1; Id=0.989; Q=415.2 :  GST N-terminal :1
603 Site:# : 116Y=116Y : Substrate binding: Substrate
604 Variant: 160V=160V : A160V : UniProtKB FT ID: VAR_033980
605 Variant: 208V=208V : L208V : UniProtKB FT ID: VAR_049487
606 Region: 90-208:90-208 : score=699; bits=175.1; Id=0.833; Q=489.3 :  GST C-terminal :2
607 Variant: 209F=209F : Y209F : UniProtKB FT ID: VAR_049488
608 Variant: 211K=211K : R211K : UniProtKB FT ID: VAR_049489
609 Variant: 212M=212M : V212M : UniProtKB FT ID: VAR_049490
610 s-w opt: 1375  Z-score: 1823.2  bits: 344.5 E(458668): 4.4e-94
611Smith-Waterman score: 1375; 90.4% identity (96.3% similar) in 218 aa overlap (1-218:1-218)
612\end{verbatim}
613\end{quote}
614\end{footnotesize}
615This report can be broken into three parts: (1) information on
616variants, described above, (2) information on annotated sites, and (3)
617information on annotated domains.  For each annotated site, the
618coordinate and amino-acid resdiue in the query and library (subject)
619sequence is shown, as well as the conservation state (\texttt{=} in
620all these examples).  For annotated domains, the overall alignment
621score is broken into pieces, based on the boundaries of the domains.
622In this case, the full alignment extends from residues 1--218,
623producing a raw Smith-Waterman score of 1375 and a bit score of
624344.5.  The GST N-terminal domain is annotated from residue 2--88 on
625\texttt{GSTM4\_HUMAN}, and the 2--88 region of the alignment produces
626a score of 599 and bit score of 150.1.  This region is 98.9\%
627identical, and the probability of that similarity score is
628$10^{-41.52}$ (the Qvalue score is $-10 log_{10} P$).  The
629alignment associated with GST C-terminal domain is slightly less well
630conserved (83.3\% identical), but longer, so it produces a higher
631Smith-Waterman score (699), bit score (175.1) and Q-value.
632
633Sub-alignment scores can be used to identify cases of alignment
634over-extension \cite{wrp136}, where an alignment extends well beyond
635the homologous domain.  In this case, the homologous region will
636produce the vast majority of the score, and the non-homologous
637over-extension will produce very little score.  For example, when
638\texttt{SRC8\_HUMAN} aligns with \texttt{LASP1\_MOUSE}, the alignment
639spans 200 residues, but only about 49 of those residues, an
640\texttt{SH3} domain, are homologous:
641
642\begin{footnotesize}
643\begin{quote}
644\begin{verbatim}
645>>sp|Q61792.1|LASP1_MOUSE LIM and SH3 domain protein 1;  LASP-1;              (263 aa)
646 Region: 369-398:66-95 : score=20; bits=13.7; Id=0.200; Q=0.0 :  Nebulin_repeat  InterPro
647 Region: 400-434:97-131 : score=-8; bits=8.7; Id=0.150; Q=0.0 :  Nebulin_repeat  InterPro
648 Region: 435-499:132-203 : score=13; bits=11.4; Id=0.197; Q=0.0 :  NODOM :0
649 Region: 499-547:204-261 : score=124; bits=47.8; Id=0.474; Q=92.1 :  SH3  InterPro
650 s-w opt: 148  Z-score: 253.4  bits: 55.6 E(459565): 1.2e-06
651Smith-Waterman score: 159; 26.3% identity (55.6% similar) in 205 aa overlap (369-547:66-261)
652\end{verbatim}
653\end{quote}
654\end{footnotesize}
655The 150 aligned residues outside the \texttt{SH3} homology produce
656less than 20\% of the alignment score, while spannign to
657non-homologous Nebulin repeat domains.
658
659\subsection{Program Options}
660
661Command line options are available to change the scoring parameters
662and output display. Unlike the NCBI BLAST programs, command line
663options \emph{must} precede the query file name and library file name
664arguments.  To see the command-line options for a program and their
665defaults, type \texttt{program\_name -help}, e.g. \texttt{fasta36
666  -help} or \texttt{ssearch36 -help}. For a quick list of the most
667common options, just type the program name without any options
668(e.g. \texttt{fasta36$<$ret$>$}).
669
670\subsubsection{Command line options}
671\begin{description}
672\item[\texttt{-a}] (\texttt{fasta36}, \texttt{ssearch36},
673  \texttt{glsearch36}, \texttt{fasts36}) show both sequences in their
674  entirety.
675\item[\texttt{-A}] force Smith-Waterman alignments for
676  \texttt{fasta36} DNA sequences.  By default, only \texttt{fasta36}
677  protein sequence comparisons use Smith-Waterman alignments.
678  Likewise, for proteins, use band alignments (Smith-Waterman is used
679  by default).
680\item[\texttt{-b \#}] Number of sequence scores to be shown on output.
681  In the absence of this option, \texttt{fasta36} (and
682  \texttt{ssearch36}) display all library sequences obtaining
683  similarity scores with expectations less than the expectation (-E)
684  threshold, 10.0 for proteins, and 2.0 for DNA:DNA and
685  protein:translated DNA.  The \texttt{-b \#} option can limit the
686  display further.  There are two ``sub-modes'' of \texttt{-b}.
687  \texttt{-b =100} will force 100 high scores to be displayed,
688  regardless of the expectation (\texttt{-E}) threshold, and
689  \texttt{-b >1} will show at least \texttt{1}, but is otherwise
690  limited by \texttt{-E}.  Thus, \texttt{-b 10} will show \emph{no
691    more than} 10 results, limited by \texttt{-E}; \texttt{-b =10}
692  will always show \emph{exactly} \texttt{10} results, and \texttt{-b
693    >5} will show \emph{at least} \texttt{5} results, but could show
694  many more if more results have e\-values $\le$ \texttt{-E e\_cut}.
695\item[\texttt{-c \#,\#}] (\texttt{fasta36}, \texttt{[t]fast[x,y]36}
696  only) Fraction of alignments optimized (second value is fraction of
697  sequences joined). FASTA36 uses a statistical threshold strategy
698  that joins and optimizes only the fraction of the alignments with an
699  \texttt{initn} score expected \texttt{-c} times.  Thus, \texttt{-c
700    0.05} should optimize about 5\% of sequences.  The actual number
701  of sequences optimized (and joined) is displayed in the scoring
702  parameters line.  Thus:
703\begin{quote}
704\begin{verbatim}
705Parameters: BL50 matrix (15:-5), open/ext: -10/-2
706 ktup: 2, E-join: 1 (0.687), E-opt: 0.2 (0.294), width:  16
707\end{verbatim}
708\end{quote}
709reports that 20\% of the sequences in the database should have been
710band-optimized, and 29.4\% were. Reducing the \texttt{-c opt} fraction
711improves performance, but dropping the fraction below 0.02 can
712reduce the accuracy of the statistical estimates.
713
714\texttt{-c O} (letter 'O') sets the joining/optimization
715thresholds as they were prior to \texttt{fasta-36.3.3} (original thresholds). Positive
716values set the thresholds to specific score values, as was the case
717in older versions of \texttt{fasta}.
718
719\item[\texttt{-C}]
720length of the sequence name printed at the beginning of alignment
721lines (default 6 characters).
722\item[\texttt{-d \#}]
723Maximum number of alignments to be displayed (must be \texttt{<=} to the number of descriptions, \texttt{-b \#})
724\item[\texttt{-D}]
725  Provide some debugging output.  Used in conjunction
726  with the \texttt{-e expand\_script.sh}, the \texttt{link\_acc\_file}
727  and \texttt{link\_lib\_file} are not deleted during the run; so that
728  \texttt{expand\_script.sh} scripts can be tested.
729
730\item[\texttt{-e expand\_script.sh}]
731
732  Expand the set of sequences that
733  are aligned to beyond the set of sequences searched. When the
734  \texttt{-e expand\_script.sh} option is used, the
735  \texttt{expand\_script.sh} script is run after the initial search
736  scan but before the list of high-scoring sequences is displayed.
737  \texttt{expand\_script.sh} is given a single argument, the name of a
738  file that contains a list of accession strings (the text between the
739  \texttt{>} and the first space ('\textvisiblespace') character
740  followed by the E()-value for the sequence (separated by a
741  \texttt{<tab>} character), e.g.:
742\begin{quote}
743\begin{verbatim}
744gi|121719|sp|P08010|GSTM2_RAT<tab>2.69e-86
745gi|121746|sp|P09211|GSTP1_HUMAN<tab>1.51e-20
746gi|121749|sp|P04906|GSTP1_RAT<tab>1.16e-19
747gi|62822551|sp|P00502|GSTA1_RAT<tab>9.5e-12
748\end{verbatim}
749\end{quote}
750The script should produce a fasta-formatted list of additional
751sequences printed to \texttt{stdout}.  The script is run with the command:
752\begin{quote}
753\begin{verbatim}
754expand_script.sh link_acc.tmp_file > link_lib.tmp_file
755\end{verbatim}
756\end{quote}
757The sequences in \texttt{link\_lib.tmp\_file} (a temporary file name is
758actually used, and the file is deleted unless the \texttt{-D} option
759is used) are then compared and, if they are significant, included in
760the list of high scoring sequences and the alignments.  The expanded
761set of sequences does not change the database size or statisical
762parameters, it simply expands the set of high-scoring sequences.
763
764The \texttt{fasta36/misc} directory contains
765\texttt{expand\_uniref50.pl} that uses a mySQL table based on the
766\texttt{uniref50} clusters.  Using the script and the
767\texttt{uniref50} cluster information, one can search
768\texttt{uniref50.fasta}, but then expand the hits so that
769\texttt{uniprot} appears to be searched.
770
771\item[\texttt{-E e\_cut [e\_cut\_r]}] Limit the number of scores and
772  alignments shown based on the expected number of scores.  Used to
773  override the expectation value of 10.0 (protein:protein; 5.0
774  translated-DNA:protein; 2.0 DNA:DNA) used by default.  \texttt{-E
775    2.0} will show all library sequences with scores with an
776  expectation value $<=$ 2.0.  With \texttt{fasta-36}, a second
777  value, \texttt{e\_cut\_r} is available to limit the E()-values of
778  additional sequence alignments between the query and library
779  sequences.  If not given, the threshold is \texttt{e\_cut}/10.0.  If
780  given with a value $>$ 1.0, \texttt{e\_cut\_r} = \texttt{e\_cut} /
781  value; for a value $<$ 1.0, \texttt{e\_cut\_r} = value; If
782  \texttt{e\_cut\_r} $<$ 0, then the additional alignment option is
783  disabled.
784\item[\texttt{-f \#}]
785Gap open penalty (-10 by default for proteins,
786-12 for DNA, -12 for \texttt{[t]fast[xy]}).
787\item[\texttt{-F \#}]
788Limit the number of scores and alignments shown based on the expected
789number of scores. \texttt{-E \#} sets the highest E()-value shown; \texttt{-F \#} sets
790the lowest E()-value displayed. Thus, \mbox{\texttt{-F 0.0001}} will not show any matches or
791alignments with E() $<$ 0.0001.  This allows one to skip over close
792relationships to search for more distant relationships.
793\item[\texttt{-g \#}]
794Penalty per residue in a gap (-2 by default for proteins,
795-4 for DNA, -2 for \texttt{[t]fast[xy]}).  A single residue gap costs \texttt{f} $+$ \texttt{g}.
796\item[\texttt{-h}]
797Short help message. Help options with \texttt{':'}, e.g. \texttt{-s:},
798require an argument (\texttt{-s BP62}).  Defaults are shown in square
799brackets, e.g.: \texttt{-s:\ [BL50]}.
800\item[\texttt{-help}]
801Long help message
802\item[\texttt{-H}]
803Show histogram.
804\item[\texttt{-i}]
805DNA queries - search with reverse complement.  For
806\texttt{tfastx36/y36}, search the reverse complement of the library sequence
807only (complement of \texttt{-3} option).
808\item[\texttt{-I}]
809Interactive mode (the default for versions older than \texttt{fasta-36.3.4}).
810\item[\texttt{-j \#}]
811Penalty for frameshift between codons (\texttt{[t]fastx36}, \texttt{[t]fasty36}) and within a codon (\texttt{fasty36}/ \texttt{tfasty36} only).
812\item[\texttt{-J}]
813(\texttt{lalign36} only) show the identity alignment (normally
814  suppressed, \texttt{-I} in versions before \texttt{fasta-36.3.4}).
815\item[\texttt{-k \#}]
816number of shuffles for statistical estimates from shuffling.
817\item[\texttt{-l file}]
818Location of library menu file (FASTLIBS).
819\item[\texttt{-L}]
820Display longer library sequence description.
821\item[\texttt{-M low-high}]
822Range of amino acid sequence lengths to be included in the search.
823\item[\texttt{-m \#}]
824Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, B, BB, ``F\# out\_file''
825\begin{small}
826\begin{verbatim}
827    -m 0        -m 1          -m 2          -m 3        -m 4
828MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
829::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
830MWKSCGYPYT   MWKSCGYPYT
831\end{verbatim}
832\end{small}
833If the \texttt{-V '*@\%'} annotation option has been used,
834annotations can be included in either the coordinate line (the default) or the
835middle alignment line (\texttt{-m 0M}, \texttt{-m 1M}), or both
836(\texttt{-m 0B}, \texttt{-m 1B}).  See \texttt{-V} for more details.
837
838\indent \texttt{-m 5}: a combination of \texttt{-m 4} and \texttt{-m
839  0}. \texttt{-m 6} provides \texttt{-m 5} plus HTML formatting.  In
840addition, independent \texttt{-m} options can be combined. Thus, one
841can use \texttt{-m 1 -m 6 -m 9}.
842
843\item[\texttt{-m 8}] provides BLAST tabular format output (a tab
844  delimited line with the query name, library name, percent identity,
845  and other alignment information). ``\texttt{-m 8C}'' provides the
846  additional information provided by the BLAST tabular format with
847  comment lines.  BLAST tabular format has been extended to include
848  either a CIGAR string alignment encoding (\texttt{-m 8CC} with BLAST
849  comments, \texttt{-m 8XC} without comments) and, if available, an
850  annotation encoding matching FASTA \texttt{-m 9C} output. All the
851  \texttt{-m 9c/C/d/D} encodings are available with BLAST tabular
852  output using \texttt{-m 8C[c/C/d/D]}.
853
854\item[\texttt{-m 9}] display alignment coordinates and scores with the
855  best score information.  \texttt{-m 9i} provides alignment length,
856  percent identity, and percent similarity only. \texttt{ -m 9} extends
857  the normal best score information:
858\begin{footnotesize}
859\begin{verbatim}
860The best scores are:                                      opt bits E(14548)
861XURTG4 glutathione transferase (EC 2.5.1.18) 4 -   ( 219) 1248 291.7 1.1e-79
862\end{verbatim}
863\end{footnotesize}
864
865to include the additional information (on the same line, separated by
866$<$tab$>$ characters):
867\begin{footnotesize}
868\begin{verbatim}
869%_id  %_gid   sw  alen  an0  ax0  pn0  px0  an1  ax1 pn1 px1 gapq gapl  fs
8700.771 0.771 1248  218    1  218    1  218    1  218    1  219   0   0   0
871\end{verbatim}
872\end{footnotesize}
873
874The first two values are fraction identical and fraction similar
875(score $\ge 0$), followed by the Smith-Waterman alignment score (\texttt{sw}), the
876alignment length (\texttt{alen}), and the coordinates of the beginning
877and end of the alignment in the query and target (library) sequences
878(\texttt{an0} beginning, \texttt{ax1} end in query; \texttt{an1}
879beginning, \texttt{ax1} end in target/library), and the coordinate
880system for the beginning and end of the query and target/library
881sequence (\texttt{pn0} is the displayed coordinate of the first
882residue of the query sequence, \texttt{px0} is the displayed
883coordinate of the last residue, \texttt{pn1},\texttt{px1} provide the
884coordinates for the target/library sequence). \texttt{gapq},
885\texttt{gapl} report the number of gaps in the query and library
886sequence; \texttt{fs} reports the number of frameshifts.
887
888\texttt{ -m 9c} provides additional information: an encoded alignment string.  For example, the alignment:
889\begin{footnotesize}
890\begin{verbatim}
891       10        20        30        40        50          60         70
892GT8.7  NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
893       :.::  . :: ::  .   .:::         : .:    ::.:   .: : ..:.. :::  :..:
894XURTG  NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
895               20        30                 40        50        60
896\end{verbatim}
897\end{footnotesize}
898would be encoded:
899\begin{footnotesize}
900\texttt{=23+9=13-2=10-1=3+1=5}
901\end{footnotesize}.
902The numbers in the alignment encoding is with repect to the beginning
903of the alignment, not the sequences.  The beginning coordinate of the
904alignment is given earlier in the \texttt{-m 9c} line.  \texttt{-m 9C}
905provides the alignment encoding in CIGAR format:
906\begin{footnotesize}
907\texttt{28M9D13M2I10M1I3M1D5M}
908\end{footnotesize}.
909
910(June, 2014) The \texttt{-m 9c/C} option has been extended to
911\texttt{-m 9d/D}, which encodes the positions of mismatches as well as
912insertions and deletions.  For the example above, the \texttt{-m 9d}
913encoding would be:
914\begin{footnotesize}
915\texttt{=1x1=2x4=2x1=2x7=3-9=1x2=1x4=2x1=1x1+2x1=1x1=1x3=1x2+1=3-1x1=1x2=1}
916\end{footnotesize}
917while \texttt{-m 9D} would be:
918\begin{footnotesize}
919\texttt{1M1X2M4X2M1X2M7X3M9D1M2X1M4X2M1X1M1X2I1X1M1X1M3X1M2X1I3M1D1X1M2X1M}
920\end{footnotesize}
921\item[\texttt{-m 10}]
922a parseable format for use with other programs.
923\item[\texttt{-m 11}]
924Provide \texttt{lav}-like output (used by \texttt{lalign}) for graphical output.
925\begin{quote}
926\texttt{lalign36 -m 11 mchu.aa mchu.aa | lav2plt.pl --dev ps  > mchu\_laln.ps}
927\end{quote}
928Produces a postscript plot of the local alignments.  Likewise,
929\texttt{lav2plt.pl --dev svg} produces SVG output.
930
931\item[\texttt{-m BB}] Format output to mimic BLAST format.  \texttt{-m
932  B} formats alignments to look like BLAST alignments (Query/Sbjct),
933  but is FASTA output otherwise. \texttt{-mBB} imitates BLAST as much
934  as possible, and cannot be used with other \texttt{-m} options.
935
936\item[\texttt{-m "F\# out.file"}] Send an alternate result format to \texttt{out.file}.
937Normally, the \texttt{-m out\_fmt} option applies to the default output
938file, which is either \texttt{stdout}, or specified with \texttt{-O out\_file} (or within
939the program in interactive mode). With \texttt{-m F}, an output format can be
940associated with a separate output file, which will contain a complete
941FASTA program output.  Thus,
942\begin{quote}
943\begin{small}
944\begin{verbatim}
945  ssearch36 -m 9c -m "FBB blast.out" -m "F9c,10 m9c_10.out" query library
946\end{verbatim}
947\end{small}
948\end{quote}
949Sends the \texttt{-m 9c} output to \texttt{stdout}, but will also send
950\texttt{-m BB} output to the \texttt{blast.out} file, and \texttt{-m 9c -m
951  10} output to \texttt{m9\_c10.out}.  Consistent \texttt{-m out\_fmt}
952commands can be set to the same file by separating them with ','.
953Producing alternative format alignments in different files has little
954additional computational cost.
955
956Because a space (\textvisiblespace) is used to separate the output
957format (\texttt{-m}) values from the file name, the \texttt{-m F}
958argument must typically be surrounded by quotation marks (\texttt{"}).
959
960One of the shortcomings of this approach is that it affects only the
961output format, not the other options that modify the amount of output.
962Thus, if you specify \texttt{-E 0.001}; that expect threshold will be
963used for all the output files.  When a \texttt{-m} option does modify
964the output (e.g. \texttt{-m 8} sets \texttt{-d 0}), that modification
965is specific to the output file.
966
967\item[\texttt{-M low-high}]
968Include library sequences with lengths between low and
969high.
970\item[\texttt{-n}]
971Force the query sequence to be treated as a DNA sequence.
972Useful when query sequences contain a large number of
973ambiguous residues, e.g. transcription factor binding sites.
974\item[\texttt{-N \#}]
975break long library sequences into blocks of \# residues.  Useful for
976bacterial genomes, which have only one sequence entry.  -N 2000 works
977well for well for bacterial genomes. (This option was required when
978FASTA only provided one alignment between the query and library
979sequence.  It is not as useful, now that multiple alignments are
980available.)
981
982\item[\texttt{-o off1,off2}]
983(Previously \texttt{-X}.) Specifies offsets for the beginning of the query and library sequence.
984For example, if you are comparing upstream regions for two genes, and
985the first sequence contains 500 nt of upstream sequence while the
986second contains 300 nt of upstream sequence, you might try:
987\begin{quote}
988\texttt{fasta -o "-500 -300" seq1.nt seq2.nt}
989\end{quote}
990If the \texttt{-o} option is not used, FASTA assumes numbering starts with 1.
991(You should double check to be certain the negative numbering works
992properly.)
993
994\item[\texttt{-O}] Send a copy of results to \texttt{filename}.
995  Helpful for environments without STDOUT, but should be avoided (use
996  \texttt{> filename} instead).
997
998\item[\texttt{-p}]
999Force query to be treated as protein sequence.
1000
1001\item[\texttt{-P PSSM\_file}]
1002Specify a PSI-BLAST format PSSM (Position Specific Scoring Matrix)
1003file.  \texttt{ssearch36}, \texttt{ggsearch36}, and
1004\texttt{glsearch36} can use a PSSM file to improve the sensitivity of
1005a search. The FASTA programs accept two PSSM file formats:\\[2ex]
1006\begin{tabular}{l l l}
1007\hline\\[-1.5ex]
1008format & \texttt{blastpgp} & option \\[0.5ex]
1009\hline\\[-1.5ex]
10100 & \texttt{blastpgp -C pssm.chk -u 0} & byte-encoded \\
10112 & \texttt{blastpgp -C pssm.asnb -u 2} & binary ASN.1 \\
1012%  & \texttt{psiblast -out\_pssm\_text} \\[1.5ex]
1013\hline\\[-0.5ex]
1014\end{tabular}\\
1015which can be specified after the file name, e.g.:
1016\begin{quote}
1017\texttt{ssearch36 -P 'pssm.asnb 2' pssm\_query.aa +sp+}
1018\end{quote}
1019Searches with a PSI-BLAST PSSM must still require a query sequence
1020file, and the query sequence file must match the PSSM seed sequence.
1021The format 0 byte-encoded PSSM is machine dependent; it must be
1022created by \texttt{blastpgp} on the same architecture as
1023\texttt{ssearch36}.  In general, you should use the binary ASN.1 (format 2) file.
1024
1025With the release of \texttt{NCBI-BLAST+}, \texttt{psiblast} replaces
1026\texttt{blastpgp}, and \texttt{psiblast} does not produce the binary
1027ASN.1 PSSM checkpoint data.  However, the text ASN.1 PSSM checkpoint
1028file (produced with the \texttt{psiblast} option \texttt{-out\_pssm})
1029can be converted to a binary ASN.1 format that \texttt{ssearch36} can
1030read using the NCBI \texttt{datatool} program (available from
1031\url{ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools++/BIN/CURRENT/datatool})
1032together with
1033\url{http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn}. More
1034information about \texttt{datatool} is available from
1035\url{http://www.ncbi.nlm.nih.gov/data_specs/NCBI_data_conversion.html}.
1036The NCBI BLASTP/PSI-BLAST website provides the same PSSM text ASN.1
1037file with the downloads link.  A text ASN.1 PSSM file can be converted
1038to a binary ASN.1 file using the command:
1039\begin{quote}
1040\texttt{datatool -m NCBI\_all.asn -v pssm.asn\_txt -e pssm.asnb}
1041\end{quote}
1042The \texttt{pssm.asnb} can then be used with
1043\texttt{ssearch36} with the \texttt{-P 'pssm.asnb 2'}
1044option shown above.
1045
1046\item[\texttt{-Q,-q}]
1047Quiet - does not prompt for any input.  Writes scores and alignments
1048to the terminal or standard output file (on by default, turned off
1049with \texttt{-I}).
1050\item[\texttt{-r +n/-m}]
1051Specify match/mismatch scores for DNA comparisons.  The default is
1052\texttt{+5/-4}. \texttt{+3/-2} can perform better in some cases.
1053\item[\texttt{-R file}]
1054Save a results summary line for every sequence in the sequence
1055library.  The summary line includes the sequence identifier,
1056superfamily number (if available) position
1057in the library, and the similarity scores calculated.  This option can
1058be used to evaluate the sensitivity and selectivity of different
1059search strategies \cite{wrp951,wrp981}.
1060\item[\texttt{-s file}] Specify the scoring matrix file.
1061  \texttt{fasta36} uses the same scoring matrice format as Blast.
1062  Several scoring matrix files are included in the standard
1063  distribution in the \texttt{data/} directory.  For protein
1064  sequences: \texttt{codaa.mat} - based on minimum mutation matrix;
1065  \texttt{idnaa.mat} - identity matrix; \texttt{pam250.mat} - the
1066  PAM250 matrix; \cite{day787}, (\texttt{-s P250}), and
1067  \texttt{pam120.mat} - a PAM120 matrix (\texttt{-s P120}).  The
1068  default scoring matrix is BLOSUM50 (\texttt{-s BL50}). Other
1069  matrices include a series of modern PAM-based matrices
1070  \cite{tay925}: MDM40/\texttt{-s MD40}, MDM20/\texttt{-s MD20}, and
1071  MDM10/\texttt{-s MD10}, and a selection from the BLOSUM series
1072  \cite{hen929} BLOSUM50, 62, and 80/\texttt{-s BL50}, \texttt{-s
1073    BL62}, \texttt{-s BL80}.  \texttt{-s BP62} sets the scoring matrix
1074  to BLOSUM62 and the gap penalties to -11/-1, identical to
1075  \texttt{BLASTP}.  In addition, the VTML160 matrix (\texttt{-s
1076    VT160}) \cite{muller2002} and OPTIMA\_5 (\texttt{-s OPT5})
1077  \cite{kan023} are available.
1078
1079If the scoring matrix is prefaced by a question mark,
1080e.g. \texttt{?BP62}, then the scoring matrix is adjusted for each
1081query to ensure that a 100\% identical match can produce a score of at
1082least 40 bits.  This is designed for \texttt{fastx36} searches with
1083potentially short DNA queries; A 120 nt DNA query can only produce a
108440 amino-acid alignment, which, with BLOSUM62 -11/-1, cannot produce
1085more than 23 bits of score. A scoring matrix with a higher information
1086content is required; in the set available by default, MD40, with 2.22
1087bits/position, would be used.  For more information about alignment
1088length and information content, see \cite{alt915}.
1089
1090\item[\texttt{-S}] Filter out lower-case characters in the query or
1091  library sequences for the initial score calculation (used to filter
1092  low-complexity -- \texttt{seg}-ed -- residues).  The \texttt{pseg}
1093  program \cite{woo935} can be used to lower-case mask low complexity
1094  regions in protein sequences. With the \texttt{-S} option, lower
1095  case characters in the query or database sequences are treated as
1096  \texttt{X}'s during the initial scan, but are treated as normal
1097  residues during the final alignment display.  Since statistical
1098  significance is calculated from the similarity score calculated
1099  during the library search, the lower case residues do not contribute
1100  to the score.  However, if a significant alignment contains low
1101  complexity regions, the residues are shown (as lower
1102  case characters, Fig. \ref{seg-aln}).
1103
1104The \texttt{pseg} program can be used to produce databases (or query
1105sequences) with lower case residues indicating low complexity regions
1106using the command:
1107\begin{verbatim}
1108pseg ./swissprot.fasta -z 1 -q > swissprot.lseg
1109\end{verbatim}
1110
1111The \texttt{-S} option should always be used with \texttt{FASTX/Y} and
1112\texttt{TFASTX/Y} because out-of-frame translations often generate
1113low-complexity protein sequences.  However, only lower case characters
1114in the protein sequence (or protein database) are masked; lower case
1115DNA sequences are translated into upper case protein sequences, and
1116not treated as low complexity by the translated alignment
1117programs. (There is an option in the \texttt{Makefile},
1118\texttt{-DDNALIB\_LC}, to enable preserving case in DNA sequences.)
1119
1120\item[\texttt{-t \#}]
1121Translation table - fastx36, tfastx36, fasty36, and
1122tfasty3 now support the BLAST translation tables.  See
1123\url{http://www.ncbi.nih.gov/Taxonomy/Utils/wprintgc.cgi}.
1124
1125\texttt{-t t} or \texttt{-t t\#} enables the addition of
1126an implicit termination codon to a protein:translated DNA match.  That
1127is, each protein sequence implicitly ends with \texttt{*}, which
1128matches the termination codes for the appropriate genetic code.
1129\texttt{-t t\#} sets implicit termination and a different genetic
1130code.
1131\item[\texttt{-T \#}]
1132set number of threads/workers.  Normally on a multi-core machine, the maximum
1133number of processors/cores is used.
1134\item[\texttt{-U}]
1135Treat the query sequence an RNA sequence.  In addition to selecting a
1136DNA/RNA alphabet, this option causes changes to the scoring matrix so
1137that \texttt{G:A} , \texttt{T:C} or \texttt{U:C} are scored as \mbox{\texttt{G:G -3}}.
1138\item[\texttt{-v \#}]
1139Do window shuffles with the window size specified.
1140\item[\texttt{-V str}] Specify annotation characters that can be
1141  included (and will be ignored), in the query sequence file, but are
1142  displayed in the alignments.  If a query file contains
1143  \texttt{"ACVS*ITRLFT?"}, where \texttt{"*"} and \texttt{"?"}  are
1144  used to indicate phosphorylation, giving the option \mbox{\texttt{-V
1145      '*?'}}, the annotated characters in the query will (\texttt{S*},
1146  \texttt{F?}) will be highlighted in the alignment (on the number
1147  line). A \texttt{fasts36} alignment of \texttt{seq/ngts.aa} compared
1148  to \texttt{seq/mgstm1.aa} with \texttt{-V '*?'} produces:
1149\begin{footnotesize}
1150\begin{verbatim}
1151             *                10??
1152GT8.7     ILGYWN------------EYTDSSYDEKR----------------------------
1153          ::::::            :::::::::::
1154GT8.7  MPMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKLGLDFPNL
1155               10        20        30        40        50        60
1156\end{verbatim}
1157\end{footnotesize}
1158In addition to showing the alignments of post-translationally modified
1159sites, the \texttt{-V} option can be used to highlight active sites in
1160library sequences. In the \texttt{-m 9c} output, the state of the
1161annotated sites is summarized when \texttt{-V} is used.
1162
1163(fasta-36.3.6 June 2012) The \texttt{-V} option has been extended to:
1164(1) allow feature descriptions to be specified in a file,
1165e.g. \texttt{-V =annot.defs} where \texttt{annot.defs} contains:
1166\begin{footnotesize}
1167\begin{verbatim}
1168*:phosphorylation
1169@:active site
1170^:binding site
1171\end{verbatim}
1172\end{footnotesize}
1173The annotation character is left of the  ':', the definition is on the
1174right.  The \texttt{annot.defs} file can also be specified by setting
1175the \texttt{FA\_ANNOT\_DEF} environment variable to the file name;
1176
1177(2) to include optional annotation file, e.g. \texttt{-V
1178  '<features.annot'}, or script, e.g. \texttt{-V '!features.pl'} for
1179library annotations and \texttt{-V 'q!features.pl'} for query annotations. (Some shells require \texttt{\textbackslash!features.pl}.) Similar to the library expansion script, the
1180\texttt{features.pl} script is run against a temporary file containing
1181the list of high scoring sequence accessions (the text before the
1182first space), e.g.
1183\begin{footnotesize}
1184\begin{verbatim}
1185gi|121735|sp|P09488.3|GSTM1_HUMAN
1186gi|1170096|sp|Q03013.3|GSTM4_HUMAN
1187gi|67461004|sp|Q5R8E8.3|GSTM2_PONAB
1188...
1189\end{verbatim}
1190\end{footnotesize}
1191The \texttt{features.pl} script then produces a file of annotations on
1192those sequences, in the format:
1193\begin{verbatim}
1194>accession1
1195position label value
1196>accession2
1197...
1198\end{verbatim}
1199For example:
1200\begin{footnotesize}
1201\begin{verbatim}
1202>gi|121735|sp|P09488.3|GSTM1_HUMAN
120323	*
120433	*
120534	*
1206116	^
1207173	V	N
1208210	V	T
1209>gi|1170096|sp|Q03013.3|GSTM4_HUMAN
12102	V	P
1211116	^
1212160	V	V
1213208	V	V
1214209	V	F
1215211	V	K
1216212	V	M
1217>gi|67461004|sp|Q5R8E8.3|GSTM2_PONAB
1218...
1219\end{verbatim}
1220\end{footnotesize}
1221The same format is used for the \texttt{-V '<feature.annot'} file.
1222
1223The \texttt{V} label is special; it indicates that the feature is a
1224variant residue and specifies the alternative residue in the label
1225field.  Thus, \texttt{GSTM4\_HUMAN} can have a \texttt{M} at position
12262.  Unlike modification or active site annotations, variant residues
1227can change the sequence of the library sequence if replacing the
1228canonical library residue with the variant residue improves the score.
1229Thus, without the \texttt{-V '!feature.pl'} script, the human
1230\texttt{GSTM1B} variant with dbSNP:rs449856 would align to
1231\texttt{GSTM1\_HUMAN} (\texttt{P04988}) like this (the \texttt{-m 1}
1232format option was used to highlight differences) :
1233\begin{footnotesize}
1234\begin{verbatim}
1235The best scores are:                                                          opt bits E(1)
1236sp|P09488.3|GSTM1_HUMAN Glutathione S-transferase Mu 1; GST HB subuni  ( 218) 1490 335.9 3.7e-97
1237
1238>>sp|P09488.3|GSTM1_HUMAN Glutathione S-transferase Mu 1; GST HB s            (218 aa)
1239 initn: 1490 init1: 1490 opt: 1490  Z-score: 1776.8  bits: 335.9 E(1): 3.7e-97
1240Smith-Waterman score: 1490; 99.1% identity (100.0% similar) in 218 aa overlap (1-218:1-218)
1241
1242...
1243              170       180       190       200       210
1244gtm1_h YDVLDLHRIFEPNCLDAFPNLKDFISRFEGLEKISAYMKSSRFLPRPVFTKMAVWGNK
1245                   x                                    x
1246sp|P09 YDVLDLHRIFEPKCLDAFPNLKDFISRFEGLEKISAYMKSSRFLPRPVFSKMAVWGNK
1247              170       180       190       200       210
1248\end{verbatim}
1249\end{footnotesize}
1250
1251With a \texttt{-V '!feature.pl'} script to annotate the variants,
1252the alignment becomes:
1253\begin{footnotesize}
1254\begin{verbatim}
1255The best scores are:                                                          opt bits E(1)
1256sp|P09488.3|GSTM1_HUMAN Glutathione S-transferase Mu 1; GST HB s       ( 218) 1500 338.1   8e-98
1257
1258>>sp|P09488.3|GSTM1_HUMAN Glutathione S-transferase Mu 1; GST HB s            (218 aa)
1259 Variant: K173N;S210T;
1260 initn: 1500 init1: 1500 opt: 1500  Z-score: 1788.7  bits: 338.1 E(1): 8e-98
1261Smith-Waterman score: 1500; 100.0% identity (100.0% similar) in 218 aa overlap (1-218:1-218)
1262...
1263              170       180       190       200       210
1264gtm1_h YDVLDLHRIFEPNCLDAFPNLKDFISRFEGLEKISAYMKSSRFLPRPVFTKMAVWGNK
1265
1266sp|P09 YDVLDLHRIFEPNCLDAFPNLKDFISRFEGLEKISAYMKSSRFLPRPVFTKMAVWGNK
1267              170  V    180       190       200       21V
1268\end{verbatim}
1269\end{footnotesize}
1270In addition to removing the two differences as residues 173 and 210,
1271which produces a 100\% identical alignment, inclusion of the variant
1272library sequence also improves the raw similarity score and $E()$-value.
1273An example script (\texttt{misc/up\_feats.pl}) that extracts
1274annotations from a mysql database of Uniprot features is provided.
1275
1276If the annotation script produces lines beginning with '=', then these
1277lines are taken as annotation definitions, similar to the
1278\texttt{annot.defs} file described above.  Thus:
1279\begin{footnotesize}
1280\begin{verbatim}
1281=*:phosphorylation
1282=@:active site
1283=^:binding site
1284>gi|121735|sp|P09488.3|GSTM1_HUMAN
128523	*
128633	*
128734	*
1288116	^
1289173	V	N
1290210	V	T
1291\end{verbatim}
1292\end{footnotesize}
1293will produce the same annotation descriptions as the
1294\texttt{annot.defs} file.
1295
1296Scripts to produce annotations are available in the \texttt{scripts/}
1297directory as \texttt{scripts/ann\_feats*.pl}. Scripts with
1298\texttt{www} in the name,
1299e.g. \texttt{scripts/ann\_feats\_up\_www2.pl} and
1300\texttt{scripts/ann\_pfam\_www.pl} download annotation information
1301from Uniprot or Pfam web services, respectively.  Scripts lacking
1302\texttt{www} require require a MySQL database that associates features
1303or domains with sequence identifiers (accessions).  With \CURRENT,
1304domain annotations are allows to overlap each other (which often
1305happens in Pfam and UniProt); FASTA 36.3.6 did not support overlapping
1306domains.  Scripts that can produce overlapping domain annotations have
1307\texttt{\_e} in their names, but will produce non-overlapping domain
1308annotations with the \texttt{--no-over} option. Thus:
1309\texttt{scripts/ann\_pfam\_www\_e.pl --acc sp|P43553|ALR2\_YEAST}
1310produces:
1311\begin{quote}
1312\begin{verbatim}
1313>sp|P43553|ALR2_YEAST
1314451  -  683   PF01544 :1
1315667  -  799   PF01544 :1
1316\end{verbatim}
1317\end{quote}
1318While \texttt{scripts/ann\_pfam\_www\_e.pl --acc --no-over
1319  sp|P43553|ALR2\_YEAST} produces:
1320\begin{quote}
1321\begin{verbatim}
1322>sp|P43553|ALR2_YEAST
1323451  -  675  PF01544 :1
1324676  -  799  PF01544 :1
1325\end{verbatim}
1326\end{quote}
1327
1328\item[\texttt{-w \#}]
1329  Display width value ($<$200). Sets the approximate width of the
1330  high-score descriptions and the length of residue
1331  alignments. \texttt{-w 60} by default.
1332
1333\item[\texttt{-W \#}] context length (default is 1/2 of line width -w)
1334  for alignment, for programs like \texttt{fasta36} and
1335  \texttt{ssearch36}, that provide additional sequence context.
1336
1337\item[\texttt{-X extended\_option}]
1338A number of rarely used options are now only available as extended options:
1339
1340\begin{description}
1341
1342\item[\texttt{X1}] sort output by \texttt{init1} score (for
1343  compatibility with FASTP; obsolete).
1344
1345\item[\texttt{XB}] (Previously \texttt{-B}.)  Show the z-score, rather
1346  than the bit-score in the list of best scores (rarely used, provided
1347  for backward compatibility).
1348
1349\item[\texttt{XI}] Modify rounding used in percent identity/percent
1350  similarity display to ensure that sequences that have a mismatch are
1351  not shown as 100.0\% identical.  Without this option, a single
1352  mismatch in a 10,000 residue alignment would be shown as 100.0\%
1353  identical; with this option, it would be shown as 99.9\%
1354  identical.
1355
1356\item[\texttt{Xo}] (\texttt{fasta36}, \texttt{[t]fast[x/y]36} only)
1357  (Previously \texttt{-o}.) Turn off the default \texttt{opt} score
1358  calculation and sort results by \texttt{initn} scores (reduces
1359  sensitivity and statistical accuracy, obsolete).
1360
1361\item[\texttt{XM}] The maximum amount of memory available for storing
1362  the library in multi-sequence searches. The value is specified in
1363  MBytes (\texttt{-XL16}) or GBytes (\texttt{-XL4G}) and can also be
1364  set using the \texttt{LIB\_MEMK} environment variable
1365  (\texttt{LIB\_MEMK=4G}).  Negative values remove the memory
1366  restriction. By default (set as a compile-time option,
1367  \texttt{-DMAX\_MEMK=2}), set to 2 GBytes in 32-bit environments and
1368  12 GBytes in 64-bit environments.
1369
1370\item[\texttt{XN/XX}] Alter the treatment of N:N (DNA) or X:X
1371  (protein) alignments for counts of identities and similarities. By
1372  default the \texttt{FASTA} programs count N:N or X:X as identical,
1373  but not similar, because their alignment scores are typically
1374  negative. \texttt{-XNS}, \texttt{-XN+}, \texttt{-XXS}, and
1375  \texttt{-XX+} treat N:N and X:X alignments as ``similar'' , even
1376  though their alignment scores are negative, when calculating percent
1377  similarity. \texttt{-XND}, \texttt{-XN-}, \texttt{-XXD}, and
1378  \texttt{-XX-} treat N:N and X:X alignments as non-identical for
1379  calculating percent identity.
1380
1381\item[\texttt{Xx}] (Previously \texttt{-x}) Specify the penalty for a
1382  match to an \texttt{X}, and mismatch to \texttt{X}, independently of
1383  the PAM matrix.  Particularly useful for \texttt{fastx3/fasty36},
1384  where termination codons are encoded as \texttt{X}.  For example,
1385  \texttt{-Xx=0,-1} scores an \texttt{X:X} match as 0, and
1386  \texttt{X:not-X} as -1.
1387
1388\item[\texttt{Xy}] (Previously \texttt{-y}.) Set the width of the band
1389  used for calculating "optimized" scores.  For proteins and ktup=2,
1390  the width is 16.  For proteins with ktup=1, the width is 32 by
1391  default.  For DNA the width is 16.
1392
1393\end{description}
1394
1395\item[\texttt{-z -1,0,1,2,3,4,5,6}]\hfill\\
1396\texttt{-z -1} turns off statistical calculations. \texttt{z 0} estimates
1397the significance of the match from the mean and standard deviation of
1398the library scores, without correcting for library sequence length.
1399\texttt{-z 1} (the default) uses a weighted regression of average score
1400vs library sequence length; \texttt{-z 2} uses maximum likelihood
1401estimates of $\lambda$
1402and $K$; \texttt{-z 3} uses Altschul-Gish parameters \cite{alt960};
1403\texttt{-z 4 - 5} uses two variations on the \texttt{-z 1}
1404strategy. \texttt{-z 1} and \texttt{-z 2} are the best methods, in
1405general.
1406\item[\texttt{-z 11,12,14,15,16}]\hfill\\
1407estimate the statistical parameters from shuffled copies of each
1408library sequence.  This allows accurate statistics to be estimated for libraries comprised of a single protein family.
1409
1410\item[\texttt{-z 21,22,24,25,26}]\hfill\\
1411estimate the statistical parameters from shuffled copies of the
1412highest scoring sequences reported in the search.
1413library sequence. This shuffling strategy is much more like
1414\texttt{prss}, since the sequences shuffled share compositional
1415similarity to the query.
1416\item[\texttt{-Z db\_size}]
1417sets the apparent size of the database to be used when calculating
1418expectation E()-values.  If you searched a database with 1,000
1419sequences, but would like to have the E()-values calculated in the
1420context of a 100,000 sequence database, use \texttt{-Z 100000}.
1421\item[\texttt{-3}]
1422translate only three forward frames or search with only the forward
1423strand (complement of \texttt{-i}).
1424\end{description}
1425
1426Thus, to tell \texttt{fasta36} to align \texttt{seq1.aa} with \texttt{seq2.aa} showing the entirety of both sequences, with 80 characters per line, one would type:
1427\begin{verbatim}
1428fasta36 -w 80 -s BP62 -a seq1.aa seq2.aa
1429\end{verbatim}
1430The \texttt{-w 80} and \texttt{-a} options must precede the file
1431names.  If you just enter the options on the command line followed by
1432\texttt{-I}, the program will prompt for the file names.
1433
1434In addition, the FASTA programs can accept query sequence data from
1435\texttt{STDIN}.  To specify that stdin be used as the query or library
1436file, the file name should be specified as \texttt{@}.  Thus:
1437\begin{quote}
1438\texttt{cat query.aa | fasta36 @:25-75 /slib/swissprot }
1439\end{quote}
1440would take residues 25-75 from \texttt{query.aa} and search the
1441\texttt{/slib/swissprot}.
1442
1443\subsubsection{Environment variables}
1444
1445FASTA allows virtually every option to be set on the command line
1446(except the \emph{ktup}, which must be set as the third command line
1447argument), but it is often convenient to set the \texttt{FASTLIBS}
1448environment variable to specify the location of the \texttt{fastlibs}
1449database description file.
1450
1451\texttt{FASTLIBS} -- \texttt{FASTLIBS}
1452specifies the location of the file that contains the list of library
1453descriptions, locations, and library types (see section on finding
1454library files).
1455
1456\texttt{LIB\_MEMK} -- Set the maximum amount of memory (MBytes) to be
1457available for library buffering (equivalent to \texttt{-XM\#}, see
1458above).  By default, \texttt{2GB} is available on 32-bit systems
1459(\texttt{LIB\_MEMK=2G}); \texttt{8GB} on 64-bit systems.
1460
1461\texttt{REF\_URL}, \texttt{SRCH\_URL} and \texttt{SRCH\_URL1} -- These
1462environment variables are used in HTML mode (\texttt{-m 6}) to provide
1463links from the sequence alignment (see the links at
1464\url{http://fasta.bioch.virginia.edu/fasta_www2/}). \texttt{REF\_URL}
1465is associated with the \texttt{Entrez Lookup} link; \texttt{SRCH\_URL}
1466with the \texttt{Re-search database} link, and \texttt{SRCH\_URL1}
1467with the \texttt{General re-search} link.  In each case, the text
1468corresponds to a HTML URL, but with positions containing the
1469\texttt{\%s} or \texttt{\%ld} (for numbers) part of a 'C'
1470\texttt{sprintf()} call for specific variables. \texttt{REF\_URL} uses
1471the database (\texttt{protein} or \texttt{nucleotide}), together with
1472a query term (typically the \texttt{gi} number). \texttt{SRCH\_URL}
1473and \texttt{SRCH\_URL1} use \texttt{db}, \texttt{query} (\texttt{gi},
1474\texttt{pgm} (\texttt{fa}, \texttt{ss}, \texttt{fx}, etc.), and
1475\texttt{start}, \texttt{stop}, and \texttt{n1} (library sequence
1476length), where \texttt{start} and \texttt{stop} are the boundaries of
1477the alignment, for sub-sequence searches.  The values of these
1478environment variables are used with \texttt{sprintf} to build a new
1479URL that is linked in the output.
1480
1481\texttt{TMP\_DIR} -- Location (if defined) of the temporary files used
1482by the \texttt{-e expand\_script.sh} option.
1483
1484In addition, environment variables can be used inside both the
1485\texttt{fastlibs} file and in the \texttt{@db.nam} files of file
1486names. The \texttt{fasta36/conf/fast\_libs\_e.www} file, included with
1487the distribution, shows an example, as do the descriptions of file of
1488file names files shown below. Whenever a word of the form
1489\texttt{\$\{WORD\}} is found in \texttt{fastlibs} or a file of file
1490names, the \texttt{\$\{WORD\}} environment variable is expanded and
1491inserted in the string.  Thus, if \texttt{<\$\{SLIB\}/blast\_dbs/}
1492describes where a list of files will be found and \texttt{\$\{SLIB\}}
1493is \texttt{"/seqdata"}, then the resulting substitution yields:
1494\texttt{</seqdata/blast\_dbs/}.
1495
1496\section{Installing FASTA and the sequence databases}
1497
1498\subsection{Obtaining/preparing the sequence libraries}
1499
1500The FASTA program package does not include any protein or DNA sequence
1501libraries.  Protein and DNA sequence databases are available via
1502anonymous FTP from the NCBI (\url{ftp://ftp.ncbi.nih.gov/blast/db},
1503\url{ftp://ftp.ncbi.nih.gov/blast/db}), UniProt
1504(\url{ftp://ftp.uniprot.org/pub/databases/uniprot}), and the EBI
1505(\texttt{ftp.ebi.ac.uk/pub/databases}).
1506
1507\emph{Protein Sequence Databases} -- Protein sequence databases are
1508available from the NCBI, UniProt, and the EBI.  The NCBI provides a
1509``raw'' database, \texttt{nr}, and a well-curated, less redundant
1510database, \texttt{refseq\_protein}, and a copy of the very well
1511annotated \texttt{swissprot} database. Protein sequence databases can
1512also be downloaded from UniProt and the EBI; both sites provide the
1513same UniProt\cite{uniprot11} database.
1514
1515Protein libraries, particularly those used for translated-DNA:protein
1516comparisons with \texttt{fastx36} or \texttt{fasty36}, show be scanned
1517to remove low-complexity regions.  Matches between low complexity
1518regions can violate the composition assumptions used by the FASTA
1519statistical estimates. The \texttt{pseg} program (\cite{woo935},
1520\url{ftp://ftp.ncbi.nih.gov/pub/seg/pseg}) can be used to lower-case
1521low complexity regions, which then can be ignored during the initial
1522database search by using the \texttt{-S} option.  To lower-case low
1523complexity regions, run the \texttt{pseg} program against the protein sequence database:
1524\begin{quote}
1525\begin{verbatim}
1526pseg /seqdata/swissprot.fa -z 1 -q > /seqdata/swissprot.lseg
1527\end{verbatim}
1528\end{quote}
1529And then you can run most FASTA programs with \texttt{-S}:
1530\begin{quote}
1531\begin{verbatim}
1532ssearch36 -S mgstm1.aa /seqdata/swissprot.lseg
1533\end{verbatim}
1534\end{quote}
1535
1536Fig. \ref{seg-aln} shows the effect of including the \texttt{-S}
1537option with lower-cased low-complexity sequences.  The \texttt{opt}
1538score (407), which is used to sort the results and calculate
1539statistics, is lower than the Smith-Waterman score (451), even though
1540exactly the same residues are aligned for each score. The \texttt{opt}
1541score excludes residues 19-30, because they were marked as
1542low-complexity by \texttt{pseg}; thus they are shown as lower-case.
1543The Smith-Waterman score includes the contribution from that part of
1544the alignment.
1545
1546Out-of-frame translated DNA sequences often produce low-complexity
1547regions \cite{wrp973}, so it is particularly important to avoid
1548low-complexity alignments when using \texttt{fastx36} and
1549\texttt{fasty36}
1550
1551\subsection{Searching taxonomic subsets}
1552
1553Because increasing database size reduces search sensitivity (an
1554alignment with an $E()$-value of $0.001$ in a search of a 100,000
1555entry database will have an $E()$-value of 0.1, not significant, if
1556found in a database of 10,000,000 sequences), it is much more
1557effective to search smaller, less redundant databases (you can always
1558search the larger database later).  Thus, the \texttt{refseq\_protein}
1559database from the NCBI is preferred over \texttt{nr}; even better are
1560databases that reflect a limited phylogenetic range
1561(e.g. \texttt{refseq\_human} for vertebrate sequences).
1562
1563While the NCBI provides organism-specific \texttt{refseq} subsets on
1564their FTP site, they can be difficult to find.  Alternatively, you can
1565use the NCBI \texttt{Entrez} web site to download a list of
1566\texttt{gi} numbers specific to a particular organism or taxonomic
1567range. The FASTA programs can search a subset of a large sequence
1568database that is specified by a list of \texttt{gi} numbers by using
1569library format 10.  For example, given a list of \texttt{gi} numbers
1570for the human proteins in \texttt{swissprot.lseg}, the file
1571\texttt{sp\_human.db}, with the content:
1572\begin{quote}
1573\begin{verbatim}
1574<${SLIB}/swissprot.lseg 0:2 4|
15753121763
157651701705
15777404340
1578205831112
157974735515
1580...
1581\end{verbatim}
1582\end{quote}
1583could be used to search the human subset of
1584\texttt{swissprot.lseg}. The \texttt{gi} numbers for the SwissProt
1585entries begin with the second line. The first line specifies the
1586location of the file where the sequences containing the \texttt{gi}
1587numbers can be found (\texttt{\$\{SLIB\}/swissprot.lseg}, the
1588\texttt{libtype} of that file (\texttt{0:fasta}), the character offset
1589to the beginning of the sequence identifier in that file (\texttt{2}),
1590the identifier type (\texttt{4}), and the character
1591that separates the fields in the FASTA descriptor (\texttt{|}).  The
1592identifier type can take four formats:
1593
1594\begin{tabular}{l l}
1595\hline\\[-1.5ex]
15961 & ordered accession strings  (letters or numbers)\\
15972 & ordered numbers (digits only) \\
15983 & un-ordered accession strings \\
15994 & un-ordered numbers \\
1600\hline\\
1601\end{tabular}\\
1602(Ordered accession strings/numbers are ordered in both the library and the subset file.)
1603
1604Thus, given the \texttt{0:2 4|} specification above, the line:
1605\begin{quote}
1606\texttt{>gi|3121763|sp|O15143.3|ARC1B\_HUMAN Actin-related protein 2/3 ...}
1607\end{quote}
1608would be parsed, looking for an number starting at column 4 (the first
1609column is numbered 0), and ending with \texttt{|}. The order of
1610sequences in the library do not have to correspond to the order in the
1611\texttt{sp\_human.db} file (un-ordered). Given a the
1612\texttt{sp\_human.db} file, a file \texttt{swissprot.lseg} in the
1613directory specified by the environment variable \texttt{\$\{SLIB\}},
1614and a command of the form:
1615\begin{quote}
1616\texttt{fasta36 -S mgstm1.aa 'sp\_human.db 10'}
1617\end{quote}
1618Would use the \texttt{sp\_human.db} file to search the subset of
1619\texttt{swissprot.lseg} that contained the specified \texttt{gi}
1620numbers.
1621
1622\subsection{DNA sequence libraries}
1623
1624Because of the large size of DNA databases, you will probably want to
1625keep DNA databases in only one format.  The FASTA3 programs that
1626search DNA databases --- \texttt{fasta36}, \texttt{fastm36}, and
1627\texttt{tfastx/y36} --- can read DNA databases in Genbank flatfile (not
1628ASN.1), FASTA, and BLAST2.0 (\texttt{formatdb}) formats, as well as
1629EMBL format.  BLAST2.0 format is preferred for DNA sequence libraries,
1630because the files are considerably more compact than GenBank format.
1631The NCBI does not provide software for converting from Genbank flat
1632files to Blast2.0 DNA databases, but you can use the Blast
1633\texttt{formatdb} program to convert ASN.1 formatted Genbank files,
1634which are available from the NCBI \texttt{ftp} site.
1635
1636The NCBI also provides the comprehensive \texttt{nt} DNA database, and
1637several EST databases in Blast2.0/\texttt{formatdb} format from
1638\texttt{ftp://ncbi.nih.gov/blast/db}.
1639
1640
1641\subsection{Finding the library files}
1642
1643All the FASTA programs comparison programs have the command line syntax:
1644\begin{quote}
1645\texttt{fasta36 query.file /seqdata/library}
1646\end{quote}
1647However, in addition to simply specifying the location of the database
1648to be searched
1649(\texttt{/seqdata/library}), the FASTA programs
1650provide several methods for referring to sequence databases without specifying a specific file.  These methods can be used to provide abbreviations for sequence libraries, e.g.:
1651\begin{quote}
1652\texttt{fasta36 query.file s}
1653or
1654\texttt{fasta36 query.file +sp+}
1655\end{quote}
1656To use abbreviations like \texttt{'s'} or \texttt{'+sp+'} to reference a
1657sequence database, a \texttt{FASTLIBS} file must be used, see section
1658\ref{fastlibs}.
1659
1660Large DNA and protein databases are often distributed across several
1661files.  For example, the NCBI \texttt{nr} protein database is found in
16625 files, \texttt{nr.00} ... \texttt{nr.04}.  To search databases in
1663multiple files, the names of the files are specified in a file of
1664filenames, \texttt{nr.nam}:
1665\begin{quote}
1666\begin{verbatim}
1667<${SLIB}/blast_dbs/
1668nr.00 12
1669nr.01 12
1670nr.02 12
1671nr.03 12
1672nr.04 12
1673\end{verbatim}
1674\end{quote}
1675In this file, the first line \texttt{<\$\{SLIB2\}/blast\_dbs/},
1676beginning with \texttt{<}, specifies the location and format (Blast2.0
1677\texttt{formatdb}) the data files. Text of the form
1678\texttt{\$\{SLIB\}} refers to Unix/MacOSX/Windows environment
1679variables; the value of \texttt{\$\{SLIB\}} is set by a Unix/MacOSX
1680shell environment command. Thus, if the value of \texttt{\$\{SLIB\}}
1681is \texttt{/seqdata}, then the first sequence library file to be read
1682will be \texttt{/seqdata/blast\_dbs/nr.00}, in format 12 (Blast2.0
1683\texttt{formatdb}).
1684
1685To refer to the \texttt{nr.nam} file as a file of file names, it must
1686be prefixed by a \texttt{@} character, e.g.
1687\begin{quote}
1688\texttt{fasta36 query.file \textbf{@}nr.nam}
1689\end{quote}
1690Files of file names can contain references to other files of file names:
1691\begin{quote}
1692\begin{verbatim}
1693<${SLIB}/fasta_dbs/
1694@pdb.nam
1695@swissprot.nam
1696\end{verbatim}
1697\end{quote}
1698The FASTA file of file names is similar to the NCBI
1699\texttt{prot\_db.pal} and \texttt{dna\_db.nal}, files, but
1700unfortunately they are different, and currently FASTA cannot read NCBI
1701\texttt{.pal} or \texttt{.nal} files that contain a \texttt{DBLIST}
1702line. FASTA can read NCBI \texttt{.pal} or \texttt{.nal} files that do
1703not contain a \texttt{DBLIST} line.
1704
1705FASTA version \texttt{fasta-36.3.6} provides an alternative way to
1706generate a database to be searched: the \texttt{!script.sh} file.
1707Like the \texttt{-e expand\_file.sh} script, a shell script or program
1708can be used to produce a database to a temporary file, which is then
1709seached.  For example, if the file \texttt{cat\_db.sh} contains the
1710command \texttt{echo /seqdb/swissprot.lseg}, the command:
1711\begin{quote}
1712\begin{verbatim}
1713fasta36 query.aa \!@cat_db.sh
1714\end{verbatim}
1715\end{quote}
1716will cause \texttt{cat\_db.sh} to produce a temporary file with the
1717line \texttt{swissprot.lseg}, which is interpreted as an indirect file
1718of filenames; thus, because of the \texttt{@}, the file will be
1719interpreted as an indirect file, and the \texttt{swissprot.lseg} file
1720will be searched.  Note that on Unix systems, the \texttt{'!'} must be
1721preceeded by a \texttt{'\textbackslash'} so that it is not interpreted by the
1722shell, as shown above.
1723
1724\subsection{\texttt{FASTLIBS}}
1725\label{fastlibs}
1726
1727All the search programs in the FASTA3 package can use the environment
1728variable \texttt{FASTLIBS} to find the protein and DNA sequence
1729libraries. (Alternatively, you can specify the \texttt{FASTLIBS} file
1730with the \texttt{-l fastlibs.file} option.) The \texttt{FASTLIBS}
1731variable contains the name of a file that has the actual filenames of
1732the libraries.  The \texttt{fastlibs} file included with the
1733distribution is an example of a file that can be referred to by
1734FASTLIBS. To use the \texttt{fastlibs} file, type:
1735\begin{quote}
1736\texttt{setenv FASTLIBS /seqdata/info/fastgbs} (csh/tcsh)\\
1737 or\\
1738\texttt{export FASTLIBS=/seqdata/info/fastgbs} (bash/ksh)
1739\end{quote}
1740 Then edit the \texttt{fastlibs} file to indicate the location of the
1741 protein and DNA sequence libraries.  If the protein sequence library is
1742 kept in the file \texttt{/seqdata/aa/swissprot.lseg} and your Genbank
1743 DNA sequence library is kept in the directory:
1744 \texttt{/seqdata/genbank}, then the \texttt{fastlibs} file might
1745 contain:
1746%%\pagebreak
1747\begin{verbatim}
1748SwissProt$0P/seqdata/aa/swissprot.lseg 0
1749UniProt$0+uniprot+@/seqdata/aa/uniprot.nam
1750GB Primate$1P@/seqdata/genbank/gpri.nam
1751GB Rodent$1R@/seqdata/genbank/grod.nam
1752GB Mammal$1M@/seqdata/genbank/gmammal.nam
1753^   1    ^^^^       4                ^ ^
1754          23                           (5)
1755\end{verbatim}
1756The first line of this file says that there is a copy of the SwissProt
1757sequence database (a protein database) that can be selected by typing
1758"P" on the command line or when the database menu is presented in
1759interactive mode.
1760
1761Note that there are 4 (or 5) fields in the lines in the
1762\texttt{fastlibs} file.  The first field describes library and is
1763displayed by FASTA program; it ends with the '\$'.  The second field
1764(1 character), is a 0 if the library is a protein library and 1 if it
1765is a DNA library.  The third field can either be a single character
1766(\texttt{P}) or a word surrounded by the \texttt{+} symbol
1767(\texttt{+uniprot+}), and can be used to specify the library on the command line or in interactive mode.
1768
1769The fourth field is the name of the library file.  In the example
1770above, the \texttt{/seqdata/aa/swissprot.lseg} file contains the
1771entire protein sequence library.  Alternatively,
1772\texttt{/seqdata/aa/uniprot.nam} is a file of file names, which
1773contains a list of one or more library files.  Likewise, the DNA
1774library files are files of file names.
1775
1776In addition, an optional fifth field can be used to specify the format
1777of the library file.  Alternatively, you can specify the library
1778format in a file of file names.  This field must be separated from the
1779file name by a space character ('\ ') from the filename.  FASTA can
1780read the libraries in the following formats:\\
1781
1782\begin{tabular}{r l}
17830 & FASTA (\texttt{>SEQID} - comment/sequence) \\
17841 & Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)\\
17852 & NBRF CODATA (ENTRY/SEQUENCE) (obsolete)\\
17863 & EMBL/SWISS-PROT (ID/DE/SQ)\\
17874 & Intelligenetics (;comment/SEQID/sequence) (obsolete)\\
17885 & NBRF/PIR VMS (\texttt{>P1;SEQID}/comment/sequence) (obsolete)\\
17896 & GCG (version 8.0) Unix Protein and DNA (compressed)\\
17907 & FASTQ (sequence only, quality ignored)\\
179110 & subset format (</slib2/swissprot.lseg 0:2 4|) \\
179211 & NCBI Blast1.3.2 format  (unix only) (obsolete)\\
179312 & NCBI Blast2.0 format\\
179416 & MySQL (requires special compilation) \\
179517 & Postgres (requires special compilation) \\
1796\end{tabular}
1797
1798Today, the most popular formats are \texttt{FASTA}, type \texttt{'0'},
1799the default, and the NCBI Blast2.0 \texttt{formatdb} formats (type
1800\texttt{'12'}).  The FASTA programs cannot read NCBI ASN.1 formatted databases.
1801If a library format is not specified, for example, because
1802you are just comparing two sequences, FASTA (format 0) is used by
1803default. To specify a library type on the command line, add it to the
1804library filename and surround the filename and library type in quotes:
1805\begin{quote}
1806\begin{verbatim}
1807fasta36 query.file "/seqdb/genbank/gbmam 12"
1808\end{verbatim}
1809\end{quote}
1810NCBI \texttt{formatdb} databases are built from multiple files,
1811e.g. \texttt{gbmam.nsq}, \texttt{gbmam.nhr}, \texttt{gbmam.nin}; to
1812refer to the complete set of files, simply use name before the
1813suffixes, e.g. \texttt{gbmam}.  When NCBI databases distributed across
1814several files, e.g. \texttt{gbbct.00}, \texttt{gbbct.01}, etc, those
1815files must be included in a \texttt{gbbct.nam} file of file names.
1816
1817FASTA subset format ({\tt 10}) allows users to search a subset of a
1818sequence database, by specifying a list of {\tt gi} numbers or accessions in
1819a larger database.  The format begins with a line naming the file
1820sequence file followed by information about how to
1821extract the {\tt gi} number or accession. Thus, the line.
1822\begin{quote}
1823\texttt{<library\_file lib\_fmt:id\_fmt id\_loc}
1824\end{quote}
1825where {\tt lib\_fmt} is the library format (0), {\tt id\_fmt} is the
1826format of the sequence identifier (:1, :2 - ordered strings or
1827numbers; :3, :4 - unordered strings or numbers),
1828and {\tt id\_loc} is the location of the sequence identifier. For example,
1829\begin{quote}
1830\begin{verbatim}
1831</slib2/blast/swissprot.lseg 0:2 4|
18323121763
183351701705
18347404340
183574735515
1836...
1837\end{verbatim}
1838\end{quote}
1839specifies the file containing all the sequences and the file is in
1840FASTA format ('0:'), the sequence identifier is a number (':2'),
1841and the identifier starts at character 4 and ends with the \texttt{'|'} symbol.
1842
1843The major problem that most new users of the FASTA package have is in
1844setting up the program to find the databases and their library type.
1845In general, if you cannot get \texttt{fasta36} to read a sequence
1846database, there is probably something wrong with the \texttt{FASTLIBS}
1847file.  A common problem is that the database file is found, but either
1848no sequences are read, or an incorrect number of entries is read.
1849This is almost always because the library format (\texttt{libtype}) is
1850incorrect.
1851
1852Test the setup by running FASTA.  Enter the sequence
1853file '\texttt{mgstm1.aa}' when the program requests it (this file is
1854included with the programs).  The program should then ask you to
1855select a protein sequence library.  Alternatively, if you run the
1856\texttt{tfastx36 -I} program and use the mgstm1.aa query sequence, the program
1857should show you a selection of DNA sequence libraries.
1858Once the \texttt{fastlibs} file has been set up correctly, you can
1859set FASTLIBS=fastgbs in your AUTOEXEC.BAT file, and you will not need to
1860remember where the libraries are kept or how they are named.
1861
1862%%\pagebreak
1863\section{Frequently Asked Questions (FAQs)}
1864
1865{\noindent}\textbf{Where can I get FASTA?} --
1866\url{http://faculty.virginia.edu/wrpearson/fasta} has the latest
1867versions of the FASTA programs.  This document describes
1868\texttt{\CURRENT}, which is available from
1869\url{http://faculty.virginia.edu/wrpearson/fasta/fasta3.tar.gz}.
1870In addition, pre-compiled versions of the programs are available for
1871MacOSX and Windows.
1872
1873\needspace{4\baselineskip}
1874{\noindent}\textbf{Which program should I use?} -- See Table I, also:\\
1875
1876\begin{tabular}{l l l l l }
1877\hline \\[-1.0ex]
1878Query & Library & FASTA pgm. & BLAST pgm. & \\[1.2ex]
1879\hline \\[-1.0ex]
1880Prot. & Prot. & \texttt{fasta36} & \texttt{blastp} & heuristic local similarity \\
1881 &  & \texttt{ssearch36} &  & optimal local sim.\\
1882 &  & \texttt{ggearch36} &  & global:global sim. \\
1883 &  & \texttt{ggearch36} &  & global:local sim.\\
1884DNA & DNA & \texttt{fasta36}$^*$ & \texttt{blastn} & \\[1.2ex]
1885\hline \\[-1.0ex]
1886Prot. & Prot. & \texttt{lalign36} & & multiple non-intersecting \\
1887DNA & DNA & & & alignments \\[1.2ex]
1888\hline \\[-1.0ex]
1889DNA & Prot. & \texttt{fastx36} & \texttt{blastx} & trans. DNA:protein sim. \\
1890 &  & \texttt{fasty36} & & \\[1.2ex]
1891\hline \\[-1.0ex]
1892Prot. & DNA & \texttt{tfastx36} & \texttt{blastn} & protein:trans. DNA \\
1893 &  & \texttt{tfasty36} & & \\[1.2ex]
1894\hline \\[-1.0ex]
1895Prot. & Prot. & \texttt{fasts36} & & Unordered peptides \\
1896Prot. & DNA & \texttt{tfasts36} & & Unordered peptides \\
1897DNA & DNA & \texttt{fasts36} & & Unordered oligonucleotides \\
1898Prot. & Prot. & \texttt{fastm36} & & Ordered peptides \\
1899DNA & DNA & \texttt{fastm36} & & Ordered oligos \\[1.2 ex]
1900\hline \\[-1.0ex]
1901\multicolumn{5}{l}{$^*$\texttt{ssearch36} can also be used for DNA:DNA, but is much slower and no more sensitive.}\\[0.2ex]
1902\hline \\
1903\end{tabular}
1904
1905\needspace{4 ex}
1906{\noindent}\textbf{How do I make FASTA act/look like BLAST}? --
1907\vspace{-0.5ex}
1908\begin{quote}
1909\texttt{fasta36 -s BP62 -m BB query.file library.file}
1910\end{quote}
1911\vspace{-0.5ex}
1912\texttt{-s BP62} sets the same scoring matrix (BLOSUM62) and
1913gap-penalties (-11/-1) as BLAST (FASTA uses BLOSUM50 by
1914default). \texttt{-m BB} produces very BLAST-like output.
1915
1916In addition, the \texttt{-m 8} and \texttt{-m 8C} options provide
1917BLAST tabular output, optionally with comments (\texttt{-m 8C}). This
1918compact output is effective for analysis pipelines.  In addtion,
1919\texttt{-m 8XC} (no comments) or \texttt{-m 8CC} provides two
1920additional blast-tabular fields, a CIGAR alignment string and (if
1921available), an annotation string.
1922
1923{\noindent}\textbf{When I search Genbank - the program reports:} \texttt{0 residues in 0
1924sequences}?  This typically happens because the program does not
1925know that you are searching a Genbank flatfile database and is looking
1926for a FASTA format database.  Be certain to specify the library type
1927("1" for Genbank flatfile) with the database name.
1928
1929{\noindent}\textbf{The search seemed to work, but I do not see any results.} -- In
1930command line mode (the default), all the FASTA programs limit the
1931number of high scoring sequences shown using an expectation value
1932cutoff ($E()<10$ for proteins; $E()<2$ for DNA).  Sometimes, a search
1933will complete successfully (you see the message \texttt{XXXX residues
1934  in YYY sequences}) but the message: \texttt{!! No sequences with E()
1935  < 10} instead of \texttt{The best scores are:}.  Typically, this
1936happens because of a problem with the statistical estimation process;
1937in particular, if the library contains only related sequences and
1938\texttt{-z 11} was not used, none of the hits may be ``significant''.
1939To trouble shoot this problem, you can search with \texttt{-z -1},
1940which turns off all the statistical estimation procedures, and will
1941show the 20 highest scoring sequences (\texttt{-b \#} sets the default
1942number of sequences shown).
1943
1944{\noindent}\textbf{What is the difference between} \texttt{fastx3} and
1945\texttt{fasty3}? (or \texttt{tfastx3} and \texttt{tfasty3})? --
1946\texttt{[t]fastx3} uses a simpler codon based model for alignments
1947that does not allow frameshifts in some codon positions (see
1948ref. \cite{wrp971}).  \texttt{fastx3} is about 30\% faster, but
1949\texttt{fasty3} can produce higher quality alignments in some cases.
1950
1951\vspace{0.5ex}
1952{\noindent}\textbf{What is ktup}? -- All of the programs with \texttt{fast} in their
1953name use a computer science method called a lookup table to speed the
1954search.  For proteins with \emph{ktup}=2, this means that the program
1955does not look at any sequence alignment that does not involve matching
1956two identical residues in both sequences.  Likewise with DNA and
1957\emph{ktup} = 6, the initial alignment of the sequences looks for 6
1958identical adjacent nucleotides in both sequences.  Because it is less
1959likely that two identical amino-acids will line up by chance in two
1960unrelated proteins, this speeds up the comparison.  But very distantly
1961related sequences may never have two identical residues in a row but
1962will have single aligned identities.  In this case, \emph{ktup} = 1 may
1963find alignments that \emph{ktup}=2 misses.
1964
1965\vspace{0.5ex} {\noindent}\textbf{How do I turn off statistics}? --
1966The FASTA programs are designed to identify homologs based on
1967statistically significant similarity; to infer homology you need
1968accurate statistical estimates.  Sometimes, however, you know the
1969sequences are related, and searching against libraries of related
1970sequences can confuse FASTA if you do not use \texttt{-z 11}. If all
1971you want are scores and alignments, use \texttt{-z -1} to turn off
1972statistical estimates.
1973
1974\vspace{0.5ex}
1975{\noindent}\textbf{Where are} \texttt{prss} {\noindent}\textbf{and} \texttt{prfx}? -- Earlier FASTA3
1976releases included \texttt{prss3} and \texttt{prfx3}. With FASTA
1977version 35 and 36, these programs have been incorporated into
1978\texttt{ssearch36} and \texttt{fastx36}.  FASTA version 35 and 36
1979programs now automatically estimate statistical parameters by
1980shuffling - the function of \texttt{prss} and \texttt{prfx}, when
1981searching for libraries with fewer than 500 members.
1982
1983\vspace{0.5ex}
1984{\noindent}\textbf{Where is} \texttt{tfasta}? -- Although it is possible to make
1985\texttt{tfasta36}, it is not compiled by default.  \texttt{tfastx36}
1986and \texttt{tfasty36} allow frame-shifts to be joined into a single
1987alignment; \texttt{tfasta} did not. \texttt{tfastx36} produces better
1988alignments with better statistics.
1989
1990\vspace{0.5ex}
1991{\noindent}\textbf{Can I run the FASTA programs on a cluster}? -- With version
199236.3.4, almost all of the FASTA programs can be run on clusters of
1993computers using MPI (Message Passaging Interface).  The programs can
1994be compiled using \texttt{make -f ../make/Makefile.mpi\_sse2} from the
1995\texttt{fasta36/src} directory.  Except for \texttt{lalign36}, all the
1996programs in Table I are available as \texttt{fasta36\_mpi},
1997\texttt{ssearch36\_mpi}, etc.
1998
1999Unfortunately, the current MPI implementation involves substantially
2000more communications overhead than the threaded versions.  The FASTA
2001programs are very efficient on threaded machines; if the preload
2002option is used (edit \texttt{make/Makefile36m.common} to use
2003\texttt{comp\_lib8.c}), the FASTA programs can obtain more than 40-fold speedup on a 48-core machine (the largest I have tested).
2004
2005\vspace{0.5ex}
2006{\noindent}\textbf{Sometimes, in the list of best scores, the same sequence is
2007  shown twice with exactly the same score.  Sometimes, the sequence is
2008  there twice, but the scores are slightly different}? -- When any of
2009the FASTA programs searches a long sequence, it breaks the sequence up
2010into \emph{overlapping} pieces.  If the highest scoring alignment is
2011at the end of one piece, it will be scored again at the beginning of
2012the next piece.  If the alignment is not be completely included in the
2013overlap region, one of the pieces will give a higher score than the
2014other.  These duplications can be detected by looking at the
2015coordinates of the alignment.  If either the beginning or end
2016coordinate is identical in two alignments, the alignments are at least
2017partially duplicates.
2018
2019\vspace{2ex}
2020As always, please inform me of bugs as soon as possible.
2021
2022\begin{quote}
2023William R. Pearson\\
2024Department of Biochemistry\\
2025Jordan Hall Box 800733\\
2026U. of Virginia\\
2027Charlottesville, VA\\
2028wrp@virginia.EDU
2029\end{quote}
2030
2031\bibliographystyle{plain}
2032\bibliography{fasta_guide}
2033
2034\appendix
2035\section*{Appendix}
2036
2037\section{FASTA Makefile compile time options}
2038
2039\begin{table}
2040\caption{\label{make-defs}FASTA \texttt{Makefile} compile time \texttt{\#defines}}
2041\vspace{1.0ex}
2042\begin{tabular}{l l p{1.00 in} p{3.0 in}}
2043\hline\\[-1.2ex]
2044\texttt{\#define} & Status$^*$ & Target file(s) & Function \\[1.0ex]
2045\hline\\[-1.5ex]
2046\texttt{ALLOCN0} & obs & \texttt{dropnfa.c}, \texttt{dropfx.c}, \texttt{dropfz2.c} & allows FASTA algorithm to use memory $\sim$ query length (n0), not query $+$ library (n0+n1). \\
2047\texttt{DNALIB\_LC} & undef & \texttt{initfa.c} & enable lower case masking for DNA libraries \\
2048\texttt{HTML\_HEAD} & undef & \texttt{comp\_lib5e.c}, \texttt{comp\_lib8.c} & wrap \texttt{-m 6} HTML output with \texttt{<html> <body> </body> </html>} \\
2049\texttt{M10\_CONS} & def & \texttt{c\_dispn.c} & show consensus line (\texttt{:. }) with \texttt{-m 10} output. \\
2050\texttt{OLD\_FASTA\_GAP} & undef & \texttt{drop*.c} & use first-residue/additional residue penalties, not open/extend. \\
2051\texttt{PGM\_DOC} & def & \texttt{comp\_lib5e.c}, \texttt{comp\_lib8.c}  & provide \texttt{\#pgm\_name -opt1 -opt2 query file} copy of command line \\
2052\texttt{PROGRESS} & def & \texttt{comp\_lib5e.c}, \texttt{comp\_lib8.c}  & provide progress symbols in interactive mode \\
2053\texttt{SAMP\_STATS} & def & \texttt{comp\_lib5e.c}, \texttt{comp\_lib8.c}  & scores are sampled for statistical estimates \\
2054\texttt{SAMP\_STATS\_LESS} & def & \texttt{compacc.c} & a slower sampling strategy is used \\
2055\texttt{SHOW\_ALIGN\_SCORE} & undef & \texttt{wm\_align.c} & print score, cummulative score, during alignment (for teaching) \\
2056\texttt{SHOW\_HELP} & def & \texttt{comp\_lib5e.c}, \texttt{comp\_lib8.c}, \texttt{initfa.c}, \texttt{doinit.c} & print out help information with '-help', or no arguments given.  Undef \texttt{SHOW\_HELP} reverts to pre-\texttt{fasta-35.4.4}.\\
2057\texttt{SHOW\_HIST} & undef & \texttt{doinit.c} & inverts current meaning of \texttt{-H} (shows by default for non-PCOMPLIB (MPI) programs). \\
2058\texttt{SHOWSIM} & def & \texttt{mshowbest.c} \texttt{mshowalign2.c} & display percent similarity \\
2059\texttt{USE\_LNSTATS} & obs & \texttt{scaleswn.c} & use $ln()$-scaling for scores, removed in \texttt{fasta2.0}.\\
2060\hline \\
2061\end{tabular}
2062$^*$Status: def: \#defined in standard \texttt{Makefiles}; undef: undefined; obs: obsolete, provided backwards compatibility with FASTA2.0 or earlier.
2063\end{table}
2064
2065The \texttt{fasta-36/make} directory includes \texttt{Makefile}s
2066appropriate for a broad range of environments, including Linux/Unix,
2067BSD, MacOSX, and Windows.  Makefiles are regularly tested against
2068MacOSX, Linux, and Windows. Table \ref{make-defs} summarizes the
2069Makefile options that can be modified.
2070
2071As distributed, the \texttt{Makefiles} in \texttt{fasta36/make}, build
2072a version of the FASTA programs that is optimized for single searches
2073against arbitrary sized databases, using bit scores, efficient sampled
2074statistics, and gap-open/extend penalties.  The default compilation
2075configuration can be changed either by changing the compile time
2076defines (Table \ref{make-defs}) in the main \texttt{Makefile},
2077e.g. \texttt{make/Makefile.linux64\_sse2}, or by editing
2078\texttt{make/Makefile36m.common}.
2079
2080\emph{High-performance searches with many queries} -- By default, the
2081\texttt{comp\_lib5e.c} program specified in
2082\texttt{Makefile36m.common} builds FASTA programs that re-read the
2083library sequence database for every query sequence.  This has the
2084advantage that sequence comparison begins almost immediately, but if
2085thousands of searches are being performed, the database is re-read
2086thousands of times.  \texttt{Makefile36m.common} can be edited to use
2087\texttt{comp\_lib8.c} in place of \texttt{comp\_lib5e.c} and the
2088database is read only once, then held in memory for additional
2089searches.  Of course, if \texttt{comp\_lib8.c} is used, the computer
2090must have enough memory to store the complete database.  Keeping the
2091database in memory allows the FASTA programs to very efficiently used
2092large, multicore computers.
2093
2094\needspace{5\baselineskip}
2095\emph{Parallel searches with MPI} -- By default under
2096Unix/Linux/MacOSX, the FASTA programs are threaded; they will spawn as
2097many threads as CPU cores are available (this can be limited with the
2098\texttt{-t n-threads} option).  Using \texttt{comp\_lib8.c}, we see
2099almost 48-fold speedup on a 48-core machine.  The FASTA programs can
2100also be run in parallel in the MPI environment on clusters of
2101computers.  To build the MPI versions of the programs, use
2102\texttt{make ../make/Makefile.mpi\_sse2 ssearch36\_mpi},
2103\texttt{fastx36\_mpi}, etc.  The MPI programs currently substantially
2104more communications overhead than the threaded versions, so they may
2105not scale as well to large clusters.
2106
2107\include{fasta.history}
2108\end{document}
2109