1\documentclass[11pt]{article} 2\RequirePackage{helvet} 3\newcommand{\CURRENT}{fasta-36.3.8} 4\renewcommand{\familydefault}{\sfdefault} 5\usepackage{cite} 6\usepackage{url} 7\usepackage{fancyhdr} 8\usepackage{url} 9\usepackage{needspace} 10\usepackage{longtable} 11\addtolength{\oddsidemargin}{-0.75in} 12\addtolength{\evensidemargin}{-0.75in} 13\addtolength{\textwidth}{1.5in} 14\addtolength{\topmargin}{-0.75in} 15\addtolength{\textheight}{1.5in} 16 17\lhead{\CURRENT} 18\rhead{\today} 19\cfoot{\thepage} 20\pagestyle{fancy} 21\newcommand{\FASTA}{\texttt{FASTA }} 22\hyphenation{Swiss-Prot} 23 24\parskip 0.5ex 25 26\begin{document} 27%% .he 'FASTA3.DOC''Release 3.6, March, 2011' 28 29\section*{\Large{The FASTA program package}} 30 31%% \begin{quote} 32%% \emph{This document is undergoing extensive revision. 33%% Some parts of it are old; while still accurate, they are less 34%% relevant. Recent improvements to the FASTA programs are not well 35%% documented, particularly with respect to options for selecting 36%% databases and database sequences.} 37%% \end{quote} 38 39\section*{Introduction} 40 41This documentation describes the version 36 of the FASTA program 42package (see W. R. Pearson and D. J. Lipman (1988), ``Improved Tools 43for Biological Sequence Analysis'', PNAS 85:2444-2448,\cite{wrp881} 44W. R. Pearson (1996) ``Effective protein sequence comparison'' 45Meth. Enzymol. 266:227-258 \cite{wrp960}; and Pearson et. al. (1997) 46Genomics 46:24-36 \cite{wrp973}. Version 3 of the FASTA packages 47contains many programs for searching DNA and protein databases and for 48evaluating statistical significance from randomly shuffled sequences. 49 50This document is divided into four sections: (1) A summary overview of 51the programs in the FASTA3 package; (2) A guide to using the FASTA 52programs; (3) A guide to installing the programs and 53databases. Section (4) provides answers to some Frequently Asked 54Questions (FAQs). In addition to this document, the 55\texttt{changes\_v36.html}, \texttt{changes\_v35.html} and 56\texttt{changes\_v34.html} files list functional changes to the programs. 57The \texttt{readme.v30..v36} files provide a more complete revision 58history of the programs, including bug fixes. 59 60The programs are easy to use; if you are using them on a machine that 61is administered by someone else, you can focus on sections (1) and (2) 62to learn how to use the programs. If you are installing the programs 63on your own machine, you will need to read section (3) carefully. 64 65\emph{FASTA and BLAST} -- FASTA and BLAST have the same goal: to 66identify statistically significant sequence similarity that can be 67used to infer homology. The FASTA programs offer several advantages 68over BLAST: 69\begin{enumerate} 70\item 71Rigorous algorithms unavailable in BLAST (Table I). Smith-Waterman 72(\texttt{ssearch36}), global: global (\texttt{ggsearch36}), and 73global:local (\texttt{glsearch36}) programs are available, and these 74programs can be used with \texttt{psiblast} PSSM profiles. 75\item 76Better translated alignments. \texttt{fastx36}, \texttt{fasty36}, 77\texttt{tfastx36}, and \texttt{tfastx36} allow frame-shifts in 78alignments; frame-shifts are treated like gap-penalties, alignments 79tend to be longer in error-prone reads. 80\item 81Better statistics. BLAST calculates very accurate statistics for 82protein:protein alignments, but its model-based strategy is less 83robust for translated-DNA:protein and DNA:DNA scores. FASTA uses an 84empirical estimation strategy, and now provides both search-based, and 85high-scoring shuffle-based statistics (\texttt{-z 21}). 86\item 87More flexible library sequence formats. The FASTA programs can read 88FASTA, NCBI/ \texttt{formatdb}, and several other sequence formats, and can 89directly query MySQL and Postgres databases. The programs offer 90several strategies for specifying subsets of databases. 91\item 92A very efficient threaded implementation. The FASTA programs are 93fully threaded; both similarity scores and alignments can be 94calculated in parallel on multi-core hardware. On multi-core 95machines, FASTA can be faster than BLAST while producing better 96alignments with more accurate statistical estimates. 97\item 98 A powerful annotation facility. The FASTA programs can incorporate 99 functional site annotations, site variation, and domain-based 100 sub-alignment scoring using annotations from sequence libraries. 101 Scripts are available to download site and domain information from 102 Uniprot, and domain information from Pfam and CATH. Domain 103 information can be used to sub-divide alignment scores to ensure 104 that the aligned domain is homologous. 105 106\end{enumerate} 107 108In addition, the FASTA programs from \texttt{fasta-36.3.4} on provide 109an option to produce very BLAST-like output (\texttt{-m BB}), so that 110analysis pipelines require minimal modification. 111 112\section{An overview of the \texttt{FASTA} programs} 113 114\begin{table} 115\caption{\label{table1} Comparison programs in the FASTA36 package} 116\vspace{0.5ex} 117\begin{tabular}{ p{0.8in} p{0.6in} p{4.6 in}} 118\hline \\[-1.0ex] 119FASTA \mbox{program} & BLAST equiv. & Description \\[1.2ex] 120\hline \\[-1.0ex] 121\texttt{fasta36} & \texttt{blastp}/ \texttt{blastn} & 122Compare a protein sequence to a protein sequence 123database or a DNA sequence to a DNA sequence database using the FASTA 124algorithm \cite{wrp881,wrp960}. Search speed and selectivity are 125controlled with the \emph{ktup}(wordsize) parameter. For protein 126comparisons, \emph{ktup} = 2 by default; \emph{ktup} =1 is more sensitive 127but slower. For DNA comparisons, \emph{ktup}=6 by default; \emph{ktup}=3 or 128\emph{ktup}=4 provides higher sensitivity.\\[1 ex] 129 130\texttt{ssearch36} & & Compare a protein sequence to a protein sequence 131database or a DNA sequence to a DNA sequence database using the 132Smith-Waterman algorithm \cite{wat815}. \texttt{ssearch36} uses SSE2 133acceleration, and is only 2 - 5X slower than \texttt{fasta36} \cite{farrar2007}. \\[1 ex] 134 135\texttt{ggsearch36}/ \texttt{glsearch36} & & Compare a protein sequence to a protein sequence 136database or a DNA sequence to a DNA sequence database using 137an optimal global:global (\texttt{ggsearch36}) or global:local 138(\texttt{glsearch36}) algorithm.\\[1 ex] 139 140\texttt{fastx36}/ \texttt{fasty36} & \texttt{blastx} & 141Compare a DNA sequence to a protein 142sequence database, by comparing the translated DNA sequence in three 143frames and allowing gaps and frameshifts. \texttt{fastx36} uses a 144simpler, faster algorithm for alignments that allows frameshifts only 145between codons; \texttt{fasty36} is slower but can produce better alignments 146because frameshifts are allowed within codons \cite{wrp971}.\\[1 ex] 147 148\texttt{tfastx36}/ \texttt{tfasty36}& \texttt{tblastn} & 149Compare a protein sequence to a DNA sequence 150database, calculating similarities with frameshifts to the forward and 151reverse orientations \cite{wrp971}.\\[1 ex] 152 153\texttt{fastf36/ tfastf36} & & 154Compares an ordered peptide mixture, as would be obtained by 155Edman degradation of a CNBr cleavage of a protein, against a protein 156(\texttt{fastf}) or DNA (\texttt{tfastf}) database \cite{wrp021}.\\[1 ex] 157 158\texttt{fasts36/ tfasts36} & & 159Compares set of short peptide fragments, as would be obtained 160from mass-spec. analysis of a protein, against a 161protein (\texttt{fasts}) or DNA (\texttt{tfasts}) database \cite{wrp021}.\\[1 ex] 162 163\texttt{lalign36} & & Calculate multiple, non-intersecting alignments 164using the sim2 implementation of the Waterman-Eggert 165algorithm\cite{wat875} developed by Xiaoqui Huang and Web 166Miller\cite{mil908}. Statistical estimates are calculated from 167Smith-Waterman scores of shuffled sequences. \\[1 ex] 168 169\hline \\ 170\end{tabular} 171\end{table} 172 173Although there are a large number of programs in this package, they 174belong to three groups: (1) Traditional similarity searching programs: 175\texttt{fasta36}, \texttt{fastx36}, \texttt{fasty36}, 176\texttt{tfastx36}, \texttt{tfasty36}, \texttt{ssearch36}, 177\texttt{ggsearch36}, and \texttt{glsearch36}; (2) Programs for 178searching with short fragments: \texttt{fasts36}, \texttt{fastf36}, 179\texttt{tfasts36}, \texttt{tfastf36}, and \texttt{fastm36}; (3) A 180program for finding non-overlapping local alignments: \texttt{lalign36}. 181Programs that start with \texttt{fast} search protein databases, while 182\texttt{tfast} programs search translated DNA databases. Table I 183gives a brief description of the programs. 184 185In addition, there are several programs included. \texttt{map\_db} is 186used to index FASTA format sequence databases for more efficient 187scanning. \texttt{scripts/lav2plt.pl} can plot the \texttt{.lav} 188files produced by \texttt{lalign -m 11} as postscript 189(\texttt{lav2plt.pl --dev ps}) or SVG (\texttt{lav2plt.pl --dev svg}) output. 190 191\section{Using the FASTA Package} 192\subsection{Introduction/Overview} 193 194All the FASTA sequence comparison programs use similar command line 195options and arguments. The simplest command line arguments are (in 196order): the name of a query sequence file, a library file, and 197(possibly) the \emph{ktup} parameter. If command line options are 198provided, they \emph{must} precede the standard query-file and 199library-file arguments. Thus: 200\begin{quote} 201\texttt{fasta36 -s BP62 query.file library.file} 202\end{quote} 203will compare the sequences in \texttt{query.file} with those in 204\texttt{library.file} using the \texttt{BLOSUM62} scoring matrix with 205BLASTP gap penalties (\texttt{-11/-1}). 206 207The program can also be run by typing: 208\begin{quote} 209\texttt{fasta36 -I} 210\end{quote} 211which presents the ``classic'' interative mode (this was 212the default behavior before version \texttt{36.3.4}). 213In interactive mode, 214you will be prompted for: (1) the name of the test sequence file; (2) 215the name of the library file; (3) whether you want ktup = 1 or 2. ( 2161 -- 6 for DNA sequences). 217 218Current versions of the FASTA programs expect a query file and library, if you simply type ``\texttt{fasta36}'', you will see a short help message: 219\begin{footnotesize} 220\begin{quote} 221\begin{verbatim} 222% ssearch36 223USAGE 224 ssearch36 [-options] query_file library_file 225 ssearch36 -help for a complete option list 226 227DESCRIPTION 228 SSEARCH performs a Smith-Waterman search 229 version: 36.3.4 Mar, 2011 230 231COMMON OPTIONS (options must precede query_file library_file) 232 -s: [BL50] scoring matrix; 233 -f: [-10] gap-open penalty; 234 -g: [-2] gap-extension penalty; 235 -S filter lowercase (seg) residues; 236 -b: high scores reported (limited by -E by default); 237 -d: number of alignments shown (limited by -E by default); 238 -I interactive mode; 239\end{verbatim} 240\end{quote} 241\end{footnotesize} 242``\texttt{fasta36 -help}'' (or any of the other program names in Table 243I) provides complete listing of the options available for the program 244and their default values. 245 246The package includes several test files. To check to make certain 247that everything is working, you can try: 248\begin{quote} 249\begin{verbatim} 250fasta36 ../seq/musplfm.aa ../seq/prot_test.lib 251or 252tfastx36 ../seq/mgstm1.aa ../seq/gst.nlib 253\end{verbatim} 254\end{quote} 255 256\subsection{Sequence files} 257 258The \texttt{fasta36} programs can read query and library files in many 259standard formats (Section \ref{fastlibs}). The default file format for query and library files 260-- the format that will be used if no additional file format 261information is provided -- is \texttt{FASTA} format. Like 262\texttt{BLAST}, version 36 can compare a query file with multiple 263query sequences to a sequence database, performing an independent 264search with each sequence in the query file. 265 266FASTA format files consist of a description line, beginning 267with a '$>$' character, followed by the sequence itself: 268\begin{quote} 269\begin{verbatim} 270>sequence name and description 1 271A F A S Y T .... actual sequence. 272F S S .... second line of sequence. 273>sequence name and description 2 274PMILTYV ... sequence 2 275\end{verbatim} 276\end{quote} 277All of the characters of the description line are read, and special 278characters can be used to indicate additional information about the 279sequence. In general, non-amino-acid/non-nucleotide sequences in the 280sequence lines are ignored. 281 282FASTA format files from major sequence distributors, like the NCBI and 283EBI, have specially formatted description lines, e.g.:\\ 284\indent 285\texttt{ 286>gi|54321|ref|np\_12345| example NCBI refseq sequence\\ 287} 288or\\ 289\indent 290\texttt{ 291>sw:gstm1\_human P01234 glutathione transferase GSTM1 - human\\ 292} 293 294Several sample test files are included with the FASTA distribution: 295\texttt{seq/*.aa} and \texttt{seq/*.seq}, as well as two small sequence 296libraries, \texttt{seq/prot\_test.lib} and \texttt{seq/gst.nlib}. 297 298You can build your own library by concatenating several sequence 299files. Just be sure that each sequence is preceded by a line 300beginning with a '\texttt{>}' followed by a sequence name/description. Sequences 301entered with word processors should use a ``text'' mode, e.g. ``Save as 302text'' with MS-WORD, with end of line characters and no special 303formatting characters in the file. The FASTA program cannot read 304Microsoft Word .DOC files, or rich text (.RTF) files; query and 305library sequence files should contain only sequence descriptions, 306sequences, and end-of-line characters. 307 308\subsection{Running the programs} 309As mentioned earlier, the FASTA programs can be run either 310interactively, by typing the name of a FASTA program (and possibly 311command line options), followed by \texttt{-I} (\texttt{fasta36 -I}) 312or from the command line, entering command line options, and the 313query and library file names. For searches of large databases that 314may take several minutes (or longer), it is more convenient 315to run searches from the command line, e.g.: 316\begin{quote} 317\begin{verbatim} 318fasta36 query.file library.file > output.file 319\end{verbatim} 320\end{quote} 321The command line shown above could be typed in a Unix or MacOSX 322terminal window, or from the MS-Windows command line interface 323(command.exe). The command line syntax shown above works for all 324the FASTA programs, e.g.: 325\begin{quote} 326\begin{verbatim} 327lalign36 mchu.aa mchu.aa > mchu.laln 328fastx36 mgstm1.seq prot_test.lseg > mgstm1.fx_out 329ssearch36 mgstm1.aa xurtg.aa > mgstm1_xurtg.ss 330\end{verbatim} 331\end{quote} 332 333\emph{Command line options} -- The FASTA programs provide a variety of 334command line options that modify the default scoring matrix 335(\texttt{-sBL62}) and gap penalties (\texttt{-f -11}, \texttt{-g -1}), other 336algorithm parameters, the output options (\texttt{-E 0.1}, \texttt{-d 20}, 337\texttt{-m 9i}), and statistical procedures (\texttt{z -2}). A complete 338list of command line options is shown near the end of this document. 339Unlike the \texttt{BLAST} programs, all \texttt{FASTA} command line options 340must precede the query file name and library file name (and there are 341no command line options available to specify the query and library 342file names). Thus, you should type: 343\begin{quote} 344\begin{verbatim} 345ssearch36 -s BL62 -f -11 -g -1 query.file library.file > output.file 346\end{verbatim} 347\end{quote} 348If you include \texttt{-I} as one of the options, you can provide 349command line options (e.g. to change the scoring matrix or gap 350penalties) without a query file or library file, and the program will 351use the options but prompt for the necessary files . 352 353\subsection{Interpreting the results} 354 355Fig. \ref{ssearch_run} shows the output from a typical FASTA program 356(\texttt{ssearch36}). The output file can be 357viewed as four parts: (a) the initial command line and description of 358the query sequence used (mgstm1.aa, 218 aa) and library (PIR1, 13,143 359entries); (b) a description of the search statistics, algorithm 360(Smith-Waterman, SSE2 accelerated), and search parameters (BLOSUM50 361matrix, gap penalties: -10 to open a gap, -2 for each residue in a 362gap); (c) a list of high scoring library sequences, descriptions, similarity scores, and statistical significance; (d) the alignments that produced the scores. 363 364\begin{figure} 365\include{fasta_guide.fg1} 366\caption{\label{ssearch_run}\texttt{ssearch36} results} 367\vspace{1.0ex} 368Comparison of \texttt{seq/mgstm1.aa} against a small protein database 369(\texttt{pir1.lseg}). Some high-scoring sequences and all but one 370alignment were removed to reduce the output size. 371\end{figure} 372 373\subsubsection{Identifying homologs} 374In the description section (which starts: \texttt{The best scores 375 are:}), four numbers after the description of each library sequence 376are shown: (i) (in parentheses) the length of the library sequence; 377(ii) the raw Smith-Waterman score for the alignment (\texttt{s-w}; for 378the \texttt{fasta36}, \texttt{[t]fast[x,y]36} programs, this column 379would be labeled \texttt{opt}, for the \emph{opt}imized -- banded 380Smith-Waterman -- score), (iii) the \emph{bit} score, and (iv) the 381expectation (E()), or statistical significance, of the alignment 382score. The E()-value depends on the size of the database searched, in 383this case, 13,143 sequences, so the database size is given at the top 384of the list. 385 386The bit score is equivalent to a BLAST bit score; together with the 387length of query and library sequences, it can be used to calculate the 388significance of the alignment.\footnote{$E(D) = D m n 2^{-b}$, 389where $D$ is the number of sequences in the database, $m, n$ are the 390lengths of the two sequences, and $b$ is the bit score.} Bit scores 391are convenient because they provide a matrix independent score that 392can be compared with other searches performed with other matrices and 393gap penalties against other databases. However, the E()-value, or 394expectation, provides the most direct measure of the statistical 395significance of the match. 396 397In this example, the \texttt{GSTP1\_RAT}, \texttt{GSTA1\_RAT}, and 398\texttt{GSTA4\_RAT} proteins share strong significant similarity 399(better than $E() < 6.1 \times 10^{-7}$ ), while the 400\texttt{GSTF1\_MAIZE}, \texttt{GSTF3\_MAIZE}, and 401\texttt{GSTT1\_DROME} sequences do not share significant similarity 402($E() < 0.001$). However, \texttt{GSTF1\_MAIZE}, 403\texttt{GSTF3\_MAIZE}, and \texttt{GSTT1\_DROME} are all glutathione 404transferase homologs, they simply do not share statistically 405significant similarity with this particular \texttt{mGSTM1} query. 406Statistically significant sequence similarity scores \emph{can} be 407used to infer \emph{homology} (common ancestry), but non-significant 408scores \emph{cannot} be used to infer \emph{non-homology}. 409 410While percent identity is often used to characterize the quality of an 411alignment and the likelihood that it reflects homology, the E()-value 412is a much more reliable value for homology infernence (once homology is 413established, the percent identity is much more useful for estimating 414evolutionary distance). Often sequences that share less than 30\% 415identity will share very significant similarity (in the example above 416\texttt{mgstm1.aa} and \texttt{GSTA4\_RAT}, with E() $<$ 6.1E-07 are 41725.6\% identical). The expectation value captures information about 418conservative replacements, identities, and alignment length to provide 419a \emph{single} value that captures the significance of the alignment. 420 421For protein searches, library sequences with E()-values $<$ 0.001 for 422searches of a 10,000 entry protein database are almost always 423homologous. Some sequences with E()-values from 1 - 10 may also be 424related, but unrelated sequences ( 1--10 per search) will have scores 425in this range as well. 426 427E()-values $<$ 0.001 can reliably be used to infer homology, assuming 428that the statistical estimates are accurate. The two most common 429causes of statistical problems are low-complexity regions and 430amino-acid composition bias. Low-complexity regions are can be 431identified using the \texttt{pseg} program \cite{woo935}, and filtered 432out using the \texttt{-S} option. Composition bias rarely produces 433highly-signficiant E()-values, but can cause unrelated sequences to 434have E()-values between 0.01 and 0.001. The FASTA programs offer two 435shuffle-based strategies for evaluating composition bias; calculating 436similarity scores for random sequences with the same length and amino 437acid composition (\texttt{-z 11} $..$ \texttt{16}),\footnote{Random 438 shuffles are performed for pairwise alignments and \texttt{lalign36} 439 by default.} and calculating statistical estimates derived from 440shuffles of the high-scoring sequences (\texttt{-z 21, 22, 24, 25, 441 26}). 442 443When \texttt{-z 21 .. 26} shuffles are performed, the FASTA36 programs 444present two E()-values in the list of high scoring sequences and the 445alignments; the traditional one based on the library search, and a 446second \texttt{E2()} value, based on the shuffles of the high scoring 447sequences. \texttt{-z 21 .. 26} shuffles are most useful for 448evaluating the significance of translated-DNA:protein searches like 449\texttt{fastx36}. Out-of-frame translations can produce cryptic low 450complexity regions, which are most apparent when the high-scoring 451sequences are shuffled. \texttt{-z 21} shuffles are more efficient 452than individual sequence shuffles, because the set of high scoring 453sequences is shuffled 500 -- 1,000 times, rather than 500 shuffles for 454each of 50 -- 100 high scoring library sequences. It is almost as 455effective, because homologous sequences share similar amino-acid 456composition. 457 458The statistical routines assume that the library contains a large 459sample of unrelated sequences. If the library contains fewer than 460500 sequences (\texttt{MAX\_RSTATS}), then the library sequences are 461shuffled to produce 500 random scores, from which lambda and K 462statistical parameters are estimated. If the library contains a large 463number of \emph{related} sequences, then the statistical parameters 464should be estimated by using the \texttt{-z 11-15}, options. 465\texttt{-z} options greater than 10 calculate a shuffled similarity 466score for each library sequence, in addition to the unshuffled score, 467and estimate the statistical parameters from the scores of the 468shuffled sequences. 469 470\subsubsection{Looking at alignments} 471 472The description section described above contains the critical 473information for inferring homology, the \texttt{E()}-value. The 474alignment section shows the actual alignments that produced the 475similarity score and statistical estimates. In 476Fig. \ref{ssearch_run}, the alignment display reports the percent 477identity, percent similarity (number of aligned residues with BLOSUM50 478values $\ge$ 0), and the boundaries of the alignment. Note that for 479the \texttt{ssearch36} and \texttt{fasta36}, the alignment shown can 480include residues that are not part of the best local alignment 481(e.g. residues 1--5 and 207--218 in \texttt{mGSTM1} in 482Fig. \ref{ssearch_run}). The amount of additional sequence context 483shown is the alignment line length (60 residues, set by \texttt{-w 484 len}) divided by 2 by default, but can be adjusted with the 485\texttt{-W context} option. 486 487\begin{figure} 488\include{fasta_guide.fg2} 489\vspace{-3.0ex} 490\caption{\label{seg-aln}Alignment with \texttt{-S} filtered sequence} 491\end{figure} 492 493Fig. \ref{seg-aln} shows an example of a \texttt{fasta36} alignment 494produced using the \texttt{-S} option to filter out lower-case (low 495complexity) residues. Here, additional scores (\texttt{initn}, 496\texttt{init1} are shown, in addition to the \texttt{opt} score which 497is used to rank the sequences and calculate statistical significance. 498The \texttt{init1} score is the highest scoring alignment without 499gaps; \texttt{initn} is a score that combines consistent 500(non-overlapping) runs without gaps, and \texttt{opt} is the score of 501a banded Smith-Waterman of width 16 for \emph{ktup=2} that is applied 502to sequences with \texttt{initn} scores over the optimization 503threshold. In Fig. \ref{seg-aln}, the \texttt{init1} score is 504based on the long, un-gapped region from residues 46--208 in 505\texttt{mGSTM1}, while the \texttt{initn} and \texttt{opt} scores 506include the other regions joined by gaps. The \texttt{initn} score is 507higher than the \texttt{opt} score, because it uses a simpler, 508length-independent, gap penalty. 509 510The \texttt{init1} and \texttt{initn} scores are shown for 511historical reasons, and can be used to illustrate the FASTA algorithm. 512But the \texttt{opt} score is the most reliable and sensitive score 513for inferring homology; the others can be ignored. 514 515For \texttt{fasta36} with proteins, the final alignment and score is 516calculated with the Smith-Waterman algorithm. For DNA sequences, a 517banded Smith-Waterman is used. (The \texttt{-A} option produces banded 518Smith-Waterman alignments for proteins, and full Smith-Waterman for 519DNA.) In Fig. \ref{seg-aln}, the \texttt{opt} score and 520\texttt{Smith-Waterman} scores are calculated on exactly the same 521alignment, but the \texttt{opt} score excludes the contribution from 522the ``low-complexity'' region between 19--30 in \texttt{GST26\_SCHMA}. 523 524\subsubsection{Results without alignments} 525 526While sequence alignments are very informative, it is often not 527practical to examine all the statistically significant alignments in 528large-scale searches. The \texttt{-m 9} and \texttt{-m 8} options 529present summaries of each alignment (alignment boundaries, percent 530identity, and other information) in a much more compact form. In 531addition, \texttt{-m 9c} or \texttt{-m 9C} (see options below) provide 532a detailed encoding of the alignment, that allows it to be 533reconstructed. For large-scale searches, we routinely use \texttt{-m 534 8} with the \texttt{-d 0} option, which sets the number of 535alignments shown to 0 (thus none are shown). Alternatively, the 536\texttt{-m 8} and \texttt{-m 8C} ouput options produce BLAST-format 537tabular results summaries (\texttt{-m 8C} provides commented tabular 538results). \texttt{-m 8CC} adds an alignment CIGAR string and 539annotation string to the BLAST tabular format. 540\texttt{-m BB} produces an output that mimics BLAST output 541(with alignments). 542 543\subsubsection{Alignments with annotations} 544 545The command line \texttt{-V} option (described below) causes the FASTA 546programs to ``decorate'' its sequence alignments with annotation 547information, such as functional sites, variants, and domain-based 548sub-alignment scores. For example, a comparison of the 549\texttt{seqs/gstm1\_human.vaa} sequence with SwissProt using the 550\texttt{scripts/ann\_feats\_up\_www2.pl} script: 551\begin{footnotesize} 552\begin{quote} 553\begin{verbatim} 554ssearch36 -m 9i -V \!../scripts/ann_feats_up_www2.pl ../seq/gstm1\_human.vaa /slib/swissprot.fa 555\end{verbatim} 556\end{quote} 557\end{footnotesize} 558Produces the following addtional output: 559\begin{footnotesize} 560\begin{verbatim} 561Annotation symbols: 562 = : Active site 563 * : Modified 564 # : Substrate binding 565 ^ : Metal binding 566 @ : Site 567 568The best scores are: s-w bits E(458668) %_id %_sim alen 569sp|P09488.3|GSTM1_HUMAN Glutathione ( 218) 1500 375.2 2.4e-103 1.000 1.000 218 |Var: K173N;S210T; 570sp|Q03013.3|GSTM4_HUMAN Glutathione ( 218) 1375 344.5 4.4e-94 0.904 0.963 218 |Var: S2P;A160V;L208V;Y209F;... 571sp|Q5R8E8.3|GSTM2_PONAB Glutathione ( 218) 1310 328.5 2.9e-89 0.862 0.959 218 572sp|Q9TSM5.3|GSTM1_MACFA Glutathione ( 218) 1308 328.0 4e-89 0.858 0.954 218 573sp|Q9TSM4.3|GSTM2_MACFA Glutathione ( 218) 1307 327.7 4.8e-89 0.862 0.954 218 574sp|P28161.2|GSTM2_HUMAN Glutathione ( 218) 1306 327.5 5.7e-89 0.853 0.959 218 |Var: S173N; 575sp|P46439.3|GSTM5_HUMAN Glutathione ( 218) 1305 327.2 6.7e-89 0.876 0.954 218 |Var: L179P; 576\end{verbatim} 577\end{footnotesize} 578This summary of high scoring hits shows one of the effects of 579annotation---substitution of variant residues to increase the score. 580In this case, the query sequence \texttt{gstm1\_human.vaa} is a known 581variant of the canonical \texttt{GSTM1\_HUMAN}/\texttt{P09488} UniProt 582sequence. Without the \texttt{-V} annotation option, 583\texttt{gstm1\_human.vaa} would be 99\% identical to 584\texttt{GSTM1\_HUMAN}, but because UniProt documents the variant 585residues in the feature table, the \texttt{K173N} and \texttt{S210T} 586substutions are made in the library (subject) sequence, producing a 587perfect match. 588 589In addition to the variant substitution shown above, the alignments 590provide a more complete view of the annotations available on the 591library (subject) proteins. Below is the report for the 592\texttt{GSTM4\_HUMAN} alignment: 593\begin{footnotesize} 594\begin{quote} 595\begin{verbatim} 596>>sp|Q03013.3|GSTM4_HUMAN Glutathione S-trans (218 aa) 597 Variant: 2P=2P : S2P : UniProtKB FT ID: VAR_033979 598 Site:@ : 7Y=7Y : Site: Glutathione binding 599 Site:@ : 46W=46W : Site: Glutathione binding 600 Site:@ : 59N=59N : Site: Glutathione binding 601 Site:@ : 72Q=72Q : Site: Glutathione binding 602 Region: 2-88:2-88 : score=599; bits=150.1; Id=0.989; Q=415.2 : GST N-terminal :1 603 Site:# : 116Y=116Y : Substrate binding: Substrate 604 Variant: 160V=160V : A160V : UniProtKB FT ID: VAR_033980 605 Variant: 208V=208V : L208V : UniProtKB FT ID: VAR_049487 606 Region: 90-208:90-208 : score=699; bits=175.1; Id=0.833; Q=489.3 : GST C-terminal :2 607 Variant: 209F=209F : Y209F : UniProtKB FT ID: VAR_049488 608 Variant: 211K=211K : R211K : UniProtKB FT ID: VAR_049489 609 Variant: 212M=212M : V212M : UniProtKB FT ID: VAR_049490 610 s-w opt: 1375 Z-score: 1823.2 bits: 344.5 E(458668): 4.4e-94 611Smith-Waterman score: 1375; 90.4% identity (96.3% similar) in 218 aa overlap (1-218:1-218) 612\end{verbatim} 613\end{quote} 614\end{footnotesize} 615This report can be broken into three parts: (1) information on 616variants, described above, (2) information on annotated sites, and (3) 617information on annotated domains. For each annotated site, the 618coordinate and amino-acid resdiue in the query and library (subject) 619sequence is shown, as well as the conservation state (\texttt{=} in 620all these examples). For annotated domains, the overall alignment 621score is broken into pieces, based on the boundaries of the domains. 622In this case, the full alignment extends from residues 1--218, 623producing a raw Smith-Waterman score of 1375 and a bit score of 624344.5. The GST N-terminal domain is annotated from residue 2--88 on 625\texttt{GSTM4\_HUMAN}, and the 2--88 region of the alignment produces 626a score of 599 and bit score of 150.1. This region is 98.9\% 627identical, and the probability of that similarity score is 628$10^{-41.52}$ (the Qvalue score is $-10 log_{10} P$). The 629alignment associated with GST C-terminal domain is slightly less well 630conserved (83.3\% identical), but longer, so it produces a higher 631Smith-Waterman score (699), bit score (175.1) and Q-value. 632 633Sub-alignment scores can be used to identify cases of alignment 634over-extension \cite{wrp136}, where an alignment extends well beyond 635the homologous domain. In this case, the homologous region will 636produce the vast majority of the score, and the non-homologous 637over-extension will produce very little score. For example, when 638\texttt{SRC8\_HUMAN} aligns with \texttt{LASP1\_MOUSE}, the alignment 639spans 200 residues, but only about 49 of those residues, an 640\texttt{SH3} domain, are homologous: 641 642\begin{footnotesize} 643\begin{quote} 644\begin{verbatim} 645>>sp|Q61792.1|LASP1_MOUSE LIM and SH3 domain protein 1; LASP-1; (263 aa) 646 Region: 369-398:66-95 : score=20; bits=13.7; Id=0.200; Q=0.0 : Nebulin_repeat InterPro 647 Region: 400-434:97-131 : score=-8; bits=8.7; Id=0.150; Q=0.0 : Nebulin_repeat InterPro 648 Region: 435-499:132-203 : score=13; bits=11.4; Id=0.197; Q=0.0 : NODOM :0 649 Region: 499-547:204-261 : score=124; bits=47.8; Id=0.474; Q=92.1 : SH3 InterPro 650 s-w opt: 148 Z-score: 253.4 bits: 55.6 E(459565): 1.2e-06 651Smith-Waterman score: 159; 26.3% identity (55.6% similar) in 205 aa overlap (369-547:66-261) 652\end{verbatim} 653\end{quote} 654\end{footnotesize} 655The 150 aligned residues outside the \texttt{SH3} homology produce 656less than 20\% of the alignment score, while spannign to 657non-homologous Nebulin repeat domains. 658 659\subsection{Program Options} 660 661Command line options are available to change the scoring parameters 662and output display. Unlike the NCBI BLAST programs, command line 663options \emph{must} precede the query file name and library file name 664arguments. To see the command-line options for a program and their 665defaults, type \texttt{program\_name -help}, e.g. \texttt{fasta36 666 -help} or \texttt{ssearch36 -help}. For a quick list of the most 667common options, just type the program name without any options 668(e.g. \texttt{fasta36$<$ret$>$}). 669 670\subsubsection{Command line options} 671\begin{description} 672\item[\texttt{-a}] (\texttt{fasta36}, \texttt{ssearch36}, 673 \texttt{glsearch36}, \texttt{fasts36}) show both sequences in their 674 entirety. 675\item[\texttt{-A}] force Smith-Waterman alignments for 676 \texttt{fasta36} DNA sequences. By default, only \texttt{fasta36} 677 protein sequence comparisons use Smith-Waterman alignments. 678 Likewise, for proteins, use band alignments (Smith-Waterman is used 679 by default). 680\item[\texttt{-b \#}] Number of sequence scores to be shown on output. 681 In the absence of this option, \texttt{fasta36} (and 682 \texttt{ssearch36}) display all library sequences obtaining 683 similarity scores with expectations less than the expectation (-E) 684 threshold, 10.0 for proteins, and 2.0 for DNA:DNA and 685 protein:translated DNA. The \texttt{-b \#} option can limit the 686 display further. There are two ``sub-modes'' of \texttt{-b}. 687 \texttt{-b =100} will force 100 high scores to be displayed, 688 regardless of the expectation (\texttt{-E}) threshold, and 689 \texttt{-b >1} will show at least \texttt{1}, but is otherwise 690 limited by \texttt{-E}. Thus, \texttt{-b 10} will show \emph{no 691 more than} 10 results, limited by \texttt{-E}; \texttt{-b =10} 692 will always show \emph{exactly} \texttt{10} results, and \texttt{-b 693 >5} will show \emph{at least} \texttt{5} results, but could show 694 many more if more results have e\-values $\le$ \texttt{-E e\_cut}. 695\item[\texttt{-c \#,\#}] (\texttt{fasta36}, \texttt{[t]fast[x,y]36} 696 only) Fraction of alignments optimized (second value is fraction of 697 sequences joined). FASTA36 uses a statistical threshold strategy 698 that joins and optimizes only the fraction of the alignments with an 699 \texttt{initn} score expected \texttt{-c} times. Thus, \texttt{-c 700 0.05} should optimize about 5\% of sequences. The actual number 701 of sequences optimized (and joined) is displayed in the scoring 702 parameters line. Thus: 703\begin{quote} 704\begin{verbatim} 705Parameters: BL50 matrix (15:-5), open/ext: -10/-2 706 ktup: 2, E-join: 1 (0.687), E-opt: 0.2 (0.294), width: 16 707\end{verbatim} 708\end{quote} 709reports that 20\% of the sequences in the database should have been 710band-optimized, and 29.4\% were. Reducing the \texttt{-c opt} fraction 711improves performance, but dropping the fraction below 0.02 can 712reduce the accuracy of the statistical estimates. 713 714\texttt{-c O} (letter 'O') sets the joining/optimization 715thresholds as they were prior to \texttt{fasta-36.3.3} (original thresholds). Positive 716values set the thresholds to specific score values, as was the case 717in older versions of \texttt{fasta}. 718 719\item[\texttt{-C}] 720length of the sequence name printed at the beginning of alignment 721lines (default 6 characters). 722\item[\texttt{-d \#}] 723Maximum number of alignments to be displayed (must be \texttt{<=} to the number of descriptions, \texttt{-b \#}) 724\item[\texttt{-D}] 725 Provide some debugging output. Used in conjunction 726 with the \texttt{-e expand\_script.sh}, the \texttt{link\_acc\_file} 727 and \texttt{link\_lib\_file} are not deleted during the run; so that 728 \texttt{expand\_script.sh} scripts can be tested. 729 730\item[\texttt{-e expand\_script.sh}] 731 732 Expand the set of sequences that 733 are aligned to beyond the set of sequences searched. When the 734 \texttt{-e expand\_script.sh} option is used, the 735 \texttt{expand\_script.sh} script is run after the initial search 736 scan but before the list of high-scoring sequences is displayed. 737 \texttt{expand\_script.sh} is given a single argument, the name of a 738 file that contains a list of accession strings (the text between the 739 \texttt{>} and the first space ('\textvisiblespace') character 740 followed by the E()-value for the sequence (separated by a 741 \texttt{<tab>} character), e.g.: 742\begin{quote} 743\begin{verbatim} 744gi|121719|sp|P08010|GSTM2_RAT<tab>2.69e-86 745gi|121746|sp|P09211|GSTP1_HUMAN<tab>1.51e-20 746gi|121749|sp|P04906|GSTP1_RAT<tab>1.16e-19 747gi|62822551|sp|P00502|GSTA1_RAT<tab>9.5e-12 748\end{verbatim} 749\end{quote} 750The script should produce a fasta-formatted list of additional 751sequences printed to \texttt{stdout}. The script is run with the command: 752\begin{quote} 753\begin{verbatim} 754expand_script.sh link_acc.tmp_file > link_lib.tmp_file 755\end{verbatim} 756\end{quote} 757The sequences in \texttt{link\_lib.tmp\_file} (a temporary file name is 758actually used, and the file is deleted unless the \texttt{-D} option 759is used) are then compared and, if they are significant, included in 760the list of high scoring sequences and the alignments. The expanded 761set of sequences does not change the database size or statisical 762parameters, it simply expands the set of high-scoring sequences. 763 764The \texttt{fasta36/misc} directory contains 765\texttt{expand\_uniref50.pl} that uses a mySQL table based on the 766\texttt{uniref50} clusters. Using the script and the 767\texttt{uniref50} cluster information, one can search 768\texttt{uniref50.fasta}, but then expand the hits so that 769\texttt{uniprot} appears to be searched. 770 771\item[\texttt{-E e\_cut [e\_cut\_r]}] Limit the number of scores and 772 alignments shown based on the expected number of scores. Used to 773 override the expectation value of 10.0 (protein:protein; 5.0 774 translated-DNA:protein; 2.0 DNA:DNA) used by default. \texttt{-E 775 2.0} will show all library sequences with scores with an 776 expectation value $<=$ 2.0. With \texttt{fasta-36}, a second 777 value, \texttt{e\_cut\_r} is available to limit the E()-values of 778 additional sequence alignments between the query and library 779 sequences. If not given, the threshold is \texttt{e\_cut}/10.0. If 780 given with a value $>$ 1.0, \texttt{e\_cut\_r} = \texttt{e\_cut} / 781 value; for a value $<$ 1.0, \texttt{e\_cut\_r} = value; If 782 \texttt{e\_cut\_r} $<$ 0, then the additional alignment option is 783 disabled. 784\item[\texttt{-f \#}] 785Gap open penalty (-10 by default for proteins, 786-12 for DNA, -12 for \texttt{[t]fast[xy]}). 787\item[\texttt{-F \#}] 788Limit the number of scores and alignments shown based on the expected 789number of scores. \texttt{-E \#} sets the highest E()-value shown; \texttt{-F \#} sets 790the lowest E()-value displayed. Thus, \mbox{\texttt{-F 0.0001}} will not show any matches or 791alignments with E() $<$ 0.0001. This allows one to skip over close 792relationships to search for more distant relationships. 793\item[\texttt{-g \#}] 794Penalty per residue in a gap (-2 by default for proteins, 795-4 for DNA, -2 for \texttt{[t]fast[xy]}). A single residue gap costs \texttt{f} $+$ \texttt{g}. 796\item[\texttt{-h}] 797Short help message. Help options with \texttt{':'}, e.g. \texttt{-s:}, 798require an argument (\texttt{-s BP62}). Defaults are shown in square 799brackets, e.g.: \texttt{-s:\ [BL50]}. 800\item[\texttt{-help}] 801Long help message 802\item[\texttt{-H}] 803Show histogram. 804\item[\texttt{-i}] 805DNA queries - search with reverse complement. For 806\texttt{tfastx36/y36}, search the reverse complement of the library sequence 807only (complement of \texttt{-3} option). 808\item[\texttt{-I}] 809Interactive mode (the default for versions older than \texttt{fasta-36.3.4}). 810\item[\texttt{-j \#}] 811Penalty for frameshift between codons (\texttt{[t]fastx36}, \texttt{[t]fasty36}) and within a codon (\texttt{fasty36}/ \texttt{tfasty36} only). 812\item[\texttt{-J}] 813(\texttt{lalign36} only) show the identity alignment (normally 814 suppressed, \texttt{-I} in versions before \texttt{fasta-36.3.4}). 815\item[\texttt{-k \#}] 816number of shuffles for statistical estimates from shuffling. 817\item[\texttt{-l file}] 818Location of library menu file (FASTLIBS). 819\item[\texttt{-L}] 820Display longer library sequence description. 821\item[\texttt{-M low-high}] 822Range of amino acid sequence lengths to be included in the search. 823\item[\texttt{-m \#}] 824Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, B, BB, ``F\# out\_file'' 825\begin{small} 826\begin{verbatim} 827 -m 0 -m 1 -m 2 -m 3 -m 4 828MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT 829::..:: ::: xx X ..KS..Y... MWKSCGYPYT ---------- 830MWKSCGYPYT MWKSCGYPYT 831\end{verbatim} 832\end{small} 833If the \texttt{-V '*@\%'} annotation option has been used, 834annotations can be included in either the coordinate line (the default) or the 835middle alignment line (\texttt{-m 0M}, \texttt{-m 1M}), or both 836(\texttt{-m 0B}, \texttt{-m 1B}). See \texttt{-V} for more details. 837 838\indent \texttt{-m 5}: a combination of \texttt{-m 4} and \texttt{-m 839 0}. \texttt{-m 6} provides \texttt{-m 5} plus HTML formatting. In 840addition, independent \texttt{-m} options can be combined. Thus, one 841can use \texttt{-m 1 -m 6 -m 9}. 842 843\item[\texttt{-m 8}] provides BLAST tabular format output (a tab 844 delimited line with the query name, library name, percent identity, 845 and other alignment information). ``\texttt{-m 8C}'' provides the 846 additional information provided by the BLAST tabular format with 847 comment lines. BLAST tabular format has been extended to include 848 either a CIGAR string alignment encoding (\texttt{-m 8CC} with BLAST 849 comments, \texttt{-m 8XC} without comments) and, if available, an 850 annotation encoding matching FASTA \texttt{-m 9C} output. All the 851 \texttt{-m 9c/C/d/D} encodings are available with BLAST tabular 852 output using \texttt{-m 8C[c/C/d/D]}. 853 854\item[\texttt{-m 9}] display alignment coordinates and scores with the 855 best score information. \texttt{-m 9i} provides alignment length, 856 percent identity, and percent similarity only. \texttt{ -m 9} extends 857 the normal best score information: 858\begin{footnotesize} 859\begin{verbatim} 860The best scores are: opt bits E(14548) 861XURTG4 glutathione transferase (EC 2.5.1.18) 4 - ( 219) 1248 291.7 1.1e-79 862\end{verbatim} 863\end{footnotesize} 864 865to include the additional information (on the same line, separated by 866$<$tab$>$ characters): 867\begin{footnotesize} 868\begin{verbatim} 869%_id %_gid sw alen an0 ax0 pn0 px0 an1 ax1 pn1 px1 gapq gapl fs 8700.771 0.771 1248 218 1 218 1 218 1 218 1 219 0 0 0 871\end{verbatim} 872\end{footnotesize} 873 874The first two values are fraction identical and fraction similar 875(score $\ge 0$), followed by the Smith-Waterman alignment score (\texttt{sw}), the 876alignment length (\texttt{alen}), and the coordinates of the beginning 877and end of the alignment in the query and target (library) sequences 878(\texttt{an0} beginning, \texttt{ax1} end in query; \texttt{an1} 879beginning, \texttt{ax1} end in target/library), and the coordinate 880system for the beginning and end of the query and target/library 881sequence (\texttt{pn0} is the displayed coordinate of the first 882residue of the query sequence, \texttt{px0} is the displayed 883coordinate of the last residue, \texttt{pn1},\texttt{px1} provide the 884coordinates for the target/library sequence). \texttt{gapq}, 885\texttt{gapl} report the number of gaps in the query and library 886sequence; \texttt{fs} reports the number of frameshifts. 887 888\texttt{ -m 9c} provides additional information: an encoded alignment string. For example, the alignment: 889\begin{footnotesize} 890\begin{verbatim} 891 10 20 30 40 50 60 70 892GT8.7 NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ 893 :.:: . :: :: . .::: : .: ::.: .: : ..:.. ::: :..: 894XURTG NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ 895 20 30 40 50 60 896\end{verbatim} 897\end{footnotesize} 898would be encoded: 899\begin{footnotesize} 900\texttt{=23+9=13-2=10-1=3+1=5} 901\end{footnotesize}. 902The numbers in the alignment encoding is with repect to the beginning 903of the alignment, not the sequences. The beginning coordinate of the 904alignment is given earlier in the \texttt{-m 9c} line. \texttt{-m 9C} 905provides the alignment encoding in CIGAR format: 906\begin{footnotesize} 907\texttt{28M9D13M2I10M1I3M1D5M} 908\end{footnotesize}. 909 910(June, 2014) The \texttt{-m 9c/C} option has been extended to 911\texttt{-m 9d/D}, which encodes the positions of mismatches as well as 912insertions and deletions. For the example above, the \texttt{-m 9d} 913encoding would be: 914\begin{footnotesize} 915\texttt{=1x1=2x4=2x1=2x7=3-9=1x2=1x4=2x1=1x1+2x1=1x1=1x3=1x2+1=3-1x1=1x2=1} 916\end{footnotesize} 917while \texttt{-m 9D} would be: 918\begin{footnotesize} 919\texttt{1M1X2M4X2M1X2M7X3M9D1M2X1M4X2M1X1M1X2I1X1M1X1M3X1M2X1I3M1D1X1M2X1M} 920\end{footnotesize} 921\item[\texttt{-m 10}] 922a parseable format for use with other programs. 923\item[\texttt{-m 11}] 924Provide \texttt{lav}-like output (used by \texttt{lalign}) for graphical output. 925\begin{quote} 926\texttt{lalign36 -m 11 mchu.aa mchu.aa | lav2plt.pl --dev ps > mchu\_laln.ps} 927\end{quote} 928Produces a postscript plot of the local alignments. Likewise, 929\texttt{lav2plt.pl --dev svg} produces SVG output. 930 931\item[\texttt{-m BB}] Format output to mimic BLAST format. \texttt{-m 932 B} formats alignments to look like BLAST alignments (Query/Sbjct), 933 but is FASTA output otherwise. \texttt{-mBB} imitates BLAST as much 934 as possible, and cannot be used with other \texttt{-m} options. 935 936\item[\texttt{-m "F\# out.file"}] Send an alternate result format to \texttt{out.file}. 937Normally, the \texttt{-m out\_fmt} option applies to the default output 938file, which is either \texttt{stdout}, or specified with \texttt{-O out\_file} (or within 939the program in interactive mode). With \texttt{-m F}, an output format can be 940associated with a separate output file, which will contain a complete 941FASTA program output. Thus, 942\begin{quote} 943\begin{small} 944\begin{verbatim} 945 ssearch36 -m 9c -m "FBB blast.out" -m "F9c,10 m9c_10.out" query library 946\end{verbatim} 947\end{small} 948\end{quote} 949Sends the \texttt{-m 9c} output to \texttt{stdout}, but will also send 950\texttt{-m BB} output to the \texttt{blast.out} file, and \texttt{-m 9c -m 951 10} output to \texttt{m9\_c10.out}. Consistent \texttt{-m out\_fmt} 952commands can be set to the same file by separating them with ','. 953Producing alternative format alignments in different files has little 954additional computational cost. 955 956Because a space (\textvisiblespace) is used to separate the output 957format (\texttt{-m}) values from the file name, the \texttt{-m F} 958argument must typically be surrounded by quotation marks (\texttt{"}). 959 960One of the shortcomings of this approach is that it affects only the 961output format, not the other options that modify the amount of output. 962Thus, if you specify \texttt{-E 0.001}; that expect threshold will be 963used for all the output files. When a \texttt{-m} option does modify 964the output (e.g. \texttt{-m 8} sets \texttt{-d 0}), that modification 965is specific to the output file. 966 967\item[\texttt{-M low-high}] 968Include library sequences with lengths between low and 969high. 970\item[\texttt{-n}] 971Force the query sequence to be treated as a DNA sequence. 972Useful when query sequences contain a large number of 973ambiguous residues, e.g. transcription factor binding sites. 974\item[\texttt{-N \#}] 975break long library sequences into blocks of \# residues. Useful for 976bacterial genomes, which have only one sequence entry. -N 2000 works 977well for well for bacterial genomes. (This option was required when 978FASTA only provided one alignment between the query and library 979sequence. It is not as useful, now that multiple alignments are 980available.) 981 982\item[\texttt{-o off1,off2}] 983(Previously \texttt{-X}.) Specifies offsets for the beginning of the query and library sequence. 984For example, if you are comparing upstream regions for two genes, and 985the first sequence contains 500 nt of upstream sequence while the 986second contains 300 nt of upstream sequence, you might try: 987\begin{quote} 988\texttt{fasta -o "-500 -300" seq1.nt seq2.nt} 989\end{quote} 990If the \texttt{-o} option is not used, FASTA assumes numbering starts with 1. 991(You should double check to be certain the negative numbering works 992properly.) 993 994\item[\texttt{-O}] Send a copy of results to \texttt{filename}. 995 Helpful for environments without STDOUT, but should be avoided (use 996 \texttt{> filename} instead). 997 998\item[\texttt{-p}] 999Force query to be treated as protein sequence. 1000 1001\item[\texttt{-P PSSM\_file}] 1002Specify a PSI-BLAST format PSSM (Position Specific Scoring Matrix) 1003file. \texttt{ssearch36}, \texttt{ggsearch36}, and 1004\texttt{glsearch36} can use a PSSM file to improve the sensitivity of 1005a search. The FASTA programs accept two PSSM file formats:\\[2ex] 1006\begin{tabular}{l l l} 1007\hline\\[-1.5ex] 1008format & \texttt{blastpgp} & option \\[0.5ex] 1009\hline\\[-1.5ex] 10100 & \texttt{blastpgp -C pssm.chk -u 0} & byte-encoded \\ 10112 & \texttt{blastpgp -C pssm.asnb -u 2} & binary ASN.1 \\ 1012% & \texttt{psiblast -out\_pssm\_text} \\[1.5ex] 1013\hline\\[-0.5ex] 1014\end{tabular}\\ 1015which can be specified after the file name, e.g.: 1016\begin{quote} 1017\texttt{ssearch36 -P 'pssm.asnb 2' pssm\_query.aa +sp+} 1018\end{quote} 1019Searches with a PSI-BLAST PSSM must still require a query sequence 1020file, and the query sequence file must match the PSSM seed sequence. 1021The format 0 byte-encoded PSSM is machine dependent; it must be 1022created by \texttt{blastpgp} on the same architecture as 1023\texttt{ssearch36}. In general, you should use the binary ASN.1 (format 2) file. 1024 1025With the release of \texttt{NCBI-BLAST+}, \texttt{psiblast} replaces 1026\texttt{blastpgp}, and \texttt{psiblast} does not produce the binary 1027ASN.1 PSSM checkpoint data. However, the text ASN.1 PSSM checkpoint 1028file (produced with the \texttt{psiblast} option \texttt{-out\_pssm}) 1029can be converted to a binary ASN.1 format that \texttt{ssearch36} can 1030read using the NCBI \texttt{datatool} program (available from 1031\url{ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools++/BIN/CURRENT/datatool}) 1032together with 1033\url{http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn}. More 1034information about \texttt{datatool} is available from 1035\url{http://www.ncbi.nlm.nih.gov/data_specs/NCBI_data_conversion.html}. 1036The NCBI BLASTP/PSI-BLAST website provides the same PSSM text ASN.1 1037file with the downloads link. A text ASN.1 PSSM file can be converted 1038to a binary ASN.1 file using the command: 1039\begin{quote} 1040\texttt{datatool -m NCBI\_all.asn -v pssm.asn\_txt -e pssm.asnb} 1041\end{quote} 1042The \texttt{pssm.asnb} can then be used with 1043\texttt{ssearch36} with the \texttt{-P 'pssm.asnb 2'} 1044option shown above. 1045 1046\item[\texttt{-Q,-q}] 1047Quiet - does not prompt for any input. Writes scores and alignments 1048to the terminal or standard output file (on by default, turned off 1049with \texttt{-I}). 1050\item[\texttt{-r +n/-m}] 1051Specify match/mismatch scores for DNA comparisons. The default is 1052\texttt{+5/-4}. \texttt{+3/-2} can perform better in some cases. 1053\item[\texttt{-R file}] 1054Save a results summary line for every sequence in the sequence 1055library. The summary line includes the sequence identifier, 1056superfamily number (if available) position 1057in the library, and the similarity scores calculated. This option can 1058be used to evaluate the sensitivity and selectivity of different 1059search strategies \cite{wrp951,wrp981}. 1060\item[\texttt{-s file}] Specify the scoring matrix file. 1061 \texttt{fasta36} uses the same scoring matrice format as Blast. 1062 Several scoring matrix files are included in the standard 1063 distribution in the \texttt{data/} directory. For protein 1064 sequences: \texttt{codaa.mat} - based on minimum mutation matrix; 1065 \texttt{idnaa.mat} - identity matrix; \texttt{pam250.mat} - the 1066 PAM250 matrix; \cite{day787}, (\texttt{-s P250}), and 1067 \texttt{pam120.mat} - a PAM120 matrix (\texttt{-s P120}). The 1068 default scoring matrix is BLOSUM50 (\texttt{-s BL50}). Other 1069 matrices include a series of modern PAM-based matrices 1070 \cite{tay925}: MDM40/\texttt{-s MD40}, MDM20/\texttt{-s MD20}, and 1071 MDM10/\texttt{-s MD10}, and a selection from the BLOSUM series 1072 \cite{hen929} BLOSUM50, 62, and 80/\texttt{-s BL50}, \texttt{-s 1073 BL62}, \texttt{-s BL80}. \texttt{-s BP62} sets the scoring matrix 1074 to BLOSUM62 and the gap penalties to -11/-1, identical to 1075 \texttt{BLASTP}. In addition, the VTML160 matrix (\texttt{-s 1076 VT160}) \cite{muller2002} and OPTIMA\_5 (\texttt{-s OPT5}) 1077 \cite{kan023} are available. 1078 1079If the scoring matrix is prefaced by a question mark, 1080e.g. \texttt{?BP62}, then the scoring matrix is adjusted for each 1081query to ensure that a 100\% identical match can produce a score of at 1082least 40 bits. This is designed for \texttt{fastx36} searches with 1083potentially short DNA queries; A 120 nt DNA query can only produce a 108440 amino-acid alignment, which, with BLOSUM62 -11/-1, cannot produce 1085more than 23 bits of score. A scoring matrix with a higher information 1086content is required; in the set available by default, MD40, with 2.22 1087bits/position, would be used. For more information about alignment 1088length and information content, see \cite{alt915}. 1089 1090\item[\texttt{-S}] Filter out lower-case characters in the query or 1091 library sequences for the initial score calculation (used to filter 1092 low-complexity -- \texttt{seg}-ed -- residues). The \texttt{pseg} 1093 program \cite{woo935} can be used to lower-case mask low complexity 1094 regions in protein sequences. With the \texttt{-S} option, lower 1095 case characters in the query or database sequences are treated as 1096 \texttt{X}'s during the initial scan, but are treated as normal 1097 residues during the final alignment display. Since statistical 1098 significance is calculated from the similarity score calculated 1099 during the library search, the lower case residues do not contribute 1100 to the score. However, if a significant alignment contains low 1101 complexity regions, the residues are shown (as lower 1102 case characters, Fig. \ref{seg-aln}). 1103 1104The \texttt{pseg} program can be used to produce databases (or query 1105sequences) with lower case residues indicating low complexity regions 1106using the command: 1107\begin{verbatim} 1108pseg ./swissprot.fasta -z 1 -q > swissprot.lseg 1109\end{verbatim} 1110 1111The \texttt{-S} option should always be used with \texttt{FASTX/Y} and 1112\texttt{TFASTX/Y} because out-of-frame translations often generate 1113low-complexity protein sequences. However, only lower case characters 1114in the protein sequence (or protein database) are masked; lower case 1115DNA sequences are translated into upper case protein sequences, and 1116not treated as low complexity by the translated alignment 1117programs. (There is an option in the \texttt{Makefile}, 1118\texttt{-DDNALIB\_LC}, to enable preserving case in DNA sequences.) 1119 1120\item[\texttt{-t \#}] 1121Translation table - fastx36, tfastx36, fasty36, and 1122tfasty3 now support the BLAST translation tables. See 1123\url{http://www.ncbi.nih.gov/Taxonomy/Utils/wprintgc.cgi}. 1124 1125\texttt{-t t} or \texttt{-t t\#} enables the addition of 1126an implicit termination codon to a protein:translated DNA match. That 1127is, each protein sequence implicitly ends with \texttt{*}, which 1128matches the termination codes for the appropriate genetic code. 1129\texttt{-t t\#} sets implicit termination and a different genetic 1130code. 1131\item[\texttt{-T \#}] 1132set number of threads/workers. Normally on a multi-core machine, the maximum 1133number of processors/cores is used. 1134\item[\texttt{-U}] 1135Treat the query sequence an RNA sequence. In addition to selecting a 1136DNA/RNA alphabet, this option causes changes to the scoring matrix so 1137that \texttt{G:A} , \texttt{T:C} or \texttt{U:C} are scored as \mbox{\texttt{G:G -3}}. 1138\item[\texttt{-v \#}] 1139Do window shuffles with the window size specified. 1140\item[\texttt{-V str}] Specify annotation characters that can be 1141 included (and will be ignored), in the query sequence file, but are 1142 displayed in the alignments. If a query file contains 1143 \texttt{"ACVS*ITRLFT?"}, where \texttt{"*"} and \texttt{"?"} are 1144 used to indicate phosphorylation, giving the option \mbox{\texttt{-V 1145 '*?'}}, the annotated characters in the query will (\texttt{S*}, 1146 \texttt{F?}) will be highlighted in the alignment (on the number 1147 line). A \texttt{fasts36} alignment of \texttt{seq/ngts.aa} compared 1148 to \texttt{seq/mgstm1.aa} with \texttt{-V '*?'} produces: 1149\begin{footnotesize} 1150\begin{verbatim} 1151 * 10?? 1152GT8.7 ILGYWN------------EYTDSSYDEKR---------------------------- 1153 :::::: ::::::::::: 1154GT8.7 MPMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKLGLDFPNL 1155 10 20 30 40 50 60 1156\end{verbatim} 1157\end{footnotesize} 1158In addition to showing the alignments of post-translationally modified 1159sites, the \texttt{-V} option can be used to highlight active sites in 1160library sequences. In the \texttt{-m 9c} output, the state of the 1161annotated sites is summarized when \texttt{-V} is used. 1162 1163(fasta-36.3.6 June 2012) The \texttt{-V} option has been extended to: 1164(1) allow feature descriptions to be specified in a file, 1165e.g. \texttt{-V =annot.defs} where \texttt{annot.defs} contains: 1166\begin{footnotesize} 1167\begin{verbatim} 1168*:phosphorylation 1169@:active site 1170^:binding site 1171\end{verbatim} 1172\end{footnotesize} 1173The annotation character is left of the ':', the definition is on the 1174right. The \texttt{annot.defs} file can also be specified by setting 1175the \texttt{FA\_ANNOT\_DEF} environment variable to the file name; 1176 1177(2) to include optional annotation file, e.g. \texttt{-V 1178 '<features.annot'}, or script, e.g. \texttt{-V '!features.pl'} for 1179library annotations and \texttt{-V 'q!features.pl'} for query annotations. (Some shells require \texttt{\textbackslash!features.pl}.) Similar to the library expansion script, the 1180\texttt{features.pl} script is run against a temporary file containing 1181the list of high scoring sequence accessions (the text before the 1182first space), e.g. 1183\begin{footnotesize} 1184\begin{verbatim} 1185gi|121735|sp|P09488.3|GSTM1_HUMAN 1186gi|1170096|sp|Q03013.3|GSTM4_HUMAN 1187gi|67461004|sp|Q5R8E8.3|GSTM2_PONAB 1188... 1189\end{verbatim} 1190\end{footnotesize} 1191The \texttt{features.pl} script then produces a file of annotations on 1192those sequences, in the format: 1193\begin{verbatim} 1194>accession1 1195position label value 1196>accession2 1197... 1198\end{verbatim} 1199For example: 1200\begin{footnotesize} 1201\begin{verbatim} 1202>gi|121735|sp|P09488.3|GSTM1_HUMAN 120323 * 120433 * 120534 * 1206116 ^ 1207173 V N 1208210 V T 1209>gi|1170096|sp|Q03013.3|GSTM4_HUMAN 12102 V P 1211116 ^ 1212160 V V 1213208 V V 1214209 V F 1215211 V K 1216212 V M 1217>gi|67461004|sp|Q5R8E8.3|GSTM2_PONAB 1218... 1219\end{verbatim} 1220\end{footnotesize} 1221The same format is used for the \texttt{-V '<feature.annot'} file. 1222 1223The \texttt{V} label is special; it indicates that the feature is a 1224variant residue and specifies the alternative residue in the label 1225field. Thus, \texttt{GSTM4\_HUMAN} can have a \texttt{M} at position 12262. Unlike modification or active site annotations, variant residues 1227can change the sequence of the library sequence if replacing the 1228canonical library residue with the variant residue improves the score. 1229Thus, without the \texttt{-V '!feature.pl'} script, the human 1230\texttt{GSTM1B} variant with dbSNP:rs449856 would align to 1231\texttt{GSTM1\_HUMAN} (\texttt{P04988}) like this (the \texttt{-m 1} 1232format option was used to highlight differences) : 1233\begin{footnotesize} 1234\begin{verbatim} 1235The best scores are: opt bits E(1) 1236sp|P09488.3|GSTM1_HUMAN Glutathione S-transferase Mu 1; GST HB subuni ( 218) 1490 335.9 3.7e-97 1237 1238>>sp|P09488.3|GSTM1_HUMAN Glutathione S-transferase Mu 1; GST HB s (218 aa) 1239 initn: 1490 init1: 1490 opt: 1490 Z-score: 1776.8 bits: 335.9 E(1): 3.7e-97 1240Smith-Waterman score: 1490; 99.1% identity (100.0% similar) in 218 aa overlap (1-218:1-218) 1241 1242... 1243 170 180 190 200 210 1244gtm1_h YDVLDLHRIFEPNCLDAFPNLKDFISRFEGLEKISAYMKSSRFLPRPVFTKMAVWGNK 1245 x x 1246sp|P09 YDVLDLHRIFEPKCLDAFPNLKDFISRFEGLEKISAYMKSSRFLPRPVFSKMAVWGNK 1247 170 180 190 200 210 1248\end{verbatim} 1249\end{footnotesize} 1250 1251With a \texttt{-V '!feature.pl'} script to annotate the variants, 1252the alignment becomes: 1253\begin{footnotesize} 1254\begin{verbatim} 1255The best scores are: opt bits E(1) 1256sp|P09488.3|GSTM1_HUMAN Glutathione S-transferase Mu 1; GST HB s ( 218) 1500 338.1 8e-98 1257 1258>>sp|P09488.3|GSTM1_HUMAN Glutathione S-transferase Mu 1; GST HB s (218 aa) 1259 Variant: K173N;S210T; 1260 initn: 1500 init1: 1500 opt: 1500 Z-score: 1788.7 bits: 338.1 E(1): 8e-98 1261Smith-Waterman score: 1500; 100.0% identity (100.0% similar) in 218 aa overlap (1-218:1-218) 1262... 1263 170 180 190 200 210 1264gtm1_h YDVLDLHRIFEPNCLDAFPNLKDFISRFEGLEKISAYMKSSRFLPRPVFTKMAVWGNK 1265 1266sp|P09 YDVLDLHRIFEPNCLDAFPNLKDFISRFEGLEKISAYMKSSRFLPRPVFTKMAVWGNK 1267 170 V 180 190 200 21V 1268\end{verbatim} 1269\end{footnotesize} 1270In addition to removing the two differences as residues 173 and 210, 1271which produces a 100\% identical alignment, inclusion of the variant 1272library sequence also improves the raw similarity score and $E()$-value. 1273An example script (\texttt{misc/up\_feats.pl}) that extracts 1274annotations from a mysql database of Uniprot features is provided. 1275 1276If the annotation script produces lines beginning with '=', then these 1277lines are taken as annotation definitions, similar to the 1278\texttt{annot.defs} file described above. Thus: 1279\begin{footnotesize} 1280\begin{verbatim} 1281=*:phosphorylation 1282=@:active site 1283=^:binding site 1284>gi|121735|sp|P09488.3|GSTM1_HUMAN 128523 * 128633 * 128734 * 1288116 ^ 1289173 V N 1290210 V T 1291\end{verbatim} 1292\end{footnotesize} 1293will produce the same annotation descriptions as the 1294\texttt{annot.defs} file. 1295 1296Scripts to produce annotations are available in the \texttt{scripts/} 1297directory as \texttt{scripts/ann\_feats*.pl}. Scripts with 1298\texttt{www} in the name, 1299e.g. \texttt{scripts/ann\_feats\_up\_www2.pl} and 1300\texttt{scripts/ann\_pfam\_www.pl} download annotation information 1301from Uniprot or Pfam web services, respectively. Scripts lacking 1302\texttt{www} require require a MySQL database that associates features 1303or domains with sequence identifiers (accessions). With \CURRENT, 1304domain annotations are allows to overlap each other (which often 1305happens in Pfam and UniProt); FASTA 36.3.6 did not support overlapping 1306domains. Scripts that can produce overlapping domain annotations have 1307\texttt{\_e} in their names, but will produce non-overlapping domain 1308annotations with the \texttt{--no-over} option. Thus: 1309\texttt{scripts/ann\_pfam\_www\_e.pl --acc sp|P43553|ALR2\_YEAST} 1310produces: 1311\begin{quote} 1312\begin{verbatim} 1313>sp|P43553|ALR2_YEAST 1314451 - 683 PF01544 :1 1315667 - 799 PF01544 :1 1316\end{verbatim} 1317\end{quote} 1318While \texttt{scripts/ann\_pfam\_www\_e.pl --acc --no-over 1319 sp|P43553|ALR2\_YEAST} produces: 1320\begin{quote} 1321\begin{verbatim} 1322>sp|P43553|ALR2_YEAST 1323451 - 675 PF01544 :1 1324676 - 799 PF01544 :1 1325\end{verbatim} 1326\end{quote} 1327 1328\item[\texttt{-w \#}] 1329 Display width value ($<$200). Sets the approximate width of the 1330 high-score descriptions and the length of residue 1331 alignments. \texttt{-w 60} by default. 1332 1333\item[\texttt{-W \#}] context length (default is 1/2 of line width -w) 1334 for alignment, for programs like \texttt{fasta36} and 1335 \texttt{ssearch36}, that provide additional sequence context. 1336 1337\item[\texttt{-X extended\_option}] 1338A number of rarely used options are now only available as extended options: 1339 1340\begin{description} 1341 1342\item[\texttt{X1}] sort output by \texttt{init1} score (for 1343 compatibility with FASTP; obsolete). 1344 1345\item[\texttt{XB}] (Previously \texttt{-B}.) Show the z-score, rather 1346 than the bit-score in the list of best scores (rarely used, provided 1347 for backward compatibility). 1348 1349\item[\texttt{XI}] Modify rounding used in percent identity/percent 1350 similarity display to ensure that sequences that have a mismatch are 1351 not shown as 100.0\% identical. Without this option, a single 1352 mismatch in a 10,000 residue alignment would be shown as 100.0\% 1353 identical; with this option, it would be shown as 99.9\% 1354 identical. 1355 1356\item[\texttt{Xo}] (\texttt{fasta36}, \texttt{[t]fast[x/y]36} only) 1357 (Previously \texttt{-o}.) Turn off the default \texttt{opt} score 1358 calculation and sort results by \texttt{initn} scores (reduces 1359 sensitivity and statistical accuracy, obsolete). 1360 1361\item[\texttt{XM}] The maximum amount of memory available for storing 1362 the library in multi-sequence searches. The value is specified in 1363 MBytes (\texttt{-XL16}) or GBytes (\texttt{-XL4G}) and can also be 1364 set using the \texttt{LIB\_MEMK} environment variable 1365 (\texttt{LIB\_MEMK=4G}). Negative values remove the memory 1366 restriction. By default (set as a compile-time option, 1367 \texttt{-DMAX\_MEMK=2}), set to 2 GBytes in 32-bit environments and 1368 12 GBytes in 64-bit environments. 1369 1370\item[\texttt{XN/XX}] Alter the treatment of N:N (DNA) or X:X 1371 (protein) alignments for counts of identities and similarities. By 1372 default the \texttt{FASTA} programs count N:N or X:X as identical, 1373 but not similar, because their alignment scores are typically 1374 negative. \texttt{-XNS}, \texttt{-XN+}, \texttt{-XXS}, and 1375 \texttt{-XX+} treat N:N and X:X alignments as ``similar'' , even 1376 though their alignment scores are negative, when calculating percent 1377 similarity. \texttt{-XND}, \texttt{-XN-}, \texttt{-XXD}, and 1378 \texttt{-XX-} treat N:N and X:X alignments as non-identical for 1379 calculating percent identity. 1380 1381\item[\texttt{Xx}] (Previously \texttt{-x}) Specify the penalty for a 1382 match to an \texttt{X}, and mismatch to \texttt{X}, independently of 1383 the PAM matrix. Particularly useful for \texttt{fastx3/fasty36}, 1384 where termination codons are encoded as \texttt{X}. For example, 1385 \texttt{-Xx=0,-1} scores an \texttt{X:X} match as 0, and 1386 \texttt{X:not-X} as -1. 1387 1388\item[\texttt{Xy}] (Previously \texttt{-y}.) Set the width of the band 1389 used for calculating "optimized" scores. For proteins and ktup=2, 1390 the width is 16. For proteins with ktup=1, the width is 32 by 1391 default. For DNA the width is 16. 1392 1393\end{description} 1394 1395\item[\texttt{-z -1,0,1,2,3,4,5,6}]\hfill\\ 1396\texttt{-z -1} turns off statistical calculations. \texttt{z 0} estimates 1397the significance of the match from the mean and standard deviation of 1398the library scores, without correcting for library sequence length. 1399\texttt{-z 1} (the default) uses a weighted regression of average score 1400vs library sequence length; \texttt{-z 2} uses maximum likelihood 1401estimates of $\lambda$ 1402and $K$; \texttt{-z 3} uses Altschul-Gish parameters \cite{alt960}; 1403\texttt{-z 4 - 5} uses two variations on the \texttt{-z 1} 1404strategy. \texttt{-z 1} and \texttt{-z 2} are the best methods, in 1405general. 1406\item[\texttt{-z 11,12,14,15,16}]\hfill\\ 1407estimate the statistical parameters from shuffled copies of each 1408library sequence. This allows accurate statistics to be estimated for libraries comprised of a single protein family. 1409 1410\item[\texttt{-z 21,22,24,25,26}]\hfill\\ 1411estimate the statistical parameters from shuffled copies of the 1412highest scoring sequences reported in the search. 1413library sequence. This shuffling strategy is much more like 1414\texttt{prss}, since the sequences shuffled share compositional 1415similarity to the query. 1416\item[\texttt{-Z db\_size}] 1417sets the apparent size of the database to be used when calculating 1418expectation E()-values. If you searched a database with 1,000 1419sequences, but would like to have the E()-values calculated in the 1420context of a 100,000 sequence database, use \texttt{-Z 100000}. 1421\item[\texttt{-3}] 1422translate only three forward frames or search with only the forward 1423strand (complement of \texttt{-i}). 1424\end{description} 1425 1426Thus, to tell \texttt{fasta36} to align \texttt{seq1.aa} with \texttt{seq2.aa} showing the entirety of both sequences, with 80 characters per line, one would type: 1427\begin{verbatim} 1428fasta36 -w 80 -s BP62 -a seq1.aa seq2.aa 1429\end{verbatim} 1430The \texttt{-w 80} and \texttt{-a} options must precede the file 1431names. If you just enter the options on the command line followed by 1432\texttt{-I}, the program will prompt for the file names. 1433 1434In addition, the FASTA programs can accept query sequence data from 1435\texttt{STDIN}. To specify that stdin be used as the query or library 1436file, the file name should be specified as \texttt{@}. Thus: 1437\begin{quote} 1438\texttt{cat query.aa | fasta36 @:25-75 /slib/swissprot } 1439\end{quote} 1440would take residues 25-75 from \texttt{query.aa} and search the 1441\texttt{/slib/swissprot}. 1442 1443\subsubsection{Environment variables} 1444 1445FASTA allows virtually every option to be set on the command line 1446(except the \emph{ktup}, which must be set as the third command line 1447argument), but it is often convenient to set the \texttt{FASTLIBS} 1448environment variable to specify the location of the \texttt{fastlibs} 1449database description file. 1450 1451\texttt{FASTLIBS} -- \texttt{FASTLIBS} 1452specifies the location of the file that contains the list of library 1453descriptions, locations, and library types (see section on finding 1454library files). 1455 1456\texttt{LIB\_MEMK} -- Set the maximum amount of memory (MBytes) to be 1457available for library buffering (equivalent to \texttt{-XM\#}, see 1458above). By default, \texttt{2GB} is available on 32-bit systems 1459(\texttt{LIB\_MEMK=2G}); \texttt{8GB} on 64-bit systems. 1460 1461\texttt{REF\_URL}, \texttt{SRCH\_URL} and \texttt{SRCH\_URL1} -- These 1462environment variables are used in HTML mode (\texttt{-m 6}) to provide 1463links from the sequence alignment (see the links at 1464\url{http://fasta.bioch.virginia.edu/fasta_www2/}). \texttt{REF\_URL} 1465is associated with the \texttt{Entrez Lookup} link; \texttt{SRCH\_URL} 1466with the \texttt{Re-search database} link, and \texttt{SRCH\_URL1} 1467with the \texttt{General re-search} link. In each case, the text 1468corresponds to a HTML URL, but with positions containing the 1469\texttt{\%s} or \texttt{\%ld} (for numbers) part of a 'C' 1470\texttt{sprintf()} call for specific variables. \texttt{REF\_URL} uses 1471the database (\texttt{protein} or \texttt{nucleotide}), together with 1472a query term (typically the \texttt{gi} number). \texttt{SRCH\_URL} 1473and \texttt{SRCH\_URL1} use \texttt{db}, \texttt{query} (\texttt{gi}, 1474\texttt{pgm} (\texttt{fa}, \texttt{ss}, \texttt{fx}, etc.), and 1475\texttt{start}, \texttt{stop}, and \texttt{n1} (library sequence 1476length), where \texttt{start} and \texttt{stop} are the boundaries of 1477the alignment, for sub-sequence searches. The values of these 1478environment variables are used with \texttt{sprintf} to build a new 1479URL that is linked in the output. 1480 1481\texttt{TMP\_DIR} -- Location (if defined) of the temporary files used 1482by the \texttt{-e expand\_script.sh} option. 1483 1484In addition, environment variables can be used inside both the 1485\texttt{fastlibs} file and in the \texttt{@db.nam} files of file 1486names. The \texttt{fasta36/conf/fast\_libs\_e.www} file, included with 1487the distribution, shows an example, as do the descriptions of file of 1488file names files shown below. Whenever a word of the form 1489\texttt{\$\{WORD\}} is found in \texttt{fastlibs} or a file of file 1490names, the \texttt{\$\{WORD\}} environment variable is expanded and 1491inserted in the string. Thus, if \texttt{<\$\{SLIB\}/blast\_dbs/} 1492describes where a list of files will be found and \texttt{\$\{SLIB\}} 1493is \texttt{"/seqdata"}, then the resulting substitution yields: 1494\texttt{</seqdata/blast\_dbs/}. 1495 1496\section{Installing FASTA and the sequence databases} 1497 1498\subsection{Obtaining/preparing the sequence libraries} 1499 1500The FASTA program package does not include any protein or DNA sequence 1501libraries. Protein and DNA sequence databases are available via 1502anonymous FTP from the NCBI (\url{ftp://ftp.ncbi.nih.gov/blast/db}, 1503\url{ftp://ftp.ncbi.nih.gov/blast/db}), UniProt 1504(\url{ftp://ftp.uniprot.org/pub/databases/uniprot}), and the EBI 1505(\texttt{ftp.ebi.ac.uk/pub/databases}). 1506 1507\emph{Protein Sequence Databases} -- Protein sequence databases are 1508available from the NCBI, UniProt, and the EBI. The NCBI provides a 1509``raw'' database, \texttt{nr}, and a well-curated, less redundant 1510database, \texttt{refseq\_protein}, and a copy of the very well 1511annotated \texttt{swissprot} database. Protein sequence databases can 1512also be downloaded from UniProt and the EBI; both sites provide the 1513same UniProt\cite{uniprot11} database. 1514 1515Protein libraries, particularly those used for translated-DNA:protein 1516comparisons with \texttt{fastx36} or \texttt{fasty36}, show be scanned 1517to remove low-complexity regions. Matches between low complexity 1518regions can violate the composition assumptions used by the FASTA 1519statistical estimates. The \texttt{pseg} program (\cite{woo935}, 1520\url{ftp://ftp.ncbi.nih.gov/pub/seg/pseg}) can be used to lower-case 1521low complexity regions, which then can be ignored during the initial 1522database search by using the \texttt{-S} option. To lower-case low 1523complexity regions, run the \texttt{pseg} program against the protein sequence database: 1524\begin{quote} 1525\begin{verbatim} 1526pseg /seqdata/swissprot.fa -z 1 -q > /seqdata/swissprot.lseg 1527\end{verbatim} 1528\end{quote} 1529And then you can run most FASTA programs with \texttt{-S}: 1530\begin{quote} 1531\begin{verbatim} 1532ssearch36 -S mgstm1.aa /seqdata/swissprot.lseg 1533\end{verbatim} 1534\end{quote} 1535 1536Fig. \ref{seg-aln} shows the effect of including the \texttt{-S} 1537option with lower-cased low-complexity sequences. The \texttt{opt} 1538score (407), which is used to sort the results and calculate 1539statistics, is lower than the Smith-Waterman score (451), even though 1540exactly the same residues are aligned for each score. The \texttt{opt} 1541score excludes residues 19-30, because they were marked as 1542low-complexity by \texttt{pseg}; thus they are shown as lower-case. 1543The Smith-Waterman score includes the contribution from that part of 1544the alignment. 1545 1546Out-of-frame translated DNA sequences often produce low-complexity 1547regions \cite{wrp973}, so it is particularly important to avoid 1548low-complexity alignments when using \texttt{fastx36} and 1549\texttt{fasty36} 1550 1551\subsection{Searching taxonomic subsets} 1552 1553Because increasing database size reduces search sensitivity (an 1554alignment with an $E()$-value of $0.001$ in a search of a 100,000 1555entry database will have an $E()$-value of 0.1, not significant, if 1556found in a database of 10,000,000 sequences), it is much more 1557effective to search smaller, less redundant databases (you can always 1558search the larger database later). Thus, the \texttt{refseq\_protein} 1559database from the NCBI is preferred over \texttt{nr}; even better are 1560databases that reflect a limited phylogenetic range 1561(e.g. \texttt{refseq\_human} for vertebrate sequences). 1562 1563While the NCBI provides organism-specific \texttt{refseq} subsets on 1564their FTP site, they can be difficult to find. Alternatively, you can 1565use the NCBI \texttt{Entrez} web site to download a list of 1566\texttt{gi} numbers specific to a particular organism or taxonomic 1567range. The FASTA programs can search a subset of a large sequence 1568database that is specified by a list of \texttt{gi} numbers by using 1569library format 10. For example, given a list of \texttt{gi} numbers 1570for the human proteins in \texttt{swissprot.lseg}, the file 1571\texttt{sp\_human.db}, with the content: 1572\begin{quote} 1573\begin{verbatim} 1574<${SLIB}/swissprot.lseg 0:2 4| 15753121763 157651701705 15777404340 1578205831112 157974735515 1580... 1581\end{verbatim} 1582\end{quote} 1583could be used to search the human subset of 1584\texttt{swissprot.lseg}. The \texttt{gi} numbers for the SwissProt 1585entries begin with the second line. The first line specifies the 1586location of the file where the sequences containing the \texttt{gi} 1587numbers can be found (\texttt{\$\{SLIB\}/swissprot.lseg}, the 1588\texttt{libtype} of that file (\texttt{0:fasta}), the character offset 1589to the beginning of the sequence identifier in that file (\texttt{2}), 1590the identifier type (\texttt{4}), and the character 1591that separates the fields in the FASTA descriptor (\texttt{|}). The 1592identifier type can take four formats: 1593 1594\begin{tabular}{l l} 1595\hline\\[-1.5ex] 15961 & ordered accession strings (letters or numbers)\\ 15972 & ordered numbers (digits only) \\ 15983 & un-ordered accession strings \\ 15994 & un-ordered numbers \\ 1600\hline\\ 1601\end{tabular}\\ 1602(Ordered accession strings/numbers are ordered in both the library and the subset file.) 1603 1604Thus, given the \texttt{0:2 4|} specification above, the line: 1605\begin{quote} 1606\texttt{>gi|3121763|sp|O15143.3|ARC1B\_HUMAN Actin-related protein 2/3 ...} 1607\end{quote} 1608would be parsed, looking for an number starting at column 4 (the first 1609column is numbered 0), and ending with \texttt{|}. The order of 1610sequences in the library do not have to correspond to the order in the 1611\texttt{sp\_human.db} file (un-ordered). Given a the 1612\texttt{sp\_human.db} file, a file \texttt{swissprot.lseg} in the 1613directory specified by the environment variable \texttt{\$\{SLIB\}}, 1614and a command of the form: 1615\begin{quote} 1616\texttt{fasta36 -S mgstm1.aa 'sp\_human.db 10'} 1617\end{quote} 1618Would use the \texttt{sp\_human.db} file to search the subset of 1619\texttt{swissprot.lseg} that contained the specified \texttt{gi} 1620numbers. 1621 1622\subsection{DNA sequence libraries} 1623 1624Because of the large size of DNA databases, you will probably want to 1625keep DNA databases in only one format. The FASTA3 programs that 1626search DNA databases --- \texttt{fasta36}, \texttt{fastm36}, and 1627\texttt{tfastx/y36} --- can read DNA databases in Genbank flatfile (not 1628ASN.1), FASTA, and BLAST2.0 (\texttt{formatdb}) formats, as well as 1629EMBL format. BLAST2.0 format is preferred for DNA sequence libraries, 1630because the files are considerably more compact than GenBank format. 1631The NCBI does not provide software for converting from Genbank flat 1632files to Blast2.0 DNA databases, but you can use the Blast 1633\texttt{formatdb} program to convert ASN.1 formatted Genbank files, 1634which are available from the NCBI \texttt{ftp} site. 1635 1636The NCBI also provides the comprehensive \texttt{nt} DNA database, and 1637several EST databases in Blast2.0/\texttt{formatdb} format from 1638\texttt{ftp://ncbi.nih.gov/blast/db}. 1639 1640 1641\subsection{Finding the library files} 1642 1643All the FASTA programs comparison programs have the command line syntax: 1644\begin{quote} 1645\texttt{fasta36 query.file /seqdata/library} 1646\end{quote} 1647However, in addition to simply specifying the location of the database 1648to be searched 1649(\texttt{/seqdata/library}), the FASTA programs 1650provide several methods for referring to sequence databases without specifying a specific file. These methods can be used to provide abbreviations for sequence libraries, e.g.: 1651\begin{quote} 1652\texttt{fasta36 query.file s} 1653or 1654\texttt{fasta36 query.file +sp+} 1655\end{quote} 1656To use abbreviations like \texttt{'s'} or \texttt{'+sp+'} to reference a 1657sequence database, a \texttt{FASTLIBS} file must be used, see section 1658\ref{fastlibs}. 1659 1660Large DNA and protein databases are often distributed across several 1661files. For example, the NCBI \texttt{nr} protein database is found in 16625 files, \texttt{nr.00} ... \texttt{nr.04}. To search databases in 1663multiple files, the names of the files are specified in a file of 1664filenames, \texttt{nr.nam}: 1665\begin{quote} 1666\begin{verbatim} 1667<${SLIB}/blast_dbs/ 1668nr.00 12 1669nr.01 12 1670nr.02 12 1671nr.03 12 1672nr.04 12 1673\end{verbatim} 1674\end{quote} 1675In this file, the first line \texttt{<\$\{SLIB2\}/blast\_dbs/}, 1676beginning with \texttt{<}, specifies the location and format (Blast2.0 1677\texttt{formatdb}) the data files. Text of the form 1678\texttt{\$\{SLIB\}} refers to Unix/MacOSX/Windows environment 1679variables; the value of \texttt{\$\{SLIB\}} is set by a Unix/MacOSX 1680shell environment command. Thus, if the value of \texttt{\$\{SLIB\}} 1681is \texttt{/seqdata}, then the first sequence library file to be read 1682will be \texttt{/seqdata/blast\_dbs/nr.00}, in format 12 (Blast2.0 1683\texttt{formatdb}). 1684 1685To refer to the \texttt{nr.nam} file as a file of file names, it must 1686be prefixed by a \texttt{@} character, e.g. 1687\begin{quote} 1688\texttt{fasta36 query.file \textbf{@}nr.nam} 1689\end{quote} 1690Files of file names can contain references to other files of file names: 1691\begin{quote} 1692\begin{verbatim} 1693<${SLIB}/fasta_dbs/ 1694@pdb.nam 1695@swissprot.nam 1696\end{verbatim} 1697\end{quote} 1698The FASTA file of file names is similar to the NCBI 1699\texttt{prot\_db.pal} and \texttt{dna\_db.nal}, files, but 1700unfortunately they are different, and currently FASTA cannot read NCBI 1701\texttt{.pal} or \texttt{.nal} files that contain a \texttt{DBLIST} 1702line. FASTA can read NCBI \texttt{.pal} or \texttt{.nal} files that do 1703not contain a \texttt{DBLIST} line. 1704 1705FASTA version \texttt{fasta-36.3.6} provides an alternative way to 1706generate a database to be searched: the \texttt{!script.sh} file. 1707Like the \texttt{-e expand\_file.sh} script, a shell script or program 1708can be used to produce a database to a temporary file, which is then 1709seached. For example, if the file \texttt{cat\_db.sh} contains the 1710command \texttt{echo /seqdb/swissprot.lseg}, the command: 1711\begin{quote} 1712\begin{verbatim} 1713fasta36 query.aa \!@cat_db.sh 1714\end{verbatim} 1715\end{quote} 1716will cause \texttt{cat\_db.sh} to produce a temporary file with the 1717line \texttt{swissprot.lseg}, which is interpreted as an indirect file 1718of filenames; thus, because of the \texttt{@}, the file will be 1719interpreted as an indirect file, and the \texttt{swissprot.lseg} file 1720will be searched. Note that on Unix systems, the \texttt{'!'} must be 1721preceeded by a \texttt{'\textbackslash'} so that it is not interpreted by the 1722shell, as shown above. 1723 1724\subsection{\texttt{FASTLIBS}} 1725\label{fastlibs} 1726 1727All the search programs in the FASTA3 package can use the environment 1728variable \texttt{FASTLIBS} to find the protein and DNA sequence 1729libraries. (Alternatively, you can specify the \texttt{FASTLIBS} file 1730with the \texttt{-l fastlibs.file} option.) The \texttt{FASTLIBS} 1731variable contains the name of a file that has the actual filenames of 1732the libraries. The \texttt{fastlibs} file included with the 1733distribution is an example of a file that can be referred to by 1734FASTLIBS. To use the \texttt{fastlibs} file, type: 1735\begin{quote} 1736\texttt{setenv FASTLIBS /seqdata/info/fastgbs} (csh/tcsh)\\ 1737 or\\ 1738\texttt{export FASTLIBS=/seqdata/info/fastgbs} (bash/ksh) 1739\end{quote} 1740 Then edit the \texttt{fastlibs} file to indicate the location of the 1741 protein and DNA sequence libraries. If the protein sequence library is 1742 kept in the file \texttt{/seqdata/aa/swissprot.lseg} and your Genbank 1743 DNA sequence library is kept in the directory: 1744 \texttt{/seqdata/genbank}, then the \texttt{fastlibs} file might 1745 contain: 1746%%\pagebreak 1747\begin{verbatim} 1748SwissProt$0P/seqdata/aa/swissprot.lseg 0 1749UniProt$0+uniprot+@/seqdata/aa/uniprot.nam 1750GB Primate$1P@/seqdata/genbank/gpri.nam 1751GB Rodent$1R@/seqdata/genbank/grod.nam 1752GB Mammal$1M@/seqdata/genbank/gmammal.nam 1753^ 1 ^^^^ 4 ^ ^ 1754 23 (5) 1755\end{verbatim} 1756The first line of this file says that there is a copy of the SwissProt 1757sequence database (a protein database) that can be selected by typing 1758"P" on the command line or when the database menu is presented in 1759interactive mode. 1760 1761Note that there are 4 (or 5) fields in the lines in the 1762\texttt{fastlibs} file. The first field describes library and is 1763displayed by FASTA program; it ends with the '\$'. The second field 1764(1 character), is a 0 if the library is a protein library and 1 if it 1765is a DNA library. The third field can either be a single character 1766(\texttt{P}) or a word surrounded by the \texttt{+} symbol 1767(\texttt{+uniprot+}), and can be used to specify the library on the command line or in interactive mode. 1768 1769The fourth field is the name of the library file. In the example 1770above, the \texttt{/seqdata/aa/swissprot.lseg} file contains the 1771entire protein sequence library. Alternatively, 1772\texttt{/seqdata/aa/uniprot.nam} is a file of file names, which 1773contains a list of one or more library files. Likewise, the DNA 1774library files are files of file names. 1775 1776In addition, an optional fifth field can be used to specify the format 1777of the library file. Alternatively, you can specify the library 1778format in a file of file names. This field must be separated from the 1779file name by a space character ('\ ') from the filename. FASTA can 1780read the libraries in the following formats:\\ 1781 1782\begin{tabular}{r l} 17830 & FASTA (\texttt{>SEQID} - comment/sequence) \\ 17841 & Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)\\ 17852 & NBRF CODATA (ENTRY/SEQUENCE) (obsolete)\\ 17863 & EMBL/SWISS-PROT (ID/DE/SQ)\\ 17874 & Intelligenetics (;comment/SEQID/sequence) (obsolete)\\ 17885 & NBRF/PIR VMS (\texttt{>P1;SEQID}/comment/sequence) (obsolete)\\ 17896 & GCG (version 8.0) Unix Protein and DNA (compressed)\\ 17907 & FASTQ (sequence only, quality ignored)\\ 179110 & subset format (</slib2/swissprot.lseg 0:2 4|) \\ 179211 & NCBI Blast1.3.2 format (unix only) (obsolete)\\ 179312 & NCBI Blast2.0 format\\ 179416 & MySQL (requires special compilation) \\ 179517 & Postgres (requires special compilation) \\ 1796\end{tabular} 1797 1798Today, the most popular formats are \texttt{FASTA}, type \texttt{'0'}, 1799the default, and the NCBI Blast2.0 \texttt{formatdb} formats (type 1800\texttt{'12'}). The FASTA programs cannot read NCBI ASN.1 formatted databases. 1801If a library format is not specified, for example, because 1802you are just comparing two sequences, FASTA (format 0) is used by 1803default. To specify a library type on the command line, add it to the 1804library filename and surround the filename and library type in quotes: 1805\begin{quote} 1806\begin{verbatim} 1807fasta36 query.file "/seqdb/genbank/gbmam 12" 1808\end{verbatim} 1809\end{quote} 1810NCBI \texttt{formatdb} databases are built from multiple files, 1811e.g. \texttt{gbmam.nsq}, \texttt{gbmam.nhr}, \texttt{gbmam.nin}; to 1812refer to the complete set of files, simply use name before the 1813suffixes, e.g. \texttt{gbmam}. When NCBI databases distributed across 1814several files, e.g. \texttt{gbbct.00}, \texttt{gbbct.01}, etc, those 1815files must be included in a \texttt{gbbct.nam} file of file names. 1816 1817FASTA subset format ({\tt 10}) allows users to search a subset of a 1818sequence database, by specifying a list of {\tt gi} numbers or accessions in 1819a larger database. The format begins with a line naming the file 1820sequence file followed by information about how to 1821extract the {\tt gi} number or accession. Thus, the line. 1822\begin{quote} 1823\texttt{<library\_file lib\_fmt:id\_fmt id\_loc} 1824\end{quote} 1825where {\tt lib\_fmt} is the library format (0), {\tt id\_fmt} is the 1826format of the sequence identifier (:1, :2 - ordered strings or 1827numbers; :3, :4 - unordered strings or numbers), 1828and {\tt id\_loc} is the location of the sequence identifier. For example, 1829\begin{quote} 1830\begin{verbatim} 1831</slib2/blast/swissprot.lseg 0:2 4| 18323121763 183351701705 18347404340 183574735515 1836... 1837\end{verbatim} 1838\end{quote} 1839specifies the file containing all the sequences and the file is in 1840FASTA format ('0:'), the sequence identifier is a number (':2'), 1841and the identifier starts at character 4 and ends with the \texttt{'|'} symbol. 1842 1843The major problem that most new users of the FASTA package have is in 1844setting up the program to find the databases and their library type. 1845In general, if you cannot get \texttt{fasta36} to read a sequence 1846database, there is probably something wrong with the \texttt{FASTLIBS} 1847file. A common problem is that the database file is found, but either 1848no sequences are read, or an incorrect number of entries is read. 1849This is almost always because the library format (\texttt{libtype}) is 1850incorrect. 1851 1852Test the setup by running FASTA. Enter the sequence 1853file '\texttt{mgstm1.aa}' when the program requests it (this file is 1854included with the programs). The program should then ask you to 1855select a protein sequence library. Alternatively, if you run the 1856\texttt{tfastx36 -I} program and use the mgstm1.aa query sequence, the program 1857should show you a selection of DNA sequence libraries. 1858Once the \texttt{fastlibs} file has been set up correctly, you can 1859set FASTLIBS=fastgbs in your AUTOEXEC.BAT file, and you will not need to 1860remember where the libraries are kept or how they are named. 1861 1862%%\pagebreak 1863\section{Frequently Asked Questions (FAQs)} 1864 1865{\noindent}\textbf{Where can I get FASTA?} -- 1866\url{http://faculty.virginia.edu/wrpearson/fasta} has the latest 1867versions of the FASTA programs. This document describes 1868\texttt{\CURRENT}, which is available from 1869\url{http://faculty.virginia.edu/wrpearson/fasta/fasta3.tar.gz}. 1870In addition, pre-compiled versions of the programs are available for 1871MacOSX and Windows. 1872 1873\needspace{4\baselineskip} 1874{\noindent}\textbf{Which program should I use?} -- See Table I, also:\\ 1875 1876\begin{tabular}{l l l l l } 1877\hline \\[-1.0ex] 1878Query & Library & FASTA pgm. & BLAST pgm. & \\[1.2ex] 1879\hline \\[-1.0ex] 1880Prot. & Prot. & \texttt{fasta36} & \texttt{blastp} & heuristic local similarity \\ 1881 & & \texttt{ssearch36} & & optimal local sim.\\ 1882 & & \texttt{ggearch36} & & global:global sim. \\ 1883 & & \texttt{ggearch36} & & global:local sim.\\ 1884DNA & DNA & \texttt{fasta36}$^*$ & \texttt{blastn} & \\[1.2ex] 1885\hline \\[-1.0ex] 1886Prot. & Prot. & \texttt{lalign36} & & multiple non-intersecting \\ 1887DNA & DNA & & & alignments \\[1.2ex] 1888\hline \\[-1.0ex] 1889DNA & Prot. & \texttt{fastx36} & \texttt{blastx} & trans. DNA:protein sim. \\ 1890 & & \texttt{fasty36} & & \\[1.2ex] 1891\hline \\[-1.0ex] 1892Prot. & DNA & \texttt{tfastx36} & \texttt{blastn} & protein:trans. DNA \\ 1893 & & \texttt{tfasty36} & & \\[1.2ex] 1894\hline \\[-1.0ex] 1895Prot. & Prot. & \texttt{fasts36} & & Unordered peptides \\ 1896Prot. & DNA & \texttt{tfasts36} & & Unordered peptides \\ 1897DNA & DNA & \texttt{fasts36} & & Unordered oligonucleotides \\ 1898Prot. & Prot. & \texttt{fastm36} & & Ordered peptides \\ 1899DNA & DNA & \texttt{fastm36} & & Ordered oligos \\[1.2 ex] 1900\hline \\[-1.0ex] 1901\multicolumn{5}{l}{$^*$\texttt{ssearch36} can also be used for DNA:DNA, but is much slower and no more sensitive.}\\[0.2ex] 1902\hline \\ 1903\end{tabular} 1904 1905\needspace{4 ex} 1906{\noindent}\textbf{How do I make FASTA act/look like BLAST}? -- 1907\vspace{-0.5ex} 1908\begin{quote} 1909\texttt{fasta36 -s BP62 -m BB query.file library.file} 1910\end{quote} 1911\vspace{-0.5ex} 1912\texttt{-s BP62} sets the same scoring matrix (BLOSUM62) and 1913gap-penalties (-11/-1) as BLAST (FASTA uses BLOSUM50 by 1914default). \texttt{-m BB} produces very BLAST-like output. 1915 1916In addition, the \texttt{-m 8} and \texttt{-m 8C} options provide 1917BLAST tabular output, optionally with comments (\texttt{-m 8C}). This 1918compact output is effective for analysis pipelines. In addtion, 1919\texttt{-m 8XC} (no comments) or \texttt{-m 8CC} provides two 1920additional blast-tabular fields, a CIGAR alignment string and (if 1921available), an annotation string. 1922 1923{\noindent}\textbf{When I search Genbank - the program reports:} \texttt{0 residues in 0 1924sequences}? This typically happens because the program does not 1925know that you are searching a Genbank flatfile database and is looking 1926for a FASTA format database. Be certain to specify the library type 1927("1" for Genbank flatfile) with the database name. 1928 1929{\noindent}\textbf{The search seemed to work, but I do not see any results.} -- In 1930command line mode (the default), all the FASTA programs limit the 1931number of high scoring sequences shown using an expectation value 1932cutoff ($E()<10$ for proteins; $E()<2$ for DNA). Sometimes, a search 1933will complete successfully (you see the message \texttt{XXXX residues 1934 in YYY sequences}) but the message: \texttt{!! No sequences with E() 1935 < 10} instead of \texttt{The best scores are:}. Typically, this 1936happens because of a problem with the statistical estimation process; 1937in particular, if the library contains only related sequences and 1938\texttt{-z 11} was not used, none of the hits may be ``significant''. 1939To trouble shoot this problem, you can search with \texttt{-z -1}, 1940which turns off all the statistical estimation procedures, and will 1941show the 20 highest scoring sequences (\texttt{-b \#} sets the default 1942number of sequences shown). 1943 1944{\noindent}\textbf{What is the difference between} \texttt{fastx3} and 1945\texttt{fasty3}? (or \texttt{tfastx3} and \texttt{tfasty3})? -- 1946\texttt{[t]fastx3} uses a simpler codon based model for alignments 1947that does not allow frameshifts in some codon positions (see 1948ref. \cite{wrp971}). \texttt{fastx3} is about 30\% faster, but 1949\texttt{fasty3} can produce higher quality alignments in some cases. 1950 1951\vspace{0.5ex} 1952{\noindent}\textbf{What is ktup}? -- All of the programs with \texttt{fast} in their 1953name use a computer science method called a lookup table to speed the 1954search. For proteins with \emph{ktup}=2, this means that the program 1955does not look at any sequence alignment that does not involve matching 1956two identical residues in both sequences. Likewise with DNA and 1957\emph{ktup} = 6, the initial alignment of the sequences looks for 6 1958identical adjacent nucleotides in both sequences. Because it is less 1959likely that two identical amino-acids will line up by chance in two 1960unrelated proteins, this speeds up the comparison. But very distantly 1961related sequences may never have two identical residues in a row but 1962will have single aligned identities. In this case, \emph{ktup} = 1 may 1963find alignments that \emph{ktup}=2 misses. 1964 1965\vspace{0.5ex} {\noindent}\textbf{How do I turn off statistics}? -- 1966The FASTA programs are designed to identify homologs based on 1967statistically significant similarity; to infer homology you need 1968accurate statistical estimates. Sometimes, however, you know the 1969sequences are related, and searching against libraries of related 1970sequences can confuse FASTA if you do not use \texttt{-z 11}. If all 1971you want are scores and alignments, use \texttt{-z -1} to turn off 1972statistical estimates. 1973 1974\vspace{0.5ex} 1975{\noindent}\textbf{Where are} \texttt{prss} {\noindent}\textbf{and} \texttt{prfx}? -- Earlier FASTA3 1976releases included \texttt{prss3} and \texttt{prfx3}. With FASTA 1977version 35 and 36, these programs have been incorporated into 1978\texttt{ssearch36} and \texttt{fastx36}. FASTA version 35 and 36 1979programs now automatically estimate statistical parameters by 1980shuffling - the function of \texttt{prss} and \texttt{prfx}, when 1981searching for libraries with fewer than 500 members. 1982 1983\vspace{0.5ex} 1984{\noindent}\textbf{Where is} \texttt{tfasta}? -- Although it is possible to make 1985\texttt{tfasta36}, it is not compiled by default. \texttt{tfastx36} 1986and \texttt{tfasty36} allow frame-shifts to be joined into a single 1987alignment; \texttt{tfasta} did not. \texttt{tfastx36} produces better 1988alignments with better statistics. 1989 1990\vspace{0.5ex} 1991{\noindent}\textbf{Can I run the FASTA programs on a cluster}? -- With version 199236.3.4, almost all of the FASTA programs can be run on clusters of 1993computers using MPI (Message Passaging Interface). The programs can 1994be compiled using \texttt{make -f ../make/Makefile.mpi\_sse2} from the 1995\texttt{fasta36/src} directory. Except for \texttt{lalign36}, all the 1996programs in Table I are available as \texttt{fasta36\_mpi}, 1997\texttt{ssearch36\_mpi}, etc. 1998 1999Unfortunately, the current MPI implementation involves substantially 2000more communications overhead than the threaded versions. The FASTA 2001programs are very efficient on threaded machines; if the preload 2002option is used (edit \texttt{make/Makefile36m.common} to use 2003\texttt{comp\_lib8.c}), the FASTA programs can obtain more than 40-fold speedup on a 48-core machine (the largest I have tested). 2004 2005\vspace{0.5ex} 2006{\noindent}\textbf{Sometimes, in the list of best scores, the same sequence is 2007 shown twice with exactly the same score. Sometimes, the sequence is 2008 there twice, but the scores are slightly different}? -- When any of 2009the FASTA programs searches a long sequence, it breaks the sequence up 2010into \emph{overlapping} pieces. If the highest scoring alignment is 2011at the end of one piece, it will be scored again at the beginning of 2012the next piece. If the alignment is not be completely included in the 2013overlap region, one of the pieces will give a higher score than the 2014other. These duplications can be detected by looking at the 2015coordinates of the alignment. If either the beginning or end 2016coordinate is identical in two alignments, the alignments are at least 2017partially duplicates. 2018 2019\vspace{2ex} 2020As always, please inform me of bugs as soon as possible. 2021 2022\begin{quote} 2023William R. Pearson\\ 2024Department of Biochemistry\\ 2025Jordan Hall Box 800733\\ 2026U. of Virginia\\ 2027Charlottesville, VA\\ 2028wrp@virginia.EDU 2029\end{quote} 2030 2031\bibliographystyle{plain} 2032\bibliography{fasta_guide} 2033 2034\appendix 2035\section*{Appendix} 2036 2037\section{FASTA Makefile compile time options} 2038 2039\begin{table} 2040\caption{\label{make-defs}FASTA \texttt{Makefile} compile time \texttt{\#defines}} 2041\vspace{1.0ex} 2042\begin{tabular}{l l p{1.00 in} p{3.0 in}} 2043\hline\\[-1.2ex] 2044\texttt{\#define} & Status$^*$ & Target file(s) & Function \\[1.0ex] 2045\hline\\[-1.5ex] 2046\texttt{ALLOCN0} & obs & \texttt{dropnfa.c}, \texttt{dropfx.c}, \texttt{dropfz2.c} & allows FASTA algorithm to use memory $\sim$ query length (n0), not query $+$ library (n0+n1). \\ 2047\texttt{DNALIB\_LC} & undef & \texttt{initfa.c} & enable lower case masking for DNA libraries \\ 2048\texttt{HTML\_HEAD} & undef & \texttt{comp\_lib5e.c}, \texttt{comp\_lib8.c} & wrap \texttt{-m 6} HTML output with \texttt{<html> <body> </body> </html>} \\ 2049\texttt{M10\_CONS} & def & \texttt{c\_dispn.c} & show consensus line (\texttt{:. }) with \texttt{-m 10} output. \\ 2050\texttt{OLD\_FASTA\_GAP} & undef & \texttt{drop*.c} & use first-residue/additional residue penalties, not open/extend. \\ 2051\texttt{PGM\_DOC} & def & \texttt{comp\_lib5e.c}, \texttt{comp\_lib8.c} & provide \texttt{\#pgm\_name -opt1 -opt2 query file} copy of command line \\ 2052\texttt{PROGRESS} & def & \texttt{comp\_lib5e.c}, \texttt{comp\_lib8.c} & provide progress symbols in interactive mode \\ 2053\texttt{SAMP\_STATS} & def & \texttt{comp\_lib5e.c}, \texttt{comp\_lib8.c} & scores are sampled for statistical estimates \\ 2054\texttt{SAMP\_STATS\_LESS} & def & \texttt{compacc.c} & a slower sampling strategy is used \\ 2055\texttt{SHOW\_ALIGN\_SCORE} & undef & \texttt{wm\_align.c} & print score, cummulative score, during alignment (for teaching) \\ 2056\texttt{SHOW\_HELP} & def & \texttt{comp\_lib5e.c}, \texttt{comp\_lib8.c}, \texttt{initfa.c}, \texttt{doinit.c} & print out help information with '-help', or no arguments given. Undef \texttt{SHOW\_HELP} reverts to pre-\texttt{fasta-35.4.4}.\\ 2057\texttt{SHOW\_HIST} & undef & \texttt{doinit.c} & inverts current meaning of \texttt{-H} (shows by default for non-PCOMPLIB (MPI) programs). \\ 2058\texttt{SHOWSIM} & def & \texttt{mshowbest.c} \texttt{mshowalign2.c} & display percent similarity \\ 2059\texttt{USE\_LNSTATS} & obs & \texttt{scaleswn.c} & use $ln()$-scaling for scores, removed in \texttt{fasta2.0}.\\ 2060\hline \\ 2061\end{tabular} 2062$^*$Status: def: \#defined in standard \texttt{Makefiles}; undef: undefined; obs: obsolete, provided backwards compatibility with FASTA2.0 or earlier. 2063\end{table} 2064 2065The \texttt{fasta-36/make} directory includes \texttt{Makefile}s 2066appropriate for a broad range of environments, including Linux/Unix, 2067BSD, MacOSX, and Windows. Makefiles are regularly tested against 2068MacOSX, Linux, and Windows. Table \ref{make-defs} summarizes the 2069Makefile options that can be modified. 2070 2071As distributed, the \texttt{Makefiles} in \texttt{fasta36/make}, build 2072a version of the FASTA programs that is optimized for single searches 2073against arbitrary sized databases, using bit scores, efficient sampled 2074statistics, and gap-open/extend penalties. The default compilation 2075configuration can be changed either by changing the compile time 2076defines (Table \ref{make-defs}) in the main \texttt{Makefile}, 2077e.g. \texttt{make/Makefile.linux64\_sse2}, or by editing 2078\texttt{make/Makefile36m.common}. 2079 2080\emph{High-performance searches with many queries} -- By default, the 2081\texttt{comp\_lib5e.c} program specified in 2082\texttt{Makefile36m.common} builds FASTA programs that re-read the 2083library sequence database for every query sequence. This has the 2084advantage that sequence comparison begins almost immediately, but if 2085thousands of searches are being performed, the database is re-read 2086thousands of times. \texttt{Makefile36m.common} can be edited to use 2087\texttt{comp\_lib8.c} in place of \texttt{comp\_lib5e.c} and the 2088database is read only once, then held in memory for additional 2089searches. Of course, if \texttt{comp\_lib8.c} is used, the computer 2090must have enough memory to store the complete database. Keeping the 2091database in memory allows the FASTA programs to very efficiently used 2092large, multicore computers. 2093 2094\needspace{5\baselineskip} 2095\emph{Parallel searches with MPI} -- By default under 2096Unix/Linux/MacOSX, the FASTA programs are threaded; they will spawn as 2097many threads as CPU cores are available (this can be limited with the 2098\texttt{-t n-threads} option). Using \texttt{comp\_lib8.c}, we see 2099almost 48-fold speedup on a 48-core machine. The FASTA programs can 2100also be run in parallel in the MPI environment on clusters of 2101computers. To build the MPI versions of the programs, use 2102\texttt{make ../make/Makefile.mpi\_sse2 ssearch36\_mpi}, 2103\texttt{fastx36\_mpi}, etc. The MPI programs currently substantially 2104more communications overhead than the threaded versions, so they may 2105not scale as well to large clusters. 2106 2107\include{fasta.history} 2108\end{document} 2109