fasta/fasta/fasta20.me

.nr pp 11
.nr sp 11
.nr tp 11
.nr fp 10
.nr fi 0n
.sz 11
.if t \{
.po 1i
.he 'FASTA.DOC''Release 2.0u4, February 1996'
.fo ''- % -''
\}
.if n \{
.po 0
.na
.nh
\}
.ll 6.5i
.ce
\fB\s+2COPYRIGHT NOTICE\s0\fP
.lp
Copyright 1988, 1991, 1992, 1994, 1995, 1996 by William R. Pearson and
the University of Virginia.  All rights reserved. The FASTA program
and documentation may not be sold or incorporated into a commercial
product, in whole or in part, without written consent of William
R. Pearson and the University of Virginia.  For further information
regarding permission for use or reproduction, please contact: David
Hudson, Assistant Provost for Research, University of Virginia,
P.O. Box 9025, Charlottesville, VA 22906-9025, (804) 924-6853
.sp
.uh "\s+2The FASTA program package\s0"
.uh "Introduction"
.pp
This documentation describes the version 2.0x of the FASTA program
package (see W. R. Pearson and D. J. Lipman (1988), "Improved Tools
for Biological Sequence Analysis", PNAS 85:2444-2448, and W. R.
Pearson (1990) "Rapid and Sensitive Sequence Comparison with FASTP and
FASTA" Methods in Enzymology 183:63-98). Version 2.0 modifies version
1.8 to include explicit statistical estimates for similarity scores
based on the extreme value distribution.  In addition, FASTA protein
alignments now use the Smith-Waterman algorithm with no limitation on
gap size. FASTA and SSEARCH now use the BLOSUM50 matrix by default,
with options to change gap penalties on the command line. Version 1.7
replaces rdf2 and rss with prdf and prss, which use the extreme-value
distribution to calculate accurate probability estimates.
.sp
.lp
Although there are a large number of programs in this package, they
belong to four groups:
.(l

Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH

Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN

Statistical significance: PRDF, RELATE, PRSS, RANDSEQ

Global alignment: ALIGN

.)l
.lp
In addition, I have included several programs for protein sequence
analysis, including a Kyte-Doolittle hydropathicity plotting program
(GREASE, TGREASE), and a secondary structure prediction package
(GARNIER).
.pp
The FASTA sequence comparison programs on this disk are improved
versions of the FASTP program, originally described in Science (Lipman
and Pearson, (1985) Science 227:1435-1441).  We have made several
improvements.  First, the library search programs use a more sensitive
method for the initial comparison of two sequences which allows the
scores of several similar regions to be combined.  As a result, the
results of a library search are now given with three scores,
\f2initn\fP (the new initial score which may include several similar
regions), \f2init1\fP (the old fastp initial score from the best
initial region), and \f2opt\fP (the old fastp optimized score allowing
gaps in a 32 residue wide band).
.pp
These programs have also been modified to become "universal" (hence
FAST-A, for FASTA-All, as opposed to FAST-P (protein) or FAST-N
(nucleotides)); by changing the environment variable SMATRIX, the
programs can be used to search protein sequences, DNA sequences, or
whatever you like.  By default, FASTA, LFASTA, and the PRDF programs
automatically recognize protein and DNA sequences.  Sequences are
first read as amino acids, and then converted to nucleotides if the
sequence is greater than 85% A,C,G,T (the '-n' option can be used to
indicate DNA sequences).  TFASTA compares protein sequences to a
translated DNA sequence.  Alternative scoring matrices can also be
used.  In addition to the BLOSUM50 matrix for proteins, the PAM250
matrix or matrices based on simple identities or the genetic code can
also be used for sequence comparisons or evaluation of significance.
Several different protein sequence matrices have been included;
instructions for constructing your own scoring matrix are included in
the file FORMAT.DOC.
.sp 2
The remainder of this document is divided into three sections: (1) a
brief history of the changes to the FASTA package; (2) A guide to
installing the programs and databases; (3) A guide to using the FASTA
programs. The programs are very easy to use, so if you are using them
on a machine that is administered by someone else, you may want to
skip to section (3) to learn how to use the programs, and then read
section (1) to look at some of the more recent changes.  If you are
installing the programs on your own machine, you will need to read
section (2) carefully.
.sp
.sh 1 "Revision History"
.sh 2 "Changes with version 2.0u"
.pp
Version 2.0u provides several major improvements over previous
versions of FASTA (and SSEARCH).  The most important is the
incorporation of explicit statistical estimates and appropriate
normalization of similarity scores. This improvement is discussed in
more detail below in the section entitled \(lqStatistical
Significance\(rq.  In addition, all of the protein comparison programs
now use the BLOSUM50 matrix, with gap penalties of -12, -2, by
default.  BLOSUM50 performs significantly better than the older PAM250
matrix.  PAM250 can still be used with the command line option: \(lq-s
250\(rq.  (DNA sequence comparisons use a more stringent gap
penalty of -16, -4, which produces excellent statistical estimates
when optimized scores are used. TFASTA uses -16, -4 as well.)
.pp
The quality of the fit of the extreme value distribution to the actual
distribution of similarity scores is summarized with the
Kolmogorov-Smirnov statistic.  The acceptance limits for this
statistic can be found in many statistics books.  In general, values
<0.10 (N=30) indicate excellent agreement between the actual and
theoretical distributions.  If this statistic is > 0.2, consider
using a higher (more stringent) gap penalty, e.g. -16, -4 rather than
-12, -2.  The default scoring matrix for DNA has been changed to score
+5 for an identity and -4 for a mismatch.  These are the same scores
used by BLASTN.
.pp
With explicit expectation calculations, the program now shows all
scores and alignments with expectations less than 10.0 (with optimized
scores, 2.0 without optimization) when the "-Q" (quiet) mode is used.
The expectation threshold can be changed with the "-E" option.
.pp
Finally, the algorithm used to produce the final alignments of protein
sequences is now a full Smith-Waterman, with unlimited gaps.  (The
older band-limited alignments are used for DNA sequences and TFASTA by
default, because Smith-Waterman alignments are very slow for long
sequences.)  Both the \(lqoptimized\(rq and \(lqSmith-Waterman\(rq
scores are reported; if the Smith-Waterman score is higher, then
additional gaps allowed a better alignment and similarity score to be
calculated.
.pp
FASTA searches now optimize similarity scores by default (this slows
searches about 2-fold (worst case) for \fIktup=2\fP). Thus, the
meaning of the "-o" option has been reversed; "-o" now turns off
optimization and reports results sorted by "initn" scores.
Optimization significantly improves the sensitivity of FASTA, so that
it almost matches Smith-Waterman.  With version 2.0, the default band
width used for optimized calculations can be varied with the "-y"
option.  For proteins with ktup=2, a width of 16 (-y 16) is used; 16
is also used for DNA sequences.  For proteins and ktup=1, a width of
32 is used. Searches that disable optimization with the "-o" option
will work fine for sequences that share 25% or more identity in
general, but to detect evolutionary relationships with 20% \- 25%
identity, the more sensitive default optimization is often required.
Optimization is required for accurate statistical estimates with
either protein or DNA sequences.
.pp
The FASTA package now includes \f(CBFASTX\fP, a program that compares
a DNA sequence to a protein sequence database by translating the DNA
sequence in three frames (the reverse frames are selected with the
\fC-i\fB option) and aligning the three-frame translation with the
sequences in the protein database.  Alignment scores allow frameshifts
so that a cDNA or EST sequence with insertion/deletion errors can be
aligned with its homologues from beginning to end.
.pp
With release 20u6, there is also a \f(CBTFASTX\fP program, which is a
replacement for TFASTA.  TFASTA treats each of the six reading frames
of a DNA library sequence as a different sequence; \f(CBTFASTX\fP
compares a protein sequence against only two sequences from each DNA
sequence \- the forward and reverse orientation.  For a given
orientation, \f(CBTFASTX\fP calculates a similarity score for
alignments that allow frameshifts, thus considering all possible
reading frames.
.pp
Another new program is included - \f(CBrandseq\fP - which will produce a
randomly shuffled (uniform or local shuffle) from an input sequence.
This randomly shuffled sequence can be used to evaluate the
statistical estimates produced by FASTA, SSEARCH, or BLAST.
.sh 2 "Changes with version 1.7"
.br
Version 1.7 has been released to provide the PRDF and PRSS programs
for shuffling sequences and estimating accurately the probabilities of
the unshuffled-sequence scores.
.ip "PRDF" 1i
a version of RDF2 that uses calculates the probability of a similarity
score more accurately by using a fit to an extreme value distribution.
Code to fit the extreme value distribution parameters and the impetus
to update RDF2 was provided by Phil Green, U. of Washington.
.ip "PRSS" 1i
a version of PRDF that uses a rigorous Smith-Waterman calculation to
score similarities
.sh 2 "Changes with version 1.6"
.pp
FASTA version 1.6 uses a new method for calculating optimal
scores in a band (the optimization or last step in the FASTA
algorithm). In addition, it uses a linear-space method for calculating
the actual alignments.  FASTA v1.6 package includes several new
programs:
.ip "SSEARCH" 1i
a program to search a sequence database using
the rigorous Smith-Waterman algorithm (this
program is about 100-fold slower than FASTA
with ktup=2 (for proteins).
.ip "LALIGN" 1i
A rigorous local sequence alignment program that will display the
N-best local alignments (N=10 by default).
.ip "PLALIGN" 1i
a version of lalign that plots the local alignments to postscript.
.ip "FLALIGN" 1i
a version of lalign that plots the local alignments to a GCG Figure
file.
.pp
The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim" algorithm
described by Huang and Miller (1991) Adv. Appl. Math. 12:337-357.
The SSEARCH and PRSS programs incorporate algorithms described by
Huang, Hardison, and Miller (1990) CABIOS 6:373-381.
.pp
LFASTA and PLFASTA now calculate a different number of local
similarities; they now behave more like LALIGN/PLALIGN.
Since local alignments of identical sequences produce "mirror-image"
alignments, lalign and lfasta consider only one-half of the potential
alignments between sequences from identical file names.  Thus
.(l I
\fClfasta mchu.aa mchu.aa\fP
.)l
Displays only two alignments, with earlier versions of the program, it
would have displayed five, including the identity alignment.  PLFASTA
does display five alignments; when two identical filenames are given,
it draws the identity alignment, calculates the two unique local
alignments, draws them, and draws their mirror images.
LFASTA/PLFASTA and LALIGN/PLALIGN use the filenames, rather than the
actual sequences, to determine whether sequences are identical; you
can "trick" the programs into behaving the old way by putting the same
sequence in two different files.
.sh 2 "Changes with version 1.5"
.pp
FASTA version 1.5 includes a number of substantial revisions to
improve the performance and sensitivity of the program.  It is now
possible to tell the program to optimize all of the \f2initn\fP scores
greater than a threshold.  The threshold is set at the same value as
the old FASTA cutoff score.  Alternatively, you can tell FASTA to sort
the results by the \f2init1\fP, rather than the \f2initn\fP, score by
using the \fC-1\fP option.  \fCFASTA -1 ...\fP will report the results
the way the older FASTP program did.
.pp
A new method has been provided for selecting libraries. In the
past, one could enter the name of a sequence file to be searched or a
single letter that would specify a library from the list included in
the $FASTLIBS file. Now, you can specify a set of library files with a
string of letters preceded by a '%'.  Thus, if the FASTLIBS file has
the lines:
.(l I

Genbank 70 primates$1P/seqlib/gbpri.seq 1
Genbank 70 rodents$1R/seqlib/gbrod.seq 1
Genbank 70 other mammals$1M/seqlib/gbmam.seq 1
Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1

.)l
Then the string: "%PRMB" would tell FASTA to search the four libraries
listed above.  The %PRMB string can be entered either on the command
line or when the program asks for a filename or library letter.
.pp
FASTA1.5 also provides additional flexibility for specifying the
number of results and alignments to be displayed with the \fC-Q\fP
(quiet) option.  The \fC-b number\fP option allows you to specify the
number of sequence scores to show when the search is finished.  Thus
.(l I

\fCFASTA -b 100 ...\fP

.)l
tells the program to display the top 100 sequence scores. In the past,
if you displayed 100 scores (in \fC-Q\fP mode), you would also have
store 100 alignments. The \fC-d\fP option allows you to limit the
number of alignments shown.  \fCFASTA -b 100 -d 20\fP would show 100
scores and 20 alignments.
.pp
Finally, FASTA can provide a complete list of all of the sequences and
scores calculated to a file with the \fC-r\fP (results) option.
\fCFASTA -r results.out ...\fP creates a file with a list of scores
for every sequence in the library.  The list is not sorted, and only
includes those scores calculated during the initial scan of the
library.
.sh 1 "Installing the FASTA package"
.sh 2 "Installing the programs"
.sh 3 "Unix version"
.pp
The FASTA distribution comes with several \fCmakefile\fP's that can be
used to compile the FASTA programs.  Over the years, as ATT Unix
System 5 and BSD unix have converged, these files have become very
similar. To begin with, I recommend using the standard \fCMakefile\fP.
There are two values in the \fCmakefile\fP that should be checked against
the values used on your system: the \fCHZ\fP value, which is the frequency
in ticks per second used by the \fCtimes()\fP system call, this value can
usually be found by running:
.(l I
\fCgrep HZ /usr/include/sys/*\fP
.)l
and the functions available to return random numbers.  If you have a
\fCrand48()\fP function that returns a 32-bit random number, use it and use
the lines:
.(l I
\fCNRAND=nrand48
RANFLG= -DRAND32\fP
.)l
If not, you will need to use the \fCrand()\fP function call and determine
whether it returns a 16-bit or a 32-bit value.  These functions are
used by PRDF and PRSS.
.p
If you have problems compiling the programs, you may want to examine
the \fCmakefile.unx\fP and \fCmakefile.sun\fP files, to look for differences.
I have tried to use very standard unix functions in these programs,
and they have been successfully compiled, with very small changes to
the \fCMakefile\fP, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS
machines (under the BSD environment).
.sh 3 "IBM-PC/DOS version"
.pp
For the IBM-PC/DOS version, the FASTA source code disk contains the
complete source code to all of the programs on the other disks.  The
programs were compiled with Borland's Turbo 'C++', using Borland's
MAKE utility.  The graphics programs (PLFASTA, TGREASE) use the
graphics device drivers supplied with the Turbo 'C' V2.0 package.
Also included are the documentation files PROGRAMS.DOC and FORMAT.DOC.
You do not need any of the files the source code disk to run the
programs.  The files on this disk are identical to the UNIX and VMS
versions that run on larger machines.  Also included is the code to
compile ALIGN0.EXE.  ALIGN0 is the same as ALIGN, but does not
penalize for end-gaps.
.pp
If you have the DOS or Macintosh version of the FASTA package, to
install the programs you should:
.np
Make a new directory (folder) for the FASTA programs.  This need not
be the same as the directory for your sequence databases.
.np
Copy the files from the FASTA source disk to the new directory.
.np
(DOS only) Edit your AUTOEXEC.BAT file to (a) modify your PATH command
to include the FASTA directory and (b) add the line:
.(l
\fCset FASTLIBS=c:\\yourfastadirectory\\fastgbs\fP
.)l
On the Macintosh, you may need to edit the "environment" file and
change the line that reads:
.(l
\fCFASTLIBS=fastgbs\fP
.)l
to indicate the full directory path for the \fCfastgbs\fP file, for
example:
.(l
\fCFASTLIBS=Q105:FASTA:fastgbs\fP
.)l
.np
Finally, you will need to edit the \fCfastgbs\fP file.  This is usually
the most confusing part of the installation.  An example of this file
is shown below; to customize this file for your machine, you will need
to change the file names from those provided in the \fCfastgbs\fP file to
ones that reflect the directory names and file names you use on your
machine. This is explained in more detail below.  In addition, some
entries in the \fCfastgbs\fP file refer to other files of file names.
These files of file names (as opposed to actual database files) may
also need to be edited.
.sh 2 "Installing the libraries"
.sh 3 "The NBRF protein sequence library"
.pp
The FASTA program package does not include any protein or DNA
sequence libraries.  You can obtain the PIR protein sequence database from:
.(l
National  Biomedical Research Foundation
Georgetown  University  Medical  Center
3900 Reservoir Rd, N.W.
Washington, D.C. 20007
.)l
In addition, this database is available via anonymous ftp from the
host "ftp.bchs.uh.edu". It is available in two formats, VMS and CODATA
format.  The "VMS" format (library type 5 below) can be searched much
faster, can be easily reformatted for use by the "BLAST" rapid
searching program, and is compatible with the Genetics Computer Group
package of programs.  The CODATA format is used by the EUGENE/MBIR
computing package from Baylor (library type 2).
.sh 3 "The GENBANK DNA sequence library"
.pp
FASTA, and TFASTA search sequences from the GENBANK "flatfile" (not
ASN.1) DNA sequence library in the flat-file format distributed by the
National Center for Biotechnology Information and the PIR format used
by EBI/EMBL.  CD-ROMs can be obtained from:
.(l
Genbank
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
8600 Rockville Pike
Bethesda, MD  20894
.)l
.pp
The GenBank DNA sequence library is also available via anonymous FTP
from \fCncbi.nlm.nih.gov\fP.
.sh 3 "The EBI/EMBL CD-ROM libraries"
.pp
The European Bioinformatics Institute (EBI) is now distributing the
EMBL CD-ROM that contains both the complete EMBL DNA sequence database
(which should be essentially identical to the GenBank DNA sequence
database) and the SWISS-PROT protein sequence database. SWISS-PROT is
derived from the NBRF Protein sequence database with additions from
the EBI/EMBL DNA sequence database.  This CD-ROM is a "best-buy,"
since it provides both DNA and protein sequence libraries.  It is
available from:
.(l

European Bioinformatics Institute
Hinxton Genome Campus, Hinxton Hall
Hinxton, Cambridge CB10 1RQ,
United Kingdom
Tel: +44 1223 4944
Fax: +44 1223 494468
Email: DATALIB@ebi.ac.uk

.)l
.pp
In addition, the SWISS-PROT protein sequence database is available via
anonymous FTP from \fCncbi.nlm.nih.gov\fP.
.sh 2 "Finding the libraries: FASTLIBS"
.pp
FASTA and TFASTA use the environment variable FASTLIBS to find the
protein and DNA sequence libraries.  The FASTLIBS variable contains
the name of a file that has the actual filenames of the libraries.
The \fCFASTGBS\fP file on is an example of a file that
can be referred to by FASTLIBS. To use the \fCFASTGBS\fP file, type:
.(l
\fCsetenv FASTLIBS /usr/lib/fasta/fastgbs\fP (BSD UNIX/csh)
or
\fCexport FASTLIBS=/usr/lib/fasta/fastgbs\fP (SysV UNIX/ksh)
.)l
Then edit the \fCFASTGBS\fP file to indicate where the protein and DNA
sequence libraries can be found.  If you have a hard disk and your
protein sequence library is kept in the file \fC/usr/lib/aabank.lib\fP and
your Genbank DNA sequence library is kept in the directory:
\fC/usr/lib/genbank\fP, then \fCfastgbs\fP might contain:
.ne 8
.(l
.ft C
NBRF Protein$0P/usr/lib/seq/aabank.lib 0
SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
GB Primate$1P@/usr/lib/genbank/gpri.nam
GB Rodent$1R@/usr/lib/genbank/grod.nam
GB Mammal$1M@/usr/lib/genbank/gmammal.nam
^   1    ^^^^       4                   ^     ^
          23                             (5)
.ft R
.)l
The first line of this file says that there is a copy of the NBRF
protein sequence database (which is a protein database) that can be
selected by typing "P" on the command line or when the database menu
is presented in the file \fC/usr/lib/seq/aabank.lib\fP.
.pp
Note that there are 4 or 5 fields in the lines in \fCfastgbs\fP.  The first
field is the description of the library which will be displayed by
FASTA; it ends with a '$'.  The second field (1 character), is a 0 if
the library is a protein library and 1 if it is a DNA library.  The
third field (1 character) is the character to be typed to select the
library.
.pp
The fourth field is the name of the library file.  In the example
above, the \fC/usr/lib/seq/aabank.lib\fP file contains the entire
protein sequence library.  However the DNA library file names are
preceded by a '@', because these files (\fCgpri.nam, grod.nam,
gmammal.nam\fP) do not contain the sequences; instead they contain the names
of the files which contain the sequences.  This is done because the
GENBANK DNA database is broken down in to a large number of smaller
files.  In order to search the entire primate database, you must
search more than a dozen files.
.pp
In addition, an optional fifth field can be used to specify the format
of the library file.  Alternatively, you can specify the library
format in a file of file names (a file preceded by an '@').  This
field must be separated from the file name by a space character ('\ ')
from the filename.  In the example above, the \fCaabank.lib\fP file is
in Pearson/FASTA format, while the \fCswiss.seq\fP file is in PIR/VMS format
(from the EMBL CD-ROM). Currently, FASTA can read the following formats:
.(l I
.ft C
0 Pearson/FASTA (>SEQID - comment/sequence)
1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
2 NBRF CODATA (ENTRY/SEQUENCE)
3 EMBL/SWISS-PROT (ID/DE/SQ)
4 Intelligenetics (;comment/SEQID/sequence)
5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
6 GCG (version 8.0) Unix Protein and DNA (compressed)
11 NCBI Blast1.3.2 format  (unix only)
.ft R
.)l
In particular, this version will work with the EMBL and PIR VMS
formats that are distributed on the EMBL CD-ROM. The latter format
(PIR VMS) is much faster to search than EMBL format.  This release
also works with the protein and DNA database formats created for the
BLASTP and BLASTN programs by SETDB and PRESSDB and with the new NCBI
search format.  If a library format is not specified, for example,
because you are just comparing two sequences, Pearson/FASTA (format 0)
is used by default.  To change this default, you may set the LIBTYPE
environment variable to a number.  For example,
.(l I
\fCsetenv LIBTYPE 1\fP
.)l
would cause the program to use the GenBank LOCUS format by default
for libraries (or the second sequence file), but the Pearson/FASTA
format would still be used for the query sequence.
.pp
You can specify a group of library files by putting a '@' symbol
before a file that contains a list of file names to be searched.  For
example, if @gmam.nam is in the fastgbs file, the file "gmam.nam"
might contain the lines:
.(l
.ft C
</usr/lib/genbank
gbpri.seq 1
gbrod.seq 1
gbmam.seq 1
.ft R
.)l
In this case, the line beginning with a '<' indicates the directory
the files will be found in.  The remaining lines name the actual
sequence files.  So the first sequence file to be searched would be:
.(l
.ft C
/usr/lib/genbank/gbpri.seq
.ft R
.)l
The notation "\fC<PIRNAQ:\fP" might be used under the VAX/VMS operating
system. Under UNIX, the trailing '/' is left off, so the library
directory might be written as "\fC</usr/seqlib\fP".
.pp
With version 1.4 of the FASTA package, the FASTA and TFASTA programs
can search a library composed of different files in different sequence
formats.  For example, you may wish to search the Genbank files (in
GenBank flat file format) and the EMBL DNA sequence database on
CD-ROM.  To do this, you simply list the names and filetypes of the
files to be searched in a file of filenames.  For example, to search
the mammalian portion of Genbank, the unannotated portion of Genbank,
and the unannotated portion of the EMBL library, you could use the
file:
.(l I
.ft C
</usr/lib/DNA
gbpri.seq 1
\&#  (this '#' causes the program to display the size of the library)
gbrod.seq 1
\&...
gbmam.seq 1
\&...
gbuna.seq 1
\&...
unanno.seq 5
\&#
.ft R
.)l
.(l I F
You do not need to include library format numbers if you only use the
Pearson/FASTA version of the PIR protein sequence library.  If no
library type is specified, the program assumes that type 0 is being
used (unless you have set LIBTYPE).
.)l
.lp
Support for the old compressed GenBank files, which have not been
distributed for more than four years, has been removed from programs
in the FASTA package.
.sp
.pp
Test the setup by running FASTA.  Enter the sequence
file '\fCMUSPLFM.AA\fP' when the program requests it (this file is
included with the programs).  The program should then ask you to
select a protein sequence library.  Alternatively, if you run the
TFASTA program and use the MUSPLFM.AA query sequence, the program
should show you a selection of DNA sequence libraries.
Once the fastgbs file has been set up correctly, you can
set FASTLIBS=fastgbs in your AUTOEXEC.BAT file, and you will not need to
remember where the libraries are kept or how they are named.
.pp
FASTA and TFASTA must open a large number of files when searching and
reporting the results of a GENBANK floppy disk format library search.
You may have problems with the large number of files under DOS on IBM-PC's
(Unix and VMS users will not have these problems).  If you are going
to search the GENBANK floppy disk format DNA sequence library under
DOS, you should add the line:
.(l
.ft C
FILES=16
.ft R
.)l
to your \fCCONFIG.SYS\fP file.  (Typically this is already done for programs
like Windows or WordPerfect.)
.sh 1 "\s+2Using the FASTA Package\s0"
.sh 2 "Overview"
.pp
The FASTA sequence comparison programs all require similar
information, the name of a query sequence file, a library file, and
the \fIktup\fP parameter.  All of the programs can accept arguments
on the command line, or they will prompt for the file names and
\fIktup\fP value.
.lp
To use FASTA, simply type:
.(l
.ft C
\f(CBFASTA\fP
and you will be prompted for :
.in +0.5i
the name of the test sequence file
the name of the library file
and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
.(l F
.ft R
ktup of 2 is about 5 times faster than ktup = 1.  For a 200 aa
sequence against a 10,000,000 aa library, the program takes about 30
min with ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286 IBM-PC.
.ft C
.)l
.ft R
.)l
The program can also be run by typing
.(l
.ft C
FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
.ft R
.)l
.lp
Included with the package are the test files,
\fCMUSPLFM.AA\fP, \fCLCBO.AA\fP, \fCMCHU.AA\fP and \fCBOVPRL.SEQ\fP.
To check to make certain that everything is working, you can try:
.(l
.ft C
fasta musplfm.aa lcbo.aa
and
tfasta musplfm.aa bovprl.seq
.ft R
.)l
To test the local similarity programs LFASTA and PLFASTA, try:
.(l
.ft C
lfasta mchu.aa mchu.aa
and
plfasta mchu.aa mchu.aa
.ft R
.)l
\fCMCHU\fP (calmodulin) has four duplicated calcium binding sites
that are clearly detected by LFASTA.  For a more complicated example,
try \fCMWRTC1.aa\fP, myosin heavy chain.
.sh 2 "Sequence files"
.pp
The FASTA programs know about three kinds of sequence files (four
under VMS): (1) plain sequence files that can only be used as query
sequences or for LFASTA, PRDF, and ALIGN. (2) Standard library files.
These are the same as plain sequence files, each sequence is preceded
by a comment line with a '>' in the first column. (3) distributed
sequence libraries (this is a broad class that includes the NBRF/PIR
VMS and blocked ascii formats, Genbank flat-file format, EMBL
flat-file format, and Intelligenetics format.  All of the files that
you create should be of type (1) or (2).  Type (2) files (ones with a
'>' and comment before the sequence) are preferred, because they can
be used as query or library sequence files by all of the programs.
.pp
I have included several sample test files, *.AA.  The first line may begin
with a '>'  or ';' followed by a comment.  The text after ';' in other lines
will  be  ignored.   Spaces  and  tabs  (and anything else that  is  not  an
amino-acid code) are ignored.
.pp
Library files should have the form:
.(l
.ft C
>Sequence name and identifier
A F A S Y T .... actual sequence.
F S S       .... second line of sequence.
>Next sequence name and identifier
.ft R
.)l
This is often referred to as "FASTA" or "Pearson" format.  You can
build your own library by concatenating several sequence files.  Just
be sure that each sequence is preceded by a line beginning with a '>'
with a sequence name.
.pp
The test file should not have lines longer than 120 characters, and
sequences entered with word processors should use a document
mode, with normal carriage returns at the end of lines.
.uh "\s+2Program Summary\s0"
.sh 2 "Sequence search programs"
.nr ii 1i
.ip "FASTA"
universal sequence comparison. Defaults to comparing protein sequences;
if the sequences are > 85% A+C+G+T or the \fC-n\fP option is used, a
DNA sequence is assumed.
.ip "FASTX"
Search a protein sequence library using amino acid sequence comparison
to the forward three frames of a translated DNA query sequence. (The
reverse frames are specified with the \fC-i\fP option.) Alignment
scores allow frameshifts; the final alignment uses a Smith-Waterman
type alignment routine (no limit on gaps) that allows frameshifts.
.ip "TFASTA"
Search DNA library for a protein sequence by translating the DNA
sequence to protein in all six frames (three forward frames with the
-3 command line option). TFASTA with ktup=2 is about as fast as a DNA
FASTA with ktup=4, and is substantially more sensitive.
(also reads the GENBANK library)
.ip "TFASTX"
Search DNA library for a protein sequence by translating the DNA
sequence to protein in all six frames (three forward frames with the
-3 command line option) calculating similarity scores that allow
frameshifts. TFASTX produces an optimal Smith-Waterman alignment of
the query and translated-library sequence.
.ip "\f2SSEARCH\fP"
Universal sequence comparison using the Smith-Waterman algorithm (
T. F. Smith and M. S. Waterman (1981) J. Mol. Biol. 147:195-197).
This program uses code developed by Huang and Miller (X. Huang, R. C.
Hardison, W. Miller (1990) CABIOS 6:373-381) for calculating the local
similarity score and code from the ALIGN program (see below) for
calculating the local alignment.  \f2SSEARCH\fP is about 50-times
slower than FASTA with ktup=2 (for proteins).
.ip "ALIGN"
optimal global alignment of two sequences with no short-cuts.
This program is a slightly modified version of one taken from E.
Myers and W. Miller. The algorithm is described in E. Myers and W.
Miller, "Optimal Alignments in Linear Space" (CABIOS (1988) 4:11-17).
.sh 2 "Local similarity programs"
.ip "LFASTA"
local similarity searches showing local alignments.  The algorithm
used to calculate the local alignment in a band has been improved
(Chao, Pearson, and Miller, submitted).
.ip "PLFASTA"
local similarity searches with plot output (on the IBM,
this program requires that the environment variable BGIDIR be set).
.ip "PCLFASTA"
(unix only) local similarity searches with plot output using pic commands.
.ip "\f2LALIGN\fP"
Calculates the N-best local alignments using a rigorous algorithm.
(N=10 by default.) The algorithm was developed by Huang and Miller (X.
Huang and W.  Miller (1991) Adv. Appl. Math. 12:337-357), which is a
linear-space version of an algorithm described by M. S. Waterman and
M. Eggert (J.  Mol. Biol. 197:723-728).  Like \f2SSEARCH\fP,
\f2LALIGN\fP is rigorous, but also very slow.
.ip "\f2PLALIGN\fP"
A version of \f2LALIGN\fP that plots its output to a postscript file.

.sh 2 "Statistical Significance"
.pp
With version 2.0 of the FASTA program distribution, FASTA, TFASTA, and
SSEARCH now provide estimates of statistical significance for library
searches.  Work by Altschul, Arratia, Karlin, Mott, Waterman, and
others (see Altschul et al. (1994) Nature Genetics 6:119 for an
excellent review) suggests that local sequence similarity scores
follow the extreme value distribution, so that P(s > x) = 1 -
exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are the
lengths of the query and library sequence. This formula can be
rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that the
average score for an unrelated library sequence increases with the
logarithm of the length of the library sequence.  FASTA and SSEARCH
use simple linear regression against the the log of the library
sequence length to calculate a normalized "z-score" with mean 50,
regardless of library sequence length, and variance 10.  These
z-scores can then be used with the extreme value distribution and the
poisson distribution (to account for the fact that each library
sequence comparison is an independent test) to calculate the number of
library sequences to obtain a score greater than or equal to the score
obtained in the search. The original idea and routines to do the
linear regression on library sequence length were provided Phil Green,
U. Washington.  This version of FASTA and SSEARCH uses a slightly
different strategy for fitting the data than those originally provided
by Dr. Green.
.pp
The expected number of sequences is plotted in the histogram using an
"*". Since the parameters for the extreme value distribution are not
calculated directly from the distribution of similarity scores, the
pattern of "*'s" in the histogram gives a qualitative view of how well
the statistical theory fits the similarity scores calculated by FASTA
and SSEARCH.  For FASTA, if optimized scores are calculated for each
sequence in the database (the default), the agreement between the
actual distribution of "z-scores" and the expected distribution based
on the length dependence of the score and the extreme value
distribution is usually very good.  Likewise, the distribution of
SSEARCH Smith-Waterman scores typically agrees closely with the actual
distribution of "z-scores."  The agreement with unoptimized scores,
\fIktup=2\fP, is often not very good, with too many high scoring
sequences and too few low scoring sequences compared with the
predicted relationship between sequence length and similarity score.
In those cases, the expectation values may be overestimates.
.pp
The statistical routines assume that the library contains a large
sample of unrelated sequences.  If this is not the case, then the
expectation values are meaningless.  Likewise, if there are fewer than
20 sequences in the library, the statistical calculations are not
done.
.pp
For protein searches, library sequences with E() values < 0.01 for
searches of a 10,000 entry protein database are almost always
homologous. Frequently sequences with E()-values from 1 - 10 are
related as well. Remember, however, that these E() values also reflect
differences between the amino acid composition of the query sequence
and that of the "average" library sequence.  Thus, when searches are
done with query sequences with "biased" amino-acid composition,
unrelated sequences may have "significant" scores because of sequence
bias.  The programs below, PRDF and PRSS, can address this problem by
calculating similarity scores for random sequences with the same
length and amino acid composition.
.pp
If optimization is not used ("-o"), E-values for DNA sequences
overestimate the significance of the scores that are obtained and
unrelated sequences frequently have E()-values < 0.0005. With
optimization, the agreement between E()-value compares favorably with
protein sequence comparison.  This is in part due to the use of more
stringent gap penalties for DNA sequence comparison, -16, -4 rather
than -12, -2.  With the latter penalties, many unrelated sequences
appear to have significant similarity. Nevertheless, since protein
sequence comparison is much more sensitive, DNA sequence comparison
should not be used to identify sequences that encode protein.  Even
with ktup=6, optimization rarely increases run-times more than 50%
with mRNA-size query sequences.  Optimization should be used whenever
possible.
.pp
Similar comments apply to TFASTA, where  higher gap penalties (-16,-4) are
required for accurate statistical estimates.  Because TFASTA produces
so many artificial "coding" sequences with atypical amino acid
compositions, the statistical estimates with TFASTA are often over
estimates.  With optimized scores, ktup=1, and gap penalties of -16,
-4, unrelated sequences will sometimes have E() values of 0.1.  If
initn scores are used, unrelated sequences may have have E() values <
0.01.
.ip "PRDF"
improved version of RDF program that includes accurate probability
estimates for all three scoring methods (includes local or window
shuffle routine)
.ip "PRSS"
A version of PRDF that uses the rigorous Smith-Waterman calculation
used by SSEARCH.
.ip "RANDSEQ"
produces a randomly shuffled sequence from a query sequence.
.ip "RELATE"
significance program described by Dayhoff (Atlas of Protein Sequence
and Structure, Vol. 5, Supplement 3).  Each chunk of 25 residues in
one sequence is compared to every 25 residue fragment of the second
sequence. Sequences which are genuinely related will have a large
number of scores greater than 3 standard deviations above the mean
score of all of the comparisons.
.sh 2 "Other analysis programs"
.ip "AACOMP"
calculate the amino acid composition and molecular weight
of a sequence.
.ip "BESTSCOR"
calculate the best self-comparison score.
.ip "GREASE"
Kyte-Doolittle hydropathicity profile
.ip "TGREASE"
graphic plot of Kyte-Doolittle profile
.ip "FROMGB"
convert from GenBank LOCUS format (also used by the IBI-Pustell programs)
to Pearson/FASTA format.
.ip "GARNIER"
A secondary structure prediction program using the method of Garnier,
Osgusthorpe, and Robson, J. Mol. Biol., (1978) 120:97-120.
.sh 2 "Options"
.pp
These programs have a number of output options, which are invoked
by the environment variables \fBLINLEN\fP, \fBSHOWALL\fP, and
\fBMARKX\fP.  Alternatively, these values can be controlled by
command line options.  The number of sequence residues per output
line is now adjustable by setting the environment variable
\fBLINLEN\fP, or the command line option \fB-w\fP.  \fBLINLEN\fP is
normally 60, to change it set \fBLINLEN=80\fP before running the
program or add \fB-w 80\fP to the command line.  \fBLINLEN\fP can
be set up to 200.  \fBSHOWALL\fP (\fB-a\fP) determines whether all,
or just a portion, of the aligned sequences are displayed.
Previously, FASTP would show the entire length of both sequences in
an alignment while FASTN would only show the portions of the two
sequences that overlapped. Now the default is to show only the
overlap between the two sequences, to show complete sequences, set
\fBSHOWALL=1\fP, or use the \fB-a\fP option on the command line.
.pp
The differences between the two aligned sequences can be highlighted
in three different ways by changing the environment variable
\fBMARKX\fP or the \fB-m\fP option.  Normally (MARKX=0) the program
uses ':' do denote identities and '.' to denote conservative
replacements.  If MARKX=1, the program will not mark identities;
instead conservative replacements are denoted by a 'x' and
non-conservative substitutions by a 'X'.  If MARKX=2, the residues in
the second sequence are only shown if they are different from the
first. MARKX=3 displays the aligned library sequences without the
query sequence; these can be used to build a primitive multiple
alignment.  MARKX=4 provides a graphical display of the boundaries of
the alignments. Thus the five options are:
.(l
.ft C

 MARKX=0      MARKX=1       MARKX=2       MARKX=3      MARKX=4

MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
MWKSCGYPYT   MWKSCGYPYT
.ft R
.)l
.lp
(fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseable format for use
with other programs.  See the file"readme.v20u4" for a more complete
description.
.sh 2 "Command line options"
.pp
It is now possible to specify  several options on the command
line, instead of using environment variables.  The command line options
are preceded by a dash; the following options are available:
.ip "-a"
same as showall=1
.ip "-A"
force Smith-Waterman alignments for DNA sequences and TFASA.  By
default, only FASTA protein sequence comparisons use Smith-Waterman
alignments.
.ip "-b #"
Number of sequence scores to be shown on output.  In the absence of
this option, fasta (and tfasta and ssearch) display all library
sequences obtaining similarity scores with expectations less than
10.0 if optimized score are used, or 2.0 if they are not. The -b
option can limit the display further, but it will not cause additional
sequences to be displayed.
.ip "-c #"
Threshold score for optimization (OPTCUT).  Set "-c 1" to
optimize every sequence in a database.  (This slows the program down
about 5-fold).
.ip "-E #"
Limit the number of scores and alignments shown based on the
expected number of scores.  Used to override the expectation value of 10.0
used by default.  When used with -Q, -E 2.0 will show all library sequences
with scores with an expectation value <= 2.0.
.ip "-d #"
Number of alignments to be reported by default. (Used in conjunction
with -Q).  No longer necessary, see "-b" above.
.ip "-f"
Penalty for the first residue in a gap (-12 by default for proteins,
-16 for DNA or for TFASTA).
.ip "-g"
Penalty for additional residues in a gap (-2 by default for proteins,
-4 for DNA and TFASTA ).
.ip "-h"
Penalty for frameshift (FASTX, TFASTX only).
.ip "-H"
Omit histogram.
.ip "-i"
Invert (reverse complement) the query sequence if it is DNA.  For
TFASTX,
search the reverse complement of the library sequence only.
.ip "-k #"
Threshold for joining init1 segments to build an initn score (GAPCUT).
.ip "-l file"
Location of library menu file (FASTLIBS).
.ip "-L"
Display more information about the library sequence in the alignment.
.ip "-m #"
MARKX = # (0, 1, 2, 3, 4, 10)
.ip "-n"
Force the query sequence to be treated as a DNA sequence.  This is
particularly useful for query sequences that contain a large number of
ambiguous residues, e.g. transcription factor binding sites.
.ip "-O"
Send copy of results to "filename."  Helpful for environments without STDOUT.
.ip "-o "
Turn off default optimization of all scores greater than OPTCUT. Sort
results by "initn" scores.
.ip "-Q,-q"
Quiet - does not prompt for any input.  Writes scores and alignments
to the terminal or standard output file.
.ip "-r file"
Save a results summary line for every sequence in the sequence
library.  The summary line includes the sequence identifier,
superfamily number (if available) position
in the library, and the similarity scores calculated.  This option can
be used to evaluate the sensitivity and selectivity of different
search strategies (see W. R. Pearson (1991) Genomics 11:635-650.)
.ip "-s file"
SMATRIX is read from file.  Several SMATRIX files are provided with
the standard distribution.  For protein sequences: \fCcodaa.mat\fP -
based on minimum mutation matrix; \fCidnaa.mat\fP - identity matrix;
\fCpam250.mat\fP - the PAM250 matrix developed by Dayhoff et al (Atlas
of Protein Sequence and Structure, vol. 5, suppl. 3, 1978);
\fCpam120.mat\fP - a PAM120 matrix.  The default scoring matrix is
BLOSUM50, PAM250 is available with "-s 250", BLOSUM62 ("-s BL62") is
also available.
.ip "-v \"#1 #2 #3\""
(LINEVAL) values used for line styles in plfasta
.ip "-w #"
Line length (width) = number (<200)
.ip "-x \"off1 off2\""
Specifies offsets for the beginning of the query and library sequence.
For example, if you are comparing upstream regions for two genes, and
the first sequence contains 500 nt of upstream sequence while the
second contains 300 nt of upstream sequence, you might try:
.(l I
\fCfasta -x "-500 -300" seq1.nt seq2.nt\fP
.)l
If the -x option is not used, FASTA assumes numbering starts with 1.
This option will not work properly with the translated library
sequence with tfasta.  (You should double check to be certain the
negative numbering works properly.)
.ip "-y"
Set the width of the band used for calculating "optimized" scores.
For proteins and ktup=2, the width is 16.  For proteins with ktup=1,
the width is 32 by default.  For DNA the width is 16.
.ip "-z"
Turn off statistical calculations.
.ip "-1"
sort output by init1 score (as FASTP used to do).
.ip "-3"
(TFASTA, TFASTX only) translate only three forward frames
.sp
.lp
For example:
.(l
\fCfasta -w 80 -a seq1.aa seq.aa\fP
.)l
would compare the sequence in seq1.aa to that in seq2.aa and display the
results with 80 residues on an output line, showing all of the residues
in both sequences.  Be sure to enter the options before entering the file
names, or just enter the options on the command line, and the program will
prompt for the file names.
.pp
Not all of these options are appropriate for all of the programs.  The
options above are used by FASTA and TFASTA. RELATE uses the -s option,
ALIGN uses the -w, -m, and -s options, and the PRDF program uses -c,
-f, -k, and -s.
.sp
.pp
(November, 1997) In addition, it is now possible to provide the fasta
programs with the query sequence (fasta, fasty, ssearch, tfastx), or
two sequences (prss, lalign, plalign) from the unix "stdin" stream.  This
makes it much easier to set up FASTA or PRSS WWW pages.  To specify
that stdin be used, rather than a file, the file name should be
specified as '-' or '@' (the latter file name makes it possible to
specify a subset of the sequence).
Thus:
.(l
cat query.aa | fasta -q @:25-75 s
.)l
would take residues 25-75 from query.aa and search the 's' library
(see the discussion of FASTLIBS).  If DNA sequences are to be read
from stdin, the '-n' option must be used, as fasta cannot check for
DNA queries when stdin is used.

.sh 1 "Environment variable summary"
.pp
Environment variables allow you to set search parameters that will be
used frequently when you run a program; for example, if you prefer to
use the PAM250 scoring matrix, you might "set SMATRIX=250."  Command
line parameters, if used, always override environment variable
settings. The following environment variables are used by this
program:
.ip "AABANK"
the file name  of the default sequence library.
.ip "FASTLIBS"
the location of the file which contains the list
of library files to be searched.
.ip "GAPCUT"
threshold used for joining init1 regions in the second step of FASTA.
Normally set based on sequence length and \fIktup\fP.
.ip "LIBTYPE"
used to specify the format of the library sequence for FASTA and TFASTA.
.ip "LINLEN"
output line length - can go up to 200
.ip "LINEVAL"
used by plfasta to determine the relationship between line style and
similarity score (-v).  This should be a string of three numbers, e.g.
"200 100 50"
.ip "MARKX"
symbol for denoting matches, mismatches. Note that this symbol is only
used across the optimized local region; sequences that are outside
this region are not marked.
.ip "OPTCUT"
Set the threshold to be used for optimization in a band around the
best initial region.  Normally the OPTCUT value is calculated from the
length of the sequence and the \fIktup\fP value (for a 200 residue
sequence, it is about 28).  If OPTCUT=1, every sequence in the
database will be optimized.  This is the most sensitive option.
.ip "PAMFACT"
This version of fasta uses a more sensitive method for identifying
initial regions. Instead of using a constant factor (fact) for each
match in a ktup, it uses the scoring matrix (PAM) scores.  While this
works well for protein sequences, it has not been as carefully tested
for DNA sequences, so by default, this modification is used for
proteins but not for DNA. Setting the PAMFACT environment variable to
1 forces the option on; PAMFACT=0 turns it off.
.ip "SHOWALL"
on output, show the complete sequence instead of just the
overlap of the two aligned sequences.
.ip "SMATRIX"
alternative scoring matrix file.
.sp
.lp
As always, please inform me of bugs as soon as possible.
.sp
.nf
William R. Pearson
Department of Biochemistry
Box 440, Jordan Hall
U. of Virginia
Charlottesville, VA

wrp@virginia.EDU