1
2                        COPYRIGHT NOTICE
3
4Copyright 1988, 1991, 1992, 1994, 1995, 1996 by William R.
5Pearson and the University of Virginia.  All rights reserved. The
6FASTA program and documentation may not be sold or incorporated
7into a commercial product, in whole or in part, without written
8consent of William R. Pearson and the University of Virginia.
9For further information regarding permission for use or
10reproduction, please contact: David Hudson, Assistant Provost for
11Research, University of Virginia, P.O. Box 9025, Charlottesville,
12VA 22906-9025, (804) 924-6853
13
14
15The FASTA program package
16
17Introduction
18
19     This documentation describes the version 2.0x of the FASTA
20program package (see W. R. Pearson and D. J. Lipman (1988),
21"Improved Tools for Biological Sequence Analysis", PNAS 85:2444-
222448, and W. R.  Pearson (1990) "Rapid and Sensitive Sequence
23Comparison with FASTP and FASTA" Methods in Enzymology 183:63-
2498). Version 2.0 modifies version 1.8 to include explicit
25statistical estimates for similarity scores based on the extreme
26value distribution.  In addition, FASTA protein alignments now
27use the Smith-Waterman algorithm with no limitation on gap size.
28FASTA and SSEARCH now use the BLOSUM50 matrix by default, with
29options to change gap penalties on the command line. Version 1.7
30replaces rdf2 and rss with prdf and prss, which use the extreme-
31value distribution to calculate accurate probability estimates.
32
33
34Although there are a large number of programs in this package,
35they belong to four groups:
36
37
38    Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH
39
40    Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN
41
42    Statistical significance: PRDF, RELATE, PRSS, RANDSEQ
43
44    Global alignment: ALIGN
45
46
47
48In addition, I have included several programs for protein
49sequence analysis, including a Kyte-Doolittle hydropathicity
50plotting program (GREASE, TGREASE), and a secondary structure
51prediction package (GARNIER).
52
53     The FASTA sequence comparison programs on this disk are
54improved versions of the FASTP program, originally described in
55Science (Lipman and Pearson, (1985) Science 227:1435-1441).  We
56have made several improvements.  First, the library search
57programs use a more sensitive method for the initial comparison
58of two sequences which allows the scores of several similar
59regions to be combined.  As a result, the results of a library
60search are now given with three scores, initn (the new initial
61score which may include several similar regions), init1 (the old
62fastp initial score from the best initial region), and opt (the
63old fastp optimized score allowing gaps in a 32 residue wide
64band).
65
66     These programs have also been modified to become "universal"
67(hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or
68FAST-N (nucleotides)); by changing the environment variable
69SMATRIX, the programs can be used to search protein sequences,
70DNA sequences, or whatever you like.  By default, FASTA, LFASTA,
71and the PRDF programs automatically recognize protein and DNA
72sequences.  Sequences are first read as amino acids, and then
73converted to nucleotides if the sequence is greater than 85%
74A,C,G,T (the '-n' option can be used to indicate DNA sequences).
75TFASTA compares protein sequences to a translated DNA sequence.
76Alternative scoring matrices can also be used.  In addition to
77the BLOSUM50 matrix for proteins, the PAM250 matrix or matrices
78based on simple identities or the genetic code can also be used
79for sequence comparisons or evaluation of significance.  Several
80different protein sequence matrices have been included;
81instructions for constructing your own scoring matrix are
82included in the file FORMAT.DOC.
83
84
85The remainder of this document is divided into three sections:
86(1) a brief history of the changes to the FASTA package; (2) A
87guide to installing the programs and databases; (3) A guide to
88using the FASTA programs. The programs are very easy to use, so
89if you are using them on a machine that is administered by
90someone else, you may want to skip to section (3) to learn how to
91use the programs, and then read section (1) to look at some of
92the more recent changes.  If you are installing the programs on
93your own machine, you will need to read section (2) carefully.
94
95
961.  Revision History
97
981.1.  Changes with version 2.0u
99
100     Version 2.0u provides several major improvements over
101previous versions of FASTA (and SSEARCH).  The most important is
102the incorporation of explicit statistical estimates and
103appropriate normalization of similarity scores. This improvement
104is discussed in more detail below in the section entitled
105Statistical Significance.  In addition, all of the protein
106comparison programs now use the BLOSUM50 matrix, with gap
107penalties of -12, -2, by default.  BLOSUM50 performs
108significantly better than the older PAM250 matrix.  PAM250 can
109still be used with the command line option: -s 250.  (DNA
110sequence comparisons use a more stringent gap penalty of -16, -4,
111which produces excellent statistical estimates when optimized
112scores are used. TFASTA uses -16, -4 as well.)
113
114     The quality of the fit of the extreme value distribution to
115the actual distribution of similarity scores is summarized with
116the Kolmogorov-Smirnov statistic.  The acceptance limits for this
117statistic can be found in many statistics books.  In general,
118values <0.10 (N=30) indicate excellent agreement between the
119actual and theoretical distributions.  If this statistic is >
1200.2, consider using a higher (more stringent) gap penalty, e.g.
121-16, -4 rather than -12, -2.  The default scoring matrix for DNA
122has been changed to score +5 for an identity and -4 for a
123mismatch.  These are the same scores used by BLASTN.
124
125     With explicit expectation calculations, the program now
126shows all scores and alignments with expectations less than 10.0
127(with optimized scores, 2.0 without optimization) when the "-Q"
128(quiet) mode is used.  The expectation threshold can be changed
129with the "-E" option.
130
131     Finally, the algorithm used to produce the final alignments
132of protein sequences is now a full Smith-Waterman, with unlimited
133gaps.  (The older band-limited alignments are used for DNA
134sequences and TFASTA by default, because Smith-Waterman
135alignments are very slow for long sequences.)  Both the optimized
136and Smith-Waterman scores are reported; if the Smith-Waterman
137score is higher, then additional gaps allowed a better alignment
138and similarity score to be calculated.
139
140     FASTA searches now optimize similarity scores by default
141(this slows searches about 2-fold (worst case) for ktup=2). Thus,
142the meaning of the "-o" option has been reversed; "-o" now turns
143off optimization and reports results sorted by "initn" scores.
144Optimization significantly improves the sensitivity of FASTA, so
145that it almost matches Smith-Waterman.  With version 2.0, the
146default band width used for optimized calculations can be varied
147with the "-y" option.  For proteins with ktup=2, a width of 16
148(-y 16) is used; 16 is also used for DNA sequences.  For proteins
149and ktup=1, a width of 32 is used. Searches that disable
150optimization with the "-o" option will work fine for sequences
151that share 25% or more identity in general, but to detect
152evolutionary relationships with 20% - 25% identity, the more
153sensitive default optimization is often required.  Optimization
154is required for accurate statistical estimates with either
155protein or DNA sequences.
156
157     The FASTA package now includes FASTX, a program that
158compares a DNA sequence to a protein sequence database by
159translating the DNA sequence in three frames (the reverse frames
160are selected with the -i option) and aligning the three-frame
161translation with the sequences in the protein database.
162Alignment scores allow frameshifts so that a cDNA or EST sequence
163with insertion/deletion errors can be aligned with its homologues
164from beginning to end.
165
166     With release 20u6, there is also a TFASTX program, which is
167a replacement for TFASTA.  TFASTA treats each of the six reading
168frames of a DNA library sequence as a different sequence; TFASTX
169compares a protein sequence against only two sequences from each
170DNA sequence - the forward and reverse orientation.  For a given
171orientation, TFASTX calculates a similarity score for alignments
172that allow frameshifts, thus considering all possible reading
173frames.
174
175     Another new program is included - randseq - which will
176produce a randomly shuffled (uniform or local shuffle) from an
177input sequence.  This randomly shuffled sequence can be used to
178evaluate the statistical estimates produced by FASTA, SSEARCH, or
179BLAST.
180
1811.2.  Changes with version 1.7
182Version 1.7 has been released to provide the PRDF and PRSS
183programs for shuffling sequences and estimating accurately the
184probabilities of the unshuffled-sequence scores.
185
186PRDF      a version of RDF2 that uses calculates the probability
187          of a similarity score more accurately by using a fit to
188          an extreme value distribution.  Code to fit the extreme
189          value distribution parameters and the impetus to update
190          RDF2 was provided by Phil Green, U. of Washington.
191
192PRSS      a version of PRDF that uses a rigorous Smith-Waterman
193          calculation to score similarities
194
1951.3.  Changes with version 1.6
196
197     FASTA version 1.6 uses a new method for calculating optimal
198scores in a band (the optimization or last step in the FASTA
199algorithm). In addition, it uses a linear-space method for
200calculating the actual alignments.  FASTA v1.6 package includes
201several new programs:
202
203SSEARCH   a program to search a sequence database using the
204          rigorous Smith-Waterman algorithm (this program is
205          about 100-fold slower than FASTA with ktup=2 (for
206          proteins).
207
208LALIGN    A rigorous local sequence alignment program that will
209          display the N-best local alignments (N=10 by default).
210
211PLALIGN   a version of lalign that plots the local alignments to
212          a postscript file.
213
214FLALIGN   a version of lalign that plots the local alignments to
215          a GCG Figure file.
216
217     The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim"
218algorithm described by Huang and Miller (1991) Adv. Appl. Math.
21912:337-357.  The SSEARCH and PRSS programs incorporate algorithms
220described by Huang, Hardison, and Miller (1990) CABIOS 6:373-381.
221
222     LFASTA and PLFASTA now calculate a different number of local
223similarities; they now behave more like LALIGN/PLALIGN.  Since
224local alignments of identical sequences produce "mirror-image"
225alignments, lalign and lfasta consider only one-half of the
226potential alignments between sequences from identical file names.
227Thus
228
229    lfasta mchu.aa mchu.aa
230
231Displays only two alignments, with earlier versions of the
232program, it would have displayed five, including the identity
233alignment.  PLFASTA does display five alignments; when two
234identical filenames are given, it draws the identity alignment,
235calculates the two unique local alignments, draws them, and draws
236their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the
237filenames, rather than the actual sequences, to determine whether
238sequences are identical; you can "trick" the programs into
239behaving the old way by putting the same sequence in two
240different files.
241
2421.4.  Changes with version 1.5
243
244     FASTA version 1.5 includes a number of substantial revisions
245to improve the performance and sensitivity of the program.  It is
246now possible to tell the program to optimize all of the initn
247scores greater than a threshold.  The threshold is set at the
248same value as the old FASTA cutoff score.  Alternatively, you can
249tell FASTA to sort the results by the init1, rather than the
250initn, score by using the -1 option.  FASTA -1 ... will report
251the results the way the older FASTP program did.
252
253     A new method has been provided for selecting libraries. In
254the past, one could enter the name of a sequence file to be
255searched or a single letter that would specify a library from the
256list included in the $FASTLIBS file. Now, you can specify a set
257of library files with a string of letters preceded by a '%'.
258Thus, if the FASTLIBS file has the lines:
259
260    Genbank 70 primates$1P/seqlib/gbpri.seq 1
261    Genbank 70 rodents$1R/seqlib/gbrod.seq 1
262    Genbank 70 other mammals$1M/seqlib/gbmam.seq 1
263    Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1
264
265Then the string: "%PRMB" would tell FASTA to search the four
266libraries listed above.  The %PRMB string can be entered either
267on the command line or when the program asks for a filename or
268library letter.
269
270     FASTA1.5 also provides additional flexibility for specifying
271the number of results and alignments to be displayed with the -Q
272(quiet) option.  The -b number option allows you to specify the
273number of sequence scores to show when the search is finished.
274Thus
275
276
277    FASTA -b 100 ...
278
279
280tells the program to display the top 100 sequence scores. In the
281past, if you displayed 100 scores (in -Q mode), you would also
282have store 100 alignments. The -d option allows you to limit the
283number of alignments shown.  FASTA -b 100 -d 20 would show 100
284scores and 20 alignments.
285
286     Finally, FASTA can provide a complete list of all of the
287sequences and scores calculated to a file with the -r (results)
288option.  FASTA -r results.out ... creates a file with a list of
289scores for every sequence in the library.  The list is not
290sorted, and only includes those scores calculated during the
291initial scan of the library.
292
2932.  Installing the FASTA package
294
2952.1.  Installing the programs
296
2972.1.1.  Unix version
298
299     The FASTA distribution comes with several makefile's that
300can be used to compile the FASTA programs.  Over the years, as
301ATT Unix System 5 and BSD unix have converged, these files have
302become very similar. To begin with, I recommend using the
303standard Makefile.  There are two values in the makefile that
304should be checked against the values used on your system: the HZ
305value, which is the frequency in ticks per second used by the
306times() system call, this value can usually be found by running:
307
308    grep HZ /usr/include/sys/*
309
310and the functions available to return random numbers.  If you
311have a rand48() function that returns a 32-bit random number, use
312it and use the lines:
313
314    NRAND=nrand48
315    RANFLG= -DRAND32
316
317If not, you will need to use the rand() function call and
318determine whether it returns a 16-bit or a 32-bit value.  These
319functions are used by PRDF and PRSS.  If you have problems
320compiling the programs, you may want to examine the makefile.unx
321and makefile.sun files, to look for differences.  I have tried to
322use very standard unix functions in these programs, and they have
323been successfully compiled, with very small changes to the
324Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS
325machines (under the BSD environment).
326
3272.1.2.  IBM-PC/DOS version
328
329     For the IBM-PC/DOS version, the FASTA source code disk
330contains the complete source code to all of the programs on the
331other disks.  The programs were compiled with Borland's Turbo
332'C++', using Borland's MAKE utility.  The graphics programs
333(PLFASTA, TGREASE) use the graphics device drivers supplied with
334the Turbo 'C' V2.0 package.  Also included are the documentation
335files PROGRAMS.DOC and FORMAT.DOC.  You do not need any of the
336files the source code disk to run the programs.  The files on
337this disk are identical to the UNIX and VMS versions that run on
338larger machines.  Also included is the code to compile
339ALIGN0.EXE.  ALIGN0 is the same as ALIGN, but does not penalize
340for end-gaps.
341
342     If you have the DOS or Macintosh version of the FASTA
343package, to install the programs you should:
344
345 (1)   Make a new directory (folder) for the FASTA programs.
346       This need not be the same as the directory for your
347       sequence databases.
348
349 (2)   Copy the files from the FASTA source disk to the new
350       directory.
351
352 (3)   (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your
353       PATH command to include the FASTA directory and (b) add
354       the line:
355
356           set FASTLIBS=c:\yourfastadirectory\fastgbs
357
358       On the Macintosh, you may need to edit the "environment"
359       file and change the line that reads:
360
361           FASTLIBS=fastgbs
362
363       to indicate the full directory path for the fastgbs file,
364       for example:
365
366           FASTLIBS=Q105:FASTA:fastgbs
367
368
369 (4)   Finally, you will need to edit the fastgbs file.  This is
370       usually the most confusing part of the installation.  An
371       example of this file is shown below; to customize this
372       file for your machine, you will need to change the file
373       names from those provided in the fastgbs file to ones that
374       reflect the directory names and file names you use on your
375       machine. This is explained in more detail below.  In
376       addition, some entries in the fastgbs file refer to other
377       files of file names.  These files of file names (as
378       opposed to actual database files) may also need to be
379       edited.
380
3812.2.  Installing the libraries
382
3832.2.1.  The NBRF protein sequence library
384
385     The FASTA program package does not include any protein or
386DNA sequence libraries.  You can obtain the PIR protein sequence
387database from:
388
389    National  Biomedical Research Foundation
390    Georgetown  University  Medical  Center
391    3900 Reservoir Rd, N.W.
392    Washington, D.C. 20007
393
394In addition, this database is available via anonymous ftp from
395the host "ftp.bchs.uh.edu". It is available in two formats, VMS
396and CODATA format.  The "VMS" format (library type 5 below) can
397be searched much faster, can be easily reformatted for use by the
398"BLAST" rapid searching program, and is compatible with the
399Genetics Computer Group package of programs.  The CODATA format
400is used by the EUGENE/MBIR computing package from Baylor (library
401type 2).
402
4032.2.2.  The GENBANK DNA sequence library
404
405     FASTA, and TFASTA search sequences from the GENBANK
406"flatfile" (not ASN.1) DNA sequence library in the flat-file
407format distributed by the National Center for Biotechnology
408Information and the PIR format used by EBI/EMBL.  CD-ROMs can be
409obtained from:
410
411    Genbank
412    National Center for Biotechnology Information
413    National Library of Medicine
414    National Institutes of Health
415    8600 Rockville Pike
416    Bethesda, MD  20894
417
418
419     The GenBank DNA sequence library is also available via
420anonymous FTP from ncbi.nlm.nih.gov.
421
4222.2.3.  The EBI/EMBL CD-ROM libraries
423
424     The European Bioinformatics Institute (EBI) is now
425distributing the EMBL CD-ROM that contains both the complete EMBL
426DNA sequence database (which should be essentially identical to
427the GenBank DNA sequence database) and the SWISS-PROT protein
428sequence database. SWISS-PROT is derived from the NBRF Protein
429sequence database with additions from the EBI/EMBL DNA sequence
430database.  This CD-ROM is a "best-buy," since it provides both
431DNA and protein sequence libraries.  It is available from:
432
433
434    European Bioinformatics Institute
435    Hinxton Genome Campus, Hinxton Hall
436    Hinxton, Cambridge CB10 1RQ,
437    United Kingdom
438    Tel: +44 1223 4944
439    Fax: +44 1223 494468
440    Email: DATALIB@ebi.ac.uk
441
442
443
444     In addition, the SWISS-PROT protein sequence database is
445available via anonymous FTP from ncbi.nlm.nih.gov.
446
4472.3.  Finding the libraries: FASTLIBS
448
449     FASTA and TFASTA use the environment variable FASTLIBS to
450find the protein and DNA sequence libraries.  The FASTLIBS
451variable contains the name of a file that has the actual
452filenames of the libraries.  The FASTGBS file on is an example of
453a file that can be referred to by FASTLIBS. To use the FASTGBS
454file, type:
455
456    setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
457    or
458    export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)
459
460Then edit the FASTGBS file to indicate where the protein and DNA
461sequence libraries can be found.  If you have a hard disk and
462your protein sequence library is kept in the file
463/usr/lib/aabank.lib and your Genbank DNA sequence library is kept
464in the directory: /usr/lib/genbank, then fastgbs might contain:
465
466    NBRF Protein$0P/usr/lib/seq/aabank.lib 0
467    SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
468    GB Primate$1P@/usr/lib/genbank/gpri.nam
469    GB Rodent$1R@/usr/lib/genbank/grod.nam
470    GB Mammal$1M@/usr/lib/genbank/gmammal.nam
471    ^   1    ^^^^       4                   ^     ^
472              23                             (5)
473
474The first line of this file says that there is a copy of the NBRF
475protein sequence database (which is a protein database) that can
476be selected by typing "P" on the command line or when the
477database menu is presented in the file /usr/lib/seq/aabank.lib.
478
479     Note that there are 4 or 5 fields in the lines in fastgbs.
480The first field is the description of the library which will be
481displayed by FASTA; it ends with a '$'.  The second field (1
482character), is a 0 if the library is a protein library and 1 if
483it is a DNA library.  The third field (1 character) is the
484character to be typed to select the library.
485
486     The fourth field is the name of the library file.  In the
487example above, the /usr/lib/seq/aabank.lib file contains the
488entire protein sequence library.  However the DNA library file
489names are preceded by a '@', because these files (gpri.nam,
490grod.nam, gmammal.nam) do not contain the sequences; instead they
491contain the names of the files which contain the sequences.  This
492is done because the GENBANK DNA database is broken down in to a
493large number of smaller files.  In order to search the entire
494primate database, you must search more than a dozen files.
495
496     In addition, an optional fifth field can be used to specify
497the format of the library file.  Alternatively, you can specify
498the library format in a file of file names (a file preceded by an
499'@').  This field must be separated from the file name by a space
500character (' ') from the filename.  In the example above, the
501aabank.lib file is in Pearson/FASTA format, while the swiss.seq
502file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
503FASTA can read the following formats:
504
505    0 Pearson/FASTA (>SEQID - comment/sequence)
506    1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
507    2 NBRF CODATA (ENTRY/SEQUENCE)
508    3 EMBL/SWISS-PROT (ID/DE/SQ)
509    4 Intelligenetics (;comment/SEQID/sequence)
510    5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
511    6 GCG (version 8.0) Unix Protein and DNA (compressed)
512    11 NCBI Blast1.3.2 format  (unix only)
513
514In particular, this version will work with the EMBL and PIR VMS
515formats that are distributed on the EMBL CD-ROM. The latter
516format (PIR VMS) is much faster to search than EMBL format.  This
517release also works with the protein and DNA database formats
518created for the BLASTP and BLASTN programs by SETDB and PRESSDB
519and with the new NCBI search format.  If a library format is not
520specified, for example, because you are just comparing two
521sequences, Pearson/FASTA (format 0) is used by default.  To
522change this default, you may set the LIBTYPE environment variable
523to a number.  For example,
524
525    setenv LIBTYPE 1
526
527would cause the program to use the GenBank LOCUS format by
528default for libraries (or the second sequence file), but the
529Pearson/FASTA format would still be used for the query sequence.
530
531     You can specify a group of library files by putting a '@'
532symbol before a file that contains a list of file names to be
533searched.  For example, if @gmam.nam is in the fastgbs file, the
534file "gmam.nam" might contain the lines:
535
536    </usr/lib/genbank
537    gbpri.seq 1
538    gbrod.seq 1
539    gbmam.seq 1
540
541In this case, the line beginning with a '<' indicates the
542directory the files will be found in.  The remaining lines name
543the actual sequence files.  So the first sequence file to be
544searched would be:
545
546    /usr/lib/genbank/gbpri.seq
547
548The notation "<PIRNAQ:" might be used under the VAX/VMS operating
549system. Under UNIX, the trailing '/' is left off, so the library
550directory might be written as "</usr/seqlib".
551
552     With version 1.4 of the FASTA package, the FASTA and TFASTA
553programs can search a library composed of different files in
554different sequence formats.  For example, you may wish to search
555the Genbank files (in GenBank flat file format) and the EMBL DNA
556sequence database on CD-ROM.  To do this, you simply list the
557names and filetypes of the files to be searched in a file of
558filenames.  For example, to search the mammalian portion of
559Genbank, the unannotated portion of Genbank, and the unannotated
560portion of the EMBL library, you could use the file:
561
562    </usr/lib/DNA
563    gbpri.seq 1
564    #  (this '#' causes the program to display the size of the library)
565    gbrod.seq 1
566    gbmam.seq 1
567    gbuna.seq 1
568    unanno.seq 5
569    #
570
571    You do not need to include library format numbers if  you
572    only use the Pearson/FASTA version of the PIR protein se-
573    quence library.  If no library  type  is  specified,  the
574    program  assumes  that  type  0 is being used (unless you
575    have set LIBTYPE).
576
577Support for the old compressed GenBank files, which have not been
578distributed for more than four years, has been removed from
579programs in the FASTA package.
580
581
582     Test the setup by running FASTA.  Enter the sequence file
583'MUSPLFM.AA' when the program requests it (this file is included
584with the programs).  The program should then ask you to select a
585protein sequence library.  Alternatively, if you run the TFASTA
586program and use the MUSPLFM.AA query sequence, the program should
587show you a selection of DNA sequence libraries.  Once the fastgbs
588file has been set up correctly, you can set FASTLIBS=fastgbs in
589your AUTOEXEC.BAT file, and you will not need to remember where
590the libraries are kept or how they are named.
591
592     FASTA and TFASTA must open a large number of files when
593searching and reporting the results of a GENBANK floppy disk
594format library search.  You may have problems with the large
595number of files under DOS on IBM-PC's (Unix and VMS users will
596not have these problems).  If you are going to search the GENBANK
597floppy disk format DNA sequence library under DOS, you should add
598the line:
599
600    FILES=16
601
602to your CONFIG.SYS file.  (Typically this is already done for
603programs like Windows or WordPerfect.)
604
6053.  Using the FASTA Package
606
6073.1.  Overview
608
609     The FASTA sequence comparison programs all require similar
610information, the name of a query sequence file, a library file,
611and the ktup parameter.  All of the programs can accept arguments
612on the command line, or they will prompt for the file names and
613ktup value.
614
615To use FASTA, simply type:
616
617    FASTA
618    and you will be prompted for :
619         the name of the test sequence file
620         the name of the library file
621         and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
622
623             ktup of 2 is about 5 times faster than ktup = 1.
624             For  a  200  aa sequence against a 10,000,000 aa
625             library, the program takes  about  30  min  with
626             ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286
627             IBM-PC.
628
629
630The program can also be run by typing
631
632    FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
633
634
635Included with the package are the test files, MUSPLFM.AA,
636LCBO.AA, MCHU.AA and BOVPRL.SEQ.  To check to make certain that
637everything is working, you can try:
638
639    fasta musplfm.aa lcbo.aa
640    and
641    tfasta musplfm.aa bovprl.seq
642
643To test the local similarity programs LFASTA and PLFASTA, try:
644
645    lfasta mchu.aa mchu.aa
646    and
647    plfasta mchu.aa mchu.aa
648
649MCHU (calmodulin) has four duplicated calcium binding sites that
650are clearly detected by LFASTA.  For a more complicated example,
651try MWRTC1.aa, myosin heavy chain.
652
6533.2.  Sequence files
654
655     The FASTA programs know about three kinds of sequence files
656(four under VMS): (1) plain sequence files that can only be used
657as query sequences or for LFASTA, PRDF, and ALIGN. (2) Standard
658library files.  These are the same as plain sequence files, each
659sequence is preceded by a comment line with a '>' in the first
660column. (3) distributed sequence libraries (this is a broad class
661that includes the NBRF/PIR VMS and blocked ascii formats, Genbank
662flat-file format, EMBL flat-file format, and Intelligenetics
663format.  All of the files that you create should be of type (1)
664or (2).  Type (2) files (ones with a be used as query or library
665sequence files by all of the programs.
666
667     I have included several sample test files, *.AA.  The first
668line may begin with a '>'  or ';' followed by a comment.  The
669text after ';' in other lines will  be  ignored.   Spaces  and
670tabs  (and anything else that  is  not  an amino-acid code) are
671ignored.
672
673     Library files should have the form:
674
675    >Sequence name and identifier
676    A F A S Y T .... actual sequence.
677    F S S       .... second line of sequence.
678    >Next sequence name and identifier
679
680This is often referred to as "FASTA" or "Pearson" format.  You
681can build your own library by concatenating several sequence
682files.  Just be sure that each sequence is preceded by a line
683beginning with a '>' with a sequence name.
684
685     The test file should not have lines longer than 120
686characters, and sequences entered with word processors should use
687a document mode, with normal carriage returns at the end of
688lines.
689
690Program Summary
691
6923.3.  Sequence search programs
693
694FASTA     universal sequence comparison. Defaults to comparing
695          protein sequences; if the sequences are > 85% A+C+G+T
696          or the -n option is used, a DNA sequence is assumed.
697
698FASTX     Search a protein sequence library using amino acid
699          sequence comparison to the forward three frames of a
700          translated DNA query sequence. (The reverse frames are
701          specified with the -i option.) Alignment scores allow
702          frameshifts; the final alignment uses a Smith-Waterman
703          type alignment routine (no limit on gaps) that allows
704          frameshifts.
705
706TFASTA    Search DNA library for a protein sequence by
707          translating the DNA sequence to protein in all six
708          frames (three forward frames with the -3 command line
709          option). TFASTA with ktup=2 is about as fast as a DNA
710          FASTA with ktup=4, and is substantially more sensitive.
711          (also reads the GENBANK library)
712
713TFASTX    Search DNA library for a protein sequence by
714          translating the DNA sequence to protein in all six
715          frames (three forward frames with the -3 command line
716          option) calculating similarity scores that allow
717          frameshifts. TFASTX produces an optimal Smith-Waterman
718          alignment of the query and translated-library sequence.
719
720SSEARCH   Universal sequence comparison using the Smith-Waterman
721          algorithm ( T. F. Smith and M. S. Waterman (1981) J.
722          Mol. Biol. 147:195-197).  This program uses code
723          developed by Huang and Miller (X. Huang, R. C.
724          Hardison, W. Miller (1990) CABIOS 6:373-381) for
725          calculating the local similarity score and code from
726          the ALIGN program (see below) for calculating the local
727          alignment.  SSEARCH is about 50-times slower than FASTA
728          with ktup=2 (for proteins).
729
730ALIGN     optimal global alignment of two sequences with no
731          short-cuts.  This program is a slightly modified
732          version of one taken from E.  Myers and W. Miller. The
733          algorithm is described in E. Myers and W.  Miller,
734          "Optimal Alignments in Linear Space" (CABIOS (1988)
735          4:11-17).
736
7373.4.  Local similarity programs
738
739LFASTA    local similarity searches showing local alignments.
740          The algorithm used to calculate the local alignment in
741          a band has been improved (Chao, Pearson, and Miller,
742          submitted).
743
744PLFASTA   local similarity searches with plot output (on the IBM,
745          this program requires that the environment variable
746          BGIDIR be set).
747
748PCLFASTA  (unix only) local similarity searches with plot output
749          using pic commands.
750
751LALIGN    Calculates the N-best local alignments using a rigorous
752          algorithm.  (N=10 by default.) The algorithm was
753          developed by Huang and Miller (X.  Huang and W.  Miller
754          (1991) Adv. Appl. Math. 12:337-357), which is a
755          linear-space version of an algorithm described by M. S.
756          Waterman and M. Eggert (J.  Mol. Biol. 197:723-728).
757          Like SSEARCH, LALIGN is rigorous, but also very slow.
758
759PLALIGN   A version of LALIGN that plots its output to a
760	  postscript file.
761
7623.5.  Statistical Significance
763
764     With version 2.0 of the FASTA program distribution, FASTA,
765TFASTA, and SSEARCH now provide estimates of statistical
766significance for library searches.  Work by Altschul, Arratia,
767Karlin, Mott, Waterman, and others (see Altschul et al. (1994)
768Nature Genetics 6:119 for an excellent review) suggests that
769local sequence similarity scores follow the extreme value
770distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where
771u = ln(Kmn)/lambda and m,m are the lengths of the query and
772library sequence. This formula can be rewritten as: 1 - exp(-Kmn
773exp(-lambda x), which shows that the average score for an
774unrelated library sequence increases with the logarithm of the
775length of the library sequence.  FASTA and SSEARCH use simple
776linear regression against the the log of the library sequence
777length to calculate a normalized "z-score" with mean 50,
778regardless of library sequence length, and variance 10.  These
779z-scores can then be used with the extreme value distribution and
780the poisson distribution (to account for the fact that each
781library sequence comparison is an independent test) to calculate
782the number of library sequences to obtain a score greater than or
783equal to the score obtained in the search. The original idea and
784routines to do the linear regression on library sequence length
785were provided Phil Green, U. Washington.  This version of FASTA
786and SSEARCH uses a slightly different strategy for fitting the
787data than those originally provided by Dr. Green.
788
789     The expected number of sequences is plotted in the histogram
790using an "*". Since the parameters for the extreme value
791distribution are not calculated directly from the distribution of
792similarity scores, the pattern of "*'s" in the histogram gives a
793qualitative view of how well the statistical theory fits the
794similarity scores calculated by FASTA and SSEARCH.  For FASTA, if
795optimized scores are calculated for each sequence in the database
796(the default), the agreement between the actual distribution of
797"z-scores" and the expected distribution based on the length
798dependence of the score and the extreme value distribution is
799usually very good.  Likewise, the distribution of SSEARCH Smith-
800Waterman scores typically agrees closely with the actual
801distribution of "z-scores."  The agreement with unoptimized
802scores, ktup=2, is often not very good, with too many high
803scoring sequences and too few low scoring sequences compared with
804the predicted relationship between sequence length and similarity
805score.  In those cases, the expectation values may be
806overestimates.
807
808     The statistical routines assume that the library contains a
809large sample of unrelated sequences.  If this is not the case,
810then the expectation values are meaningless.  Likewise, if there
811are fewer than 20 sequences in the library, the statistical
812calculations are not done.
813
814     For protein searches, library sequences with E() values <
8150.01 for searches of a 10,000 entry protein database are almost
816always homologous. Frequently sequences with E()-values from 1 -
81710 are related as well. Remember, however, that these E() values
818also reflect differences between the amino acid composition of
819the query sequence and that of the "average" library sequence.
820Thus, when searches are done with query sequences with "biased"
821amino-acid composition, unrelated sequences may have
822"significant" scores because of sequence bias.  The programs
823below, PRDF and PRSS, can address this problem by calculating
824similarity scores for random sequences with the same length and
825amino acid composition.
826
827     If optimization is not used ("-o"), E-values for DNA
828sequences overestimate the significance of the scores that are
829obtained and unrelated sequences frequently have E()-values <
8300.0005. With optimization, the agreement between E()-value
831compares favorably with protein sequence comparison.  This is in
832part due to the use of more stringent gap penalties for DNA
833sequence comparison, -16, -4 rather than -12, -2.  With the
834latter penalties, many unrelated sequences appear to have
835significant similarity. Nevertheless, since protein sequence
836comparison is much more sensitive, DNA sequence comparison should
837not be used to identify sequences that encode protein.  Even with
838ktup=6, optimization rarely increases run-times more than 50%
839with mRNA-size query sequences.  Optimization should be used
840whenever possible.
841
842     Similar comments apply to TFASTA, where  higher gap
843penalties (-16,-4) are required for accurate statistical
844estimates.  Because TFASTA produces so many artificial "coding"
845sequences with atypical amino acid compositions, the statistical
846estimates with TFASTA are often over estimates.  With optimized
847scores, ktup=1, and gap penalties of -16, -4, unrelated sequences
848will sometimes have E() values of 0.1.  If initn scores are used,
849unrelated sequences may have have E() values < 0.01.
850
851PRDF      improved version of RDF program that includes accurate
852          probability estimates for all three scoring methods
853          (includes local or window shuffle routine)
854
855PRSS      A version of PRDF that uses the rigorous Smith-Waterman
856          calculation used by SSEARCH.
857
858RANDSEQ   produces a randomly shuffled sequence from a query
859          sequence.
860
861RELATE    significance program described by Dayhoff (Atlas of
862          Protein Sequence and Structure, Vol. 5, Supplement 3).
863          Each chunk of 25 residues in one sequence is compared
864          to every 25 residue fragment of the second sequence.
865          Sequences which are genuinely related will have a large
866          number of scores greater than 3 standard deviations
867          above the mean score of all of the comparisons.
868
8693.6.  Other analysis programs
870
871AACOMP    calculate the amino acid composition and molecular
872          weight of a sequence.
873
874BESTSCOR  calculate the best self-comparison score.
875
876GREASE    Kyte-Doolittle hydropathicity profile
877
878TGREASE   graphic plot of Kyte-Doolittle profile
879
880FROMGB    convert from GenBank LOCUS format (also used by the
881          IBI-Pustell programs) to Pearson/FASTA format.
882
883GARNIER   A secondary structure prediction program using the
884          method of Garnier, Osgusthorpe, and Robson, J. Mol.
885          Biol., (1978) 120:97-120.
886
8873.7.  Options
888
889     These programs have a number of output options, which are
890invoked by the environment variables LINLEN, SHOWALL, and MARKX.
891Alternatively, these values can be controlled by command line
892options.  The number of sequence residues per output line is now
893adjustable by setting the environment variable LINLEN, or the
894command line option -w.  LINLEN is normally 60, to change it set
895LINLEN=80 before running the program or add -w 80 to the command
896line.  LINLEN can be set up to 200.  SHOWALL (-a) determines
897whether all, or just a portion, of the aligned sequences are
898displayed.  Previously, FASTP would show the entire length of
899both sequences in an alignment while FASTN would only show the
900portions of the two sequences that overlapped. Now the default is
901to show only the overlap between the two sequences, to show
902complete sequences, set SHOWALL=1, or use the -a option on the
903command line.
904
905     The differences between the two aligned sequences can be
906highlighted in three different ways by changing the environment
907variable MARKX or the -m option.  Normally (MARKX=0) the program
908uses ':' do denote identities and '.' to denote conservative
909replacements.  If MARKX=1, the program will not mark identities;
910instead conservative replacements are denoted by a 'x' and non-
911conservative substitutions by a 'X'.  If MARKX=2, the residues in
912the second sequence are only shown if they are different from the
913first. MARKX=3 displays the aligned library sequences without the
914query sequence; these can be used to build a primitive multiple
915alignment.  MARKX=4 provides a graphical display of the
916boundaries of the alignments. Thus the five options are:
917
918
919     MARKX=0      MARKX=1       MARKX=2       MARKX=3      MARKX=4
920
921    MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
922    ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
923    MWKSCGYPYT   MWKSCGYPYT
924
925
926(fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseable
927format for use with other programs.  See the file"readme.v20u4"
928for a more complete description.
929
9303.8.  Command line options
931
932     It is now possible to specify  several options on the
933command line, instead of using environment variables.  The
934command line options are preceded by a dash; the following
935options are available:
936
937-a        same as showall=1
938
939-A        force Smith-Waterman alignments for DNA sequences and
940          TFASA.  By default, only FASTA protein sequence
941          comparisons use Smith-Waterman alignments.
942
943-b #      Number of sequence scores to be shown on output.  In
944          the absence of this option, fasta (and tfasta and
945          ssearch) display all library sequences obtaining
946          similarity scores with expectations less than 10.0 if
947          optimized score are used, or 2.0 if they are not. The
948          -b option can limit the display further, but it will
949          not cause additional sequences to be displayed.
950
951-c #      Threshold score for optimization (OPTCUT).  Set "-c 1"
952          to optimize every sequence in a database.  (This slows
953          the program down about 5-fold).
954
955-E #      Limit the number of scores and alignments shown based
956          on the expected number of scores.  Used to override the
957          expectation value of 10.0 used by default.  When used
958          with -Q, -E 2.0 will show all library sequences with
959          scores with an expectation value <= 2.0.
960
961-d #      Number of alignments to be reported by default. (Used
962          in conjunction with -Q).  No longer necessary, see "-b"
963          above.
964
965-f        Penalty for the first residue in a gap (-12 by default
966          for proteins, -16 for DNA or for TFASTA).
967
968-g        Penalty for additional residues in a gap (-2 by default
969          for proteins, -4 for DNA and TFASTA ).
970
971-h        Penalty for frameshift (FASTX, TFASTX only).
972
973-H        Omit histogram.
974
975-i        Invert (reverse complement) the query sequence if it is
976          DNA.  For TFASTX, search the reverse complement of the
977          library sequence only.
978
979-k #      Threshold for joining init1 segments to build an initn
980          score (GAPCUT).
981
982-l file   Location of library menu file (FASTLIBS).
983
984-L        Display more information about the library sequence in
985          the alignment.
986
987-m #      MARKX = # (0, 1, 2, 3, 4, 10)
988
989-n        Force the query sequence to be treated as a DNA
990          sequence.  This is particularly useful for query
991          sequences that contain a large number of ambiguous
992          residues, e.g. transcription factor binding sites.
993
994-O        Send copy of results to "filename."  Helpful for
995          environments without STDOUT.
996
997-o        Turn off default optimization of all scores greater
998          than OPTCUT. Sort results by "initn" scores.
999
1000-Q,-q     Quiet - does not prompt for any input.  Writes scores
1001          and alignments to the terminal or standard output file.
1002
1003-r file   Save a results summary line for every sequence in the
1004          sequence library.  The summary line includes the
1005          sequence identifier, superfamily number (if available)
1006          position in the library, and the similarity scores
1007          calculated.  This option can be used to evaluate the
1008          sensitivity and selectivity of different search
1009          strategies (see W. R. Pearson (1991) Genomics 11:635-
1010          650.)
1011
1012-s file   SMATRIX is read from file.  Several SMATRIX files are
1013          provided with the standard distribution.  For protein
1014          sequences: codaa.mat - based on minimum mutation
1015          matrix; idnaa.mat - identity matrix; pam250.mat - the
1016          PAM250 matrix developed by Dayhoff et al (Atlas of
1017          Protein Sequence and Structure, vol. 5, suppl. 3,
1018          1978); pam120.mat - a PAM120 matrix.  The default
1019          scoring matrix is BLOSUM50, PAM250 is available with
1020          "-s 250", BLOSUM62 ("-s BL62") is also available.
1021
1022-v        (LINEVAL) values used for line styles in plfasta
1023
1024-w #      Line length (width) = number (<200)
1025
1026-x        Specifies offsets for the beginning of the query and
1027          library sequence.  For example, if you are comparing
1028          upstream regions for two genes, and the first sequence
1029          contains 500 nt of upstream sequence while the second
1030          contains 300 nt of upstream sequence, you might try:
1031
1032              fasta -x "-500 -300" seq1.nt seq2.nt
1033
1034          If the -x option is not used, FASTA assumes numbering
1035          starts with 1.  This option will not work properly with
1036          the translated library sequence with tfasta.  (You
1037          should double check to be certain the negative
1038          numbering works properly.)
1039
1040-y        Set the width of the band used for calculating
1041          "optimized" scores.  For proteins and ktup=2, the width
1042          is 16.  For proteins with ktup=1, the width is 32 by
1043          default.  For DNA the width is 16.
1044
1045-z        Turn off statistical calculations.
1046
1047-1        sort output by init1 score (as FASTP used to do).
1048
1049-3        (TFASTA, TFASTX only) translate only three forward
1050          frames
1051
1052
1053For example:
1054
1055    fasta -w 80 -a seq1.aa seq.aa
1056
1057would compare the sequence in seq1.aa to that in seq2.aa and
1058display the results with 80 residues on an output line, showing
1059all of the residues in both sequences.  Be sure to enter the
1060options before entering the file names, or just enter the options
1061on the command line, and the program will prompt for the file
1062names.
1063
1064     Not all of these options are appropriate for all of the
1065programs.  The options above are used by FASTA and TFASTA. RELATE
1066uses the -s option, ALIGN uses the -w, -m, and -s options, and
1067the PRDF program uses -c, -f, -k, and -s.
1068
10694.  Environment variable summary
1070
1071     Environment variables allow you to set search parameters
1072that will be used frequently when you run a program; for example,
1073if you prefer to use the PAM250 scoring matrix, you might "set
1074SMATRIX=250."  Command line parameters, if used, always override
1075environment variable settings. The following environment
1076variables are used by this program:
1077
1078AABANK    the file name  of the default sequence library.
1079
1080FASTLIBS  the location of the file which contains the list of
1081          library files to be searched.
1082
1083GAPCUT    threshold used for joining init1 regions in the second
1084          step of FASTA.  Normally set based on sequence length
1085          and ktup.
1086
1087LIBTYPE   used to specify the format of the library sequence for
1088          FASTA and TFASTA.
1089
1090LINLEN    output line length - can go up to 200
1091
1092LINEVAL   used by plfasta to determine the relationship between
1093          line style and similarity score (-v).  This should be a
1094          string of three numbers, e.g.  "200 100 50"
1095
1096MARKX     symbol for denoting matches, mismatches. Note that this
1097          symbol is only used across the optimized local region;
1098          sequences that are outside this region are not marked.
1099
1100OPTCUT    Set the threshold to be used for optimization in a band
1101          around the best initial region.  Normally the OPTCUT
1102          value is calculated from the length of the sequence and
1103          the ktup value (for a 200 residue sequence, it is about
1104          28).  If OPTCUT=1, every sequence in the database will
1105          be optimized.  This is the most sensitive option.
1106
1107PAMFACT   This version of fasta uses a more sensitive method for
1108          identifying initial regions. Instead of using a
1109          constant factor (fact) for each match in a ktup, it
1110          uses the scoring matrix (PAM) scores.  While this works
1111          well for protein sequences, it has not been as
1112          carefully tested for DNA sequences, so by default, this
1113          modification is used for proteins but not for DNA.
1114          Setting the PAMFACT environment variable to 1 forces
1115          the option on; PAMFACT=0 turns it off.
1116
1117SHOWALL   on output, show the complete sequence instead of just
1118          the overlap of the two aligned sequences.
1119
1120SMATRIX   alternative scoring matrix file.
1121
1122As always, please inform me of bugs as soon as possible.
1123
1124William R. Pearson
1125Department of Biochemistry
1126Box 440, Jordan Hall
1127U. of Virginia
1128Charlottesville, VA
1129
1130wrp@virginia.EDU
1131