1.nr pp 11 2.nr sp 11 3.nr tp 11 4.nr fp 10 5.nr fi 0n 6.sz 11 7.if t \{ 8.po 1i 9.he 'FASTA.DOC''Release 2.0u4, February 1996' 10.fo ''- % -'' 11\} 12.if n \{ 13.po 0 14.na 15.nh 16\} 17.ll 6.5i 18.ce 19\fB\s+2COPYRIGHT NOTICE\s0\fP 20.lp 21Copyright 1988, 1991, 1992, 1994, 1995, 1996 by William R. Pearson and 22the University of Virginia. All rights reserved. The FASTA program 23and documentation may not be sold or incorporated into a commercial 24product, in whole or in part, without written consent of William 25R. Pearson and the University of Virginia. For further information 26regarding permission for use or reproduction, please contact: David 27Hudson, Assistant Provost for Research, University of Virginia, 28P.O. Box 9025, Charlottesville, VA 22906-9025, (804) 924-6853 29.sp 30.uh "\s+2The FASTA program package\s0" 31.uh "Introduction" 32.pp 33This documentation describes the version 2.0x of the FASTA program 34package (see W. R. Pearson and D. J. Lipman (1988), "Improved Tools 35for Biological Sequence Analysis", PNAS 85:2444-2448, and W. R. 36Pearson (1990) "Rapid and Sensitive Sequence Comparison with FASTP and 37FASTA" Methods in Enzymology 183:63-98). Version 2.0 modifies version 381.8 to include explicit statistical estimates for similarity scores 39based on the extreme value distribution. In addition, FASTA protein 40alignments now use the Smith-Waterman algorithm with no limitation on 41gap size. FASTA and SSEARCH now use the BLOSUM50 matrix by default, 42with options to change gap penalties on the command line. Version 1.7 43replaces rdf2 and rss with prdf and prss, which use the extreme-value 44distribution to calculate accurate probability estimates. 45.sp 46.lp 47Although there are a large number of programs in this package, they 48belong to four groups: 49.(l 50 51Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH 52 53Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN 54 55Statistical significance: PRDF, RELATE, PRSS, RANDSEQ 56 57Global alignment: ALIGN 58 59.)l 60.lp 61In addition, I have included several programs for protein sequence 62analysis, including a Kyte-Doolittle hydropathicity plotting program 63(GREASE, TGREASE), and a secondary structure prediction package 64(GARNIER). 65.pp 66The FASTA sequence comparison programs on this disk are improved 67versions of the FASTP program, originally described in Science (Lipman 68and Pearson, (1985) Science 227:1435-1441). We have made several 69improvements. First, the library search programs use a more sensitive 70method for the initial comparison of two sequences which allows the 71scores of several similar regions to be combined. As a result, the 72results of a library search are now given with three scores, 73\f2initn\fP (the new initial score which may include several similar 74regions), \f2init1\fP (the old fastp initial score from the best 75initial region), and \f2opt\fP (the old fastp optimized score allowing 76gaps in a 32 residue wide band). 77.pp 78These programs have also been modified to become "universal" (hence 79FAST-A, for FASTA-All, as opposed to FAST-P (protein) or FAST-N 80(nucleotides)); by changing the environment variable SMATRIX, the 81programs can be used to search protein sequences, DNA sequences, or 82whatever you like. By default, FASTA, LFASTA, and the PRDF programs 83automatically recognize protein and DNA sequences. Sequences are 84first read as amino acids, and then converted to nucleotides if the 85sequence is greater than 85% A,C,G,T (the '-n' option can be used to 86indicate DNA sequences). TFASTA compares protein sequences to a 87translated DNA sequence. Alternative scoring matrices can also be 88used. In addition to the BLOSUM50 matrix for proteins, the PAM250 89matrix or matrices based on simple identities or the genetic code can 90also be used for sequence comparisons or evaluation of significance. 91Several different protein sequence matrices have been included; 92instructions for constructing your own scoring matrix are included in 93the file FORMAT.DOC. 94.sp 2 95The remainder of this document is divided into three sections: (1) a 96brief history of the changes to the FASTA package; (2) A guide to 97installing the programs and databases; (3) A guide to using the FASTA 98programs. The programs are very easy to use, so if you are using them 99on a machine that is administered by someone else, you may want to 100skip to section (3) to learn how to use the programs, and then read 101section (1) to look at some of the more recent changes. If you are 102installing the programs on your own machine, you will need to read 103section (2) carefully. 104.sp 105.sh 1 "Revision History" 106.sh 2 "Changes with version 2.0u" 107.pp 108Version 2.0u provides several major improvements over previous 109versions of FASTA (and SSEARCH). The most important is the 110incorporation of explicit statistical estimates and appropriate 111normalization of similarity scores. This improvement is discussed in 112more detail below in the section entitled \(lqStatistical 113Significance\(rq. In addition, all of the protein comparison programs 114now use the BLOSUM50 matrix, with gap penalties of -12, -2, by 115default. BLOSUM50 performs significantly better than the older PAM250 116matrix. PAM250 can still be used with the command line option: \(lq-s 117250\(rq. (DNA sequence comparisons use a more stringent gap 118penalty of -16, -4, which produces excellent statistical estimates 119when optimized scores are used. TFASTA uses -16, -4 as well.) 120.pp 121The quality of the fit of the extreme value distribution to the actual 122distribution of similarity scores is summarized with the 123Kolmogorov-Smirnov statistic. The acceptance limits for this 124statistic can be found in many statistics books. In general, values 125<0.10 (N=30) indicate excellent agreement between the actual and 126theoretical distributions. If this statistic is > 0.2, consider 127using a higher (more stringent) gap penalty, e.g. -16, -4 rather than 128-12, -2. The default scoring matrix for DNA has been changed to score 129+5 for an identity and -4 for a mismatch. These are the same scores 130used by BLASTN. 131.pp 132With explicit expectation calculations, the program now shows all 133scores and alignments with expectations less than 10.0 (with optimized 134scores, 2.0 without optimization) when the "-Q" (quiet) mode is used. 135The expectation threshold can be changed with the "-E" option. 136.pp 137Finally, the algorithm used to produce the final alignments of protein 138sequences is now a full Smith-Waterman, with unlimited gaps. (The 139older band-limited alignments are used for DNA sequences and TFASTA by 140default, because Smith-Waterman alignments are very slow for long 141sequences.) Both the \(lqoptimized\(rq and \(lqSmith-Waterman\(rq 142scores are reported; if the Smith-Waterman score is higher, then 143additional gaps allowed a better alignment and similarity score to be 144calculated. 145.pp 146FASTA searches now optimize similarity scores by default (this slows 147searches about 2-fold (worst case) for \fIktup=2\fP). Thus, the 148meaning of the "-o" option has been reversed; "-o" now turns off 149optimization and reports results sorted by "initn" scores. 150Optimization significantly improves the sensitivity of FASTA, so that 151it almost matches Smith-Waterman. With version 2.0, the default band 152width used for optimized calculations can be varied with the "-y" 153option. For proteins with ktup=2, a width of 16 (-y 16) is used; 16 154is also used for DNA sequences. For proteins and ktup=1, a width of 15532 is used. Searches that disable optimization with the "-o" option 156will work fine for sequences that share 25% or more identity in 157general, but to detect evolutionary relationships with 20% \- 25% 158identity, the more sensitive default optimization is often required. 159Optimization is required for accurate statistical estimates with 160either protein or DNA sequences. 161.pp 162The FASTA package now includes \f(CBFASTX\fP, a program that compares 163a DNA sequence to a protein sequence database by translating the DNA 164sequence in three frames (the reverse frames are selected with the 165\fC-i\fB option) and aligning the three-frame translation with the 166sequences in the protein database. Alignment scores allow frameshifts 167so that a cDNA or EST sequence with insertion/deletion errors can be 168aligned with its homologues from beginning to end. 169.pp 170With release 20u6, there is also a \f(CBTFASTX\fP program, which is a 171replacement for TFASTA. TFASTA treats each of the six reading frames 172of a DNA library sequence as a different sequence; \f(CBTFASTX\fP 173compares a protein sequence against only two sequences from each DNA 174sequence \- the forward and reverse orientation. For a given 175orientation, \f(CBTFASTX\fP calculates a similarity score for 176alignments that allow frameshifts, thus considering all possible 177reading frames. 178.pp 179Another new program is included - \f(CBrandseq\fP - which will produce a 180randomly shuffled (uniform or local shuffle) from an input sequence. 181This randomly shuffled sequence can be used to evaluate the 182statistical estimates produced by FASTA, SSEARCH, or BLAST. 183.sh 2 "Changes with version 1.7" 184.br 185Version 1.7 has been released to provide the PRDF and PRSS programs 186for shuffling sequences and estimating accurately the probabilities of 187the unshuffled-sequence scores. 188.ip "PRDF" 1i 189a version of RDF2 that uses calculates the probability of a similarity 190score more accurately by using a fit to an extreme value distribution. 191Code to fit the extreme value distribution parameters and the impetus 192to update RDF2 was provided by Phil Green, U. of Washington. 193.ip "PRSS" 1i 194a version of PRDF that uses a rigorous Smith-Waterman calculation to 195score similarities 196.sh 2 "Changes with version 1.6" 197.pp 198FASTA version 1.6 uses a new method for calculating optimal 199scores in a band (the optimization or last step in the FASTA 200algorithm). In addition, it uses a linear-space method for calculating 201the actual alignments. FASTA v1.6 package includes several new 202programs: 203.ip "SSEARCH" 1i 204a program to search a sequence database using 205the rigorous Smith-Waterman algorithm (this 206program is about 100-fold slower than FASTA 207with ktup=2 (for proteins). 208.ip "LALIGN" 1i 209A rigorous local sequence alignment program that will display the 210N-best local alignments (N=10 by default). 211.ip "PLALIGN" 1i 212a version of lalign that plots the local alignments to postscript. 213.ip "FLALIGN" 1i 214a version of lalign that plots the local alignments to a GCG Figure 215file. 216.pp 217The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim" algorithm 218described by Huang and Miller (1991) Adv. Appl. Math. 12:337-357. 219The SSEARCH and PRSS programs incorporate algorithms described by 220Huang, Hardison, and Miller (1990) CABIOS 6:373-381. 221.pp 222LFASTA and PLFASTA now calculate a different number of local 223similarities; they now behave more like LALIGN/PLALIGN. 224Since local alignments of identical sequences produce "mirror-image" 225alignments, lalign and lfasta consider only one-half of the potential 226alignments between sequences from identical file names. Thus 227.(l I 228\fClfasta mchu.aa mchu.aa\fP 229.)l 230Displays only two alignments, with earlier versions of the program, it 231would have displayed five, including the identity alignment. PLFASTA 232does display five alignments; when two identical filenames are given, 233it draws the identity alignment, calculates the two unique local 234alignments, draws them, and draws their mirror images. 235LFASTA/PLFASTA and LALIGN/PLALIGN use the filenames, rather than the 236actual sequences, to determine whether sequences are identical; you 237can "trick" the programs into behaving the old way by putting the same 238sequence in two different files. 239.sh 2 "Changes with version 1.5" 240.pp 241FASTA version 1.5 includes a number of substantial revisions to 242improve the performance and sensitivity of the program. It is now 243possible to tell the program to optimize all of the \f2initn\fP scores 244greater than a threshold. The threshold is set at the same value as 245the old FASTA cutoff score. Alternatively, you can tell FASTA to sort 246the results by the \f2init1\fP, rather than the \f2initn\fP, score by 247using the \fC-1\fP option. \fCFASTA -1 ...\fP will report the results 248the way the older FASTP program did. 249.pp 250A new method has been provided for selecting libraries. In the 251past, one could enter the name of a sequence file to be searched or a 252single letter that would specify a library from the list included in 253the $FASTLIBS file. Now, you can specify a set of library files with a 254string of letters preceded by a '%'. Thus, if the FASTLIBS file has 255the lines: 256.(l I 257 258Genbank 70 primates$1P/seqlib/gbpri.seq 1 259Genbank 70 rodents$1R/seqlib/gbrod.seq 1 260Genbank 70 other mammals$1M/seqlib/gbmam.seq 1 261Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1 262 263.)l 264Then the string: "%PRMB" would tell FASTA to search the four libraries 265listed above. The %PRMB string can be entered either on the command 266line or when the program asks for a filename or library letter. 267.pp 268FASTA1.5 also provides additional flexibility for specifying the 269number of results and alignments to be displayed with the \fC-Q\fP 270(quiet) option. The \fC-b number\fP option allows you to specify the 271number of sequence scores to show when the search is finished. Thus 272.(l I 273 274\fCFASTA -b 100 ...\fP 275 276.)l 277tells the program to display the top 100 sequence scores. In the past, 278if you displayed 100 scores (in \fC-Q\fP mode), you would also have 279store 100 alignments. The \fC-d\fP option allows you to limit the 280number of alignments shown. \fCFASTA -b 100 -d 20\fP would show 100 281scores and 20 alignments. 282.pp 283Finally, FASTA can provide a complete list of all of the sequences and 284scores calculated to a file with the \fC-r\fP (results) option. 285\fCFASTA -r results.out ...\fP creates a file with a list of scores 286for every sequence in the library. The list is not sorted, and only 287includes those scores calculated during the initial scan of the 288library. 289.sh 1 "Installing the FASTA package" 290.sh 2 "Installing the programs" 291.sh 3 "Unix version" 292.pp 293The FASTA distribution comes with several \fCmakefile\fP's that can be 294used to compile the FASTA programs. Over the years, as ATT Unix 295System 5 and BSD unix have converged, these files have become very 296similar. To begin with, I recommend using the standard \fCMakefile\fP. 297There are two values in the \fCmakefile\fP that should be checked against 298the values used on your system: the \fCHZ\fP value, which is the frequency 299in ticks per second used by the \fCtimes()\fP system call, this value can 300usually be found by running: 301.(l I 302\fCgrep HZ /usr/include/sys/*\fP 303.)l 304and the functions available to return random numbers. If you have a 305\fCrand48()\fP function that returns a 32-bit random number, use it and use 306the lines: 307.(l I 308\fCNRAND=nrand48 309RANFLG= -DRAND32\fP 310.)l 311If not, you will need to use the \fCrand()\fP function call and determine 312whether it returns a 16-bit or a 32-bit value. These functions are 313used by PRDF and PRSS. 314.p 315If you have problems compiling the programs, you may want to examine 316the \fCmakefile.unx\fP and \fCmakefile.sun\fP files, to look for differences. 317I have tried to use very standard unix functions in these programs, 318and they have been successfully compiled, with very small changes to 319the \fCMakefile\fP, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS 320machines (under the BSD environment). 321.sh 3 "IBM-PC/DOS version" 322.pp 323For the IBM-PC/DOS version, the FASTA source code disk contains the 324complete source code to all of the programs on the other disks. The 325programs were compiled with Borland's Turbo 'C++', using Borland's 326MAKE utility. The graphics programs (PLFASTA, TGREASE) use the 327graphics device drivers supplied with the Turbo 'C' V2.0 package. 328Also included are the documentation files PROGRAMS.DOC and FORMAT.DOC. 329You do not need any of the files the source code disk to run the 330programs. The files on this disk are identical to the UNIX and VMS 331versions that run on larger machines. Also included is the code to 332compile ALIGN0.EXE. ALIGN0 is the same as ALIGN, but does not 333penalize for end-gaps. 334.pp 335If you have the DOS or Macintosh version of the FASTA package, to 336install the programs you should: 337.np 338Make a new directory (folder) for the FASTA programs. This need not 339be the same as the directory for your sequence databases. 340.np 341Copy the files from the FASTA source disk to the new directory. 342.np 343(DOS only) Edit your AUTOEXEC.BAT file to (a) modify your PATH command 344to include the FASTA directory and (b) add the line: 345.(l 346\fCset FASTLIBS=c:\\yourfastadirectory\\fastgbs\fP 347.)l 348On the Macintosh, you may need to edit the "environment" file and 349change the line that reads: 350.(l 351\fCFASTLIBS=fastgbs\fP 352.)l 353to indicate the full directory path for the \fCfastgbs\fP file, for 354example: 355.(l 356\fCFASTLIBS=Q105:FASTA:fastgbs\fP 357.)l 358.np 359Finally, you will need to edit the \fCfastgbs\fP file. This is usually 360the most confusing part of the installation. An example of this file 361is shown below; to customize this file for your machine, you will need 362to change the file names from those provided in the \fCfastgbs\fP file to 363ones that reflect the directory names and file names you use on your 364machine. This is explained in more detail below. In addition, some 365entries in the \fCfastgbs\fP file refer to other files of file names. 366These files of file names (as opposed to actual database files) may 367also need to be edited. 368.sh 2 "Installing the libraries" 369.sh 3 "The NBRF protein sequence library" 370.pp 371The FASTA program package does not include any protein or DNA 372sequence libraries. You can obtain the PIR protein sequence database from: 373.(l 374National Biomedical Research Foundation 375Georgetown University Medical Center 3763900 Reservoir Rd, N.W. 377Washington, D.C. 20007 378.)l 379In addition, this database is available via anonymous ftp from the 380host "ftp.bchs.uh.edu". It is available in two formats, VMS and CODATA 381format. The "VMS" format (library type 5 below) can be searched much 382faster, can be easily reformatted for use by the "BLAST" rapid 383searching program, and is compatible with the Genetics Computer Group 384package of programs. The CODATA format is used by the EUGENE/MBIR 385computing package from Baylor (library type 2). 386.sh 3 "The GENBANK DNA sequence library" 387.pp 388FASTA, and TFASTA search sequences from the GENBANK "flatfile" (not 389ASN.1) DNA sequence library in the flat-file format distributed by the 390National Center for Biotechnology Information and the PIR format used 391by EBI/EMBL. CD-ROMs can be obtained from: 392.(l 393Genbank 394National Center for Biotechnology Information 395National Library of Medicine 396National Institutes of Health 3978600 Rockville Pike 398Bethesda, MD 20894 399.)l 400.pp 401The GenBank DNA sequence library is also available via anonymous FTP 402from \fCncbi.nlm.nih.gov\fP. 403.sh 3 "The EBI/EMBL CD-ROM libraries" 404.pp 405The European Bioinformatics Institute (EBI) is now distributing the 406EMBL CD-ROM that contains both the complete EMBL DNA sequence database 407(which should be essentially identical to the GenBank DNA sequence 408database) and the SWISS-PROT protein sequence database. SWISS-PROT is 409derived from the NBRF Protein sequence database with additions from 410the EBI/EMBL DNA sequence database. This CD-ROM is a "best-buy," 411since it provides both DNA and protein sequence libraries. It is 412available from: 413.(l 414 415European Bioinformatics Institute 416Hinxton Genome Campus, Hinxton Hall 417Hinxton, Cambridge CB10 1RQ, 418United Kingdom 419Tel: +44 1223 4944 420Fax: +44 1223 494468 421Email: DATALIB@ebi.ac.uk 422 423.)l 424.pp 425In addition, the SWISS-PROT protein sequence database is available via 426anonymous FTP from \fCncbi.nlm.nih.gov\fP. 427.sh 2 "Finding the libraries: FASTLIBS" 428.pp 429FASTA and TFASTA use the environment variable FASTLIBS to find the 430protein and DNA sequence libraries. The FASTLIBS variable contains 431the name of a file that has the actual filenames of the libraries. 432The \fCFASTGBS\fP file on is an example of a file that 433can be referred to by FASTLIBS. To use the \fCFASTGBS\fP file, type: 434.(l 435\fCsetenv FASTLIBS /usr/lib/fasta/fastgbs\fP (BSD UNIX/csh) 436or 437\fCexport FASTLIBS=/usr/lib/fasta/fastgbs\fP (SysV UNIX/ksh) 438.)l 439Then edit the \fCFASTGBS\fP file to indicate where the protein and DNA 440sequence libraries can be found. If you have a hard disk and your 441protein sequence library is kept in the file \fC/usr/lib/aabank.lib\fP and 442your Genbank DNA sequence library is kept in the directory: 443\fC/usr/lib/genbank\fP, then \fCfastgbs\fP might contain: 444.ne 8 445.(l 446.ft C 447NBRF Protein$0P/usr/lib/seq/aabank.lib 0 448SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5 449GB Primate$1P@/usr/lib/genbank/gpri.nam 450GB Rodent$1R@/usr/lib/genbank/grod.nam 451GB Mammal$1M@/usr/lib/genbank/gmammal.nam 452^ 1 ^^^^ 4 ^ ^ 453 23 (5) 454.ft R 455.)l 456The first line of this file says that there is a copy of the NBRF 457protein sequence database (which is a protein database) that can be 458selected by typing "P" on the command line or when the database menu 459is presented in the file \fC/usr/lib/seq/aabank.lib\fP. 460.pp 461Note that there are 4 or 5 fields in the lines in \fCfastgbs\fP. The first 462field is the description of the library which will be displayed by 463FASTA; it ends with a '$'. The second field (1 character), is a 0 if 464the library is a protein library and 1 if it is a DNA library. The 465third field (1 character) is the character to be typed to select the 466library. 467.pp 468The fourth field is the name of the library file. In the example 469above, the \fC/usr/lib/seq/aabank.lib\fP file contains the entire 470protein sequence library. However the DNA library file names are 471preceded by a '@', because these files (\fCgpri.nam, grod.nam, 472gmammal.nam\fP) do not contain the sequences; instead they contain the names 473of the files which contain the sequences. This is done because the 474GENBANK DNA database is broken down in to a large number of smaller 475files. In order to search the entire primate database, you must 476search more than a dozen files. 477.pp 478In addition, an optional fifth field can be used to specify the format 479of the library file. Alternatively, you can specify the library 480format in a file of file names (a file preceded by an '@'). This 481field must be separated from the file name by a space character ('\ ') 482from the filename. In the example above, the \fCaabank.lib\fP file is 483in Pearson/FASTA format, while the \fCswiss.seq\fP file is in PIR/VMS format 484(from the EMBL CD-ROM). Currently, FASTA can read the following formats: 485.(l I 486.ft C 4870 Pearson/FASTA (>SEQID - comment/sequence) 4881 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN) 4892 NBRF CODATA (ENTRY/SEQUENCE) 4903 EMBL/SWISS-PROT (ID/DE/SQ) 4914 Intelligenetics (;comment/SEQID/sequence) 4925 NBRF/PIR VMS (>P1;SEQID/comment/sequence) 4936 GCG (version 8.0) Unix Protein and DNA (compressed) 49411 NCBI Blast1.3.2 format (unix only) 495.ft R 496.)l 497In particular, this version will work with the EMBL and PIR VMS 498formats that are distributed on the EMBL CD-ROM. The latter format 499(PIR VMS) is much faster to search than EMBL format. This release 500also works with the protein and DNA database formats created for the 501BLASTP and BLASTN programs by SETDB and PRESSDB and with the new NCBI 502search format. If a library format is not specified, for example, 503because you are just comparing two sequences, Pearson/FASTA (format 0) 504is used by default. To change this default, you may set the LIBTYPE 505environment variable to a number. For example, 506.(l I 507\fCsetenv LIBTYPE 1\fP 508.)l 509would cause the program to use the GenBank LOCUS format by default 510for libraries (or the second sequence file), but the Pearson/FASTA 511format would still be used for the query sequence. 512.pp 513You can specify a group of library files by putting a '@' symbol 514before a file that contains a list of file names to be searched. For 515example, if @gmam.nam is in the fastgbs file, the file "gmam.nam" 516might contain the lines: 517.(l 518.ft C 519</usr/lib/genbank 520gbpri.seq 1 521gbrod.seq 1 522gbmam.seq 1 523.ft R 524.)l 525In this case, the line beginning with a '<' indicates the directory 526the files will be found in. The remaining lines name the actual 527sequence files. So the first sequence file to be searched would be: 528.(l 529.ft C 530/usr/lib/genbank/gbpri.seq 531.ft R 532.)l 533The notation "\fC<PIRNAQ:\fP" might be used under the VAX/VMS operating 534system. Under UNIX, the trailing '/' is left off, so the library 535directory might be written as "\fC</usr/seqlib\fP". 536.pp 537With version 1.4 of the FASTA package, the FASTA and TFASTA programs 538can search a library composed of different files in different sequence 539formats. For example, you may wish to search the Genbank files (in 540GenBank flat file format) and the EMBL DNA sequence database on 541CD-ROM. To do this, you simply list the names and filetypes of the 542files to be searched in a file of filenames. For example, to search 543the mammalian portion of Genbank, the unannotated portion of Genbank, 544and the unannotated portion of the EMBL library, you could use the 545file: 546.(l I 547.ft C 548</usr/lib/DNA 549gbpri.seq 1 550\&# (this '#' causes the program to display the size of the library) 551gbrod.seq 1 552\&... 553gbmam.seq 1 554\&... 555gbuna.seq 1 556\&... 557unanno.seq 5 558\&# 559.ft R 560.)l 561.(l I F 562You do not need to include library format numbers if you only use the 563Pearson/FASTA version of the PIR protein sequence library. If no 564library type is specified, the program assumes that type 0 is being 565used (unless you have set LIBTYPE). 566.)l 567.lp 568Support for the old compressed GenBank files, which have not been 569distributed for more than four years, has been removed from programs 570in the FASTA package. 571.sp 572.pp 573Test the setup by running FASTA. Enter the sequence 574file '\fCMUSPLFM.AA\fP' when the program requests it (this file is 575included with the programs). The program should then ask you to 576select a protein sequence library. Alternatively, if you run the 577TFASTA program and use the MUSPLFM.AA query sequence, the program 578should show you a selection of DNA sequence libraries. 579Once the fastgbs file has been set up correctly, you can 580set FASTLIBS=fastgbs in your AUTOEXEC.BAT file, and you will not need to 581remember where the libraries are kept or how they are named. 582.pp 583FASTA and TFASTA must open a large number of files when searching and 584reporting the results of a GENBANK floppy disk format library search. 585You may have problems with the large number of files under DOS on IBM-PC's 586(Unix and VMS users will not have these problems). If you are going 587to search the GENBANK floppy disk format DNA sequence library under 588DOS, you should add the line: 589.(l 590.ft C 591FILES=16 592.ft R 593.)l 594to your \fCCONFIG.SYS\fP file. (Typically this is already done for programs 595like Windows or WordPerfect.) 596.sh 1 "\s+2Using the FASTA Package\s0" 597.sh 2 "Overview" 598.pp 599The FASTA sequence comparison programs all require similar 600information, the name of a query sequence file, a library file, and 601the \fIktup\fP parameter. All of the programs can accept arguments 602on the command line, or they will prompt for the file names and 603\fIktup\fP value. 604.lp 605To use FASTA, simply type: 606.(l 607.ft C 608\f(CBFASTA\fP 609and you will be prompted for : 610.in +0.5i 611the name of the test sequence file 612the name of the library file 613and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences) 614.(l F 615.ft R 616ktup of 2 is about 5 times faster than ktup = 1. For a 200 aa 617sequence against a 10,000,000 aa library, the program takes about 30 618min with ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286 IBM-PC. 619.ft C 620.)l 621.ft R 622.)l 623The program can also be run by typing 624.(l 625.ft C 626FASTA test.aa /lib/bigfile.lib ktup (1 or 2) 627.ft R 628.)l 629.lp 630Included with the package are the test files, 631\fCMUSPLFM.AA\fP, \fCLCBO.AA\fP, \fCMCHU.AA\fP and \fCBOVPRL.SEQ\fP. 632To check to make certain that everything is working, you can try: 633.(l 634.ft C 635fasta musplfm.aa lcbo.aa 636and 637tfasta musplfm.aa bovprl.seq 638.ft R 639.)l 640To test the local similarity programs LFASTA and PLFASTA, try: 641.(l 642.ft C 643lfasta mchu.aa mchu.aa 644and 645plfasta mchu.aa mchu.aa 646.ft R 647.)l 648\fCMCHU\fP (calmodulin) has four duplicated calcium binding sites 649that are clearly detected by LFASTA. For a more complicated example, 650try \fCMWRTC1.aa\fP, myosin heavy chain. 651.sh 2 "Sequence files" 652.pp 653The FASTA programs know about three kinds of sequence files (four 654under VMS): (1) plain sequence files that can only be used as query 655sequences or for LFASTA, PRDF, and ALIGN. (2) Standard library files. 656These are the same as plain sequence files, each sequence is preceded 657by a comment line with a '>' in the first column. (3) distributed 658sequence libraries (this is a broad class that includes the NBRF/PIR 659VMS and blocked ascii formats, Genbank flat-file format, EMBL 660flat-file format, and Intelligenetics format. All of the files that 661you create should be of type (1) or (2). Type (2) files (ones with a 662'>' and comment before the sequence) are preferred, because they can 663be used as query or library sequence files by all of the programs. 664.pp 665I have included several sample test files, *.AA. The first line may begin 666with a '>' or ';' followed by a comment. The text after ';' in other lines 667will be ignored. Spaces and tabs (and anything else that is not an 668amino-acid code) are ignored. 669.pp 670Library files should have the form: 671.(l 672.ft C 673>Sequence name and identifier 674A F A S Y T .... actual sequence. 675F S S .... second line of sequence. 676>Next sequence name and identifier 677.ft R 678.)l 679This is often referred to as "FASTA" or "Pearson" format. You can 680build your own library by concatenating several sequence files. Just 681be sure that each sequence is preceded by a line beginning with a '>' 682with a sequence name. 683.pp 684The test file should not have lines longer than 120 characters, and 685sequences entered with word processors should use a document 686mode, with normal carriage returns at the end of lines. 687.uh "\s+2Program Summary\s0" 688.sh 2 "Sequence search programs" 689.nr ii 1i 690.ip "FASTA" 691universal sequence comparison. Defaults to comparing protein sequences; 692if the sequences are > 85% A+C+G+T or the \fC-n\fP option is used, a 693DNA sequence is assumed. 694.ip "FASTX" 695Search a protein sequence library using amino acid sequence comparison 696to the forward three frames of a translated DNA query sequence. (The 697reverse frames are specified with the \fC-i\fP option.) Alignment 698scores allow frameshifts; the final alignment uses a Smith-Waterman 699type alignment routine (no limit on gaps) that allows frameshifts. 700.ip "TFASTA" 701Search DNA library for a protein sequence by translating the DNA 702sequence to protein in all six frames (three forward frames with the 703-3 command line option). TFASTA with ktup=2 is about as fast as a DNA 704FASTA with ktup=4, and is substantially more sensitive. 705(also reads the GENBANK library) 706.ip "TFASTX" 707Search DNA library for a protein sequence by translating the DNA 708sequence to protein in all six frames (three forward frames with the 709-3 command line option) calculating similarity scores that allow 710frameshifts. TFASTX produces an optimal Smith-Waterman alignment of 711the query and translated-library sequence. 712.ip "\f2SSEARCH\fP" 713Universal sequence comparison using the Smith-Waterman algorithm ( 714T. F. Smith and M. S. Waterman (1981) J. Mol. Biol. 147:195-197). 715This program uses code developed by Huang and Miller (X. Huang, R. C. 716Hardison, W. Miller (1990) CABIOS 6:373-381) for calculating the local 717similarity score and code from the ALIGN program (see below) for 718calculating the local alignment. \f2SSEARCH\fP is about 50-times 719slower than FASTA with ktup=2 (for proteins). 720.ip "ALIGN" 721optimal global alignment of two sequences with no short-cuts. 722This program is a slightly modified version of one taken from E. 723Myers and W. Miller. The algorithm is described in E. Myers and W. 724Miller, "Optimal Alignments in Linear Space" (CABIOS (1988) 4:11-17). 725.sh 2 "Local similarity programs" 726.ip "LFASTA" 727local similarity searches showing local alignments. The algorithm 728used to calculate the local alignment in a band has been improved 729(Chao, Pearson, and Miller, submitted). 730.ip "PLFASTA" 731local similarity searches with plot output (on the IBM, 732this program requires that the environment variable BGIDIR be set). 733.ip "PCLFASTA" 734(unix only) local similarity searches with plot output using pic commands. 735.ip "\f2LALIGN\fP" 736Calculates the N-best local alignments using a rigorous algorithm. 737(N=10 by default.) The algorithm was developed by Huang and Miller (X. 738Huang and W. Miller (1991) Adv. Appl. Math. 12:337-357), which is a 739linear-space version of an algorithm described by M. S. Waterman and 740M. Eggert (J. Mol. Biol. 197:723-728). Like \f2SSEARCH\fP, 741\f2LALIGN\fP is rigorous, but also very slow. 742.ip "\f2PLALIGN\fP" 743A version of \f2LALIGN\fP that plots its output to a postscript file. 744 745.sh 2 "Statistical Significance" 746.pp 747With version 2.0 of the FASTA program distribution, FASTA, TFASTA, and 748SSEARCH now provide estimates of statistical significance for library 749searches. Work by Altschul, Arratia, Karlin, Mott, Waterman, and 750others (see Altschul et al. (1994) Nature Genetics 6:119 for an 751excellent review) suggests that local sequence similarity scores 752follow the extreme value distribution, so that P(s > x) = 1 - 753exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are the 754lengths of the query and library sequence. This formula can be 755rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that the 756average score for an unrelated library sequence increases with the 757logarithm of the length of the library sequence. FASTA and SSEARCH 758use simple linear regression against the the log of the library 759sequence length to calculate a normalized "z-score" with mean 50, 760regardless of library sequence length, and variance 10. These 761z-scores can then be used with the extreme value distribution and the 762poisson distribution (to account for the fact that each library 763sequence comparison is an independent test) to calculate the number of 764library sequences to obtain a score greater than or equal to the score 765obtained in the search. The original idea and routines to do the 766linear regression on library sequence length were provided Phil Green, 767U. Washington. This version of FASTA and SSEARCH uses a slightly 768different strategy for fitting the data than those originally provided 769by Dr. Green. 770.pp 771The expected number of sequences is plotted in the histogram using an 772"*". Since the parameters for the extreme value distribution are not 773calculated directly from the distribution of similarity scores, the 774pattern of "*'s" in the histogram gives a qualitative view of how well 775the statistical theory fits the similarity scores calculated by FASTA 776and SSEARCH. For FASTA, if optimized scores are calculated for each 777sequence in the database (the default), the agreement between the 778actual distribution of "z-scores" and the expected distribution based 779on the length dependence of the score and the extreme value 780distribution is usually very good. Likewise, the distribution of 781SSEARCH Smith-Waterman scores typically agrees closely with the actual 782distribution of "z-scores." The agreement with unoptimized scores, 783\fIktup=2\fP, is often not very good, with too many high scoring 784sequences and too few low scoring sequences compared with the 785predicted relationship between sequence length and similarity score. 786In those cases, the expectation values may be overestimates. 787.pp 788The statistical routines assume that the library contains a large 789sample of unrelated sequences. If this is not the case, then the 790expectation values are meaningless. Likewise, if there are fewer than 79120 sequences in the library, the statistical calculations are not 792done. 793.pp 794For protein searches, library sequences with E() values < 0.01 for 795searches of a 10,000 entry protein database are almost always 796homologous. Frequently sequences with E()-values from 1 - 10 are 797related as well. Remember, however, that these E() values also reflect 798differences between the amino acid composition of the query sequence 799and that of the "average" library sequence. Thus, when searches are 800done with query sequences with "biased" amino-acid composition, 801unrelated sequences may have "significant" scores because of sequence 802bias. The programs below, PRDF and PRSS, can address this problem by 803calculating similarity scores for random sequences with the same 804length and amino acid composition. 805.pp 806If optimization is not used ("-o"), E-values for DNA sequences 807overestimate the significance of the scores that are obtained and 808unrelated sequences frequently have E()-values < 0.0005. With 809optimization, the agreement between E()-value compares favorably with 810protein sequence comparison. This is in part due to the use of more 811stringent gap penalties for DNA sequence comparison, -16, -4 rather 812than -12, -2. With the latter penalties, many unrelated sequences 813appear to have significant similarity. Nevertheless, since protein 814sequence comparison is much more sensitive, DNA sequence comparison 815should not be used to identify sequences that encode protein. Even 816with ktup=6, optimization rarely increases run-times more than 50% 817with mRNA-size query sequences. Optimization should be used whenever 818possible. 819.pp 820Similar comments apply to TFASTA, where higher gap penalties (-16,-4) are 821required for accurate statistical estimates. Because TFASTA produces 822so many artificial "coding" sequences with atypical amino acid 823compositions, the statistical estimates with TFASTA are often over 824estimates. With optimized scores, ktup=1, and gap penalties of -16, 825-4, unrelated sequences will sometimes have E() values of 0.1. If 826initn scores are used, unrelated sequences may have have E() values < 8270.01. 828.ip "PRDF" 829improved version of RDF program that includes accurate probability 830estimates for all three scoring methods (includes local or window 831shuffle routine) 832.ip "PRSS" 833A version of PRDF that uses the rigorous Smith-Waterman calculation 834used by SSEARCH. 835.ip "RANDSEQ" 836produces a randomly shuffled sequence from a query sequence. 837.ip "RELATE" 838significance program described by Dayhoff (Atlas of Protein Sequence 839and Structure, Vol. 5, Supplement 3). Each chunk of 25 residues in 840one sequence is compared to every 25 residue fragment of the second 841sequence. Sequences which are genuinely related will have a large 842number of scores greater than 3 standard deviations above the mean 843score of all of the comparisons. 844.sh 2 "Other analysis programs" 845.ip "AACOMP" 846calculate the amino acid composition and molecular weight 847of a sequence. 848.ip "BESTSCOR" 849calculate the best self-comparison score. 850.ip "GREASE" 851Kyte-Doolittle hydropathicity profile 852.ip "TGREASE" 853graphic plot of Kyte-Doolittle profile 854.ip "FROMGB" 855convert from GenBank LOCUS format (also used by the IBI-Pustell programs) 856to Pearson/FASTA format. 857.ip "GARNIER" 858A secondary structure prediction program using the method of Garnier, 859Osgusthorpe, and Robson, J. Mol. Biol., (1978) 120:97-120. 860.sh 2 "Options" 861.pp 862These programs have a number of output options, which are invoked 863by the environment variables \fBLINLEN\fP, \fBSHOWALL\fP, and 864\fBMARKX\fP. Alternatively, these values can be controlled by 865command line options. The number of sequence residues per output 866line is now adjustable by setting the environment variable 867\fBLINLEN\fP, or the command line option \fB-w\fP. \fBLINLEN\fP is 868normally 60, to change it set \fBLINLEN=80\fP before running the 869program or add \fB-w 80\fP to the command line. \fBLINLEN\fP can 870be set up to 200. \fBSHOWALL\fP (\fB-a\fP) determines whether all, 871or just a portion, of the aligned sequences are displayed. 872Previously, FASTP would show the entire length of both sequences in 873an alignment while FASTN would only show the portions of the two 874sequences that overlapped. Now the default is to show only the 875overlap between the two sequences, to show complete sequences, set 876\fBSHOWALL=1\fP, or use the \fB-a\fP option on the command line. 877.pp 878The differences between the two aligned sequences can be highlighted 879in three different ways by changing the environment variable 880\fBMARKX\fP or the \fB-m\fP option. Normally (MARKX=0) the program 881uses ':' do denote identities and '.' to denote conservative 882replacements. If MARKX=1, the program will not mark identities; 883instead conservative replacements are denoted by a 'x' and 884non-conservative substitutions by a 'X'. If MARKX=2, the residues in 885the second sequence are only shown if they are different from the 886first. MARKX=3 displays the aligned library sequences without the 887query sequence; these can be used to build a primitive multiple 888alignment. MARKX=4 provides a graphical display of the boundaries of 889the alignments. Thus the five options are: 890.(l 891.ft C 892 893 MARKX=0 MARKX=1 MARKX=2 MARKX=3 MARKX=4 894 895MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT 896::..:: ::: xx X ..KS..Y... MWKSCGYPYT ---------- 897MWKSCGYPYT MWKSCGYPYT 898.ft R 899.)l 900.lp 901(fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseable format for use 902with other programs. See the file"readme.v20u4" for a more complete 903description. 904.sh 2 "Command line options" 905.pp 906It is now possible to specify several options on the command 907line, instead of using environment variables. The command line options 908are preceded by a dash; the following options are available: 909.ip "-a" 910same as showall=1 911.ip "-A" 912force Smith-Waterman alignments for DNA sequences and TFASA. By 913default, only FASTA protein sequence comparisons use Smith-Waterman 914alignments. 915.ip "-b #" 916Number of sequence scores to be shown on output. In the absence of 917this option, fasta (and tfasta and ssearch) display all library 918sequences obtaining similarity scores with expectations less than 91910.0 if optimized score are used, or 2.0 if they are not. The -b 920option can limit the display further, but it will not cause additional 921sequences to be displayed. 922.ip "-c #" 923Threshold score for optimization (OPTCUT). Set "-c 1" to 924optimize every sequence in a database. (This slows the program down 925about 5-fold). 926.ip "-E #" 927Limit the number of scores and alignments shown based on the 928expected number of scores. Used to override the expectation value of 10.0 929used by default. When used with -Q, -E 2.0 will show all library sequences 930with scores with an expectation value <= 2.0. 931.ip "-d #" 932Number of alignments to be reported by default. (Used in conjunction 933with -Q). No longer necessary, see "-b" above. 934.ip "-f" 935Penalty for the first residue in a gap (-12 by default for proteins, 936-16 for DNA or for TFASTA). 937.ip "-g" 938Penalty for additional residues in a gap (-2 by default for proteins, 939-4 for DNA and TFASTA ). 940.ip "-h" 941Penalty for frameshift (FASTX, TFASTX only). 942.ip "-H" 943Omit histogram. 944.ip "-i" 945Invert (reverse complement) the query sequence if it is DNA. For 946TFASTX, 947search the reverse complement of the library sequence only. 948.ip "-k #" 949Threshold for joining init1 segments to build an initn score (GAPCUT). 950.ip "-l file" 951Location of library menu file (FASTLIBS). 952.ip "-L" 953Display more information about the library sequence in the alignment. 954.ip "-m #" 955MARKX = # (0, 1, 2, 3, 4, 10) 956.ip "-n" 957Force the query sequence to be treated as a DNA sequence. This is 958particularly useful for query sequences that contain a large number of 959ambiguous residues, e.g. transcription factor binding sites. 960.ip "-O" 961Send copy of results to "filename." Helpful for environments without STDOUT. 962.ip "-o " 963Turn off default optimization of all scores greater than OPTCUT. Sort 964results by "initn" scores. 965.ip "-Q,-q" 966Quiet - does not prompt for any input. Writes scores and alignments 967to the terminal or standard output file. 968.ip "-r file" 969Save a results summary line for every sequence in the sequence 970library. The summary line includes the sequence identifier, 971superfamily number (if available) position 972in the library, and the similarity scores calculated. This option can 973be used to evaluate the sensitivity and selectivity of different 974search strategies (see W. R. Pearson (1991) Genomics 11:635-650.) 975.ip "-s file" 976SMATRIX is read from file. Several SMATRIX files are provided with 977the standard distribution. For protein sequences: \fCcodaa.mat\fP - 978based on minimum mutation matrix; \fCidnaa.mat\fP - identity matrix; 979\fCpam250.mat\fP - the PAM250 matrix developed by Dayhoff et al (Atlas 980of Protein Sequence and Structure, vol. 5, suppl. 3, 1978); 981\fCpam120.mat\fP - a PAM120 matrix. The default scoring matrix is 982BLOSUM50, PAM250 is available with "-s 250", BLOSUM62 ("-s BL62") is 983also available. 984.ip "-v \"#1 #2 #3\"" 985(LINEVAL) values used for line styles in plfasta 986.ip "-w #" 987Line length (width) = number (<200) 988.ip "-x \"off1 off2\"" 989Specifies offsets for the beginning of the query and library sequence. 990For example, if you are comparing upstream regions for two genes, and 991the first sequence contains 500 nt of upstream sequence while the 992second contains 300 nt of upstream sequence, you might try: 993.(l I 994\fCfasta -x "-500 -300" seq1.nt seq2.nt\fP 995.)l 996If the -x option is not used, FASTA assumes numbering starts with 1. 997This option will not work properly with the translated library 998sequence with tfasta. (You should double check to be certain the 999negative numbering works properly.) 1000.ip "-y" 1001Set the width of the band used for calculating "optimized" scores. 1002For proteins and ktup=2, the width is 16. For proteins with ktup=1, 1003the width is 32 by default. For DNA the width is 16. 1004.ip "-z" 1005Turn off statistical calculations. 1006.ip "-1" 1007sort output by init1 score (as FASTP used to do). 1008.ip "-3" 1009(TFASTA, TFASTX only) translate only three forward frames 1010.sp 1011.lp 1012For example: 1013.(l 1014\fCfasta -w 80 -a seq1.aa seq.aa\fP 1015.)l 1016would compare the sequence in seq1.aa to that in seq2.aa and display the 1017results with 80 residues on an output line, showing all of the residues 1018in both sequences. Be sure to enter the options before entering the file 1019names, or just enter the options on the command line, and the program will 1020prompt for the file names. 1021.pp 1022Not all of these options are appropriate for all of the programs. The 1023options above are used by FASTA and TFASTA. RELATE uses the -s option, 1024ALIGN uses the -w, -m, and -s options, and the PRDF program uses -c, 1025-f, -k, and -s. 1026.sp 1027.pp 1028(November, 1997) In addition, it is now possible to provide the fasta 1029programs with the query sequence (fasta, fasty, ssearch, tfastx), or 1030two sequences (prss, lalign, plalign) from the unix "stdin" stream. This 1031makes it much easier to set up FASTA or PRSS WWW pages. To specify 1032that stdin be used, rather than a file, the file name should be 1033specified as '-' or '@' (the latter file name makes it possible to 1034specify a subset of the sequence). 1035Thus: 1036.(l 1037cat query.aa | fasta -q @:25-75 s 1038.)l 1039would take residues 25-75 from query.aa and search the 's' library 1040(see the discussion of FASTLIBS). If DNA sequences are to be read 1041from stdin, the '-n' option must be used, as fasta cannot check for 1042DNA queries when stdin is used. 1043 1044.sh 1 "Environment variable summary" 1045.pp 1046Environment variables allow you to set search parameters that will be 1047used frequently when you run a program; for example, if you prefer to 1048use the PAM250 scoring matrix, you might "set SMATRIX=250." Command 1049line parameters, if used, always override environment variable 1050settings. The following environment variables are used by this 1051program: 1052.ip "AABANK" 1053the file name of the default sequence library. 1054.ip "FASTLIBS" 1055the location of the file which contains the list 1056of library files to be searched. 1057.ip "GAPCUT" 1058threshold used for joining init1 regions in the second step of FASTA. 1059Normally set based on sequence length and \fIktup\fP. 1060.ip "LIBTYPE" 1061used to specify the format of the library sequence for FASTA and TFASTA. 1062.ip "LINLEN" 1063output line length - can go up to 200 1064.ip "LINEVAL" 1065used by plfasta to determine the relationship between line style and 1066similarity score (-v). This should be a string of three numbers, e.g. 1067"200 100 50" 1068.ip "MARKX" 1069symbol for denoting matches, mismatches. Note that this symbol is only 1070used across the optimized local region; sequences that are outside 1071this region are not marked. 1072.ip "OPTCUT" 1073Set the threshold to be used for optimization in a band around the 1074best initial region. Normally the OPTCUT value is calculated from the 1075length of the sequence and the \fIktup\fP value (for a 200 residue 1076sequence, it is about 28). If OPTCUT=1, every sequence in the 1077database will be optimized. This is the most sensitive option. 1078.ip "PAMFACT" 1079This version of fasta uses a more sensitive method for identifying 1080initial regions. Instead of using a constant factor (fact) for each 1081match in a ktup, it uses the scoring matrix (PAM) scores. While this 1082works well for protein sequences, it has not been as carefully tested 1083for DNA sequences, so by default, this modification is used for 1084proteins but not for DNA. Setting the PAMFACT environment variable to 10851 forces the option on; PAMFACT=0 turns it off. 1086.ip "SHOWALL" 1087on output, show the complete sequence instead of just the 1088overlap of the two aligned sequences. 1089.ip "SMATRIX" 1090alternative scoring matrix file. 1091.sp 1092.lp 1093As always, please inform me of bugs as soon as possible. 1094.sp 1095.nf 1096William R. Pearson 1097Department of Biochemistry 1098Box 440, Jordan Hall 1099U. of Virginia 1100Charlottesville, VA 1101 1102wrp@virginia.EDU 1103