1 2 COPYRIGHT NOTICE 3 4Copyright 1988, 1991, 1992, 1994, 1995, 1996 by William R. 5Pearson and the University of Virginia. All rights reserved. The 6FASTA program and documentation may not be sold or incorporated 7into a commercial product, in whole or in part, without written 8consent of William R. Pearson and the University of Virginia. 9For further information regarding permission for use or 10reproduction, please contact: David Hudson, Assistant Provost for 11Research, University of Virginia, P.O. Box 9025, Charlottesville, 12VA 22906-9025, (804) 924-6853 13 14 15The FASTA program package 16 17Introduction 18 19 This documentation describes the version 2.0x of the FASTA 20program package (see W. R. Pearson and D. J. Lipman (1988), 21"Improved Tools for Biological Sequence Analysis", PNAS 85:2444- 222448, and W. R. Pearson (1990) "Rapid and Sensitive Sequence 23Comparison with FASTP and FASTA" Methods in Enzymology 183:63- 2498). Version 2.0 modifies version 1.8 to include explicit 25statistical estimates for similarity scores based on the extreme 26value distribution. In addition, FASTA protein alignments now 27use the Smith-Waterman algorithm with no limitation on gap size. 28FASTA and SSEARCH now use the BLOSUM50 matrix by default, with 29options to change gap penalties on the command line. Version 1.7 30replaces rdf2 and rss with prdf and prss, which use the extreme- 31value distribution to calculate accurate probability estimates. 32 33 34Although there are a large number of programs in this package, 35they belong to four groups: 36 37 38 Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH 39 40 Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN 41 42 Statistical significance: PRDF, RELATE, PRSS, RANDSEQ 43 44 Global alignment: ALIGN 45 46 47 48In addition, I have included several programs for protein 49sequence analysis, including a Kyte-Doolittle hydropathicity 50plotting program (GREASE, TGREASE), and a secondary structure 51prediction package (GARNIER). 52 53 The FASTA sequence comparison programs on this disk are 54improved versions of the FASTP program, originally described in 55Science (Lipman and Pearson, (1985) Science 227:1435-1441). We 56have made several improvements. First, the library search 57programs use a more sensitive method for the initial comparison 58of two sequences which allows the scores of several similar 59regions to be combined. As a result, the results of a library 60search are now given with three scores, initn (the new initial 61score which may include several similar regions), init1 (the old 62fastp initial score from the best initial region), and opt (the 63old fastp optimized score allowing gaps in a 32 residue wide 64band). 65 66 These programs have also been modified to become "universal" 67(hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or 68FAST-N (nucleotides)); by changing the environment variable 69SMATRIX, the programs can be used to search protein sequences, 70DNA sequences, or whatever you like. By default, FASTA, LFASTA, 71and the PRDF programs automatically recognize protein and DNA 72sequences. Sequences are first read as amino acids, and then 73converted to nucleotides if the sequence is greater than 85% 74A,C,G,T (the '-n' option can be used to indicate DNA sequences). 75TFASTA compares protein sequences to a translated DNA sequence. 76Alternative scoring matrices can also be used. In addition to 77the BLOSUM50 matrix for proteins, the PAM250 matrix or matrices 78based on simple identities or the genetic code can also be used 79for sequence comparisons or evaluation of significance. Several 80different protein sequence matrices have been included; 81instructions for constructing your own scoring matrix are 82included in the file FORMAT.DOC. 83 84 85The remainder of this document is divided into three sections: 86(1) a brief history of the changes to the FASTA package; (2) A 87guide to installing the programs and databases; (3) A guide to 88using the FASTA programs. The programs are very easy to use, so 89if you are using them on a machine that is administered by 90someone else, you may want to skip to section (3) to learn how to 91use the programs, and then read section (1) to look at some of 92the more recent changes. If you are installing the programs on 93your own machine, you will need to read section (2) carefully. 94 95 961. Revision History 97 981.1. Changes with version 2.0u 99 100 Version 2.0u provides several major improvements over 101previous versions of FASTA (and SSEARCH). The most important is 102the incorporation of explicit statistical estimates and 103appropriate normalization of similarity scores. This improvement 104is discussed in more detail below in the section entitled 105Statistical Significance. In addition, all of the protein 106comparison programs now use the BLOSUM50 matrix, with gap 107penalties of -12, -2, by default. BLOSUM50 performs 108significantly better than the older PAM250 matrix. PAM250 can 109still be used with the command line option: -s 250. (DNA 110sequence comparisons use a more stringent gap penalty of -16, -4, 111which produces excellent statistical estimates when optimized 112scores are used. TFASTA uses -16, -4 as well.) 113 114 The quality of the fit of the extreme value distribution to 115the actual distribution of similarity scores is summarized with 116the Kolmogorov-Smirnov statistic. The acceptance limits for this 117statistic can be found in many statistics books. In general, 118values <0.10 (N=30) indicate excellent agreement between the 119actual and theoretical distributions. If this statistic is > 1200.2, consider using a higher (more stringent) gap penalty, e.g. 121-16, -4 rather than -12, -2. The default scoring matrix for DNA 122has been changed to score +5 for an identity and -4 for a 123mismatch. These are the same scores used by BLASTN. 124 125 With explicit expectation calculations, the program now 126shows all scores and alignments with expectations less than 10.0 127(with optimized scores, 2.0 without optimization) when the "-Q" 128(quiet) mode is used. The expectation threshold can be changed 129with the "-E" option. 130 131 Finally, the algorithm used to produce the final alignments 132of protein sequences is now a full Smith-Waterman, with unlimited 133gaps. (The older band-limited alignments are used for DNA 134sequences and TFASTA by default, because Smith-Waterman 135alignments are very slow for long sequences.) Both the optimized 136and Smith-Waterman scores are reported; if the Smith-Waterman 137score is higher, then additional gaps allowed a better alignment 138and similarity score to be calculated. 139 140 FASTA searches now optimize similarity scores by default 141(this slows searches about 2-fold (worst case) for ktup=2). Thus, 142the meaning of the "-o" option has been reversed; "-o" now turns 143off optimization and reports results sorted by "initn" scores. 144Optimization significantly improves the sensitivity of FASTA, so 145that it almost matches Smith-Waterman. With version 2.0, the 146default band width used for optimized calculations can be varied 147with the "-y" option. For proteins with ktup=2, a width of 16 148(-y 16) is used; 16 is also used for DNA sequences. For proteins 149and ktup=1, a width of 32 is used. Searches that disable 150optimization with the "-o" option will work fine for sequences 151that share 25% or more identity in general, but to detect 152evolutionary relationships with 20% - 25% identity, the more 153sensitive default optimization is often required. Optimization 154is required for accurate statistical estimates with either 155protein or DNA sequences. 156 157 The FASTA package now includes FASTX, a program that 158compares a DNA sequence to a protein sequence database by 159translating the DNA sequence in three frames (the reverse frames 160are selected with the -i option) and aligning the three-frame 161translation with the sequences in the protein database. 162Alignment scores allow frameshifts so that a cDNA or EST sequence 163with insertion/deletion errors can be aligned with its homologues 164from beginning to end. 165 166 With release 20u6, there is also a TFASTX program, which is 167a replacement for TFASTA. TFASTA treats each of the six reading 168frames of a DNA library sequence as a different sequence; TFASTX 169compares a protein sequence against only two sequences from each 170DNA sequence - the forward and reverse orientation. For a given 171orientation, TFASTX calculates a similarity score for alignments 172that allow frameshifts, thus considering all possible reading 173frames. 174 175 Another new program is included - randseq - which will 176produce a randomly shuffled (uniform or local shuffle) from an 177input sequence. This randomly shuffled sequence can be used to 178evaluate the statistical estimates produced by FASTA, SSEARCH, or 179BLAST. 180 1811.2. Changes with version 1.7 182Version 1.7 has been released to provide the PRDF and PRSS 183programs for shuffling sequences and estimating accurately the 184probabilities of the unshuffled-sequence scores. 185 186PRDF a version of RDF2 that uses calculates the probability 187 of a similarity score more accurately by using a fit to 188 an extreme value distribution. Code to fit the extreme 189 value distribution parameters and the impetus to update 190 RDF2 was provided by Phil Green, U. of Washington. 191 192PRSS a version of PRDF that uses a rigorous Smith-Waterman 193 calculation to score similarities 194 1951.3. Changes with version 1.6 196 197 FASTA version 1.6 uses a new method for calculating optimal 198scores in a band (the optimization or last step in the FASTA 199algorithm). In addition, it uses a linear-space method for 200calculating the actual alignments. FASTA v1.6 package includes 201several new programs: 202 203SSEARCH a program to search a sequence database using the 204 rigorous Smith-Waterman algorithm (this program is 205 about 100-fold slower than FASTA with ktup=2 (for 206 proteins). 207 208LALIGN A rigorous local sequence alignment program that will 209 display the N-best local alignments (N=10 by default). 210 211PLALIGN a version of lalign that plots the local alignments to 212 a postscript file. 213 214FLALIGN a version of lalign that plots the local alignments to 215 a GCG Figure file. 216 217 The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim" 218algorithm described by Huang and Miller (1991) Adv. Appl. Math. 21912:337-357. The SSEARCH and PRSS programs incorporate algorithms 220described by Huang, Hardison, and Miller (1990) CABIOS 6:373-381. 221 222 LFASTA and PLFASTA now calculate a different number of local 223similarities; they now behave more like LALIGN/PLALIGN. Since 224local alignments of identical sequences produce "mirror-image" 225alignments, lalign and lfasta consider only one-half of the 226potential alignments between sequences from identical file names. 227Thus 228 229 lfasta mchu.aa mchu.aa 230 231Displays only two alignments, with earlier versions of the 232program, it would have displayed five, including the identity 233alignment. PLFASTA does display five alignments; when two 234identical filenames are given, it draws the identity alignment, 235calculates the two unique local alignments, draws them, and draws 236their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the 237filenames, rather than the actual sequences, to determine whether 238sequences are identical; you can "trick" the programs into 239behaving the old way by putting the same sequence in two 240different files. 241 2421.4. Changes with version 1.5 243 244 FASTA version 1.5 includes a number of substantial revisions 245to improve the performance and sensitivity of the program. It is 246now possible to tell the program to optimize all of the initn 247scores greater than a threshold. The threshold is set at the 248same value as the old FASTA cutoff score. Alternatively, you can 249tell FASTA to sort the results by the init1, rather than the 250initn, score by using the -1 option. FASTA -1 ... will report 251the results the way the older FASTP program did. 252 253 A new method has been provided for selecting libraries. In 254the past, one could enter the name of a sequence file to be 255searched or a single letter that would specify a library from the 256list included in the $FASTLIBS file. Now, you can specify a set 257of library files with a string of letters preceded by a '%'. 258Thus, if the FASTLIBS file has the lines: 259 260 Genbank 70 primates$1P/seqlib/gbpri.seq 1 261 Genbank 70 rodents$1R/seqlib/gbrod.seq 1 262 Genbank 70 other mammals$1M/seqlib/gbmam.seq 1 263 Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1 264 265Then the string: "%PRMB" would tell FASTA to search the four 266libraries listed above. The %PRMB string can be entered either 267on the command line or when the program asks for a filename or 268library letter. 269 270 FASTA1.5 also provides additional flexibility for specifying 271the number of results and alignments to be displayed with the -Q 272(quiet) option. The -b number option allows you to specify the 273number of sequence scores to show when the search is finished. 274Thus 275 276 277 FASTA -b 100 ... 278 279 280tells the program to display the top 100 sequence scores. In the 281past, if you displayed 100 scores (in -Q mode), you would also 282have store 100 alignments. The -d option allows you to limit the 283number of alignments shown. FASTA -b 100 -d 20 would show 100 284scores and 20 alignments. 285 286 Finally, FASTA can provide a complete list of all of the 287sequences and scores calculated to a file with the -r (results) 288option. FASTA -r results.out ... creates a file with a list of 289scores for every sequence in the library. The list is not 290sorted, and only includes those scores calculated during the 291initial scan of the library. 292 2932. Installing the FASTA package 294 2952.1. Installing the programs 296 2972.1.1. Unix version 298 299 The FASTA distribution comes with several makefile's that 300can be used to compile the FASTA programs. Over the years, as 301ATT Unix System 5 and BSD unix have converged, these files have 302become very similar. To begin with, I recommend using the 303standard Makefile. There are two values in the makefile that 304should be checked against the values used on your system: the HZ 305value, which is the frequency in ticks per second used by the 306times() system call, this value can usually be found by running: 307 308 grep HZ /usr/include/sys/* 309 310and the functions available to return random numbers. If you 311have a rand48() function that returns a 32-bit random number, use 312it and use the lines: 313 314 NRAND=nrand48 315 RANFLG= -DRAND32 316 317If not, you will need to use the rand() function call and 318determine whether it returns a 16-bit or a 32-bit value. These 319functions are used by PRDF and PRSS. If you have problems 320compiling the programs, you may want to examine the makefile.unx 321and makefile.sun files, to look for differences. I have tried to 322use very standard unix functions in these programs, and they have 323been successfully compiled, with very small changes to the 324Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS 325machines (under the BSD environment). 326 3272.1.2. IBM-PC/DOS version 328 329 For the IBM-PC/DOS version, the FASTA source code disk 330contains the complete source code to all of the programs on the 331other disks. The programs were compiled with Borland's Turbo 332'C++', using Borland's MAKE utility. The graphics programs 333(PLFASTA, TGREASE) use the graphics device drivers supplied with 334the Turbo 'C' V2.0 package. Also included are the documentation 335files PROGRAMS.DOC and FORMAT.DOC. You do not need any of the 336files the source code disk to run the programs. The files on 337this disk are identical to the UNIX and VMS versions that run on 338larger machines. Also included is the code to compile 339ALIGN0.EXE. ALIGN0 is the same as ALIGN, but does not penalize 340for end-gaps. 341 342 If you have the DOS or Macintosh version of the FASTA 343package, to install the programs you should: 344 345 (1) Make a new directory (folder) for the FASTA programs. 346 This need not be the same as the directory for your 347 sequence databases. 348 349 (2) Copy the files from the FASTA source disk to the new 350 directory. 351 352 (3) (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your 353 PATH command to include the FASTA directory and (b) add 354 the line: 355 356 set FASTLIBS=c:\yourfastadirectory\fastgbs 357 358 On the Macintosh, you may need to edit the "environment" 359 file and change the line that reads: 360 361 FASTLIBS=fastgbs 362 363 to indicate the full directory path for the fastgbs file, 364 for example: 365 366 FASTLIBS=Q105:FASTA:fastgbs 367 368 369 (4) Finally, you will need to edit the fastgbs file. This is 370 usually the most confusing part of the installation. An 371 example of this file is shown below; to customize this 372 file for your machine, you will need to change the file 373 names from those provided in the fastgbs file to ones that 374 reflect the directory names and file names you use on your 375 machine. This is explained in more detail below. In 376 addition, some entries in the fastgbs file refer to other 377 files of file names. These files of file names (as 378 opposed to actual database files) may also need to be 379 edited. 380 3812.2. Installing the libraries 382 3832.2.1. The NBRF protein sequence library 384 385 The FASTA program package does not include any protein or 386DNA sequence libraries. You can obtain the PIR protein sequence 387database from: 388 389 National Biomedical Research Foundation 390 Georgetown University Medical Center 391 3900 Reservoir Rd, N.W. 392 Washington, D.C. 20007 393 394In addition, this database is available via anonymous ftp from 395the host "ftp.bchs.uh.edu". It is available in two formats, VMS 396and CODATA format. The "VMS" format (library type 5 below) can 397be searched much faster, can be easily reformatted for use by the 398"BLAST" rapid searching program, and is compatible with the 399Genetics Computer Group package of programs. The CODATA format 400is used by the EUGENE/MBIR computing package from Baylor (library 401type 2). 402 4032.2.2. The GENBANK DNA sequence library 404 405 FASTA, and TFASTA search sequences from the GENBANK 406"flatfile" (not ASN.1) DNA sequence library in the flat-file 407format distributed by the National Center for Biotechnology 408Information and the PIR format used by EBI/EMBL. CD-ROMs can be 409obtained from: 410 411 Genbank 412 National Center for Biotechnology Information 413 National Library of Medicine 414 National Institutes of Health 415 8600 Rockville Pike 416 Bethesda, MD 20894 417 418 419 The GenBank DNA sequence library is also available via 420anonymous FTP from ncbi.nlm.nih.gov. 421 4222.2.3. The EBI/EMBL CD-ROM libraries 423 424 The European Bioinformatics Institute (EBI) is now 425distributing the EMBL CD-ROM that contains both the complete EMBL 426DNA sequence database (which should be essentially identical to 427the GenBank DNA sequence database) and the SWISS-PROT protein 428sequence database. SWISS-PROT is derived from the NBRF Protein 429sequence database with additions from the EBI/EMBL DNA sequence 430database. This CD-ROM is a "best-buy," since it provides both 431DNA and protein sequence libraries. It is available from: 432 433 434 European Bioinformatics Institute 435 Hinxton Genome Campus, Hinxton Hall 436 Hinxton, Cambridge CB10 1RQ, 437 United Kingdom 438 Tel: +44 1223 4944 439 Fax: +44 1223 494468 440 Email: DATALIB@ebi.ac.uk 441 442 443 444 In addition, the SWISS-PROT protein sequence database is 445available via anonymous FTP from ncbi.nlm.nih.gov. 446 4472.3. Finding the libraries: FASTLIBS 448 449 FASTA and TFASTA use the environment variable FASTLIBS to 450find the protein and DNA sequence libraries. The FASTLIBS 451variable contains the name of a file that has the actual 452filenames of the libraries. The FASTGBS file on is an example of 453a file that can be referred to by FASTLIBS. To use the FASTGBS 454file, type: 455 456 setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh) 457 or 458 export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh) 459 460Then edit the FASTGBS file to indicate where the protein and DNA 461sequence libraries can be found. If you have a hard disk and 462your protein sequence library is kept in the file 463/usr/lib/aabank.lib and your Genbank DNA sequence library is kept 464in the directory: /usr/lib/genbank, then fastgbs might contain: 465 466 NBRF Protein$0P/usr/lib/seq/aabank.lib 0 467 SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5 468 GB Primate$1P@/usr/lib/genbank/gpri.nam 469 GB Rodent$1R@/usr/lib/genbank/grod.nam 470 GB Mammal$1M@/usr/lib/genbank/gmammal.nam 471 ^ 1 ^^^^ 4 ^ ^ 472 23 (5) 473 474The first line of this file says that there is a copy of the NBRF 475protein sequence database (which is a protein database) that can 476be selected by typing "P" on the command line or when the 477database menu is presented in the file /usr/lib/seq/aabank.lib. 478 479 Note that there are 4 or 5 fields in the lines in fastgbs. 480The first field is the description of the library which will be 481displayed by FASTA; it ends with a '$'. The second field (1 482character), is a 0 if the library is a protein library and 1 if 483it is a DNA library. The third field (1 character) is the 484character to be typed to select the library. 485 486 The fourth field is the name of the library file. In the 487example above, the /usr/lib/seq/aabank.lib file contains the 488entire protein sequence library. However the DNA library file 489names are preceded by a '@', because these files (gpri.nam, 490grod.nam, gmammal.nam) do not contain the sequences; instead they 491contain the names of the files which contain the sequences. This 492is done because the GENBANK DNA database is broken down in to a 493large number of smaller files. In order to search the entire 494primate database, you must search more than a dozen files. 495 496 In addition, an optional fifth field can be used to specify 497the format of the library file. Alternatively, you can specify 498the library format in a file of file names (a file preceded by an 499'@'). This field must be separated from the file name by a space 500character (' ') from the filename. In the example above, the 501aabank.lib file is in Pearson/FASTA format, while the swiss.seq 502file is in PIR/VMS format (from the EMBL CD-ROM). Currently, 503FASTA can read the following formats: 504 505 0 Pearson/FASTA (>SEQID - comment/sequence) 506 1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN) 507 2 NBRF CODATA (ENTRY/SEQUENCE) 508 3 EMBL/SWISS-PROT (ID/DE/SQ) 509 4 Intelligenetics (;comment/SEQID/sequence) 510 5 NBRF/PIR VMS (>P1;SEQID/comment/sequence) 511 6 GCG (version 8.0) Unix Protein and DNA (compressed) 512 11 NCBI Blast1.3.2 format (unix only) 513 514In particular, this version will work with the EMBL and PIR VMS 515formats that are distributed on the EMBL CD-ROM. The latter 516format (PIR VMS) is much faster to search than EMBL format. This 517release also works with the protein and DNA database formats 518created for the BLASTP and BLASTN programs by SETDB and PRESSDB 519and with the new NCBI search format. If a library format is not 520specified, for example, because you are just comparing two 521sequences, Pearson/FASTA (format 0) is used by default. To 522change this default, you may set the LIBTYPE environment variable 523to a number. For example, 524 525 setenv LIBTYPE 1 526 527would cause the program to use the GenBank LOCUS format by 528default for libraries (or the second sequence file), but the 529Pearson/FASTA format would still be used for the query sequence. 530 531 You can specify a group of library files by putting a '@' 532symbol before a file that contains a list of file names to be 533searched. For example, if @gmam.nam is in the fastgbs file, the 534file "gmam.nam" might contain the lines: 535 536 </usr/lib/genbank 537 gbpri.seq 1 538 gbrod.seq 1 539 gbmam.seq 1 540 541In this case, the line beginning with a '<' indicates the 542directory the files will be found in. The remaining lines name 543the actual sequence files. So the first sequence file to be 544searched would be: 545 546 /usr/lib/genbank/gbpri.seq 547 548The notation "<PIRNAQ:" might be used under the VAX/VMS operating 549system. Under UNIX, the trailing '/' is left off, so the library 550directory might be written as "</usr/seqlib". 551 552 With version 1.4 of the FASTA package, the FASTA and TFASTA 553programs can search a library composed of different files in 554different sequence formats. For example, you may wish to search 555the Genbank files (in GenBank flat file format) and the EMBL DNA 556sequence database on CD-ROM. To do this, you simply list the 557names and filetypes of the files to be searched in a file of 558filenames. For example, to search the mammalian portion of 559Genbank, the unannotated portion of Genbank, and the unannotated 560portion of the EMBL library, you could use the file: 561 562 </usr/lib/DNA 563 gbpri.seq 1 564 # (this '#' causes the program to display the size of the library) 565 gbrod.seq 1 566 gbmam.seq 1 567 gbuna.seq 1 568 unanno.seq 5 569 # 570 571 You do not need to include library format numbers if you 572 only use the Pearson/FASTA version of the PIR protein se- 573 quence library. If no library type is specified, the 574 program assumes that type 0 is being used (unless you 575 have set LIBTYPE). 576 577Support for the old compressed GenBank files, which have not been 578distributed for more than four years, has been removed from 579programs in the FASTA package. 580 581 582 Test the setup by running FASTA. Enter the sequence file 583'MUSPLFM.AA' when the program requests it (this file is included 584with the programs). The program should then ask you to select a 585protein sequence library. Alternatively, if you run the TFASTA 586program and use the MUSPLFM.AA query sequence, the program should 587show you a selection of DNA sequence libraries. Once the fastgbs 588file has been set up correctly, you can set FASTLIBS=fastgbs in 589your AUTOEXEC.BAT file, and you will not need to remember where 590the libraries are kept or how they are named. 591 592 FASTA and TFASTA must open a large number of files when 593searching and reporting the results of a GENBANK floppy disk 594format library search. You may have problems with the large 595number of files under DOS on IBM-PC's (Unix and VMS users will 596not have these problems). If you are going to search the GENBANK 597floppy disk format DNA sequence library under DOS, you should add 598the line: 599 600 FILES=16 601 602to your CONFIG.SYS file. (Typically this is already done for 603programs like Windows or WordPerfect.) 604 6053. Using the FASTA Package 606 6073.1. Overview 608 609 The FASTA sequence comparison programs all require similar 610information, the name of a query sequence file, a library file, 611and the ktup parameter. All of the programs can accept arguments 612on the command line, or they will prompt for the file names and 613ktup value. 614 615To use FASTA, simply type: 616 617 FASTA 618 and you will be prompted for : 619 the name of the test sequence file 620 the name of the library file 621 and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences) 622 623 ktup of 2 is about 5 times faster than ktup = 1. 624 For a 200 aa sequence against a 10,000,000 aa 625 library, the program takes about 30 min with 626 ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286 627 IBM-PC. 628 629 630The program can also be run by typing 631 632 FASTA test.aa /lib/bigfile.lib ktup (1 or 2) 633 634 635Included with the package are the test files, MUSPLFM.AA, 636LCBO.AA, MCHU.AA and BOVPRL.SEQ. To check to make certain that 637everything is working, you can try: 638 639 fasta musplfm.aa lcbo.aa 640 and 641 tfasta musplfm.aa bovprl.seq 642 643To test the local similarity programs LFASTA and PLFASTA, try: 644 645 lfasta mchu.aa mchu.aa 646 and 647 plfasta mchu.aa mchu.aa 648 649MCHU (calmodulin) has four duplicated calcium binding sites that 650are clearly detected by LFASTA. For a more complicated example, 651try MWRTC1.aa, myosin heavy chain. 652 6533.2. Sequence files 654 655 The FASTA programs know about three kinds of sequence files 656(four under VMS): (1) plain sequence files that can only be used 657as query sequences or for LFASTA, PRDF, and ALIGN. (2) Standard 658library files. These are the same as plain sequence files, each 659sequence is preceded by a comment line with a '>' in the first 660column. (3) distributed sequence libraries (this is a broad class 661that includes the NBRF/PIR VMS and blocked ascii formats, Genbank 662flat-file format, EMBL flat-file format, and Intelligenetics 663format. All of the files that you create should be of type (1) 664or (2). Type (2) files (ones with a be used as query or library 665sequence files by all of the programs. 666 667 I have included several sample test files, *.AA. The first 668line may begin with a '>' or ';' followed by a comment. The 669text after ';' in other lines will be ignored. Spaces and 670tabs (and anything else that is not an amino-acid code) are 671ignored. 672 673 Library files should have the form: 674 675 >Sequence name and identifier 676 A F A S Y T .... actual sequence. 677 F S S .... second line of sequence. 678 >Next sequence name and identifier 679 680This is often referred to as "FASTA" or "Pearson" format. You 681can build your own library by concatenating several sequence 682files. Just be sure that each sequence is preceded by a line 683beginning with a '>' with a sequence name. 684 685 The test file should not have lines longer than 120 686characters, and sequences entered with word processors should use 687a document mode, with normal carriage returns at the end of 688lines. 689 690Program Summary 691 6923.3. Sequence search programs 693 694FASTA universal sequence comparison. Defaults to comparing 695 protein sequences; if the sequences are > 85% A+C+G+T 696 or the -n option is used, a DNA sequence is assumed. 697 698FASTX Search a protein sequence library using amino acid 699 sequence comparison to the forward three frames of a 700 translated DNA query sequence. (The reverse frames are 701 specified with the -i option.) Alignment scores allow 702 frameshifts; the final alignment uses a Smith-Waterman 703 type alignment routine (no limit on gaps) that allows 704 frameshifts. 705 706TFASTA Search DNA library for a protein sequence by 707 translating the DNA sequence to protein in all six 708 frames (three forward frames with the -3 command line 709 option). TFASTA with ktup=2 is about as fast as a DNA 710 FASTA with ktup=4, and is substantially more sensitive. 711 (also reads the GENBANK library) 712 713TFASTX Search DNA library for a protein sequence by 714 translating the DNA sequence to protein in all six 715 frames (three forward frames with the -3 command line 716 option) calculating similarity scores that allow 717 frameshifts. TFASTX produces an optimal Smith-Waterman 718 alignment of the query and translated-library sequence. 719 720SSEARCH Universal sequence comparison using the Smith-Waterman 721 algorithm ( T. F. Smith and M. S. Waterman (1981) J. 722 Mol. Biol. 147:195-197). This program uses code 723 developed by Huang and Miller (X. Huang, R. C. 724 Hardison, W. Miller (1990) CABIOS 6:373-381) for 725 calculating the local similarity score and code from 726 the ALIGN program (see below) for calculating the local 727 alignment. SSEARCH is about 50-times slower than FASTA 728 with ktup=2 (for proteins). 729 730ALIGN optimal global alignment of two sequences with no 731 short-cuts. This program is a slightly modified 732 version of one taken from E. Myers and W. Miller. The 733 algorithm is described in E. Myers and W. Miller, 734 "Optimal Alignments in Linear Space" (CABIOS (1988) 735 4:11-17). 736 7373.4. Local similarity programs 738 739LFASTA local similarity searches showing local alignments. 740 The algorithm used to calculate the local alignment in 741 a band has been improved (Chao, Pearson, and Miller, 742 submitted). 743 744PLFASTA local similarity searches with plot output (on the IBM, 745 this program requires that the environment variable 746 BGIDIR be set). 747 748PCLFASTA (unix only) local similarity searches with plot output 749 using pic commands. 750 751LALIGN Calculates the N-best local alignments using a rigorous 752 algorithm. (N=10 by default.) The algorithm was 753 developed by Huang and Miller (X. Huang and W. Miller 754 (1991) Adv. Appl. Math. 12:337-357), which is a 755 linear-space version of an algorithm described by M. S. 756 Waterman and M. Eggert (J. Mol. Biol. 197:723-728). 757 Like SSEARCH, LALIGN is rigorous, but also very slow. 758 759PLALIGN A version of LALIGN that plots its output to a 760 postscript file. 761 7623.5. Statistical Significance 763 764 With version 2.0 of the FASTA program distribution, FASTA, 765TFASTA, and SSEARCH now provide estimates of statistical 766significance for library searches. Work by Altschul, Arratia, 767Karlin, Mott, Waterman, and others (see Altschul et al. (1994) 768Nature Genetics 6:119 for an excellent review) suggests that 769local sequence similarity scores follow the extreme value 770distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where 771u = ln(Kmn)/lambda and m,m are the lengths of the query and 772library sequence. This formula can be rewritten as: 1 - exp(-Kmn 773exp(-lambda x), which shows that the average score for an 774unrelated library sequence increases with the logarithm of the 775length of the library sequence. FASTA and SSEARCH use simple 776linear regression against the the log of the library sequence 777length to calculate a normalized "z-score" with mean 50, 778regardless of library sequence length, and variance 10. These 779z-scores can then be used with the extreme value distribution and 780the poisson distribution (to account for the fact that each 781library sequence comparison is an independent test) to calculate 782the number of library sequences to obtain a score greater than or 783equal to the score obtained in the search. The original idea and 784routines to do the linear regression on library sequence length 785were provided Phil Green, U. Washington. This version of FASTA 786and SSEARCH uses a slightly different strategy for fitting the 787data than those originally provided by Dr. Green. 788 789 The expected number of sequences is plotted in the histogram 790using an "*". Since the parameters for the extreme value 791distribution are not calculated directly from the distribution of 792similarity scores, the pattern of "*'s" in the histogram gives a 793qualitative view of how well the statistical theory fits the 794similarity scores calculated by FASTA and SSEARCH. For FASTA, if 795optimized scores are calculated for each sequence in the database 796(the default), the agreement between the actual distribution of 797"z-scores" and the expected distribution based on the length 798dependence of the score and the extreme value distribution is 799usually very good. Likewise, the distribution of SSEARCH Smith- 800Waterman scores typically agrees closely with the actual 801distribution of "z-scores." The agreement with unoptimized 802scores, ktup=2, is often not very good, with too many high 803scoring sequences and too few low scoring sequences compared with 804the predicted relationship between sequence length and similarity 805score. In those cases, the expectation values may be 806overestimates. 807 808 The statistical routines assume that the library contains a 809large sample of unrelated sequences. If this is not the case, 810then the expectation values are meaningless. Likewise, if there 811are fewer than 20 sequences in the library, the statistical 812calculations are not done. 813 814 For protein searches, library sequences with E() values < 8150.01 for searches of a 10,000 entry protein database are almost 816always homologous. Frequently sequences with E()-values from 1 - 81710 are related as well. Remember, however, that these E() values 818also reflect differences between the amino acid composition of 819the query sequence and that of the "average" library sequence. 820Thus, when searches are done with query sequences with "biased" 821amino-acid composition, unrelated sequences may have 822"significant" scores because of sequence bias. The programs 823below, PRDF and PRSS, can address this problem by calculating 824similarity scores for random sequences with the same length and 825amino acid composition. 826 827 If optimization is not used ("-o"), E-values for DNA 828sequences overestimate the significance of the scores that are 829obtained and unrelated sequences frequently have E()-values < 8300.0005. With optimization, the agreement between E()-value 831compares favorably with protein sequence comparison. This is in 832part due to the use of more stringent gap penalties for DNA 833sequence comparison, -16, -4 rather than -12, -2. With the 834latter penalties, many unrelated sequences appear to have 835significant similarity. Nevertheless, since protein sequence 836comparison is much more sensitive, DNA sequence comparison should 837not be used to identify sequences that encode protein. Even with 838ktup=6, optimization rarely increases run-times more than 50% 839with mRNA-size query sequences. Optimization should be used 840whenever possible. 841 842 Similar comments apply to TFASTA, where higher gap 843penalties (-16,-4) are required for accurate statistical 844estimates. Because TFASTA produces so many artificial "coding" 845sequences with atypical amino acid compositions, the statistical 846estimates with TFASTA are often over estimates. With optimized 847scores, ktup=1, and gap penalties of -16, -4, unrelated sequences 848will sometimes have E() values of 0.1. If initn scores are used, 849unrelated sequences may have have E() values < 0.01. 850 851PRDF improved version of RDF program that includes accurate 852 probability estimates for all three scoring methods 853 (includes local or window shuffle routine) 854 855PRSS A version of PRDF that uses the rigorous Smith-Waterman 856 calculation used by SSEARCH. 857 858RANDSEQ produces a randomly shuffled sequence from a query 859 sequence. 860 861RELATE significance program described by Dayhoff (Atlas of 862 Protein Sequence and Structure, Vol. 5, Supplement 3). 863 Each chunk of 25 residues in one sequence is compared 864 to every 25 residue fragment of the second sequence. 865 Sequences which are genuinely related will have a large 866 number of scores greater than 3 standard deviations 867 above the mean score of all of the comparisons. 868 8693.6. Other analysis programs 870 871AACOMP calculate the amino acid composition and molecular 872 weight of a sequence. 873 874BESTSCOR calculate the best self-comparison score. 875 876GREASE Kyte-Doolittle hydropathicity profile 877 878TGREASE graphic plot of Kyte-Doolittle profile 879 880FROMGB convert from GenBank LOCUS format (also used by the 881 IBI-Pustell programs) to Pearson/FASTA format. 882 883GARNIER A secondary structure prediction program using the 884 method of Garnier, Osgusthorpe, and Robson, J. Mol. 885 Biol., (1978) 120:97-120. 886 8873.7. Options 888 889 These programs have a number of output options, which are 890invoked by the environment variables LINLEN, SHOWALL, and MARKX. 891Alternatively, these values can be controlled by command line 892options. The number of sequence residues per output line is now 893adjustable by setting the environment variable LINLEN, or the 894command line option -w. LINLEN is normally 60, to change it set 895LINLEN=80 before running the program or add -w 80 to the command 896line. LINLEN can be set up to 200. SHOWALL (-a) determines 897whether all, or just a portion, of the aligned sequences are 898displayed. Previously, FASTP would show the entire length of 899both sequences in an alignment while FASTN would only show the 900portions of the two sequences that overlapped. Now the default is 901to show only the overlap between the two sequences, to show 902complete sequences, set SHOWALL=1, or use the -a option on the 903command line. 904 905 The differences between the two aligned sequences can be 906highlighted in three different ways by changing the environment 907variable MARKX or the -m option. Normally (MARKX=0) the program 908uses ':' do denote identities and '.' to denote conservative 909replacements. If MARKX=1, the program will not mark identities; 910instead conservative replacements are denoted by a 'x' and non- 911conservative substitutions by a 'X'. If MARKX=2, the residues in 912the second sequence are only shown if they are different from the 913first. MARKX=3 displays the aligned library sequences without the 914query sequence; these can be used to build a primitive multiple 915alignment. MARKX=4 provides a graphical display of the 916boundaries of the alignments. Thus the five options are: 917 918 919 MARKX=0 MARKX=1 MARKX=2 MARKX=3 MARKX=4 920 921 MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT 922 ::..:: ::: xx X ..KS..Y... MWKSCGYPYT ---------- 923 MWKSCGYPYT MWKSCGYPYT 924 925 926(fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseable 927format for use with other programs. See the file"readme.v20u4" 928for a more complete description. 929 9303.8. Command line options 931 932 It is now possible to specify several options on the 933command line, instead of using environment variables. The 934command line options are preceded by a dash; the following 935options are available: 936 937-a same as showall=1 938 939-A force Smith-Waterman alignments for DNA sequences and 940 TFASA. By default, only FASTA protein sequence 941 comparisons use Smith-Waterman alignments. 942 943-b # Number of sequence scores to be shown on output. In 944 the absence of this option, fasta (and tfasta and 945 ssearch) display all library sequences obtaining 946 similarity scores with expectations less than 10.0 if 947 optimized score are used, or 2.0 if they are not. The 948 -b option can limit the display further, but it will 949 not cause additional sequences to be displayed. 950 951-c # Threshold score for optimization (OPTCUT). Set "-c 1" 952 to optimize every sequence in a database. (This slows 953 the program down about 5-fold). 954 955-E # Limit the number of scores and alignments shown based 956 on the expected number of scores. Used to override the 957 expectation value of 10.0 used by default. When used 958 with -Q, -E 2.0 will show all library sequences with 959 scores with an expectation value <= 2.0. 960 961-d # Number of alignments to be reported by default. (Used 962 in conjunction with -Q). No longer necessary, see "-b" 963 above. 964 965-f Penalty for the first residue in a gap (-12 by default 966 for proteins, -16 for DNA or for TFASTA). 967 968-g Penalty for additional residues in a gap (-2 by default 969 for proteins, -4 for DNA and TFASTA ). 970 971-h Penalty for frameshift (FASTX, TFASTX only). 972 973-H Omit histogram. 974 975-i Invert (reverse complement) the query sequence if it is 976 DNA. For TFASTX, search the reverse complement of the 977 library sequence only. 978 979-k # Threshold for joining init1 segments to build an initn 980 score (GAPCUT). 981 982-l file Location of library menu file (FASTLIBS). 983 984-L Display more information about the library sequence in 985 the alignment. 986 987-m # MARKX = # (0, 1, 2, 3, 4, 10) 988 989-n Force the query sequence to be treated as a DNA 990 sequence. This is particularly useful for query 991 sequences that contain a large number of ambiguous 992 residues, e.g. transcription factor binding sites. 993 994-O Send copy of results to "filename." Helpful for 995 environments without STDOUT. 996 997-o Turn off default optimization of all scores greater 998 than OPTCUT. Sort results by "initn" scores. 999 1000-Q,-q Quiet - does not prompt for any input. Writes scores 1001 and alignments to the terminal or standard output file. 1002 1003-r file Save a results summary line for every sequence in the 1004 sequence library. The summary line includes the 1005 sequence identifier, superfamily number (if available) 1006 position in the library, and the similarity scores 1007 calculated. This option can be used to evaluate the 1008 sensitivity and selectivity of different search 1009 strategies (see W. R. Pearson (1991) Genomics 11:635- 1010 650.) 1011 1012-s file SMATRIX is read from file. Several SMATRIX files are 1013 provided with the standard distribution. For protein 1014 sequences: codaa.mat - based on minimum mutation 1015 matrix; idnaa.mat - identity matrix; pam250.mat - the 1016 PAM250 matrix developed by Dayhoff et al (Atlas of 1017 Protein Sequence and Structure, vol. 5, suppl. 3, 1018 1978); pam120.mat - a PAM120 matrix. The default 1019 scoring matrix is BLOSUM50, PAM250 is available with 1020 "-s 250", BLOSUM62 ("-s BL62") is also available. 1021 1022-v (LINEVAL) values used for line styles in plfasta 1023 1024-w # Line length (width) = number (<200) 1025 1026-x Specifies offsets for the beginning of the query and 1027 library sequence. For example, if you are comparing 1028 upstream regions for two genes, and the first sequence 1029 contains 500 nt of upstream sequence while the second 1030 contains 300 nt of upstream sequence, you might try: 1031 1032 fasta -x "-500 -300" seq1.nt seq2.nt 1033 1034 If the -x option is not used, FASTA assumes numbering 1035 starts with 1. This option will not work properly with 1036 the translated library sequence with tfasta. (You 1037 should double check to be certain the negative 1038 numbering works properly.) 1039 1040-y Set the width of the band used for calculating 1041 "optimized" scores. For proteins and ktup=2, the width 1042 is 16. For proteins with ktup=1, the width is 32 by 1043 default. For DNA the width is 16. 1044 1045-z Turn off statistical calculations. 1046 1047-1 sort output by init1 score (as FASTP used to do). 1048 1049-3 (TFASTA, TFASTX only) translate only three forward 1050 frames 1051 1052 1053For example: 1054 1055 fasta -w 80 -a seq1.aa seq.aa 1056 1057would compare the sequence in seq1.aa to that in seq2.aa and 1058display the results with 80 residues on an output line, showing 1059all of the residues in both sequences. Be sure to enter the 1060options before entering the file names, or just enter the options 1061on the command line, and the program will prompt for the file 1062names. 1063 1064 Not all of these options are appropriate for all of the 1065programs. The options above are used by FASTA and TFASTA. RELATE 1066uses the -s option, ALIGN uses the -w, -m, and -s options, and 1067the PRDF program uses -c, -f, -k, and -s. 1068 10694. Environment variable summary 1070 1071 Environment variables allow you to set search parameters 1072that will be used frequently when you run a program; for example, 1073if you prefer to use the PAM250 scoring matrix, you might "set 1074SMATRIX=250." Command line parameters, if used, always override 1075environment variable settings. The following environment 1076variables are used by this program: 1077 1078AABANK the file name of the default sequence library. 1079 1080FASTLIBS the location of the file which contains the list of 1081 library files to be searched. 1082 1083GAPCUT threshold used for joining init1 regions in the second 1084 step of FASTA. Normally set based on sequence length 1085 and ktup. 1086 1087LIBTYPE used to specify the format of the library sequence for 1088 FASTA and TFASTA. 1089 1090LINLEN output line length - can go up to 200 1091 1092LINEVAL used by plfasta to determine the relationship between 1093 line style and similarity score (-v). This should be a 1094 string of three numbers, e.g. "200 100 50" 1095 1096MARKX symbol for denoting matches, mismatches. Note that this 1097 symbol is only used across the optimized local region; 1098 sequences that are outside this region are not marked. 1099 1100OPTCUT Set the threshold to be used for optimization in a band 1101 around the best initial region. Normally the OPTCUT 1102 value is calculated from the length of the sequence and 1103 the ktup value (for a 200 residue sequence, it is about 1104 28). If OPTCUT=1, every sequence in the database will 1105 be optimized. This is the most sensitive option. 1106 1107PAMFACT This version of fasta uses a more sensitive method for 1108 identifying initial regions. Instead of using a 1109 constant factor (fact) for each match in a ktup, it 1110 uses the scoring matrix (PAM) scores. While this works 1111 well for protein sequences, it has not been as 1112 carefully tested for DNA sequences, so by default, this 1113 modification is used for proteins but not for DNA. 1114 Setting the PAMFACT environment variable to 1 forces 1115 the option on; PAMFACT=0 turns it off. 1116 1117SHOWALL on output, show the complete sequence instead of just 1118 the overlap of the two aligned sequences. 1119 1120SMATRIX alternative scoring matrix file. 1121 1122As always, please inform me of bugs as soon as possible. 1123 1124William R. Pearson 1125Department of Biochemistry 1126Box 440, Jordan Hall 1127U. of Virginia 1128Charlottesville, VA 1129 1130wrp@virginia.EDU 1131