1.nr pp 11
2.nr sp 11
3.nr tp 11
4.nr fp 10
5.nr fi 0n
6.sz 11
7.if t \{
8.po 1i
9.he 'FASTA.DOC''Release 2.0u4, February 1996'
10.fo ''- % -''
11\}
12.if n \{
13.po 0
14.na
15.nh
16\}
17.ll 6.5i
18.ce
19\fB\s+2COPYRIGHT NOTICE\s0\fP
20.lp
21Copyright 1988, 1991, 1992, 1994, 1995, 1996 by William R. Pearson and
22the University of Virginia.  All rights reserved. The FASTA program
23and documentation may not be sold or incorporated into a commercial
24product, in whole or in part, without written consent of William
25R. Pearson and the University of Virginia.  For further information
26regarding permission for use or reproduction, please contact: David
27Hudson, Assistant Provost for Research, University of Virginia,
28P.O. Box 9025, Charlottesville, VA 22906-9025, (804) 924-6853
29.sp
30.uh "\s+2The FASTA program package\s0"
31.uh "Introduction"
32.pp
33This documentation describes the version 2.0x of the FASTA program
34package (see W. R. Pearson and D. J. Lipman (1988), "Improved Tools
35for Biological Sequence Analysis", PNAS 85:2444-2448, and W. R.
36Pearson (1990) "Rapid and Sensitive Sequence Comparison with FASTP and
37FASTA" Methods in Enzymology 183:63-98). Version 2.0 modifies version
381.8 to include explicit statistical estimates for similarity scores
39based on the extreme value distribution.  In addition, FASTA protein
40alignments now use the Smith-Waterman algorithm with no limitation on
41gap size. FASTA and SSEARCH now use the BLOSUM50 matrix by default,
42with options to change gap penalties on the command line. Version 1.7
43replaces rdf2 and rss with prdf and prss, which use the extreme-value
44distribution to calculate accurate probability estimates.
45.sp
46.lp
47Although there are a large number of programs in this package, they
48belong to four groups:
49.(l
50
51Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH
52
53Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN
54
55Statistical significance: PRDF, RELATE, PRSS, RANDSEQ
56
57Global alignment: ALIGN
58
59.)l
60.lp
61In addition, I have included several programs for protein sequence
62analysis, including a Kyte-Doolittle hydropathicity plotting program
63(GREASE, TGREASE), and a secondary structure prediction package
64(GARNIER).
65.pp
66The FASTA sequence comparison programs on this disk are improved
67versions of the FASTP program, originally described in Science (Lipman
68and Pearson, (1985) Science 227:1435-1441).  We have made several
69improvements.  First, the library search programs use a more sensitive
70method for the initial comparison of two sequences which allows the
71scores of several similar regions to be combined.  As a result, the
72results of a library search are now given with three scores,
73\f2initn\fP (the new initial score which may include several similar
74regions), \f2init1\fP (the old fastp initial score from the best
75initial region), and \f2opt\fP (the old fastp optimized score allowing
76gaps in a 32 residue wide band).
77.pp
78These programs have also been modified to become "universal" (hence
79FAST-A, for FASTA-All, as opposed to FAST-P (protein) or FAST-N
80(nucleotides)); by changing the environment variable SMATRIX, the
81programs can be used to search protein sequences, DNA sequences, or
82whatever you like.  By default, FASTA, LFASTA, and the PRDF programs
83automatically recognize protein and DNA sequences.  Sequences are
84first read as amino acids, and then converted to nucleotides if the
85sequence is greater than 85% A,C,G,T (the '-n' option can be used to
86indicate DNA sequences).  TFASTA compares protein sequences to a
87translated DNA sequence.  Alternative scoring matrices can also be
88used.  In addition to the BLOSUM50 matrix for proteins, the PAM250
89matrix or matrices based on simple identities or the genetic code can
90also be used for sequence comparisons or evaluation of significance.
91Several different protein sequence matrices have been included;
92instructions for constructing your own scoring matrix are included in
93the file FORMAT.DOC.
94.sp 2
95The remainder of this document is divided into three sections: (1) a
96brief history of the changes to the FASTA package; (2) A guide to
97installing the programs and databases; (3) A guide to using the FASTA
98programs. The programs are very easy to use, so if you are using them
99on a machine that is administered by someone else, you may want to
100skip to section (3) to learn how to use the programs, and then read
101section (1) to look at some of the more recent changes.  If you are
102installing the programs on your own machine, you will need to read
103section (2) carefully.
104.sp
105.sh 1 "Revision History"
106.sh 2 "Changes with version 2.0u"
107.pp
108Version 2.0u provides several major improvements over previous
109versions of FASTA (and SSEARCH).  The most important is the
110incorporation of explicit statistical estimates and appropriate
111normalization of similarity scores. This improvement is discussed in
112more detail below in the section entitled \(lqStatistical
113Significance\(rq.  In addition, all of the protein comparison programs
114now use the BLOSUM50 matrix, with gap penalties of -12, -2, by
115default.  BLOSUM50 performs significantly better than the older PAM250
116matrix.  PAM250 can still be used with the command line option: \(lq-s
117250\(rq.  (DNA sequence comparisons use a more stringent gap
118penalty of -16, -4, which produces excellent statistical estimates
119when optimized scores are used. TFASTA uses -16, -4 as well.)
120.pp
121The quality of the fit of the extreme value distribution to the actual
122distribution of similarity scores is summarized with the
123Kolmogorov-Smirnov statistic.  The acceptance limits for this
124statistic can be found in many statistics books.  In general, values
125<0.10 (N=30) indicate excellent agreement between the actual and
126theoretical distributions.  If this statistic is > 0.2, consider
127using a higher (more stringent) gap penalty, e.g. -16, -4 rather than
128-12, -2.  The default scoring matrix for DNA has been changed to score
129+5 for an identity and -4 for a mismatch.  These are the same scores
130used by BLASTN.
131.pp
132With explicit expectation calculations, the program now shows all
133scores and alignments with expectations less than 10.0 (with optimized
134scores, 2.0 without optimization) when the "-Q" (quiet) mode is used.
135The expectation threshold can be changed with the "-E" option.
136.pp
137Finally, the algorithm used to produce the final alignments of protein
138sequences is now a full Smith-Waterman, with unlimited gaps.  (The
139older band-limited alignments are used for DNA sequences and TFASTA by
140default, because Smith-Waterman alignments are very slow for long
141sequences.)  Both the \(lqoptimized\(rq and \(lqSmith-Waterman\(rq
142scores are reported; if the Smith-Waterman score is higher, then
143additional gaps allowed a better alignment and similarity score to be
144calculated.
145.pp
146FASTA searches now optimize similarity scores by default (this slows
147searches about 2-fold (worst case) for \fIktup=2\fP). Thus, the
148meaning of the "-o" option has been reversed; "-o" now turns off
149optimization and reports results sorted by "initn" scores.
150Optimization significantly improves the sensitivity of FASTA, so that
151it almost matches Smith-Waterman.  With version 2.0, the default band
152width used for optimized calculations can be varied with the "-y"
153option.  For proteins with ktup=2, a width of 16 (-y 16) is used; 16
154is also used for DNA sequences.  For proteins and ktup=1, a width of
15532 is used. Searches that disable optimization with the "-o" option
156will work fine for sequences that share 25% or more identity in
157general, but to detect evolutionary relationships with 20% \- 25%
158identity, the more sensitive default optimization is often required.
159Optimization is required for accurate statistical estimates with
160either protein or DNA sequences.
161.pp
162The FASTA package now includes \f(CBFASTX\fP, a program that compares
163a DNA sequence to a protein sequence database by translating the DNA
164sequence in three frames (the reverse frames are selected with the
165\fC-i\fB option) and aligning the three-frame translation with the
166sequences in the protein database.  Alignment scores allow frameshifts
167so that a cDNA or EST sequence with insertion/deletion errors can be
168aligned with its homologues from beginning to end.
169.pp
170With release 20u6, there is also a \f(CBTFASTX\fP program, which is a
171replacement for TFASTA.  TFASTA treats each of the six reading frames
172of a DNA library sequence as a different sequence; \f(CBTFASTX\fP
173compares a protein sequence against only two sequences from each DNA
174sequence \- the forward and reverse orientation.  For a given
175orientation, \f(CBTFASTX\fP calculates a similarity score for
176alignments that allow frameshifts, thus considering all possible
177reading frames.
178.pp
179Another new program is included - \f(CBrandseq\fP - which will produce a
180randomly shuffled (uniform or local shuffle) from an input sequence.
181This randomly shuffled sequence can be used to evaluate the
182statistical estimates produced by FASTA, SSEARCH, or BLAST.
183.sh 2 "Changes with version 1.7"
184.br
185Version 1.7 has been released to provide the PRDF and PRSS programs
186for shuffling sequences and estimating accurately the probabilities of
187the unshuffled-sequence scores.
188.ip "PRDF" 1i
189a version of RDF2 that uses calculates the probability of a similarity
190score more accurately by using a fit to an extreme value distribution.
191Code to fit the extreme value distribution parameters and the impetus
192to update RDF2 was provided by Phil Green, U. of Washington.
193.ip "PRSS" 1i
194a version of PRDF that uses a rigorous Smith-Waterman calculation to
195score similarities
196.sh 2 "Changes with version 1.6"
197.pp
198FASTA version 1.6 uses a new method for calculating optimal
199scores in a band (the optimization or last step in the FASTA
200algorithm). In addition, it uses a linear-space method for calculating
201the actual alignments.  FASTA v1.6 package includes several new
202programs:
203.ip "SSEARCH" 1i
204a program to search a sequence database using
205the rigorous Smith-Waterman algorithm (this
206program is about 100-fold slower than FASTA
207with ktup=2 (for proteins).
208.ip "LALIGN" 1i
209A rigorous local sequence alignment program that will display the
210N-best local alignments (N=10 by default).
211.ip "PLALIGN" 1i
212a version of lalign that plots the local alignments to postscript.
213.ip "FLALIGN" 1i
214a version of lalign that plots the local alignments to a GCG Figure
215file.
216.pp
217The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim" algorithm
218described by Huang and Miller (1991) Adv. Appl. Math. 12:337-357.
219The SSEARCH and PRSS programs incorporate algorithms described by
220Huang, Hardison, and Miller (1990) CABIOS 6:373-381.
221.pp
222LFASTA and PLFASTA now calculate a different number of local
223similarities; they now behave more like LALIGN/PLALIGN.
224Since local alignments of identical sequences produce "mirror-image"
225alignments, lalign and lfasta consider only one-half of the potential
226alignments between sequences from identical file names.  Thus
227.(l I
228\fClfasta mchu.aa mchu.aa\fP
229.)l
230Displays only two alignments, with earlier versions of the program, it
231would have displayed five, including the identity alignment.  PLFASTA
232does display five alignments; when two identical filenames are given,
233it draws the identity alignment, calculates the two unique local
234alignments, draws them, and draws their mirror images.
235LFASTA/PLFASTA and LALIGN/PLALIGN use the filenames, rather than the
236actual sequences, to determine whether sequences are identical; you
237can "trick" the programs into behaving the old way by putting the same
238sequence in two different files.
239.sh 2 "Changes with version 1.5"
240.pp
241FASTA version 1.5 includes a number of substantial revisions to
242improve the performance and sensitivity of the program.  It is now
243possible to tell the program to optimize all of the \f2initn\fP scores
244greater than a threshold.  The threshold is set at the same value as
245the old FASTA cutoff score.  Alternatively, you can tell FASTA to sort
246the results by the \f2init1\fP, rather than the \f2initn\fP, score by
247using the \fC-1\fP option.  \fCFASTA -1 ...\fP will report the results
248the way the older FASTP program did.
249.pp
250A new method has been provided for selecting libraries. In the
251past, one could enter the name of a sequence file to be searched or a
252single letter that would specify a library from the list included in
253the $FASTLIBS file. Now, you can specify a set of library files with a
254string of letters preceded by a '%'.  Thus, if the FASTLIBS file has
255the lines:
256.(l I
257
258Genbank 70 primates$1P/seqlib/gbpri.seq 1
259Genbank 70 rodents$1R/seqlib/gbrod.seq 1
260Genbank 70 other mammals$1M/seqlib/gbmam.seq 1
261Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1
262
263.)l
264Then the string: "%PRMB" would tell FASTA to search the four libraries
265listed above.  The %PRMB string can be entered either on the command
266line or when the program asks for a filename or library letter.
267.pp
268FASTA1.5 also provides additional flexibility for specifying the
269number of results and alignments to be displayed with the \fC-Q\fP
270(quiet) option.  The \fC-b number\fP option allows you to specify the
271number of sequence scores to show when the search is finished.  Thus
272.(l I
273
274\fCFASTA -b 100 ...\fP
275
276.)l
277tells the program to display the top 100 sequence scores. In the past,
278if you displayed 100 scores (in \fC-Q\fP mode), you would also have
279store 100 alignments. The \fC-d\fP option allows you to limit the
280number of alignments shown.  \fCFASTA -b 100 -d 20\fP would show 100
281scores and 20 alignments.
282.pp
283Finally, FASTA can provide a complete list of all of the sequences and
284scores calculated to a file with the \fC-r\fP (results) option.
285\fCFASTA -r results.out ...\fP creates a file with a list of scores
286for every sequence in the library.  The list is not sorted, and only
287includes those scores calculated during the initial scan of the
288library.
289.sh 1 "Installing the FASTA package"
290.sh 2 "Installing the programs"
291.sh 3 "Unix version"
292.pp
293The FASTA distribution comes with several \fCmakefile\fP's that can be
294used to compile the FASTA programs.  Over the years, as ATT Unix
295System 5 and BSD unix have converged, these files have become very
296similar. To begin with, I recommend using the standard \fCMakefile\fP.
297There are two values in the \fCmakefile\fP that should be checked against
298the values used on your system: the \fCHZ\fP value, which is the frequency
299in ticks per second used by the \fCtimes()\fP system call, this value can
300usually be found by running:
301.(l I
302\fCgrep HZ /usr/include/sys/*\fP
303.)l
304and the functions available to return random numbers.  If you have a
305\fCrand48()\fP function that returns a 32-bit random number, use it and use
306the lines:
307.(l I
308\fCNRAND=nrand48
309RANFLG= -DRAND32\fP
310.)l
311If not, you will need to use the \fCrand()\fP function call and determine
312whether it returns a 16-bit or a 32-bit value.  These functions are
313used by PRDF and PRSS.
314.p
315If you have problems compiling the programs, you may want to examine
316the \fCmakefile.unx\fP and \fCmakefile.sun\fP files, to look for differences.
317I have tried to use very standard unix functions in these programs,
318and they have been successfully compiled, with very small changes to
319the \fCMakefile\fP, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS
320machines (under the BSD environment).
321.sh 3 "IBM-PC/DOS version"
322.pp
323For the IBM-PC/DOS version, the FASTA source code disk contains the
324complete source code to all of the programs on the other disks.  The
325programs were compiled with Borland's Turbo 'C++', using Borland's
326MAKE utility.  The graphics programs (PLFASTA, TGREASE) use the
327graphics device drivers supplied with the Turbo 'C' V2.0 package.
328Also included are the documentation files PROGRAMS.DOC and FORMAT.DOC.
329You do not need any of the files the source code disk to run the
330programs.  The files on this disk are identical to the UNIX and VMS
331versions that run on larger machines.  Also included is the code to
332compile ALIGN0.EXE.  ALIGN0 is the same as ALIGN, but does not
333penalize for end-gaps.
334.pp
335If you have the DOS or Macintosh version of the FASTA package, to
336install the programs you should:
337.np
338Make a new directory (folder) for the FASTA programs.  This need not
339be the same as the directory for your sequence databases.
340.np
341Copy the files from the FASTA source disk to the new directory.
342.np
343(DOS only) Edit your AUTOEXEC.BAT file to (a) modify your PATH command
344to include the FASTA directory and (b) add the line:
345.(l
346\fCset FASTLIBS=c:\\yourfastadirectory\\fastgbs\fP
347.)l
348On the Macintosh, you may need to edit the "environment" file and
349change the line that reads:
350.(l
351\fCFASTLIBS=fastgbs\fP
352.)l
353to indicate the full directory path for the \fCfastgbs\fP file, for
354example:
355.(l
356\fCFASTLIBS=Q105:FASTA:fastgbs\fP
357.)l
358.np
359Finally, you will need to edit the \fCfastgbs\fP file.  This is usually
360the most confusing part of the installation.  An example of this file
361is shown below; to customize this file for your machine, you will need
362to change the file names from those provided in the \fCfastgbs\fP file to
363ones that reflect the directory names and file names you use on your
364machine. This is explained in more detail below.  In addition, some
365entries in the \fCfastgbs\fP file refer to other files of file names.
366These files of file names (as opposed to actual database files) may
367also need to be edited.
368.sh 2 "Installing the libraries"
369.sh 3 "The NBRF protein sequence library"
370.pp
371The FASTA program package does not include any protein or DNA
372sequence libraries.  You can obtain the PIR protein sequence database from:
373.(l
374National  Biomedical Research Foundation
375Georgetown  University  Medical  Center
3763900 Reservoir Rd, N.W.
377Washington, D.C. 20007
378.)l
379In addition, this database is available via anonymous ftp from the
380host "ftp.bchs.uh.edu". It is available in two formats, VMS and CODATA
381format.  The "VMS" format (library type 5 below) can be searched much
382faster, can be easily reformatted for use by the "BLAST" rapid
383searching program, and is compatible with the Genetics Computer Group
384package of programs.  The CODATA format is used by the EUGENE/MBIR
385computing package from Baylor (library type 2).
386.sh 3 "The GENBANK DNA sequence library"
387.pp
388FASTA, and TFASTA search sequences from the GENBANK "flatfile" (not
389ASN.1) DNA sequence library in the flat-file format distributed by the
390National Center for Biotechnology Information and the PIR format used
391by EBI/EMBL.  CD-ROMs can be obtained from:
392.(l
393Genbank
394National Center for Biotechnology Information
395National Library of Medicine
396National Institutes of Health
3978600 Rockville Pike
398Bethesda, MD  20894
399.)l
400.pp
401The GenBank DNA sequence library is also available via anonymous FTP
402from \fCncbi.nlm.nih.gov\fP.
403.sh 3 "The EBI/EMBL CD-ROM libraries"
404.pp
405The European Bioinformatics Institute (EBI) is now distributing the
406EMBL CD-ROM that contains both the complete EMBL DNA sequence database
407(which should be essentially identical to the GenBank DNA sequence
408database) and the SWISS-PROT protein sequence database. SWISS-PROT is
409derived from the NBRF Protein sequence database with additions from
410the EBI/EMBL DNA sequence database.  This CD-ROM is a "best-buy,"
411since it provides both DNA and protein sequence libraries.  It is
412available from:
413.(l
414
415European Bioinformatics Institute
416Hinxton Genome Campus, Hinxton Hall
417Hinxton, Cambridge CB10 1RQ,
418United Kingdom
419Tel: +44 1223 4944
420Fax: +44 1223 494468
421Email: DATALIB@ebi.ac.uk
422
423.)l
424.pp
425In addition, the SWISS-PROT protein sequence database is available via
426anonymous FTP from \fCncbi.nlm.nih.gov\fP.
427.sh 2 "Finding the libraries: FASTLIBS"
428.pp
429FASTA and TFASTA use the environment variable FASTLIBS to find the
430protein and DNA sequence libraries.  The FASTLIBS variable contains
431the name of a file that has the actual filenames of the libraries.
432The \fCFASTGBS\fP file on is an example of a file that
433can be referred to by FASTLIBS. To use the \fCFASTGBS\fP file, type:
434.(l
435\fCsetenv FASTLIBS /usr/lib/fasta/fastgbs\fP (BSD UNIX/csh)
436or
437\fCexport FASTLIBS=/usr/lib/fasta/fastgbs\fP (SysV UNIX/ksh)
438.)l
439Then edit the \fCFASTGBS\fP file to indicate where the protein and DNA
440sequence libraries can be found.  If you have a hard disk and your
441protein sequence library is kept in the file \fC/usr/lib/aabank.lib\fP and
442your Genbank DNA sequence library is kept in the directory:
443\fC/usr/lib/genbank\fP, then \fCfastgbs\fP might contain:
444.ne 8
445.(l
446.ft C
447NBRF Protein$0P/usr/lib/seq/aabank.lib 0
448SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
449GB Primate$1P@/usr/lib/genbank/gpri.nam
450GB Rodent$1R@/usr/lib/genbank/grod.nam
451GB Mammal$1M@/usr/lib/genbank/gmammal.nam
452^   1    ^^^^       4                   ^     ^
453          23                             (5)
454.ft R
455.)l
456The first line of this file says that there is a copy of the NBRF
457protein sequence database (which is a protein database) that can be
458selected by typing "P" on the command line or when the database menu
459is presented in the file \fC/usr/lib/seq/aabank.lib\fP.
460.pp
461Note that there are 4 or 5 fields in the lines in \fCfastgbs\fP.  The first
462field is the description of the library which will be displayed by
463FASTA; it ends with a '$'.  The second field (1 character), is a 0 if
464the library is a protein library and 1 if it is a DNA library.  The
465third field (1 character) is the character to be typed to select the
466library.
467.pp
468The fourth field is the name of the library file.  In the example
469above, the \fC/usr/lib/seq/aabank.lib\fP file contains the entire
470protein sequence library.  However the DNA library file names are
471preceded by a '@', because these files (\fCgpri.nam, grod.nam,
472gmammal.nam\fP) do not contain the sequences; instead they contain the names
473of the files which contain the sequences.  This is done because the
474GENBANK DNA database is broken down in to a large number of smaller
475files.  In order to search the entire primate database, you must
476search more than a dozen files.
477.pp
478In addition, an optional fifth field can be used to specify the format
479of the library file.  Alternatively, you can specify the library
480format in a file of file names (a file preceded by an '@').  This
481field must be separated from the file name by a space character ('\ ')
482from the filename.  In the example above, the \fCaabank.lib\fP file is
483in Pearson/FASTA format, while the \fCswiss.seq\fP file is in PIR/VMS format
484(from the EMBL CD-ROM). Currently, FASTA can read the following formats:
485.(l I
486.ft C
4870 Pearson/FASTA (>SEQID - comment/sequence)
4881 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
4892 NBRF CODATA (ENTRY/SEQUENCE)
4903 EMBL/SWISS-PROT (ID/DE/SQ)
4914 Intelligenetics (;comment/SEQID/sequence)
4925 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
4936 GCG (version 8.0) Unix Protein and DNA (compressed)
49411 NCBI Blast1.3.2 format  (unix only)
495.ft R
496.)l
497In particular, this version will work with the EMBL and PIR VMS
498formats that are distributed on the EMBL CD-ROM. The latter format
499(PIR VMS) is much faster to search than EMBL format.  This release
500also works with the protein and DNA database formats created for the
501BLASTP and BLASTN programs by SETDB and PRESSDB and with the new NCBI
502search format.  If a library format is not specified, for example,
503because you are just comparing two sequences, Pearson/FASTA (format 0)
504is used by default.  To change this default, you may set the LIBTYPE
505environment variable to a number.  For example,
506.(l I
507\fCsetenv LIBTYPE 1\fP
508.)l
509would cause the program to use the GenBank LOCUS format by default
510for libraries (or the second sequence file), but the Pearson/FASTA
511format would still be used for the query sequence.
512.pp
513You can specify a group of library files by putting a '@' symbol
514before a file that contains a list of file names to be searched.  For
515example, if @gmam.nam is in the fastgbs file, the file "gmam.nam"
516might contain the lines:
517.(l
518.ft C
519</usr/lib/genbank
520gbpri.seq 1
521gbrod.seq 1
522gbmam.seq 1
523.ft R
524.)l
525In this case, the line beginning with a '<' indicates the directory
526the files will be found in.  The remaining lines name the actual
527sequence files.  So the first sequence file to be searched would be:
528.(l
529.ft C
530/usr/lib/genbank/gbpri.seq
531.ft R
532.)l
533The notation "\fC<PIRNAQ:\fP" might be used under the VAX/VMS operating
534system. Under UNIX, the trailing '/' is left off, so the library
535directory might be written as "\fC</usr/seqlib\fP".
536.pp
537With version 1.4 of the FASTA package, the FASTA and TFASTA programs
538can search a library composed of different files in different sequence
539formats.  For example, you may wish to search the Genbank files (in
540GenBank flat file format) and the EMBL DNA sequence database on
541CD-ROM.  To do this, you simply list the names and filetypes of the
542files to be searched in a file of filenames.  For example, to search
543the mammalian portion of Genbank, the unannotated portion of Genbank,
544and the unannotated portion of the EMBL library, you could use the
545file:
546.(l I
547.ft C
548</usr/lib/DNA
549gbpri.seq 1
550\&#  (this '#' causes the program to display the size of the library)
551gbrod.seq 1
552\&...
553gbmam.seq 1
554\&...
555gbuna.seq 1
556\&...
557unanno.seq 5
558\&#
559.ft R
560.)l
561.(l I F
562You do not need to include library format numbers if you only use the
563Pearson/FASTA version of the PIR protein sequence library.  If no
564library type is specified, the program assumes that type 0 is being
565used (unless you have set LIBTYPE).
566.)l
567.lp
568Support for the old compressed GenBank files, which have not been
569distributed for more than four years, has been removed from programs
570in the FASTA package.
571.sp
572.pp
573Test the setup by running FASTA.  Enter the sequence
574file '\fCMUSPLFM.AA\fP' when the program requests it (this file is
575included with the programs).  The program should then ask you to
576select a protein sequence library.  Alternatively, if you run the
577TFASTA program and use the MUSPLFM.AA query sequence, the program
578should show you a selection of DNA sequence libraries.
579Once the fastgbs file has been set up correctly, you can
580set FASTLIBS=fastgbs in your AUTOEXEC.BAT file, and you will not need to
581remember where the libraries are kept or how they are named.
582.pp
583FASTA and TFASTA must open a large number of files when searching and
584reporting the results of a GENBANK floppy disk format library search.
585You may have problems with the large number of files under DOS on IBM-PC's
586(Unix and VMS users will not have these problems).  If you are going
587to search the GENBANK floppy disk format DNA sequence library under
588DOS, you should add the line:
589.(l
590.ft C
591FILES=16
592.ft R
593.)l
594to your \fCCONFIG.SYS\fP file.  (Typically this is already done for programs
595like Windows or WordPerfect.)
596.sh 1 "\s+2Using the FASTA Package\s0"
597.sh 2 "Overview"
598.pp
599The FASTA sequence comparison programs all require similar
600information, the name of a query sequence file, a library file, and
601the \fIktup\fP parameter.  All of the programs can accept arguments
602on the command line, or they will prompt for the file names and
603\fIktup\fP value.
604.lp
605To use FASTA, simply type:
606.(l
607.ft C
608\f(CBFASTA\fP
609and you will be prompted for :
610.in +0.5i
611the name of the test sequence file
612the name of the library file
613and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
614.(l F
615.ft R
616ktup of 2 is about 5 times faster than ktup = 1.  For a 200 aa
617sequence against a 10,000,000 aa library, the program takes about 30
618min with ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286 IBM-PC.
619.ft C
620.)l
621.ft R
622.)l
623The program can also be run by typing
624.(l
625.ft C
626FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
627.ft R
628.)l
629.lp
630Included with the package are the test files,
631\fCMUSPLFM.AA\fP, \fCLCBO.AA\fP, \fCMCHU.AA\fP and \fCBOVPRL.SEQ\fP.
632To check to make certain that everything is working, you can try:
633.(l
634.ft C
635fasta musplfm.aa lcbo.aa
636and
637tfasta musplfm.aa bovprl.seq
638.ft R
639.)l
640To test the local similarity programs LFASTA and PLFASTA, try:
641.(l
642.ft C
643lfasta mchu.aa mchu.aa
644and
645plfasta mchu.aa mchu.aa
646.ft R
647.)l
648\fCMCHU\fP (calmodulin) has four duplicated calcium binding sites
649that are clearly detected by LFASTA.  For a more complicated example,
650try \fCMWRTC1.aa\fP, myosin heavy chain.
651.sh 2 "Sequence files"
652.pp
653The FASTA programs know about three kinds of sequence files (four
654under VMS): (1) plain sequence files that can only be used as query
655sequences or for LFASTA, PRDF, and ALIGN. (2) Standard library files.
656These are the same as plain sequence files, each sequence is preceded
657by a comment line with a '>' in the first column. (3) distributed
658sequence libraries (this is a broad class that includes the NBRF/PIR
659VMS and blocked ascii formats, Genbank flat-file format, EMBL
660flat-file format, and Intelligenetics format.  All of the files that
661you create should be of type (1) or (2).  Type (2) files (ones with a
662'>' and comment before the sequence) are preferred, because they can
663be used as query or library sequence files by all of the programs.
664.pp
665I have included several sample test files, *.AA.  The first line may begin
666with a '>'  or ';' followed by a comment.  The text after ';' in other lines
667will  be  ignored.   Spaces  and  tabs  (and anything else that  is  not  an
668amino-acid code) are ignored.
669.pp
670Library files should have the form:
671.(l
672.ft C
673>Sequence name and identifier
674A F A S Y T .... actual sequence.
675F S S       .... second line of sequence.
676>Next sequence name and identifier
677.ft R
678.)l
679This is often referred to as "FASTA" or "Pearson" format.  You can
680build your own library by concatenating several sequence files.  Just
681be sure that each sequence is preceded by a line beginning with a '>'
682with a sequence name.
683.pp
684The test file should not have lines longer than 120 characters, and
685sequences entered with word processors should use a document
686mode, with normal carriage returns at the end of lines.
687.uh "\s+2Program Summary\s0"
688.sh 2 "Sequence search programs"
689.nr ii 1i
690.ip "FASTA"
691universal sequence comparison. Defaults to comparing protein sequences;
692if the sequences are > 85% A+C+G+T or the \fC-n\fP option is used, a
693DNA sequence is assumed.
694.ip "FASTX"
695Search a protein sequence library using amino acid sequence comparison
696to the forward three frames of a translated DNA query sequence. (The
697reverse frames are specified with the \fC-i\fP option.) Alignment
698scores allow frameshifts; the final alignment uses a Smith-Waterman
699type alignment routine (no limit on gaps) that allows frameshifts.
700.ip "TFASTA"
701Search DNA library for a protein sequence by translating the DNA
702sequence to protein in all six frames (three forward frames with the
703-3 command line option). TFASTA with ktup=2 is about as fast as a DNA
704FASTA with ktup=4, and is substantially more sensitive.
705(also reads the GENBANK library)
706.ip "TFASTX"
707Search DNA library for a protein sequence by translating the DNA
708sequence to protein in all six frames (three forward frames with the
709-3 command line option) calculating similarity scores that allow
710frameshifts. TFASTX produces an optimal Smith-Waterman alignment of
711the query and translated-library sequence.
712.ip "\f2SSEARCH\fP"
713Universal sequence comparison using the Smith-Waterman algorithm (
714T. F. Smith and M. S. Waterman (1981) J. Mol. Biol. 147:195-197).
715This program uses code developed by Huang and Miller (X. Huang, R. C.
716Hardison, W. Miller (1990) CABIOS 6:373-381) for calculating the local
717similarity score and code from the ALIGN program (see below) for
718calculating the local alignment.  \f2SSEARCH\fP is about 50-times
719slower than FASTA with ktup=2 (for proteins).
720.ip "ALIGN"
721optimal global alignment of two sequences with no short-cuts.
722This program is a slightly modified version of one taken from E.
723Myers and W. Miller. The algorithm is described in E. Myers and W.
724Miller, "Optimal Alignments in Linear Space" (CABIOS (1988) 4:11-17).
725.sh 2 "Local similarity programs"
726.ip "LFASTA"
727local similarity searches showing local alignments.  The algorithm
728used to calculate the local alignment in a band has been improved
729(Chao, Pearson, and Miller, submitted).
730.ip "PLFASTA"
731local similarity searches with plot output (on the IBM,
732this program requires that the environment variable BGIDIR be set).
733.ip "PCLFASTA"
734(unix only) local similarity searches with plot output using pic commands.
735.ip "\f2LALIGN\fP"
736Calculates the N-best local alignments using a rigorous algorithm.
737(N=10 by default.) The algorithm was developed by Huang and Miller (X.
738Huang and W.  Miller (1991) Adv. Appl. Math. 12:337-357), which is a
739linear-space version of an algorithm described by M. S. Waterman and
740M. Eggert (J.  Mol. Biol. 197:723-728).  Like \f2SSEARCH\fP,
741\f2LALIGN\fP is rigorous, but also very slow.
742.ip "\f2PLALIGN\fP"
743A version of \f2LALIGN\fP that plots its output to a postscript file.
744
745.sh 2 "Statistical Significance"
746.pp
747With version 2.0 of the FASTA program distribution, FASTA, TFASTA, and
748SSEARCH now provide estimates of statistical significance for library
749searches.  Work by Altschul, Arratia, Karlin, Mott, Waterman, and
750others (see Altschul et al. (1994) Nature Genetics 6:119 for an
751excellent review) suggests that local sequence similarity scores
752follow the extreme value distribution, so that P(s > x) = 1 -
753exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are the
754lengths of the query and library sequence. This formula can be
755rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that the
756average score for an unrelated library sequence increases with the
757logarithm of the length of the library sequence.  FASTA and SSEARCH
758use simple linear regression against the the log of the library
759sequence length to calculate a normalized "z-score" with mean 50,
760regardless of library sequence length, and variance 10.  These
761z-scores can then be used with the extreme value distribution and the
762poisson distribution (to account for the fact that each library
763sequence comparison is an independent test) to calculate the number of
764library sequences to obtain a score greater than or equal to the score
765obtained in the search. The original idea and routines to do the
766linear regression on library sequence length were provided Phil Green,
767U. Washington.  This version of FASTA and SSEARCH uses a slightly
768different strategy for fitting the data than those originally provided
769by Dr. Green.
770.pp
771The expected number of sequences is plotted in the histogram using an
772"*". Since the parameters for the extreme value distribution are not
773calculated directly from the distribution of similarity scores, the
774pattern of "*'s" in the histogram gives a qualitative view of how well
775the statistical theory fits the similarity scores calculated by FASTA
776and SSEARCH.  For FASTA, if optimized scores are calculated for each
777sequence in the database (the default), the agreement between the
778actual distribution of "z-scores" and the expected distribution based
779on the length dependence of the score and the extreme value
780distribution is usually very good.  Likewise, the distribution of
781SSEARCH Smith-Waterman scores typically agrees closely with the actual
782distribution of "z-scores."  The agreement with unoptimized scores,
783\fIktup=2\fP, is often not very good, with too many high scoring
784sequences and too few low scoring sequences compared with the
785predicted relationship between sequence length and similarity score.
786In those cases, the expectation values may be overestimates.
787.pp
788The statistical routines assume that the library contains a large
789sample of unrelated sequences.  If this is not the case, then the
790expectation values are meaningless.  Likewise, if there are fewer than
79120 sequences in the library, the statistical calculations are not
792done.
793.pp
794For protein searches, library sequences with E() values < 0.01 for
795searches of a 10,000 entry protein database are almost always
796homologous. Frequently sequences with E()-values from 1 - 10 are
797related as well. Remember, however, that these E() values also reflect
798differences between the amino acid composition of the query sequence
799and that of the "average" library sequence.  Thus, when searches are
800done with query sequences with "biased" amino-acid composition,
801unrelated sequences may have "significant" scores because of sequence
802bias.  The programs below, PRDF and PRSS, can address this problem by
803calculating similarity scores for random sequences with the same
804length and amino acid composition.
805.pp
806If optimization is not used ("-o"), E-values for DNA sequences
807overestimate the significance of the scores that are obtained and
808unrelated sequences frequently have E()-values < 0.0005. With
809optimization, the agreement between E()-value compares favorably with
810protein sequence comparison.  This is in part due to the use of more
811stringent gap penalties for DNA sequence comparison, -16, -4 rather
812than -12, -2.  With the latter penalties, many unrelated sequences
813appear to have significant similarity. Nevertheless, since protein
814sequence comparison is much more sensitive, DNA sequence comparison
815should not be used to identify sequences that encode protein.  Even
816with ktup=6, optimization rarely increases run-times more than 50%
817with mRNA-size query sequences.  Optimization should be used whenever
818possible.
819.pp
820Similar comments apply to TFASTA, where  higher gap penalties (-16,-4) are
821required for accurate statistical estimates.  Because TFASTA produces
822so many artificial "coding" sequences with atypical amino acid
823compositions, the statistical estimates with TFASTA are often over
824estimates.  With optimized scores, ktup=1, and gap penalties of -16,
825-4, unrelated sequences will sometimes have E() values of 0.1.  If
826initn scores are used, unrelated sequences may have have E() values <
8270.01.
828.ip "PRDF"
829improved version of RDF program that includes accurate probability
830estimates for all three scoring methods (includes local or window
831shuffle routine)
832.ip "PRSS"
833A version of PRDF that uses the rigorous Smith-Waterman calculation
834used by SSEARCH.
835.ip "RANDSEQ"
836produces a randomly shuffled sequence from a query sequence.
837.ip "RELATE"
838significance program described by Dayhoff (Atlas of Protein Sequence
839and Structure, Vol. 5, Supplement 3).  Each chunk of 25 residues in
840one sequence is compared to every 25 residue fragment of the second
841sequence. Sequences which are genuinely related will have a large
842number of scores greater than 3 standard deviations above the mean
843score of all of the comparisons.
844.sh 2 "Other analysis programs"
845.ip "AACOMP"
846calculate the amino acid composition and molecular weight
847of a sequence.
848.ip "BESTSCOR"
849calculate the best self-comparison score.
850.ip "GREASE"
851Kyte-Doolittle hydropathicity profile
852.ip "TGREASE"
853graphic plot of Kyte-Doolittle profile
854.ip "FROMGB"
855convert from GenBank LOCUS format (also used by the IBI-Pustell programs)
856to Pearson/FASTA format.
857.ip "GARNIER"
858A secondary structure prediction program using the method of Garnier,
859Osgusthorpe, and Robson, J. Mol. Biol., (1978) 120:97-120.
860.sh 2 "Options"
861.pp
862These programs have a number of output options, which are invoked
863by the environment variables \fBLINLEN\fP, \fBSHOWALL\fP, and
864\fBMARKX\fP.  Alternatively, these values can be controlled by
865command line options.  The number of sequence residues per output
866line is now adjustable by setting the environment variable
867\fBLINLEN\fP, or the command line option \fB-w\fP.  \fBLINLEN\fP is
868normally 60, to change it set \fBLINLEN=80\fP before running the
869program or add \fB-w 80\fP to the command line.  \fBLINLEN\fP can
870be set up to 200.  \fBSHOWALL\fP (\fB-a\fP) determines whether all,
871or just a portion, of the aligned sequences are displayed.
872Previously, FASTP would show the entire length of both sequences in
873an alignment while FASTN would only show the portions of the two
874sequences that overlapped. Now the default is to show only the
875overlap between the two sequences, to show complete sequences, set
876\fBSHOWALL=1\fP, or use the \fB-a\fP option on the command line.
877.pp
878The differences between the two aligned sequences can be highlighted
879in three different ways by changing the environment variable
880\fBMARKX\fP or the \fB-m\fP option.  Normally (MARKX=0) the program
881uses ':' do denote identities and '.' to denote conservative
882replacements.  If MARKX=1, the program will not mark identities;
883instead conservative replacements are denoted by a 'x' and
884non-conservative substitutions by a 'X'.  If MARKX=2, the residues in
885the second sequence are only shown if they are different from the
886first. MARKX=3 displays the aligned library sequences without the
887query sequence; these can be used to build a primitive multiple
888alignment.  MARKX=4 provides a graphical display of the boundaries of
889the alignments. Thus the five options are:
890.(l
891.ft C
892
893 MARKX=0      MARKX=1       MARKX=2       MARKX=3      MARKX=4
894
895MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
896::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
897MWKSCGYPYT   MWKSCGYPYT
898.ft R
899.)l
900.lp
901(fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseable format for use
902with other programs.  See the file"readme.v20u4" for a more complete
903description.
904.sh 2 "Command line options"
905.pp
906It is now possible to specify  several options on the command
907line, instead of using environment variables.  The command line options
908are preceded by a dash; the following options are available:
909.ip "-a"
910same as showall=1
911.ip "-A"
912force Smith-Waterman alignments for DNA sequences and TFASA.  By
913default, only FASTA protein sequence comparisons use Smith-Waterman
914alignments.
915.ip "-b #"
916Number of sequence scores to be shown on output.  In the absence of
917this option, fasta (and tfasta and ssearch) display all library
918sequences obtaining similarity scores with expectations less than
91910.0 if optimized score are used, or 2.0 if they are not. The -b
920option can limit the display further, but it will not cause additional
921sequences to be displayed.
922.ip "-c #"
923Threshold score for optimization (OPTCUT).  Set "-c 1" to
924optimize every sequence in a database.  (This slows the program down
925about 5-fold).
926.ip "-E #"
927Limit the number of scores and alignments shown based on the
928expected number of scores.  Used to override the expectation value of 10.0
929used by default.  When used with -Q, -E 2.0 will show all library sequences
930with scores with an expectation value <= 2.0.
931.ip "-d #"
932Number of alignments to be reported by default. (Used in conjunction
933with -Q).  No longer necessary, see "-b" above.
934.ip "-f"
935Penalty for the first residue in a gap (-12 by default for proteins,
936-16 for DNA or for TFASTA).
937.ip "-g"
938Penalty for additional residues in a gap (-2 by default for proteins,
939-4 for DNA and TFASTA ).
940.ip "-h"
941Penalty for frameshift (FASTX, TFASTX only).
942.ip "-H"
943Omit histogram.
944.ip "-i"
945Invert (reverse complement) the query sequence if it is DNA.  For
946TFASTX,
947search the reverse complement of the library sequence only.
948.ip "-k #"
949Threshold for joining init1 segments to build an initn score (GAPCUT).
950.ip "-l file"
951Location of library menu file (FASTLIBS).
952.ip "-L"
953Display more information about the library sequence in the alignment.
954.ip "-m #"
955MARKX = # (0, 1, 2, 3, 4, 10)
956.ip "-n"
957Force the query sequence to be treated as a DNA sequence.  This is
958particularly useful for query sequences that contain a large number of
959ambiguous residues, e.g. transcription factor binding sites.
960.ip "-O"
961Send copy of results to "filename."  Helpful for environments without STDOUT.
962.ip "-o "
963Turn off default optimization of all scores greater than OPTCUT. Sort
964results by "initn" scores.
965.ip "-Q,-q"
966Quiet - does not prompt for any input.  Writes scores and alignments
967to the terminal or standard output file.
968.ip "-r file"
969Save a results summary line for every sequence in the sequence
970library.  The summary line includes the sequence identifier,
971superfamily number (if available) position
972in the library, and the similarity scores calculated.  This option can
973be used to evaluate the sensitivity and selectivity of different
974search strategies (see W. R. Pearson (1991) Genomics 11:635-650.)
975.ip "-s file"
976SMATRIX is read from file.  Several SMATRIX files are provided with
977the standard distribution.  For protein sequences: \fCcodaa.mat\fP -
978based on minimum mutation matrix; \fCidnaa.mat\fP - identity matrix;
979\fCpam250.mat\fP - the PAM250 matrix developed by Dayhoff et al (Atlas
980of Protein Sequence and Structure, vol. 5, suppl. 3, 1978);
981\fCpam120.mat\fP - a PAM120 matrix.  The default scoring matrix is
982BLOSUM50, PAM250 is available with "-s 250", BLOSUM62 ("-s BL62") is
983also available.
984.ip "-v \"#1 #2 #3\""
985(LINEVAL) values used for line styles in plfasta
986.ip "-w #"
987Line length (width) = number (<200)
988.ip "-x \"off1 off2\""
989Specifies offsets for the beginning of the query and library sequence.
990For example, if you are comparing upstream regions for two genes, and
991the first sequence contains 500 nt of upstream sequence while the
992second contains 300 nt of upstream sequence, you might try:
993.(l I
994\fCfasta -x "-500 -300" seq1.nt seq2.nt\fP
995.)l
996If the -x option is not used, FASTA assumes numbering starts with 1.
997This option will not work properly with the translated library
998sequence with tfasta.  (You should double check to be certain the
999negative numbering works properly.)
1000.ip "-y"
1001Set the width of the band used for calculating "optimized" scores.
1002For proteins and ktup=2, the width is 16.  For proteins with ktup=1,
1003the width is 32 by default.  For DNA the width is 16.
1004.ip "-z"
1005Turn off statistical calculations.
1006.ip "-1"
1007sort output by init1 score (as FASTP used to do).
1008.ip "-3"
1009(TFASTA, TFASTX only) translate only three forward frames
1010.sp
1011.lp
1012For example:
1013.(l
1014\fCfasta -w 80 -a seq1.aa seq.aa\fP
1015.)l
1016would compare the sequence in seq1.aa to that in seq2.aa and display the
1017results with 80 residues on an output line, showing all of the residues
1018in both sequences.  Be sure to enter the options before entering the file
1019names, or just enter the options on the command line, and the program will
1020prompt for the file names.
1021.pp
1022Not all of these options are appropriate for all of the programs.  The
1023options above are used by FASTA and TFASTA. RELATE uses the -s option,
1024ALIGN uses the -w, -m, and -s options, and the PRDF program uses -c,
1025-f, -k, and -s.
1026.sp
1027.pp
1028(November, 1997) In addition, it is now possible to provide the fasta
1029programs with the query sequence (fasta, fasty, ssearch, tfastx), or
1030two sequences (prss, lalign, plalign) from the unix "stdin" stream.  This
1031makes it much easier to set up FASTA or PRSS WWW pages.  To specify
1032that stdin be used, rather than a file, the file name should be
1033specified as '-' or '@' (the latter file name makes it possible to
1034specify a subset of the sequence).
1035Thus:
1036.(l
1037cat query.aa | fasta -q @:25-75 s
1038.)l
1039would take residues 25-75 from query.aa and search the 's' library
1040(see the discussion of FASTLIBS).  If DNA sequences are to be read
1041from stdin, the '-n' option must be used, as fasta cannot check for
1042DNA queries when stdin is used.
1043
1044.sh 1 "Environment variable summary"
1045.pp
1046Environment variables allow you to set search parameters that will be
1047used frequently when you run a program; for example, if you prefer to
1048use the PAM250 scoring matrix, you might "set SMATRIX=250."  Command
1049line parameters, if used, always override environment variable
1050settings. The following environment variables are used by this
1051program:
1052.ip "AABANK"
1053the file name  of the default sequence library.
1054.ip "FASTLIBS"
1055the location of the file which contains the list
1056of library files to be searched.
1057.ip "GAPCUT"
1058threshold used for joining init1 regions in the second step of FASTA.
1059Normally set based on sequence length and \fIktup\fP.
1060.ip "LIBTYPE"
1061used to specify the format of the library sequence for FASTA and TFASTA.
1062.ip "LINLEN"
1063output line length - can go up to 200
1064.ip "LINEVAL"
1065used by plfasta to determine the relationship between line style and
1066similarity score (-v).  This should be a string of three numbers, e.g.
1067"200 100 50"
1068.ip "MARKX"
1069symbol for denoting matches, mismatches. Note that this symbol is only
1070used across the optimized local region; sequences that are outside
1071this region are not marked.
1072.ip "OPTCUT"
1073Set the threshold to be used for optimization in a band around the
1074best initial region.  Normally the OPTCUT value is calculated from the
1075length of the sequence and the \fIktup\fP value (for a 200 residue
1076sequence, it is about 28).  If OPTCUT=1, every sequence in the
1077database will be optimized.  This is the most sensitive option.
1078.ip "PAMFACT"
1079This version of fasta uses a more sensitive method for identifying
1080initial regions. Instead of using a constant factor (fact) for each
1081match in a ktup, it uses the scoring matrix (PAM) scores.  While this
1082works well for protein sequences, it has not been as carefully tested
1083for DNA sequences, so by default, this modification is used for
1084proteins but not for DNA. Setting the PAMFACT environment variable to
10851 forces the option on; PAMFACT=0 turns it off.
1086.ip "SHOWALL"
1087on output, show the complete sequence instead of just the
1088overlap of the two aligned sequences.
1089.ip "SMATRIX"
1090alternative scoring matrix file.
1091.sp
1092.lp
1093As always, please inform me of bugs as soon as possible.
1094.sp
1095.nf
1096William R. Pearson
1097Department of Biochemistry
1098Box 440, Jordan Hall
1099U. of Virginia
1100Charlottesville, VA
1101
1102wrp@virginia.EDU
1103