README.versions
1
2 $Id: README.versions 120 2010-01-31 19:42:09Z wrp $
3 $Revision: 210 $
4
5January, 2010
6
7This directory contains the newest version of FASTA, version 36.
8FASTA36 is a major update to FASTA35 that provides the ability to
9display multiple significant alignments to a query sequence. Previous
10versions of FASTA displayed only the best alignment between the query
11and library sequence; if the library sequence was long, with multiple
12similar regions, only the best was shown. This contrasts with BLAST,
13which has always displayed multiple "HSPs" when they are present.
14
15FASTA36 provides some additional improvements; like BLAST, it now uses
16statistical estimates to set thresholds for band optimization, which
17can increase search speed as much as 2-fold, and it provides much more
18flexibility in specifying the files that are searched (indirect files
19of filenames can include additional indirection). But the main
20improvement is the display of multiple HSPs.
21
22All of the traditional alignment programs: ssearch36, fasta36,
23[t]fast[xy]36 and glsearch36 display multiple HSPs. The peptide and
24mixed peptide alignment programs ([t]fasts36, fastf36, fastm36) still
25show a single HSP.
26
27Currently, the PVM/MPI parallel versions of the programs still display
28a single HSP.
29
30As of late 2007, there is almost no reason to use the fasta2 programs;
31the major programs present in fasta2 that were not present in fasta3
32(version 34) -- align (global alignments) and lalign (non-overlapping
33local alignments) are now available in fasta version 36.
34
35For more information about the programs in the current FASTA v36
36package, see the "changes_v36.html" and "readme.v36" files.
37
38There are still a very few programs in the fasta2 package that are not
39available in the fasta3 package - programs for global alignments
40without end-gap penalties, the "grease" Kyte-Doolittle plot, and
41"garnier" and "chofas" for classic (but inaccurate) secondary
42structure prediction. You should not use the fasta2 programs for
43library searching; the fasta3 programs are more sensitive and have
44better statistics.
45
46Precompiled versions of the programs for Windows and MacOS are available in the
47executables directory.
48
49
readme.v30
1
2Because of interdependencies in the Makefile, sometimes you must
3type "make" a second time to get everything built.
4
5June 12, 1996 - fasta30t1
6
7 Fixed bug in reading blast-format DNA sequence files.
8 Fixed core-dump for some large libraries on some machines.
9
10June 19, 1996 - fasta30t2
11
12 Fixed a serious bug in the Smith-Waterman alignment routines used
13 by both fasta3 (dropnfa.c) and ssearch3 (dropgsw.c) that caused
14 the amount of memory required to depend on the library sequence
15 size, rather than the query sequence size.
16
17 Fixed some memory-overwrite errors in showalign.c
18
19June 27, 1996 - fasta30t3
20
21 Found and fixed bugs in comp_thr.c and nxgetaa.c that caused core
22 dumps when reading DNA libraries with long sequences in fasta
23 format.
24
25July 6, 1996 - fasta30t4
26
27 ibm_pthread_subs.c available, Makefile.ibm for multiprocessor
28 IBM RS/6000 AIX systems.
29
30 Finally (?) fixed the previous bug that caused core dumps when
31 reading DNA libraries in fasta format.
32
33 Corrections to the fastx algorithm.
34
35July 10, 1996
36
37 Fixed reading of compressed GCG DNA format.
38
39
readme.v30t6
1
2>>August 24, 1996
3
4New programs - tfastx3, tfastx3_t, compare a protein sequence to
5forward and reverse translations of a DNA sequence database. An excellent
6replacement for tfasta3.
7
8Sun multiprocessing - change in thr_create() to use all CPU's if available.
9
10GCG formats - now can search with simple GCG-format query sequences and
11results with GCG format Swissprot and Genpept are more readable.
12
13>>August 26, 1996
14
15Fixed bugs in tfastx3(_t) and fastx3(_t) including an ancient problem
16with aatran(). Less redundancy in gcg_ranlib().
17
18
19>>August 31, 1996
20
21Included support for BLOSUM62 (-s BL62) as per documentation.
22
23Rearranged Makefile's so that they would make everything in one pass.
24
25>>September 6, 1996
26
27Corrected yet another problem with the fastx/tfastx code.
28
29Noticed that searching without optimized scores gave no optimized
30scores on the final list of scores - fixed this.
31
32The pvm version now does alignments - not thoroughly tested.
33
34>>September 13, 1996
35
36Fixed display of best scores to stdout.
37
38Fixed problem with alignments when -o flag used.
39
40pvcompfa/pvcompsw have now been tested on DEC Alpha, Solaris X86, and
41SGI PVM implementations. Several bugs were corrected.
42
43>>September 18, 1996
44
45Fixed bug selectbestz() that caused core dumps in pvcomplib.c
46(changes to pvcomplib.c, comp_thr.c, complib.c).
47
48>>September 23, 1996
49
50Corrected showalign.c/pvm_showalign.c addressing bug found and fixed
51by Erik Wallin. (erikw@biokemi.su.se).
52
53>>October 15, 1996
54
55Corrected bug so alternative scoring matrices are used.
56
57>>October 22, 1996
58
59Remove singularities from regression routine.
60
61-z 0 now means no statistics (same as -z -1).
62
63No longer show alignment for 0 score.
64
65>>October 26, 1996
66
67Fix problem with -b, -d when Z-values disabled.
68
69>>November 1, 1996
70
71Altschul-Gish statistical estimates (-z 3) now work properly.
72
73Fix problem with mean_var==0.0.
74
75
readme.v30t7
1>> October 30, 1996
2
3A new program, sc_to_e, can be used to calculate expectation values
4from the regression coefficients reported from a search. The
5expectation value is based on similarity score, sequence length, and
6database size.
7
8>> November 8, 1996
9
10fasta30t7 differs from fasta30t6 in the amount of information provided
11with the -m 10 option.
12
13(1) The query and library sequence identifiers are no longer abbreviated.
14
15(2) New information about the program and program version are provided:
16
17The new information provided is:
18
19 mp_name: program name (actually argv[0])
20 mp_ver: main program version (can be different from function version)
21 mp_argv: command line arguments (duplicates argv[0])
22
23 Some statistical information is provided as well:
24 mp_extrap: XXXX YYY - statistics extrapolated from XXX to YYY
25 mp_stats: indicates type of statistics used for E() value
26 mp_KS: Kolmogorov-Smirnoff statistic
27
28The "mp_" (main program) information is function independent, while the "pg_"
29information is produced by a particular comparison function (ssearch,
30fastx, fasta, etc). "pg_" should probably be called "fn_", and "mp_"
31called "pg_", but I remain backwards compatible.
32
33(3) The end of the "parseable" records is denoted with:
34
35 >>><<<
36
37(4) There now an compile-time option -DM10_CONS, that allows you to
38display a final alignment summary:
39
40;al_cons:
41 .::.:- .:: .. :. .:.---: : .--.:. :
42.. .--- ..: :: ... :..: .::.:. . .---. . .:
43 : . . . : .. . :..: .--. . : .:. .. : .
44 .:.::: ..:. :
45
46or, if M10_CONS_L is defined (in addition to M10_CONS), the output is:
47;al_cons:
48 p==p=-mmmp==mpzmm=pmmmmz=p---=mmm=mmp--p=zm=m
49pzmmp---mmzp=m==mzzzm=zp=mz==z=pmzmmz---pmmpmmmp=m
50m=mzmmzmpm=mmmmppmmmpmmmm=pp=mp--pmpm=mp=pmzzm=mmp
51mp=z===mmpz=zm=
52
53where '=' indicates identical residues, '-' a gap in one or the other
54sequence, 'p' indicates a positive pam value, 'm' indicates a negative
55pam value, and 'z' indicates a zero pam value.
56
57A typical run now looks like:
58
59>>>gtm1_mouse.aa, 217 aa vs s library
60; mp_name: fasta3_t
61; mp_ver: version 3.0t7 November, 1996
62; mp_argv: fasta3_t -q -m 10 gtm1_mouse.aa s
63; pg_name: FASTA
64; pg_ver: 3.06 Sept, 1996
65; pg_matrix: BL50
66; pg_gap-pen: -12 -2
67; pg_ktup: 2
68; pg_optcut: 24
69; pg_cgap: 36
70; mp_extrap: 50000 51933
71; mp_stats: Expectation fit: rho(ln(x))= 5.8855+/-0.000527; mu= 1.5386+/- 0.029; mean_var=73.0398+/-15.283
72; mp_KS: 0.0133 (N=29) at 42
73>>GTM1_MOUSE GLUTATHIONE S-TRANSFERASE GT8.7 (EC 2.5.1.18) (GST 1-1) (CLASS-MU).
74; fa_initn: 1490
75; fa_init1: 1490
76; fa_opt: 1490
77; fa_z-score: 1754.6
78; fa_expect: 0
79; sw_score: 1490
80; sw_ident: 1.000
81; sw_overlap: 217
82>GTM1_MOUSE ..
83; sq_len: 217
84; sq_type: p
85; al_start: 1
86; al_stop: 217
87; al_display_start: 1
88PMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKF
89KLGLDFPNLPYLIDGSHKITQSNAILRYLARKHHLDGETEEERIRADIVE
90NQVMDTRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEFLGKRPWFAGD
91KVTYVDFLAYDILDQYRMFEPKCLDAFPNLRDFLARFEGLKKISAYMKSS
92RYIATPIFSKMAHWSNK
93>GTM1_MOUSE ..
94; sq_len: 217
95; sq_type: p
96; al_start: 1
97; al_stop: 217
98; al_display_start: 1
99PMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKF
100KLGLDFPNLPYLIDGSHKITQSNAILRYLARKHHLDGETEEERIRADIVE
101NQVMDTRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEFLGKRPWFAGD
102KVTYVDFLAYDILDQYRMFEPKCLDAFPNLRDFLARFEGLKKISAYMKSS
103RYIATPIFSKMAHWSNK
104>>GTM1_RAT GLUTATHIONE S-TRANSFERASE YB1 (EC 2.5.1.18) (CHAIN 3) (CLASS-MU).
105; fa_initn: 1406
106; fa_init1: 1406
107; fa_opt: 1406
108; fa_z-score: 1656.3
109; fa_expect: 0
110; sw_score: 1406
111; sw_ident: 0.931
112; sw_overlap: 217
113>GTM1_MOUSE ..
114; sq_len: 217
115; sq_type: p
116; al_start: 1
117; al_stop: 217
118; al_display_start: 1
119PMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKF
120KLGLDFPNLPYLIDGSHKITQSNAILRYLARKHHLDGETEEERIRADIVE
121NQVMDTRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEFLGKRPWFAGD
122KVTYVDFLAYDILDQYRMFEPKCLDAFPNLRDFLARFEGLKKISAYMKSS
123RYIATPIFSKMAHWSNK
124>GTM1_RAT ..
125; sq_len: 217
126; sq_type: p
127; al_start: 1
128; al_stop: 217
129; al_display_start: 1
130PMILGYWNVRGLTHPIRLLLEYTDSSYEEKRYAMGDAPDYDRSQWLNEKF
131KLGLDFPNLPYLIDGSRKITQSNAIMRYLARKHHLCGETEEERIRADIVE
132NQVMDNRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEFLGKRPWFAGD
133KVTYVDFLAYDILDQYHIFEPKCLDAFPNLKDFLARFEGLKKISAYMKSS
134RYLSTPIFSKLAQWSNK
135;al_cons:
136:::::::::::::::::.:::::::::.::::.::::::.::::::::::
137::::::::::::::::.::::::::.::::::::: ::::::::::::::
138:::::.::::::::::::::::::::::::::::::::::::::::::::
139::::::::::::::::..::::::::::::.:::::::::::::::::::
140::..::::::.:.::::
141>>><<<
142
143
144217 residues in 1 query sequences
14518531385 residues in 52205 library sequences
146 Tcomplib (4 proc)[version 3.0t7 November, 1996]
147 start: Fri Nov 8 18:20:26 1996 done: Fri Nov 8 18:20:41 1996
148 Scan time: 38.434 Display time: 2.166
149
150Function used was FASTA
151
152================================================================
153
154>> November 11, 1996
155
156 --> v30t71
157
158Made changes to complib.c, comp_thr.c, nxgetaa.c to allow scoring
159matrix to be modified in fastx3, fastx3_t.
160
161================================================================
162
163>> November 15, 1996
164
165 --> v30t72
166
167nxgetaa.c now accepts query sequences from "stdin" by using "-" as the
168input file name. If DNA sequences are read in this mode, the "-n"
169option must be used.
170
171> November 23, 1996
172
173Included code in nxgetaa.c and Makefile.sgi to get around a bug in SGI's
174sscanf() that prevented compressed GCG databases from being read properly.
175
176
readme.v31t0
1
2>>November 1, 1997
3
4 --> v31t0
5
6version 31t of the fasta program package uses a more modular
7structure for comparison functions. In addition to modular functions
8to initialize, calculate and align sequences, v31 provides a modular
9function for creating the alignment display. This was required for
10fasty and fastf, which have very different alignment strategies from
11the other search programs.
12
13>>February 13, 1998
14
15modified nascii[] so that 0, 1, 2 are no longer end of sequence
16characters.
17
18prss3 added. Unlike prss, prss3 uses -d # to specify the number of
19shuffles.
20
21>>March 18, 1998
22
23First public release. Corrected problems with dropfz.c (which is
24used in fasty3, tfasty3). Makefile is well tested, but other Makefile's
25are not. PVM versions not tested.
26
27>>March 19, 1998
28
29Problem with unthreaded tfastx3, tfasty3 caused by bug in complib.c
30fixed. All Makefiles (Makefile.alpha Makefile.sun, Makefile.sgi,
31Makefile.linux) have been tested and work properly. Threaded versions
32do not work on linux (yet). Function labeling problems with fasty3,
33tfasty3 corrected.
34
35>>March 20, 1998
36
37 --> v31t02
38
39Fixed problem with inconsistent openlib() calls that broke BLAST databases
40on some platforms.
41
42>>March 27, 1998
43
44 --> v31t04
45
46Fixed a long standing problem with fastx/tfastx and fasty/tfasty that
47caused various memory allocation problems and core dumps.
48
49The PVM version works again, but cannot produce alignments. The
50change in the location of the modular display functions will require
51significant changes in the pvm display functions. For the moment,
52showalign() has been commented out.
53
54Code tested on Macintosh without changes.
55
56Added some additional information in the results file.
57
58
59Please report bugs to wrp@virginia.edu
60
61>>April 3, 1998
62
63Removed some debugging code in faatran.c now that fastx/fasty bugs
64seem corrected.
65
66 FASTA --> v3.14
67
68Corrected uninitialized array elements in dropnfa.c.
69
70>>April 10, 1998
71
72Added facility for specifying SRCH_URL (the URL string that will be
73used to re-search the database) and REF_RUL (the URL string that
74will be used to lookup the sequence) ini url_subs.c. This allows perl
75scripts to provide different databases for re-searching dynamically.
76
77>>April 16, 1998
78
79 --> v31t05
80
81Corrected problem with ignoring ','s in databases (','s are found in
82PIR).
83
84>>April 18, 1998
85
86Corrected some problems with sequence names for Entrez lookups and
87re-searching databases.
88
89Made minor modifications to nxgetaa.c and compacc.c for compatibility
90with Borland 'C' compiler for Win32 systems. Including makefile.tc
91fasta.rsp, prss.rsp, and test.bat for Borland 'C'/win32.
92
93>>April 24, 1998
94
95 --> v31t06
96
97Fixed another bug in fasty3/tfasty3 alignment routines.
98
99Added additional information to the do_url1() (url_subs.c) function.
100The re-search URL can now reference the start, stop, and length of the
101library sequence to be re-searched with. For DNA library sequences,
102these values are always in nucleotides, even with tfasta/x/y.
103
104
105>>May 12, 1998
106
107(no version change as v31t06 was not released prior to this)
108
109Correct nxgetaa.c GETLIB to deal correctly with BLAST NR database
110sequences with exceptionally long title lines.
111
112Fix bug with long -O results files.
113
114>>May 18, 1998
115
116 --> v31t07
117
118Corrected some bugs in information string lengths (e.g. gstring1,
119stat_str), disabling statistics with -z 0, translation of 'X' by
120saatran() (faatran.c) that caused problems with FASTX.
121
122A serious bug has been fixed in the FASTX alignment routines.
123For some pathological sequences, % identity increases from < 10%
124to 40%. The version number of the main program has not changed,
125but the version number of the fastx function has changed to 3.2.
126
127>>June 19, 1998
128
129 --> v31t08
130
131Corrected some problems with alignments with -m 10.
132
133Added -Z db_size option to modify apparent database size for
134expectation value calculation (used only for protein/protein FASTA and
135SSEARCH, FASTX, FASTY, TFASTX, and TFASTY).
136
137>>July 1, 1998
138
139 (no version change)
140
141Corrected size of lbnames[], lb_size[] in structs.h to accomodate MAX_LF
142files.
143
144>>July 13, 1998
145
146 --> v31t09
147
148Corrected problem in nxgetaa.c encountered when reading long sequences
149(that must be split) in fasta format.
150
151Corrected problem in statistics calculation encountered with a small number
152of very long DNA sequences.
153
154>>July 17, 1998
155
156 (no version change, date change for ssearch3)
157
158Corrected default expectation cutoff (it was 10, now it is 2.0) for
159DNA with ssearch3.
160
161
readme.v31t1
1>>July 22, 1998
2
3 --> v31t10
4
5Corrected problem with histogram when unscaled statistics used (e.g. prss3).
6
7Corrected problems with prss3 shuffled sequence prompt. Provided option
8to enter number of shuffles, window size, for prss3. Number of shuffles
9for prss3 can be entered as an option (-d #) or as the third argument
10on the command line (prss3 query lib 1000).
11
12Modified nrand.c, nrand48.c to use time to set random number.
13
14Corrected problems reading GCG formatted files with prss3.
15
16Corrected various problems with pvcomp* programs, but they still do
17not produce alignments with version 3.1.
18
19Two new programs, fastf3(_t) and tfastf3(_t) are available. These
20programs compare a set of mixed peptide sequences from an Edman
21sequencer to a protein (fastf3) or DNA (tfastf3) database, using
22the database sequences to de-convolve the peptide mixture.
23
24See fastf3.1
25
26>>August 11, 1998
27
28(no version change)
29
30Modified initfa.c so that using '-n' on the fastx/fasty command line
31would not cause problems.
32
33Changed labeling of query sequence length for fastx/fasty from 'aa' to 'nt'.
34
35>>August 18, 1998
36
37(no version change)
38
39Modified complib.c, comp_thr.c scaleswn.c, to report E()-value for only
40one related sequence if -z 3 is used.
41
42>>August 23, 1998
43
44 -->v31t11
45
46Some serious problems with prss3 have been corrected:
47
48(1) use dropnsw.c rather than dropgsw.c for more accurate low scores
49
50(2) modify estimation program; use scaleswe.c rather than scaleswn.c.
51 scaleswe.c has some improvements for estimation by moments and can
52 use MLE as well as mu/var (-z 3).
53
54(3) add p() estimate.
55
56(4) correct bugs in nrand48, which caused bad sequences for llgetaa.c
57
58(5) -Z number works properly for prss3 and other programs (fixed histogram).
59
60(6) a new program, ssearch3e, is available that uses the same scaling
61 routines as prss3 (scaleswe.c). prss3 will save the random
62 sequences it generates when the -r file option is given; the
63 sequences are in file_rlib. ssearch3e (or ssearch3 or fasta) can
64 then do a search on exactly the same sequences that were used by prss3.
65
66A bug reading GCG format compressed DNA databases was fixed.
67
68Fixed a bug that caused query sequence not to be displayed with -m 10.
69
70Simple optimization in dropnfa.c improves performance 10%.
71
72>>Sept. 1, 1998
73
74(no version change)
75
76Modified nxgetaa.c to recognize "ACGTX" as nucleotides.
77
78>>Sept. 7, 1998
79
80 --> v31t12
81
82Added -z 11 - 15, which use shuffled sequences, rather than real
83sequences to calculate statistical estimates. Because a shuffled
84sequence score is calculated for each sequence score, the search
85process takes twice as long. In this first version, codons are not
86preserved during shuffles, so tfasta/x/y shuffles may not be as
87informative as they should be.
88
89Also fix a problem with prss3 shuffles.
90
91>>Sept. 14, 1998
92
93 (no version change; previous version not released)
94
95Corrected bugs in tfastx3/tfasty3 caused by using the -3 option with
96or without -i. With the bug fixes; "-3" and "-3 -i" work as expected;
97"-3" gives the forward three frames, while "-3 -i" gives the reverse
98three frames.
99
100In addition, tfasta3/tfasta3_t was upgraded to perform the same way
101that tfastx/y3 does - i.e. a search with "-i -3" searches only frames
1024,5, and 6, while "-3" searches only frames 1, 2, and 3.
103
104>>Sept. 29, 1998
105
106 --> v31t13
107
108Corrected bugs in dropfx.c that were corrected in fasta30 last May,
109but lingered in fasta31. Also included code to ensure that tfastx/y
110alignments against long introns would not overrun the alignment
111buffer. Instead of overrunning the buffer, the message: ***aligment
112truncated *** is displayed.
113
114
readme.v32t0
1
2FASTX/Y and FASTA (DNA) are now half as fast, because the programs now
3search both the forward and reverse strands by default.
4
5The documentation in fasta3x.me/fasta3x.doc has been substantially
6revised.
7
8>>October 9, 1999
9 --> v32t08 (no version number change)
10
11Added "-M low-high" option, where low and high are inclusion limits
12for library sequences. If a library sequence is shorter than "low" or
13longer than "high", it will not be considered in the search. Thus,
14"-M 200-250" limits the database search to proteins between 200 and
15250 residues in length. This should be particularly useful for fasts3
16and fastf3. This limit applies only to protein sequences.
17
18Modified scaleswn.c to fall back to maximum likelihood estimates of
19lambda, K rather than mean/variance estimates. (This allows MLE
20estimation to be used instead of proc_hist_n when a limited range of
21scores is examined.)
22
23>>October 20, 1999
24(no version change)
25
26Modify nxgetaa.c/nmgetaa.c to recognize 'N' as a possible DNA character.
27
28>>October 9, 1999
29 --> v32t08 (no version number change)
30
31Added "-M low-high" option, where low and high are inclusion limits
32for library sequences. If a library sequence is shorter than "low" or
33longer than "high", it will not be considered in the search. Thus,
34"-M 200-250" limits the database search to proteins between 200 and
35250 residues in length. This should be particularly useful for fasts3
36and fastf3. -M -500 searches library sequences < 500; -M 200 -
37searches sequences > 200. This limit applies only to protein
38sequences.
39
40Modified scaleswn.c to fall back to maximum likelihood estimates of
41lambda, K rather than mean/variance estimates. (This allows MLE
42estimation to be used instead of proc_hist_n when a limited range of
43scores is examined.)
44
45>>October 2, 1999
46 --> v32t08
47
48Many changes:
49
50(1) memory mapped (mmap()ed) database reading - other database reading fixes
51(2) BLAST2 databases supported
52(3) true maximum likelihood estimates for Lambda, K
53(4) Misc. minor fixes
54
55(1) (Sept. 26 - Oct. 2, 1999) Memory mapped database access.
56It is now possible to use mmap()ed access to FASTA format databases,
57if the "map_db" program has been used to produce an ".xin" file. If
58USE_MMAP is defined at compile time and a ".xin" file is present, the
59".xin" will be used to access sequences directly after the file is
60mmap()ed. On my 4-processor Alpha, this can reduce elapsed time by
6150%. It is not quite as efficient as BLAST2 format, but it is close.
62
63Currently, memory mapping is supported for type 0 (FASTA), 5
64(PIR/GCG ascii), and 6 (GCG binary). Memory mapping is used if a
65".xin" file is present. ".xin" files are created by the new program
66"map_db". The syntax for "map_db" is:
67
68 map_db [-n] "/dir/database.fa"
69
70which creates the file /dir/database.fa.xin. Library types can be
71included in the filename; thus:
72
73 map_db -n "/gcggenbank/gb_om.seq 6"
74
75would be used for a type 6 GCG binary file.
76
77The ".xin" file must be updated each time the database file changes.
78map_db writes the size of the database file into the ".xin" file, so
79that if the database file changes, making the ".xin" offset
80information invalid, the ".xin" file is not used. "list_db" is
81provided to print out the offset information in the ".xin" file.
82
83(Oct 2, 1999) The memory mapping routines have been changed to
84allow several files to be memory mapped simultaneously. Indeed, once a
85database has been memory mapped, it will not be unmap()ed until the
86program finishes. This fixes a problem under Digital Unix, and should
87make re-access to mmap()ed files (as when displaying high scores and
88alignments) much more efficient. If no more memory is available for
89mmap()ing, the file will be read using conventional fread/fgets.
90
91(Oct 2, 1999) The names of the database reading functions has been
92changed to allow both Blast1.4 and Blast2.0 databases to be read. In
93addition, Makefile.common now includes an option to link both
94ncbl_lib.o and ncbl2_lib.o, which provides support for both libraries.
95However, Blast1.4 support has not been tested.
96
97The Makefile structure has been improved. Each architecture specific
98Makefile (Makefile.alpha, Makefile.linux, etc) now includes
99Makefile.common. Thus, changes to the program structure should be
100correct for all platforms. "map_db" and "list_db" are not made with
101"make all".
102
103The database reading functions in nxgetaa.c can now return a database
104length of 0, which indicates that no residues were read. Previously,
1050-length sequences returned a length of 1, which were ignored.
106Complib.c and comp_thr.c have changed to accommodate this
107modification. This change was made to ensure that each residue,
108including the last, of each sequence is read.
109
110Corrected bug in nxgetaa.c with FASTA format files with very long
111(>512 char) definition lines.
112
113(2) (September 20, 1999) BLAST2 format databases supported
114
115This release supports NCBI Blast2.0 format databases, using either
116conventional file reading or memory mapped files. The Blast2.0 format
117can be read very efficiently, so there is only a modest improvement in
118performance with memory mapping. The decision to use mmap()'ed files
119is made at compile time, by defining USE_MMAP. My thanks to Eamonn
120O'Toole of DEC/Compaq, and Daryl Madura of Sun Microsystems, for
121providing mmap()'ed modifications to fasta3. On my machines, Blast2.0
122format reduces search time by about 30%. At the moment, ambiguous DNA
123sequences are not decoded properly.
124
125(3) (September 30, 1999) A new statistical estimation option is
126available. -z 2 has been changed from ln()-scaling, which never
127should have been used, to scaling using Maximum Likelihood Estimates
128(MLEs) of Lambda and K. The MLE estimation routines were written by
129Aaron Mackey, based on a discussion of MLE estimates of Lambda and K
130written by Sean Eddy. The MLE estimation examines the middle 95% of
131scores, if there are fewer than 10000 sequences in the database;
132otherwise it excludes (censors) the top 250 scores and the bottom 250
133scores. This approach seems to effectively prevent related sequences
134from contaminating the estimation process. As with -z 1, -z 12 causes
135the program to generate a shuffled sequence score for each of the
136library sequences; in this case, no censoring is done. If the
137estimation process is reliable, Lambda and K should not vary much with
138different queries or query lengths. Lambda appears not to vary much
139with the comparison algorithm, although K does.
140
141(4) Minor changes include fixes to some of the alignment display routines,
142individual copies of the pstruct structure for each thread, and some
143changes to ensure that every last residue in a library is available
144for matching (sometime the last residue could be ignored). This
145version has undergone extensive testing with high-throughput sequences
146to confirm that long sequences are read properly. Problems with
147fastf3/fasts3 alignment display have also been addressed.
148
149>>August 26, 1999 (no version change - not released)
150
151Corrected problem in "apam.c" that prevented scoring matrices from
152being imported for [t]fasts3/[t]fastf3.
153
154>>August 17, 1999
155 --> v32t07
156
157Corrected problem with opt_cut initialization that only appeared
158with pvcomp* programs.
159
160Improved calculation of FASTA optcut threshold for DNA sequence
161comparison for match scores much less than +5 (e.g. +3). The previous
162optcut theshold was too high when the match penalty was < 4 and
163ktup=6; it is now scaled more appropriately.
164
165Optcut thresholds have also been raised slightly for
166fastx/y3/tfastx/y3. This should improve performance with minimal
167effects on sensitivity.
168
169>>July 29, 1999
170(no version change - date change)
171
172Corrected various uninitialized variables and buffer overruns
173detected.
174
175>>July 26, 1999 - new distribution
176(no version change - v32t06, previous version not released)
177
178Changed the location of "(reverse complement)" label in tfasta/x/y/s/f
179programs.
180
181Statistical calculations for tfasta/x/y in unthreaded version
182corrected. Statistical estimates for threaded and unthreaded versions
183of the tfasta/x/y/s/f programs should be much more consistent.
184
185Substantial modifications in alignment coordinate calculation/
186presentation. Minor error in fastx/y/tfastx/y end of alignment
187corrected. Major problems with tfasta alignment coordinates
188corrected. tfasta and tfastx/y coordinates should now be consistent.
189
190Corrected problem with -N 5000 in tfasta/x/y3(_t) searches encountered
191with long query sequences.
192
193Updated pthr_subs.c/Makefile.linux to increase the pthreads stacksize
194to try to avoid "cannot allocate diagonal arrays" error message.
195Pthreads stacksize can be changed with RedHat 6.0, but not RedHat 5.2,
196so Makefile.linux uses -DLINUX5 for RedHat5.* (no pthreads stack size).
197I am still getting this message, so it has not been completely
198successful. Makefile.linux now uses -DALLOCN0 to avoid this problem,
199at some cost in speed.
200
201The pvcomp* programs have been updated to work properly with
202forward/reverse DNA searches. See readme.pvm_3.2.
203
204>>July 7, 1999 - not released
205 --> v32t06
206
207Corrected bug in complib.c (fasta3, fastx3, etc) that caused core
208dumps with "-o" option.
209
210Corrected a subtle bug in fastx/y/tfastx/y alignment display.
211
212>>June 30, 1999 - new distribution
213(no version change)
214
215Corrected doinit.c to allow DNA substitution matrices with -s matrix
216option.
217
218Changed ".gbl" files to ".h" files.
219
220>>June 2 - 9, 1999 - new distribution
221(no version change)
222
223Added additional DNA lambda/K/H to alt_param.h. Corrected some
224other problems with those table. for the case where (inf,inf)
225gap penalties were not included.
226
227Fixed complib.c/comp_thr.c error message to properly report filename
228when library file is not found.
229
230Included approximate Lambda/K/H for BL80 in alt_parms.h.
231BL80 scoring matrix changed from 1/3 bit to 1/2 bit units.
232
233Included some additional perl files for searchfa.cgi, searchnn.cgi
234in the distribution (my-cgi.pl, cgi-lib.pl).
235
236>>May 30, 1999, June 2, 1999 - new distribution
237(no version number change)
238
239Added Makefile.NetBSD, if !defined(__NetBSD__) for values.h. Changed
240zs_to_E() and z_to_E() in scaleswn.c to correctly calculate E() value
241when only one sequence is compared and -z 3 is used.
242
243>>May 27, 1999
244(no version number change)
245
246Corrected bug in alignment numbering on the % identity line
247 27.4% identity in 234 aa (101-234:110-243)
248for reverse complements with offset coordinates (test.aa:101-250)
249
250>>May 23, 1999
251(no version number change)
252
253Correction to Makefile.linux (tgetaa.o : failed to -DTFAST).
254
255>>May 19, 1999
256(no version number change)
257
258Minor changes to pvm_showalign.c to allow #define FIRSTNODE 1.
259Changes to showsum.c to change off-end reporting. (Neither of these
260changes is likely to affect anyone outside my research group.)
261
262>>May 12, 1999
263 --> v32t05
264
265Fixed a serious bug in the fastx3/tfastx3 alignment display which
266caused t/fastx3 to produce incorrect alignments (and incorrectly low
267percent identities). The scores were correct, but the alignment
268percent identities were too low and the alignments were wrong.
269
270Numbering errors were also corrected in fastx3/tfastx3 and
271fasty3/tfasty3 and when partial query sequences were used.
272
273>>May 7, 1999
274
275Fixed a subtle bug in dropgsw.c that caused do_work() to calculate
276incorrect Smith-Waterman scores after do_walign() had been called.
277This affected only pvcompsw searches with the "-m 9" option.
278
279>>May 5, 1999
280
281Modified showalign.c to provide improved alignment information that
282includes explicitly the boundaries of the alignment. Default
283alignments now say:
284
285Smith-Waterman score: 175; 24.645% identity in 211 aa overlap (5:207-7:207)
286
287>>May 3, 1999
288
289Modified nxgetaa.c, showsum.c, showbest.c, manshowun.c to allow a
290"not" superfamily annotation for the query sequence only. The
291goal is to be able to specify that certain superfamily numbers be
292ignored in some of the search summaries. Thus, a description line
293of the form:
294
295>GT8.7 | 40001 ! 90043 | transl. of pa875.con, 19 to 675
296
297says that GT8.7 belongs to superfamily 40001, but any library
298sequences with superfamily number 90043 should be ignored in any
299listing or summary of best scores.
300
301In addition, it is now possible to make a fasta3r/prcompfa, which is
302the converse of fasta3u/pucompfa. fasta3u reports the highest scoring
303unrelated sequences in a search using the superfamily annotation.
304fasta3r shows only the scores of related sequences. This might be
305used in combination with the -F e_val option to show the scores
306obtained by the most distantly related members of a family.
307
308>>April 25, 1999
309
310 -->v32t04 (not distributed)
311
312Modified nxgetaa.c to remove the dependence of tgetaa.o on TFASTA
313(necessary for a more rational Makefile structure). No code changes.
314
315>>April 19, 1999
316
317Fixed a bug in showalign.c that displayed incorrect alignment coordinates.
318(no version number change).
319
320>>April 17, 1999
321
322 --> v32t03
323
324A serious bug in DNA alignments when the sequence has been broken into
325multiple segments that was introduced in version fasta32 has been
326fixed. In addition, several minor problems with -z 3 statistics on
327DNA sequences were fixed.
328
329Added -m 9 option, which unfortunately does different things in
330pvcompfa/sw and fasta3/ssearch3. In both programs, -m 9 provides the
331id's of the two sequences, length, E(), %_ident, and start and end of
332the alignment in both sequences. pvcompfa/sw provides this
333information with the list of high scoring sequences. fasta3/ssearch3
334provides the information in lieu of an alignment.
335
336>>March 18, 1999
337
338 --> v32t02
339
340Added information on the algorithm/parameter description line to
341report the range of the pam matrices. Useful for matrices like
342MD_10, _20, and _40 which require much higher gap penalties.
343
344>>March 13, 1999 (not distributed)
345
346 --> v32t01
347
348 -r results.file has been changed to -R results.file to accomodate
349 DNA match/mismatch penalties of the form: -r "+1/-3".
350
351>>February 10, 1999
352
353Modify functions in scalesw*.c to prevent underflow after exp() on
354Alpha Linux machines. The Alpha/LINUX gcc compiler is buggy and
355doesn't behave properly with "denormalized" numbers, so "gcc -g -m
356ieee" is recommended.
357
358Add "Display alignments also (y/n)[n] "
359
360pvcomplib.c again provides alignments!! In addition, there is a
361new "-m 9" option, which reports alignments as:
362
363>>>/home/wrp/slib/hlibs/hum0.aa#5>HS5 gi:1280326 T-cell receptor beta chain 30 aa, 30 aa vs /home/wrp/slib/hlibs/hum0.seg library
364HS5 30 HS5 30 1.873e-11 1.000 30 1 30 1 30
365HS5 30 HS2249 40 1.061e-07 0.774 31 1 30 7 37
366HS5 30 HS2221 38 1.207e-07 0.833 30 1 30 7 35
367HS5 30 HS2283 40 1.455e-07 0.774 31 1 30 7 37
368HS5 30 HS2239 38 1.939e-07 0.800 30 1 30 7 35
369
370where the columns are:
371
372query-name q-len lib-name lib-len E() %id align-len q-start q-end l-start l-end
373
374>>February 9, 1999
375
376Corrected bug in showalign.c that offset reverse complement alignments
377by one.
378
379>>Febrary 2, 1999
380
381Changed the formatting slightly in showbest.c to have columns line up better.
382
383>>January 11, 1999
384
385Corrected some bugs introduced into fastf3(_t) in the previous version.
386
387>>December 28, 1998
388
389Corrected various problems in dropfz.c affecting alignment scores
390and coordinates.
391
392Introduced a new program, fasts3(_t), for searching with peptide
393sequences.
394
395>>November 11, 1998
396
397 --> v32t0
398
399Added code to correct problems with coordinate number in long library
400sequences with tfastx/tfasty. With this release, sequences should be
401numbered properly, and sequence numbers count down with reverse
402complement library sequences.
403
404In addition, with this release, fastx/y and tfastx/y translated
405protein alignments are numbered as nucleotides (increasing by 3,
406labels every 30 nucleotides) rather than codons.
407
408
readme.v33t0
1
2 $Id: readme.v33t0 342 2010-06-28 19:57:56Z wrp $
3 $Revision: $
4
5================ readme.v33t0 ================
6
7This release includes an MPI implementation of the parallel
8library-vs-library comparison code. See readme.mpi_3.3 and
9readme.pvm_3.3 for more information.
10
11=====
12>>July 9, 2001
13
14Considerable changes to support no-global library functions.
15
16(1) Separate ascii/sequence mapping arrays are used by the
17 query-reading (qascii), library-reading (lascii), and sequence
18 comparison function (pascii) routines. As a result, there is no
19 longer a need for tgetlib.o/lgetlib.o - lgetlib.o can serve both
20 functions.
21
22(2) This also allows us to remove all #ifdef TFAST/FASTX conditionals
23 from complib.c/comp_thr.c/p2_complib.c. We no longer need
24 tcomp_thr.o, comp_thrx.o, etc. We still have a variety of
25 p2_complib.o variations to support the different c34.work* files.
26
27(3) Because non-global openlib/getlib functions are available, exactly
28 the same open/get functions are available for reading both the
29 query and reference libraries in pv34comp* programs. The
30 host-specific openlib/getlib functions in hxgetaa.c are now
31 provided by nmgetlib.c, etc. This has two effect:
32
33 (a) it is now possible to compare a query database generated by an
34 SQL query to a library database generated by a different SQL
35 query.
36
37 (b) pv34comp* has lost (at least in this version) the ability to
38 automatically detect the query sequence type. To search with a
39 DNA query, you MUST use "-n".
40
41(4) the resetp() function is now responsible for almost all of the
42 function sepcific (TFAST/FASTX/etc) initializations. All of the
43 function specific code has been removed from complib.c/comp_thr.c
44 and most of it has been moved to initfa.c/resetp().
45
46(5) manageacc.c has been merged into compacc.c (mostly prhist()).
47
48(6) Although it may reflect a subtle bug in my code, it is not
49 possible to reliably run threaded/memory mapped versions of the
50 fasta34_t code. I have spent considerable time tracking down the
51 problem, and have determined that, in threaded code, something
52 happens during the thread initialization to corrupt the
53 description offset information used when files are memory mapped.
54 This never occurs when the unthreaded versions of the code are
55 used. And it does not occur under MacOSX, Compaq Tru64Unix, Sun
56 Solaris/Sparc, or SGI IRIX.
57
58 Thus, I cannot recommend using the threaded code versions (_t)
59 under Linux (RH6.2 or 7.1).
60
61=====
62>>June 1, 2001
63
64Many changes to accomodate a new - no global variable - strategy for
65reading sequence databases. Every time a file is opened, a struct
66lmf_str is allocated which can be used for memory mapped files, ncbl2,
67files, and mysql files.
68
69In addition, an open'ed file has a default sequence type: DNA or
70protein, or one can open a file in a mode that will allow the sequence
71type to be changed.
72
73=====
74>>May 18, 2001 CVS: fa33t09d0
75
76A new compile time parameter - -DGAP_OPEN, is available to change the
77definition of the "-f gap-open" parameter from the penalty for the
78first residue in a gap to a true gap-open penalty, as is used in BLAST
79and many other comparison algorithms. This will probably become the
80default for fasta in version 3.4.
81
82Fixes to conflicts between "-S" and "-s matrix". When a scoring
83matrix file was specified, lower-case alignments were not displayed
84with -S (although the scores were calculated properly).
85
86More extensive testting of mysql_lib.c (mySQL query-libraries) with
87the pv4comp* and mp4comp* programs.
88
89=====
90>>April 5, 2001 CVS: fa33t08d4b3
91
92Changes in nmgetlib.c and ncbl2_mlib.c to return long sequence
93descriptions for PCOMPLIB (pv4/mp3comp*). Also fix p2_complib.c to
94request DNA library for translated comparisons.
95
96Fix for prss33(_t) to read both sequences from stdin.
97
98=====
99>>March 27, 2001 CVS: fa33t08d4
100 --> fa33t08d4
101
102Problems in ncbl2_mlib.c found searching NCBI non-redundant nucleotide
103database "nt" were fixed. Testing revealed a minor memory leak, which
104was fixed by modifying showbest.c, showalign.c, comp_thr.c, complib.c,
105and p2_complib.c to remember the last opened database file more
106effectively.
107
108Modifications to allow 64-bit fseek/ftell on machines like Sun,
109Linux/Intel, that support -D_FILE_OFFSET_BITS=64, -D_LARGE_FILE_SOURCE
110off_t, and fseeko(), ftello() with the option -DUSE_FSEEKO. Machines
111with 64-bit long's do not need this option. Machines with 32-bit
112longs that allow files >2 Gb can do so with 64-bit file access
113functions, including fseeko() and ftello(), which work with off_t file
114offsets instead of long's.
115
116=====
117>>March 3, 2001 CVS: fa33t08d2
118
119Corrected problems in nmgetaa.c and mysql_lib.c with parallel
120programs, and one serious problem with alternate DNA scoring matrices
121(initfa.c, initsw.c) not being set properly. A subtle problem with
122the merge of scaleswn.c and scaleswg.c is fixed.
123
124>>February 17, 2001
125
126Modified mysql_lib.c to use "#", rather than "%ld", to indicate the
127position of the GID. This change was made because sprintf() cannot be
128used reliably to generate an SQL string, as '"' and '%' are used in
129such strings.
130
131=====
132>>January 17, 2001
133(no version change, date change)
134
135Minro fixes to initfa.c, initsw.c to deal with DNA scoring matrices
136properly. "-n -s dna.mat" is required for the sequence/matrix to be
137recognized as DNA.
138
139>>January 16, 2001
140-->v34t00
141
142Merge of the main CVS trunk - fa33t06 with the latest release branch,
143fa33t08.
144
145In addition, PCOMPLIB mods have been made to mysql_lib.c. Because
146p2_complib.c gets sequence description information during the first
147read of the database, the mysql_query must be changed to return:
148result[0]=GID, result[1]=description, result[2]=sequence. In the
149PCOMPLIB case, the other SQL queries (for GID description, sequence)
150are not necessary but must still be provided.
151
152=====
153>>January 16, 2001
154(no version change, previous version not released)
155
156changes to p2_complib.c to correct openlib() incompatibility.
157
158changes to nmgetaa.c, ncbl2_lib.c to incorporate PCOMPLIB. nxgetaa.c
159removed.
160
161=====
162>>January 12, 2001
163(no version change, previous version not released)
164
165Change to initfa.c to move ktup check from query_parm() to last_init().
166
167=====
168>>January 10, 2001
169--> v33t08
170
171Fixes to complib.c, comp_thr.c to deal properly with long query
172protein sequences when a short library chunk (e.g. -N 5000) was given.
173In the case where the chunk size is too short, it will be reset to a
174length which allows the search to proceed, by including an amount of
175new sequence that is equal to the amount of overlap sequence.
176
177scaleswn.c and scaleswg.c have been merged.
178
179v33t08 includes the initial implementation for mySQL described below
180for v33t07x.
181
182======
183>>Dec. 20, 2000
184--> v33t07x
185
186Initial implementation of a syntax for mySQL database queries. A new
187file, mysql_lib.c has been added, and changes have been made to
188nmgetaa.c (which should now replace nxgetaa.c) and altlib.h. A mySQL
189database search needs a file with 4 parts:
190
191(1) description of the database, user, password
192(2) a select statement that generates the set of protein sequences
193 as: UID, sequence
194(3) a select statement that generates a UID, description given a UID
195(4) a select statement that generats a single UID, sequence given a UID
196
197Each of the four parts should be separated by ';'. For example, in
198the database that we are using for testing, a file "demo.sql" that
199contains:
200
201================
202localhost taxonomy username secret;
203SELECT proteins.gid, proteins.sequence FROM proteins,swissprot WHERE proteins.gid=swissprot.gid AND swissprot.spid IS NOT NULL;
204select proteins.gid, concat(swissprot.spid," ",proteins.description) from proteins,swissprot where proteins.gid=%ld AND swissprot.gid=proteins.gid;
205select gid, sequence from proteins where gid=%ld;
206================
207
208will find all the proteins in the BLAST "nr" database that also have
209SwissProt ID's when given the command line:
210
211 fasta33 -q query.aa "demo.sql 16"
212
213At least for simple queries, there is surprisingly little overhead for the
214search. For more complex queries involving several tables, the overhead
215can be significant.
216
217At the moment, libraries that need the functions in mysql_lib.c will
218use library type 16. We may also use file type 17 for SQL queries
219that return binary sequences.
220
221This implementation of mysql_lib.c was written to require a minimal
222amount of change to the other programs. Only nmgetaa.c and altlib.h
223needed to be changed to incorporate this new capability. One result
224of this limitation is that one cannot mix mySQL databases queries with
225other databases in the same search. Eventually, I would like to make
226a mySQL database like any other, so that several mysql database
227queries could be searched in the same run, and mysql databases could
228be mixed with other (flat file) databases, but this will require some
229changes in the function calls throughout the code. (Right now, the
230various programs do not distinguish between an openlib() that is made
231before searching a large database, and one before retrieving a single
232sequence. This must be changed for a database query like mySQL to
233behave like other databases.
234
235Several mySQL demo files have been provided: mysql_demo*.sql.
236
237(10 January 2001) The mySQL code has been tested on Intel Linux and
238Compaq/Alpha/Tru64 Unix.
239
240>>Dec. 9, 2000
241
242Changes to apam.c that to tie different default gap penalties to
243alternate scoring matrices. In addition, changes to apam.c, to deal
244with user-specified matrices with or without '*'.
245
246>>Nov. 5, 2000 (date updated)
247
248pst.dnaseq can now have 3 values, -1, or 0-> protein, 1->DNA, and 2->other.
249This becomes important for thing like init_karlin_a, which needs a
250background frequency of residues.
251
252>>Nov. 1, 2000
253
254Significant bug fixes for the -z 6/-z 16 option. An ininitialized
255variable was fixed in karlin.c, and comp_thr.c did not pass the
256correct composition argument type in find_zp(). The -z 6/16 option
257has now been tested and works correctly on Alphas, Linux x86, SGI, Sun
258and Mac OSX. Another problem was fixed in scaleswn.c (simplex()) that
259prevented the code from being reused by the pv4/mp4 complib programs.
260
261>>Oct. 9, 2000
262
263Several changes made to accomodate Mac OSX. Longer lists of superfamily
264numbers now supported in p[su]4comp/m[su]4comp programs.
265
266>>Sept 25, 2000
267
268All global variables have been removed from scaleswn.c. The last to
269go, db_struct db, required many edits, because until now, the fasta
270programs have kept two versions of the db_struct data (entries,
271length). One version was kept by the main program, which updated entry
272number and db length as sequences were read; a second copy of this
273information was kept by the statistical estimation routines. Now
274there is only one copy, which means that the E() values will be a
275function of the complete database, not the database with some high
276scoring sequences removed.
277
278>>Sept 23, 2000
279
280Continued removal of global variables from scaleswn.c. Only one
281global is left, db_struct db, which contains the number of entries in
282the database and the number of residues. It will be the next to go
283(changing all the zs_to_*() functions) and scaleswn. will be free
284of globals. scaleswg.c is gone - scaleswn.c compiles to scaleswg.c
285with -DNORMAL_DIST.
286
287>>Sept 20, 2000
288
289Removal of histogram globals required changes in p2_complib.c as well.
290p_complib.c has not been updated. scaleswg.c has been modified to
291reflect the new histogram strategy.
292
293>>Sept 19, 2000
294
295Substantial changes to remove globals for printing histogram. m_msg
296now contains a hist_str, which keeps histogram information.
297
298>>Sept. 19, 2000
299(no version change, previous version not released)
300
301Correct bug introduced into scaleswn.c (inithist()) by changing
302score2_sums[], score_sums[] from int to double.
303
304Reporting of version numbers is more consistent between fasta33,
305fasta33_t, and pv4compfa/mp4compfa. The programs now report the same
306numbers/dates in similar places.
307
308>>Sept. 15, 2000
309--> v33t07
310
311Changes to fix problems with statistical estimates when a large
312fraction (but not all) of the database is related. Several users
313reported problems when searching with rRNA genes with version 33t06.
314In some cases, a 100% identitical match over 1500 nt would not be
315statistically significant against a search of the bacterial division
316of Genbank. This problem was not seen with some releases of v33t05.
317
318The cause of the problem was a change between v33t05 and v33t06 to
319allow scoring matrices with unusual scaling to be used. In v33t05,
320there was a line that excluded all scores > 300 from the statistical
321estimation procedure. While 300 is a high score with any "normal"
322scoring matrix, some investigators were using matrices scaled 10X, so
323that a score of 300 was really a score of 30 with a conventional
324matrix, and should not be excluded. Unfortunately, removing the test
325to exclude scores > 300 meant that when a rRNA sequence was used to
326search the bacterial division, tens of thousands of high scoring
327related sequences were treated as if they were unrelated, with the
328result that the variance estimates were much too high, and thus high
329real scores had low z-scores, and thus were not statistically
330significant. (There appear to be more than 20,000 rRNA sequences in
331the bacterial division of Genbank, almost 25% of all sequences).
332
333The solution to the problem is a substantial enhancement in the
334strategies used to exclude high-scoring, related sequences, the -z 1,
3354, and 5 parameter estimation strategies. The programs now estimate
336the expected high scoring sequence by calculating an ungapped Lambda
337and K, and then use a relatively conservative threshold for excluding
338scores that are higher than would be expected 0.01 times by chance.
339By calculating Lambda and K, we can scale the cutoff thresholds to
340allow scoring matrices with unusual scales. For "normal" searches,
341there should be little change, but there should be an improvement for
342searches with large numbers of related sequences in the database.
343
344As a result of testing for this change, a bug in the karlin() function
345used with -z 6 was found and corrected.
346
347=======
348>>Sept. 9, 2000
349
350Changes to manshowbest.c to include correct display coordinates.
351
352Significant changes to structs.h, param.h, p2_complib.c,
353p2_workcomp.c, to store and use a reliable a_struct for alignment
354coordinates.
355
356Other cosmetic changes.
357
358>>Sept. 7, 2000
359
360Minor changes to complib.c, showrss.c, so that prss33 -q uses 200
361shuffles and prss33 provides bit scores, rather than z-scores.
362(no version number change).
363
364Modifications to p2_complib.c to include superfamily numbers for
365ps4comp* ms4comp*.
366
367>>Aug 22, 2000
368
369Changes to mmgetaa.c, ncbl2_mlib.c, dropfs.c to accomodate AIX.
37000README.1st updated to reflect the current version and correct
371outdated information on threads.
372
373>>Aug. 3, 2000
374
375Modifications to initpam2() in initsw.c to correct a problem with pam_x
376when the -S option is used.
377
378Modifications to compacc.c, scaleswn.c to ensure that residue numbers
379are calculated properly when more than 2 Gb of sequence is searched.
380
381>>July 12, 2000
382
383Modifications to dropnfa.c so that DNA matches to 'N' will be included
384in the "ungapped %identity". Thus, a sequence that is 100% identical
385for 100 nt on either side of a 100 nt region that has been masked to
386'NNNNN' will be reported as: "67% identical (100% ungapped)". This
387has been added to deal with masked BAC-end databases. It would be
388better if masking changed the letters to lowercase, but the mouse
389BAC-end sequences at TIGR use 'NNNNN'. This is currently available
390only for the fasta function, not [t]fast[x/y], etc, and only for DNA
391sequences.
392
393mk_n_pam() in apam.c modified to ensure that mismatch scores of -1
394remain -1.
395
396>>June 25, 2000
397
398Modification to nxgetaa.c, nmgetaa.c, mmgetaa.c to return Genbank Accession
399number as part of the descriptive string.
400
401>>June 11, 2000
402
403(no version change - not yet released)
404
405Modifications to calcons(), calc_id(), showbest(), p_workcomp.c to
406provide ngap_q (number of alignment gaps in query) , ngap_l (number
407of gaps in library) information for -m 9 output.
408
409>>June 6, 2000
410
411(no version change - not yet released)
412
413Modified scaleswn.c to provide better support for unconventional
414scoring scoring matrices, in particular, scoring matrices where every
415value is 50-times higher. Previous versions of the MLE estimator (-z
4162) started with lambda = 0.2, which is too high for a scoring matrix
417going from -500:+1500. The initial estimate for lambda is now
418calculated using the formula: lambda = pi/sqrt(6*variance). For the
419default -z 1, a restriction to limit scores to a maximum of 300 for
420the statistical analysis was removed.
421
422>>June 3, 2000
423
424Modified aligment output, and -m 9 and -m10, to report an "ungapped"
425identity as well as the traditional "gapped" identity. The
426traditional "gapped" identity reports the number of identities divided
427by the overall length of the alignment, including gaps. The
428"ungapped" identity does not include gaps in the length of the
429alignment. This new value is included for alignments that include
430introns; thus, a tfastx33 search might find the 100% identical genomic
431sequence but report the gapped percent identity if a short intron were
432included in the alignment (the alignment probably would not span a
433long exon) as 66%. The "ungapped" identity would remain 100%. The
434ungapped identity value is also shown in the "-m 9" output line after
435the "gapped" fraction identical.
436
437>>June 1, 2000
438
439Modified -m 9 output to provide fraction identical, alignment boundary
440information with the initial list of high scoring sequences, just as
441the pv3comp and mp_comp versions do. The -m 9 option now shows the
442same alignment display as -m 0, but the width of the alignment is
443increased by 40. Thus, by default, -m 9 will show the list of best
444hits, with percent identity, Smith-Waterman score, and alignment
445boundaries initially, and then show alignments standard (-m 0)
446alignments with 100 residues/line.
447
448>>May 29, 2000
449
450Correct some problems with reading data files with <CR>'s under unix.
451
452nmgetaa.c/nxgetaa.c/mmgetaa.c have been modified to convert <TAB>
453('\t') to <SPC> (' ') in descriptive lines.
454
455=======
456
457>>May 3, 2000
458
459 Corrected problem with very low mean_var in fit_llen() in scaleswn.c.
460
461>>May 2, 2000
462 (no version number change - previous version not released)
463
464 Merged fasta33t05d2 with fasta33t06. Also removed restriction on
465"-M size-range" to proteins - the size range now can be applied to DNA
466as well.
467
468>>May 1, 2000
469 (changes to v33t05d merged into v33t06)
470
471Introduced changes to include '*' as a valid sequence character, which
472indicates termination. Thus, 'TGA', 'TAG', and 'TAA' are now
473tranlated to '*' rather than 'X', and the protein PAM matrices have
474been modified to provide a match score of approximately 1/2 the max
475identity score for a '*:*' match. Otherise, '*' is the same as 'X'.
476This change only affects query sequences that include a '*' to
477indicate an end of sequence, the '*' is not there by default.
478
479The inclusion of '*' broke some things in tfasts33, tfastf33, fasty33,
480and tfasty33, which were fixed today.
481
482>>March 28, 2000/April 24, 2000
483 --> v33t06
484
485(a) -z 6 statistics that factor in composition
486(b) -smatrix-offset pam-offset parameter
487
488(a) This release provides a new statistics option, -z 6, which
489provides a more sophisticated model that accounts for sequence
490composition. When -z 6 is used (only for fasta33(_t) and
491ssearch33(_t)), the program calculates a composition parameter
492comp=1/lambda using a modified version of the Karlin-Altschul karlin()
493function. As a result, every sequence in the database has an
494associated length (n1) and composition (comp).
495
496The length n1 and composition comp are used in the maximum likelihood
497estimation described by Mott (1992) Bull. Math. Biol. 54:59-75. Four
498parameters are estimated, a0, a1, a2, and b1, and the probability of
499obtaining a score is then:
500
501p(s >= x) = 1-exp(-exp(-( a0 + a1*comp + a2*comp*log(n0*n1) + x)/(b1*comp)))
502
503The maximum likelihood estimates of a0, a1, a2, and b1 are calculated
504using the Nelder-Mead simplex search strategy.
505
506The average Lambda is reported for the search using Lambda =
5071/(b1*ave_comp). Where ave_comp is the geometric mean of the comp values
508calculated during the statistical estimates.
509
510The "lambda/comp" calculation can fail for sequences with very biased
511amino acid composition. When this occurs, 'comp' is set to -1.0 (as
512is 'H', the information content parameter) and the 'ave_comp' value is
513used to calculate statistical significance. (But obviously 'ave_comp'
514is not really appropriate, since if the sequence had an average 'comp'
515value, it would have been calculated.) When -z 6 is used, the
516alignment display shows the 'comp' and 'H' values for that library
517sequence.
518
519(b) Scoring matrix offsets - The main reason that the "lamdba/comp"
520calculation fails is that, for the particular query/library sequence
521pair, the expected score is not < 0, instead, Sum {p_ij S_ij} >= 0.0.
522This problem is reported to 'stderr' when it occurs. The simplest
523solution to the problem is to provide an offset to the scoring matrix;
524for example, to use Blosum62 - 1, which ranges from +10 to -5, rather
525than the standard +11 to -4. This option used to be available with
526the -S offset option, but -S is now used to specify a lower-case
527seg-ed database. The offset can now be specified as part of the
528scoring matrix name. Thus, "-s BL62-1" uses Blosum62 reduced by 1 at
529each entry. The '-' character is used to indicate an offset, so
530scoring matrix files must not have a '-' in their name.
531Alternatively, "-s BL80+1" or "-s BL80--1" would add one to each value.
532
533nxgetaa.c, nmgetaa.c, and mmgetaa.c have been edited to avoid string
534run-off problems after strncpy().
535
536Fixed problem where positive gap extension penalties in ssearch33
537were not converted to negative values.
538
539>>April 8, 2000
540
541Fixed problem in calculating corrected sequence lengths for
542Altschul-Gish probabilities.
543
544>>March 30, 2000
545 (no version change, date updated to March 30, 2000)
546
547Corrected problem with -m 9 option.
548
549The '*' character is now available to allow translated alignments to
550extend through the termination codon. Thus, if a protein sequence ends
551with a '*', and matches in to a translated termination codon, the
552score will be increased. The *:* match score is set to 1/2 the max
553positive score for the matrix (see upam.h). This strategy can also be
554used to upweight a match that extends all the way to the end of a
555full-length sequence by putting '*' at the end of both the query and
556library protein sequences. Recognition of '*' will probably become a
557command line option.
558
559>>March 21, 2000
560 (no version change, previous version not distributed)
561
562Changes to map_db.c, list_db.c, and mmgetaa.c to accomodate large
563sequence files. Long (64-bit on some systems) variables are now used
564to specify file and memory position for the memory mapped functions.
565As a result, there are now two *.xin (memory mapped index) file
566formats: MP0, which uses 32-bit longs, and MP1, which uses 64-bit
567longs. On 64-bit machines, MP0 32-bit indices are read properly, but
568limit the database size to 2 or 4 Gb; MP1 64-bit indices allow very
569large databases. Blast2.0 formatdb databases are still limited to
5704Gb. To compile map_db.c to generate 64-bit index files, include the
571compile time option -DBIG_LIB64 in the Makefile. (Currently this
572option has been tested only on the DEC Alpha and SGI platforms, and
573will work only with Unix versions that provide 64-bit longs and 64-bit
574ftell()'s.)
575
576The -R results file now uses sfn_cmp() to report a matching
577superfamily number, if one exists, and '0' otherwise.
578
579>>March 12, 2000
580 (no version change, previous version not distributed)
581
582Provide new strategy for specifying library abbreviations. In
583addition to:
584
585 fasta33 query.aa %anr
586
587one can also specify:
588
589 fasta33 query.aa %pir1+sp+nr
590or
591 fasta33 query.aa +pir1+sp+nr
592or
593 fasta33 query.aa %+pir1+sp+nr
594
595where the + anywhere in the library name string indicates that
596variable length library names, separated by '+', are being used (the
597last '+' is optional). The FASTLIBS file then becomes:
598
599================
600PIR1 Annotated Protein Database (rel 56)$0+pir1+/slib2/blast/pir1.lseg
601NBRF Protein database (complete)$0+nbrf+@/seqlib/lib/NBRF.nam
602NRL_3d structure database$0D/seqlib/lib/nrl_3d.seq 5
603NCBI/Blast non-redundant proteins$0+nr+/slib2/blast/nr.lseg
604NCBI/Blast Swissprot$0+sp+/slib2/blast/swissprot.lseg
605================
606
607The two abbreviation types, single letter and +word+, cannot be
608intermixed, and at least initially, +word+ specifiers are
609case-sensitive (single letter abbreviations are not) and will not be
610available interactively, only on the command line.
611
612Removed 'K' estimate for Expectation_n, Expectation_i fits to the
613distribution of unrelated similarity scores. 'K' cannot be calculated
614from the data available. 'Lamdba' can be calculated, it is
6151.28255/sqrt(mean_var), and is still available.
616
617>>March 3, 2000
618 (no version change)
619
620changed Makefile33.common, Makefile.common, to incorporate $(NRAND)
621rather than "rand48". Provide nrandom.c which uses random(), as
622replacement for nrand.c, which uses rand48().
623
624>>February 8, 2000
625 --> v33t05
626
627Fixes to scaleswn.c (proc_hist_ml) to set num_db_entries properly.
628Scaleswn.c also provides Lambda estimates for -z 1/11 (Expectation_n),
629and -z 1/14 (Expectation_i) statistical estimates.
630
631Modifications to calc_id() to correct bug in counting identities.
632Modified showalign() to use calc_id() with -m 9, for simpler
633debugging.
634
635Additional modifications to dropfa*.c files to deal properly with 'n's
636and 'x's.
637
638Added new option: -x #, which allows one to override the penalty for a
639match against 'x' (or 'N') provided by the scoring matrix. This
640option is particularly useful in fast[x/y] searches, where out of
641frame low complexity regions can generate high scores.
642
643The old function of '-x' - to specify an alternate coordinate system,
644is now available as '-X # #'.
645
646Updated scaleswn.c to provide window shuffle information for -z 12.
647
648Updated compacc.c, workacc.c, to fix serious bug in wshuffle()
649that destroyed aa1[n1]=0.
650
651>>January 25, 2000
652 --> v33t04
653
654 A serious bug in all of the fasta related programs has been
655corrected. The new code in fasta33 which ignores certain residues
656failed to initialize one of the arrays properly. As a result, in
657pathological situations, a very strong match could be missed.
658
659 Corrected minor bug in initsw.c that cause misplaced "ktup" command
660line argument, which should be ingnored by ssearch, to be read as -d
661ktup.
662
663 Improved error message for 0 length query sequence.
664
665>>January 17, 2000
666 --> no external version number change
667
668Modified mmgetaa.c, map_db.c, and nmgetaa.c to provide memory mapping
669of genbank flatfile (format=1) files. This format could be read much
670more efficiently, however.
671
672>>January 12, 2000
673 --> no external version number change
674
675Changed the behavior of the options that set the number of high scores
676(-b) and alignments (-d) that are displayed. Previously, fasta33 -E
67710.0 -d 10 would show 50 best scores, rather than all the scores with
678E() < 10.0. To get the -E threshold to limit, -E 10.0 -b 10000 -d 10
679was required. This is now fixed. Setting "-d 10" does not affect the
680number of best scores shown.
681
682Minor change in mw.h to remove unused defines.
683
684fasta3x.me (fasta3x.doc) updated.
685
686>>January 6, 2000
687 --> v33t03
688
689Corrected bug in memory mapped reads of gcg_binary format files
690that potentially caused the last 63 residues to be read improperly.
691
692Changes to comp_thr.c, pthr_subs.c, uthr_subs.c, ibm_pthr_subs.c to
693ensure that each thread has its own work_info structure. This solves
694some minor race conditions that sometimes caused some parameters
695not to be reported properly.
696
697Changes to most of the drop*.c files to correct some minor problems
698with sequence alphabets. Code in mmgetaa.c (memory mapped code for
699FASTA, GCG compressed files) reordered to prevent files from being
700memory mapped if appropriate index files are not available.
701
702See readme.pvm_3.3 for updates to the pvm programs.
703
704>>December 10, 1999
705 (no version change - modifications largely affect ps3comp*)
706
707Modifications to showsum.c to deal with 2 scores/sequence. Modifications
708to mmgetaa.c for superfamily numbers.
709
710>>December 7, 1999
711 (no version change, previous version not released)
712
713Corrected problem in mmgetaa.c that caused searches on a memory mapped
714single long sequence (e.g. Chr22) to fail. Corrected bug in map_db.c
715that caused it to crash on some architectures if a filename was not
716specified. Corrected off-by-three error in fasty/tfasty. Corrected
717indexing error in dropfz2.c.
718
719>>December 5, 1999
720 --> v33t02
721
722corrected some bugs in inifa.c/initsw.c/doinit.c that caused
723abbreviated function names to be lost.
724
725modify showbest.c, showalign.c to include information on position in
726library sequence (bbp->cont) to distinguish subsegment of very long
727sequences. Currently, the new label is available only with -m 6.
728
729>>November 29, 1999
730 [t]fastz33 uses v33t02 of fasty function.
731
732Replace dropfz.c with dropfz2.c. Dropfz2.c interprets any codons,
733that include the nucleotide 'N' as the amino 'X'. Previously, 'N' was
734treated as 'A', so 'NNN' ended up 'K'. This modification, together
735with the -S option and lower-case pseg'ed databases, should ensure
736that DNA queries with large numbers of 'N's do not match low
737complexity regions.
738
739>>November 20, 1999
740 (no version change, previous version not released)
741
742Modify initfa.c to disply initn, init1 scores for [t]fast[fs].
743Include "-B" option to show previous z-scores.
744
745>>November 17, 1999
746 (no version change, previous version not released)
747
748Modify dropfx.c to use saatran(), rather than aatran(). saatran
749translates any 'N' containing codon as 'X'. aatran() treats 'N' as
750an 'A'. Although more steps are required for translation, the program
751appears to run just as fast.
752
753>>November 7, 1999
754 --> v33t01
755
756Substantial changes to the output format in showbest.c (the list of
757high scoring sequences) and showalign.c (the alignments). The classic
758list of best scores:
759
760The best scores are: initn init1 opt z-sc E(82014)
761gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIO ( 218) 1497 1497 1497 1761.1 2.3e-91
762gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE ( 218) 1413 1413 1413 1662.9 6.7e-86
763
764has been replaced by:
765
766The best scores are: opt bits E(82138)
767gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIONE S-TRAN ( 218) 1497 354 7.6e-98
768gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE S-TRANSF ( 218) 1413 335 5.3e-92
769
770This display provides more information and removes the outdated initn
771and init1 scores, which are no longer used. The "bit" score is
772comparable to the blast2 bit score. It is calculated as: (lambda*S -
773ln K)/ln 2, where S is the raw similarity score, lambda and K are
774statistical parameters estimated from the distribution of unrelated
775sequence similarity scores. All of the similarity scores, including
776init1, initn, and z-scores are reported with the alignment data.
777Z-scores are displayed instead of bit scores in the list of high
778scores if the command line option "-B" is specified.
779
780In addition, the alignment score line has changed from:
781
782>>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER (220 aa)
783 initn: 954 init1: 954 opt: 958 Z-score: 1130.9 expect() 1.1e-56
784Smith-Waterman score: 958; 61.927% identity in 218 aa overlap (1-218:1-218)
785
786to:
787
788>>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER (220 aa)
789 initn: 954 init1: 954 opt: 958 Z-score: 1130.9 bits: 216.4 E(): 2.8e-56
790Smith-Waterman score: 958; 61.927% identity in 218 aa overlap (1-218:1-218)
791
792In addition to the addition of the "bits:" score, the "expect()" label
793has changed to "E()" to save some space.
794
795>>November 4,12, 1999
796(no version change)
797
798Fixed serious bug in -z 2 lambda/K calculation in scaleswn.c
799
800Fixed bugs in llgetaa.c (openlib()) and definition of superfamily
801numbers.
802
803>>October 21, 1999
804(no version change)
805
806Begin using CVS for version control. Correct faulty error message in
807dropfs.c. Corrected bad "goto loopl;" in dropfz.c. Corrected prss3.rsp
808for Makefile.tc (Win32 version).
809
810>>October 18, 1999
811 --> v33t0
812
813Corrected some serious bugs with the various fasta/x/y programs when
814the -DALLOCN0 was used to save memory. Improvements to fasta3x.me/.doc
815documentation.
816
817>>October 12, 1999
818 --> v33tx
819
820For this initial release of version 33 of the FASTA programs, the
821Makefile's have been modified to make "fasta33(_t)", "fastx33(_t)",
822etc, so that you can test fasta33 while retaining fasta3 (from release
823v32t08). The FASTA33 programs are somewhat slower than previous
824releases, but I believe the ability to handle low complexity regions
825without 'X'ing them out outweighs the slowdown. By (temporarily)
826changing the names of the programs slightly, it will be easier for you
827to judge the relative cost and benefit. To "make" the programs as
828"fasta3(_t)", etc, simply replace "Makefile33.common" with
829"Makefile.common" in the "Makefile" that you use.
830
831>>September 30, 1999
832
833ssearch3/fasta3/fastx3/fasty3 have been modified to search databases
834containing both upper and lower case letters, where lower case letters
835indicate low-complexity regions. With the modified programs, lower
836case letters are treated as 'X's' in the initial scan, but are then
837treated normally in the final alignment. In addition, alignments can
838contain lower case letters. Lower case letters are treated as
839low-complexity regions during the seach phase of the program, but as
840"conventional" residues during the alignment phase, with the "-S"
841option. Currently, lower case letters are mapped to 'X's during the
842scan of the entire library. In the future, alternate weights will be
843available. This is a substantial improvement for very large scale
844comparison, where one seeks both accurate statistical estimates and
845accurate %identities and alignments, and for translated DNA:protein
846comparisons, like "fastx3" and "fasty3", where out-of-frame
847translations tend to match low complexity regions (see Pearson et
848al. (1997) Genomics 46:24-36).
849
850Protein databases (and query sequences) can be generated in the
851appropriate format using John Wooton's "pseg" program, available from
852ftp://ftp.ncbi.nih.gov/pub/seg/pseg. Once you have compiled the "pseg"
853program, use the command:
854
855 pseg database.fasta -z 1 -q > database.lc_seg
856
857Once you have database.lc_seg, run the command "map_db" to generate
858a ".xin" file that can be used to efficiently memory map the database.
859
860You can then search database.lc_seg with or without the "-S" option.
861Without "-S", the database is treated as any other FASTA format file -
862all the residues are present. With "-S", lower case residues will be
863treated as 'x's' during the initial scan but as normal residues when
864final alignments are displayed.
865
866When the -S option is used, the matrix information line is changed
867from: "BL50 matrix (15:-5)" to "BL50 matrix (15:-5)xS". The "-S"
868option is no longer available to provide a scoring matrix offset.
869
870Unfortunately, Blast2.0 format files cannot contain lower case
871letters. We have addressed this problem by providing efficient memory
872mapped access to Fasta and GCG/PIR, and GCG/compressed-binary files in
873the last release of fasta32t08. The memory mapped file I/O
874improvements are provided in fasta33 as well.
875
876================ readme.v32 ================
877
878FASTX/Y and FASTA (DNA) are now half as fast, because the programs now
879search both the forward and reverse strands by default.
880
881The documentation in fasta3x.me/fasta3x.doc has been substantially
882revised.
883
884>>October 20, 1999
885(no version change)
886
887Modify nxgetaa.c/nmgetaa.c to recognize 'N' as a possible DNA character.
888
889>>October 9, 1999
890 --> v32t08 (no version number change)
891
892Added "-M low-high" option, where low and high are inclusion limits
893for library sequences. If a library sequence is shorter than "low" or
894longer than "high", it will not be considered in the search. Thus,
895"-M 200-250" limits the database search to proteins between 200 and
896250 residues in length. This should be particularly useful for fasts3
897and fastf3. -M -500 searches library sequences < 500; -M 200 -
898searches sequences > 200. This limit applies only to protein
899sequences.
900
901Modified scaleswn.c to fall back to maximum likelihood estimates of
902lambda, K rather than mean/variance estimates. (This allows MLE
903estimation to be used instead of proc_hist_n when a limited range of
904scores is examined.)
905
906>>October 2, 1999
907 --> v32t08
908
909Many changes:
910
911(1) memory mapped (mmap()ed) database reading - other database reading fixes
912(2) BLAST2 databases supported
913(3) true maximum likelihood estimates for Lambda, K
914(4) Misc. minor fixes
915
916(1) (Sept. 26 - Oct. 2, 1999) Memory mapped database access.
917It is now possible to use mmap()ed access to FASTA format databases,
918if the "map_db" program has been used to produce an ".xin" file. If
919USE_MMAP is defined at compile time and a ".xin" file is present, the
920".xin" will be used to access sequences directly after the file is
921mmap()ed. On my 4-processor Alpha, this can reduce elapsed time by
92250%. It is not quite as efficient as BLAST2 format, but it is close.
923
924Currently, memory mapping is supported for type 0 (FASTA), 5
925(PIR/GCG ascii), and 6 (GCG binary). Memory mapping is used if a
926".xin" file is present. ".xin" files are created by the new program
927"map_db". The syntax for "map_db" is:
928
929 map_db [-n] "/dir/database.fa"
930
931which creates the file /dir/database.fa.xin. Library types can be
932included in the filename; thus:
933
934 map_db -n "/gcggenbank/gb_om.seq 6"
935
936would be used for a type 6 GCG binary file.
937
938The ".xin" file must be updated each time the database file changes.
939map_db writes the size of the database file into the ".xin" file, so
940that if the database file changes, making the ".xin" offset
941information invalid, the ".xin" file is not used. "list_db" is
942provided to print out the offset information in the ".xin" file.
943
944(Oct 2, 1999) The memory mapping routines have been changed to
945allow several files to be memory mapped simultaneously. Indeed, once a
946database has been memory mapped, it will not be unmap()ed until the
947program finishes. This fixes a problem under Digital Unix, and should
948make re-access to mmap()ed files (as when displaying high scores and
949alignments) much more efficient. If no more memory is available for
950mmap()ing, the file will be read using conventional fread/fgets.
951
952(Oct 2, 1999) The names of the database reading functions has been
953changed to allow both Blast1.4 and Blast2.0 databases to be read. In
954addition, Makefile.common now includes an option to link both
955ncbl_lib.o and ncbl2_lib.o, which provides support for both libraries.
956However, Blast1.4 support has not been tested.
957
958The Makefile structure has been improved. Each architecture specific
959Makefile (Makefile.alpha, Makefile.linux, etc) now includes
960Makefile.common. Thus, changes to the program structure should be
961correct for all platforms. "map_db" and "list_db" are not made with
962"make all".
963
964The database reading functions in nxgetaa.c can now return a database
965length of 0, which indicates that no residues were read. Previously,
9660-length sequences returned a length of 1, which were ignored.
967Complib.c and comp_thr.c have changed to accommodate this
968modification. This change was made to ensure that each residue,
969including the last, of each sequence is read.
970
971Corrected bug in nxgetaa.c with FASTA format files with very long
972(>512 char) definition lines.
973
974(2) (September 20, 1999) BLAST2 format databases supported
975
976This release supports NCBI Blast2.0 format databases, using either
977conventional file reading or memory mapped files. The Blast2.0 format
978can be read very efficiently, so there is only a modest improvement in
979performance with memory mapping. The decision to use mmap()'ed files
980is made at compile time, by defining USE_MMAP. My thanks to Eamonn
981O'Toole of DEC/Compaq, and Daryl Madura of Sun Microsystems, for
982providing mmap()'ed modifications to fasta3. On my machines, Blast2.0
983format reduces search time by about 30%. At the moment, ambiguous DNA
984sequences are not decoded properly.
985
986(3) (September 30, 1999) A new statistical estimation option is
987available. -z 2 has been changed from ln()-scaling, which never
988should have been used, to scaling using Maximum Likelihood Estimates
989(MLEs) of Lambda and K. The MLE estimation routines were written by
990Aaron Mackey, based on a discussion of MLE estimates of Lambda and K
991written by Sean Eddy. The MLE estimation examines the middle 95% of
992scores, if there are fewer than 10000 sequences in the database;
993otherwise it excludes (censors) the top 250 scores and the bottom 250
994scores. This approach seems to effectively prevent related sequences
995from contaminating the estimation process. As with -z 1, -z 12 causes
996the program to generate a shuffled sequence score for each of the
997library sequences; in this case, no censoring is done. If the
998estimation process is reliable, Lambda and K should not vary much with
999different queries or query lengths. Lambda appears not to vary much
1000with the comparison algorithm, although K does.
1001
1002(4) Minor changes include fixes to some of the alignment display routines,
1003individual copies of the pstruct structure for each thread, and some
1004changes to ensure that every last residue in a library is available
1005for matching (sometime the last residue could be ignored). This
1006version has undergone extensive testing with high-throughput sequences
1007to confirm that long sequences are read properly. Problems with
1008fastf3/fasts3 alignment display have also been addressed.
1009
1010>>August 26, 1999 (no version change - not released)
1011
1012Corrected problem in "apam.c" that prevented scoring matrices from
1013being imported for [t]fasts3/[t]fastf3.
1014
1015>>August 17, 1999
1016 --> v32t07
1017
1018Corrected problem with opt_cut initialization that only appeared
1019with pvcomp* programs.
1020
1021Improved calculation of FASTA optcut threshold for DNA sequence
1022comparison for match scores much less than +5 (e.g. +3). The previous
1023optcut theshold was too high when the match penalty was < 4 and
1024ktup=6; it is now scaled more appropriately.
1025
1026Optcut thresholds have also been raised slightly for
1027fastx/y3/tfastx/y3. This should improve performance with minimal
1028effects on sensitivity.
1029
1030>>July 29, 1999
1031(no version change - date change)
1032
1033Corrected various uninitialized variables and buffer overruns
1034detected.
1035
1036>>July 26, 1999 - new distribution
1037(no version change - v32t06, previous version not released)
1038
1039Changed the location of "(reverse complement)" label in tfasta/x/y/s/f
1040programs.
1041
1042Statistical calculations for tfasta/x/y in unthreaded version
1043corrected. Statistical estimates for threaded and unthreaded versions
1044of the tfasta/x/y/s/f programs should be much more consistent.
1045
1046Substantial modifications in alignment coordinate calculation/
1047presentation. Minor error in fastx/y/tfastx/y end of alignment
1048corrected. Major problems with tfasta alignment coordinates
1049corrected. tfasta and tfastx/y coordinates should now be consistent.
1050
1051Corrected problem with -N 5000 in tfasta/x/y3(_t) searches encountered
1052with long query sequences.
1053
1054Updated pthr_subs.c/Makefile.linux to increase the pthreads stacksize
1055to try to avoid "cannot allocate diagonal arrays" error message.
1056Pthreads stacksize can be changed with RedHat 6.0, but not RedHat 5.2,
1057so Makefile.linux uses -DLINUX5 for RedHat5.* (no pthreads stack size).
1058I am still getting this message, so it has not been completely
1059successful. Makefile.linux now uses -DALLOCN0 to avoid this problem,
1060at some cost in speed.
1061
1062The pvcomp* programs have been updated to work properly with
1063forward/reverse DNA searches. See readme.pvm_3.2.
1064
1065>>July 7, 1999 - not released
1066 --> v32t06
1067
1068Corrected bug in complib.c (fasta3, fastx3, etc) that caused core
1069dumps with "-o" option.
1070
1071Corrected a subtle bug in fastx/y/tfastx/y alignment display.
1072
1073>>June 30, 1999 - new distribution
1074(no version change)
1075
1076Corrected doinit.c to allow DNA substitution matrices with -s matrix
1077option.
1078
1079Changed ".gbl" files to ".h" files.
1080
1081>>June 2 - 9, 1999 - new distribution
1082(no version change)
1083
1084Added additional DNA lambda/K/H to alt_param.h. Corrected some
1085other problems with those table. for the case where (inf,inf)
1086gap penalties were not included.
1087
1088Fixed complib.c/comp_thr.c error message to properly report filename
1089when library file is not found.
1090
1091Included approximate Lambda/K/H for BL80 in alt_parms.h.
1092BL80 scoring matrix changed from 1/3 bit to 1/2 bit units.
1093
1094Included some additional perl files for searchfa.cgi, searchnn.cgi
1095in the distribution (my-cgi.pl, cgi-lib.pl).
1096
1097>>May 30, 1999, June 2, 1999 - new distribution
1098(no version number change)
1099
1100Added Makefile.NetBSD, if !defined(__NetBSD__) for values.h. Changed
1101zs_to_E() and z_to_E() in scaleswn.c to correctly calculate E() value
1102when only one sequence is compared and -z 3 is used.
1103
1104>>May 27, 1999
1105(no version number change)
1106
1107Corrected bug in alignment numbering on the % identity line
1108 27.4% identity in 234 aa (101-234:110-243)
1109for reverse complements with offset coordinates (test.aa:101-250)
1110
1111>>May 23, 1999
1112(no version number change)
1113
1114Correction to Makefile.linux (tgetaa.o : failed to -DTFAST).
1115
1116>>May 19, 1999
1117(no version number change)
1118
1119Minor changes to pvm_showalign.c to allow #define FIRSTNODE 1.
1120Changes to showsum.c to change off-end reporting. (Neither of these
1121changes is likely to affect anyone outside my research group.)
1122
1123>>May 12, 1999
1124 --> v32t05
1125
1126Fixed a serious bug in the fastx3/tfastx3 alignment display which
1127caused t/fastx3 to produce incorrect alignments (and incorrectly low
1128percent identities). The scores were correct, but the alignment
1129percent identities were too low and the alignments were wrong.
1130
1131Numbering errors were also corrected in fastx3/tfastx3 and
1132fasty3/tfasty3 and when partial query sequences were used.
1133
1134>>May 7, 1999
1135
1136Fixed a subtle bug in dropgsw.c that caused do_work() to calculate
1137incorrect Smith-Waterman scores after do_walign() had been called.
1138This affected only pvcompsw searches with the "-m 9" option.
1139
1140>>May 5, 1999
1141
1142Modified showalign.c to provide improved alignment information that
1143includes explicitly the boundaries of the alignment. Default
1144alignments now say:
1145
1146Smith-Waterman score: 175; 24.645% identity in 211 aa overlap (5:207-7:207)
1147
1148>>May 3, 1999
1149
1150Modified nxgetaa.c, showsum.c, showbest.c, manshowun.c to allow a
1151"not" superfamily annotation for the query sequence only. The
1152goal is to be able to specify that certain superfamily numbers be
1153ignored in some of the search summaries. Thus, a description line
1154of the form:
1155
1156>GT8.7 | 40001 ! 90043 | transl. of pa875.con, 19 to 675
1157
1158says that GT8.7 belongs to superfamily 40001, but any library
1159sequences with superfamily number 90043 should be ignored in any
1160listing or summary of best scores.
1161
1162In addition, it is now possible to make a fasta3r/prcompfa, which is
1163the converse of fasta3u/pucompfa. fasta3u reports the highest scoring
1164unrelated sequences in a search using the superfamily annotation.
1165fasta3r shows only the scores of related sequences. This might be
1166used in combination with the -F e_val option to show the scores
1167obtained by the most distantly related members of a family.
1168
1169>>April 25, 1999
1170
1171 -->v32t04 (not distributed)
1172
1173Modified nxgetaa.c to remove the dependence of tgetaa.o on TFASTA
1174(necessary for a more rational Makefile structure). No code changes.
1175
1176>>April 19, 1999
1177
1178Fixed a bug in showalign.c that displayed incorrect alignment coordinates.
1179(no version number change).
1180
1181>>April 17, 1999
1182
1183 --> v32t03
1184
1185A serious bug in DNA alignments when the sequence has been broken into
1186multiple segments that was introduced in version fasta32 has been
1187fixed. In addition, several minor problems with -z 3 statistics on
1188DNA sequences were fixed.
1189
1190Added -m 9 option, which unfortunately does different things in
1191pvcompfa/sw and fasta3/ssearch3. In both programs, -m 9 provides the
1192id's of the two sequences, length, E(), %_ident, and start and end of
1193the alignment in both sequences. pvcompfa/sw provides this
1194information with the list of high scoring sequences. fasta3/ssearch3
1195provides the information in lieu of an alignment.
1196
1197>>March 18, 1999
1198
1199 --> v32t02
1200
1201Added information on the algorithm/parameter description line to
1202report the range of the pam matrices. Useful for matrices like
1203MD_10, _20, and _40 which require much higher gap penalties.
1204
1205>>March 13, 1999 (not distributed)
1206
1207 --> v32t01
1208
1209 -r results.file has been changed to -R results.file to accomodate
1210 DNA match/mismatch penalties of the form: -r "+1/-3".
1211
1212>>February 10, 1999
1213
1214Modify functions in scalesw*.c to prevent underflow after exp() on
1215Alpha Linux machines. The Alpha/LINUX gcc compiler is buggy and
1216doesn't behave properly with "denormalized" numbers, so "gcc -g -m
1217ieee" is recommended.
1218
1219Add "Display alignments also (y/n)[n] "
1220
1221pvcomplib.c again provides alignments!! In addition, there is a
1222new "-m 9" option, which reports alignments as:
1223
1224>>>/home/wrp/slib/hlibs/hum0.aa#5>HS5 gi:1280326 T-cell receptor beta chain 30 aa, 30 aa vs /home/wrp/slib/hlibs/hum0.seg library
1225HS5 30 HS5 30 1.873e-11 1.000 30 1 30 1 30
1226HS5 30 HS2249 40 1.061e-07 0.774 31 1 30 7 37
1227HS5 30 HS2221 38 1.207e-07 0.833 30 1 30 7 35
1228HS5 30 HS2283 40 1.455e-07 0.774 31 1 30 7 37
1229HS5 30 HS2239 38 1.939e-07 0.800 30 1 30 7 35
1230
1231where the columns are:
1232
1233query-name q-len lib-name lib-len E() %id align-len q-start q-end l-start l-end
1234
1235>>February 9, 1999
1236
1237Corrected bug in showalign.c that offset reverse complement alignments
1238by one.
1239
1240>>Febrary 2, 1999
1241
1242Changed the formatting slightly in showbest.c to have columns line up better.
1243
1244>>January 11, 1999
1245
1246Corrected some bugs introduced into fastf3(_t) in the previous version.
1247
1248>>December 28, 1998
1249
1250Corrected various problems in dropfz.c affecting alignment scores
1251and coordinates.
1252
1253Introduced a new program, fasts3(_t), for searching with peptide
1254sequences.
1255
1256>>November 11, 1998
1257
1258 --> v32t0
1259
1260Added code to correct problems with coordinate number in long library
1261sequences with tfastx/tfasty. With this release, sequences should be
1262numbered properly, and sequence numbers count down with reverse
1263complement library sequences.
1264
1265In addition, with this release, fastx/y and tfastx/y translated
1266protein alignments are numbered as nucleotides (increasing by 3,
1267labels every 30 nucleotides) rather than codons.
1268
1269
readme.v34t0
1
2 $Id: readme.v34t0 348 2010-07-20 21:33:22Z wrp $
3 $Revision: $
4
5>>May 28, 2007
6
7Small modification for GCG ASCII (libtype=5) header line.
8
9>>January 12, 2007 fasta-34_26_2
10
11Fix a problem with pssm_asn_subs.c reading strings (sequences) longer
12than 1024 bytes.
13
14Remove searchfa.cgi, searchnn.cgi, cgi-lib.pl, my-cgi.pl - this code
15was used for an ancient FASTA WWW implementation and has been replaced
16by the FASTA_WWW package.
17
18FASTA Version numbers are being modified to make releases easier to
19track, thus fa34t26b5 has become fasta-34_26_1. I would prefer to use
20decimal versions, but CVS does not allow '.' in tags.
21
22>>January 4, 2007 fasta-34_26_1
23
24Include scripts for building Mac OS X Universal binaries on a PPC
25machine. Programs are compiled first with Makefile.os_x (gcc-3.3 for
26PPC) and then installed into ./ppc/. Programs are next compiled with
27Makefile.os_x86 for i386, and the resulting executables installed into
28./i386/. Finally, the "make_osx_univ.sh" script is run to build the
29universal binaries from the two executables using "lipo".
30
31>>December 12, 2006
32
33Fix some problems with p2_workcomp.c: (1) no longer initialize pad
34characters for non-existant sequences. (2) deal with small libraries
35consistently with the serial versions.
36
37>>November 17, 2006 fa34t26b5
38
39Fixed a problem reading ASN.1 format 2 PSSM's. It is now possible to
40download a PSI-BLAST PSSM RID and search properly. Next, the query
41sequence from the PSSM should be used instead of the provided query
42sequence, so that the query sequence is ignored.
43
44>>October 19, 2006 fa34t26b4
45
46Fixed problem with SSE2 code when PSSM's are used.
47
48>>October 6, 2006 fa34t26b3
49
50A new set of WIN32 programs is now available that use the Intel C++
519.1 compiler, rather than the much older Borland Turbo-C compiler. All
52of the unthreaded programs that are part of the Unix and MacOSX FASTA
53distributions are now available. Threaded (multiprocessor) versions
54of the program as available as well, as are sse2 accelerated versions
55of ssearch34 (ssearch34sse2.exe, ssearch34sse2_t.exe).
56
57Th new WIN32 code also uses Microsoft's "nmake" program to build the
58programs, which allows much greater consistency between the Unix and
59Windows versions.
60
61
62>>September 18, 2006
63
64Static global alignment variables removed from dropnfa.c, dropfx.c,
65dropfz2.c. dropnfa.c, dropfx.c and dropfz2.c should be thread safe.
66Together with the earlier changes, all the FASTA functions should now
67be thread safe during the alignment process.
68
69>>August 17, 2006
70
71Begin removal of static variables from Smith-Waterman alignment
72functions. These variables kept the functions from being thread-safe.
73Now dropgsw.c and dropnsw.c are thread-safe.
74
75>>August 15, 2006 fa34t26b2
76
77Fixed a problem with pv34compfx/mp34compfx (and fy) producing
78improperly labeled alignments and de-allocating memory for the reverse
79complement.
80
81>>July 18, 2006
82
83The library file name parsing programs now provide the option for
84environment variable substitions. For example, SLIB2=/slib2 as an
85environment variable (e.g. export SLIB2=/slib2 for ksh and bash), then
86
87 fasta34 -q query.aa '${SLIB2}/swissprot.fa' expands as expected.
88
89While this is not important for command lines, where the Unix shell
90would expand things anyway, it is very helpful for various
91configuration files, such as files of file names, where:
92
93 <${SLIB2}/blast
94 swissprot.fa
95
96now expands properly, and in FASTLIBS files the line:
97
98 NCBI/Blast Swissprot$0S${SLIB2}/blast/swissprot.fa
99
100expands properly. Currently, Environment variable expansion only
101takes place for library file names, and the <directory in a file of
102file names.
103
104>>July 14, 2006 fa34t26b1
105
106Updated Farrar smith_waterman_sse2.c code to address possible bug
107(code from Michael Farrar). Include <sunmedia_intrin.h> for
108compilation with Sun compiler with Makefile.sun_x86.
109
110>>July 2, 2006 fa34t26b0
111
112This release provides an extremely efficient SSE2 implementation of
113the Smith-Waterman algorithm for the SSE2 vector instructions written
114by Michael Farrar (farrar.michael@gmail.com). The SSE code speeds up
115Smith-Waterman 8 - 10-fold in my tests, making it comparable to Eric
116Lindahl's Altivec code for the Apple/IBM G4/G5 architecture.
117
118The Farrar code is largely confined to smith_waterman_sse2.c and
119smith_waterman_sse2.h, which are copyright (2006) by Michael Farrar,
120and cannot be redistributed without his permission. Mr. Farrar has
121agreed to provide his code under the same policy used by FASTA -
122e.g. the code can be used without permission, but not redistributed.
123
124The Farrar code uses GCC version 4.0 SSE2 intrinsic functions to avoid
125assembly language code. Unfortunately, in my hands, "gcc -O3" causes
126"out of memory" errors, and other problems, so "gcc -O" is used instead.
127
128>>June 23, 2006 fa34t25d10
129
130Modifications to comp_lib.c, compacc.c, and other files to ensure that
131function-specific MAXTOT values are used properly. MAXTOT is now
132available as m_msg.max_tot, which is set in initfa.c (m_msg.max_tot =
133MAXTOT) to ensure that functions that need very large MAXTOT values
134(e.g. TFASTX) can get them. tfastx can now search successfully with
135titin, a 27,000 residue protein.
136
137Other changes have been made to accomodate long query sequences.
138
139A serious bug was found in fastx34(_t) that caused alignment
140coordinates to be calculated improperly when the DNA sequence was much
141longer than the protein sequence.
142
143>>May 31, 2006 fa34t25d9
144
145Fixed some problems with fasts/fastf alignments when -m 9 options were
146used. Unlike the other algorithms, the a_res structure does not
147capture all the information to re-produce an alignment, so do_walign
148now sets bptr->have_ares to indicate whether the a_res structure is
149valid.
150
151Various problems with bad library names, and short query titles were
152also fixed.
153
154Updated version number/date on all drop*.c functions.
155
156>>May 24, 2006 fa34t25d8
157
158Revised code for NCBI *.pal/*.nal databases has been tested on all
159architectures, including Windows.
160
161In addition, support for ASN.1 PSSM:2 files provided by the NCBI
162PSI-BLAST WWW site is included. This code will not work with
163iteration 0 PSSM's (which have no PSSM information). For ASN.1
164PSSM's, which provide the matrix name (and in some cases the gap
165penalties), the scoring matrix and gap penalties are set appropriately
166if they were not specified on the command line. ASN.1 PSSM's are type 2:
167 ssearch34 -P "pssm.asn1 2" .....
168
169>>May 18, 2006
170
171Support for NCBI Blast formatdb databases has been expanded. The
172FASTA programs can now read some NCBI *.pal and *.nal files, which are
173used to specify subsets of databases. Specifically, the
174swissprot.00.pal and pdbaa.00.pal files are supported. FASTA supports
175files that refer to *.msk files (i.e. swissprot.00.pal refers to
176swissprot.00.msk); it does not currently support .pal files that
177simply list other .pal or database files (e.g. FASTA does not support
178nr.pal or swissprot.pal).
179
180In the process of providing this support, the routines used to read
181ASN.1 binary formatdb files were substantially improved. It is now
182possible to see multiple description lines for a single sequence.
183
184IS_BIG_ENDIAN has been removed from all of the Makefiles. The code
185now looks for the definition of __BIG_ENDIAN__ or _BIG_ENDIAN to
186decide whether the architecture IS_BIG_ENDIAN. If, for some reason,
187one of these macros is not defined on a BIG_ENDIAN architecture, then
188-DIS_BIG_ENDIAN is required.
189
190>>May 12, 2006 CVS fa34t25d7
191
192Corrected serious problem with coordinate display calculation for
193fasta34 and ssearch34 - in some cases the coordinates and alignment
194symbols were off by the length of the context (typically 30 residues).
195
196Added capability to read ASN.1 binary PSSM information. This
197information is provided (in an encoded form) from the NCBI PSI-BLAST
198WWW site. (What is actually provided from the WWW site is a bzip2-ed
199binary file that is converted to ASCII HEX. The ASCII HEX file must
200be converted to binary, and then bunzip'ed. This bunzip-ed file is
201binary ASN.1.) These files can also be generated by
202
203 blastpgp -J T -C pssm.asn1_bin -u 2
204
205I am parsing the ASN.1 binary manually, not using the NCBI toolkit, so
206there may be some files that are not parsed properly - if so, let me
207know.
208
209(May 12, 2006 - The NCBI changed the format of the psi-blast ASN.1
210PSSM - and has not yet provided documentation of the new structure, so
211this code does not work. It does work with blastpgp v 2.2.13, but not
212with the web site version 2.2.14. A fix was provided 24-May-2006)
213
214>>April 18, 2006
215
216Small modification in mshowbest.c to provide more consistent display
217widths with -m 9i in list of best hits.
218
219>>April 11, 2006 CVS fa34t25d6
220
221Corrected a problem introduced with the new, more efficient method for
222displaying alignments. For the tfast* programs, which must translate
223the library sequence, translations were not done when alignments were
224re-displayed.
225
226Corrected an older problem with tfastx34 against very long sequence
227databases - the code to more efficiently do the display alignment did
228not use the correct sequence coordinates.
229
230Modifications to dropfs2.c to ensure that exact peptide matches are
231captured more frequently.
232
233>>March 16, 2006 CVS fa34t25d5
234
235Change to initfa.c to allow lower case DNA libraries using the
236-DDNALIB_LC compile time option.
237
238Modify p2_complib.c, p2_worklib.c (and doinit.c, msg.h) to allow the
239-V annotation option for the parallel programs. Also modify to allow
240specification of the query range (but only for the first query, like
241fasta34) for the parallel programs.
242
243Modification of p2_workcomp.c to correct some problems presenting
244percent similarity. Also correct unreleased bugs in the alignment
245routines that allow more efficient alignment re-calculation.
246
247>>Nov 20, 2005
248
249Changes to support asymmetric matrices - a scoring matrix read in from
250a file can be asymmetric. Default matrices are all symmetric.
251
252>>Oct 24, 2005
253
254Modifications extended to p2_complib.c/p2_workcomp.c. Incorporation
255of drop_func.h into p2_workcomp.c greatly simplifies things. No
256changes in communication - struct a_res_str is internal to
257p2_workcomp.c.
258
259Additional changes to do_walign() so that aln_func_vals() must be
260called to set llfact, qlfact, etc in a_struct aln before or after
261do_walign is called. do_walign produces a_res_str a_res, which has
262all the information necessary to produce a calcons() or calc_code()
263alignment.
264
265>>Oct 19, 2005 CVS fa34t26b0
266
267Modifications to drop*.c and c_dispn.c to separate (and simplify) some
268of the alignment coordinate calculations. Before, the "a_struct" had
269the coordinates of the alignment used in the display (seqc0, seqc1)
270AND in the original sequences (aa0, aa1), as well as other information
271used to calculate alignment coordinates. In the new version, astruct
272coordinates always refer to seqc0,1, while a new structure, a_res_str,
273has coordinates for aa0, aa1 as well as the alignment encoding in res[nres].
274Eventually, this should make it possible to display multiple local
275alignments from the same two sequences.
276
277In addition, the file "drop_func.h" has been added to the project, and
278is included by many of the files (all the drop*.c functions,
279mshowbest.c, mshowalign.c) to ensure that the various functions are
280declared and used consistently.
281
282>>Sept 19, 2005 CVS fa34t25d4
283
284Changes to support Mac OS 10.4 - Tiger (include sys/types.h in more
285files). Documentation update for prss34/prfx34. Modifications to
286comp_lib.c to support prss34_t/prfx34_t. Shuffle numbers for
287prss/prfx can now be specified by "-k #".
288
289>>Sept 2, 2005
290
291The prss34 program has been modified to use the same display routines
292as the other search programs. To be more consistent with the other
293programs, the old "-w shuffle-window-size" is now "-v window-size".
294
295prss34/prfx34 will also show the optimal alignment for which the
296significance is calculated by using the "-A" option.
297
298Since the new program reports results exactly like other
299fasta/ssearch/fastxy34 programs, parsing for statistical significance
300is considerably different. The old format program can be make using
301"make prss34o".
302
303>>Aug 26, 2005
304
305Modifications to save_best() in comp_lib.c to support prss34_t. It
306did not work before.
307
308>>July 25, 2005
309
310Modify mshowbest.c to suppress gi|12345 in HTML mode.
311
312>>July 18, 2005 CVS fa34t25d3
313
314Modifications to Makefile.tc to support NCBI formatdb formats under
315Windows.
316
317>>May 19, 2005 CVS fa34t25d2
318
319Modifications to dropfs2.c to fix an obscure bug that occurred when
320correctly ordered peptides aligned one residue apart.
321
322>>May 5, 2005 CVS fa34t25d1
323
324Modification to the -x option, so that both an "X:X" match score and
325an "X:not-X" mismatch score can be specified. (This score is also used
326
327give a positive score to a "*:*" match - the end of a reading frame,
328while giving a negative score to "*:not-*".
329
330>>March 14, 2005 CVS fa34t25b4
331
332Fixed some problems caused by padding characters required for
333Smith-Waterman ALTIVEC in the parallel (p2_complib.c, p2_workcomp.c)
334versions.
335
336>>Feb 24, 2005 CVS fa34t25b3
337
338Changes to comp_lib.c (and Makefile.pcom) to support prss34_t.
339
340>>Feb 12, 2005
341
342Modify dropfs.c to dynamically allocate space for alignments, so that
343queries with a large number of fragments can still place all the
344fragments on the alignment. Also fix a problem produced by removing
345-DBIGMEM from most of the Makefile's, but not fixing defs.h to use
346BIGMEM sizes by default.
347
348>>Jan 24, 2005
349
350Include a new program, "print_pssm", which reads a blastpgp binary
351checkpoint file and writes out the frequency values as text. These
352values can be used with a new option with ssearch34(_t) and prss34,
353which provides the ability to read a text PSSM file. To specify a
354text PSSM, use the option -P "query.ckpt 1" where the "1" indicates a
355text, rather than a binary checkpoint file. "initfa.c" has also been
356modified to work with PSSM files with zero's in the in the frequency
357table. Presumably these positions (at the ends) do not provide
358information. (Jan 26, 2005) blastpgp actually uses BLOSUM62 values
359when zero frequencies are provided, so read_pssm() has been modified
360to use scoring matrix values for zero frequencies as well.
361
362>>Jan 13, 2005
363
364Change to initfa.c to have fasts34 do a protein comparison by default,
365rather than an unknown sequence type. Automatic checking for fasts34
366does not work reliably, because queries can be very short. Likewise
367for fastm34. [Jan 26, 2004] Undo this change, which broke DNA
368comparison when "-n" was specified.
369
370>>Jan 7, 2005
371
372Changes to tatstats.h, dropfs2.c to allow larger numbers of peptides
373to match when fasts is used to show coverage on a proteomics
374experiment. Previously fasts could match no more than 30 peptides,
375that has been increased to 50. In addition, ktup=2 can be used
376to increase the likelihood that short exact matchs trump longer
377mismatched regions.
378
379>>Nov 11, 2004 CVS fa34t25
380
381Finished merge of earlier fa34t24 branch with HEAD. Correct
382labeling of TFASTM.
383
384>>Nov 4-8, 2004
385
386Incorporation of Erik Lindahl "anti-diagonal" Altivec code for
387Smith-Waterman, only. Altivec SSEARCH is now faster than FASTA for
388query sequences < 250 amino acids.
389
390Small modifications to output score display to ensure that the correct
391scores are shown, and that they are correctly labeled.
392
393>>Aug 25,26, 2004 CVS fa34t24b3
394
395Small change in output format for p34comp* programs in
396">>>query_file#1 string" line before alignments. This line is not present
397in the non-parallel versions - it would be better for them to be consistent.
398
399Change in last_stats.c to properly label fasts statistics with -z != 1.
400
401Change in dropfs2.c to ensure that tatprobs are not precalculated with -z 4.
402
403Modify -m 9i output option to show in HTML output.
404
405Add "#ifdef NOOVERHANG" to dropfs2.c that causes overlapping
406alignments to score a 0, rather than the partial overlap score.
407Useful for SAGE alignments, because "fasts" requires global alignments
408(except for for overhangs, unless NOOVERHANG is defined).
409
410>>Aug 23, 2004
411
412Fix problem with very long definition lines with formatdb version4
413ASN databases. Fix mshowalign.c to re-enable "-L" option.
414
415>>July 28, 2004
416
417Fix to re-enable -w window shuffle for PRSS. Modify comp_lib.c
418for PRSS to ensure that the unshuffled score and probability
419are shown, even for very high probabililty alignments.
420
421>>July 21, 2004
422
423Modifications to support PostgreSQL databases with the same commands
424as MySQL databases. MySQL database libraries are type 16, PostgreSQL
425are type 17. Makefile.linux_sql and Makefile.pvm4_sql support both
426database types simultaneously.
427
428>>June 23, 2004 CVS fa34t24b2
429
430Additional fixes to enable -n or -p with fasts34 and
431fastm34. Makefile.pcom was fixed for fastm34_t. A new file,
432mgstm1.nts, of DNA fragments from mgstm1.seq, is included for testing
433fasts34 and fastm34.
434
435>>May 4, 2004
436
437Fixes to initfa.c to allow DNA:DNA for FASTS, FASTM. This change
438introduced a bug that broke FASTS completely, but was fixed June 18,
4392004 (and retagged fa34t24b2).
440
441>>April 23, 2004 CVS fa34t24b1
442
443Fix bug in initfa.c that caused tfasts/tfastf not to examine all six
444frames.
445
446>>May 4, 2004
447
448Fixes to initfa.c to allow DNA:DNA for FASTS, FASTM.
449
450>>March 19, 2004 CVS fa34t24b0
451
452Modify all the drop*.c files, plus mshowbest.c and mshowalign.c, to
453display percent similarity, rather than percent ungapped. An
454alignment is counted as similar if the score is greater than or equal
455to zero (the same criterion used for placing ".". To disable this
456change, remove -DSHOWSIM from the appropriate Makefile.*.
457
458>>March 18, 2004 CVS fa34t23b8
459
460Fix bug in initfa.c tables that caused prss to generally compare
461proteins.
462
463>>March 15, 2004
464
465Fix bug in calls to revcomp(); make revcomp() guarantee NULL termination.
466
467>>March 2, 2004 CVS fa34t23b7
468
469Fix a very embarrassing and surprising bug that caused insertions
470in fasta alignments to appear in the wrong sequence.
471
472>>Feb 7, 2004 CVS fa34t23b6
473
474Change initfa.c to allow "-i" (reverse complement) and "-i -3" with
475"fastx34" and "prfx34". In addition, "prfx34" now examines both query
476DNA strands in calculated the shuffled statistical significance.
477
478>>Feb 5, 2004
479
480Reverse assignments for G:U baseparing in initfa.c.
481
482Fix memory allocation error caused by doubling DNA alignment width.
483
484>>Jan 7, 2004 CVS fa34t23b5
485
486Change in do_walign() in dropnfa.c to make final DNA alignments use a
487band that is 2X as large as the search band width.
488
489>>Dec 22, 2003 CVS fa34t23b4
490
491Fix typo in p2_complib.c that prevented compilation. Fix problem
492with karlin.c for asymmetrical matrices, such as used with -U.
493
494>>Dec 10, 2003 CVS fa34t23b3
495
496Fix problem in resetp()/initfa.c that disabled banded Smith-Waterman
497DNA alignments.
498
499Allow spam() to do extended alignments for DNA if one of the sequences
500is < 50 nt.
501
502Cause default ktup to drop for short sequences. For protein < 50, ktup=1;
503for DNA < 20, 50, 100 ktup = 1, 2, 3, respectively.
504
505>>Dec 7, 2003
506
507A new option, "-U" is available for RNA sequence comparison. "-U"
508functions like "-n", indicating that the query is an RNA sequence. In
509addition, to account for "G:U" base pairs, "-U" modifies the scoring
510matrices so that a "G:A" match has the same score as "G:G" match,
511and "T:C" match has the same score as a "T:T" match. (Corrected
51213-July-2010 -- the G:A/T:C scores are score(G:G)-3.) The asymmetric
513matrix required changes in dropnfa.c that were similar to the changes
514in dropgsw.c required for profiles. In addition, m_msg.qdnaseq and pst.dnaseq
515 can now be SEQT_DNA, SEQT_RNA, SEQT_PROT, SEQT_UNK, or SEQT_OTHER.
516m_msg.ldnaseq does not use SEQT_RNA, only SEQT_DNA. A new member of
517struct pstruct: int nt_align, is used to indicate nucleotide
518alignments.
519
520>>Nov 19, 2003
521
522Changes to Makefile's to distinguish between tatstats_fs.o and
523tatstats_ff.o.
524
525>>Nov 2, 2003
526
527Substantial changes to comp_lib.c, p2_complib.c, mshowbest.c, and
528mshowalign.c to support more sophisticated display options.
529Previously, one could have only on "-m #" option, even though several
530of the options were orthogonal (-m 9c is independent of -m 1 and -m2,
531which is independent of -m 6 (HTML)). The programs now use a bitmask
532that allows independent options to be combined. In particular -m 9c
533can be combined with -m 6, which can be very helpful for runs that
534need HTML output but can also exploit the encoding provided by -m 9c.
535
536The "-m 9" option now also allows "-m 9i", which shows the standard
537best score information, plus percent identity and alignment length.
538
539>>Oct 26, 2003 CVS fa34t23b1
540
541Additional fixes to Makefiles to enable tfastf34(_t). Changes to
542support ossearch34 (a non-Phil Green optimized Smith-Waterman).
543
544>>Oct 8, 2003 CVS fa34t23b0
545
546Fixes to get DNA queries working in both directions, and to fix PCOMPLIB
547programs for "-V" option. Currently, the parallel programs cannot use
548the "-V" option.
549
550>>Sept 25, 2003
551
552A new option is available for annotating alignments. -V '@#?!'
553can be used to annotate sites in a sequence, e.g:
554 >GTM1_HUMAN ...
555 PMILGYWDIRGLAHAIRLLLEYTDS@S?YEEKKYT@MG
556 DAPDYDRS@QWLNEKFKLGLDFPNLPYLIDGAHKIT
557might mark known and expected (S,T) phosphorylation sites. These
558symbols are then displayed on the query coordinate line:
559
560 10 20 @? 30 @ 40 @ 50 60
561GTM1_H PMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAPDYDRSQWLNEKFKLGLDFPNLP
562 ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
563gtm1_h PMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAPDYDRSQWLNEKFKLGLDFPNLP
564 10 20 30 40 50 60
565
566This annotation is mostly designed to display post-translational
567modifications detected by MassSpec with FASTS, but is also available
568with FASTA and SSEARCH.
569
570>>Sept 22, 2003 CVS fa34t22b5
571
572The Altivec Smith-Waterman code has been removed.
573
574>>Sept 17, 2003 CVS fa34t22b4
575
576A variety of different bugs have been fixed. (1) All the functions in
577the old initsw.c are now in initfa.c; initsw.c will be removed.
578Specifically, the Profile/PSSM code is now in initfa.c. initfa.c is
579now fully table driven. (2) various problems with prss34 and prfx34
580have been fixed in initfa.c. (3) An additional ncbl2_mlib.c buffer
581overrun has been fixed. (4) fastf34 is now available in this package.
582Its performance is very similar to, but not identical to, fastf33. I
583am tracking down the differences. In general, the raw scores
584calculated by both programs are the same, but the statistical analysis
585seems to be slightly different.
586
587>>July 30, 2003 CVS fa34t22b3
588
589Fix bug in ncbl2_mlib.c that caused buffer overrun with blast/formatdb
590v3 description lines.
591
592>>July 28, 2003
593
594The initfa.c file has been substantially re-structured to use a
595table-driven approach to parameter setting, rather than the previous
596confusing combinations of #ifdef's. Two tables of parameters are
597used, pgm_def_arr[] and msg_def_arr[], which specify values like the
598program name, reference, scoring matrix, default gap penalties, etc.
599msg_def_arr[] has the sequence types for the query, library, and
600algorithm, as well as other parameters (qframe, nframe, nrelv, etc),
601which greatly simplifies the sequence recognition logic. ppst->pgm_id
602can be used to identify the program that is running. Eventually,
603almost all of the program specific #ifdef's will be removed from
604initfa.c. initfa.c now provides initsw.c functionality, so that
605initsw.c is no longer needed.
606
607>>July 25, 2003
608
609A new file is included - fasta.defaults - that lists the scoring
610matrix, gap penalty, and other defaults for all of the fasta34
611programs. This file will be used soon to simplify parameter setting
612for the FASTA programs, and should also be used by Javascript WWW
613interfaces to the FASTA programs.
614
615>>July 22, 2003 CVS fa34t22b2
616
617Fixes to dropfs2.c, tatprobs.c to ensure that negative probabilities
618cannot occur. Negative probabilities were never seen with standard
619matrices, but did occur with BL50. Another optimization in dropfs.c
620considerably improves fasts34 performance in some cases.
621
622Fix a problem with formatdb v4 ASN.1 format files.
623
624>>July 12, 2003
625
626Fix a bug that prevented "-L" (long sequence descriptions) from
627working.
628
629>>July 9, 2003
630
631Fix reverse complement (M:K) error. Fix off-by-one error for FASTA
632DNA alignments that caused the first aligned residue pair to be
633missed.
634
635>>July 4 - 8, 2003
636
637Incorporate blast-def-line ASN.1 parsing so that NCBI formatdb version
6384 files can be read.
639
640>>June 26, 2003
641
642The strategy for displaying the match/mismatch line (" .:" for -m 0)
643has been changed dramatically to acommodate more sophisticated
644strategies for indicating conservative replacements, e.g. because of
645PSSM's. In addition to seqc0 and seqc1, which hold the aligned
646sequences for display, there is also seqca, which holds the alignment
647symbol. calcons(), do_show(), and discons() have all changed to
648include seqca. calcons() is somewhat more complex; discons() is much
649simpler. (June 29, 2003 - dropgsw.c calcons() now displays profile
650similarity accurately - it is very very illuminating.)
651
652>>June 16, 2003 version: fasta34t22
653
654ssearch34 now supports PSI-BLAST PSSM/profiles. Currently, it only
655supports the "checkpoint" file produced by blastall, and only on
656certain architectures where byte-reordering is unnecessary. It has not
657been tested extensively with the -S option.
658
659 ssearch34 -P blast.ckpt -f -11 -g -1 -s BL62 query.aa library
660
661Will use the frequency information in the blast.chkpt file to do a
662position specific scoring matrix (PSSM) search using the
663Smith-Waterman algorithm. Because ssearch34 calculates scores for
664each of the sequences in the database, we anticipate that PSSM
665ssearch34 statistics will be more reliable than PSI-Blast statistics.
666
667The Blast checkpoint file is mostly double precision frequency
668numbers, which are represented in a machine specific way. Thus, you
669must generate the checkpoint file on the same machine that you run
670ssearch34 or prss34 -P query.ckpt. To generate a checkpoint file,
671run:
672
673blastpgp -j 2 -h 1e-6 -i query.fa -d swissprot -C query.ckpt -o /dev/null
674
675(This searches swissprot for 2 iterations ("-j 2" using a E()
676threshold 1e-6 saving the resulting position specific frequencies in
677query.ckpt. Note that the original query.fa and query.ckpt must
678match.)
679
680>>June 5, 2003
681
682Fix to mshowbest.c to get -m 9 coordinates correct on reverse strand
683with pv34comp*. Some additional fixes for prfx34.
684
685>>May 22, 2003
686
687Changes to llgetaa.c, getseq.c, comp_lib.c to provide a different
688library residue lookup table (sascii) for queries and libraries. This
689allows one to make a prfx34 (like prss34, but using the fastx
690algorithm). prfx34 is now available.
691
692>>May 13,14 2003
693
694Fixes to most of the drop*.c files, and mshowbest.c, to ensure that
695coordinates displayed with -m 9(c) and the final alignment are
696consistent. They were consistent for fasta34/ssearch34/fasts34, but
697not for fastx34/fasty34. The alignment coordinate system has been
698been revised for consistency in allthe drop*.c programs (coordinates
699used to be off-by-one for some, but not other functions).
700
701Fixes to -m 9c for fasty34/pv34compfy. In addition, a problem was
702fixed with fastx34/fasty34 that appeared with a protein sequence was
703considerably longer than the DNA query, e.g. an EST vs titin (26K
704residues). This problem only appeared on pv34compfx/fy on Xserve's
705under OS_X; but it should improve fastx34/fasty34 performance with
706very long protein sequences on all platforms.
707
708>>May 7,8 2003
709
710Changes to p2_workcomp.c, compacc.c, and p_mw.h to fix persistent
711bugs in the -m 9c display. Previous pv34comp* programs would not
712return the correct coded alignment if more than 100 alignments came
713from the same node, or if an encoding was longer than 127 chars.
714
715Also, fixes to p2_complib.c, comp_lib.c, to allow long query sequences
716to be segmented. Previously, only the first 20,000 residues were
717used. The segmented queries are not overlapped; segmented library
718sequences are.
719
720>>May 5, 2003
721
722Changes to last_tat.c, scaleswt.c to ensure that all fasts alignments
723that are likely to have significant scores are displayed. In previous
724implementations, if the query had more than 10 fragments, only the 100
725best scores were shown. Now, we rescore up to 2500 alignments. The
726new approach allows large mixtures to be used for searches, where some
727of the fragments from the mixture match too many proteins
728(e.g. actins). Some differences between the fasts34 and pv34compfs
729implementations have been fixed. The two programs typically will not
730give exactly the same results, because of small differences in the
731sampling procedures, but the results are essentially equivalent.
732
733>>Apr 11, 2003 CVS fa34t21b3
734
735Fixes for "-E" and "-F" with ssearch34, which was inadvertantly disabled.
736
737A new option, "-t t", is available to specify that all the protein
738sequences have implicit termination codons "*" at the end. Thus, all
739protein sequences are one residue longer, and full length matches are
740extended one extra residue and get a higher score. For
741fastx34/tfastx34, this helps extend alignments to the very end in
742cases where there may be a mismatch at the C-terminal residues.
743
744-m 9c has also been modified to indicate locations of termination
745codons ( *1).
746
747>>Mar 17, 2003 CVS fa34t21b2
748
749A new option on scoring matrices "-MS" (e.g. "BL50-MS") can be used to
750turn the I/L, K/Q identities on or off. Thus, to make "fastm34" use
751the isobaric identities, use "-s M20-MS". To turn them off for "fasts34",
752use "-s M20".
753
754More fixes for correct alignment coordinates. There was a conflict between
755-m 9 and -m 9c and subsequent alignment displays.
756
757>>Mar 13, 2003
758
759Various fixes to produce correct fastm34 alignments. Changes to all
760functions to correct potential problem with -m 9 alignment coordinates
761when both -m 9 and actual alignments are shown.
762
763>>Feb 25,27, 2003
764
765Modifications to re-activate showsum.c, which included corrections to
766the showbest() call in p2_complib.c.
767
768>>Feb 13, 2003 CVS fa34t21b1
769
770Modifications to dropfx.c to dramatically improve alignment speed for
771cases where the DNA sequence is considerably longer than the protein
772sequence. Previously, a 200 aa vs 5000 nt comparison would do a full
773200 x 5000 Smith-Waterman alignment; with this modification, no more
774than a 200 x 1200 (2x3x200) alignment is done. This optimization has
775not (yet) been applied to dropfz2.c (fasty/tfasty).
776
777>>Feb 11, 2003
778
779Small modifications to comp_lib.c, p2_complib.c, and nmgetlib.c to
780pass openlib() a possibly old lmf_str. This allows openlib() to
781re-use memory mapped files. closelib() no longer releases memory
782mapped file buffers. Under Linux, memory mapped file buffers were not
783really released, so when comparing a set of sequences against nr, the
784program could not mmap() the database after several searches. This
785will also speed up memory mapped multiple sequence searches.
786
787>>Jan 28-31, 2003 CVS fa34t21b0
788
789Fix another bug (all of v34t20) involved with overlapping long
790sequences. And another bug that occurred when using sampled
791statistics, but appeared only on the SGI platform - thanks to Dmitri
792Mikhailov. Several other issues have been addressed based on more
793instrumented runtime testing.
794
795Fix an old (all v34) bug that caused problems with -z 11-16 (shuffled
796sequence array was not allocated properly). Fixed another bug with -z
7976/16 when using threaded (_t) searches in fasta34_t.
798
799Restructure statistical analysis functions (scaleswn.c, scaleswt.c) to
800return the "final" statistical estimation routine done in pst.zsflag_f.
801This allows the program to cope with searches against a single sequence
802correctly.
803
804Corrected an error for DNA sequences needing Altschul-Gish statistics.
805
806>>Jan 25, 2003
807
808Add option "-J start:stop" to pv34comp*/mp34comp*. "-J x" used to
809allow one to start at query sequence "x"; now both start and stop can
810be specified.
811
812>>Jan 14, 2003
813
814Changes to apam.c to provide an error message on stderr when a scoring
815matrix cannot be found.
816
817Changes to dropfs2.c, initsw.c, initfa.c to provide -m9c information
818for fasts34 searches. Modify the alignment algorithm to use
819probabilistic scores properly.
820
821>>Dec 22, 2002
822
823Change to compacc.c (sortbeste()) to do a second sort on zscore when
824several sequences have E() == 0.
825
826>>Nov 27, 2002
827
828Change FSEEK_T to fseek_t to keep Borland BCC5 happy.
829
830>>Nov 14-22, 2002 CVS fa34t20b6
831
832Include compile-time define (-DPGM_DOC) that causes all the fasta
833programs to provide the same command line echo that is provided by the
834PVM and MPI parallel programs. Thus, if you run the program:
835
836 fasta34_t -q -S gtt1_drome.aa /slib/swissprot 12
837
838the first lines of output from FASTA will be:
839
840 # fasta34_t -q gtt1_drome.aa /slib/swissprot
841 FASTA searches a protein or DNA sequence data bank
842 version 3.4t20 Nov 10, 2002
843 Please cite:
844 W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
845
846This has been turned on by default in most FASTA Makefiles.
847
848Fix p2_complib.c so that qstats[] is always allocated before it is used.
849
850Fix serious bug in non-threaded comp_lib.c that caused some high
851scoring sequences to be missed by fasts34. New tests are included in
852test.sh to detect this problem in the future.
853
854The shell sort algorithm in sortbeste(), sortbestz(), and sortbesto()
855has been modified to use an improved algorithm that will not go
856quadratic in pathological cases.
857
858nmgetlib.c and mmgetaa.c have been modified to remove "^A" in libstr
859when used with p2_complib.c.
860
861Fix problem with MAXSEG in tatstats.h with IBM/AIX.
862
863Changes to most Makefiles to use -DSAMP_STATS; fixes to p2_complib.c
864for SAMP_STATS.
865
866>>Oct 22, Nov 3, Nov 9, 2002 CVS tag fa34t20b5
867
868Fix problem in comp_lib.c that caused the query sequence length to be
869counted twice.
870
871Fixed problem with prss34 (updated find_zp in showrss.c).
872
873Correct shuffling function in several places.
874
875Add jitter back to addhistz() - improves appearance with prss34.
876
877Changes to fix problems with aln_code using -m 9c.
878
879Fix to serious bug in scaleswt.c (fasts34, etc) that caused sorts on
880the high scores to take much to long. The program is now 10X faster,
881and scales well on PVM/MPI.
882
883Fix to llgetaa.c to work with new getseq() API with automatic alphabet
884recognition.
885
886>>Oct 12, 2002 CVS tag fa34t20b4
887
888Several very obscure (and sometimes old) bugs that appeared in certain
889MPI environments have been fixed. This occurred because the pst.sq[]
890array did not always have a '\0' at the end. In addition,
891mshowalign.c/p2_workcomp.c sometimes failed to put the '\0' at the end
892of seqc0/seqc1. Correct bug introduced in fa34t20b3 for fasts34(_t).
893
894>>Oct 9, 2002 CVS tag fa34t20b3
895
896Fix to apam.c build_xascii() to not zero-out qascii[0]. Fix
897Makefile.pvm4. Mix problem with -m 9c with compacc.c.
898
899>>Sept 28, 2002
900
901Additional fixes to -m 9c in p2_complib.c/compacc.c/mshowbest.c.
902Remove restriction in fasts34(_t) to less than 30 peptides (though no
903more than 30 peptides can be aligned currently).
904
905>>Sept 24, 2002
906
907Fix p2_workcomp.c so that e_scores are delivered correctly when
908last_calc flag is set, and -m 9c provides alignments when only one
909best hit is present.
910
911Fix comp_lib.c to use different maxn and overlap for each different
912query sequence. fasta34 and fasta34_t now have identical results when
913a long sequence is searched.
914
915Add '@C:101' support to memory mapped FASTA format files.
916
917Fix mshowalign.c so that coordinates returned by cal_coord() use
918loffset+l_off.
919
920>>Sept 14, 2002 CVS tag fa34t20b2
921
922Changes to p2_complib.c, compacc.c to fix statistics problems with
923pv34compfs on query sequences with more than 10 fragments.
924
925>>Aug 27, 2002
926
927Modifications to mshowbest.c and drop*.c (and p2_workcomp.c,
928compacc.c, doinit.c, etc.) to provide more information about the
929alignment with the -m 9 option. There is now a "-m 9c" option, which
930displays an encoded alignment after the -m 9 alignment information.
931The encoding is a string of the form: "=#mat+#ins=#mat-#del=#mat".
932Thus, an alignment over 218 amino acids with no gaps (not necessarily
933100% identical) would be =218. The alignment:
934
935 10 20 30 40 50 60 70
936GT8.7 NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
937 :.:: . :: :: . .::: : .: ::.: .: : ..:.. ::: :..:
938XURTG NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
939 20 30 40 50 60
940
941would be encoded: "=23+9=13-2=10-1=3+1=5". The alignment encoding is
942with respect to the beginning of the alignment, not the beginning of
943either sequence. The beginning of the alignment in either sequence is
944given by the an0/an1 values. This capability is particularly useful
945for [t]fast[xy], where it can be used to indicate frameshift positions
946"/#\#" compactly. If "-m 9c" is used, the "The best scores" title
947line includes "aln_code".
948
949>>Aug 14, 2002 CVS tag fa34t20
950
951Changes to nmgetlib.c to allow multiple query searches coming from
952STDIN, either through pipes or input redirection. Thus, the command
953
954 cat prot_test.lseg | fasta34 -q -S @ /seqlib/swissprot
955
956produces 11 searches. If you use the multiple query functions, the
957query subset applies only to the first sequence.
958
959Unfortunately, it is not possible to search against a STDIN library,
960because the FASTA programs do not keep the entire library in memory
961and need to be able to re-read high-scoring library sequences. Since
962it is not possible to fseek() against STDIN, searching against a STDIN
963library is not possible.
964
965>>Aug 5, 2002
966
967fasts34(_t) and fastm34(_t) have been modified to allow searches with
968DNA sequences. This gives a new capability to search for DNA motifs,
969or to search for ordered or unordered DNA sequences spaced at
970arbitrary distances.
971
972>>Aug 4, 2002
973
974comp_lib.c has been modified to provide comp_mlib.c function.
975comp_mlib.c is no longer used. comp_lib.c with the "mlib" function
976can now recognize protein or DNA sequences automatically, and reads
977from stdin can now detect DNA/protein sequence types automatically.
978Changes to compacc.c, getseq.c, doinit.c initfa.c, initsw.c, and
979nmgetlib.c to support automatic sequence type detection.
980
981>>July 28-31, 2002
982
983(1) The various Makefile's have been "normalized". The fast*34[_t]
984 (Makefile.34m.common[_sql]), Makefile.pvm4[_sql], and
985 Makefile.mpi4[_sql] make files all use a common set of filenames,
986 described in Makefile.fcom. This greatly simplifies adding
987 programs, but requires that all *.o files be deleted when moving
988 from fast*34* to pv34comp* to mp34comp*.
989
990(2) showalign.c/p_showalign.c have been merged into mshowalign.c
991 showbest.c/manshowbest.c have been merged into mshowbest.c. Some
992 of the related files (showun.c, manshowun.c, have not been merged
993 or tested).
994
995(3) Code for ranking scores with valid e_value's incorporated.
996
997(4) Bug fixes in p2_complib.c, so that fasts34/fasts34_t/pvcompfs
998 provide identical statistics.
999
1000>>July 26, 2002
1001
1002Makefile.pvm4_sql and Makefile.pvm4 have been substantially simplified
1003by providing the worker program name from the h_init() function in the
1004initfa.c/initsw.c files.
1005
1006>>July 24, 2002
1007
1008Substantial modifications to param.h, structs.h to ensure that no
1009sequence specific information is kept in struct pstruct. This
1010structure now holds the pam[] matrix, and other scoring parameters,
1011but nothing that is dependent on aa0. The aa0 dependent stuff (nm0,
1012Lambda, K, etc) is now stored in struct mngmsg. This was mostly done
1013to support the pv34comp* programs, which have separate mngmsg
1014structures but the same pstructs.
1015
1016The fasts34, fasts34_t, and pv34compfs/c34.workfs have all been tested
1017successfully.
1018
1019>>July 19, 2002
1020
1021Fix an old bug in the calculation of E()-values in DNA databases
1022longer than 2147483647 residues on machines with 32-bit longs.
1023
1024
1025>>July 28-31, 2002
1026
1027(1) The various Makefile's have been "normalized". The fast*34[_t]
1028 (Makefile.34m.common[_sql]), Makefile.pvm4[_sql], and
1029 Makefile.mpi4[_sql] make files all use a common set of filenames,
1030 described in Makefile.fcom. This greatly simplifies adding
1031 programs, but requires that all *.o files be deleted when moving
1032 from fast*34* to pv34comp* to mp34comp*.
1033
1034(2) showalign.c/p_showalign.c have been merged into mshowalign.c
1035 showbest.c/manshowbest.c have been merged into mshowbest.c. Some
1036 of the related files (showun.c, manshowun.c, have not been merged
1037 or tested).
1038
1039(3) Code for ranking scores with valid e_value's incorporated.
1040
1041(4) Bug fixes in p2_complib.c, so that fasts34/fasts34_t/pvcompfs
1042 provide identical statistics.
1043
1044>>July 26, 2002
1045
1046Makefile.pvm4_sql and Makefile.pvm4 have been substantially simplified
1047by providing the worker program name from the h_init() function in the
1048initfa.c/initsw.c files.
1049
1050>>July 24, 2002
1051
1052Substantial modifications to param.h, structs.h to ensure that no
1053sequence specific information is kept in struct pstruct. This
1054structure now holds the pam[] matrix, and other scoring parameters,
1055but nothing that is dependent on aa0. The aa0 dependent stuff (nm0,
1056Lambda, K, etc) is now stored in struct mngmsg. This was mostly done
1057to support the pv34comp* programs, which have separate mngmsg
1058structures but the same pstructs.
1059
1060The fasts34, fasts34_t, and pv34compfs/c34.workfs have all been tested
1061successfully.
1062
1063>>July 8, 2002
1064
1065Modifications to comp_lib.c, initfa.c and new scaleswt.c, tatstats.c
1066to support FASTS with Tatusov statistics.
1067
1068last_params() has been introduced to allow aa0 dependent changes in m_msg/pstr.
1069
1070sortbest() has been moved into initfa.c/initsw.c to make it function specific.
1071
1072find_z() takes an additional parameter, escore.
1073
1074The do_work() results structure, beststr, and stat_str all accommodate
1075escores as well as integer scores (stat_str also saves segn and segl
1076but doesn't need them).
1077
1078In scaleswt.c, process_hist() now knows much more about Tatusov statistics.
1079
1080last_stats() provided to accommodate rank-based statistical corrections.
1081
1082scale_scores() is the last function to modify the beststr scores
1083(final calculation of E-value).
1084
1085Some sortbest*() calls and some bptr[i]->zscore=find_zp() loops have
1086been moved into scale_scores();
1087
1088>>July 3,5, 2002
1089
1090Modifications to allow mySQL comments (--) in "library.sql 16" files.
1091Thus, a first line of:
1092
1093 --host seqdb user password;
1094
1095is read by FASTA as the login information to a mySQL server, but is
1096ignored by mySQL. "DO" commands in FASTA mySQL files can also be
1097rendered invisible to mySQL in this way. See "do.sql".
1098
1099Modifications to mysql_lib.c to allow very long SQL statements. The
1100buffer is now dynamically reallocated in 4Kb chunks.
1101
1102The fasta3.1 man page has been updated and re-organized.
1103
1104>>June 26, 2002
1105
1106Minor modifications to nmgetaa.c (openlib()) to use the same arguments
1107for searching and PRSS. PRSS needs access to all of m_msg, but
1108searches do not. Other small fixes to comp_mlib.c, towards the goal
1109of merging comp_mlib.c and comp_lib.c.
1110
1111>>June 25, 2002
1112
1113Modify the statistical estimation strategy to sample all the sequences
1114in the database, not just the first 60,000. The histogram is still
1115based only on the first 60,000 scores and lengths, though all scores
1116an lengths are shown. The fit to the data may be better than the
1117histogram indicates, but it should not be worse.
1118
1119Currently, this modification is available only if the -DSAMPLE_STATS
1120option is defined.
1121
1122>>June 23, 2002 CVS fa34t11d4
1123
1124Fix a very long-standing bug in fasty/tfasty that caused 'NNN' to be
1125translated as 'S', rather than 'X'. fastx/tfastx has done this
1126correctly for many years, but the fasty/tfasty code that I received
1127from Zheng Zhang was not implemented correctly (my fault, his code was
1128fine).
1129
1130>>June 19, 2002
1131
1132Added "-C #" option, where 6 <= # <= MAX_UID (20), to specify the
1133length of the sequence name display on the alignment labels. Until
1134now, only 6 characters were ever displayed. Now, up to MAX_UID
1135characters are available.
1136
1137>>May 30, 2002 CVS fa34t11d3
1138
1139Fixed problem with programs using the default -E cutoff when -b was
1140provided. With this implementation, -E can override -b, but -b
1141overrides the default -E.
1142
1143Fixed problem with 64-bit file offsets in param.h (change USE_FSEEK0
1144-> USE_FSEEKO, include -D_LARGEFILE_SOURCE and -D_LARGEFILE64_SOURCE
1145in Makefile.linux_sql). Put limits on alignment display length (200
1146chars). More checks for null returns from SQL queries.
1147
1148>>Apr 17, 2002 CVS fa34t11d2
1149
1150Fixed bug in mm_file.h/ncbl2_mlib.c that caused the SGI version to be
1151unable to read blast2 format files.
1152
1153Changed "mp_*" tags to "pg_*" for -m 10 option.
1154
1155>>Mar 30, 2002
1156
1157Fix embarrassing bug in revcomp() (getseq.c) that failed to complement
1158the central nucleotide in a sequence with an odd number of residues.
1159
1160Small changes to dropfs.c for more segments.
1161
1162>>Mar 16, 2002
1163
1164Added create_seq_demo.sql, nt_to_sql.pl to show how to build an SQL
1165protein sequence database that can be used with with the mySQL
1166versions of the fasta34 programs. Once the mySQL seq_demo database
1167has been installed, it can be searched using the command:
1168
1169 fasta34 -q mgstm1.aa "seq_demo.sql 16"
1170
1171mysql_lib.c has been modified to remove the restriction that mySQL
1172protein sequence unique identifiers be integers. This allows the
1173program to be used with the PIRPSD database. The RANLIB() function
1174call has been changed to include "libstr", to support SQL text keys.
1175Due to the size of libstr[], unique ID's must be < MAX_UID (20)
1176characters.
1177
1178A "pirpsd.sql" file is available for searching the mySQL distribution
1179of the PIRPSD database. PIRPSD is available from
1180ftp://nbrfa.georgetown.edu/pir_databases/psd/mysql.
1181
1182>>Mar 6, 2002
1183
1184Fix showbest.c showbest() to report pst.zdb_size as database size.
1185Fix dropnfa.c spam() to address off-by-one on end of run, and double
1186counting on backwards scan. Fix dropnfa.c do_fasta() to fix another
1187problem introduced by -S. Changes to comp_lib.c to ensure that both
1188the beginning and end of the query and library sequence have '\0'
1189present. Changes to initfa.c, initsw.c to ensure that a match to a
1190lower-case letter with -S gets exactly the same score as a match to an
1191'X'. Changes to mmgetlib.c to work with 64-bit longs in *.xin files.
1192
1193>>Feb 26, 2002
1194
1195Fixes to doinit.c, initfa.c, initsw.c to allow DNA matrices using the
1196"-s dna.mat" option. A new matrix, "d50ry.mat" is available that
1197scores +5 for a match, -2 for a transition, and -5 for a
1198transversion. "d50ry.mat" corresponds to DNA PAM50 with transitions
1199twice as common as transversions. When "-s dna.mat" is used, "-n"
1200MUST be used as well.
1201
1202Query sequence names ("aa", "nt") should be more accurate.
1203
1204>>Feb 22, 2002
1205
1206Fix to getseq.c to allow "plain" sequence files.
1207
1208>>Feb 12, 2002
1209
1210Minor fix to res_stats.c.
1211
1212>>Jan 28, 2002
1213
1214Fixes to resurrect res_stats.c. res_stats (cc -o res_stats
1215res_stats.c scaleswn.c -lm) takes the output from a current "-R
1216file.res" file and calculates statistical significance - this allows
1217one to take exactly the same set of scores (and lengths) and calculate
1218statistical estimates using different strategies.
1219
1220>>Jan 24, 2002
1221
1222modifications to mmgetlib.c, ncbl2_mlib.c to more robustly read memory
1223mapped files (*.xin, map_db) on machines lacking "native" 64-bit
1224longs. If the machine provides some definition for a 64-bit long
1225(e.g. "long long", "int64_t"), things should work. 64-bit offsets into
1226memory mapped files work properly on Alpha, SGI, i386 Linux, and
1227MacOSX. The current implementation depends either on 64 bit longs
1228(Compaq Alpha's pre 4.0G) or the <sys/inttype.h> file. Makefile,
1229Makefile.alpha, and Makefile.linux have been modified.
1230
1231Modifications to nmgetlib.c, mmgetlib.c to provide GI numbers and
1232Accession versions for Genbank searches. If the GI:123456 number is
1233available, it will be used and the description line will be formatted:
1234
1235 gi|123456|gb|ACC1234.1|LOCUS description
1236
1237This should help FAST_PAN runs, where the version of a sequence
1238changes frequently.
1239
1240>>Jan 10, 2002
1241
1242Modifications to p2_complib.c, p2_workcomp.c to more reliably allocate
1243space for library sequence descriptions on the master and workers.
1244
1245>>Jan 2-3, 2002 CVS fa34t10c/fa34t10d3
1246
1247Fixes to comp_lib.c to support Macintosh and Windows/Turbo-C
1248compilation. New Makefile.tc. Macintosh version supports both
1249"Classic" and "Carbon" environments.
1250
1251"<values.h>" has been replaced with the more modern "<limits.h>"
1252
1253Fixes to p2_complib.c to support n_libstr (libstr length) in GETLIB().
1254
1255comp_thr.c, complib.c removed.
1256
1257>>Dec 16, 2001
1258
1259Complete integration of comp_mlib.c with both the unthreaded and
1260threaded programs. Comp_mlib allows fasta34 and fasta34_t to compare
1261a database with a second database, just as pv34compfa does. Using
1262multiple queries with fasta34_t is not as efficient as pv34compfa (and
1263it cannot use networks of Unix workstations), but it is much easier to
1264use and install.
1265
1266With the comp_mlib.c option, fasta34 cannot automatically recognize
1267DNA sequences, just as pv34compfa no longer recognizes DNA sequences.
1268You must use the "-n" option to search with DNA sequences. The other
1269programs (fastx34, tfastx34, etc) "know" the type of the query and
1270database sequences, so "-n" is only required for fasta34(_t).
1271
1272>>Dec 14, 2001 CVS tag fa34t10b
1273
1274Fix problems reading DNA databases in blast2 format.
1275
1276>>Dec 11, 2001
1277
1278Changes to spam() in dropnfa.c so that, for DNA sequences, the
1279previous behavior for finding the boundaries of a local alignment
1280region use the same algorithm as previous versions of fasta. For
1281protein sequences, the algorithm will extend the local region beyond
1282the "ktup" boundaries if a better score can be found. For DNA
1283sequences, this raises the noise rather than increasing sensitivity,
1284so it is turned off and "ktup" boundaries are respected. The old,
1285"ktup" boundary algorithm is available with -DNOSPAM_EXT.
1286
1287This version also includes a working res_stats.c, which can be used to
1288test various statistical estimates on exactly the same set of scores.
1289
1290Fixed problems with -m 9 percent identity for fastx/fasty/tfastx/tfasty.
1291These errors have been present since -m 9 was implemented.
1292
1293>>Dec 10, 2001
1294
1295Fix to map_db.c to work correctly with files > 2 Gb when 64-bit longs
1296are available. It is not yet designed to work with ftello() and other
1297offset types.
1298
1299>>Nov 11,21, 2001 CVS tag fa34t10a, fa34t10d1
1300
1301Substantial changes to revcomp(), getseq(), and other functions to
1302correct problems with -S on DNA sequences. Sequences with lower case
1303nucleotides were not recognized or reverse complemented properly.
1304
1305Fix to dropnfa.c (v34t07, Nov 21, 2001) bg_align() to re-initialize
1306static globals - this fixes a problem encountered with pv34compfa. A
1307new main program, comp_mlib.c has been added to the CVS archive,
1308although it is not referenced in any of the Makefile. comp_mlib.c
1309works like p2_complib.c and compares a library against another
1310library.
1311
1312>>Nov 4, 2001
1313
1314Change to dropnfa.c spam () while(1) -> while(lpos <= dmax->stop).
1315This fixes a problem with ktup=1 on Suns only, so far.
1316
1317>>Oct 4, 2001 CVS tag fa34t10
1318
1319Add comp_lib.c file, which merges complib.c (unthreaded) and
1320comp_thr.c (threaded) code into one file.
1321
1322Modifications to nmgetlib.c, mmgetaa.c to allow Genbank flatfile
1323format without DESCRIPTION or ACCESSION lines.
1324
1325Additional fix for -S with ktup=1.
1326
1327>>Sept. 24, 2001
1328
1329Fix to have correct gap-penalties for short scoring matrices with
1330tfastx/fastx.
1331
1332>>Sept. 10, 2001 CVS tag fa34t05d6
1333
1334Fix a bug introduced by -S fix in fa34t05d5. Also, try to remove
1335changes in p34compfa compared to pv4compfa output.
1336
1337>>Sept. 6, 2001 CVS tag fa34t05d5
1338
1339Fix the -S dropnfa/fx/fz2 bug that was not actually fixed in
1340fa34t05d4. Incorporate the correct scaleswn.c refered to in
1341fa34t05d4.
1342
1343>>Sept. 5, 2001 CVS tag fa34t05d4
1344
1345Fix problem with m_msg.quiet that prevented interactive prompts for
1346ktup, file name, etc with threaded programs.
1347
1348Fix serious bug in dropnfa.c/dropfx.c/dropfz2.c that caused -S to work
1349improperly on sequences with effective length of 3 or less.
1350
1351Change to scaleswn.c to make mle_cen(), mle_cen2() more robust to cases
1352where the top and bottom scores are the same.
1353
1354Change p2_complib.c to avoid compiler complaints with (void *)wstage2p=NULL
1355on some platforms.
1356
1357>>Aug. 30, 2001 CVS tag fa34t05d3
1358
1359Fixed problem with uthr_subs.c for Suns, but changed Makefile.sun to
1360use pthreads rather than Sun Unix threads. Removed SQL stuff from
1361Makefile.mpi4/pvm4 and added Makefile.mpi4_sql/pvm4_sql.
1362
1363fa34t05d2 - fix to map_db.c to provide *sascii.
1364
1365fa34t05d1 - fixes to ibm_pthr_subs.c and Makefile.ibm from IBM.
1366
1367>>Aug. 20, 2001 CVS tag fa34t05d0
1368
1369The pvm/mpi complib programs have been substantially updated with
1370release 3.4. See readme.v34t0 for more information. With version
13713.4, the MPI programs are mp34comp*, mu34comp*, etc.
1372
1373A major effect of this change is to disable automatic sequence type
1374(protein/DNA) recognition with pv34compfa/mp34compfa. By default,
1375protein libraries are assumed. Thus, pv34compfa/mp34compfa require
1376the "-n" command line option when running pv34compfa/mp34compfa on DNA
1377sequence libraries. This issue does not occur with the other
1378programs, which will recognize the appropriate sequence type, because
1379it is determined by the program (e.g. pv34compfx requires
1380DNA:protein).
1381
1382Fixed substantial problem with 64-bit file offsets for Linux in
1383complib.c/comp_thr.c, p2_complib.c. This problem, solved by Doug
1384Blair, was preventing the threaded versions from working properly in
1385memory mapped mode.
1386
1387In all earlier versions of fasta, when very long sequences were
1388searched, the sequence length reported was that of the "chunk" that
1389was actually searched (typically 80,000-query_length) rather than the
1390actual library sequence length. The peculiar behavior now changed,
1391and the full length of the library sequence, not the sequence chunk,
1392is reported as the library sequence length. Note that chunks are
1393still used, however, which can cause the same alignment to be shown
1394twice. In addition, the "-m 9" output format has changed to report
1395the coordinates of the query and library sequence (see below), which
1396may be different from 1-sequence_length because the the query and
1397library sequences may have been extracted from larger sequences. Four
1398additional fields have been added, "pn0", "px0","pn1", "px1" that are
1399the positions in for the beginning (pn0/1) and end (px0/1) of they
1400query/library sequence. pn0/1 would typically be changed with the
1401"@C:#" directive, described below.
1402
1403Changes to doinit.c/initfa.c/initsw.c to provide a new function -
1404f_lastenv() - that allows function-specific adjustments to parameters
1405after the command line options have been read but before the first
1406sequence is read. This change solved problems with "mp/pv34compfx -S".
1407
1408fasts34/tfasts34 now recognize that 'I/L' are the same, as are 'Q/K'
1409(which are apparently indistinguishable by Mass-Spec). The latter
1410identity is on by default, but can be turned off with "-h 0".
1411
1412The MPI/PVM versions of the programs have been tested extensively with
1413compfa, compfx, and comptfx. Makefile.mpi4 now works properly.
1414Changes to p2complib.c to support the PVM option "-T 1-4", which
1415allows one to run on nodes 1-4 of a (presumably larger) PVM virtual
1416machine. This option has no effect on the mp34comp* programs. The
1417old "-T 4" to run on 4 nodes, is also available. If each node has 2
1418cpu's, as indicated in the "pvmd hostfile", both CPU's will be used
1419for a total, in this example, of 8 processes. This allows one to
1420specify a large PVM machine and use separate parts of it
1421independently.
1422
1423Changes to nmgetlib.c to fix problems with longer dates in GCG files
1424(Y2K). Fixes to faatran.c for extended alphabets and 'X's. Various
1425code clean-ups to make "gcc -Wall" a little bit (not much) happier.
1426
1427This is the first distributed fasta34 version.
1428
1429================
1430>>Aug 9, 2001 CVS tag fa34t05
1431
1432Corrections to initfa.c to allow -S to work with tfastx/y.
1433Fix to manshowbest.c for query position with -m 9.
1434
1435>>July 18, 2001 CVS tag fa34t04
1436
1437Various changes to complib.c, comp_thr.c, p2_complib.c, showbest.c,
1438showalign.c to deal with overlapping alignments in long sequences that
1439have been segmented. When long sequences are segmented (lcont>0), the
1440eventual total length (n1tot_v) is saved at beststr->n1tot_p. If
1441there was no lcont, then beststr->n1tot_p = NULL, and beststr->n1
1442should be used as the sequence length. This has the advantage of
1443requiring space only when long sequences are encountered, and
1444requiring only one integer for several segments.
1445
1446m_msg.noshow has been removed.
1447
1448The -m 9 format has been changed - 5 fields have been added, 4
1449(pmn0/pmx0/pmn1/pmx1) provide the beginning and end coordinates of the
1450query and library sequence; the last (fs) reports the number of
1451frameshifts. The names of the alignment boundaries have been changed
1452from min0/max0/min1/max1 to amn0/amx0/amn1/amx1 (Alignment miN/maX).
1453
1454The SQL format has been extended to provide for statements that do
1455things but do not generate results, such as creating and selecting into a temporary table, e.g.:
1456================
1457 do
1458 create temporary table seq_pos (
1459 id int unsigned not null auto_increment primary key,
1460 prot_id int unsigned not null default 0,
1461 start int unsigned not null default 0,
1462 length int unsigned not null default 0,
1463 )
1464 ;
1465 do
1466 insert into seq_pos (prot_id, start, length)
1467 select id, 11, len-10
1468 from protein, annot
1469 where len > 100
1470 and annot.protein_id = protein.id
1471 and annot.pref=1
1472 ;
1473 select seq_pos.id,
1474 substring(protein.seq, start, length),
1475 concat("@C:", start, " ", descr)
1476 from protein, seq_pos, annot
1477 where protein.id = annot.protein_id
1478 and protein.id = seq_pos.prot_id
1479 and annot.pref = 1
1480 ;
1481 select prot_id,
1482 concat("@C:", start, " ", descr)
1483 from seq_pos, annot
1484 where annot.protein_id = seq_pos.prot_id
1485 and seq_pos.id = #
1486 and annot.pref = 1
1487 ;
1488================
1489
1490 In the current implementation, these statements must start with "DO"
1491as the first two characters on the line, and come immediately after a
1492line ending with ';'. The text from "DO" to the next ";", excluding
1493the "DO", is executed when the database connection is made.
1494
1495===== >>July 12, 2001
1496
1497The allocation of the work_info data structure used to send
1498information to the worker threads has been changed. The old method
1499worked, possibly by accident.
1500
1501A bug in p2_complib.c that caused E()-values to be calculated
1502improperly for the first query sequence has been fixed.
1503
1504>>July 11, 2001 --> fa34t02
1505
1506It is now possible to specify output coordinates in library sequences
1507by including the string: "@C:number" on the description line, e.g.
1508
1509 >gtm1_human gi|12345 human glutathione transferase M1 @C:21
1510
1511would label the first residue in the library sequence "21" rather than
1512"1". This capability has been included to provide accurate
1513coordinates for searches done against subsequences generated by an SQL
1514query. For example, one could use a query of the form:
1515
1516 SELECT protein.id, substring(protein.seq,11,length(protein.seq)-20),
1517 concat(protein.name," @C:11 ",protein.descr)
1518 FROM protein;
1519
1520to generate a sequence set with each sequence starting with residue
152111. Without the "@C:11" option on the description line, the program
1522would number the alignment positions starting at 1, even though the
1523first residue of the sequence really started at 11. "@C:11" allows
1524one to correct the coordinate system.
1525
1526Currently, "@C:offset" is available only with library type 1 (fasta
1527format) and 16 (mySQL).
1528
1529The SQL-generated database with "@C:offset" can be used with both the
1530fast*34(_t) programs and with pv34comp*. However, the SQL syntax is
1531used differently in the fasta34 and pv34compfa programs. fast*34(_t)
1532requires three SQL statements during a search: (1) a statement to
1533generate a large set of library sequences; (2) a statement to generate
1534a description of a single sequence, given a unique identifier provided
1535by (1); and (3) a statement to generate a single sequence given a
1536unique identifier provided by (1). For fast*34 searches, the third
1537(3) SQL statement must provide the "@C:offset" information in the
1538third results field for the offset to be used. It is optional in (1)
1539and (2).
1540
1541The pv34comp* programs only require one SQL statement, statement (1)
1542above, which must provide three fields, a unique identifier, the
1543sequence, and a complete description that must include "@C:offset" if
1544substrings are used. If SQL queries (2) and (3) are provided, they
1545are ignored. Thus, the same files can be used by both programs, but
1546the "@C:offset" is required in different SQL queries by the fast*34
1547and pv34comp* programs.
1548
1549Other changes:
1550
1551Re-incorporation of GAP_OPEN option; fix to Altschul-Gish stats when
1552GAP_OPEN is used.
1553
1554Re-incorporation of A. Mackey's spam() improvement in dropnfa.
1555
1556Fixes to include file ordering to allow fast*34(_t) pv34comp* programs
1557to compile.
1558
1559Fix to lascii[] for SQL database queries.
1560
1561Fix to an old bug in comp_thr.c to send individual worker_info
1562structures to threads (does not fix LINUX threads problems, however).
1563
1564=====
1565>>July 9, 2001
1566
1567Considerable changes to support no-global library functions.
1568
1569(1) Separate ascii/sequence mapping arrays are used by the
1570 query-reading (qascii), library-reading (lascii), and sequence
1571 comparison function (pascii) routines. As a result, there is no
1572 longer a need for tgetlib.o/lgetlib.o - lgetlib.o can serve both
1573 functions.
1574
1575(2) This also allows us to remove all #ifdef TFAST/FASTX conditionals
1576 from complib.c/comp_thr.c/p2_complib.c. We no longer need
1577 tcomp_thr.o, comp_thrx.o, etc. We still have a variety of
1578 p2_complib.o variations to support the different c34.work* files.
1579
1580(3) Because non-global openlib/getlib functions are available, exactly
1581 the same open/get functions are available for reading both the
1582 query and reference libraries in pv34comp* programs. The
1583 host-specific openlib/getlib functions in hxgetaa.c are now
1584 provided by nmgetlib.c, etc. This has two effect:
1585
1586 (a) it is now possible to compare a query database generated by an
1587 SQL query to a library database generated by a different SQL
1588 query.
1589
1590 (b) pv34comp* has lost (at least in this version) the ability to
1591 automatically detect the query sequence type. To search with a
1592 DNA query, you MUST use "-n".
1593
1594(4) the resetp() function is now responsible for almost all of the
1595 function sepcific (TFAST/FASTX/etc) initializations. All of the
1596 function specific code has been removed from complib.c/comp_thr.c
1597 and most of it has been moved to initfa.c/resetp().
1598
1599(5) manageacc.c has been merged into compacc.c (mostly prhist()).
1600
1601=====
1602>>June 1, 2001
1603
1604Many changes to accommodate a new - no global variable - strategy for
1605reading sequence databases. Every time a file is opened, a struct
1606lmf_str is allocated which can be used for memory mapped files, ncbl2,
1607files, and mysql files.
1608
1609In addition, an open'ed file has a default sequence type: DNA or
1610protein, or one can open a file in a mode that will allow the sequence
1611type to be changed.
1612
1613=====
1614>>May 18, 2001 CVS: fa33t09d0
1615
1616A new compile time parameter - -DGAP_OPEN, is available to change the
1617definition of the "-f gap-open" parameter from the penalty for the
1618first residue in a gap to a true gap-open penalty, as is used in BLAST
1619and many other comparison algorithms. This will probably become the
1620default for fasta in version 3.4.
1621
1622Fixes to conflicts between "-S" and "-s matrix". When a scoring
1623matrix file was specified, lower-case alignments were not displayed
1624with -S (although the scores were calculated properly).
1625
1626More extensive testting of mysql_lib.c (mySQL query-libraries) with
1627the pv4comp* and mp4comp* programs.
1628
1629=====
1630>>April 5, 2001 CVS: fa33t08d4b3
1631
1632Changes in nmgetlib.c and ncbl2_mlib.c to return long sequence
1633descriptions for PCOMPLIB (pv4/mp3comp*). Also fix p2_complib.c to
1634request DNA library for translated comparisons.
1635
1636Fix for prss33(_t) to read both sequences from stdin.
1637
1638=====
1639>>March 27, 2001 CVS: fa33t08d4
1640
1641Modifications to allow 64-bit fseek/ftell on machines like Sun,
1642Linux/Intel, that support -D_FILE_OFFSET_BITS=64, -D_LARGE_FILE_SOURCE
1643off_t, and fseeko(), ftello() with the option -DUSE_FSEEKO. Machines
1644with 64-bit long's do not need this option. Machines with 32-bit
1645longs that allow files >2 Gb can do so with 64-bit file access
1646functions, including fseeko() and ftello(), which work with off_t file
1647offsets instead of long's.
1648
1649=====
1650>>March 3, 2001 CVS: fa33t08d2
1651
1652Corrected problems in nmgetaa.c and mysql_lib.c with parallel
1653programs, and one serious problem with alternate DNA scoring matrices
1654(initfa.c, initsw.c) not being set properly. A subtle problem with
1655the merge of scaleswn.c and scaleswg.c is fixed.
1656
1657>>February 17, 2001
1658
1659Modified mysql_lib.c to use "#", rather than "%ld", to indicate the
1660position of the GID. This change was made because sprintf() cannot be
1661used reliably to generate an SQL string, as '"' and '%' are used in
1662such strings.
1663
1664=====
1665>>January 17, 2001
1666(no version change, date change)
1667
1668Minor fixes to initfa.c, initsw.c to deal with DNA scoring matrices
1669properly. "-n -s dna.mat" is required for the sequence/matrix to be
1670recognized as DNA.
1671
1672>>January 16, 2001
1673-->v34t00
1674
1675Merge of the main CVS trunk - fa33t06 with the latest release branch,
1676fa33t08.
1677
1678In addition, PCOMPLIB mods have been made to mysql_lib.c. Because
1679p2_complib.c gets sequence description information during the first
1680read of the database, the mysql_query must be changed to return:
1681result[0]=GID, result[1]=description, result[2]=sequence. In the
1682PCOMPLIB case, the other SQL queries (for GID description, sequence)
1683are not necessary but must still be provided.
1684
readme.v35
1
2 $Id: readme.v35 120 2010-01-31 19:42:09Z wrp $
3 $Revision: 55 $
4
5>>Sep. 10, 2008
6
7Fix problem in init_ascii() call for p2_complib2.c.
8
9>>Sep. 9, 2008
10
11Fix bug in display of library name when written to an output file
12(rather than stdout).
13
14>>Aug. 28, 2008 fa35_04_02 SVN Revision: 45
15
16Fix serious bug in alignment generation that only occurred when large
17libraries were used as a query with [t]fast[x/y]. This bug often
18resulted in a core dump.
19
20Address some other issues with uninitialized variables with -m 9c.
21
22>>Jul. 15, 2008 fa35_04_01 SVN Revision: 38
23
24Correct problems with Makefiles. Add information on compiling to README.
25Address issue with mp_KS for -m 10 when searching small libraries.
26
27>>Jul. 7, 2008 fa35_04_01 SVN Revision: 35
28
29Fix problems that occurred when statistics are disabled with -z -1,
30both for a normal library search, and for searches of a small library.
31
32>>Jul. 3, 2008 SVN Revision: 33
33
34Continue to fix an issue with 'J' and -S.
35
36>>Jun. 29, 2008 SVN Revision: 29, 31
37
38Fix additional problems with Makefiles, some issues uncovered with
39Solars 'C' compiler (Rev. 30).
40
41Discover serious bug when searching long, overlapping sequences, such
42as genomes. The length of the library sequence was not updated to
43reflect the length of the new region plus the overlap.
44
45Fix inconsistency in the value of 'J' between uascii.h/aascii[] and
46pascii[]. Add code to ensure that lascii[], qascii[], never return a
47value outside pam2[][] (all <= pst.nsq) (particularly for 'O' and 'U'
48amino-acids).
49
50exit(0) returns for map_db, list_db.
51
52>>Jun. 11, 2008
53
54Correct bug in scaleswn.c that prevented exact matches to queries < 10
55residues from being scored and displayed.
56
57>>Jun. 1, 2008
58
59Address various cosmetic issues in FASTA output:
60
61(1) Modify comp_lib2.c so that -O outfile works when multiple queries are
62compared in one run.
63
64(2) remove the duplicated query sequence length in the 1>>>query line.
65
66(3) in -m 10 output, the tags "pg_name" and "pg_ver" were duplicated, e.g.
67
68>>>K1HUAG, 109 aa vs a library
69; pg_name: fasta35_t
70; pg_ver: 35.03
71; pg_argv: fasta35_t -q -b 10 -d 5 -m 10 ../seq/prot_test.lseg a
72; pg_name: FASTA
73; pg_ver: 3.5 Sept 2006
74
75The ; pg_ver and ; pg_name produced by the get_param() functions in
76drop*.c have been renamed ; pg_ver_rel and ; pg_ver_alg.
77
78>>>K1HUAG, 109 aa vs a library
79; pg_name: fasta35_t
80; pg_ver: 35.03
81; pg_argv: fasta35_t -q -b 10 -d 5 -m 10 ../seq/prot_test.lseg a
82; pg_name_alg: FASTA
83; pg_ver_rel: 3.5 Sept 2006
84
85Modify mshowbest.c, mshowalign.c to highlight E() values (<font
86color="dark red"></font> in HTML output.
87
88>>Apr. 16, 2008 fa35_03_07
89
90Merge fa35_ann1_br, which allows annotations in library sequences.
91
92The PVM/MPI parallel version now support query sequence annotations
93and -m 9c annotation encoding. It does not yet support library
94annotations. Tested with both PVM and MPI.
95
96>>Apr. 2, 2008 fa35_03_06
97
98Ensure that code in last_init() to modify ktup never increases ktup value.
99
100Add fasta_versions.html to more explicitly describe programs available.
101
102>>Mar. 4, 2008
103
104Fix parsing of parameters (matrix, gap open, gap ext) in ASN.1 PSSM
105files produced by blastpgp.
106
107>>Feb. 18, 2008 fa35_03_05
108
109Re-implement -M low-high sequence range options. Sequence range
110restriction has probably been missing since the introduction of
111ggsearch and glsearch, which use a new approach to limiting the
112sequence range.
113
114>>Feb. 7, 2008 fa35_ann1_br
115
116Add annotations to library sequences (they were already available in
117query sequences). Currently, annotations are only available within
118sequences, but they should be available in FASTA format, or any of the
119other ascii text formats (EMBL/Swissprot, Genbank, PIR/GCG). If
120annotations are present in a library and the annotation characters
121includes '*', then the -V '*' option MUST be used. However, special
122characters other than '*' are ignored, so annotations of '@', '%', or
123'@' should be transparent.
124
125In translated sequence comparisons, annotations are only available for
126the protein sequence.
127
128The format for encoded annotations has changed to support annotations
129in both the query and library sequence. If the -m 9c flag is provided
130and annotations are present, then an annotated position in the
131alignment will be encoded as:
132
133 '|'q-pos':'l-pos':'q-symbol'l-symbol':'match-symbol'q-residue'l-residue'
134
135For example:
136
137 |7:7:@@:=YY|14:14:##:=TT
138
139In cases where the query or library sequence does not have an
140annotation, then the q-symbol or l-symbol will be 'X' (which is not a
141valid annotion symbol).
142
143>>Jan. 25, 2008 fa35_03_04
144
145Map 'O' (pyrrolysine) to 'K', 'U' (seleno-cysteine) to 'C' in uascii.h
146('J' is already recognized and mapped to the average of 'I' and 'L').
147Thus, 'J' will appear in alignments, but 'O' and 'U' are transformed
148to 'K' and 'C'.
149
150Because "Oo" and "Uo" are not (currently) part of aax[] ("Uu" is in
151ntx[]), apam.c/build_xascii() was extended to add characters from
152othx[] - "oth" for "other" so that they are not lost.
153
154Double check, and fix, some mappings for 'J/j' and 'Z/z'.
155
156>>Jan. 11, 2008 fa35_03_03
157
158Clean up some issues with -m 10 output; put "; mp_Algorithm", ";
159mp_Parameters" down with other -m 10 ";" lines. Also provide ";
160al_code" and "; al_code_ann" if -m 9c is specified. Remove duplicate
161">>>query" line.
162
163Add "; aln_code" and "; ann_code" to -m 10 -m 9c output. The
164alignment/annotation encoding is only produced once (in showbest(),
165and is then saved for -m 10 aligment.
166
167>>Dec. 13, 2007 fa35_03_02m (merge of fa35_03_02 and fa35_02_08_br)
168
169Add ability to search a subset of a library using a file name and a
170list of accession/gi numbers. This version introduces a new filetype,
17110, which consists of a first line with a target filename, format, and
172accession number format-type, and optionally the accession number
173format in the database, followed by a list of accession numbers. For
174example:
175
176 </slib2/blast/swissprot.lseg 0:2 4|
177 3121763
178 51701705
179 7404340
180 74735515
181 ...
182
183Tells the program that the target database is swissprot.lseg, which is
184in FASTA (library type 0) format.
185
186The accession format comes after the ":". Currently, there are four
187accession formats, two that require ordered accessions (:1, :2), and
188two that hash the accessions (:3, :4) so they do not need to be
189ordered. The number and character after the accession format
190(e.g. "4|") indicate the offset of the beginning of the accession and
191the character that terminates the accession. Thus, in the typical
192NCBI Fasta definition line:
193
194 >gi|1170095|sp|P46419|GSTM1_DERPT Glutathione S-transferase (GST class-mu)
195
196The offset is 4 and the termination character is '|'. For databases
197distributed in FASTA format from the European Bioinformatics
198Institute, the offset depends on the name of the database, e.g.
199
200 >SW:104K_THEAN Q4U9M9 104 kDa microneme/rhoptry antigen precursor (p104).
201
202and the delimiter is ' ' (space, the default).
203
204Accession formats 1 and 3 expect strings; accession formats 2 and 4
205work with integers (e.g. gi numbers).
206
207>>Dec. 12, 2007 fa35_02_08
208
209Correct bug in ssearch35 gapped scores that only occurred in
210non-accelerated code. This bug has been present since fa35_02_06.
211Modified the Makefiles so that accelerated (ssearch35(_t)) and
212non-accelerated (ssearch35s(_t)) are available. Edited Makefile's to
213provide accelerated ssearch35 more specifically.
214
215Modifications to provide information about annotated residues in the
216-m9c coded output. Previously, -m 9c output added a field:
217
218 =26+9=15-2=9-1=3+1=74-2=3-3=63
219
220after the standard -m 9 output information. With the new version, an
221annotated query sequence ( -V '*#' ) adds the field:
222
223 |14:16:#<TM|24:26:#>TA|44:37:*>ST|71:66:#=TT
224
225which indicates that residue 14 in the query sequence aligns with
226residue 16 in the target (library) with annotation symbol '#', the
227alignment score is '<' less than zero, and the residues are 'T'
228(query) and 'M' (library). (The '|' is used to separate each
229annotation entry.)
230
231>>Nov. 10, 2007
232
233Parts of p2_complib.c and p2_workcomp.c, and the pvm/mpi Makefiles,
234have been updated to be consistent with name changes in the param.h
235and structs.h directories.
236
237>>Nov. 20, 2007 fa35_02_08
238
239Parts of p2_complib.c and p2_workcomp.c, and the pvm/mpi Makefiles,
240have been updated to be consistent with name changes in the param.h
241and structs.h directories.
242
243>>Nov. 6, 2007 fa35_02_07
244
245Correct problems with asymmetric RNA matrices in initfa.c and rna.mat.
246
247>>Oct. 18, 2007
248
249Correct problem parsing ASN1 FastaDefLines when the database is local.
250
251Recovering from a misplaced cvs commit of code that was supposed to be
252on a branch, code has been recovered from earlier versions (fa35_02_05
253because fa35_02_06 has some branch contamination).
254
255>>Oct. 4, 2007 fa35_02_06
256
257Correct error in gap penalties in dropnnw.c. Due to an unfortunate
258inconsistency, the gap parameter in FLOCAL_ALIGN (in dropgsw2.c) had a
259different meaning than that in almost all the other programs (it was
260the sum of gap_open and gap_ext). The FLOCAL_ALIGN function call was
261copied for FGLOBAL_ALIGN, even though the the FGLOBAL_ALIGN function
262used the more conventional gap_open, gap_ext parameters. Thus,
263FGLOBAL_ALIGN was wrong and the subsequent do_walign() in dropnnw.c
264were wrong. dropgsw2.c:FLOCAL_ALIGN has been modified to use the
265conventional gap_open parameter, and calls to dropnnw.c:
266FGLOBAL_ALIGN() and do_walign() have been fixed.
267
268>>Sept. 20, 2007
269
270Modify the logic used when saving a seq_record *seq_p into beststr
271*bbp to ensure that if the seq_record is replaced, it is replaced at
272all the places where it is referenced. This involves adding a linked
273list into beststr (*bbp->bbp_link). When making the link (and freeing
274it up), be certain that the linked seq_p is the same as the one being
275replaced.
276
277>>Sept. 18, 2007 fa35_02_05
278
279A relatively obscure problem was found on the SGI platform when
280searching a library smaller than 500 sequences (thus requiring some
281shuffles). Two bugs were found and corrected; one involved not
282allocating aa1shuff with COMP_THR and not do a m_file_p->ranliba()
283before re_getlib(). The second involved destroying a pointer to the
284list of seq_records when a sequence was being shuffled. The bugs were
285confirmed with Insure, and have been fixed.
286
287>>Sept. 7, 2007 fa35_02_04
288
289Revamp the offset handling code to provide better uniformity between
290query and library offsets and coordinate systems.
291
292Fix a problem with load_mmap() to load 64-bit sequence locations
293properly on machines with 32-bit integers.
294
295>>Sept. 4, 2007
296
297Modify ncbl2_mlib.c slightly to check to see whether the amino-acid
298mapping in blast databases is identical to the FASTA mapping (it
299should be). If they are identical, do not re-map the blast amino acid
300sequences (potentially a small speed up).
301
302>>Aug. 22, 2007
303
304Change ps_lav.c to lav2ps.c, and add lav2svg.c. It is now possible to
305generate a lalign35 HTML output that has both SVG (lav2svg) and PNG
306(lav2ps | gs ), graphics.
307
308>>Aug. 10, 2007 CVS fa35_02_03
309
310Fix faatran.c:aacmap() bug.
311
312>>Aug. 6, 2007
313
314Extensive restructuring of pssm_asn_subs.c to parse PSSM:2 ASN.1's
315downloaded from NCBI WWW PSI-BLAST more robustly.
316
317>>July 25, 2007 CVS fa35_02_02
318
319Change default gap penalties for OPTIMA5 matrix to -20/-2 from -24/-4.
320
321>>July 24, 2007
322
323Correct bugs introduced by adding 'J' - 'J' was initially put before
324'X' and '*' in the alphabet, which led to problems because the
325one-dimensional lower-triangular pam[] matrices (abl50[], abl62[],
326etc) had entries for 'X', and '*', but not for 'J'. By placing 'J'
327after the other characters, the problem is resolved.
328
329Modify tatstats.c to accommodate 'J'.
330
331'*' is back in the aascii[] matrix, so that it is present by default
332(like fasta34).
333
334>>July 23, 2007
335
336Changes to support sub-sequence ranges for "library" sequences -
337necessary for fully functional prss (ssearch35) and lalign35. For all
338programs, it is now possible to specify a subset of both the query and
339the library, e.g.
340
341 lalign35 -q mchu.aa:1-74 mchu.aa:75-148
342
343Note, however, that the subset range applied to the library will be
344applied to every sequence in the library - not just the first - and
345that the same subset range is applied to each sequence. This probably
346makes sense only if the library contains a single sequence (this is
347also true for the query file).
348
349Correct bugs in the functions that produce lav output from lalign35 -m
35011 to properly report the begin and end coordinates of both sequences.
351Previously, coordinates always began with "1". Correct associated bug
352in ps_lav.c that assumed coordinates started with "1".
353
354>>June 29, 2007 CVS fa35_02_01
355
356Merge of HEAD with fasta35 branch.
357
358>>June 29, 2007 CVS fa35_01_06
359
360Add exit(0); to ps_lav.c for 0 return code.
361
362>>June 26, 2007
363
364Add amino-acid 'J' for 'I' or 'L'.
365
366Add Mueller and Vingron (2000) J. Comp. Biol. 7:761-776 VT160 matrix,
367"-s VT160", and OPTIMA_5 (Kann et al. (2000) Proteins 41:498-503).
368
369Changes to dropnnw.c documentation functions to remove #ifdef's from
370strncpy() - which apparently is a macro in some versions of gcc.
371
372>>June 7, 2007
373
374Modify initfa.c to allow ggssearch35(_t), glsearch35(_t) to use PSSMs.
375
376>>June 5, 2007 CVS fa35_01_05
377
378Modifications to p2_complib.c, p2_workcomp.c to support Intel C
379compiler. Fixed bug in p2_workcomp.c - gstring[2][MAX_STR] required -
380[MAX_SSTR] too short. mp35comp* programs now tested and working (as
381are pv35comp*, c35.work* programs).
382
383Fix problem with fasts/fastm/fastf last_tat.c with limited memory.
384
385Correct problem with lalign35.exe Makefile.nm_[fp]com.
386
387Add $(CFLAGS) to map_db to enable large file support.
388
389Address problem with PSSM's when '*' not defined (initfa.c:extend_pssm()).
390
391>>May 30, 2007 CVS fa35_01_04
392
393Complete work on ps_lav, which converts an lalign35 lav (-m 11) file
394into a postscript plot, which looks identical to the plots produced by
395plalign from fasta2. (ps_lav has been replaced by lav2ps and lav2svg).
396
397>>May 25,29, 2007
398
399Changes to defs.h, doinit.c mshowalign.c for -m 11, which produces lav
400output only for lalign35.
401
402Changes to comp_lib2.c to add m_msg.std_output, which provides all the
403standard print lines. This is turned off for -m 11 (lav) output.
404lalign35 -m 11 provides standard lav output, with the addition of
405#lalign35 -q ... .
406
407>>May 18, 2007
408
409Add m_msg.zsflag to preserve pst.zsflag when reset by global/global
410exclusion of many library sequences.
411
412>>May 9, 2007 CVS fa35_01_03
413
414Tested local database size determination with p2_complib2/p2_workcomp2.
415
416>>May 2, 2007 renamed fasta35, pv35comp, etc
417
418Separate thread buffer structures from param.h.
419
420Problems with incorrect alignments has been fixed by re-initializing the
421best_seqs and lib_buf2_list.buf2 structures after each query sequence.
422
423The labels on the alignment scores are much more informative (and more
424diverse). In the past, alignment scores looked like:
425
426>>gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-transfer (218 aa)
427 s-w opt: 1497 Z-score: 1857.5 bits: 350.8 E(): 8.3e-97
428Smith-Waterman score: 1497; 100.0% identity (100.0% similar) in 218 aa overlap (1-218:1-218)
429^^^^^^^^^^^^^^
430
431where the highlighted text was either: "Smith-Waterman" or "banded
432Smith-Waterman". In fact, scores were calculated in other ways,
433including global/local for fasts and fastf. With the addition of
434ggsearch35, glsearch35, and lalign35, there are many more ways to
435calculate alignments: "Smith-Waterman" (ssearch and protein fasta),
436"banded Smith-Waterman" (DNA fasta), "Waterman-Eggert",
437"trans. Smith-Waterman", "global/local", "trans. global/local",
438"global/global (N-W)". The last option is a global global alignment,
439but with the affine gap penalties used in the Smith-Waterman
440algorithm.
441
442>>April 24, 2007
443
444The new program structure has been migrated to the PVM and MPI
445versions. In addition, the new global algorithms (pv35compgg,
446pv35compgl) have been moved, though the the PVM/MPI versions do not
447(yet) to the appropriate size filtering.
448
449>>April 19, 2007
450
451Two new programs, ggsearch35(_t) and glsearch35_t are now available.
452ggsearch35(_t) calculates an alignment score that is global in the
453query and global in the library; glsearch35_t calculates an alignment
454that is global in the query and local, while local in the library
455sequence. The latter program is designed for global alignments to domains.
456
457Both programs assume that scores are normally distributed. This
458appears to be an excellent approximation for ggsearch35 scores, but
459the distribution is somewhat skewed for global/local (glsearch)
460scores. ggsearch35(_t) only compares the query to library sequences
461that are beween 80% and 125% of the length of the query; glsearch
462limits comparisons to library sequences that are longer than 80% of
463the query. Initial results suggest that there is relatively little
464length dependence of scores over this range (scores go down
465dramatically outside these ranges).
466
467A bug was found and fixed in showalign() and showbest() where the
468aa1save buffer was not preserved when some sequences needed to be
469re-read, while others were stored in the beststr.
470
471>>April 9, 2007
472
473Some of the drop*.c functions have been reconfigured to reduce the
474amount of duplicate code. For example, dropgsw.c, dropnsw.c, and
475dropnfa.c all used exactly the same code to produce global alignments
476(NW_ALIGN() and nw_align()), this code is now in wm_align.c.
477Likewise, those same files, as well as dropgw2.c, use identical code
478to produce consensus alignments (calcons(), calcons_a(), calc_id(),
479calc_code()). Rather than working with three or four copies of
480identical code, there is now one version.
481
482>>March 29, 2007
483
484At last, the lalign (SIM) algorithm has been moved from FASTA21 to
485FASTA35. Currently, only lalign35 is available. A plotting version
486will be available shortly (or perhaps a more general solution that
487produces lav output).
488
489The statistical estimates for lalign35 should be much more accurate
490than those from the earlier lalign, because lambda and K are estimated
491from shuffles.
492
493Many functions have been modified to reduce the number of times
494structures are passed as arguments, rather than pointers.
495
496>>February 23, 2007
497
498The threading strategy has been modified slightly to separate the end
499of the search phase (and a complete reading of all results buffers)
500from the termination phase. This will allow future threading of
501subsequent phases, including the Smith-Waterman alignments in
502showbest() and showalign() (though care will be required to ensure
503that the results are presented in the correct order).
504
505>>February 20, 2007 fasta-34_27_0 (released as fasta-35_1)
506
507The FASTA programs have been restructured to reduce the differences
508between the threaded and unthreaded versions (and ultimately the
509parallel versions) and to make more efficient use of modern large
510memory systems. This is the beginning of a move towards a more robust
511shuffling strategy when searching databases with modest numbers of
512related sequences.
513
514The major changes:
515
516 comp_lib.c -> comp_lib2.c - comp_lib.c will be removed
517 work_thr.c -> work_thr2.c - work_thr.c will be removed
518
519 mshowbest.c, mshowalign.c have been modified to remove aa1 as an
520 argument. They must allocate that space if they need it.
521
522 The system is set up to allocate a substantial amount of library
523 sequence memory, either to a single buffer (unthreaded) or to the
524 threaded buffer pool. For smaller databases, the library sequences
525 are read once, and then subsequently read from memory (this could be
526 extended for RANLIB(bline) as well).
527
528Soon, these changes will allow the program to re-read the beststr[]
529sequences and shuffle them to produce accurate lambda/K estimates.
530
531================================================================
532
533See readme.v34t0 for earlier changes.
534
535================================================================
536
readme.v36
1
2 $Id: readme.v36 1291 2014-08-28 18:32:58Z wrp $
3 $Revision: 55 $
4
5Version 3.6 of the FASTA programs is a significant update over version
63.5. It uses the same underlying structure as FASTA35 (specifically
7the strategies for ensuring accurate statistics), but it allows for
8multiple high-scoring alignments to be shown, rather than just one.
9This is the main functional difference between FASTA and BLAST -
10BLAST could show multiple HSPs, FASTA did not.
11
12>>Jul. 21, 2015 [fasta-36.3.8]
13[compacc2e.c, cal_cons2.c, dropfx2.c dropfz3.c, param.h]
14Fixed a major bug in the annotation code that had been added to
15accomodate overlapping domains. The original implementation was not
16thread-safe, because the array of annotations was modified during the
17scoring, but was also shared by threads. The new version keeps
18independent scoring arrays.
19
20>>Jun. 23, 2015 [released as fasta-36.3.7b]
21[dropnnw2.c]
22Fix problem where glsearch reset (ignored) the -M sequence limit.
23
24>>Jun. 18, 2015
25[dropfx.c, dropgsw.c, dropfx.c, dropfx2.c, dropfz3.c]
26Fix problem in do_walign.c with comparison to score_thresh during
27recursive alignment.
28
29>>May. 21, 2015
30[compacc2e.c]
31Add additional checks to ensure that annotations are within the
32sequence boundaries.
33
34>>Jan. 26, 2015 [ re-released as fasta-36.3.7a]
35[compacc2e.c]
36Fix problem with domain boundary calculations for subsets of sequences.
37
38>>Jan. 21, 2015 [ released as fasta-36.3.7a]
39[calc_cons2.c, dropfx2.c, dropfy3.c]
40Fix problems with -m 9c / -m 9C alignment encodings in version
4136.3.7. Apparently, the Nov. 25, 2014 fix was not committed properly.
42In addition, make certain that the query sequence is ALWAYS the
43reference sequence, particularly in translated alignments. As a
44result, the insertion/deletion codes are now reversed for fast[xy]36
45and tfast[xy]36.
46
47>>Jan. 6, 2014
48[data/VTML_*.mat]
49Provided scoring matrix files for the VTML_10,20,40,80,120,160,200
50matrices available internally.
51
52>>Nov. 25, 2014 [ released as fasta-36.3.7]
53[cal_cons.c, dropfx2.c, dropfz3.c]
54Fix problem that prevented -m 9c and -m 8CC unless annotations were
55present.
56
57Added approved copyright notice and Apache 2.0 license to
58appropriate files.
59
60>>Nov. 19, 2014
61[mshowbest.c]
62Add alignment (CIGAR) string and annotation string to BLAST tabular
63(-m 8) aligments with -m 8C[cCdD]. To get alignment and annotation
64encoding without BLAST comments, use -m 8X[cCdD].
65
66>>Nov. 10, 2014
67[cal_cons2.c, dropfx2.c, dropfz3.c]
68Ensure that site annotations are shown when annotations are embedded
69in a sequence, not provided by a script.
70
71>>Oct. 27, 2014
72[cal_cons2.c]
73Fix a bug in the annotation alignment that put annotation symbols off
74by one (or more) in the coordinate lines. Add annotations that align
75in gaps.
76
77>>Oct. 6, 2014
78[most source files]
79The copyright notice for fasta-36.3.7 has been updated to include an
80open software license, Apache2.0, for redistribution.
81
82>>Sept. 28, 2014
83[url_subs.c]
84Substitute annot_p->s_annot_arr_p[] for annot_p->domain_arr_p[i] in
85display_domains(), encode_json_str(). Remove domain_arr_p from struct
86annot_entry. With domain_arr_p gone, n_domains is less useful, but it
87is still available, and used for checking for domain graphics.
88encode_json_domains() also now uses annot_p->n_annots, and skips over
89non-domains.
90
91>>Sept. 19, 2014
92[dropfx2.c, dropfz3.c]
93Fixes to produce correct coordinates with forward and reverse
94complement [t]fast[x,y].
95
96>>Sept. 17, 2014 [new version, fasta-36.3.7]
97[compacc2e.c, cal_cons2.c, dropfx2.c, dropfz3.c]
98The annotation domain scoring/plotting strategy has been extended to
99allow overlapping domains. To accommodate overlapping domain
100annotations, the annotation file format (e.g. gstm1_human.annot) has
101been extended to accept the form:
102
103>sp|P09388|GSTM1_HUMAN
1041 - 88 Glutathione_S-Trfase_N :1
1057 V F Mutagen: Reduces catalytic activity 100- fold.
10690 - 208 Glutathione-S-Trfase_C-like :2
107108 V Q Mutagen: Reduces catalytic activity by half.
108
109where a "-" in the second field indicates that the first and third
110fields specify the beginning and end of the domain. In previous
111versions, a '[' specified the beginning of a domain, and a ']' on a
112later line specified the end of the domain. '[' and ']' on separate
113lines required that domains not overlap (so that the '[' and ']' could
114be paired). fasta-36.3.7 will still read this format, but the "start -
115stop" format is both simpler and more flexible.
116
117Three new annotation scripts are available that use the new domain
118notation: ann_feats2ipr_e.pl, ann_feats_up_www2_e.pl, ann_pfam_e.pl,
119and ann_pfam_www_e.pl. All four scripts will report overlapping
120domains.
121
122Overlapping domains also allows domain annotations from different
123sources to be combined (e.g. InterPro Pfam, Panther, and Superfamily
124domain annotations), as well as domain annotations of different types,
125e.g. Uniprot domain and secondary structure annotations.
126
127>>Aug. 28, 2014 [re-released as fasta-36.3.6f]
128[ncbl2_mlib.c]
129The code used to parse blastfmtdb sequence description lines has not
130kept up with NCBI's use of ASN.1 in sequence descriptions. This code
131has been updated, and now works properly with the protein and DNA
132sequence databases.
133
134[comp_lib9.c]
135Fixed a seg-fault that occurred when an open-file error occurred.
136
137>>Aug. 22, 2014 [released as fasta-36.3.6f]
138[mshowbest.c]
139Change alignment summary display for lalign to not show identical
140alignment score unless '-J' option used. Add "The best non-identical
141alignments" when no "-J"
142
143[ann_pfam_www.pl] Fix bugs.
144
145[ncbl2_mlib.c]
146modified to read NCBI ambiguity codes in
147blastdbfmt/formatdb nucleotide databases. Not extensively tested.`
148
149>>Aug. 20, 2014
150[compacc2.c, cal_cons.c, dropfx.c, dropfz2.c]
151Modify sub-alignment score report to calculate bit-score by dividing
152total alignment bit score by sub-alignment raw score divided by total
153alignment raw score. This produces a bit score that is much more
154sensible than the previous strategy, which calculated a z-score from
155the sub-alignment.
156
157>>Aug. 18, 2014
158[compacc2.c, cal_cons.c]
159Undo removal of '[]' from aa0a/aa1a (they are required to visualize
160domain boundaries in alignment). cal_cons.c now users PSSMs when they
161are available.
162
163>>Aug. 8, 2014
164[comp_lib9.c, compacc2.c]
165Move the call to get query annotations via scripts out of compacc2.c
166and into comp_lib9.c.
167
168>>July 29,2014
169[comp_lib9.c, mshowbest.c, mshowalign2.c]
170Enable high scoring alignment display (like high scoring sequences)
171with lalign36, when -m 9 (-m 9c/d/C/D) option is provided, or with -m
1728. This allows lalign36 to provide a compact, tabular list of
173non-overlapping local alignments.
174
175>>June 30, 2014
176[pssm_asn_subs.c]
177Update the code for parsing ASN.1 binary PSSM files produced by
178psiblast+. The new code reads more of the optional fields in
179pssm_intermediate_data(). The fields are not used, but broke the
180earlier parser.
181
182>>June 11, 2014
183[cal_cons.c, initfa.c, dropfx.c, dropfz2.c]
184Extend the match/mismatch encoding provided by -m 9c and -m 9C with -m
1859d and -m 9D. The -m 9d/D options provide mismatch locations as well
186as insertion/deletion locations. For -m 9d, the list of codes has
187expanded from '=\/*' to '=\/*x'; for -m 9D, 'MDIMX'. Current
188implementation works for all programs except [t]fast[fms]. Updated
189version strings to June, 2014.
190
191>>May 28, 2014
192[mshowalign2.c, mshowbest.c, initfa.c, structs.h]
193Add the command line option -XI. Changes the calculation of percent
194identity to ensure that a single mismatch in a long sequence with >
19599.9\% identity is displayed as 99.9% (0.999) identity, rather than
196100.0% identity. Without this option, a single mismatch in 10,000
197residues displays 100% identity, with the option, 99.9% identity is
198displayed (even though the identity is 99.99%).
199
200[cal_consf.c]
201Fix the false error message "code begins with 0" in cal_consf.c.
202
203>>Feb. 12, 2014
204[compacc2.c]
205When providing "sequence length" to annotation scripts, add offsets.
206Also modify scripts to allow sequence lengths to increase.
207
208>>Jan. 28, 2014 (re-released as fasta-36.3.6d/Jan 2014)
209[dropfs2.c, calconsf.c, tatstats.c]
210The coordinate fix for fasts36/fastm36 (Dec 18, 2013) broke some
211fasts/fastm alignments. The alignment code has been reverted to the
212"classic" code that has been used for more than 10 years. However,
213that code always marked the first aligned residue as 1, even when the
214first part of the query did not align. The initial coordinate offset
215has been fixed; the coordinate is now the position in the first
216aligned fragment. This may be confusing, because with fasts, the
217first aligned fragment may not be the first fragment in the query
218list. The coordinate provided always provides the offset from the
219beginning of the first fragment in the alignment, not the first
220fragment in the list. This fix required changes to the definition of
221calc_astruct(), which required changes to build_ares.c, mshowalign.c,
222calc_cons.c, dropfx.c, and dropfz2.c.
223
224>>Jan. 24, 2014
225[mshowalign2.c]
226Add checks to assumption that '>gi|12345' is an NCBI library entry.
227[nmgetlib.c]
228Fix for nmgetlib.c with -DMYSQL_DB
229
230Some cleanup of old Makefiles.
231
232>>Jan. 1, 2014
233[url_subs.c]
234Fix off by one in domain coordinates in display_domains().
235
236>>Dec. 18, 2013
237[dropfs2.c, cal_consf.c]
238Fix problem with alignment display when query sequence is much longer than library sequence.
239
240>>Dec. 11, 2013
241[compacc2.c]
242Modified save_best2() to correctly exclude sequences outside
243-M n1_low-n1_high limits.
244
245>>Nov. 8, 2013 (re-released as fasta-36.3.6d)
246[ncbl2_mlib.c]
247Fix problem with src_long8_read() where int/uint64_t seems to cause
248problems with Linux intel icc. Using int/unsigned int solves the problem.
249
250>>Nov. 1, 2013
251[apam.c, ncbl2_mlib.c, map_db.c]
252[apam.c ] Fix problem with query sequences and libraries that do not
253end in newline ('\n'). [ncbl2_mlib.c, map_db.c] provide grouping for
254shifts for byte extraction in src_int4/long8_read() to remove compiler
255warnings. [map_db.c] Fix problem reading sequences for indexing that
256caused crash.
257
258>>Oct. 8, 2013 (released as fasta-36.3.6d)
259[comp_lib9.c, initfa.c]
260Modify initfa.c/re_ascii() function to avoid qascii[] characters that
261had been remapped for annotations.
262
263>>Oct. 4, 2013
264[nmgetlib.c, ncbl2_mlib.c]
265Modify nmgetlib.c/re_openlib() to re-use memory mapped file arrays.
266This had been the intention for some time, but a check for libf != 0
267prevented the memory mapped arrays from being reused. libf is no
268longer checked, just mm_flag.
269
270>>Sep. 26, 2013
271[ncbl2_mlib.c]
272Fix a bug in ncbl2_mlib.c/parse_fastadl_asn() that prevented
273accessions longer than 20 characters in description lines from BLAST
274formatted libraries.
275
276[compacc2.c]
277Fix a bug in compacc2.c/comment_var() that showed the wrong original
278sequence in qVariant changes.
279
280>>Sep. 2, 2013
281[dropfs2.c]
282Fix bug in dropfs2.c/init_work() that prevents correct tatusov
283statistics with -z >10.
284
285>>Aug. 21, 2013 (released as fasta-36.3.6c)
286[comp_lib9.c]
287Fix bug in comp_lib9.c/new_seqr_chain() that prevented memory from
288being allocated to the chain if a memory mapped database was followed
289by a non-memory mapped database.
290
291>>Aug. 9, 2013
292[scaleswn.c]
293Ensure shift to MLE_STATS if too many scores are excluded by trimming.
294
295>>July 31, 2013 (released as fasta-36.3.6b)
296[url_subs.c]
297Make JSON output for -m 6 (html) dependent on $ENV{JSON_HTML}. JSON
298output is not currently used.
299
300>>July 26, 2013
301[mshowalign2.c, scripts/lavplt_svg.pl]
302Correct offsets in -m 11 lav plots, and modify lav2plt.pl/
303lavplt_svg.pl/ lavplt_ps.pl to reflect the corrections.
304
305Move all perl scripts out of /src into /scripts.
306
307>>July 19, 2013 (released as fasta-36.3.6a)
308[compacc2.c, cal_cons.c, dropfx.c, dropfz2.c, build_ares.c]
309Provide dynamic string allocation/dyn_strcat for annotation string
310output. This fixes problems with long proteins with many domains or
311other annotations, which were too long for the fixed annotation output
312storage.
313
314Version date updated to July, 2013.
315Compiled and tested on Windows32.
316
317>>July 8, 2013
318[cal_cons.c, dropfx.c, dropfz2.c]
319Properly terminate annotions with offsets [cal_cons.c], and with
320domains beyond alignment [dropfx.c, dropfz2.c]
321
322>>July 5, 2013 (released as fasta-36.3.6)
323[comp_lib9.c, doinit.c, dropfx.c, dropfz2.c]
324Fix conflict between -m 9 and -z -1; fix annotation display using
325non-script annotations. Stop using calc_last_set in dropfx/fz2.c.
326
327>>June 24, 2013
328[scripts/ann_feats_up_www2.pl]
329Add script (ann_feats_up_www2.pl) for annotating UniProt sequences using:
330"http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/uniprotkb".
331
332>>June 6, 2013
333[compacc2.c, cal_cons.c, initfa.c, dropfx.c, dropfz2.c]
334Provide the -XNS/-XXS/-XN+/XX+ and -XND/-XXD/-XN-/-XX- options that
335specify how N:N and X:X alignments are counted for similarity and
336identity. By default, N:N (DNA) and X:X (protein) alignments are
337considered identical, but not similar (because their scores are
338typically negative to address statistical issues).
339-XNS/-XXS/-XN+/-XX+ cause N:N/X:X alignments to be counted as similar,
340even though their alignment are negative. Likewise,
341-XND/-XXD/-XN-/-XX- cause N:N and X:X alignments to be considered
342non-identical (and non-similar).
343
344>>May 28, 2013
345[url_subs.c]
346do_url1() has been modified to: (1) require env($REF_URL, $SRCH_URL,
347$SRCH_URL1) for these links to produce printout. (2) Link text is
348surrounded by <!-- LINK_START "lname" --> <!-- LINK_STOP -->. (3)
349do_url1() now produces <!-- JSON --> output automatically, which can
350be used to get all the information provided by earlier URL links.
351
352>>May 29, 2013
353[mshowalign2.c]
354Re-instate code in showalign() to ensure that original bbp->rst is
355used for first alignment, rather than that calculated by CHECK_SCORE
356(which is used for later sub-HSP's). The CHECK_SCORE -S alignment
357score is based on the non-S alignment, and is then re-scored with the
358low-complexity -S matrix. But the best alignment excluding
359low-complexity can have a higher score than the best all-complexity
360alignment rescored with -S.
361
362>>May 27, 2013
363[mshowalign2.c, url_subs.c]
364The plot_domain.cgi SVG code has been expanded to allow the domain
365structure of the entire query and library sequence, not just the
366aligned regions, to be displayed. Showing domains above the query or
367below the library takes an additional 18 px in each direction (36
368total); this size needs to be provided in the <object data=""
369width="660" height="54"> format string that is provided in
370$DOMAIN_PLOT_URL.
371
372Right now, the argument to $DOMAIN_PLOT_URL can get very long with
373lots of aligned domain (region), and query and library domain
374information. It would be better to provide this in some separate way.
375YAML might also be a more efficient strategy.
376
377>>May 9, 2013
378[dropfx.c, dropfz2.c, compacc2.c, url_subs.c]
379The web infrastructure for domain plots has been completed --
380plot_domain2.cgi which generates SVG for domain plots now understands
381reverse-complement cDNA fastx/y alignments, and plots coordinates
382accordingly. Testing with fastx36/fasty36 revealed some memory
383errors, which have been fixed. In addition, dropfz2.c has been
384updated to properly treat some region/alignment-boundary conditions;
385dropfx.c and dropfz2.c provide equivalent sub-alignment scores.
386
387[../scripts, ../misc]
388A new directory, ./scripts, has been created to collect the scripts
389used for sequence library expansion and domain/feature annotation.
390../scripts/README.scripts provides more information. Modify code to
391allow expansion scripts (-e) to start with '\!', like annotation
392scripts.
393
394>>Apr. 15, 2013
395(compacc2.c, cal_cons.c, dropfx.c, dropfz2.c, mshowalign2.c)
396Modifications to properly deal with sequence and coordinate offsets in
397annotation alignments. compacc2.c/get_annot_list() has been modified
398to only print/read an annotation once (the same sequence may appear
399twice with fastx/fasty). mshowalign2.c now includes <!--
400ANNOT_START/STOP --> and <!-- ALIGN_START/STOP --> in HTML mode. This
401comments are not on their own line, to save output space, so the
402remainder of the line should be captured.
403
404>>Apr. 5, 2013
405(doinit.c)
406Add the ability to specify HTML output using the -m '0H' option. This
407addresses the problem that -m "F6" does not fully specify the output
408format. In addition, -m 6 should probably explicitly set -m 0 (if it
409has not been set), rather than simply 'or'ing it, but right now we do
410not know when it is set.
411
412>>Mar. 17, 2013
413(compacc2.c, url_subs.c, plot_domain.cgi, ann_feats2l.pl)
414Modifications to url_subs.c to support SVG domain maps in HTML output.
415
416A new evironment variable has been defined, DOMAIN_PLOT_URL, which can
417be used to plot (using SVG or PNG) a map of the domains on the library
418sequence. The argument to DOMAIN_PLOT_URL is the concatenated list of
419annotations provided by the -V options. All annotations (including
420sites) are passed; non-alpha-numeric characters are URL encoded.
421plot_domain.cgi is an example of a script that can be passed as
422DOMAIN_PLOT_URL. To use this script:
423
424$ENV{DOMAIN_PLOT_URL}="<object data=\"plot_domain.cgi?n0=%d&query=%s&db=%s&lib=%s&q_start=%ld&q_stop=%ld&l_start=%ld&l_stop=%ld&n1=%d&o_pgm=%s&doms=%s\" width=\"660\" height=\"72\"></object>\n";
425
426ann_feats2l.pl has been extended to allow the --neg
427(or --neg-dom) option, which puts domain a NODOM domain annotation
428between the domain annotations provided by the database.
429
430>>Mar. 7, 2013
431(cal_cons.c)
432Modify update code to properly begin global alignments that start with
433insertions or deletions.
434
435>>Feb. 20, 2013
436(compacc2.c)
437Annotation scripts (-V \!ann_feats.pl) were being inactivated if no
438annotations were returned, fixed.
439
440>>Feb. 2, 2013
441(comp_lib9.c)
442Prevent premature termination of query title in -m 9 mode (guarantees
443the full >accession text to first space is preserved).
444(compacc2.c)
445Provide domain information (;C=PF00016) in -m9 domain scoring.
446
447>>Jan 7-9, 2013
448(initfa.c, pssm_asn_subs.c)
449Modify pssm_asn_subs.c to properly parse binary PssmWithParameters
450produced by NCBI asntool from psiblast (blast+) text ASN.1 output.
451The text ASN.1 uses a binary encoded query sequence; get_lambda() in
452initfa.c was modified to work with a binary encoded query sequence
453(the query is used to find the p_i from rrcounts[query[i]]).
454
455Modify pssm_asn_subs.c to set query=NULL when PSSM does not include
456query sequence. Modify read_asn_pssm() to set query=aa0 if query==NULL;
457
458>>Dec. 14, 2012
459(cal_cons.c, dropfx.c, dropfz2.c)
460Enable percent identity calculation on domains. Merge
461cal_cons.c/calc_code() strategies into dropfx.c, dropfz2.c
462
463>>Dec. 6, 2012
464(comp_lib8.c, comp_lib9.c, nmgetlib.c)
465Fix code in close_lib_list() that did not properly re-initialize files
466for re-reading (not seen when library is in memory, or for single
467sequence search).
468
469>>Dec 2, 2012
470(wm_align.c, Makefiles)
471CHECK_SCORE() in wm_align.c must return different scores for local and
472global (#define GGSEARCH in wm_align.c). Requires modified Makefiles.
473
474>>Sep 24, 2012
475(doinit.c, compacc2.c, cal_cons.c)
476Fix bugs introduced with next_annot_entry() strategy for reallocating
477annot_arr[]; find a bug in cal_cons.c where i1_annot was indexing
478annot0_arr_p[]; ensure that m_msg.ann_arr_def[] is appropriately initialized.
479
480>>Sep 17, 2012
481(lav2plt.pl, lavplt_ps.pl, lavplt_svg.pl, lav_defs.pl, l_feat_dom.pl)
482Convert the lav*.c programs to perl. This simplifies adding the
483ability to script domain annotation. The format for domain
484annotations for the lav2plt.pl programs differs slightly from the
485current up_feats_dom.pl program, because it requires a beginning and
486end for each domain, e.g.:
487
488>sp|Q14247.2|SRC8_HUMAN
48980 [] 116 Cortactin 1.
490117 [] 153 Cortactin 2.
491154 [] 190 Cortactin 3.
492191 [] 227 Cortactin 4.
493228 [] 264 Cortactin 5.
494265 [] 301 Cortactin 6.
495302 [] 324 Cort. 7; trunc.
496492 [] 550 SH3.
497
498and takes a single accession from the command line, e.g.:
499"l_annot_dom.pl sp|P09488" rather than reading a file.
500
501>>Sep 4, 2012
502(doinit.c, compacc2.c, fasta_guide.tex)
503Annotations can now be provided within a sequence (-V '%#!'), by a
504script (-V '\!up_feats.pl'), or from a file (-V '<annot.file
505q<annot.file'). Annotation files make particular sense for query
506annotations, where the user may know much more about the query than
507the database does.
508
509(doinit.c, compacc2.c, comp_lib9.c, structs.h)
510Ensure that calc_code() is called if any -m 'F9c file' requires it.
511
512>>Aug 31, 2012
513(cal_cons.c, compacc2.c, dropfx.c, dropfz2.c)
514The region score calculations have been corrected to include regions
515that overlap alignment boundaries, and regions that start in gaps.
516
517>>Aug 10, 2012
518(cal_cons.c, compacc2.c, dropfx.c, dropfz2.c)
519
520Introduce a second kind of annotation feature, the "Region" (denoted
521by '[' and ']'), that specifies a region that should be scored
522separately. These regions cannot be nested, each residue can belong
523to only one region. However, the scores in these regions can be
524calculated (perhaps percent identity and length later), and are
525displayed:
526
527>>sp|P09488|gstm1_human GLUTATHIONE S-TRANSFERASE MU 1 ( (218 aa)
528 Site:* : 23Y=23Y : MOD_RES: Phosphotyrosine (By similarity).
529 Site:* : 33Y=33Y : MOD_RES: Phosphotyrosine (By similarity).
530 Site:* : 34T=34T : MOD_RES: Phosphothreonine (By similarity).
531 Region : 3-82 : score=547; bits=146.4 : GST_N
532 Site:^ : 116Y=116Y : BINDING: Substrate.
533 Region : 104-171 : score=465; bits=125.8 : GST_C
534
535All information about the region should be provided with the '['
536(start) symbol.
537
538>>Aug 1, 2012
539(dropfx.c, dropfz2.c, c_dispn.c)
540Fix some very old bugs that caused errors in coordinate displays of
541reverse-complement fastx/fasty alignments. Fix BLAST alignment
542display coordinates. Enable variant calculations for FASTY
543(dropfz2.c), and simplify calculations for dropfx.c
544
545>>Jul 29,2012
546(doinit.c, compacc2.c, comp_lib9.c)
547Allow annotation descriptions to be delivered by annotation script,
548denoted by '=' in first line, e.g.:
549=*:phosphorylation
550=^:binding site
551=@:active site
552>gi|121735|sp|P09488.3|GSTM1_HUMAN
5537 V F Mutagen: Reduces catalytic activity 100- fold.
55423 * - MOD_RES: Phosphotyrosine (By similarity).
55533 * - MOD_RES: Phosphotyrosine (By similarity).
55634 * - MOD_RES: Phosphothreonine (By similarity).
557
558remove requirement for leading space before annotation script: e.g.:
559-V '\!up_feats_c.pl'
560
561>>Jul 27, 2012
562(compacc2.c, cal_cons.c, dropfx.c)
563
564(1) Allow comments/descriptions on features other than type 'V' (variant)
565to be displayed with alignment. If a '@' SITE feature has a comment
566provided by the annotation script, the comment will be displayed in
567the alignment description , e.g.:
568
569>>sp|P28161.2|GSTM2_HUMAN Glutathione S-transf (218 aa)
570 ^ :116Y=116Y: BINDING: Substrate (By similarity).
571 @ :210S+210T: SITE: Important for substrate specificity.
572 initn: 632 init1: 632 opt: 632 Z-score: 1414.3 bits: 268.8 E(450603): 2.6e-71
573Smith-Waterman score: 945; 75.2% identity (93.6% similar) in 218 aa overlap (1-218:1-218)
574
575If no comment is provided, the annotation will only appear in the
576coordinate line. This provides a way to show annotation locations in
577BLAST output.
578
579(2) Also add code to ensure that symbols returned by annotation scripts
580are displayed on the coordinate line.
581
582(3) Add environment variable substitution to =${TMP_D}/annot.defs and
583\!${TMP_D}/up_feats_c.pl parsing.
584
585>>Jul 24, 2012
586(uascii.h, map_db.c)
587Modify NANN, a value one more than the largest amino-acid encoding
588value, increasing it from 50 (too small for NCBIStdaa_ext_n) to 60;
589ESS changed to 59.
590
591>>Jul 20, 2012
592(mshowalign2.c, mshowbest.c, compacc2.c, comp_lib8.c)
593(transferred from fasta-36.3.5)
594(a) Fix bug in mshowalign2.c that occurred because of re-use of the
595"tmp_len" variable when adding '\n' to -L long descriptions. This
596typically occurred with -m 10. (b) Modify logic used to capture if an
597alignment had been calculated, reducing dramatically the number of
598re-alignments with multiple -m "F" output files.
599
600>>Jun 30, 2012
601(mshowbest.c)
602Ensure that opt score and E()-value are based on initial scan score,
603not later alignment score. score_delta is used to increment initial
604scan score. However, currently the E()-value of the alignment score
605is displayed in the alignment list, so the -m 9 and showalign()
606E()-values can be inconsistent.
607
608>>Jun 29, 2012 (from fasta-36.3.5c)
609(pssm_asn_subs.c)
610Add chk_asn_buf() before getting RPSPARAMS_MATRIX.
611
612>>Jun. 27, 2012 (from fasta-36.3.5c))
613(nmgetlib.c, compacc2.c)
614Fix bug that allocated unnecessary space for re-loading sequences in
615pre_load_best() (compacc2.c). Ensure that closed/NULL memory mapped
616file descriptors are not returned.
617
618>>Jun. 18, 2012
619(compacc2.c)
620Modify pre_load_best() to allocate memory for sequences to be aligned
621only if the sequences are not already in memory. (Searches against
622hg18 with repetitive queries caused very large amounts of memory to be
623allocated in duplicate.)
624
625>>Jun. 12, 2012
626(compacc2.c, doinit.c, dropfx.c, cal_consf.c)
627Implement variant scoring for fastx36. Also address problems with
628annotation location when -m markx is not set. Check function
629definitions for other drop functions where variant scoring is not yet
630implemented.
631
632>>Jun. 9, 2012
633(defs.h, doinit.c, c_dispn.c)
634Add 'M' and 'B' options to -m 0,1 to specify annotation location. For
635example, -m 0M (-m1) causes the annotation to be inserted in the
636"middle" alignment line, rather than in the coordinate line (making
637the sequence with the annotated feature ambiguous). -m 0B, -m1B
638puts the annotation in both the middle (alignment) line and the
639coordinate line.
640
641>>Jun. 8, 2012
642(doinit.c, compacc2.c, build_ares.c, mshowbest.c, mshowalign2.c,
643structs.h and others)
644
645Implement a script-driven strategy for feature annotation in
646alignments. In addition to: fasta36 -V '*%^@', which extracts the
647annotation characters from the library sequences, we can also do:
648fasta36 -V '*%^@ \!feature_script.pl' which expects the same
649annotation characters ('*%^@'), but expects them from the script
650'feature_script.pl'. This script gets the sequence description line,
651e.g: "gi|121746|sp|P09211|GSTP1_HUMAN Glutathione S-transferase P (GST
652class-pi) (GSTP1-1)", and is expected to return a tab-delimited file:
653====
654pos label value
65523 *
65633 *
65734 *
658116 ^
659173 V N
660210 V T
661====
662
663Currently, the "value" is ignored unless the label is "V", for
664variant. If 'V' annotations are present, then the alternative
665amino-acid residue values are tested in alignments; if the variant
666residue improves the score, the score is updated and the variant
667sequence is displayed, and a 'V' indicates the variant in the
668coordinate line. Currently, variant annotations can only affect
669library sequences.
670
671By default, annotation symbols are shown in the coordinate line for -m
6720 (default) and -m 1 (difference) alignments, sometimes overwriting
673the coordinate. Annotation symbols (from either sequence) can be shown
674in the middle alignment line by specifying -m 0M or -m 1M, or in both
675the middle alignment line and the coordinate line with -m 0B, -m 1B.
676
677>>May 5, 2012
678(dropnnw2.c)
679Enable rev-comp for ggsearch/glsearch.
680
681>>Mar. 13, 2012
682(defs.h)
683Increase default file name length to 256 from 120 to accommodate long
684file names at the EBI. Also allow much longer command line arguments
685argv_line[MAX_LSTR=4096] to be reported.
686
687>>Jan. 30, 2012
688(nmgetlib.c, altlib.h)
689Read .fastq sequence libraries (ignoring quality information) as library type '7';
690
691>>Dec. 21, 2011 (released as fasta-36.3.5c)
692(nmgetlib.c)
693Fixed a problem reading multiple library files that produced
694segmentation faults because a data buffer was free()ed and then
695re-used.
696
697>>Nov. 17, 2011
698(initfa.c, mshowalign.c) (from fasta-36.3.5b)
699Fix problem with ppst->e_cut_r for LALIGN DNA sequences (set
700improperly to 0.001). Add ':' to s_bits: in -m 10 output. Also
701remove "score" from "lsw_s-w opt" score description (not present in
702non-LALIGN -m 10).
703
704>>Nov. 9, 2011 (from fasta-36.3.5b)
705(lavplt_svn.c, lavplt_ps.c, ncbl2_mlib.c)
706Fix buffer overrun for lav legend. Fix old problem re-opening NCBI
707blastdbfmt indirect OID files.
708
709>>Oct. 30, 2011
710(comp_lib9.c)
711Correct re-initialization bug that prevented the second query sequence
712from seeing the entire library.
713
714[from fasta-36.3.5a_svn]
715(comp_lib9.c, comp_lib8.c, ncbl2_mlib.c, nmgetlib.c)
716Address out-of-memory problems when searching memory mapped, and fix
717problem using fopen()/fread() rather mmap for NCBI DNA databases. On
71832-bit machines, NCBI database files cannot be left open, and are now
719more agressively closed. However, searches that produce very large
720numbers of alignments may still run out of memory on low-memory 32-bit
721machines.
722
723(compacc2.c, comp_lib8.c, comp_lib9.c, htime.c)
724Correct problems that produce negative scan times.
725
726>>Oct. 21, 2011
727(pcomp_subs2.c, work_thr2.c, mshowalign2.c, make/Makefile.mp_com2, Makefile.fcom)
728Fixes to re-enable MPI compilation and execution.
729
730>>Oct. 18, 2011
731(compacc2.c, mshowbest.c, comp_lib8.c, comp_lib9.c, initfa.c)
732Fix the logic for specifying the number of alignments displayed with
733the -b 123, -b '>123', -b '=123', -b '$' options, particularly when
734statistics are not used.
735
736>>September 21, 2011
737(initfa.c, apam.c, scaleswn.c compacc2.c)
738Two major problems have been addressed (which also affect fasta-36.3.5
739and earlier versions): (a) specifying a -s dna.mat DNA matrix did not
740work properly; (b) too few shuffles, particularly with DNA sequences,
741were produced with pairwise comparisons. The problem with scoring
742matrix files was exacerbated by the use of fixed library alphabets.
743initfa.c has been modified to recognize that when a DNA scoring matrix
744is specified, the "-n" option is set. The shuffling problem appeared
745when, for pairwise DNA comparisons, fewer than 50 shuffles were
746reported. This occurred because the buffers used to communicate with
747threads no longer have a fixed amount of sequence buffer associated
748with them.
749
750>>August 23, 2011
751(tatstats.c, upam.h, apam.c)
752The remapping of the amino-acid encoding to NCBIstdaa broke some
753assumptions in tatstats.c, and elsewhere. In addition to the simple
754mapping problem, which changed the counts[] assignment in
755tatstats.c/calc_priors(), the fact that NCBIstdaa does not have
756contiguous real amino acids (e.g. B is at position 2), broke the
757generate_tatprobs() function because of a very old bug where priorptr
758was not always incremented.
759
760Some of the drop*.c functions have been updated to ensure that the
761space allocated for rapid pam[][] score lookup includes space for
762lower-case characters, which can be present in pseg'ed "map_db -b"
763libraries. In addition, binary format (currently all mmap'ed)
764libraries cannot include annotations, because common annotation values
765('*', '&') overlap the range of the NCBIstdaa_l (lowercase) mapping.
766
767>>August 1, 2011
768(map_db.c)
769map_db.c has been modified to provide a more efficient memory mapping
770for FASTA format files. map_db -b works like map_db, but, in addition
771to writing the .xin index file of descriptions and sequences in the
772FASTA library, it also produces a new protein_library.bsq file and
773protein_library.xin_b that contains binary encodings of the databases
774and an index for this file. The binary encoding can be memory mapped,
775so that database searches can proceed directly from memory. map_db -b
776.bsq files are very similar to the blastfmtdb files, except that they
777accomodate lower-case letters (masked) in the sequences. The
778implementation of blastfmtdb lower-case masking prevents it from being
779used in directly memory mapped files.
780
781map_db.c introduces a new memory mapped format encoding, MP2. I
782expect this format to be extended to allow not only directly memory
783mapped files, but also directly memory mapped lookup tables. A
784database can be hashed, and the hash and link files written to a
785library file, which can then be used for searches without the need to
786re-calculate the hash/link tables.
787
788(comp_lib9.c, mmgetaa.c, ncbl2_mlib.c, initfa.c, dropfz.c)
789Modifications to allow memory mapped files to be read and processed
790directly. Databases with lower-case characters can be memory mapped,
791which means that lower-case characters are coming into the alignment
792programs even when -S is not specified. As a result, all the protein
793scoring matrices must be built-out to allow lower-case
794characters. Likewise, the dropfz2.c matrices built by init_weights()
795must always be set for lower-case characters.
796
797>>July 20, 2011
798(mshowbest.c, mshowalign2.c)
799gi|12345 numbers are no longer shown in the list of best hits unless
800-m 8 or -m 9 are used. They are never shown in the alignments.
801(dropfz2.c)
802Modify MAX_UC, MAX_LC to be consistent with NCBIstdaa alphabet. Modify
803<= nsq for init_weights().
804
805>>July 16, 2011 fasta-36.3.6
806(comp_lib9.c, drop*.c, cal_cons*.c)
807The internal encoding of amino-acids has changed to NCBIstdaa
808throughout the programs. This allows the programs to use memory
809mapped NCBI blastdbfmt libraries directly, without re-encoding, but
810lower-case low-complexity mapping is not recognized. This allows
811substantial speedup in single query searching. However, to allow
812low-complexity searches, a new memory mapped format/encoding will be
813required.
814
815>>July 5, 2011 fasta-36.3.6
816(compacc2.c)
817Modify save_best2() logic for identifying scores to be used for
818statistics. An is_valid_stat is set for multi-frame results that
819specify which scores can be used for the stats[] and qstats[] arrays.
820Modifications to buf_do_work(), buf_shuf_work(), and buf_qshuf_work()
821to cause the calculation to be done in the thread, rather than the
822main program. Fix some bugs in the qshuffle code to ensure that all
823valid shuffles up to maxshuff are saved.
824
825(complib5e.c, complib7e.c, complib8.c)
826Fix -m 9c/C core dump with -z -1.
827
828(cal_cons.c, cal_consf.c)
829Reverse 'I', 'D' with CIGAR string.
830
831>>June 26, 2011
832(comp_lib8.c, compacc2.c)
833
834Added the ability to search a library produced/specified by a script.
835Like the "-e expand_script.sh", searching against a library that
836begins with a '!', e.g. '!library_script.sh', causes the
837library_script.sh to be executed, producing a temporary file from
838stdout, which is then scanned as the database. As with expansion
839files, all the standard library syntax can be included. Thus, if
840cat_db.sh contains the command 'echo /seqdb/swissprot.lseg', the
841command:
842
843 fasta36 query.aa '\!@cat_db.sh'
844
845will cause cat_db.sh to produce a temporary file with the line
846"swissprot.lseg"; the temporary file will be interpreted as an
847indirect file of filenames; and swissprot.lseg will be searched. Note
848that in Unix systems, the '!' must be preceeded by a '\' as shown
849above, so that it is not interpreted by the shell.
850
851>>June 23,24 2011
852(compacc2.c, comp_lib8.c, mysql_lib.c)
853A new save_best2() function in compacc2.c has been designed to
854simplify the logic involved in saving best scores, with the goal of
855moving some of the save_best() calculations into individual threads.
856
857mysql_lib.c has a new command, close_tables, that allows a script to
858remove a table after it has been used. (It might make more sense to
859add this to the extension script option.)
860
861>>June 14, 2011 (released as fasta-36.3.5a June, 2011)
862(comp_lib7e.c, comp_lib8.c, compacc2.c)
863Fix a serious bug in next_sequence_p() that caused a portion of the library to
864be missed when long sequences filled the sequence buffer before the
865slots were filled.
866
867Make certain that thread buffers are cleared when running an expansion
868script.
869
870Return an extra '\n' before the final summary for consistency with
871earlier versions.
872
873>>June 2, 2011 (released as fasta-36.3.5 June, 2011)
874(comp_lib8.c, comp_lib5e.c, comp_lib7e.c)
875Fix a bug that indicated that linked expanded sequences were
876pre-loaded for alignment when they were not.
877
878>>May 24, 2011 (released as fasta-36.3.5)
879(comp_lib8.c, comp_lib7e.c, comp_lib5e.c, mshowalign2.c, compacc2.c,
880initfa.c, param.h, scaleswn.c)
881
882The in-memory versions of the program are allocating much more memory
883than they actually use, causing the memory limits to cut in too soon.
884Fix this by using a smaller MAXLIB_P (36000) for searches against
885protein libraries, and expanding/contracting the aa1b_size more
886sensibly. Also add lost_memK value to track lost memory. For protein
887searches, lost memory is now around 15% of allocated memory (down from
88840%).
889
890Numerous fixes to improve formatting of HTML output. Full statistics
891parameters are now available with the fdata output.
892
893Add fset_vars() to comp_lib8.c to set m_msg.max_memK properly.
894Parameters have been modified to ensure less memory waste (all buffers
895have 1000 sequences); Drop default 64-bit library memory limit to 8GB
896(-XM8G, LIB_MEMK=8G).
897
898>>May 25, 2011
899(comp_lib8.c, comp_lib7e.c, comp_lib5e.c, mshowbest.c)
900
901Add the '-b >1' option, guarantees that at least 1 result is shown,
902but otherwise limits by E()-value. '-b =10' guarantees to show
903exactly 10 results (never more or less if the library is large
904enough), '-b 10' will show no more than 10 results, limited by -E
905e_cut, and '-b >1' will show at least 1 result, but is otherwise
906limited by -E e_cut.
907
908>>May 19, 2011
909(comp_lib8.c, compacc2.c, param.h)
910comp_lib8.c is a version of comp_lib7e.c that keeps sequences in
911memory over multiple searches, but returns seqr_chains of buffers of
912sequences as they are read, rather than waiting for everything to be
913read. comp_lib8.c will automatically allocate up to 2 GB (32-bit
914machines) or 8 GB (64-bit machines) to hold the sequence database in
915a multiple query search. This number can be increased or decreased
916using the -XM# (megabytes) or -XM#G (gigabytes) option, or by setting
917the LIB_MEMK environment variable. -XM4G (LIB_MEMK=4G) makes 4GB
918available for sequence libraries; -XM-1 makes all machine memory
919available.
920
921>>May 5 2011
922(mshowbest.c)
923Fix problems that prevented "-b align_number" properly limit output
924with "-z -1". "-z -1" also broke multiple HSPs (since no threshold
925could be calculated); fixed.
926(dropnfa.c)
927Fix some offset arithmetic that prevented FASTA alignments from
928extending to full length in do_walign().
929
930>>May 4, 2011
931(scaleswn.c)
932Provide additional checks for division by low numbers in fit_llen2()
933and fit_llens(). The similarities between fit_llen(), fit_llens(),
934and fit_llen2() have been highlighted, and their differences
935documented. scaleswn.c now provides pstat_info, which writes all the
936values required to re-calculate zscores or E()-values from raw scores.
937
938>>May 2, 2011
939(dropnfa.c)
940Fix a problem with the traditional cgap(join)/optcut(opt) thresholds
941(no longer used by default) caused by allowing ktup=3 for proteins.
942The ktup=3 modification increased the cgap/opt thresholds by 6.
943
944(comp_lib5e.c, comp_lib7e.c, comp_lib8.c)
945Confirm identity of -m # and -m "F3 file.out". Small differences fixed.
946
947(mshowbest.c, mshowalign2.c)
948Remove gi|12345 information from -m B, -m BB blast-like output. NCBI
949Blast does not display gi numbers.
950
951>>Apr. 22, 2011
952(doinit.c, initfa.c)
953Several of the less common options have been changed to expanded
954options, changing the meaning of -X (which now specifies expanded
955options), as well as -o, -1, -B, -x, and -y. -o now provides the
956offset coordinates previously specified with -X; -B is now -XB, -o
957-Xo, -x -Xx1,-1, and -y -Xy, e.g. -Xy32.
958
959>>Apr. 19, 2011
960(comp_lib7e.c, comp_lib5e.c, doinit.c, mshowbest.c)
961Test lastest version with -I interactive mode. Modificiations
962required to ensure that aligments goto outfd, not stdout, when
963filename is entered. In addition, in interactive mode there can be
964more scores shown than e_cut, so bbp->repeat_thresh must be set in
965showbest() not main() program.
966
967>>Apr. 17, 2011
968(comp_lib7e.c, doinit.c, compacc.c)
969
970The FASTA programs now support multiple output files with different -m
971out_fmt types using the -m "F# out_file" or -m "F#,#,# out_file"
972option. Normally, the -m out_fmt option applies to the default output
973file, which is either stdout, or specified with -O out_file (or within
974the program in interactive mode). With -m F, an output format can be
975associated with a separate output file, which will contain a complete
976FASTA program output. Thus,
977
978 ssearch36 -m 9c -m "FBB blast.out_file" -m "F10 m10.out_file" query library
979
980Will sent the -m 9c output to stdout, but will also send -m BB output
981to blast.out_file, and -m 10 output to m10.out_file. Consistent -m
982out_fmt comands can be set to the same file by separating them with
983','; e.g.:
984
985 ssearch36 -m 9c -m "F9c,10 m9c_10.out_file" query library.
986
987Producing alternative format alignments in different files has little
988additional computational cost.
989
990One of the shortcomings of this approach is that it affects only the
991output format, not the other options that modify the amount of output.
992Thus, if you specify -E 0.001; that expect threshold will be used for
993all the output files. When a -m option can modify the output (e.g. -m
9948 sets -d 0), that modification persists only for that file.
995
996>>Apr. 14, 2011
997(initfa.c)
998Fix bugs in e_cut_r calculation that made it much too low for
999lalign36, and used the >1.0 divisor improperly for all programs
1000(change from e_cut_r = e_cut_r/divisor to e_cut_r = e_cut/divisor).
1001
1002>>Apr. 11, 2011
1003(comp_lib5e.c, comp_lib7e.c, compacc.c)
1004
1005The non-preload version of FASTA (comp_lib5.c) has been extended to
1006allow script expansion (comp_lib5e.c). To do this, the central score
1007calculation loops have been moved to getlib_buf_work(), just as
1008seqr_chain_work() was created for comp_lib7e.c. Moreover, the
1009function used to build the link_file names is build_link_data() is now
1010in compacc.c. Differences between comp_lib5e.c and comp_lib7e.c have
1011been reduced.
1012
1013>>Apr. 5, 2011
1014(comp_lib7e.c)
1015Fix issue with closing unopened link_lib_list_p when no results are
1016found. Remove no-sequence error message for link library file.
1017
1018>>Apr. 1, 2011
1019(comp_lib7e.c)
1020The -e script.sh has been generalized to have all the capabilities of
1021a library file, in particular '@' specifies an indirect file, and
1022"script.sh #" allows a library type to be specified. Thus, the
1023script.sh invoked by "@script.sh" should not produce a fasta file; it
1024should produce a file that contains the name of a fasta file (or
1025possibly some other format). If '@' is used, the link_lib file
1026written to stdout will be prepended with '@', and treated as an
1027indirect file of file names.
1028
1029(comp_lib5.c, comp_lib7.c, comp_lib7e.c)
1030Fix problem with null refstr (no Please cite:).
1031
1032>>Mar. 31, 2011
1033(comp_lib7.c, comp_lib7e.c)
1034close_lib() was being called after each query. This is incorrect for
1035versions (like comp_lib7) that keep the entire database in memory; the
1036files must be kept open to allow ranlib() to get long descriptions
1037(alternatively, a long description could be read initially).
1038
1039(comp_lib5.c, comp_lib7.c, comp_lib7e.c)
1040Fix query offset coordinates for long queries that are broken up.
1041Allow query library to have zero-length sequences without stopping
1042(queries now stop when end-of-file is reached).
1043
1044(upam.h)
1045Fix gap penalties for BLOSUM80 matrix (change from -14, -2 to -10, -2).
1046
1047>>Mar. 29, 2011
1048(comp_lib7e.c, doinit.c)
1049
1050Add the ability to search an expanded set of sequences based on the
1051accessions from the initial search using "-e expand.sh" option.
1052If "-e expand_script.sh" is specified, the command:
1053
1054 expand.sh link_acc_file > link_lib_file
1055
1056is run by the program (fasta36, ssearch36, fastx36, etc), where
1057link_acc_file and link_lib_file are temporary file names produced by
1058the program. (The location of the temporary files can be specified
1059with the $TMP_DIR environment variable.) link_acc_file contains a
1060list of accession strings for the statistically significant hits - the
1061information in the description line to the first space, e.g.
1062
1063gi|121719|sp|P08010|GSTM2_RAT
1064gi|121746|sp|P09211|GSTP1_HUMAN
1065
1066from a search against my pir1.lseg library.
1067
1068"expand.sh" then reads that file, extracts the accession information,
1069expands the accessions to a new set of accessions, extracts the
1070expanded set of accessions from a database and writes them to
1071standard output (which is saved in the temporary link_lib_file
1072name). The sequences in expanded link_lib_file are then added to the
1073initial search, and included in the list of best scores (and
1074alignments) if their scores are statistically significant. The
1075additional sequences do not change the initial library size.
1076
1077To test the expansion capability, use an expand.sh script that simply
1078cat's a file of homologs to stdout (which will go to link_lib_file and
1079be read), e.g. expand.sh contains "cat ../seq/gst.lib".
1080
1081Building a program that can take an arbitrary list of accessions and
1082produce a library of homologs is more complicated (and slower), but
1083will allow a smaller database to be searched yet produce results
1084similar to those found from a larger database.
1085
1086>>Mar. 24, 2011 (released as fasta-36.3.4)
1087(comp_lib7.c, dropfx.c, dropfz2.c, doinit.c)
1088Fix a bug in the new help display; identify and correct various memory
1089leaks and references to uninitialized data.
1090
1091>>Mar. 15, 2011
1092(doc/fasta3x.me, fasta3x.tex)
1093The ancient, rarely updated, fasta3x.me has been replaced with
1094fasta3x.tex, with the goal of producing a more up-to-date, accurate,
1095and comprehensive document describing the capabilities of the FASTA
1096programs. In addition, fasta36.1 has been updated/corrected.
1097
1098(make/Makefile.os_x86_64)
1099Mac OS X clang 2.0, distributed with Xcode4.0, does not properly
1100optimize the smith_waterman_sse2_word() in smith_waterman_sse2.c when
1101clang -O is used to compile.
1102
1103>>Mar. 4, 2011
1104(doinit.c)
1105Histograms are now turned off by default. -H shows histograms for all
1106programs, not just the *_mpi (PCOMPLIB) programs.
1107
1108>>Feb. 27, 2011
1109(make/Makefile36m.common, Makefile.pcom_t, Makefile.pcom_s)
1110
1111The threaded programs are now the default, and the *_t versions of
1112programs have been removed from the Unix and unix-like (MacOX)
1113distributions. Windows versions can have either threaded or
1114non-threaded versions, since the threaded windows programs require an
1115additional library. Serial versions of the programs can still be built
1116by editing the make/Makefile36m.common file, and using
1117include Makefile.pcom_s instead of include Makefile.pcom_t.
1118
1119The documentation has been edited to reflect these changes.
1120
1121>>Feb. 24, 2011 (comp_lib5.c, comp_lib7.c, doinit.c, initfa.c,
1122structs.h) The FASTA programs have a much more informative help
1123system. If the -DSHOW_HELP option is included in the Makefile, the
1124following changes occur: (1) the program is no longer interactive by
1125default. To get interaction, use the -I option (-I previously meant
1126showing the identity alignment in lalign; that option is now available
1127with -J). (2) fasta36 and fasta36 -h present a short help message. (3)
1128fasta36 -help provides a complete list of options with a more complete
1129set of options. The getopt() option strings are now built
1130dynamically.
1131
1132>>Feb. 18-21, 2011
1133(doinit.c)
1134Fix missing -m 9i percent identity/alignment length. Fix issues with
1135short sequence description in -m 6 (html) mode.
1136
1137>>Feb. 17, 2011
1138(comp_lib5.c, comp_lib7.c, doinit.c)
1139Implementation of -m BB which provides completely BLAST-like output
1140(not just alignments).
1141
1142Modification of the -b ### option. Previously, -b 100 guaranteed 100
1143alignments; now -b 100 limits to 100 alignments if more than 100
1144alignments have E()-values less than the -E threshold. An '=' symbol
1145before the number reverts to the previous behavior; e.g. -m =100
1146guarantees 100 alignments, regardless of E()-value (-m =100 is
1147equivalent to -m 100 -E 100000.0, and disables other setting of the
1148E()-value threshold).
1149
1150>>Feb. 10, 2011
1151(doinit.c, mshowalign2.c, c_dispn.c)
1152The FASTA programs have a new alignment option, "-m B", which shows
1153alignments in BLAST format (no context, coordinates on the same line,
1154BLAST symbols for matches and mismatches.) This version does not
1155change the descriptions of the alignments, which are still FASTA like,
1156but the alignments themselves should look just like BLAST alignments.
1157Option -m BB makes output even more blast-like, showing not only the
1158alignments, but the initial set of high scoring sequences, and other
1159initial information, like BLAST+.
1160
1161>>Feb. 9, 2011 released as fasta-36.3.3
1162(dropfs2.c, initfa.c, comp_lib*.c)
1163Modify fasts36/fastm36 to allow up to ktup=3 for proteins; ktup=6 for
1164DNA (previously the max was ktup=2 for both).
1165
1166Modify version string to match release version number.
1167
1168>>Feb. 6, 2011
1169(initfa.c)
1170Fix bug that prevented fastm36 from working properly with DNA queries.
1171
1172>>Jan. 31, 2011
1173(pcomp_subs2.c, work_thr2.c)
1174Fixes to fasty36_mpi/tfastx36_mpi problem. Only fasty needs pascii[]
1175for alignments, but it wasn't being sent to workers. Fixed. The MPI
1176versions of the programs have now been tested much more thoroughly.
1177
1178>>Jan. 29, 2011
1179(comp_lib5.c, comp_lib6.c, comp_lib7.c, work_thr2.c, initfa.c,
1180param.h, dropfs2.c, scaleswt.c, dropfx.c)
1181
1182Translated DNA shuffles (tfastx36, tfasty36) now shuffle DNA as
1183codons. (1) Modify param.h pstruct to include shuffle_dna3,
1184initialized in resetp() [initfa.c] (2) modify buf_shuf_work() to use
1185ppst-zs_win and ppst->shuffle_dna3. (3) Add ppst->zs_off=0 to
1186scaleswt.c/process_hist(). (4) Fix some memory leaks in dropfx.c.
1187(5) Fix some other memory leads in dropfs2.c.
1188
1189>>Jan. 28, 2011
1190(initfa.c, scaleswn.c, mshowalign2.c)
1191Address crashes that occurred when novel scoring matrices and gap
1192penalties were specified, particularly for DNA. Fix memory problem
1193with long (-L) sequence descriptions.
1194
1195>>Jan. 23, 2011
1196(comp_lib7.c)
1197comp_lib7.c uses a more efficient strategy for reading chunks of
1198sequences that ensures that sequence data is contiguous for *_mpi
1199programs. comp_lib7.c replaces comp_lib6.c, which will be removed.
1200
1201>>Jan. 22, 2011
1202(many files)
1203Replace "mw.h" with "best_stats.h", a much more informative name.
1204
1205(drop*.c, p_mw.h, w_mw.h)
1206Remove p_mw.h, w_mw.h from code base and update_params() from
1207drop*.c. These files are left over from the old p2_complib.c parallel
1208programs.
1209
1210>>Jan. 21, 2011 released as fasta-36.3.2
1211(comp_lib5.c, comp_lib6.c, pcomp_subs2.c)
1212Fixes for MPI version of programs. Earlier versions did not handle
1213DNA/translated DNA comparisons properly, because duplicated sequences
1214(forward/reverse strand) were not handled properly. The current code
1215produces the correct scores and alignments, but probably is much less
1216efficient than it should be.
1217
1218>>Jan. 11, 2011
1219(initfa.c, scaleswn.c)
1220Re-enable DNALIB_LC (read lower-case DNA sequences as lower case).
1221
1222Reset ktup to default after change for short query in multi-query
1223searches.
1224
1225Address multiple issues associated with variable scoring matrices,
1226i.e. -s '?BP62'. Introduce pst->pam_name for the actual scoring
1227matrix, to distinguish it from pst->pam_file, which can correspond to
1228the std_pam->abbrev, for values like BP62 (which encodes both a matrix
1229and a specific set of gap penalties). Ensure that the new scoring
1230matrix is initialized and extended correctly. Fix some issues with
1231scoring matrix names in scaleswn.c
1232
1233>>Jan. 5, 2010
1234(dropnnw2.c, dropgsw2.h, global_sse2.c,h, glocal_sse2.c,h)
1235Include SSE2 optimization for global/global and global/local alignments
1236provided by Michael Farrar. Global and glocal alignments are now 20X
1237faster.
1238
1239>>Jan. 5, 2011 re-released as fasta-36.3.1
1240(initfa.c, last_tat.c)
1241Fix bug resetting pst.e_cut_r for DNA sequences. Modify last_tat.c
1242code to use pre-loaded sequence if available. Remove last_tat.c
1243PCOMPLIB code.
1244
1245>>Jan. 3, 2011 released as fasta-36.3.1
1246(comp_lib5.c, comp_lib6.c)
1247Add >>><<<, >>>/// to -m 9,10 output for separating multiple query
1248searches. Also clean up extra >>>query line before alignments when no
1249alignments are shown.
1250
1251>>Dec. 16, 2010
1252(dropgsw2.c, dropnnw2.c, dropnsw.c, comp_lib5.c, comp_lib6.c)
1253Fix bug that caused ssearch to not invert coordinates for
1254reverse-complement DNA alignments (I never imagined using ssearch for
1255DNA) in dropgsw2.c, dropnnw2.c, and dropnsw.c. Add SEQ_PAD to aa0[1]
1256(rev-comp copy) in comp_lib5.c, comp_lib6.c.
1257
1258>>Dec. 14, 2010
1259Modify CIGAR strings for frameshifts, including 1F and 1R for forward
1260and reverse frameshifts. Extensive documentation updates.
1261doc/fasta36.1 is the most comprehensive and accurate description of
1262FASTA options.
1263
1264>>Dec. 1, 2010
1265(drop*.c, comp_lib5.c, comp_lib6.c)
1266Correct problems with copying for recursive sub-alignments. Correct
1267bug in adler32_crc calculation that suggested a problem with continued
1268library sequences that did not exist.
1269
1270(initfa.c, defs.h)
1271Use MAXLIB, rather than MAXLIB+MAXTST for comp_lib6.c, which
1272pre-allocates the sequence database. Increase MAXLIB.
1273
1274>>Nov. 24, 2010
1275(drop*.c, drop_func.h)
1276Modify drop*.c functions that do recursive sub-alignments to avoid
1277modifying the aa1[] sequence array, which conceivably could be in use
1278by other threads. do_walign() now has const *aa0 AND const *aa1. To
1279prevent modification of aa1, sub-regions of aa1 are now copied into
1280newly allocated arrays.
1281
1282>>Nov. 20, 2010
1283(cal_cons.c, mshowbest.c, mshowalign2.c, doinit.c)
1284The -m 9C option displays an alignment code in CIGAR format. (-m 9c
1285shows the older alignment encoding.)
1286
1287>>Nov. 16, 2010 (beginning of fasta-36.3.*, verstr 36.07)
1288(initfa.c, apam.c, upam.h, param.h)
1289
1290Provide the ability to adjust the scoring matrix based on the length
1291of the query sequence for alignments using a protein alphabet (this
1292could certainly be extended to DNA as well). By including a '?'
1293before the scoring matrix, e.g. -s '?BP62', a shallower matrix will be
1294chosen if the entropy of the selected matrix (i.e. bit score per
1295aligned position) times the length of the protein query is
1296<=DEF_MIN_BITS (defs.h), currently 40 -- this value should be set
1297based on the library size). The FASTA programs include BLOSUM50 (0.49
1298bits/pos) and BLOSUM62 (0.58 bits/pos) but can range to MD10 (3.44
1299bits/position). The variable scoring matrix option searches down the
1300list of scoring matrices to find one with information content high
1301enough to produce a 40 bit alignment score. This option is included
1302primarily for metagenomics scans, which can include relatively short
1303DNA reads, and correspondingly short protein translations.
1304
1305Also correct the short-query modification to ktup, so that it works
1306properly with translated FASTX/FASTY searches (ktup is set to 1 when
1307the query_length/3 <= 20).
1308
1309(dropnfa.c, dropfx.c, dropfz2.c)
1310Shuffled sequence alignment scores are calculated identically to
1311library alignment scores. Previously, optimized scores were calculated
1312for all shuffled sequences for FASTA type alignments, even though
1313typically 20 - 40% of library sequences were optimized. Now the two
1314sampling strategies are consistent, though this may cause problems
1315when only a small fraction of sequences are optimized.
1316
1317Small changes to provide consistent dropnfa.c, dropfx.c, dropfz2.c
1318parameter display, and fix display with -m 10.
1319
1320>>Nov. 15, 2010
1321(initfa.c)
1322Enable statistical thresholds by default (previously, they were
1323enabled with -c -1 or -c 0.01 or anything < 1.0). The "classical"
1324join/opt threshold behavior can be restored with -c O (upper case
1325letter O), or by providing an optimization threshold >
13261.0. Statistical thresholds dramatically speed up searches (typically
13272-fold), and provide more accurate statistical estimates. The old
1328join/optimization thresholds where optimized for BLOSUM50, and other
13291/3-bit scaled scoring matrices, and did not work well with BLOSUM62.
1330Statistical thresholds have been tested extensively, particularly with
1331-z 21, and produce much more reliable statistical estimates.
1332
1333>>Oct. 14, 2010
1334(Makefile.fcom, cal_cons.c)
1335Edits to re-enable compilation and successful execution of
1336tfasta36(_t). tfasta36 has been superceeded by tfastx36(_t), which is
1337faster, and treats frameshifts as a different type of gap.
1338
1339>>Oct. 13, 2010
1340(mshowbest.c)
1341Make it more difficult to request more description/scores than are
1342available.
1343
1344>>Sep. 30, 2010 (released as fasta-36.2.7)
1345(comp_lib5.c, comp_lib6.c, dropnfa.c, dropfx.c, dropfz2.c)
1346Fix bugs in DEBUG versions with adler32_crc calculations on
1347overlapping sequences. Add more informative error messages when
1348debugging. Fix a problem with hist2.hist_a != NULL with some
1349compilers. Fix formats for some debugging error messages in dropnfa.c,
1350dropfx.c, and dropfz2.c.
1351
1352Also fix repeat_threshold calculation for very short sequences, to
1353guarantee that all matches as good as the best match with the sequence
1354are found. Fix some problems that prevented FASTA from finding short
1355repeats with short queries.
1356
1357This version of the FASTA36 package offers an alternate main program
1358file, comp_lib6.c, which reads the entire database into memory before
1359doing the search. Using comp_lib6.c can dramatically speed up
1360searches with multiple queries (there is no advantage with single
1361query sequences) on large multi-core computers, as each search is done
1362without re-reading the database. On a 48-core processor, we see
1363speedups greater than 40X with ssearch36_t and fastx36_t. To enable
1364comp_lib6.c, edit the make/Makefile36m.common file to comment out
1365lines refering to comp_lib5.c and un-comment lines referring to
1366comp_lib6.c.
1367
1368>>Sep. 29, 2010
1369(comp_lib5.c, comp_lib6.c, mshowbest.c)
1370Added -m 8C option, which mimics BLAST+ tabular with comment lines
1371format.
1372
1373>>Sep. 17, 2010
1374(dropfx.c)
1375
1376Fix a bug in dropfx.c/do_walign() that modified library sequences.
1377(This only caused a problem with comp_lib6.c, which reads the entire
1378database into memory and re-uses sequence buffers. Check sequence
1379consistency with adler32 CRC calculation.
1380
1381>>Sep. 15, 2010
1382(mshowbest.c, mshowalign2.c)
1383Change the output format slightly. E2() expect values (-z 21+) no
1384longer contain the library size (which is always the same as the
1385E(library_size) value), and the -m 9 +- line no longer contains the
1386frame information, since it is redundant. (The redundant rev-comp
1387remains on the >-- HSP lines.)
1388
1389>>Sep. 14, 2010
1390(comp_lib5.c, mshowbest.c, drop*.c, cal_cons[f].c, etc.)
1391Implement BLAST -m 8 tabular output.
1392
1393>>Sep. 9, 2010
1394
1395(compacc.c) Fix a bug in pre_load_best() that disabled
1396-L long sequence descriptions.
1397
1398(doinit.c) Fix a bug that prevented non-overlapping alignments from
1399being displayed when the -E threshold was changed. Before -E 0.001
1400would disable additional alignments. Now, -E "0.001 0" is required to
1401disable the additional alignments.
1402
1403(drop*.c) The display of search parameters has changed to ensure that
1404gap penalties are displayed on the same line as the scoring
1405matrix. Previously, the FASTA "Parameters:" section looked like:
1406
1407Parameters: BL50 matrix (15:-5)xS ktup: 2
1408 join: 42 (0.0944), opt: 30 (0.601), open/ext: -10/-2, width: 16
1409 Scan time: 0.450
1410
1411With fasta-36.2.7 (and later), the Parameters: section is:
1412
1413Parameters: BL50 matrix (15:-5), open/ext: -10/-2
1414 ktup: 2, join: 42 (0.102), opt: 30 (0.574), width: 16
1415
1416The [T]FAST[X/Y] Parameters: section includes the frameshift/substitution penalties (tfasty36):
1417
1418Parameters: BL50 matrix (15:-5) open/ext: -12/ -2 shift: -20, subs: -24
1419 ktup: 2, E-join: 0.5 (0.224), E-opt: 0.1 (0.0536), width: 16
1420
1421>>Aug. 3, 2010 (released as fasta-36.2.6)
1422(scaleswn.c)
1423
1424Modifications to calc_thresh(), proc_hist_ml(), to better accommodate
1425search strategies (fast?? with statistical thresholds) that provide
1426complete scores only for a high-scoring fraction of sequences. For
1427some query sequences, the E()-values from the database were sometimes
1428much "worse" than E2()-values, an observation that is
1429counter-intuitive (if parameters are estimated against shuffled
1430related sequences, the E()-values should get worse, not better). For
1431some queries, the result was very dramatic (E() < 1E-80, E2() <
14321E-150). This error appears to occur because the z-trim or mle_cen
1433thresholds are including many related sequences. -z 2 was modified to
1434censor more sequences when only a subset are scored, and -z 1 was
1435modified to adjust z-trim more carefully. As a result, z-trim was
1436reduced, excluding more sequences. If too many sequence are excluded,
1437then regression statistics do not work, and the program fails over to
1438Altschul-Gish statistics.
1439
1440-z 21+ modified so that MLE statistics are used for shuffle E2()
1441values if Altschul-Gish statistics are used for the library
1442E()-values.
1443
1444>>July 30, 2010
1445(comp_lib5.c, pcomp_subs2.c)
1446
1447Fix bug in buf_align_seq() that allowed buffer over-runs with long DNA
1448sequences with MPI. Checks on buffer over-runs are now included in
1449pcomp_subs2.c/put_rbuf(),get_wbuf(). Aug. 1, 2010, fixed similar bug
1450in buf_shuf_seq(). -z 21 now works with long DNA sequences.
1451
1452>>July 28, 2010
1453(mshowalign2.c)
1454Fix lalign36/showalign() to show best sub-optimal E()-value, not
1455bptr[0] E()-value (often identical).
1456
1457>>July 19, 2010 (released as fasta-36.2.5)
1458(wm_align.c, dropfx.c,dropfz2.c)
1459Fix some off-by-one boundary calculations to ensure that every query
1460that can fit into a library is aligned correctly.
1461
1462>>May 18, 2010
1463Implement comp_lib5.c, which simplifies the structure of
1464comp_lib4.c by moving some calculations into functions.
1465
1466>>May 10, 2010
1467Fix problem setting nshow with small library in interactive mode.
1468
1469>>May 5, 2010 fasta-36.2.3
1470Fix bug that prevented shuffled scores to be used properly for small
1471databases (prss capability was lost).
1472
1473>>May 2, 2010 fasta-36.2.2
1474Fix problem with tat_score values from fasts and fastm. fasta35 did
1475not re-calculate the z-score after last_stats(). fasta36 does, so it
1476must ensure that the e-value (sometimes p-value) is used correctly.
1477
1478>>Apr. 29, 2010
1479More extensive testing of the MPI-PCOMPLIB programs revealed some
1480problems sending sequences when (or more) frames for the same sequence
1481was used. This problem has been addressed, and large scale testing of
1482fastx36_mpi (with 100K sequence queries in a run) works.
1483
1484>>Apr. 16,19, 2010
1485(pcomp_subs2.c, comp_lib4.c, work_thr2.c)
1486The MPI-PCOMPLIB parallel version of the FASTA36 programs is
1487working. This PCOMPLIB version takes a very different approach from
1488the older PVM/MPI parallel programs (p2_complib2.c/p2_workcomp2.c) -
1489it works virtually identically to the threaded programs (sharing the
1490same work_thr2.c code and get_rbuf/put_rbuf() (manager) and
1491get_wbuf/put_wbuf() (worker/thread) functions. As a result, in this
1492initial version, the database is NOT distributed to the nodes. During
1493multiple searches, the library is re-read each time. However, load is
1494distributed to workers exactly the way it would be for the threaded
1495system, so the workload should scale.
1496
1497To distinguish them from the earlier mp35compsw, mp35compfa, etc, the
1498new versions are search36_mpi, fasta36_mpi, etc.
1499
1500The programs work with multiple queries, and producing multiple
1501sub-alignments, and work with -m 9c encodings.
1502
1503>>Apr. 7, 2010
1504(various Makefiles, comp_lib4.c, pcomp_subs2.c, thr_bufs2.h,
1505thr_buf_structs.h)
1506
1507The MPI version of the threaded programs, sseach36_mp, now compiles.
1508pcomp_subs2.c replaces pthr_subs2.c, and thr_bufs.h ->
1509thr_buf_structs.h, thr.h -> thr_bufs2.h, and pcomp_bufs2.h has been
1510added as the equivalent of thr_bufs2.h for PCOMPLIB.
1511
1512>>Apr. 2, 2010
1513(comp_lib4.c, work_thr2.c, compacc.c)
1514Implement init_aa0(), which isolates code that calls init_work and
1515sets up aa0s, aa1s, f_str[1] (reverse complement) and qf_str so that
1516the same code is used by the serial, threaded, and (future) PCOMP
1517versions.
1518
1519(work_thr2.c)
1520work_thr2.c now contains code for either threaded or PCOMPLIB
1521processes. Threaded processes get stuff from work_info; PCOMPLIB
1522processes get the same information via messages sent from init_thr()
1523called by main().
1524
1525>>Mar. 30, 2010
1526(comp_lib4.c, work_thr2.c, thr_bufs.c +pcomp_subs2.c
1527
1528The the data buffers used to communicate between workers and threads
1529have been restructured to separate the old buf2_str, which contained
1530sequence, score results, and alignment results, into three buffers,
1531buf2_data_s, buf2_res_s, and buf2_ares_s, separating sequence data
1532from scores and alignments. This was done to simplify communication
1533in the MPI/PVM environment. Workers should be able to return results
1534directly into the appropriate buffer.
1535
1536>>Mar. 25, 2010 fasta-36.2.1
1537
1538(dropfx.c, dropfz2.c)
1539Found/removed two "static" declarations in small_global that caused problems
1540with [t]fastx/y with threaded alignments.
1541
1542>>Mar. 24, 2010 (now version 36.06 with threaded alignments)
1543(dropnfa.c)
1544The DNA band aligner in dropnfa.c was not thread safe. This has been
1545fixed.
1546
1547>>Mar. 23, 2010
1548Code for pre-loading/threaded-aligning sequences has been
1549significantly cleaned up. Checks are made before RANLIB() and
1550re_getlib() in showbest() and showalign() that should be consistent
1551with annotations AND functions that cannot encode alignments.
1552
1553Add mshowalign2.c (which does not do PCOMPLIB) to provide threaded
1554alignments. build_ares_code() and buf_do_align() modified to ignore
1555MX_M9SUMM so that alignments are produced whenever demanded (still
1556does not do alignment if a_res is available).
1557
1558>>Mar. 22, 2010
1559(comp_lib4.c, work_thr2.c, thr_bufs.h)
1560
1561comp_lib4.c has been modified to thread the alignment encoding
1562(build_ares) for -m 9c. If m_msg.quiet and alignments are required for
1563showbest(), then the program identifies the number of alignments
1564required, reads the sequences (and annotations) into a buffer, and
1565sends them to the threads to be encoded. Then, when showbest() is
1566called, bbp->have_ares has been set, and the alignments are not
1567re-calculated. This should be extended to thread actual alignment
1568production, and additional work is required to clean-up the sequence
1569and bline(description) buffers before a second search.
1570
1571>>Mar. 17, 2010
1572(comp_lib4.c, dropnfa,fx,fz2.c)
1573Modifications to provide more sensible E2() statistical estimates with
1574threshold-heuristic comparison functions and -z 21. Also fixed bug
1575that caused the wrong zs_off to be used with -z 21. dropnfa,fx,fz2.c
1576now optimize all scores when shuff_flg is set.
1577
1578>>Mar. 16, 2010
1579(comp_lib4.c, scaleswn.c, drop*.c)
1580
1581A new, relatively consistent, statistical estimation strategy has been
1582introduced for the heuristic programs that optimize only a fraction of
1583scores (fasta36, [t]fast[xy]36). Statistics-based heuristic
1584thresholds can increase search speed 2 - 4-fold by doing band
1585optimization on only a small fraction of library sequences (with the
1586-c -1 option, about 10% of alignments are band-optimized, compared
1587with more than 50% with the classic thresholds). However, optimizing
1588only a small part of the library produces two classes of scores,
1589optimized (10% or less) and non-optimized, with different statistical
1590properties. fasta36 addresses this problem by calculating statistical
1591estimates only for the optimized scores, and then correcting the
1592significance of the score by accounting for the frequency of
1593optimization. For example, sampling only 5% of scores increases the
1594z-value (std. deviation above the mean) by -logE(0.05)*sqrt(6)/Pi =
15952.34 which offsets the z-score by 23.4. This effect is only seen when
1596the -c option is used to specify statistical thresholds, and is most
1597apparent when looking at the histogram, which will be offset by the
1598appropriate z-score.
1599
1600This strategy appears to produce more accurate statistics in general,
1601but can produce less accurate statistics for the heuristic programs when
1602the -z 21 option is used.
1603
1604>>Mar. 3, 2010
1605
1606(comp_lib4.c)
1607Fix the new stats[] sampling strategy to sample >60K sequences more
1608more uniformly. The old code massively over-sampled later sequences,
1609because of several bugs. The new code works as expected. The first
161060K sequences are represented about 30% more than the rest, but after
161160K, sequences are sampled moderately uniformly. The older
1612SAMP_STATS_MORE is uniform across all the scores.
1613
1614(build_ares.c)
1615Move code to produce chains of alignments (a_res) produced by
1616do_walign, followed by subsequent calls to calc_id, calc_code, into a
1617new function, build_ares_code(), which is shared by the
1618serial/threaded and parallel (p2_workcomp.c) programs. This is a
1619first step towards having the parallel programs produce multiple HSP
1620alignments.
1621
1622>>Feb. 27, 2010
1623
1624(lib_sel.c)
1625Fix problem with new chained library access that prevented more than
1626two files from being searched. Also, library name string has been
1627lengthened to allow a list of libraries to be displayed.
1628
1629>>Feb. 26, 2010
1630
1631Parallel programs have been tested in both PVM and MPI versions, and
1632some additional bugs have been fixed. Currently, the PVM/MPI versions
1633are fully functional, but only with FASTA35 capabilities. The new
1634multiple HSP alignments and best-shuffle E2() scores are not yet
1635available.
1636
1637>>Feb. 24, 2010
1638
1639Fix some leaks, largely do to more complex alignment data structures
1640for multiple alignments. Currently, all the major leaks are in data
1641structures allocated in main(), and which I don't bother to
1642de-allocate (mostly library buffer memory).
1643
1644Change zsflag > 10 to zsflag >= 10 && zsflag < 20 in three places.
1645Too many shuffles were being done with zsflag==21.
1646
1647>>Feb. 22, 2010
1648
1649Begin conversion of p2_complib2.c/p2_workcomp.c. Very old code to
1650allocate aln_d_base removed from v35 and v36. No code for best list
1651shuffle, or multiple high-scoring alignments. However, the code now
1652works properly with statistical thresholds. (Changes made to
1653p2_complib2.c, p2_workcomp.c to update pst struct after last_param.()).
1654
1655>>Feb. 19, 2010 fasta-36x6
1656
1657Fix issues with -z 26 statistics. Add description of E2() statistics.
1658
1659Added option to specify statistics routine for best-shuffled
1660statistics independently of library statistics by specifying a second
1661-z option. Thus, -z "21 2" uses regression scaled statistics for the
1662library estimate, and MLE statistics for the best-shuffled estimates.
1663
1664>>Feb. 17, 2010 fasta-36x5
1665
1666Some of the simplifications dealing with threads in comp_lib4.c failed
1667on some compilers and architectures. The code for terminating threads
1668has been modified to allow sequence buffers with zero entries, to
1669simplify the empty_buffer logic. There is now an explicit option to
1670terminate threads by setting lib_bhead_p->stop_thread. However, this
1671flag is never set, as rbuf_done() stops the threads instead.
1672
1673Also fix problem with stats_idx being associated with wrong buf2_p in
1674two frame searches.
1675
1676>>Feb. 15, 2010 fasta-36x4
1677
1678fasta36 can now display both "search" (E()) and "shuffled" (E2())
1679E()-value calculation and display in the best scores and
1680alignments. If the -z option is greater than 20, then two evalues are
1681calculated, one from the search (e.g. -z 1 uses regression scaled
1682scores) and a second derived from shuffling the high scoring
1683sequences. The high-scoring sequence shuffled scores are
1684approximately equivalent to doing a PRSS (pairwise shuffle), but more
1685efficient. High-scoring shuffled E()-values (labled E2()) are
1686typically 2 - 5-fold more conservative for average composition
1687proteins, and 10 - 20X more conservative for biased composition
1688proteins.
1689
1690Fix another bug in -S alignment scores vs opt scores in ssearch36 (see
1691Feb. 8).
1692
1693>>February 12, 2010
1694(prev. version 142)
1695
1696Create comp_lib4.c (from comp_lib3.c), which simplifies some of the
1697processes for handling buffers of results (no more empty_reader_bufs)
1698and enables shuffles of high-scoring sequences to evaluate significance.
1699
1700>>February 8, 2010
1701
1702Fix a problem with scores and E()-values for SSEARCH sub-alignments
1703when the -S option is used. When the -S option was used to ignore
1704lower-case residues in query or library for the initial score, the
1705final alignments include the lower-case masked residues. The
1706SSEARCH36 was using the non-masked alignment score, rather than the
1707orginal score (FASTA36, and [T]FAST[XY]36 used the masked score).
1708This was incorrect, as the statistics are calculated for masked
1709sequences. The corrected version calculates both a non-masked and a
1710masked score, where the masked score (for subalignments) uses the
1711non-masked alignment.
1712
1713[T]FAST[XY]36 had a related problem, which is that when multiple
1714sequences are in the query with the same pam2p[0] (no -S) score, then
1715the wrong alignment could be shown with the initial scores. Fixing
1716this requires that the alignment routine only work on the region
1717specified from the initial band (fixed in dropnfa.c, dropfx.c, and
1718dropfz2.c).
1719
1720>>February 4, 2010
1721
1722The more efficient statistical thresholds in fasta36 have been
1723disabled by default. They can be turned on with -c -1, or by setting
1724thesholds (-c "0.05 0.2" would set E_band_opt to 0.05 - target 5% of
1725sequences - and E_join at 20% target).
1726
1727My initial implementation produced very inaccurate statistics,
1728presumably because only a small fraction of unrelated sequences were
1729being band-optimized (fasta35 typically optimized about 60% of library
1730sequences, fasta36 with statistical thresholds optimizes about 2%,
1731which causes a 2 - 3X speed increase). The sampling strategy for
1732fasta36, and [t]fast[xy]36 scores has been adjusted to provide
1733relatively accurate scores for searches that optimize only a small
1734fraction of sequences. On the cases I have tested, statistical
1735accuracy is comparable to, or better than, the version 35 programs,
1736but probably not as robust as ssearch estimates.
1737
1738>>January 29, 2010
1739
1740The logic to predetermine where scores went for shuffling breaks when
1741some scores are not calculated (e.g. -M 200 - 300). Fix by using
1742nstats as the index for nstats < MAX_STATS, and then use stats_idx
1743afterwards.
1744
1745Provide more efficient score sampling logic. The old method (left
1746over from fasta34 or earlier) generated a random number for every
1747sequence after MAX_STATS; if it was less than MAX_STATS, the sample
1748was used. This logic is still available with -DSAMP_STATS_MORE. The
1749new logic samples every other sequence between MAX_STATS and
17502*MAX_STATS, every third between 2*MAX_STATS and 3*MAXSTATS, etc, and
1751randomly replaces one of the stats scores. For 430K SwissProt, this
1752reduces the number of samples from 178K to about 145K, and reduces the
1753number of calls to the random number generator from 430K to 85K.
1754
1755>>January 28, 2010
1756
1757(comp_lib3.c, mrandom.c) Tests of ssearch36 statistical accuracy
1758suggests that the default statistical estimates (-z 1) are not as
1759accurate as they should be with BLOSUM62, -11/-1. Both -z 11 and -z 2
1760work better. In FASTA35, -z 11 - 15 caused a 2X-slowdown (actually
1761more) because EVERY library sequence was shuffled, even though only a
1762fraction of the sequences (for libraries > 60,000 would be used for
1763the statistical calculation. comp_lib3.c uses a more sophisticated
1764strategy for sampling scores after 60,000 so that sequences are only
1765shuffled and aligned if they will be used in the statistical
1766calculation. Doing this on SwissProt, with 430,000 sequences, means
1767that ~180,000 additional shuffle alignments are done, not 430,000
1768additional.
1769
1770However, using -z 11 with the threaded program was much more than
17712X-slower -- random() is not re-entrant, and is designed to provide a
1772consistent set of random numbers over threads, so threads were waiting
1773on the random number generator, with a big performance penalty. Using
1774code from WikiPedia, I implemented a random number generator
1775(mrandom.c) that saves a local copy of state, so threaded -z 11 has
1776the correct performance penalty.
1777
1778>>January 25, 2010 (initfa.c 36.04 January 2010)
1779
1780(dropfz2.c, aln_struct.h) At long last, tfasty36 correctly produces
1781multiple alignments on the reverse strand. (Jan. 26, 2010) Fixed
1782introduced bug in fasty36 that used wrong offset in recursion.
1783
1784>>January 17, 2010
1785
1786Extensive changes have been made to all the drop_* functions, so that
1787multiple alignment results are properly sorted from highest to lowest
1788sw_score. dropnfa.c, dropgsw2.c, dropfx.c and dropfz2.c now all use
1789similar strategies to calculate non-overlapping alternative alignments.
1790score_thresh thresholds are applied to rst.score[ppst->score_ix]
1791appropriately for all recursive functions.
1792
1793>>August 24, 2009
1794
1795Statistical thresholds have been adjusted to produce more
1796approximately the correct number of joins/band optimizations. The
1797approximate fraction of joins/band optimizations is now shown in the
1798results.
1799
1800>>August 21, 2009
1801
1802fasta/fastx/fasty/tfastx/tfasty now use statistically based thresholds
1803for joining short segments and deciding to do a band optimization --
1804similar to the threshold strategy used by BLAST.
1805
1806The statistical thresholds used are set with the
1807-c option, which used to be used to set optcut. The -c option now has three ranges:
1808
1809-c < 0 -- use the old FASTA thresholds, calculated in the same way
18100 < -c < 1.0 -- use the statistical thresholds and set E_opt_cut.
1811c >= 1.0 -- use the old FASTA threshold, and specify it.
1812
1813For 0 < -c < 1.0, a second argument can be supplied (-c "0.02 0.1")
1814for the joining E()-threshold. If this value is < 1.0, it is used as
1815E_join; if it is > 1.0, E_opt_cut is multiplied by the value to get
1816E_join.
1817
1818>>August 19, 2009
1819
1820Implement Lambda/K/H based c_gap, opt_cut in dropnfa.c, dropfx.c
1821(fastx), and dropfz2.c (fasty). Add ELK_to_s() to scaleswn.c.
1822
1823>>August 11, 2009
1824
1825Fix bug in dropfx.c that used the wrong variables for calculating
1826offsets into a long DNA sequence for subset alignments.
1827
1828Stop putting sw_score in score[0] when no score[0] was calculated.
1829Use 0 instead.
1830
1831>>July 31, 2009
1832
1833(dropgsw2.c) Fix problems with dropgsw2.c that allowed poor
1834sub-alignments to be shown. Consolidate merge_ares_acc() for all the
1835functions. Add pst.do_rep to disable multiple alignments.
1836
1837>>July 6, 2009
1838
1839(initfa.c, apam.c, complib2.c, p2_complib.c) move changes for
1840validate_novel_aa() from fasta35.
1841
1842(initfa.c) Enable checks for unusual characters ('Uu' in proteins) for
1843many more programs with the -p option.
1844
1845>>June 16, 2009
1846
1847Modify statistical sampling strategy to greatly simplify the
1848calculation.
1849
1850>>May 15, 2009
1851
1852Fix bug in lav2ps.c, lav2svg.c that occured when displaying very long
1853sequence alignments (e.g. genome alignments). The maximum coordinate
1854is set properly now.
1855
1856>>May 5, 2009
1857
1858(initfa.c) Fix bug (int e_cut in pgm_def_arr[]) that prevented e_cut
1859to be set properly for lalign for DNA.
1860
1861>>May 4, 2009
1862
1863The functions that return multiple sub-alignments (HSPs) after the
1864best alignment have been modified to ensure that alignments are
1865returned sorted by score, by merging the list of alignments found to
1866the left and right of the best alignment.
1867
1868>>April 28, 2009
1869
1870(p2_complib2.c, p2_workcomp2.c, mshowbest.c, mshowalign.c) modified to
1871support new coordinate system, preliminary work on multiple HSPs in
1872parallel environment.
1873
1874>>April 14, 2009
1875
1876(comp_lib2.c, nmgetaa.c) Comprehensive restructuring of library file
1877list from a fixed length array to a variable length linked list. The
1878link lists allows library files to insert additional files into the
1879list, so that, for example, a file of accession numbers can refer to a
1880list of files for the accessions.
1881
1882Eventually, this should allow FASTA to support .pal/.nal files from
1883the NCBI, and to support files of file names most places file names
1884are allowed.
1885
1886>>April 2, 2009 (from fasta35)
1887
1888(structs.h, comp_lib2.c, doinit.c, mshowbest.c, mshowalign.c) The code
1889that selects the number of high scores to display has been reorganized
1890to support the -F e_low option (which was not implemented properly if
1891-b and -d were specified). The code is simplified; m_msg.nshow is
1892used to specify the number of best scores listed, and min(m_msg.nshow,
1893m_msg.ashow) is used to specify the number of alignments shown.
1894
1895>>March 26, 2009 (from fasta35 - fa35_04_07)
1896
1897(initfa.c) Fix problems with 'U' recognition in DNA pam matrix,
1898correct implementation of -r +mat/-mis. Previous versions of fasta35
1899may not have used the correct DNA matrix when the -r +mat/-mis option
1900was specified.
1901
1902>>March 23, 2009 (initfa.c verstr -> 36.02)
1903
1904(mshowbest.c, aln_structs.h) Add loop for displaying multiple aligned
1905regions with -m 9, -m 9i, and -m 9c in mshowbest.c.
1906
1907>>March 22, 2009
1908
1909(dropgsw2.c, dropnnw2.c, wm_align.c) Rearrange code in dropgsw2.c,
1910dropnnw2.c (which replaces dropnnw.c) so that a single function,
1911wm_align.c:nsw_malign() is responsible for recursive algnments for
1912both dropgsw2.c (sw_walign) and dropnnw2.c (nw_walign). The strategy
1913for tnese (Smith-Waterman, Global-Local) alignments is
1914identical. nsw_malign() uses a function pointer that calculates S-W or
1915N-W that it gets from dropgsw2.c or dropnnw2.c
1916
1917It might make sense to use a similar strategy for the recursive
1918translated alignments.
1919
1920>>March 19, 2009
1921
1922(map_db.c, mm_file.h) Fix another bug in map_db.c that appears for
1923sequence files larger than 2Gb. MM_OFF is now consistently used in
1924more of the places where an int64_t might is required.
1925
1926>>March 17, 2009
1927
1928(list_db.c) Fix a bug in list_db that caused it to misread the maximum
1929sequence length, and then be off by 4-bytes for all the offsets.
1930Include list_db with map_db in the list of auxiliary programs.
1931
1932>>Mar. 8, 2009 fa35_04_06
1933
1934(comp_lib2.c, pthr_subs2.c, pthr_subs.h, doinit.c, dec_pthr_subs.c)
1935Dynamically allocate pthread_t *fa_threads, rather than limit it to
1936MAX_WORKERS. MAX_WORKERS is no longer used in the Unix environment;
1937it gets its value from sysconf(_SC_NPROCESSORS_CONF). If sysconf() is
1938not available, MAX_WORKERS is used. The threaded programs should now
1939automatically adjust the number of threads to the number of
1940processors. Moreover, the number of threads can be set to more than
1941the number of processors with -T #threads. Also, max_workers was
1942renamed fa_max_workers, and pthread_t *threads is now *fa_threads.
1943
1944>>Mar. 6, 2009
1945
1946copied comp_lib2.c from v35 (fix for query offset coordinates)
1947
1948>>Oct. 22, 2008
1949
1950The programs that allow multiple alignments to be found include:
1951
1952 ssearch36(_t)
1953 fasta36(_t)
1954 fastx36(_t)
1955 fasty36(_t)
1956
1957fasts and fastf will probably not be updated in this way, because of
1958the difficulty in reconstructing alignments, but fastm may be.
1959
1960Right now, the pvm/mpi versions of the programs do not support
1961multiple sub-alignments.
1962
1963>>Sep. 25, 2008
1964
1965Modify the syntax for the -E option to allow the repeat E()-value
1966cutoff to be specified in either of two ways.
1967
1968 -E "e_cut e_rep"
1969
1970If the value of e_rep is less than one, it is taken as the absolute
1971E()-value threshold for additional local domains, for example:
1972
1973 -E "1.0 0.05" says use 1.0 for the main E()-value threshold,
1974 and 0.05 as the threshold for additional local alignments.
1975
1976Alternatively, if e_rep >= 1.0, it is taken as a divisor for the
1977E()-value threshold, thus:
1978
1979 -E "1.0 10.0"
1980
1981Sets the E()-value threshold for additional local alignments to
19821.0/10.0 = 0.1.
1983
1984Finally, if e_rep <= 0.0, no multiple alignments are done (equivalent
1985to previous versions of FASTA).
1986
readme.w32
1October 6, 2006
2
3The FASTA programs for Windows32 environments (WindowsNT, 2000, XP)
4has undergone a major upgrade, so that now all the programs in the
5Unix/MacOSX distribution are available to Windows users. Moreover,
6Windows users with modern (SSE2 compatible) processors can run greatly
7accelerated versions of the Smith-Waterman ssearch program.
8
9Moreover, these programs work both with FASTA formatted files, and
10NCBI BLAST formatted files.
11
12The following programs are available:
13
14 fasta36.exe protein-protein or DNA-DNA database searches
15 fastf36.exe
16 fastm36.exe
17 fasts36.exe
18 fastx36.exe compare DNA query to protein library with frameshifts
19 fasty36.exe compare DNA query to protein library with frameshifts
20 prfx36.exe
21 prss36.exe evaluate statistical significance using shuffles
22 prss36sse2.exe
23 ssearch36.exe Smith-Waterman for prot-prot or DNA-DNA searches
24 ssearch36sse2.exe Smith-Waterman, accelerated with SSE2 extensions
25 tfastf36.exe
26 tfastm36.exe
27 tfasts36.exe
28 tfastx36.exe compare protein to DNA library with frameshifts
29 tfasty36.exe compare protein to DNA library with frameshifts
30
31Each of these programs also has a "threaded" version, which can run on
32multiple processors (or dual cores) if they are available. However,
33they are built using the Unix pthreads API, so to use these programs,
34you must download the pthreadVC2.dll from:
35
36ftp://sources.redhat.com/pub/pthreads-win32/dll-latest/lib/pthreadVC2.dll
37
38see also http://sourceware.org/pthreads-win32/
39
40 fasta36_t.exe
41 fastf36_t.exe
42 fastm36_t.exe
43 fasts36_t.exe
44 fastx36_t.exe
45 fasty36_t.exe
46 prfx36_t.exe
47 prss36_t.exe
48 prss36sse2_t.exe
49 ssearch36_t.exe
50 ssearch36sse2_t.exe
51 tfastf36_t.exe
52 tfasts36_t.exe
53 tfastx36_t.exe
54 tfasty36_t.exe
55
56Without that DLL, the threaded programs will not run at all. The
57current compilation supports two threads, and speeds up searches about
582-fold on dual-core processors.
59
60The programs have been tested with protein and DNA databases in FASTA
61format, PIR/GCG-text format, and Genbank flatfile format. The program
62does not work properly with GCG binary format databases, but it seems
63unlikely that Windows users would need these.
64
65Be certain to use an program that can work with long file names when
66unpacking the program source files.
67
68Please report bugs to:
69
70 wrp@virginia.edu
71