1
2  $Id: readme.v35 120 2010-01-31 19:42:09Z wrp $
3  $Revision: 55 $
4
5>>Sep. 10, 2008
6
7Fix problem in init_ascii() call for p2_complib2.c.
8
9>>Sep. 9, 2008
10
11Fix bug in display of library name when written to an output file
12(rather than stdout).
13
14>>Aug. 28, 2008		fa35_04_02	SVN Revision: 45
15
16Fix serious bug in alignment generation that only occurred when large
17libraries were used as a query with [t]fast[x/y].  This bug often
18resulted in a core dump.
19
20Address some other issues with uninitialized variables with -m 9c.
21
22>>Jul. 15, 2008		fa35_04_01	SVN Revision: 38
23
24Correct problems with Makefiles.  Add information on compiling to README.
25Address issue with mp_KS for -m 10 when searching small libraries.
26
27>>Jul. 7, 2008		fa35_04_01	SVN Revision: 35
28
29Fix problems that occurred when statistics are disabled with -z -1,
30both for a normal library search, and for searches of a small library.
31
32>>Jul. 3, 2008	  	  	SVN Revision: 33
33
34Continue to fix an issue with 'J' and -S.
35
36>>Jun. 29, 2008     	 	SVN Revision: 29, 31
37
38Fix additional problems with Makefiles, some issues uncovered with
39Solars 'C' compiler (Rev. 30).
40
41Discover serious bug when searching long, overlapping sequences, such
42as genomes.  The length of the library sequence was not updated to
43reflect the length of the new region plus the overlap.
44
45Fix inconsistency in the value of 'J' between uascii.h/aascii[] and
46pascii[].  Add code to ensure that lascii[], qascii[], never return a
47value outside pam2[][] (all <= pst.nsq) (particularly for 'O' and 'U'
48amino-acids).
49
50exit(0) returns for map_db, list_db.
51
52>>Jun. 11, 2008
53
54Correct bug in scaleswn.c that prevented exact matches to queries < 10
55residues from being scored and displayed.
56
57>>Jun. 1, 2008
58
59Address various cosmetic issues in FASTA output:
60
61(1) Modify comp_lib2.c so that -O outfile works when multiple queries are
62compared in one run.
63
64(2) remove the duplicated query sequence length in the 1>>>query line.
65
66(3) in -m 10 output, the tags "pg_name" and "pg_ver" were duplicated, e.g.
67
68>>>K1HUAG, 109 aa vs a library
69; pg_name: fasta35_t
70; pg_ver: 35.03
71; pg_argv: fasta35_t -q -b 10 -d 5 -m 10 ../seq/prot_test.lseg a
72; pg_name: FASTA
73; pg_ver: 3.5 Sept 2006
74
75The ; pg_ver and ; pg_name produced by the get_param() functions in
76drop*.c have been renamed ; pg_ver_rel and ; pg_ver_alg.
77
78>>>K1HUAG, 109 aa vs a library
79; pg_name: fasta35_t
80; pg_ver: 35.03
81; pg_argv: fasta35_t -q -b 10 -d 5 -m 10 ../seq/prot_test.lseg a
82; pg_name_alg: FASTA
83; pg_ver_rel: 3.5 Sept 2006
84
85Modify mshowbest.c, mshowalign.c to highlight E() values (<font
86color="dark red"></font> in HTML output.
87
88>>Apr. 16, 2008     fa35_03_07
89
90Merge fa35_ann1_br, which allows annotations in library sequences.
91
92The PVM/MPI parallel version now support query sequence annotations
93and -m 9c annotation encoding.  It does not yet support library
94annotations.  Tested with both PVM and MPI.
95
96>>Apr. 2, 2008	    fa35_03_06
97
98Ensure that code in last_init() to modify ktup never increases ktup value.
99
100Add fasta_versions.html to more explicitly describe programs available.
101
102>>Mar. 4, 2008
103
104Fix parsing of parameters (matrix, gap open, gap ext) in ASN.1 PSSM
105files produced by blastpgp.
106
107>>Feb. 18, 2008	  fa35_03_05
108
109Re-implement -M low-high sequence range options.  Sequence range
110restriction has probably been missing since the introduction of
111ggsearch and glsearch, which use a new approach to limiting the
112sequence range.
113
114>>Feb. 7, 2008	  fa35_ann1_br
115
116Add annotations to library sequences (they were already available in
117query sequences).  Currently, annotations are only available within
118sequences, but they should be available in FASTA format, or any of the
119other ascii text formats (EMBL/Swissprot, Genbank, PIR/GCG).  If
120annotations are present in a library and the annotation characters
121includes '*', then the -V '*' option MUST be used.  However, special
122characters other than '*' are ignored, so annotations of '@', '%', or
123'@' should be transparent.
124
125In translated sequence comparisons, annotations are only available for
126the protein sequence.
127
128The format for encoded annotations has changed to support annotations
129in both the query and library sequence.  If the -m 9c flag is provided
130and annotations are present, then an annotated position in the
131alignment will be encoded as:
132
133 '|'q-pos':'l-pos':'q-symbol'l-symbol':'match-symbol'q-residue'l-residue'
134
135For example:
136
137    |7:7:@@:=YY|14:14:##:=TT
138
139In cases where the query or library sequence does not have an
140annotation, then the q-symbol or l-symbol will be 'X' (which is not a
141valid annotion symbol).
142
143>>Jan. 25, 2008	   fa35_03_04
144
145Map 'O' (pyrrolysine) to 'K', 'U' (seleno-cysteine) to 'C' in uascii.h
146('J' is already recognized and mapped to the average of 'I' and 'L').
147Thus, 'J' will appear in alignments, but 'O' and 'U' are transformed
148to 'K' and 'C'.
149
150Because "Oo" and "Uo" are not (currently) part of aax[] ("Uu" is in
151ntx[]), apam.c/build_xascii() was extended to add characters from
152othx[] - "oth" for "other" so that they are not lost.
153
154Double check, and fix, some mappings for 'J/j' and 'Z/z'.
155
156>>Jan. 11, 2008	   fa35_03_03
157
158Clean up some issues with -m 10 output; put "; mp_Algorithm", ";
159mp_Parameters" down with other -m 10 ";" lines.  Also provide ";
160al_code" and "; al_code_ann" if -m 9c is specified.  Remove duplicate
161">>>query" line.
162
163Add "; aln_code" and "; ann_code" to -m 10 -m 9c output.  The
164alignment/annotation encoding is only produced once (in showbest(),
165and is then saved for -m 10 aligment.
166
167>>Dec. 13, 2007	  fa35_03_02m (merge of fa35_03_02 and fa35_02_08_br)
168
169Add ability to search a subset of a library using a file name and a
170list of accession/gi numbers. This version introduces a new filetype,
17110, which consists of a first line with a target filename, format, and
172accession number format-type, and optionally the accession number
173format in the database, followed by a list of accession numbers.  For
174example:
175
176	  </slib2/blast/swissprot.lseg 0:2 4|
177	  3121763
178	  51701705
179	  7404340
180	  74735515
181	  ...
182
183Tells the program that the target database is swissprot.lseg, which is
184in FASTA (library type 0) format.
185
186The accession format comes after the ":".  Currently, there are four
187accession formats, two that require ordered accessions (:1, :2), and
188two that hash the accessions (:3, :4) so they do not need to be
189ordered.  The number and character after the accession format
190(e.g. "4|") indicate the offset of the beginning of the accession and
191the character that terminates the accession.  Thus, in the typical
192NCBI Fasta definition line:
193
194 >gi|1170095|sp|P46419|GSTM1_DERPT Glutathione S-transferase (GST class-mu)
195
196The offset is 4 and the termination character is '|'.  For databases
197distributed in FASTA format from the European Bioinformatics
198Institute, the offset depends on the name of the database, e.g.
199
200 >SW:104K_THEAN Q4U9M9 104 kDa microneme/rhoptry antigen precursor (p104).
201
202and the delimiter is ' ' (space, the default).
203
204Accession formats 1 and 3 expect strings; accession formats 2 and 4
205work with integers (e.g. gi numbers).
206
207>>Dec. 12, 2007  fa35_02_08
208
209Correct bug in ssearch35 gapped scores that only occurred in
210non-accelerated code.  This bug has been present since fa35_02_06.
211Modified the Makefiles so that accelerated (ssearch35(_t)) and
212non-accelerated (ssearch35s(_t)) are available. Edited Makefile's to
213provide accelerated ssearch35 more specifically.
214
215Modifications to provide information about annotated residues in the
216-m9c coded output. Previously, -m 9c output added a field:
217
218    =26+9=15-2=9-1=3+1=74-2=3-3=63
219
220after the standard -m 9 output information.  With the new version, an
221annotated query sequence ( -V '*#' ) adds the field:
222
223    |14:16:#<TM|24:26:#>TA|44:37:*>ST|71:66:#=TT
224
225which indicates that residue 14 in the query sequence aligns with
226residue 16 in the target (library) with annotation symbol '#', the
227alignment score is '<' less than zero, and the residues are 'T'
228(query) and 'M' (library). (The '|' is used to separate each
229annotation entry.)
230
231>>Nov. 10, 2007
232
233Parts of p2_complib.c and p2_workcomp.c, and the pvm/mpi Makefiles,
234have been updated to be consistent with name changes in the param.h
235and structs.h directories.
236
237>>Nov. 20, 2007   fa35_02_08
238
239Parts of p2_complib.c and p2_workcomp.c, and the pvm/mpi Makefiles,
240have been updated to be consistent with name changes in the param.h
241and structs.h directories.
242
243>>Nov. 6, 2007	  fa35_02_07
244
245Correct problems with asymmetric RNA matrices in initfa.c and rna.mat.
246
247>>Oct. 18, 2007
248
249Correct problem parsing ASN1 FastaDefLines when the database is local.
250
251Recovering from a misplaced cvs commit of code that was supposed to be
252on a branch, code has been recovered from earlier versions (fa35_02_05
253because fa35_02_06 has some branch contamination).
254
255>>Oct. 4, 2007	  fa35_02_06
256
257Correct error in gap penalties in dropnnw.c.  Due to an unfortunate
258inconsistency, the gap parameter in FLOCAL_ALIGN (in dropgsw2.c) had a
259different meaning than that in almost all the other programs (it was
260the sum of gap_open and gap_ext).  The FLOCAL_ALIGN function call was
261copied for FGLOBAL_ALIGN, even though the the FGLOBAL_ALIGN function
262used the more conventional gap_open, gap_ext parameters.  Thus,
263FGLOBAL_ALIGN was wrong and the subsequent do_walign() in dropnnw.c
264were wrong.  dropgsw2.c:FLOCAL_ALIGN has been modified to use the
265conventional gap_open parameter, and calls to dropnnw.c:
266FGLOBAL_ALIGN() and do_walign() have been fixed.
267
268>>Sept. 20, 2007
269
270Modify the logic used when saving a seq_record *seq_p into beststr
271*bbp to ensure that if the seq_record is replaced, it is replaced at
272all the places where it is referenced.  This involves adding a linked
273list into beststr (*bbp->bbp_link).  When making the link (and freeing
274it up), be certain that the linked seq_p is the same as the one being
275replaced.
276
277>>Sept. 18, 2007   fa35_02_05
278
279A relatively obscure problem was found on the SGI platform when
280searching a library smaller than 500 sequences (thus requiring some
281shuffles). Two bugs were found and corrected; one involved not
282allocating aa1shuff with COMP_THR and not do a m_file_p->ranliba()
283before re_getlib().  The second involved destroying a pointer to the
284list of seq_records when a sequence was being shuffled.  The bugs were
285confirmed with Insure, and have been fixed.
286
287>>Sept. 7, 2007	fa35_02_04
288
289Revamp the offset handling code to provide better uniformity between
290query and library offsets and coordinate systems.
291
292Fix a problem with load_mmap() to load 64-bit sequence locations
293properly on machines with 32-bit integers.
294
295>>Sept. 4, 2007
296
297Modify ncbl2_mlib.c slightly to check to see whether the amino-acid
298mapping in blast databases is identical to the FASTA mapping (it
299should be).  If they are identical, do not re-map the blast amino acid
300sequences (potentially a small speed up).
301
302>>Aug. 22, 2007
303
304Change ps_lav.c to lav2ps.c, and add lav2svg.c.  It is now possible to
305generate a lalign35 HTML output that has both SVG (lav2svg) and PNG
306(lav2ps | gs ), graphics.
307
308>>Aug. 10, 2007	CVS fa35_02_03
309
310Fix faatran.c:aacmap() bug.
311
312>>Aug. 6, 2007
313
314Extensive restructuring of pssm_asn_subs.c to parse PSSM:2 ASN.1's
315downloaded from NCBI WWW PSI-BLAST more robustly.
316
317>>July 25, 2007	  CVS fa35_02_02
318
319Change default gap penalties for OPTIMA5 matrix to -20/-2 from -24/-4.
320
321>>July 24, 2007
322
323Correct bugs introduced by adding 'J' - 'J' was initially put before
324'X' and '*' in the alphabet, which led to problems because the
325one-dimensional lower-triangular pam[] matrices (abl50[], abl62[],
326etc) had entries for 'X', and '*', but not for 'J'.  By placing 'J'
327after the other characters, the problem is resolved.
328
329Modify tatstats.c to accommodate 'J'.
330
331'*' is back in the aascii[] matrix, so that it is present by default
332(like fasta34).
333
334>>July 23, 2007
335
336Changes to support sub-sequence ranges for "library" sequences -
337necessary for fully functional prss (ssearch35) and lalign35.  For all
338programs, it is now possible to specify a subset of both the query and
339the library, e.g.
340
341    lalign35 -q mchu.aa:1-74 mchu.aa:75-148
342
343Note, however, that the subset range applied to the library will be
344applied to every sequence in the library - not just the first - and
345that the same subset range is applied to each sequence.  This probably
346makes sense only if the library contains a single sequence (this is
347also true for the query file).
348
349Correct bugs in the functions that produce lav output from lalign35 -m
35011 to properly report the begin and end coordinates of both sequences.
351Previously, coordinates always began with "1".  Correct associated bug
352in ps_lav.c that assumed coordinates started with "1".
353
354>>June 29, 2007	CVS fa35_02_01
355
356Merge of HEAD with fasta35 branch.
357
358>>June 29, 2007	CVS fa35_01_06
359
360Add exit(0); to ps_lav.c for 0 return code.
361
362>>June 26, 2007
363
364Add amino-acid 'J' for 'I' or 'L'.
365
366Add Mueller and Vingron (2000) J. Comp. Biol. 7:761-776 VT160 matrix,
367"-s VT160", and OPTIMA_5 (Kann et al. (2000) Proteins 41:498-503).
368
369Changes to dropnnw.c documentation functions to remove #ifdef's from
370strncpy() - which apparently is a macro in some versions of gcc.
371
372>>June 7, 2007
373
374Modify initfa.c to allow ggssearch35(_t), glsearch35(_t) to use PSSMs.
375
376>>June 5, 2007  CVS fa35_01_05
377
378Modifications to p2_complib.c, p2_workcomp.c to support Intel C
379compiler.  Fixed bug in p2_workcomp.c - gstring[2][MAX_STR] required -
380[MAX_SSTR] too short.  mp35comp* programs now tested and working (as
381are pv35comp*, c35.work* programs).
382
383Fix problem with fasts/fastm/fastf last_tat.c with limited memory.
384
385Correct problem with lalign35.exe Makefile.nm_[fp]com.
386
387Add $(CFLAGS) to map_db to enable large file support.
388
389Address problem with PSSM's when '*' not defined (initfa.c:extend_pssm()).
390
391>>May 30, 2007	CVS fa35_01_04
392
393Complete work on ps_lav, which converts an lalign35 lav (-m 11) file
394into a postscript plot, which looks identical to the plots produced by
395plalign from fasta2.  (ps_lav has been replaced by lav2ps and lav2svg).
396
397>>May 25,29, 2007
398
399Changes to defs.h, doinit.c mshowalign.c for -m 11, which produces lav
400output only for lalign35.
401
402Changes to comp_lib2.c to add m_msg.std_output, which provides all the
403standard print lines.  This is turned off for -m 11 (lav) output.
404lalign35 -m 11 provides standard lav output, with the addition of
405#lalign35 -q ... .
406
407>>May 18, 2007
408
409Add m_msg.zsflag to preserve pst.zsflag when reset by global/global
410exclusion of many library sequences.
411
412>>May  9, 2007	CVS fa35_01_03
413
414Tested local database size determination with p2_complib2/p2_workcomp2.
415
416>>May 2, 2007	renamed fasta35, pv35comp, etc
417
418Separate thread buffer structures from param.h.
419
420Problems with incorrect alignments has been fixed by re-initializing the
421best_seqs and lib_buf2_list.buf2 structures after each query sequence.
422
423The labels on the alignment scores are much more informative (and more
424diverse).   In the past, alignment scores looked like:
425
426>>gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-transfer  (218 aa)
427 s-w opt: 1497  Z-score: 1857.5  bits: 350.8 E(): 8.3e-97
428Smith-Waterman score: 1497; 100.0% identity (100.0% similar) in 218 aa overlap (1-218:1-218)
429^^^^^^^^^^^^^^
430
431where the highlighted text was either: "Smith-Waterman" or "banded
432Smith-Waterman". In fact, scores were calculated in other ways,
433including global/local for fasts and fastf.  With the addition of
434ggsearch35, glsearch35, and lalign35, there are many more ways to
435calculate alignments: "Smith-Waterman" (ssearch and protein fasta),
436"banded Smith-Waterman" (DNA fasta), "Waterman-Eggert",
437"trans. Smith-Waterman", "global/local", "trans. global/local",
438"global/global (N-W)".  The last option is a global global alignment,
439but with the affine gap penalties used in the Smith-Waterman
440algorithm.
441
442>>April 24, 2007
443
444The new program structure has been migrated to the PVM and MPI
445versions.  In addition, the new global algorithms (pv35compgg,
446pv35compgl) have been moved, though the the PVM/MPI versions do not
447(yet) to the appropriate size filtering.
448
449>>April 19, 2007
450
451Two new programs, ggsearch35(_t) and glsearch35_t are now available.
452ggsearch35(_t) calculates an alignment score that is global in the
453query and global in the library; glsearch35_t calculates an alignment
454that is global in the query and local, while local in the library
455sequence.  The latter program is designed for global alignments to domains.
456
457Both programs assume that scores are normally distributed.  This
458appears to be an excellent approximation for ggsearch35 scores, but
459the distribution is somewhat skewed for global/local (glsearch)
460scores.  ggsearch35(_t) only compares the query to library sequences
461that are beween 80% and 125% of the length of the query; glsearch
462limits comparisons to library sequences that are longer than 80% of
463the query.  Initial results suggest that there is relatively little
464length dependence of scores over this range (scores go down
465dramatically outside these ranges).
466
467A bug was found and fixed in showalign() and showbest() where the
468aa1save buffer was not preserved when some sequences needed to be
469re-read, while others were stored in the beststr.
470
471>>April 9, 2007
472
473Some of the drop*.c functions have been reconfigured to reduce the
474amount of duplicate code.  For example, dropgsw.c, dropnsw.c, and
475dropnfa.c all used exactly the same code to produce global alignments
476(NW_ALIGN() and nw_align()), this code is now in wm_align.c.
477Likewise, those same files, as well as dropgw2.c, use identical code
478to produce consensus alignments (calcons(), calcons_a(), calc_id(),
479calc_code()).  Rather than working with three or four copies of
480identical code, there is now one version.
481
482>>March 29, 2007
483
484At last, the lalign (SIM) algorithm has been moved from FASTA21 to
485FASTA35.  Currently, only lalign35 is available.  A plotting version
486will be available shortly (or perhaps a more general solution that
487produces lav output).
488
489The statistical estimates for lalign35 should be much more accurate
490than those from the earlier lalign, because lambda and K are estimated
491from shuffles.
492
493Many functions have been modified to reduce the number of times
494structures are passed as arguments, rather than pointers.
495
496>>February 23, 2007
497
498The threading strategy has been modified slightly to separate the end
499of the search phase (and a complete reading of all results buffers)
500from the termination phase.  This will allow future threading of
501subsequent phases, including the Smith-Waterman alignments in
502showbest() and showalign() (though care will be required to ensure
503that the results are presented in the correct order).
504
505>>February 20, 2007	fasta-34_27_0  (released as fasta-35_1)
506
507The FASTA programs have been restructured to reduce the differences
508between the threaded and unthreaded versions (and ultimately the
509parallel versions) and to make more efficient use of modern large
510memory systems.  This is the beginning of a move towards a more robust
511shuffling strategy when searching databases with modest numbers of
512related sequences.
513
514The major changes:
515
516  comp_lib.c -> comp_lib2.c  - comp_lib.c will be removed
517  work_thr.c -> work_thr2.c  - work_thr.c will be removed
518
519  mshowbest.c, mshowalign.c have been modified to remove aa1 as an
520    argument. They must allocate that space if they need it.
521
522  The system is set up to allocate a substantial amount of library
523  sequence memory, either to a single buffer (unthreaded) or to the
524  threaded buffer pool.  For smaller databases, the library sequences
525  are read once, and then subsequently read from memory (this could be
526  extended for RANLIB(bline) as well).
527
528Soon, these changes will allow the program to re-read the beststr[]
529sequences and shuffle them to produce accurate lambda/K estimates.
530
531================================================================
532
533See readme.v34t0 for earlier changes.
534
535================================================================
536