1 2 $Id: readme.v35 120 2010-01-31 19:42:09Z wrp $ 3 $Revision: 55 $ 4 5>>Sep. 10, 2008 6 7Fix problem in init_ascii() call for p2_complib2.c. 8 9>>Sep. 9, 2008 10 11Fix bug in display of library name when written to an output file 12(rather than stdout). 13 14>>Aug. 28, 2008 fa35_04_02 SVN Revision: 45 15 16Fix serious bug in alignment generation that only occurred when large 17libraries were used as a query with [t]fast[x/y]. This bug often 18resulted in a core dump. 19 20Address some other issues with uninitialized variables with -m 9c. 21 22>>Jul. 15, 2008 fa35_04_01 SVN Revision: 38 23 24Correct problems with Makefiles. Add information on compiling to README. 25Address issue with mp_KS for -m 10 when searching small libraries. 26 27>>Jul. 7, 2008 fa35_04_01 SVN Revision: 35 28 29Fix problems that occurred when statistics are disabled with -z -1, 30both for a normal library search, and for searches of a small library. 31 32>>Jul. 3, 2008 SVN Revision: 33 33 34Continue to fix an issue with 'J' and -S. 35 36>>Jun. 29, 2008 SVN Revision: 29, 31 37 38Fix additional problems with Makefiles, some issues uncovered with 39Solars 'C' compiler (Rev. 30). 40 41Discover serious bug when searching long, overlapping sequences, such 42as genomes. The length of the library sequence was not updated to 43reflect the length of the new region plus the overlap. 44 45Fix inconsistency in the value of 'J' between uascii.h/aascii[] and 46pascii[]. Add code to ensure that lascii[], qascii[], never return a 47value outside pam2[][] (all <= pst.nsq) (particularly for 'O' and 'U' 48amino-acids). 49 50exit(0) returns for map_db, list_db. 51 52>>Jun. 11, 2008 53 54Correct bug in scaleswn.c that prevented exact matches to queries < 10 55residues from being scored and displayed. 56 57>>Jun. 1, 2008 58 59Address various cosmetic issues in FASTA output: 60 61(1) Modify comp_lib2.c so that -O outfile works when multiple queries are 62compared in one run. 63 64(2) remove the duplicated query sequence length in the 1>>>query line. 65 66(3) in -m 10 output, the tags "pg_name" and "pg_ver" were duplicated, e.g. 67 68>>>K1HUAG, 109 aa vs a library 69; pg_name: fasta35_t 70; pg_ver: 35.03 71; pg_argv: fasta35_t -q -b 10 -d 5 -m 10 ../seq/prot_test.lseg a 72; pg_name: FASTA 73; pg_ver: 3.5 Sept 2006 74 75The ; pg_ver and ; pg_name produced by the get_param() functions in 76drop*.c have been renamed ; pg_ver_rel and ; pg_ver_alg. 77 78>>>K1HUAG, 109 aa vs a library 79; pg_name: fasta35_t 80; pg_ver: 35.03 81; pg_argv: fasta35_t -q -b 10 -d 5 -m 10 ../seq/prot_test.lseg a 82; pg_name_alg: FASTA 83; pg_ver_rel: 3.5 Sept 2006 84 85Modify mshowbest.c, mshowalign.c to highlight E() values (<font 86color="dark red"></font> in HTML output. 87 88>>Apr. 16, 2008 fa35_03_07 89 90Merge fa35_ann1_br, which allows annotations in library sequences. 91 92The PVM/MPI parallel version now support query sequence annotations 93and -m 9c annotation encoding. It does not yet support library 94annotations. Tested with both PVM and MPI. 95 96>>Apr. 2, 2008 fa35_03_06 97 98Ensure that code in last_init() to modify ktup never increases ktup value. 99 100Add fasta_versions.html to more explicitly describe programs available. 101 102>>Mar. 4, 2008 103 104Fix parsing of parameters (matrix, gap open, gap ext) in ASN.1 PSSM 105files produced by blastpgp. 106 107>>Feb. 18, 2008 fa35_03_05 108 109Re-implement -M low-high sequence range options. Sequence range 110restriction has probably been missing since the introduction of 111ggsearch and glsearch, which use a new approach to limiting the 112sequence range. 113 114>>Feb. 7, 2008 fa35_ann1_br 115 116Add annotations to library sequences (they were already available in 117query sequences). Currently, annotations are only available within 118sequences, but they should be available in FASTA format, or any of the 119other ascii text formats (EMBL/Swissprot, Genbank, PIR/GCG). If 120annotations are present in a library and the annotation characters 121includes '*', then the -V '*' option MUST be used. However, special 122characters other than '*' are ignored, so annotations of '@', '%', or 123'@' should be transparent. 124 125In translated sequence comparisons, annotations are only available for 126the protein sequence. 127 128The format for encoded annotations has changed to support annotations 129in both the query and library sequence. If the -m 9c flag is provided 130and annotations are present, then an annotated position in the 131alignment will be encoded as: 132 133 '|'q-pos':'l-pos':'q-symbol'l-symbol':'match-symbol'q-residue'l-residue' 134 135For example: 136 137 |7:7:@@:=YY|14:14:##:=TT 138 139In cases where the query or library sequence does not have an 140annotation, then the q-symbol or l-symbol will be 'X' (which is not a 141valid annotion symbol). 142 143>>Jan. 25, 2008 fa35_03_04 144 145Map 'O' (pyrrolysine) to 'K', 'U' (seleno-cysteine) to 'C' in uascii.h 146('J' is already recognized and mapped to the average of 'I' and 'L'). 147Thus, 'J' will appear in alignments, but 'O' and 'U' are transformed 148to 'K' and 'C'. 149 150Because "Oo" and "Uo" are not (currently) part of aax[] ("Uu" is in 151ntx[]), apam.c/build_xascii() was extended to add characters from 152othx[] - "oth" for "other" so that they are not lost. 153 154Double check, and fix, some mappings for 'J/j' and 'Z/z'. 155 156>>Jan. 11, 2008 fa35_03_03 157 158Clean up some issues with -m 10 output; put "; mp_Algorithm", "; 159mp_Parameters" down with other -m 10 ";" lines. Also provide "; 160al_code" and "; al_code_ann" if -m 9c is specified. Remove duplicate 161">>>query" line. 162 163Add "; aln_code" and "; ann_code" to -m 10 -m 9c output. The 164alignment/annotation encoding is only produced once (in showbest(), 165and is then saved for -m 10 aligment. 166 167>>Dec. 13, 2007 fa35_03_02m (merge of fa35_03_02 and fa35_02_08_br) 168 169Add ability to search a subset of a library using a file name and a 170list of accession/gi numbers. This version introduces a new filetype, 17110, which consists of a first line with a target filename, format, and 172accession number format-type, and optionally the accession number 173format in the database, followed by a list of accession numbers. For 174example: 175 176 </slib2/blast/swissprot.lseg 0:2 4| 177 3121763 178 51701705 179 7404340 180 74735515 181 ... 182 183Tells the program that the target database is swissprot.lseg, which is 184in FASTA (library type 0) format. 185 186The accession format comes after the ":". Currently, there are four 187accession formats, two that require ordered accessions (:1, :2), and 188two that hash the accessions (:3, :4) so they do not need to be 189ordered. The number and character after the accession format 190(e.g. "4|") indicate the offset of the beginning of the accession and 191the character that terminates the accession. Thus, in the typical 192NCBI Fasta definition line: 193 194 >gi|1170095|sp|P46419|GSTM1_DERPT Glutathione S-transferase (GST class-mu) 195 196The offset is 4 and the termination character is '|'. For databases 197distributed in FASTA format from the European Bioinformatics 198Institute, the offset depends on the name of the database, e.g. 199 200 >SW:104K_THEAN Q4U9M9 104 kDa microneme/rhoptry antigen precursor (p104). 201 202and the delimiter is ' ' (space, the default). 203 204Accession formats 1 and 3 expect strings; accession formats 2 and 4 205work with integers (e.g. gi numbers). 206 207>>Dec. 12, 2007 fa35_02_08 208 209Correct bug in ssearch35 gapped scores that only occurred in 210non-accelerated code. This bug has been present since fa35_02_06. 211Modified the Makefiles so that accelerated (ssearch35(_t)) and 212non-accelerated (ssearch35s(_t)) are available. Edited Makefile's to 213provide accelerated ssearch35 more specifically. 214 215Modifications to provide information about annotated residues in the 216-m9c coded output. Previously, -m 9c output added a field: 217 218 =26+9=15-2=9-1=3+1=74-2=3-3=63 219 220after the standard -m 9 output information. With the new version, an 221annotated query sequence ( -V '*#' ) adds the field: 222 223 |14:16:#<TM|24:26:#>TA|44:37:*>ST|71:66:#=TT 224 225which indicates that residue 14 in the query sequence aligns with 226residue 16 in the target (library) with annotation symbol '#', the 227alignment score is '<' less than zero, and the residues are 'T' 228(query) and 'M' (library). (The '|' is used to separate each 229annotation entry.) 230 231>>Nov. 10, 2007 232 233Parts of p2_complib.c and p2_workcomp.c, and the pvm/mpi Makefiles, 234have been updated to be consistent with name changes in the param.h 235and structs.h directories. 236 237>>Nov. 20, 2007 fa35_02_08 238 239Parts of p2_complib.c and p2_workcomp.c, and the pvm/mpi Makefiles, 240have been updated to be consistent with name changes in the param.h 241and structs.h directories. 242 243>>Nov. 6, 2007 fa35_02_07 244 245Correct problems with asymmetric RNA matrices in initfa.c and rna.mat. 246 247>>Oct. 18, 2007 248 249Correct problem parsing ASN1 FastaDefLines when the database is local. 250 251Recovering from a misplaced cvs commit of code that was supposed to be 252on a branch, code has been recovered from earlier versions (fa35_02_05 253because fa35_02_06 has some branch contamination). 254 255>>Oct. 4, 2007 fa35_02_06 256 257Correct error in gap penalties in dropnnw.c. Due to an unfortunate 258inconsistency, the gap parameter in FLOCAL_ALIGN (in dropgsw2.c) had a 259different meaning than that in almost all the other programs (it was 260the sum of gap_open and gap_ext). The FLOCAL_ALIGN function call was 261copied for FGLOBAL_ALIGN, even though the the FGLOBAL_ALIGN function 262used the more conventional gap_open, gap_ext parameters. Thus, 263FGLOBAL_ALIGN was wrong and the subsequent do_walign() in dropnnw.c 264were wrong. dropgsw2.c:FLOCAL_ALIGN has been modified to use the 265conventional gap_open parameter, and calls to dropnnw.c: 266FGLOBAL_ALIGN() and do_walign() have been fixed. 267 268>>Sept. 20, 2007 269 270Modify the logic used when saving a seq_record *seq_p into beststr 271*bbp to ensure that if the seq_record is replaced, it is replaced at 272all the places where it is referenced. This involves adding a linked 273list into beststr (*bbp->bbp_link). When making the link (and freeing 274it up), be certain that the linked seq_p is the same as the one being 275replaced. 276 277>>Sept. 18, 2007 fa35_02_05 278 279A relatively obscure problem was found on the SGI platform when 280searching a library smaller than 500 sequences (thus requiring some 281shuffles). Two bugs were found and corrected; one involved not 282allocating aa1shuff with COMP_THR and not do a m_file_p->ranliba() 283before re_getlib(). The second involved destroying a pointer to the 284list of seq_records when a sequence was being shuffled. The bugs were 285confirmed with Insure, and have been fixed. 286 287>>Sept. 7, 2007 fa35_02_04 288 289Revamp the offset handling code to provide better uniformity between 290query and library offsets and coordinate systems. 291 292Fix a problem with load_mmap() to load 64-bit sequence locations 293properly on machines with 32-bit integers. 294 295>>Sept. 4, 2007 296 297Modify ncbl2_mlib.c slightly to check to see whether the amino-acid 298mapping in blast databases is identical to the FASTA mapping (it 299should be). If they are identical, do not re-map the blast amino acid 300sequences (potentially a small speed up). 301 302>>Aug. 22, 2007 303 304Change ps_lav.c to lav2ps.c, and add lav2svg.c. It is now possible to 305generate a lalign35 HTML output that has both SVG (lav2svg) and PNG 306(lav2ps | gs ), graphics. 307 308>>Aug. 10, 2007 CVS fa35_02_03 309 310Fix faatran.c:aacmap() bug. 311 312>>Aug. 6, 2007 313 314Extensive restructuring of pssm_asn_subs.c to parse PSSM:2 ASN.1's 315downloaded from NCBI WWW PSI-BLAST more robustly. 316 317>>July 25, 2007 CVS fa35_02_02 318 319Change default gap penalties for OPTIMA5 matrix to -20/-2 from -24/-4. 320 321>>July 24, 2007 322 323Correct bugs introduced by adding 'J' - 'J' was initially put before 324'X' and '*' in the alphabet, which led to problems because the 325one-dimensional lower-triangular pam[] matrices (abl50[], abl62[], 326etc) had entries for 'X', and '*', but not for 'J'. By placing 'J' 327after the other characters, the problem is resolved. 328 329Modify tatstats.c to accommodate 'J'. 330 331'*' is back in the aascii[] matrix, so that it is present by default 332(like fasta34). 333 334>>July 23, 2007 335 336Changes to support sub-sequence ranges for "library" sequences - 337necessary for fully functional prss (ssearch35) and lalign35. For all 338programs, it is now possible to specify a subset of both the query and 339the library, e.g. 340 341 lalign35 -q mchu.aa:1-74 mchu.aa:75-148 342 343Note, however, that the subset range applied to the library will be 344applied to every sequence in the library - not just the first - and 345that the same subset range is applied to each sequence. This probably 346makes sense only if the library contains a single sequence (this is 347also true for the query file). 348 349Correct bugs in the functions that produce lav output from lalign35 -m 35011 to properly report the begin and end coordinates of both sequences. 351Previously, coordinates always began with "1". Correct associated bug 352in ps_lav.c that assumed coordinates started with "1". 353 354>>June 29, 2007 CVS fa35_02_01 355 356Merge of HEAD with fasta35 branch. 357 358>>June 29, 2007 CVS fa35_01_06 359 360Add exit(0); to ps_lav.c for 0 return code. 361 362>>June 26, 2007 363 364Add amino-acid 'J' for 'I' or 'L'. 365 366Add Mueller and Vingron (2000) J. Comp. Biol. 7:761-776 VT160 matrix, 367"-s VT160", and OPTIMA_5 (Kann et al. (2000) Proteins 41:498-503). 368 369Changes to dropnnw.c documentation functions to remove #ifdef's from 370strncpy() - which apparently is a macro in some versions of gcc. 371 372>>June 7, 2007 373 374Modify initfa.c to allow ggssearch35(_t), glsearch35(_t) to use PSSMs. 375 376>>June 5, 2007 CVS fa35_01_05 377 378Modifications to p2_complib.c, p2_workcomp.c to support Intel C 379compiler. Fixed bug in p2_workcomp.c - gstring[2][MAX_STR] required - 380[MAX_SSTR] too short. mp35comp* programs now tested and working (as 381are pv35comp*, c35.work* programs). 382 383Fix problem with fasts/fastm/fastf last_tat.c with limited memory. 384 385Correct problem with lalign35.exe Makefile.nm_[fp]com. 386 387Add $(CFLAGS) to map_db to enable large file support. 388 389Address problem with PSSM's when '*' not defined (initfa.c:extend_pssm()). 390 391>>May 30, 2007 CVS fa35_01_04 392 393Complete work on ps_lav, which converts an lalign35 lav (-m 11) file 394into a postscript plot, which looks identical to the plots produced by 395plalign from fasta2. (ps_lav has been replaced by lav2ps and lav2svg). 396 397>>May 25,29, 2007 398 399Changes to defs.h, doinit.c mshowalign.c for -m 11, which produces lav 400output only for lalign35. 401 402Changes to comp_lib2.c to add m_msg.std_output, which provides all the 403standard print lines. This is turned off for -m 11 (lav) output. 404lalign35 -m 11 provides standard lav output, with the addition of 405#lalign35 -q ... . 406 407>>May 18, 2007 408 409Add m_msg.zsflag to preserve pst.zsflag when reset by global/global 410exclusion of many library sequences. 411 412>>May 9, 2007 CVS fa35_01_03 413 414Tested local database size determination with p2_complib2/p2_workcomp2. 415 416>>May 2, 2007 renamed fasta35, pv35comp, etc 417 418Separate thread buffer structures from param.h. 419 420Problems with incorrect alignments has been fixed by re-initializing the 421best_seqs and lib_buf2_list.buf2 structures after each query sequence. 422 423The labels on the alignment scores are much more informative (and more 424diverse). In the past, alignment scores looked like: 425 426>>gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-transfer (218 aa) 427 s-w opt: 1497 Z-score: 1857.5 bits: 350.8 E(): 8.3e-97 428Smith-Waterman score: 1497; 100.0% identity (100.0% similar) in 218 aa overlap (1-218:1-218) 429^^^^^^^^^^^^^^ 430 431where the highlighted text was either: "Smith-Waterman" or "banded 432Smith-Waterman". In fact, scores were calculated in other ways, 433including global/local for fasts and fastf. With the addition of 434ggsearch35, glsearch35, and lalign35, there are many more ways to 435calculate alignments: "Smith-Waterman" (ssearch and protein fasta), 436"banded Smith-Waterman" (DNA fasta), "Waterman-Eggert", 437"trans. Smith-Waterman", "global/local", "trans. global/local", 438"global/global (N-W)". The last option is a global global alignment, 439but with the affine gap penalties used in the Smith-Waterman 440algorithm. 441 442>>April 24, 2007 443 444The new program structure has been migrated to the PVM and MPI 445versions. In addition, the new global algorithms (pv35compgg, 446pv35compgl) have been moved, though the the PVM/MPI versions do not 447(yet) to the appropriate size filtering. 448 449>>April 19, 2007 450 451Two new programs, ggsearch35(_t) and glsearch35_t are now available. 452ggsearch35(_t) calculates an alignment score that is global in the 453query and global in the library; glsearch35_t calculates an alignment 454that is global in the query and local, while local in the library 455sequence. The latter program is designed for global alignments to domains. 456 457Both programs assume that scores are normally distributed. This 458appears to be an excellent approximation for ggsearch35 scores, but 459the distribution is somewhat skewed for global/local (glsearch) 460scores. ggsearch35(_t) only compares the query to library sequences 461that are beween 80% and 125% of the length of the query; glsearch 462limits comparisons to library sequences that are longer than 80% of 463the query. Initial results suggest that there is relatively little 464length dependence of scores over this range (scores go down 465dramatically outside these ranges). 466 467A bug was found and fixed in showalign() and showbest() where the 468aa1save buffer was not preserved when some sequences needed to be 469re-read, while others were stored in the beststr. 470 471>>April 9, 2007 472 473Some of the drop*.c functions have been reconfigured to reduce the 474amount of duplicate code. For example, dropgsw.c, dropnsw.c, and 475dropnfa.c all used exactly the same code to produce global alignments 476(NW_ALIGN() and nw_align()), this code is now in wm_align.c. 477Likewise, those same files, as well as dropgw2.c, use identical code 478to produce consensus alignments (calcons(), calcons_a(), calc_id(), 479calc_code()). Rather than working with three or four copies of 480identical code, there is now one version. 481 482>>March 29, 2007 483 484At last, the lalign (SIM) algorithm has been moved from FASTA21 to 485FASTA35. Currently, only lalign35 is available. A plotting version 486will be available shortly (or perhaps a more general solution that 487produces lav output). 488 489The statistical estimates for lalign35 should be much more accurate 490than those from the earlier lalign, because lambda and K are estimated 491from shuffles. 492 493Many functions have been modified to reduce the number of times 494structures are passed as arguments, rather than pointers. 495 496>>February 23, 2007 497 498The threading strategy has been modified slightly to separate the end 499of the search phase (and a complete reading of all results buffers) 500from the termination phase. This will allow future threading of 501subsequent phases, including the Smith-Waterman alignments in 502showbest() and showalign() (though care will be required to ensure 503that the results are presented in the correct order). 504 505>>February 20, 2007 fasta-34_27_0 (released as fasta-35_1) 506 507The FASTA programs have been restructured to reduce the differences 508between the threaded and unthreaded versions (and ultimately the 509parallel versions) and to make more efficient use of modern large 510memory systems. This is the beginning of a move towards a more robust 511shuffling strategy when searching databases with modest numbers of 512related sequences. 513 514The major changes: 515 516 comp_lib.c -> comp_lib2.c - comp_lib.c will be removed 517 work_thr.c -> work_thr2.c - work_thr.c will be removed 518 519 mshowbest.c, mshowalign.c have been modified to remove aa1 as an 520 argument. They must allocate that space if they need it. 521 522 The system is set up to allocate a substantial amount of library 523 sequence memory, either to a single buffer (unthreaded) or to the 524 threaded buffer pool. For smaller databases, the library sequences 525 are read once, and then subsequently read from memory (this could be 526 extended for RANLIB(bline) as well). 527 528Soon, these changes will allow the program to re-read the beststr[] 529sequences and shuffle them to produce accurate lambda/K estimates. 530 531================================================================ 532 533See readme.v34t0 for earlier changes. 534 535================================================================ 536