1 2Version 2.1.11 (4 June 2003) 3 - corrected error where fasta header line is 2 chars or less long, e.g. '>1' 4 5Version 2.1.9 (jun 02) 6 - fixed Nexus parsing of "agc/..." common base patterns, and nchars/nspp reading 7 - fixed fasta -> embl/other documented formats bug (reset seqdoc variable b/n entries read) 8 - added obscure option to turn off interleave format for interleave output writes: 9 use '-inter[line]=0' command option 10 11 12Version 2.1.0 (25 May 2001) updates: 13 Added -reverse option for reverse-complement of sequence 14 Feature extraction of complement() locations now does reverse-complement 15 Added feature subrange extraction 16 Added ClustalW alignment, AceDB, ? SwissProt sequence formats 17 Added GFF/General-Feature-Format input/output 18 Added FFF/Flat-Feature-Format input/output (one-line DDBJ/EMBL/GenBank Feature Table) 19 Various bug fixes; Java 1.2/3 compatibility 20 21 22 -subrange=-1000..10 extract subrange of sequence for feature locations 23 -subrange=1..end e.g., 1..end is full feature range, 24 -subrange=end-10..99 -1000...10 is 1000 bases upstream/before to 10 in feature, 25 end-10..99 is 10 before end to 99 after end of feature 26 only valid with -feat/-nofeat 27 28Need to handle dna files in +300 MB range for genomes -- use something like FileBioseq() 29class which spools seq to/from disk. ? Ditto for genome feature tables - something 30like gnomap index feaures.tsv (have 150,000 features, 280 MB dna in man-chr1 as of may01) 31 32* to make 'very raw' output format, one line of sequence per sequence record, 33use format=raw width=-1 (or width=1999999999) 34 35? add HTML output (feature hyperlinks? colorized feat in seq) 36 37!! test paired seq ( .fff + .seq, .gff + .seq) merge option: 38 else if ( key.equals("pair-feature-seq") ) doPairedDocNSeq= boolval; 39 else if ( key.startsWith("pair") ) doPairedDocNSeq= boolval; //? 40 41x modify Reader before writeSeqRecord calls to create new output seqRecord for each 42 seqdoc feature indicated as separare record (e.g. source, gene, ...) 43 44!? add in missing formats handled by others - emboss at least 45 x * ClustalW - Clustal ALN format seqReadClustal 46 x * ACEDB format 47 * Fasta variants - NCBI w/ id acc and description 48 x * Does SWISSPROT differ from EMBL -- YES diff feat tab, other docs need subclass of EmblDoc, format handler 49 *? Fix also for GenBank-> amino GenPept, RefProt 50 ? Staden experiment file format -- current one is EMBL variant (old one is IG-like) 51 ? ASN.1 fix 52 ? PAUP fix 53 is PIR == NBRF or CODATA (emboss does both/ european PIR == NBRF 54 ? seqReadHennig86 55 56=============== 57 58 59From frist@cc.UManitoba.CA Sat May 27 11:42:36 2000 60Date: Sat, 27 May 2000 11:42:35 -0500 (CDT) 61From: frist <frist@cc.UManitoba.CA> 62Subject: readseq 2.0.8 63To: gilbertd@bio.indiana.edu 64MIME-Version: 1.0 65Content-MD5: TVztJn3WgTK1YI0yNskonw== 66 67Don, 68 69I've been trying readseq 2.0.8, and I have some problems 70to report. For your reference, I am using the jre from 71jdk1.2.2 on a Sun Solaris 2.6 system. 72 73>From my point of view, the biggest show stopper is the 74-item switch, which doesn't work at all. 75 76{goad:/home/plants/frist/testreadseq}readseqj -i1 -o junk CHS.gen 77Readseq version 2.0.8 (18 Jan 2000) 78java.lang.NullPointerException 79 at iubio.readseq.BioseqWriter.setSeq(Compiled Code) 80 at iubio.readseq.run.run(Compiled Code) 81 at iubio.readseq.run.<init>(Compiled Code) 82 at run.main(Compiled Code) 83 84^^^ fixed 85 86 87 88Further testing seems to show that -i crashes the program 89no matter how many sequences are in the input. I have 90tried -i reading several file formats (fasta, pir, GenBank), 91and several -i values. It crashes whenever used. 92 93Since I use -i for extracting individual sequences 94from temproary files using GDE, this will prevent 95upgrading to use the new readseq. 96 97Other notes: 98 99Phylip - With GenBank entries of variable length, output 100to Phylip interleaved format (phylip) results in infinite 101loop. Although it's not legitimate to try to write 102in interleaved format when sequences are not all the 103same length, perhaps readseq should give some sort of 104warning. 105 106^^^?? doesn't hang for me -- need example data 107 108Phylip2 - header line lists number of sequences and length as 109both being 0. If output format is phylip2, it would be 110necessary to make two passes through the data, one 111to determine numbers of sequences and lengths, and the other 112to generate the output. 113 114^^^ fixed 115 116msf - goes into infinite loop for input GenBank and output msf 117 118^^^?? doesn't hang for me -- need example data 119 120blast - with either Pearson or GenBank input, output to 121blast format crashes readseq. Same results with -fblast or -f19 122 123^^^ an input only format 124^^^ 'jre -cp readseq.jar help' will list all formats, read/write ability 125 126 127 128{goad:/home/plants/frist/testreadseq}readseqj aligned.pearson -fblast -o 129test.blast 130Readseq version 2.0.8 (18 Jan 2000) 131java.io.IOException: Null BioseqWriter 132 at java.lang.Throwable.<init>(Compiled Code) 133 at java.lang.Exception.<init>(Compiled Code) 134 at java.io.IOException.<init>(Compiled Code) 135 at iubio.readseq.run.run(Compiled Code) 136 at iubio.readseq.run.<init>(Compiled Code) 137 at run.main(Compiled Code) 138^^^ not a crash, the program is saying it doesn't have a Writer for that 139format -- made error report clearer 140 141 142scf - crashes with scf. I have no idea what this format is, 143so I can't really evaluate this option. 144^^^ an input only format - Standard Chromatagram Format (which includes sequence data) 145 146Readseq version 2.0.8 (18 Jan 2000) 147java.io.IOException: Null BioseqWriter 148 at java.lang.Throwable.<init>(Compiled Code) 149 at java.lang.Exception.<init>(Compiled Code) 150 at java.io.IOException.<init>(Compiled Code) 151 at iubio.readseq.run.run(Compiled Code) 152 at iubio.readseq.run.<init>(Compiled Code) 153 at run.main(Compiled Code) 154 155Let me know if you need further information. 156 157 158=============================================================================== 159Brian Fristensky | 160Department of Plant Science | Sun Microsystems CEO Scott McNealy, 161University of Manitoba | when asked recently about Microsoft, 162Winnipeg, MB R3T 2N2 CANADA | replied "Which one?" 163frist@cc.umanitoba.ca | 164Office phone: 204-474-6085 | source: ZDNet News 165FAX: 204-474-7528 | 166http://home.cc.umanitoba.ca/~frist/ 167=============================================================================== 168 169//// 17018 Jan 2000 - fixed subtle GCG-msdos bug (eof triggered exception on double newline) 171 172 17311 Dec 99 - added document field extraction/removal, as per feature ex/rm 174(needs testing) 175 176From rls@ebi.ac.uk Thu Oct 28 07:50:55 1999 177Date: Thu, 28 Oct 1999 13:53:06 +0100 178From: Rodrigo Lopez <rls@ebi.ac.uk> 179 180convert "genpept" sequences to embl properly... 181when using peptide sequences it does not write a SQ line. 182 183Also, it would be very usefull if command-line switches could be 184added to skip certain records from an entry (i.e. the COMMENT lines - 185CC). 186 187 188 189//////// 190 191From small@versailles.inra.fr Wed Sep 1 07:51:43 1999 192Date: Wed, 1 Sep 1999 14:52:45 +0200 (MET DST) 193Mime-Version: 1.0 194To: gilbertd@bio.indiana.edu 195From: Ian Small <small@versailles.inra.fr> 196Subject: readseq 197 198 I've just started using your Java version of the readseq package to 199provide an easy way to read in multiple formats into my program. Basically, 200I just unjarred readseq.jar and included all the packages in my jar file. 201Everything works fine on my Mac (MRJ 2.14 i.e. c. JDK 1.1.7ish) and I'm 202very happy with the result (thanks for all the hard work !). However, under 203Linux (JDK 1.2) or Windows95 (JDK1.3b), I get the following error message: 204 205Exception in thread "main" java.lang.VerifyError: (class: 206iubio/readseq/MsfSeqFormat, method: formatTestLine signature: 207(Lflybase/OpenString;II)Z) Incompatible argument to function 208 at java.lang.Class.forName0(Native Method) 209 at java.lang.Class.forName(Class.java:124) 210 at iubio.readseq.BioseqFormats.loadClasses(BioseqFormat.java) 211 at iubio.readseq.BioseqFormats.<clinit>(BioseqFormat.java) 212 at iubio.readseq.Readseq.<init>(readseq.java) 213 at Predotar.handleOpenFile(Predotar.java) 214 at Predotar.main(Predotar.java) 215 216Readseq version 2.0.5 (25 August 1999) 217Argument: 'data/5srna.msf', key: data/5srna.msf = null 218Free/total memory at start: 570136/867984: use 297848 bytes 219iubio.readseq.run -- starting 220Writing to data/5srna.msf.fasta 221checkInString 'data/5srna.msf' is file. 222Reading from data/5srna.msf 223_exceptionOccurred: java.lang.NullPointerException () 224java.lang.NullPointerException 225 at flybase.OpenString.indexOf(Compiled Code) 226 at iubio.readseq.Asn1SeqFormat.formatTestLine(seqread1f.java) 227 at iubio.readseq.Testseq.testFormat(Compiled Code) 228 at iubio.readseq.Readseq.isKnownFormat(readseq.java) 229 at iubio.readseq.run.run(Compiled Code) 230 at iubio.readseq.run.<init>(readseqrun.java) 231 at run.main(readseqmain.java) 232 at com.apple.mrj.JManager.JMStaticMethodDispatcher.run(JMAWTContextImpl.java) 233 at java.lang.Thread.run(Thread.java) 234 235/// 236 237 238app() -- swing 239 http://java.sun.com/products/jfc/download.html, good for java v 1.1.6 (?) + 240 if swingall.jar not available - trap exception and message 241 242 243//!! test interleave i/o 244 245/// � add classic readseq command-line processing 246/// � add window/gui/swing readseq 247 248/// � add BLAST output parsing (non-pairwise versions...) 249/// fix, test NEXUS/ PAUP reader 250/// XML � parser & � output? 251 252///? GCG seq formats update?? 253///? GDE format ? 254///? abi trace files, scf files ? staden format (?) and scf 255///? other new, popular seq formats? -? Sequencher, DNAstar, ??? 256 257///? see NCBI toolkit parser program for some format info 258 259/// Genbank <-> EMBL doc : 260 need to fiddle with field values to get correct match 261 embl uses ';' a lot as line ends - drop for gb, add from gb 262 Ref start: em: RN [3] gb: REFERENCE 3 (bases 1 to 1566) 263 Source/Organism lines differ: gb: Source=common name Organism=spp name /n taxa 264 em: OS= spp name (common name) OC=taxa 265 Locus line: need parsing of parts 266 Base count line: needs proper base counts, or parsing of items 267 gb: BASE COUNT 276 a 246 c 295 g 262 t 268 em: SQ Sequence 977 BP; 316 A; 188 C; 175 G; 298 T; 0 other; 269 270 271 ?? do I want to fix up current BioseqDoc to those details, or rewrite structure 272 (as per DOM?) before that? 273 274 275 276 277/// GCG seq formats update 278>> parse comments like -- at least skip lines starting with !! 279!!NA_MULTIPLE_ALIGNMENT 1.0 280!!AA_MULTIPLE_ALIGNMENT 1.0 281!!SEQUENCE_LIST 1.0 282!!RICH_SEQUENCE 1.0 283Multiple Sequence Format (MSF) and Rich Sequence Format (RSF) 284Files 285 286Reformat can be used to convert between MSF, RSF, single 287sequence format and list files. When single sequence 288files are specified using a list file, any sequence 289attributes specified in the list file (e.g. begin and end 290ranges) are ignored during the conversion to the new file 291type. When converting from an RSF file any sequence 292features are lost. Access to sequence features is 293currently available only from within SeqLab. (Refer to 294Chapter 2 of the Users' Guide, Using Sequence Files and 295Databases, for details. See "Using Multiple Sequence 296Format (MSF) Files", "Using Rich Sequence Format (RSF) 297Files", and "Using List Files" for information about list 298files.) 299 300 301/// fix, test NEXUS, PAUP reader 302 303 304/// other new, popular seq formats? -? Sequencher DNAstar ??? 305 306 307/// abi trace files, scf files ? 308/// staden format (?) and scf 309 310/// GDE format ? 311 312 313/// ? XML parser ? HMTL parser/de-frasser 314 315 316 317/// BLAST output parsing.... 318qblast The request ID is : 933566842-17349-12684 319 320From sauder@glinka.fccc.edu Wed Jun 23 16:56:08 1999 321Date: Wed, 23 Jun 1999 17:58:03 -0400 (EDT) 322From: sauder@glinka.fccc.edu (J. Michael Sauder) 323To: Don Gilbert <gilbertd@bio.indiana.edu> 324Subject: Re: SeqPup and Blast 325 326> I'll add your request for blast import; I can't promise 327 328 I already have Perl code that converts Blast to PIR format. 329Would this be helpful? 330 331-- Mike S. (m_sauder@fccc.edu) http://www.fccc.edu/research/labs/dunbrack 332 phone: 215.728.5661 FAX: 215.728.3574 or 2412 333 334{Mail}& 335Message 218: 336From sauder@glinka.fccc.edu Fri Jun 25 14:18:54 1999 337Date: Fri, 25 Jun 1999 15:20:48 -0400 (EDT) 338From: sauder@glinka.fccc.edu (J. Michael Sauder) 339To: Don Gilbert <gilbertd@bio.indiana.edu> 340Subject: Re: SeqPup and Blast 341X-Status: $$$$ 342X-UID: 343 344> Sure -- send on your code, it may make it easier for me to add blast 345> format. 346 347 Here's the code. 348It assumes that Blast was performed with the -m 4 or -m 6 option, which 349is NOT the default. I'm going to start working on a version that will 350translate the default (-m 0) output. This version can handle PSI-Blast 351output, and only keeps the sequences from the last iteration. 352 353---------------------------------------------------------------------------- 354#!/usr/bin/perl 355 356# BLASTm4FASTA.perl 357 358# Usage: blastm4fasta.perl file.blast > file.fasta 359 360# Written by J. Michael Sauder m_sauder@fccc.edu 361# Fox Chase Cancer Center 6-24-99 362 363# Convert blast -m 4 or 6 output to FASTA format (or clustalw PIR format) 364# (blastpgp -m 1, 2, 3, or 5 may produce spurious output) 365# Output format defined by $format (FASTA or PIR). 366# All sequences will be the same length, padded with either X's or 367# dashes, depending on the output format (FASTA or PIR). 368# Keeps only the last iteration in PSI-Blast output. 369# The "Query= " line must exist, as well as " Database:" at the end. 370 371 372$format = "fasta"; # output format 373# $format = "pir"; 374$m = 4; # default blast -m alignment view output option 375 # only -m 4 or 6 should be used with this program. 376 # This doesn't need to be modified. The program will 377 # identify whether -m 4 or -m 6 was used. 378$width = 70; # width of output sequence 379 380$qseq = ""; %qseqs=(); 381%hseq = (); %count = (); 382$hitnum=0; $found=0; $npad=0; $hlen=0; 383 384# Subroutine to print sequences $width characters wide 385sub printseq { 386 $seq = $_[0]; 387 $length=length($seq); 388 $ncol=int($length/$width); 389 for ($i=0; $i<=$ncol; $i++) { 390 print substr($seq,$i*$width,$width),"\n"; 391 } 392} 393 394# $file = $ARGV[0]; 395 396while(<>) { 397 chop; 398 if (/^Query= /) { # Get query name 399 @line = split; 400 $query = $line[1]; 401 next; 402 } 403 if (/^Searching/) { # handle multiple PSI-Blast iterations 404 %hseq=(); # keep only last iteration 405 $qseq=""; %qseqs=(); # by clearing everything 406 %count=(); 407 $hitnum=0; $found=0; $npad=0; $hlen=0; 408 } 409 if (/^QUERY/) { # Query sequence 410 @line = split; 411 $name = $line[0]; 412 $qstart = $line[1]; 413 if ($qseq eq "") { 414 $qfirst = $qstart; # $qfirst is probably 1 415 } 416# $qend = $line[3]; 417 $qlen = length($line[2]); 418 $qseq .= $line[2]; # append query sequence 419 $qseqs{$name} .= $line[2]; # also store in %qseqs 420 $found=1; # flag to identify sequence region of blast output 421 next; 422 } 423 424 # Stop trying to read sequences once this line is found 425 # at the end of the blast output 426 if (/^ Database: /) { $found = 0; next } 427 428 if ($found==1) { 429 if ($_ eq "") { next } 430 @line = split; 431 $name = $line[0]; 432# $hstart = $line[1]; 433 $hlen = length($line[2]); 434 if ($hlen==0) { # created with blast -m 6 435 $m=6; 436 $hlen = length($line[1]); 437 } 438 else { $m=4 } 439 440 # Add N-padding X's or dashes to match query sequence length 441 if (!$hseq{$name}) { 442 $npad = ($qlen - $hlen) + ($qstart - $qfirst); # -1 ? 443 if ($format eq "fasta") { $pad = 'X' x $npad } 444 elsif ($format eq "pir") { $pad = '-' x $npad } 445 if ($m==6) { $npad=0; $pad="" } # -m 6 is already padded 446 if ($npad>0) { $hseq{$name} = $pad } 447 $count{$hitnum} = $name; 448 $hitnum++; # Increment counter for next new sequence 449 } 450 if ($m==4) { $hseq{$name} .= $line[2] } 451 if ($m==6) { $hseq{$name} .= $line[1] } 452 } 453} 454 455$querylen = length($qseqs{QUERY}); 456 457if ($format eq "pir") { print ">$query\n\n" } 458if ($format eq "fasta") { print "\>$query $querylen\n" } 459if ($format eq "fasta") { $qseqs{QUERY} =~ s/\-/X/g } 460printseq($qseqs{QUERY}); # format and print sequence 461if ($format eq "pir") { print "\*\n" } 462 463 464# Preserve original alignment order by using %count instead of %hseq 465 466foreach $hitnum (sort keys %count) { 467 $name = $count{$hitnum}; 468 469 # Add trailing X's or dashes if necessary 470 $hlen = length($hseq{$name}); 471 $npad = $querylen-$hlen; 472 if ($format eq "fasta") { $pad = 'X' x $npad} 473 elsif ($format eq "pir") { $pad = '-' x $npad} 474 if ($npad>0) { $hseq{$name} .= $pad } 475 476 $hitlen = length($hseq{$name}); 477 478 if ($format eq "pir") { print ">$name\n\n" } 479 if ($format eq "fasta") { print "\>$name $hitlen\n" } 480 if ($format eq "fasta") { $hseq{$name} =~ s/\-/X/g } 481 printseq($hseq{$name}); 482 if ($format eq "pir") { print "\*\n" } 483} 484 485# Sample input from blastpgp -m 4 486# 487# Query= protein1 488# Database: nr 489# 490# Searching..................................................done 491# Results from round 1 492# ... [cut] 493# 494# Searching..................................................done 495# Results from round 2 496# 497# QUERY 1 MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59 498# S50979 1 MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59 499# CAA88356 1 MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59 500# P30776 7 ILSKKGPLAKVWLAAHWEKKLSKVQTLHTSIEQSVHAIV-------- 45 501# 4140710 7 ILAKKGPLARIWLAAHWDKKITKAHVFETNIEKSVEGILQPK----- 48 502# AAD33593.1 7 ILAKKGPLARIWLAAHWDKKITKAHVFETNIEKSVEGILQPK----- 48 503# 1022971 64 SKKGPLSKVWLAAHWEKKLSKAQIFETDVDEAVNEIMQPS----- 10 504# CAA66939 9 SKRGPLAKIWLAAHWDK-----KLTKAHVFECNLE----SSVESI 44 505# 506# ... [cut] 507# 508# QUERY 523 EASRGFFDILSLATEGCIGLSQTEAFGNIKIDAKPALFERFINA 566 509# S50979 523 EASRGFFDILSLATEGCIGLSQTEAFGNIKIDAKPALFERFINA 566 510# P40457 407 ERTKSLEHELKRSTE 421 511# BAA18707 139 STN 141 512# Database: nr 513 514 515# Sample input from blastpgp - m 6 516# 517# QUERY 1 MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59 518# S50979 1 MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59 519# CAA88356 1 MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59 520# P30776 7 -------------ILSKKGPLAKVWLAAHWEKKLSKVQTLHTSIEQSVHAIV-------- 45 521# 4140710 7 -------------ILAKKGPLARIWLAAHWDKKITKAHVFETNIEKSVEGILQPK----- 48 522# AAD33593.1 7 -------------ILAKKGPLARIWLAAHWDKKITKAHVFETNIEKSVEGILQPK----- 48 523# 1022971 64 ---------------SKKGPLSKVWLAAHWEKKLSKAQIFETDVDEAVNEIMQPS----- 10 524# CAA66939 9 ---------------SKRGPLAKIWLAAHWDK-----KLTKAHVFECNLE----SSVESI 44 525# Q46089 ------------------------------------------------------------ 526# 527# ... [cut] 528# 529# QUERY 523 EASRGFFDILSLATEGCIGLSQTEAFGNIKIDAKPALFERFINA 566 530# S50979 523 EASRGFFDILSLATEGCIGLSQTEAFGNIKIDAKPALFERFINA 566 531# CAA88356 284 -------------------------------------------- 285 532# P30776 121 -------------------------------------------- 122 533# 4140710 89 -------------------------------------------- 90 534# AAD33593.1 89 -------------------------------------------- 90 535# 1022971 145 -------------------------------------------- 146 536# CAA66939 89 -------------------------------------------- 90 537# Q46089 458 -------------------------------------------- 459 538---------------------------------------------------------------------------- 539 540 541-- Mike S. (m_sauder@fccc.edu) http://www.fccc.edu/research/labs/dunbrack 542 phone: 215.728.5661 FAX: 215.728.3574 or 2412 543