1
2Version 2.1.11 (4 June 2003)
3  - corrected error where fasta header line is 2 chars or less long, e.g. '>1'
4
5Version 2.1.9 (jun 02)
6  - fixed Nexus parsing of "agc/..." common base patterns, and nchars/nspp reading
7  - fixed fasta -> embl/other documented formats bug (reset seqdoc variable b/n entries read)
8  - added obscure option to turn off interleave format for interleave output writes:
9  	  use '-inter[line]=0' command option
10
11
12Version 2.1.0 (25 May 2001) updates:
13  Added -reverse option for reverse-complement of sequence
14  Feature extraction of complement() locations now does reverse-complement
15  Added feature subrange extraction
16  Added ClustalW alignment, AceDB, ? SwissProt sequence formats
17  Added GFF/General-Feature-Format input/output
18  Added FFF/Flat-Feature-Format input/output (one-line DDBJ/EMBL/GenBank Feature Table)
19  Various bug fixes; Java 1.2/3 compatibility
20
21
22    -subrange=-1000..10   extract subrange of sequence for feature locations
23    -subrange=1..end      e.g., 1..end is full feature range,
24    -subrange=end-10..99  -1000...10 is 1000 bases upstream/before to 10 in feature,
25    									 	  end-10..99 is 10 before end to 99 after end of feature
26                          only valid with -feat/-nofeat
27
28Need to handle dna files in +300 MB range for genomes -- use something like FileBioseq()
29class which spools seq to/from disk. ? Ditto for genome feature tables - something
30like gnomap index feaures.tsv (have 150,000 features,  280 MB dna in man-chr1 as of may01)
31
32* to make 'very raw' output format, one line of sequence per sequence record,
33use  format=raw  width=-1 (or width=1999999999)
34
35? add HTML output (feature hyperlinks? colorized feat in seq)
36
37!! test paired seq ( .fff + .seq, .gff + .seq) merge option:
38		else if ( key.equals("pair-feature-seq") ) doPairedDocNSeq= boolval;
39		else if ( key.startsWith("pair") ) doPairedDocNSeq= boolval;  //?
40
41x modify Reader before writeSeqRecord calls to create new output seqRecord for each
42  seqdoc feature indicated as separare record (e.g. source, gene, ...)
43
44!? add in missing formats handled by others - emboss at least
45  x * ClustalW - Clustal ALN format seqReadClustal
46	x * ACEDB format
47	* Fasta variants - NCBI w/ id acc and description
48	x * Does SWISSPROT differ from EMBL -- YES diff feat tab, other docs need subclass of EmblDoc, format handler
49	*? Fix also for GenBank-> amino GenPept, RefProt
50  ? Staden experiment file format  -- current one is EMBL variant (old one is IG-like)
51  ? ASN.1 fix
52  ? PAUP fix
53  is PIR == NBRF or CODATA (emboss does both/ european PIR == NBRF
54	? seqReadHennig86
55
56===============
57
58
59From frist@cc.UManitoba.CA  Sat May 27 11:42:36 2000
60Date: Sat, 27 May 2000 11:42:35 -0500 (CDT)
61From: frist <frist@cc.UManitoba.CA>
62Subject: readseq 2.0.8
63To: gilbertd@bio.indiana.edu
64MIME-Version: 1.0
65Content-MD5: TVztJn3WgTK1YI0yNskonw==
66
67Don,
68
69I've been trying readseq 2.0.8, and I have some problems
70to report. For your reference, I am using the jre from
71jdk1.2.2 on a Sun Solaris 2.6 system.
72
73>From my point of view, the biggest show stopper is the
74-item switch, which doesn't work at all.
75
76{goad:/home/plants/frist/testreadseq}readseqj -i1 -o junk CHS.gen
77Readseq version 2.0.8 (18 Jan 2000)
78java.lang.NullPointerException
79        at iubio.readseq.BioseqWriter.setSeq(Compiled Code)
80        at iubio.readseq.run.run(Compiled Code)
81        at iubio.readseq.run.<init>(Compiled Code)
82        at run.main(Compiled Code)
83
84^^^ fixed
85
86
87
88Further testing seems to show that -i crashes the program
89no matter how many sequences are in the input. I have
90tried -i reading several file formats (fasta, pir, GenBank),
91and several -i values. It crashes whenever used.
92
93Since I use -i for extracting individual sequences
94from temproary files using GDE, this will prevent
95upgrading to use the new readseq.
96
97Other notes:
98
99Phylip - With GenBank entries of variable length, output
100to Phylip interleaved format (phylip) results in infinite
101loop. Although it's not legitimate to try to write
102in interleaved format when sequences are not all the
103same length, perhaps readseq should give some sort of
104warning.
105
106^^^?? doesn't hang for me -- need example data
107
108Phylip2 - header line lists number of sequences and length as
109both being 0. If output format is phylip2, it would be
110necessary to make two passes through the data, one
111to determine numbers of sequences and lengths, and the other
112to generate the output.
113
114^^^ fixed
115
116msf - goes into infinite loop for input GenBank and output msf
117
118^^^?? doesn't hang for me -- need example data
119
120blast - with either Pearson or GenBank input, output to
121blast format crashes readseq. Same results with -fblast or -f19
122
123^^^ an input only format
124^^^ 'jre -cp readseq.jar help' will list all formats, read/write ability
125
126
127
128{goad:/home/plants/frist/testreadseq}readseqj aligned.pearson -fblast  -o
129test.blast
130Readseq version 2.0.8 (18 Jan 2000)
131java.io.IOException: Null BioseqWriter
132        at java.lang.Throwable.<init>(Compiled Code)
133        at java.lang.Exception.<init>(Compiled Code)
134        at java.io.IOException.<init>(Compiled Code)
135        at iubio.readseq.run.run(Compiled Code)
136        at iubio.readseq.run.<init>(Compiled Code)
137        at run.main(Compiled Code)
138^^^ not a crash, the program is saying it doesn't have a Writer for that
139format -- made error report clearer
140
141
142scf - crashes with scf. I have no idea what this format is,
143so I can't really evaluate this option.
144^^^ an input only format - Standard Chromatagram Format (which includes sequence data)
145
146Readseq version 2.0.8 (18 Jan 2000)
147java.io.IOException: Null BioseqWriter
148        at java.lang.Throwable.<init>(Compiled Code)
149        at java.lang.Exception.<init>(Compiled Code)
150        at java.io.IOException.<init>(Compiled Code)
151        at iubio.readseq.run.run(Compiled Code)
152        at iubio.readseq.run.<init>(Compiled Code)
153        at run.main(Compiled Code)
154
155Let me know if you need further information.
156
157
158===============================================================================
159Brian Fristensky                |
160Department of Plant Science     |  Sun Microsystems CEO Scott McNealy,
161University of Manitoba          |  when asked recently about Microsoft,
162Winnipeg, MB R3T 2N2  CANADA    |  replied "Which one?"
163frist@cc.umanitoba.ca           |
164Office phone:   204-474-6085    |  source: ZDNet News
165FAX:            204-474-7528    |
166http://home.cc.umanitoba.ca/~frist/
167===============================================================================
168
169////
17018 Jan 2000 - fixed subtle GCG-msdos bug (eof triggered exception on double newline)
171
172
17311 Dec 99 - added document field extraction/removal, as per feature ex/rm
174(needs testing)
175
176From rls@ebi.ac.uk  Thu Oct 28 07:50:55 1999
177Date: Thu, 28 Oct 1999 13:53:06 +0100
178From: Rodrigo Lopez <rls@ebi.ac.uk>
179
180convert "genpept" sequences to embl properly...
181when using peptide sequences it does not write a SQ line.
182
183Also, it would be very usefull if command-line switches could be
184added to skip certain records from an entry (i.e. the COMMENT lines -
185CC).
186
187
188
189////////
190
191From small@versailles.inra.fr  Wed Sep  1 07:51:43 1999
192Date: Wed, 1 Sep 1999 14:52:45 +0200 (MET DST)
193Mime-Version: 1.0
194To: gilbertd@bio.indiana.edu
195From: Ian Small <small@versailles.inra.fr>
196Subject: readseq
197
198        I've just started using your Java version of the readseq package to
199provide an easy way to read in multiple formats into my program. Basically,
200I just unjarred readseq.jar and included all the packages in my jar file.
201Everything works fine on my Mac (MRJ 2.14 i.e. c. JDK 1.1.7ish) and I'm
202very happy with the result (thanks for all the hard work !). However, under
203Linux (JDK 1.2) or Windows95 (JDK1.3b), I get the following error message:
204
205Exception in thread "main" java.lang.VerifyError: (class:
206iubio/readseq/MsfSeqFormat, method: formatTestLine signature:
207(Lflybase/OpenString;II)Z) Incompatible argument to function
208        at java.lang.Class.forName0(Native Method)
209        at java.lang.Class.forName(Class.java:124)
210        at iubio.readseq.BioseqFormats.loadClasses(BioseqFormat.java)
211        at iubio.readseq.BioseqFormats.<clinit>(BioseqFormat.java)
212        at iubio.readseq.Readseq.<init>(readseq.java)
213        at Predotar.handleOpenFile(Predotar.java)
214        at Predotar.main(Predotar.java)
215
216Readseq version 2.0.5 (25 August 1999)
217Argument: 'data/5srna.msf', key: data/5srna.msf = null
218Free/total memory at start:	570136/867984: use 297848 bytes
219iubio.readseq.run -- starting
220Writing to data/5srna.msf.fasta
221checkInString  'data/5srna.msf' is file.
222Reading from data/5srna.msf
223_exceptionOccurred: java.lang.NullPointerException ()
224java.lang.NullPointerException
225	at flybase.OpenString.indexOf(Compiled Code)
226	at iubio.readseq.Asn1SeqFormat.formatTestLine(seqread1f.java)
227	at iubio.readseq.Testseq.testFormat(Compiled Code)
228	at iubio.readseq.Readseq.isKnownFormat(readseq.java)
229	at iubio.readseq.run.run(Compiled Code)
230	at iubio.readseq.run.<init>(readseqrun.java)
231	at run.main(readseqmain.java)
232	at com.apple.mrj.JManager.JMStaticMethodDispatcher.run(JMAWTContextImpl.java)
233	at java.lang.Thread.run(Thread.java)
234
235///
236
237
238app() -- swing
239	http://java.sun.com/products/jfc/download.html, good for java v 1.1.6 (?) +
240	if swingall.jar not available - trap exception and message
241
242
243//!! test interleave i/o
244
245/// � add classic readseq command-line processing
246/// � add window/gui/swing readseq
247
248/// � add BLAST output parsing (non-pairwise versions...)
249/// fix, test NEXUS/ PAUP reader
250/// XML � parser & � output?
251
252///? GCG seq formats update??
253///? GDE format ?
254///? abi trace files, scf files ? staden format (?) and scf
255///? other new, popular seq formats? -? Sequencher, DNAstar, ???
256
257///? see NCBI toolkit parser program for some format info
258
259/// Genbank <-> EMBL doc :
260	need to fiddle with field values to get correct match
261  embl uses ';' a lot as line ends - drop for gb, add from gb
262  Ref start:  em: RN   [3]   gb: REFERENCE   3  (bases 1 to 1566)
263  Source/Organism lines differ: gb: Source=common name Organism=spp name /n taxa
264  		em: OS= spp name (common name) OC=taxa
265  Locus line: need parsing of parts
266  Base count line: needs proper base counts, or parsing of items
267		gb: BASE COUNT      276 a    246 c    295 g    262 t
268		em: SQ   Sequence 977 BP; 316 A; 188 C; 175 G; 298 T; 0 other;
269
270
271  ?? do I want to fix up current BioseqDoc to those details, or rewrite structure
272  (as per DOM?) before that?
273
274
275
276
277/// GCG seq formats update
278>> parse comments like -- at least skip lines starting with !!
279!!NA_MULTIPLE_ALIGNMENT 1.0
280!!AA_MULTIPLE_ALIGNMENT 1.0
281!!SEQUENCE_LIST 1.0
282!!RICH_SEQUENCE 1.0
283Multiple Sequence Format (MSF) and Rich Sequence Format (RSF)
284Files
285
286Reformat can be used to convert between MSF, RSF, single
287sequence format and list files. When single sequence
288files are specified using a list file, any sequence
289attributes specified in the list file (e.g. begin and end
290ranges) are ignored during the conversion to the new file
291type. When converting from an RSF file any sequence
292features are lost. Access to sequence features is
293currently available only from within SeqLab. (Refer to
294Chapter 2 of the Users' Guide, Using Sequence Files and
295Databases, for details. See "Using Multiple Sequence
296Format (MSF) Files", "Using Rich Sequence Format (RSF)
297Files", and "Using List Files" for information about list
298files.)
299
300
301/// fix, test NEXUS, PAUP reader
302
303
304/// other new, popular seq formats? -? Sequencher DNAstar ???
305
306
307/// abi trace files, scf files ?
308/// staden format (?) and scf
309
310/// GDE format ?
311
312
313/// ? XML parser  ? HMTL parser/de-frasser
314
315
316
317/// BLAST output parsing....
318qblast The request ID is : 933566842-17349-12684
319
320From sauder@glinka.fccc.edu  Wed Jun 23 16:56:08 1999
321Date: Wed, 23 Jun 1999 17:58:03 -0400 (EDT)
322From: sauder@glinka.fccc.edu (J. Michael Sauder)
323To: Don Gilbert <gilbertd@bio.indiana.edu>
324Subject: Re: SeqPup and Blast
325
326> I'll add your request for blast import; I can't promise
327
328        I already have Perl code that converts Blast to PIR format.
329Would this be helpful?
330
331-- Mike S. (m_sauder@fccc.edu) http://www.fccc.edu/research/labs/dunbrack
332           phone: 215.728.5661  FAX: 215.728.3574 or 2412
333
334{Mail}&
335Message 218:
336From sauder@glinka.fccc.edu  Fri Jun 25 14:18:54 1999
337Date: Fri, 25 Jun 1999 15:20:48 -0400 (EDT)
338From: sauder@glinka.fccc.edu (J. Michael Sauder)
339To: Don Gilbert <gilbertd@bio.indiana.edu>
340Subject: Re: SeqPup and Blast
341X-Status: $$$$
342X-UID:
343
344> Sure -- send on your code, it may make it easier for me to add blast
345> format.
346
347        Here's the code.
348It assumes that Blast was performed with the -m 4 or -m 6 option, which
349is NOT the default.  I'm going to start working on a version that will
350translate the default (-m 0) output.  This version can handle PSI-Blast
351output, and only keeps the sequences from the last iteration.
352
353----------------------------------------------------------------------------
354#!/usr/bin/perl
355
356# BLASTm4FASTA.perl
357
358# Usage: blastm4fasta.perl file.blast > file.fasta
359
360# Written by J. Michael Sauder  m_sauder@fccc.edu
361# Fox Chase Cancer Center       6-24-99
362
363# Convert blast -m 4 or 6 output to FASTA format (or clustalw PIR format)
364#   (blastpgp -m 1, 2, 3, or 5  may produce spurious output)
365# Output format defined by $format (FASTA or PIR).
366# All sequences will be the same length, padded with either X's or
367#  dashes, depending on the output format (FASTA or PIR).
368# Keeps only the last iteration in PSI-Blast output.
369# The "Query= " line must exist, as well as "  Database:" at the end.
370
371
372$format = "fasta";   # output format
373# $format = "pir";
374$m = 4;    # default blast -m alignment view output option
375           # only -m 4 or 6 should be used with this program.
376           # This doesn't need to be modified.  The program will
377           # identify whether -m 4 or -m 6 was used.
378$width = 70;  # width of output sequence
379
380$qseq = ""; %qseqs=();
381%hseq = (); %count = ();
382$hitnum=0; $found=0; $npad=0; $hlen=0;
383
384# Subroutine to print sequences $width characters wide
385sub printseq {
386    $seq = $_[0];
387    $length=length($seq);
388    $ncol=int($length/$width);
389    for ($i=0; $i<=$ncol; $i++) {
390        print substr($seq,$i*$width,$width),"\n";
391    }
392}
393
394# $file = $ARGV[0];
395
396while(<>) {
397    chop;
398    if (/^Query= /) {      # Get query name
399        @line = split;
400        $query = $line[1];
401        next;
402    }
403    if (/^Searching/) {       # handle multiple PSI-Blast iterations
404        %hseq=();             # keep only last iteration
405        $qseq=""; %qseqs=();  # by clearing everything
406        %count=();
407        $hitnum=0; $found=0; $npad=0; $hlen=0;
408    }
409    if (/^QUERY/) {        # Query sequence
410        @line = split;
411        $name = $line[0];
412        $qstart = $line[1];
413        if ($qseq eq "") {
414            $qfirst = $qstart;   # $qfirst is probably 1
415        }
416#        $qend = $line[3];
417        $qlen = length($line[2]);
418        $qseq .= $line[2];       # append query sequence
419        $qseqs{$name} .= $line[2];  # also store in %qseqs
420        $found=1;     # flag to identify sequence region of blast output
421        next;
422    }
423
424    # Stop trying to read sequences once this line is found
425    # at the end of the blast output
426    if (/^  Database: /) { $found = 0; next }
427
428    if ($found==1) {
429        if ($_ eq "") { next }
430        @line = split;
431        $name = $line[0];
432#        $hstart = $line[1];
433        $hlen = length($line[2]);
434        if ($hlen==0) {          # created with blast -m 6
435            $m=6;
436            $hlen = length($line[1]);
437        }
438        else { $m=4 }
439
440        # Add N-padding X's or dashes to match query sequence length
441        if (!$hseq{$name}) {
442            $npad = ($qlen - $hlen) + ($qstart - $qfirst);  # -1 ?
443            if ($format eq "fasta") { $pad = 'X' x $npad }
444            elsif ($format eq "pir") { $pad = '-' x $npad }
445            if ($m==6) { $npad=0; $pad="" }   # -m 6 is already padded
446            if ($npad>0) { $hseq{$name} = $pad }
447            $count{$hitnum} = $name;
448            $hitnum++;   # Increment counter for next new sequence
449        }
450        if ($m==4) { $hseq{$name} .= $line[2] }
451        if ($m==6) { $hseq{$name} .= $line[1] }
452    }
453}
454
455$querylen = length($qseqs{QUERY});
456
457if ($format eq "pir") {  print ">$query\n\n" }
458if ($format eq "fasta") { print "\>$query $querylen\n" }
459if ($format eq "fasta") { $qseqs{QUERY} =~ s/\-/X/g }
460printseq($qseqs{QUERY});   # format and print sequence
461if ($format eq "pir") { print "\*\n" }
462
463
464# Preserve original alignment order by using %count instead of %hseq
465
466foreach $hitnum (sort keys %count) {
467    $name = $count{$hitnum};
468
469    # Add trailing X's or dashes if necessary
470    $hlen = length($hseq{$name});
471    $npad = $querylen-$hlen;
472    if ($format eq "fasta") { $pad = 'X' x $npad}
473    elsif ($format eq "pir") { $pad = '-' x $npad}
474    if ($npad>0) { $hseq{$name} .= $pad }
475
476    $hitlen = length($hseq{$name});
477
478    if ($format eq "pir") { print ">$name\n\n" }
479    if ($format eq "fasta") { print "\>$name $hitlen\n" }
480    if ($format eq "fasta") { $hseq{$name} =~ s/\-/X/g }
481    printseq($hseq{$name});
482    if ($format eq "pir") { print "\*\n" }
483}
484
485# Sample input from blastpgp -m 4
486#
487# Query= protein1
488# Database: nr
489#
490# Searching..................................................done
491# Results from round 1
492# ... [cut]
493#
494# Searching..................................................done
495# Results from round 2
496#
497# QUERY      1   MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59
498# S50979     1   MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59
499# CAA88356   1   MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59
500# P30776     7                ILSKKGPLAKVWLAAHWEKKLSKVQTLHTSIEQSVHAIV-------- 45
501# 4140710    7                ILAKKGPLARIWLAAHWDKKITKAHVFETNIEKSVEGILQPK----- 48
502# AAD33593.1 7                ILAKKGPLARIWLAAHWDKKITKAHVFETNIEKSVEGILQPK----- 48
503# 1022971    64                 SKKGPLSKVWLAAHWEKKLSKAQIFETDVDEAVNEIMQPS----- 10
504# CAA66939   9                  SKRGPLAKIWLAAHWDK-----KLTKAHVFECNLE----SSVESI 44
505#
506# ... [cut]
507#
508# QUERY      523 EASRGFFDILSLATEGCIGLSQTEAFGNIKIDAKPALFERFINA 566
509# S50979     523 EASRGFFDILSLATEGCIGLSQTEAFGNIKIDAKPALFERFINA 566
510# P40457     407 ERTKSLEHELKRSTE                              421
511# BAA18707   139 STN                                          141
512#   Database: nr
513
514
515# Sample input from blastpgp - m 6
516#
517# QUERY      1   MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59
518# S50979     1   MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59
519# CAA88356   1   MVTENPQRLTVLRLATNKGPLAQIWLASNMSN-IPRGSVIQTHIAESAKEIAKASGCDDE 59
520# P30776     7   -------------ILSKKGPLAKVWLAAHWEKKLSKVQTLHTSIEQSVHAIV-------- 45
521# 4140710    7   -------------ILAKKGPLARIWLAAHWDKKITKAHVFETNIEKSVEGILQPK----- 48
522# AAD33593.1 7   -------------ILAKKGPLARIWLAAHWDKKITKAHVFETNIEKSVEGILQPK----- 48
523# 1022971    64  ---------------SKKGPLSKVWLAAHWEKKLSKAQIFETDVDEAVNEIMQPS----- 10
524# CAA66939   9   ---------------SKRGPLAKIWLAAHWDK-----KLTKAHVFECNLE----SSVESI 44
525# Q46089         ------------------------------------------------------------
526#
527# ... [cut]
528#
529# QUERY      523 EASRGFFDILSLATEGCIGLSQTEAFGNIKIDAKPALFERFINA 566
530# S50979     523 EASRGFFDILSLATEGCIGLSQTEAFGNIKIDAKPALFERFINA 566
531# CAA88356   284 -------------------------------------------- 285
532# P30776     121 -------------------------------------------- 122
533# 4140710    89  -------------------------------------------- 90
534# AAD33593.1 89  -------------------------------------------- 90
535# 1022971    145 -------------------------------------------- 146
536# CAA66939   89  -------------------------------------------- 90
537# Q46089     458 -------------------------------------------- 459
538----------------------------------------------------------------------------
539
540
541-- Mike S. (m_sauder@fccc.edu) http://www.fccc.edu/research/labs/dunbrack
542           phone: 215.728.5661  FAX: 215.728.3574 or 2412
543