1SEQIO -- A Package for Sequence File I/O
2
3
4SEQIO.DOC - The SEQIO Package Interface
5***************************************
6
7
8
9The SEQIO package is a set of C functions which can read and write
10biological sequence files formatted using various file formats and which
11can be used to perform database searches on biological databases. I
12had five main goals in designing this package:
13
14 1. Keep the interface as similar as possible to the C stdio package,
15   given the fact that we're dealing with sequences and sequence
16   entries instead of characters and lines.
17 2. Support as many sequence file formats as possible, and make it
18   relatively quick and easy (at least for me) to add new formats.
19 3. Handle genome-sized sequences and large database searches,
20   so that the file I/O is no longer a time or space bottleneck for a
21   program.
22 4. Make the package flexible enough so that sequence analysis,
23   information retrieval and associating new information with
24   sequences (such as the results of the analyses) is easy.
25 5. Try to help people do "something else" with the minimum of
26   hassle (where "something else" means getting information or
27   performing some computation that no one else has provided
28   software for).
29
30The package defines a SEQFILE data structure, similar to stdio's FILE
31data structure, that is opened and closed and is used to perform the
32reading and writing of the files. Sequences and sequence entries are
33read or written one at a time, as a stream.
34
35Because of the size and complexity of sequences and entries, internal
36SEQIO data structures always retain the last read, or "current",
37sequence and its entry. A number of access functions are provided to
38return information about that current sequence and entry. So, you
39could read the next sequence, examine it, and then use the access
40functions to look at the database identifiers for the sequence, and then
41get the complete entry text. This is different from the stdio package,
42which simply puts the characters into data structures that you create
43and then forgets about those characters.
44
45The package can retrieve a number of pieces of information stored in a
46sequence entry, although it can't get everything (yet). In addition to the
47sequence and the complete entry text, it can get things like the
48database identifiers, the description, the organism name, the entry
49creation/update date, and so on. A SEQINFO data structure (listed
50below in the Current Sequence Access Functions section) has been
51defined to hold all of this information about a sequence or entry. This is
52a transparent data structure, unlike the SEQFILE structure, so that you
53can access and modify the fields of the structure or create your own
54structures. Since the information in the SEQINFO structure is used
55along with the sequence when writing sequence entries, creating new
56sequence entries is as simple as filling in the fields of a SEQINFO
57structure and then passing it and the sequence to the function
58`seqfwrite'.
59
60The package can perform database searches using the BIOSEQ
61standard. BIOSEQ was developed as a successor to the FASTLIBS
62method of the FASTA program for specifying the files to be read. The
63user (not you, but the user of your program) creates a BIOSEQ file with
64entries describing the various database, like this one for GenBank:
65
66>genbank,gb:   /databases/genbank
67>Name: GenBank
68>Alphabet:  DNA
69    gbbct.seq, gbest.seq, gbinv.seq, gbmam.seq, gbpat.seq, gbphg.seq
70    gbpri.seq, gbrna.seq, gbrod.seq, gbsts.seq, gbsyn.seq, gbuna.seq
71    gbvrl.seq, gbvrt.seq
72
73    bct:(gbbct.seq), est:(gbest.seq), inv:(gbinv.seq),  mam:(gbmam.seq)
74    pat:(gbpat.seq), phg:(gbphg.seq), pri:(gbpri.seq),  rna:(gbrna.seq)
75    rod:(gbrod.seq), sts:(gbsts.seq), syn:(gbsyn.seq),  una:(gbuna.seq)
76    vrl:(gbvrl.seq), vrt:(gbvrt.seq)
77
78The SEQIO package can read this file and the BIOSEQ related
79procedures in the package, seqfopendb, bioseq_read, bioseq_info and
80bioseq_parse, can be used to perform database searches and to
81retrieve information about a database or the location of its files. The full
82details of the BIOSEQ standard are found in the file "user.doc".
83
84General Comments
85================
86
87As you read this file and use the package, here are some issues to keep
88in mind:
89
90 1. How are returned values allocated? The SEQIO package either
91   returns pointers to internal data structures (which may change
92   after the next SEQIO call) or returns malloc'ed buffers which you
93   must free to avoid a memory leak.
94
95 2. How are errors handled? The SEQIO package always sets an
96   error variable `seqferrno' and error string `seqferrstr' on an
97   error. Normally, the package also reports warnings and errors
98   on standard error, and will exit the program on fatal errors (like
99   running out of memory). Using `seqferrpolicy' though, you can
100   keep the package from doing any or all of that (so that it only
101   sets the error variables and returns an error return value). And
102   using `seqfsetperror', you can replace the package's default
103   print error function (which outputs to standard error) with your
104   own output function.
105
106 3. File formats are always specified by name, such as "GenBank",
107   "FASTA", "Stanford", and so on.
108
109 4. Except for filenames, just about everything is case-insensitive
110   (so "GENBANK", "fasta" and "sTAnForD" are acceptable in point
111   3).
112
113 5. One deficiency you might find in the package is that there is no
114   explicit link between a sequence and the SEQINFO structure for
115   that sequence. I could not reconcile the package's ability to
116   create new, malloc'ed sequence and SEQINFO buffers, my wish
117   to keep the amount of allocated space to a minimum, and any
118   mechanism for linking the SEQINFO structure with a sequence.
119   You'll have to keep track of the sequences and SEQINFO
120   structures, and make sure that you are not mixing them up.
121
122 6. Most of the names in the package should be fairly unique, except
123   possibly the predefined constants DNA, RNA, PROTEIN, AMINO
124   and UNKNOWN used to specify alphabets. File "quickref.doc"
125   gives a complete list of the constants, typedef names and
126   functions names defined by the package. If these alphabetic
127   constants interfere with the constants in your program, just add
128   the following lines AFTER including "seqio.h", and then use the
129   constants SEQIO_DNA, SEQIO_RNA, SEQIO_PROTEIN, SEQIO_AMINO and
130   SEQIO_UNKNOWN when looking at SEQIO's alphabet values.
131
132   #undef DNA
133   #undef RNA
134   #undef PROTEIN
135   #undef AMINO
136   #undef UNKNOWN
137
138   #define SEQIO_UNKNOWN 0
139   #define SEQIO_DNA 1
140   #define SEQIO_RNA 2
141   #define SEQIO_PROTEIN 3
142   #define SEQIO_AMINO 3
143
144   Or you can just use the constants predefined in the package in
145   your program (you should be able to figure out what values they
146   have from above).
147
148
149
150Opening and Closing Files/Database-Searches
151*******************************************
152
153Seqfopen
154========
155
156 SEQFILE *seqfopen(char *filename, char *mode, char *format)
157
158  o filename - the file to be opened
159  o mode - "r", "w" or "a"
160  o format - the file format name (optional for reading)
161
162  o returns an open SEQIO file structure (or NULL on error)
163
164The seqfopen function opens a file for reading or writing. It is similar to
165stdio's fopen, except it returns a SEQFILE structure and it has an extra
166`format' argument. This argument specifies the format of the file to be
167read or written. See the files "user.doc" and "format.doc" for a
168description of the valid formats.
169
170In addition to normal filenames, the package recognizes several special
171characters. First, the string containing only a dash ("-") specifies that
172standard input or standard output should be opened instead. Second,
173filenames beginning with a `~' are treated the same way as the Unix
174shells (i.e., the `~' is used to refer to home directories). Finally, access
175to single entries of the file can be specified using an `@', followed by a
176single entry access specification. See "user.doc" for a description of
177how to specify single entry access.
178(Note: This specification means that seqfopen will not accept filenames
179containing an ampersand character. It will always assume the `@'
180denotes a single entry access specification.)
181
182When reading a file, the `format' argument can be NULL, in which case
183seqfopen tries to automatically determine the format of the file. If the file
184format is any of the valid formats except the Raw format, seqfopen
185should be able to determine the correct format. If it cannot determine
186the correct format, the SEQIO package does not fail to open the file. It
187simply triggers a warning and opens the file in the Plain format.
188
189Seqfopendb
190==========
191
192 SEQFILE *seqfopendb(char *dbspec)
193
194  o dbspec - a BIOSEQ database search specification
195
196  o returns an open SEQIO file structure (or NULL on error)
197
198The seqfopendb function opens a SEQFILE structure to perform a
199database search. The one argument specifies what database (or part of
200a database) to search. All of the sequences in that database (or
201database part) can then be read as if they were stored in a single
202sequence file. See the file "user.doc" for a description of a valid
203BIOSEQ database specification.
204
205This function also looks for five information fields from the BIOSEQ
206entry for the database. These fields are not required to occur in an
207entry. The "Name" field specifies the name of the database described
208by that BIOSEQ entry and is used to distinguish between "official"
209databases and just collections of entries. The "IdPrefix" field gives the
210identifier prefix to attach to any identifier in an entry which does not
211already have a prefix. The "Format" field specifies the file format to pass
212to `seqfopen' for each database file read. The "Alphabet" field specifies
213the alphabet for each sequence in the database. And finally, the "Index"
214field gives the name of the file indexing all of the database's entries
215(which is used to randomly access the entries).
216
217Seqfopen2
218=========
219
220 SEQFILE *seqfopen2(char *string)
221
222  o string - the filename (if it specifies an existing file) or
223    database search specifier (otherwise)
224
225  o returns an open SEQIO file structure (or NULL on error)
226
227The seqfopen2 function is a "combination" function which you can use
228to avoid having to decide whether to call seqfopen or seqfopendb. If
229the parameter specifies an existing file (tested using the stat system
230call), then seqfopen is called (with the second and third arguments of
231"r" and NULL, resp.). Otherwise, seqfopendb is called.
232
233Seqfclose
234=========
235
236 void seqfclose(SEQFILE *sfp)
237
238  o sfp - the SEQFILE structure to be closed
239
240  o returns nothing
241
242The seqfclose function closes an open SEQFILE structure, closing any
243open FILE structures it uses and freeing up the memory allocated to it.
244
245
246
247Reading Sequences/Entries
248*************************
249
250Seqfread
251========
252
253 int seqfread(SEQFILE *sfp, int flag)
254
255  o sfp - an open SEQFILE structure
256  o flag - read the next sequence (if zero) or entry (non-zero)
257
258  o returns 0 on success and -1 on EOF or error
259
260The seqfread function reads the next sequence or entry. The difference
261between reading the next sequence and reading the next entry only
262appears with formats that contain multiple sequences per entry, such
263as the PHYLIP or Clustalw formats. When reading "by sequence", each
264sequence of a multiple sequence entry is read and kept as the current
265sequence. If seqfread is called with a non-zero `flag' argument, then
266the next entry is always read, even if some sequences of the current
267entry have not been read.
268
269This function only returns a status value because the current sequence
270and entry text are kept in the internal SEQIO data structures. The
271function just changes the value of the current sequence and current
272entry.
273
274Seqfgetseq, Seqfgetrawseq, Seqfgetentry, Seqfgetinfo
275====================================================
276
277 char *seqfgetseq(SEQFILE *sfp, int *length_out, int newbuffer)
278 char *seqfgetrawseq(SEQFILE *sfp, int *length_out, int newbuffer)
279 char *seqfgetentry(SEQFILE *sfp, int *length_out, int newbuffer)
280 SEQINFO *seqfgetinfo(SEQFILE *sfp, int newbuffer)
281
282  o sfp - an open SEQFILE structure
283  o length_out - address where the returned string's length is
284    stored (if not NULL)
285  o newbuffer - malloc a new buffer for the object (if non-zero)
286    or return an internal buffer (if zero)
287
288  o returns the sequence/entry text or the SEQINFO structure
289    (or NULL on error)
290
291These functions simply call seqfread to read the next sequence or
292entry, and then call the access function to return the sequence text,
293entry text or sequence information. The seqfgetseq, seqfgetrawseq and
294seqfgetinfo functions call seqfread with a `flag' value of 0, while the
295seqfgetentry function calls seqfread with a `flag' value of 1. This way,
296the search by entry will look at each entry exactly once (it won't
297repeatedly look at the same multiple sequence entry), and the search
298by sequence and search by information will look at each sequence
299exactly once.
300
301See the access functions description next for a more complete
302description of the arguments and their use.
303
304
305
306Access Functions for the Current Sequence, Entry and Information
307****************************************************************
308
309Seqfsequence, Seqfrawseq, Seqfentry, Seqfinfo, Seqfallinfo
310==========================================================
311
312 char *seqfsequence(SEQFILE *sfp, int *length_out, int newbuffer)
313 char *seqfrawseq(SEQFILE *sfp, int *length_out, int newbuffer)
314 char *seqfentry(SEQFILE *sfp, int *length_out, int newbuffer)
315 SEQINFO *seqfinfo(SEQFILE *sfp, int newbuffer)
316 SEQINFO *seqfallinfo(SEQFILE *sfp, int newbuffer)
317
318  o sfp - an open SEQFILE structure
319  o length_out - address where the returned string's length is
320    stored (if not NULL)
321  o newbuffer - malloc a new buffer for the object (if non-zero)
322    or return an internal buffer (if zero)
323
324  o returns the sequence/entry text or the SEQINFO structure
325    (or NULL on error)
326
327These functions are the access functions by which the current
328sequence text, the current entry text or the information about the
329current sequence can be retrieved. The seqfinfo and seqfallinfo
330functions are described in the next section.
331
332The seqfsequence, seqfrawseq and seqfentry functions return the
333current sequence text, the raw sequence text or the current entry text,
334respectively. They also return the length of the text, if the second
335argument is an integer variable pointer (such as `s =
336seqfsequence(sfp, &len, 0)').
337
338(NOTE: The function seqfrawseq differs from seqfsequence in that it
339also includes any alignment or notational characters in its returned
340"sequence". Typically, seqfsequence only extracts alphabetic
341characters in the sequence lines of the entry, whereas seqfrawseq
342extracts all characters except whitespace and digits.)
343
344The returned text is stored either in an internal SEQIO buffer or in a
345newly malloc'ed buffer, depending on the value of the `newbuffer'
346argument. With both kinds of buffers, you are permitted to modify or
347rewrite the characters of the text as needed (these are NOT read-only
348buffers), with two exceptions. First, you may not write past the end of
349the returned text, because you would be writing off the end of a
350malloc'ed buffer or onto other text stored in an internal SEQIO buffer.
351So, you can make the string shorter, but not longer.
352
353Second, if you use the internal buffer, any permanent changes you
354make could affect future calls involving that sequence or entry (what is
355being returned is the buffer storing the only copy SEQIO has of the
356sequence/entry). The idea is to use the internal buffers when the
357sequence/entry will only be kept around temporarily and any changes
358made are undone when the text is no longer needed, and to use the
359malloc'ed buffer for the sequences/entries that must be kept around for
360a long time.
361
362The SEQINFO Structure
363=====================
364
365The seqfinfo functions returns a SEQINFO structure containing various
366information about the current sequence and its entry. So, what
367information does the SEQINFO structure hold? Here is the C definition:
368
369typedef struct {
370  char *dbname, *filename, *format;
371  int entryno, seqno, numseqs;
372
373  char *date, *idlist, *description;
374  char *comment, *organism, *history;
375  int isfragment, iscircular, alphabet;
376  int fragstart, truelen, rawlen;
377} SEQINFO;
378
379The structure contains six fields which the SEQIO package has about
380the current sequence:
381
382dbname
383   The name of the database being searched (if this is a search of
384   an actual database).
385filename
386   The name of the file currently being read.
387format
388   The format of the file (and the current entry).
389entryno
390   The location of the current entry in the file (if entryno is 10, then
391   the current entry is the tenth entry in the file).
392seqno
393   The location of the current sequence in the current entry (if
394   seqno is 3, then the current sequence is the third in the current
395   entry).
396numseqs
397   The number of sequences contained in the current entry.
398
399So, the current sequence's location is the `seqno' sequence of the
400`entryno' entry of the file `filename' (possibly of the database
401`dbname'). The `format' string gives the entry's format, and the entry
402contains `numseqs' sequences.
403
404The other twelve fields are information extracted from the current entry
405(see "format.doc" for the details about which information is retrieved
406for each file format):
407
408date
409   A single date giving the last time the entry was either created or
410   updated. Its format should be day-month-year, as in
411   31-JAN-1995.
412idlist
413   The list of identifiers given in the entry. The idlist's form is a
414   string containing vertical bar separated list of identifiers, each of
415   whose form consists of an identifier prefix, a ':' and the identifier.
416   See file "user.doc" for more information about identifiers and
417   identifier prefixes.
418description
419   A description of the sequence or sequences in the entry. This is
420   the "Title" or "Definition" line in some file formats. This string
421   should consist of a single "line" of text, although it can be of any
422   length. So, no newlines should appear in this text (they are
423   removed and added when the description is read from and
424   output in the sequence entries).
425comment
426   A block of text giving a comment about the sequence. The string
427   can contain one or more lines of any length. The one restriction
428   to the text appearing in a comment is that any block of lines at
429   the end of an entry's comment section where each line begins
430   with the string "SEQIO" is reserved for other use by the package
431   (this block holds extra identifiers or the `history' lines).
432organism
433   The name of the organism the sequence was taken from. Right
434   now, this field can contain any single "line" of text, although I
435   would like to standardize the contents of this field. It's on my
436   TODO list.
437history
438   This holds the lines of text placed in the comment section of
439   entries which describe previous SEQIO operations on this entry,
440   i.e., it holds the history of alterations and updates made to this
441   entry by programs using the SEQIO package. Any block of lines
442   at the end of a comment section where each line begins with the
443   string "SEQIO" is not considered part of the comment, but part
444   of the history.
445isfragment
446   This integer is non-zero if the sequence is a fragment of a larger
447   sequence, and zero if the sequence is complete (or if it is not
448   known whether the sequence is a fragment).
449iscircular
450   This integer is non-zero if the sequence is a circular sequence,
451   and zero if it is a linear sequence (or if it's circularity is not
452   known).
453alphabet
454   This integer is one of the predefined constants DNA, RNA,
455   PROTEIN or UNKNOWN. Its value is UNKNOWN unless either
456   the database's BIOSEQ entry (information field "Alphabet") or
457   the entry itself explicitly specifies the alphabet. The package
458   does not try to guess the alphabet.
459fragstart
460   When the sequence is a fragment of a larger sequence and the
461   location of this fragment in the larger sequence is known, this
462   value gives the starting position of the fragment. If this value is
463   not known (or the sequence is complete), fragstart is set to 0.
464truelen
465   This is the "true" length of the sequence, i.e., the length of the
466   sequence without any gap characters or notational characters.
467   Typically, these are just the alphabetic characters.
468rawlen
469   This is the "raw" length of the sequence, i.e., the length of the
470   sequence which includes the gap and notational characters.
471   Typically these are all characters except whitespace and digits.
472
473The seqfinfo function fills in as many fields of the SEQINFO structure as
474it can, given the information in the entry. If a particular piece of
475information could not be found in the entry, that field is set either to
476NULL or 0, depending on whether it is a character string or an integer.
477(And, yes, I know NULL and 0 are really the same value. I'm talking
478about good programming style here.)
479
480The seqfallinfo function also returns a SEQINFO structure and the
481information is the same as that returned by seqfinfo, with the exception
482of the `comment' field. With seqfallinfo, the entire header for the current
483entry is stored in the comment field of the SEQINFO structure. The
484meaning of "entire header" is different for the different formats, but the
485general idea is that it consists of everything in the entry except the
486sequence lines. This can be useful for converting from one format to
487another without losing the header lines describing the references and
488features of the sequence.
489
490As with the seqfsequence and seqfentry, the structure returned by
491these two functions is either an internal SEQIO structure or a malloc'ed
492structure. However, the malloc'ed structure is a bit more complicated
493here, since the SEQINFO structure contains character strings for some
494of its fields. What the SEQIO package does is malloc one big buffer in
495which it stores the SEQINFO structure and all of the character strings
496that its fields point to. So, when you call free to "free" the SEQINFO
497structure, you also automatically free all of the character strings. This
498means that the same restrictions specified for the returned
499sequence/entry text above also apply to these character strings.
500
501SEQINFO Field Access Functions
502==============================
503
504 char *seqfdbname(SEQFILE *sfp, int newbuffer)
505 char *seqffilename(SEQFILE *sfp, int newbuffer)
506 char *seqfformat(SEQFILE *sfp, int newbuffer)
507 int seqfentryno(SEQFILE *sfp)
508 int seqfseqno(SEQFILE *sfp)
509 int seqfnumseqs(SEQFILE *sfp)
510 char *seqfdate(SEQFILE *sfp, int newbuffer)
511 char *seqfidlist(SEQFILE *sfp, int newbuffer)
512 char *seqfdescription(SEQFILE *sfp, int newbuffer)
513 char *seqfcomment(SEQFILE *sfp, int newbuffer)
514 char *seqforganism(SEQFILE *sfp, int newbuffer)
515 int seqfisfragment(SEQFILE *sfp)
516 int seqfiscircular(SEQFILE *sfp)
517 int seqfalphabet(SEQFILE *sfp)
518 int seqffragstart(SEQFILE *sfp)
519 int seqftruelen(SEQFILE *sfp)
520 int seqfrawlen(SEQFILE *sfp)
521
522  o sfp - an open SEQFILE structure
523  o newbuffer - malloc a new buffer for the object (if non-zero)
524    or return an internal buffer (if zero)
525
526  o returns the information string or integer
527
528These are the access functions used to retrieve individual pieces of
529information about the current sequence and current entry. Like
530seqfsequence and seqfentry, the functions which return strings can
531return either an internal SEQIO buffer or a malloc'ed buffer containing
532the string (with the same restrictions and requirements on its use).
533
534The functions returning strings all return NULL on an error. The other
535access functions all return 0 on an error, even though 0 may be a valid
536return value (such as in seqfiscircular). Note that the alphabet
537UNKNOWN is defined to be 0.
538
539Seqfmainid, Seqfmainacc
540=======================
541
542 char *seqfmainid(SEQFILE *sfp, int newbuffer)
543 char *seqfmainacc(SEQFILE *sfp, int newbuffer)
544
545  o sfp - an open SEQFILE structure
546  o newbuffer - malloc a new buffer for the object (if non-zero)
547    or return an internal buffer (if zero)
548
549  o returns the identifier string or NULL
550
551These functions access the idlist information collected from an entry
552and return either the main identifier or main accession number
553occurring in the entry. The main identifier is considered to be either the
554first identifier which is not an accession number, or the first accession
555number if no non-accession identifiers are found in the entry. Thus,
556seqfmainid is NULL only when no identifiers could be extracted from
557the entry. The main accession number is the first accession number
558found.
559
560The string returned by these functions consists of a single identifier,
561whose form is the same as each identifier in the idlist (i.e., an identifier
562prefix, a `:' and the identifier string). This string will be
563NULL-terminated (it is not just a pointer into the idlist). And, as with all
564of the other information strings, it can be returned either in an internal
565buffer or a newly malloc'ed buffer.
566
567Seqfoneline
568===========
569
570 int seqfoneline(SEQINFO *info, char *buffer, int buflen, int idonly)
571
572  o info - a SEQINFO structure
573  o buffer - the buffer to store the oneline description
574  o buflen - the buffer length
575  o idonly - only store an identifier for the sequence
576
577  o returns the length of the string stored in buffer
578
579This function is used to construct a "oneline" description of a
580sequence, based on the information stored in the SEQINFO structure.
581See "user.doc" for a complete description of the format for a oneline
582description. That description is stored in the character array specified
583by the "buffer" argument.
584
585The function guarantees that it will fit the oneline description into the
586first "buflen-1" characters of buffer, and that the description will always
587be NULL-terminated. So, you won't have to check for a buffer overflow
588(like you do with function `fgets').
589
590If the "idonly" value is non-zero, then the oneline description will
591consist only of a single identifier for the sequence, and the description
592will not contain any whitespace. This is useful for constructing short
593identifiers for sequences (most notably, the identifiers used in the
594PHYLIP, Clustalw and MSF formats).
595
596Seqfsetidpref, Seqfsetdbname, Seqfsetalpha
597==========================================
598
599 void seqfsetidpref(SEQFILE *sfp, char *idprefix)
600 void seqfsetdbname(SEQFILE *sfp, char *dbname)
601 void seqfsetalpha(SEQFILE *sfp, char *alphabet)
602
603  o sfp - a SEQFILE structure open for reading
604  o idprefix - the identifier prefix (if not NULL or not empty)
605  o dbname - the current database name(if not NULL or not
606    empty)
607  o alphabet - the string used to determine the alphabet when
608    the entry does not specify an alphabet (if not NULL or not
609    empty)
610
611  o returns nothing
612
613These are functions which allow you to set the identifier prefix,
614database name, and alphabet when reading a sequence file or
615performing a database search. The idea is that these functions provide
616the same capability as the inclusion of the "Name", "IdPrefix" and
617"Alphabet" information fields in the BIOSEQ entry used by a database
618search. The use of these values in a database search is described in
619the comments for `seqfopendb' above and in file "programr.doc" in the
620BIOSEQ Stuff section.
621
622If the second argument to the function is either NULL or an empty
623string, then the current value held in the SEQFILE structure is
624removed. This gives a way to unset any of those values as needed.
625
626(NOTE: The `alphabet' argument to "seqfsetalpha" is a character string,
627not one of the predefined constants RNA, DNA, PROTEIN or
628UNKNOWN. The reason for doing that is so that any string specifying a
629valid alphabet (i.e., strings like "tRNA", "Peptide", "cDNA",
630"pre-mRNA") can be input to the package, but the value set in the
631SEQINFO structure is simplified so that programs who just need to
632know whether the sequence is DNA, RNA or PROTEIN can just test the
633integer SEQINFO field.
634
635One of the things on my TODO list is to add an `alphastr' field to the
636SEQINFO structure to return the actual alphabet string that occurs in
637the entry or is given to the package.)
638
639
640
641Writing Sequences/Entries
642*************************
643
644Seqfwrite
645=========
646
647 int seqfwrite(SEQFILE *sfp, char *seq, int seqlen, SEQINFO *info)
648
649  o sfp - a SEQFILE structure open for writing
650  o seq - the sequence
651  o seqlen - the sequence length
652  o info - information about the sequence
653
654  o returns 0 on success and -1 on error
655
656The seqfwrite function outputs a sequence entry using the sequence
657and information given in the function arguments. Only the twelve "entry
658information" fields of the SEQINFO structure (date, mainid, mainacc,
659idlist, description, comment, organism, history, isfragment, iscircular,
660alphabet, truelen) are used when outputting the entry.
661
662(NOTE: For the ASN.1, PHYLIP and Clustalw formats, remember that it
663is important that you call seqfclose to close the file being written, as the
664package performs some output during that close operation.)
665
666Seqfconvert
667===========
668
669 int seqfconvert(SEQFILE *input_sfp, SEQFILE *output_sfp)
670
671  o input_sfp - a SEQFILE structure open for reading
672  o output_sfp - a SEQFILE structure open for writing
673
674  o returns 0 on success and -1 on error
675
676The seqfconvert function retrieves the sequence and information about
677the current sequence of `input_sfp' and then calls seqfwrite with
678`output_sfp', the sequence and the information.
679
680Seqfputs
681========
682
683 int seqfputs(SEQFILE *sfp, char *s, int len)
684
685  o sfp - a SEQFILE structure open for writing
686  o s - the string to output
687  o len - the number of chars to output (or 0, specifying to
688    output to the end of s)
689
690  o returns 0 on success and -1 on error
691
692The seqfputs function outputs a string on the output stream opened for
693the SEQFILE structure. The purpose of this function is to give a way to
694mix the output of complete entries with the output of entries produced
695by seqfwrite or seqfannotate. Thus, programs can take differently
696formatted entries as input (i.e., a combination of GenBank, FASTA and
697EMBL entries), and transform them into a single output file format
698without losing any information in input entries whose format matches
699the output format (i.e., output all in EMBL format using seqfwrite on the
700GenBank and FASTA entries and seqfputs on the EMBL entries).
701
702The function makes no checks on the string to ensure that the output
703consists of a single format. When using this function, you must ensure
704that the output consists of complete entries of a single format (since
705you can use seqfputs to output entries line by line).
706
707The `len' argument either specifies the number of characters to output
708(if non-zero), or that the complete string should be output (if zero). If
709the length is non-zero, exactly that many characters will be written to
710the output.
711
712Seqfannotate
713============
714
715 int seqfannotate(SEQFILE *sfp, char *entry, int entrylen, char
716 *newcomment, int flag)
717
718  o sfp - a SEQFILE structure open for writing
719  o entry - the entry text to output
720  o entrylen - the length of the entry text
721  o newcomment - the comment to add to the entry
722  o flag - remove existing comments (if zero) or append the
723    new comment (if non-zero)
724
725  o returns 0 on success and -1 on error
726
727The seqfannotate function adds extra comment text to an entry as it
728outputs the entry. This way, you can insert new information into an
729entry without losing any of the information in the entry. The new text
730will be output as part of the entry's comment, so that when the output
731entry is read again, seqfcomment can be used to access the inserted
732text. Also, the `flag' argument can be set to strip out any existing
733comments, so that when the output entry is read, the string returned by
734seqfcomment is just that inserted text. This should provide an easy
735method for running a number of experiments on sequences, while
736storing the results of those experiments in the sequences' entries.
737
738The format for the entry must be the same as the format specified when
739the SEQFILE structure was opened for writing. Any mismatch in the two
740formats will result in a parse error.
741
742See file "programr.doc" for a description of how this function can be
743used.
744
745Seqfgcgify
746==========
747
748 int seqfgcgify(SEQFILE *sfp, char *entry, int entrylen)
749
750  o sfp - a SEQFILE structure open for writing
751  o entry - the entry text to convert to the GCG format
752  o entrylen - the length of the entry text
753
754  o returns 0 on success and -1 on error
755
756This function converts an entry from its non-GCG format into its GCG
757format, without losing any of the header line information. This operation
758will work only for the GenBank, PIR, EMBL, Swiss-Prot, FASTA, NBRF
759or IG/Stanford formats. Also, the SEQFILE structure must have been
760opened either with the generic "GCG" format, or the "GCG-*" format
761matching the format of the entry. Any mismatch in formats will result in
762a parse error.
763
764Seqfungcgify
765============
766
767 int seqfungcgify(SEQFILE *sfp, char *entry, int entrylen)
768
769  o sfp - a SEQFILE structure open for writing
770  o entry - the entry text to convert to the GCG format
771  o entrylen - the length of the entry text
772
773  o returns 0 on success and -1 on error
774
775This function converts an entry from its GCG format into its non-GCG
776format, without losing any of the header line information. This operation
777will work only for the GenBank, PIR, EMBL, Swiss-Prot, FASTA, NBRF
778or IG/Stanford formats. Also, the SEQFILE structure must have been
779opened the non-GCG format, and the format of the entry must be the
780corresponding "GCG-*" format. Any mismatch in formats will result in a
781parse error.
782
783
784
785BIOSEQ Database Functions
786*************************
787
788Bioseq_read
789===========
790
791 int bioseq_read(char *filelist)
792
793  o filelist - a comma separated list of files (must be BIOSEQ
794    files)
795
796  o returns 0 on success and -1 on error
797
798The bioseq_read function reads all of the files in the comma separated
799list of files and adds the BIOSEQ entries it reads to the internal list of
800entries. These new entries are added to the front of the list, so that the
801newer entries will always be found before the older entries. Thus, you
802should call bioseq_read with the files in increasing priority (the entries
803you want examined first should be the last entries added).
804
805By default, the files specified by the environment variable "BIOSEQ" (if
806it exists) are always the first file read in (this is done automatically
807before any bioseq_* function is processed). If this does not suit your
808priority scheme, simply call bioseq_read with the "BIOSEQ"
809environment variable value in its proper position in the priority
810scheme. Those entries will override the previously read entries.
811
812(Note: This function has the capability to read a complete comma
813separated list of files, but the typical use of this function will probably
814be to read a single file, except when handling the "BIOSEQ"
815environment variable.)
816
817Bioseq_check
818============
819
820 int bioseq_check(char *dbspec)
821
822  o dbspec - a database search specifier
823
824  o returns non-zero if the string refers to a known database,
825    or returns zero otherwise
826
827The bioseq_check function can be used to test whether a database
828search specification refers to a database known to the package (i.e., if a
829BIOSEQ entry exists for that database).
830
831Bioseq_info
832===========
833
834 char *bioseq_info(char *dbspec, char *fieldname)
835
836  o dbspec - a database search specifier
837  o fieldname - the name of the information field to be returned
838
839  o returns the text for that field of the BIOSEQ entry for that
840    database.
841    (NOTE: the returned string buffer is a malloc'ed buffer, and
842    it must be freed by you.)
843
844The bioseq_info function is used to retrieve an information field from a
845BIOSEQ entry. The file "user.doc" describes the BIOSEQ entry format
846and how information fields can be added to an entry. The returned text
847is always stored in a malloc'ed buffer which you must free after you've
848finished using.
849
850Note that the `dbspec' argument does NOT have to be a simple
851database name, but can be a fulle database search specifier. This is so
852you can use the same specification to start a database search and get
853information about the database being searched, without having to
854explicitly extract the database name from the specification.
855
856Bioseq_matchinfo
857================
858
859 char *bioseq_matchinfo(char *fieldname, char *fieldvalue)
860
861  o fieldname - the name of an information field
862  o fieldvalue - the value that the information field should have
863
864  o returns the name of a database.
865    (NOTE: the returned string buffer is a malloc'ed buffer, and
866    it must be freed by you.)
867
868The bioseq_matchinfo function is used to determine which database
869contains an information field with a specific value. (The package uses it
870to find the database corresponding to a particular identifier prefix, as in
871`bioseq_matchinfo("IdPrefix", "ec")'.)
872
873The function traverses the list of BIOSEQ entries, looking for the first
874one which has an information field that matches both the fieldname and
875fieldvalue. The fieldname and fieldvalue matching are both
876case-insensitive, and any whitespace separating the fieldname and the
877fieldvalue is ignored. So, the call above would match the information
878line `>IDPREFIX: EC', regardless of how many spaces separate the `:'
879and `EC'.
880
881Bioseq_parse
882============
883
884 char *bioseq_parse(char *dbspec)
885
886  o dbspec - a database search specifier
887
888  o returns the list of files in a string where each file is
889    terminated by a newline character and the whole string is
890    terminated by a NULL character.
891    (NOTE: the returned string buffer is a malloc'ed buffer, and
892    it must be freed by you.)
893
894The bioseq_parse function parses a BIOSEQ database search
895specification and determines which files of the database need to be
896searched. That list of files is then returned in a malloc'ed buffer which
897you must free. The returned string terminates each filename by a
898newline character (even the last file in the list), and the list of filenames
899is ended by a NULL character (as with all strings).
900
901
902
903Miscellaneous Functions
904***********************
905
906Seqfisafile
907===========
908
909 int seqfisafile(char *filename)
910
911  o filename - a filename (with a possible "@..." single entry
912    access specification)
913
914  o returns non-zero (if the filename refers to an existing file)
915    or zero (if not)
916
917The seqfisafile function can be used to test whether a user-given
918filename (which may contain a single entry access specification in
919addition to the actual filename) refers to an existing file. It first checks
920for the single entry access specification (by looking for a `@'), and then
921checks the prefix to see if it refers to an existing file.
922
923What the function does not do is parse the single entry access
924specification (if it's there). So, a non-zero return value does NOT mean
925that seqfopen will succeed in opening the file.
926
927Seqfisaformat
928=============
929
930 int seqfisaformat(char *format)
931
932  o format - a file format string
933
934  o returns non-zero (if the string is a valid file format) or zero
935    (if not)
936
937The seqfisaformat function can be used to test whether a string
938specifies a valid file format or not. It checks the given string against all
939of the valid format names and returns a non-zero/zero value telling
940whether a match occurred.
941
942Seqffmttype
943===========
944
945 int seqffmttype(char *format)
946
947  o format - a file format string
948
949  o returns the format type or T_INVFORMAT (for an invalid
950    format)
951
952The seqffmttype function returns some type information about the
953given format. The possible returned types are T_SEQONLY,
954T_DATABANK, T_GENERAL, T_LIMITED, T_ALIGNMENT and T_OUTPUT.
955See file "format.doc" for more information about what these types
956mean.
957
958Seqfcanwrite
959============
960
961 int seqfcanwrite(char *format)
962
963  o format - a file format string
964
965  o returns non-zero (if that format is writeable) or zero (if not)
966
967The seqfcanwrite function tells whether or not the given file format has
968writing capabilities. Currently, every file format except FASTA-output
969has this capability.
970
971Seqfcanannotate
972===============
973
974 int seqfcanannotate(char *format)
975
976  o format - a file format string
977
978  o returns non-zero (if the format's entries can be annotated)
979    or zero (if not)
980
981The seqfcanannotate function tells whether or not the given file format
982has annotation capabilities. Currently, the file formats GenBank, EMBL,
983Swiss-Prot, PIR, FASTA, NBRF, IG/Stanford and ASN.1 do have this
984capability, and the formats Raw, Plain, FASTA-old, NBRF-old, IG-old,
985PHYLIP, Clustalw and FASTA-output do not.
986
987Seqfcangcgify
988=============
989
990 int seqfcangcgify(char *format)
991
992  o format - a file format string
993
994  o returns non-zero (if the format's entries can be
995    gcgified/ungcgified) or zero (if not)
996
997The seqfcangcgify function tells whether or not the given file format can
998be converted from and to its GCG format. Currently, the file formats
999GenBank, EMBL, Swiss-Prot, PIR, FASTA, FASTA-old, NBRF,
1000NBRF-old, IG/Stanford and IG/Stanford-old have this capability, and
1001the other formats do not.
1002
1003Seqfbytepos
1004===========
1005
1006 void seqfbytepos(SEQFILE *sfp)
1007
1008  o sfp - a SEQFILE structure open for reading
1009
1010  o returns a byte position, or -1 on error
1011
1012The seqfbytepos function returns the byte position in the current file of
1013the beginning of the current entry. If no current entry exists (because
1014of a previous error or because EOF was reached), the function returns
1015-1.
1016
1017Seqfsetpretty
1018=============
1019
1020 void seqfsetpretty(SEQFILE *sfp, int value)
1021
1022  o sfp - a SEQFILE structure open for writing
1023  o value - either non-zero or zero
1024
1025  o returns nothing
1026
1027The seqfsetpretty function tells whether the output operations should
1028add some whitespace to make the sequence look prettier. When
1029outputting in the Plain, FASTA, NBRF or IG/Stanford formats (and their
1030variants), the putseq operation looks at the sequence being output,
1031and may add whitespace to the outputted sequence to make it look
1032prettier.
1033
1034By default, the extra spaces are added when the sequence is DNA, RNA
1035or Protein and when there are no non-alphabetic characters in the
1036sequence (such as alignment characters).
1037
1038Seqfparseent
1039============
1040
1041 SEQINFO *seqfparseent(char *entry, int entrylen, char *format)
1042
1043  o entry - the text of an entry
1044  o entrylen - the length of the entry
1045  o format - the format of the entry
1046
1047  o returns a malloc'ed SEQINFO structure containing the
1048    information about the entry.
1049    (NOTE: This structure must be freed by you.)
1050
1051The seqfparseent function parses an entry and constructs a SEQINFO
1052structure containing the information occurring in the entry. It could be
1053useful if you read in entries on your own, but still want to retrieve the
1054information stored in them.
1055
1056(NOTE: This function cannot parse entries in the PHYLIP, Clustalw or
1057FASTA-output format. An error is triggered if you call seqfparseent
1058with an entry in one of these formats.)
1059
1060Asn_parse
1061=========
1062
1063 int asn_parse(char *begin, char *end, ...)
1064
1065  o begin - the beginning of the ASN.1 text
1066  o end - the end of the ASN.1 text (i.e., the last character of
1067    the ASN.1 text is at `end-1')
1068  o ... - a NULL terminated list of arguments specifying the
1069    sub-records to be searched and the variables to store the
1070    beginning and end positions of the sub-record text.
1071    (NOTE: These arguments must be given in groups of 3 until
1072    the NULL termination, such as in
1073
1074      "seq.id.genbank", &gbstart, &gbend,
1075      "seq.descr", &destart, &deend,
1076      NULL
1077
1078    The format for each triple is
1079
1080      char *subrecord, char **begin_out, char **end_out
1081
1082    and either begin_out or end_out can be NULL.)
1083
1084  o returns a count of the number of sub-records found or a
1085    -1 on error
1086
1087Since I found correctly parsing the ASN.1 text a much harder task than
1088the line oriented file formats, I've added this internal function to the
1089interface so that you might find it easier to move through the ASN.1
1090records. The description below assumes that you are familiar with the
1091ASN.1 text format, and in particular the structure of
1092"Bioseq-set.Seq-set.seq" records in the ASN.1 Bioseq-set hierarchy
1093defined in the NCBI toolkit. The function itself can work with any
1094correctly formatted ASN.1 text, however.
1095
1096The asn_parse function takes a piece of ASN.1 text (from `begin' to
1097`end') that specifies either one or more records in a hierarchy (i.e., you
1098need not specify the top-most record, but the beginning of the text
1099must be at the beginning of some record in the hierarchy and every
1100record whose beginning starts inside the text must be complete). The
1101arguments that you give after `end' specify the sub-records you are
1102interested in, along with pointer variables which will be set to the
1103beginning and end of those sub-records, if they are found.
1104
1105The strings naming the desired sub-records should match the
1106structure of sub-records in the text's hierarchy. In most cases, the
1107string is simply a list of the initial keywords of the sub-records
1108separated by periods. However, when a sub-record does not have an
1109initial keyword, but begins with an open brace, that open brace must be
1110included in the sub-record identifier string. One example of this in the
1111Bioseq hierarchy is the Bioseq-set.seq-set.seq.annot records (here is
1112an example)
1113
1114annot {
1115  {
1116    data
1117      ftable {
1118        {
1119          data
1120            prot {
1121              name {
1122                "nifS protein" } } ,
1123          location
1124            whole
1125              gi 77963 } } } }
1126
1127To access sub-records in this record, the strings must appear as
1128"annot.{.data" or "annot.{.data.ftable.{.location". The open braces after
1129keywords are not specified, but braces without keywords are. Also,
1130note that in this example, the keywords "prot", "whole" and "gi" are
1131NOT initial keywords recognized by asn_parse, because they do not
1132appear at the beginning of a sub-record (i.e., after an open brace
1133starting a sub-record or after a comma separating sub-records).
1134
1135In the parameter description above (the `...' description), I'm assuming
1136that the user gives asn_parse the text for a Bioseq `seq' record (see
1137the NCBI toolkit documentation for the structure of this record) and
1138wants to get the "id.genbank" and "descr" sub-records. As another
1139example, here is an excerpt from the SEQIO function implementing the
1140seqfinfo operation for the ASN.1 file format. The first part of it looks for
1141identifiers and accession numbers, and the second looks for
1142comments.
1143
1144  /*
1145   * Find the "id" and "descr" sub-records in the "seq" record.
1146   */
1147  idstr = destr = NULL;
1148  status = asn_parse(entry, entry + entrylen,
1149                     "seq.id", &idstr, &idend,
1150                     "seq.descr", &destr, &deend,
1151                     NULL);
1152  if (status == -1)
1153    /* error handling code here */
1154
1155  /*
1156   * If there was an "id" sub-record, look for all of the possible
1157   * sub-records that specify database identifiers or accession
1158   * numbers.
1159   */
1160  if (idstr != NULL) {
1161    pirname = spname = gbname = emname = otname = prfname = dbjname = NULL;
1162    piracc = spacc = gbacc = emacc = otacc = prfacc = dbjacc = NULL;
1163    pdbmol = gistr = giim = gibbs = gibbm = NULL;
1164    status = asn_parse(idstr, idend,
1165                       "id.pir.name", &pirname, &pnend,
1166                       "id.swissprot.name", &spname, &spend,
1167                       "id.genbank.name", &gbname, &gbend,
1168                       "id.embl.name", &emname, &emend,
1169                       "id.ddbj.name", &dbjname, &dbjend,
1170                       "id.prf.name", &prfname, &prfend,
1171                       "id.other.name", &otname, &otend,
1172                       "id.pir.accession", &piracc, &paend,
1173                       "id.swissprot.accession", &spacc, &spaend,
1174                       "id.genbank.accession", &gbacc, &gbaend,
1175                       "id.embl.accession", &emacc, &emaend,
1176                       "id.ddbj.accession", &dbjacc, &dbjaend,
1177                       "id.prf.accession", &prfacc, &prfaend,
1178                       "id.other.accession", &otacc, &otaend,
1179                       "id.pdb.mol", &pdbmol, &pmend,
1180                       "id.gi", &gistr, &gisend,
1181                       "id.giim.id", &giim, &giimend,
1182                       "id.gibbsq", &gibbs, &gibbsend,
1183                       "id.gibbmt", &gibbm, &gibbmend,
1184                       NULL);
1185    if (status == -1)
1186      /* error handling code here */
1187
1188    /*
1189     * If the PIR identifier is found, extract it from the string
1190     * pirname+4 (+4 to skip the "name" keyword) to pnend using
1191     * internal function `add_id'.
1192     */
1193    if (pirname != NULL)
1194      add_id(&info, "pir", pirname+4, pnend);
1195
1196    ...
1197  }
1198
1199  /*
1200   * If the "description" sub-record exists, look for a "comment"
1201   * sub-record.
1202   */
1203  if (destr != NULL) {
1204    comment = NULL;
1205    status = asn_parse(destr, deend,
1206                       "descr.comment", &comment, &cmend,
1207                       NULL);
1208    if (status == -1)
1209      /* error handling code here */
1210
1211    /*
1212     * If that first comment was found, use internal function
1213     * `add_comment' to add it to the SEQINFO structure, and then
1214     * look for other comments in the "descr" record, since there
1215     * can be more than one "comment" sub-record.
1216     *
1217     * Note the first two arguments to the asn_parse call are
1218     * `cmend+1' and `deend'.  So, these searches start from just
1219     * after the end of the last found comment and run until the
1220     * next "comment" record is found, or until the end of the
1221     * "descr" record.
1222     */
1223    if (comment != NULL) {
1224      add_comment(&info, comment+7, cmend, 1);
1225      while (asn_parse(cmend+1, deend,
1226                       "comment", &comment, &cmend,
1227                       NULL) == 1)
1228        add_comment(&info, comment+7, cmend, 1);
1229    }
1230  }
1231
1232A couple of notes. First, if both the beginning and end pointer variables
1233are specified (such as `&comment' and `&comend' above), then either
1234both variables will be set to a value if the sub-record is found, or both
1235will be left unchanged if the sub-record is not found. Thus, in the
1236example above, I only needed to set and test whether the beginning
1237pointer variable was NULL, and did not need to worry about the value
1238of the end pointer variable (if the beg. variable had been changed, then
1239I was guaranteed that the end variable had also been changed).
1240
1241Second, the function returns the very beginning of the record it is
1242searching for, including the initial keyword starting the record. So,
1243when using the beginning and end variable values from one search as
1244the `begin' and `end' parameters to another search, the record
1245specifications for that second search must begin with that initial
1246keyword. In the example above, the first search looked for "seq.id" and
1247"seq.descr", and the later searches then looked for "id.pir.name" and
1248"descr.comment".
1249
1250Finally, the return value of the function is the count of the number of
1251sub-records found, or a -1 if a parse error occurring while scanning
1252the text.
1253
1254
1255
1256Error Handling/Reporting
1257************************
1258
1259Seqferrno, Seqferrstr
1260=====================
1261
1262 extern int seqferrno;
1263 extern char seqferrstr[];
1264
1265These are variables which are set whenever an error occurs in the
1266SEQIO package. seqferrno gives a numerical error similar to Unix's
1267errno, and seqferrstr holds the error string that the SEQIO package
1268would have output (or perhaps did output depending on the error
1269policy).
1270
1271The values that seqferrno can have are as follows:
1272
1273E_EOF (-1)
1274   Reached the end of the file or database search.
1275E_NOERROR (0)
1276   There is no error (the default value).
1277E_OPENFAILED (1)
1278   An error occurred when opening a file.
1279E_READFAILED (2)
1280   An error occurred when reading a file.
1281E_NOMEMORY (3)
1282   Ran out of memory.
1283E_PROGRAMERROR (4)
1284   A bug was detected in the SEQIO package itself.
1285E_PREVERROR (5)
1286   A previous error occurred which does not permit the current
1287   operation to succeed.
1288E_PARAMERROR (6)
1289   An invalid parameter passed to a SEQIO function.
1290E_INVFORMAT (7)
1291   An unknown file format was specified as the argument to a
1292   function (like seqfopen).
1293E_DETFAILED (8)
1294   The format of a file could not be automatically determined from
1295   its contents.
1296E_PARSEERROR (9)
1297   A parse error occurred while scanning a file.
1298E_DBPARSEERROR (10)
1299   A parse error occurred while scanning a BIOSEQ entry or a
1300   database search specification.
1301E_DBFILEERROR (11)
1302   The BIOSEQ entry for a database specifies some files which
1303   could not be located or opened.
1304E_NOSEQ (12)
1305   A sequence entry does not contain a sequence. (This error only
1306   occurs if the sequence is actually requested, not for every entry
1307   without a sequence.)
1308E_DIFFLENGTH (13)
1309   There is a discrepancy between the length of a sequence, as
1310   specified in the sequence entry, and the number of characters
1311   found for the sequence.
1312E_INVINFO (14)
1313   When outputting an entry, one of the fields in the SEQINFO
1314   structure used to generate the output contains invalid
1315   information (i.e., it is not formatted correctly).
1316E_FILEERROR (15)
1317   A parse error occurred while parsing a single entry access
1318   specification that was given with a filename to seqfopen.
1319
1320Seqfperror
1321==========
1322
1323 void seqfperror(char *s)
1324
1325  o s - a string (usually the program name) to be printed
1326    before the error string
1327
1328  o returns nothing
1329
1330The seqfperror function is similar to that of Unix's perror function. It
1331outputs to standard error the text of the last error message (i.e., the
1332contents of seqferrstr). If the argument to the function is not NULL,
1333then that argument string and the string ": " are first output. This
1334argument is typically used to output the program name before the error
1335message.
1336
1337Seqfsetperror
1338=============
1339
1340 void seqfsetperror(void (*perr_fn)(char *))
1341
1342  o perr_fn - a void function that takes a string as its
1343    argument
1344
1345  o returns nothing
1346
1347The seqfsetperror function can be used to replace the default method
1348the SEQIO package has for reporting errors. When an error is detected
1349and the error policy is such that an error message should be output, a
1350default print error function is used to output the error message to
1351stderr. This function resets the print error function to either one of
1352your own choosing (if the argument is not NULL), or back to the default
1353print error function (if the argument is NULL).
1354
1355Seqferrpolicy
1356=============
1357
1358 int seqferrpolicy(int pe)
1359
1360  o pe - sets the error policy
1361
1362  o returns the old error policy
1363
1364The seqferrpolicy function set an error policy for the SEQIO package to
1365follow when it detects an error has occurred. The SEQIO package
1366detects three kinds of errors:
1367
1368 1. Warnings - when the package detects that something is wrong,
1369   but can still complete the current operation and return an actual
1370   value.
1371
1372   Examples of warnings are E_DIFFLENGTH, where a sequence is
1373   found but its length may be incorrect, or E_DETFAILED, where
1374   the file format can't be determined and the Plain format is used.
1375
1376   On a warning, an error message is output.
1377
1378   The warning errno values are: E_DETFAILED,
1379   E_DBFILEERROR(sometimes), E_DIFFLENGTH.
1380
1381 2. Errors - when the package is unable to complete the current
1382   operation because of some failure, but this failure does not
1383   adversely affect the state of the SEQFILE structures (so the
1384   SEQIO package can keep going in later calls).
1385
1386   Examples of errors are E_PARAMERROR, when an invalid
1387   parameter is given to a function, or E_NOSEQ, where no
1388   sequence can be returned since no sequence occurs in the
1389   current entry.
1390
1391   The one variation on the handling of these errors (this operation
1392   cannot finish, but future operations using that SEQFILE
1393   structure are okay) is when an E_READFAILED or
1394   E_PARSEERROR is detected while reading a file. In those cases
1395   (i.e., on the first parse error or when the file can't be read), the
1396   package stops reading that file and returns an error result.
1397   However, what happens to the next call to read the SEQFILE
1398   structure depends on whether seqfopen was called to read a
1399   single file or whether seqfopendb was called to read a number of
1400   files. If seqfopen was called, the next call to read another entry or
1401   sequence will get an EOF signal, whereas when a database is
1402   being read, the package will move on to the next file in the
1403   database and attempt to read the entries from there. So, the
1404   result of the next call to seqfread, seqfgetseq, seqfgetentry or
1405   seqfgetinfo after a read/parse error depends on whether there
1406   are any more files to read.
1407
1408   On an error, an error message is output and an error value is
1409   returned as the result of the called function.
1410
1411   The error errno values are: E_EOF, E_OPENFAILED,
1412   E_READFAILED, E_PREVERROR, E_PARAMERROR,
1413   E_INVFORMAT, E_PARSEERROR, E_DBPARSEERROR,
1414   E_DBFILEERROR(sometimes), E_NOSEQ, E_FILEERROR.
1415
1416 3. Fatal errors - when the error leaves the state of the SEQFILE
1417   structure (or the SEQIO package) in an unrecoverrable state.
1418
1419   No more operations can be performed using that SEQFILE
1420   structure, and you may only call seqfclose to close the structure
1421   (if the program does not exit on the error). If further calls are
1422   made, the error status E_PREVERROR is always returned.
1423
1424   Examples of fatal errors are E_NOMEMORY where the package
1425   runs out of memory, or E_PROGRAMERROR when an internal
1426   program bug is detected
1427
1428   On a fatal error, an error message is output, the system function
1429   `exit' is called (unless disabled using seqferrpolicy), and an
1430   error value is returned as the result of the function (if the exit call
1431   is disabled).
1432
1433   The fatal errno values are: E_NOMEMORY,
1434   E_PROGRAMERROR.
1435
1436How these errors are handled can be determined by which of the
1437following predefined constants is passed to seqferrpolicy:
1438
1439PE_NONE
1440   Disable all warning/error/fatal message output and all calls to
1441   exit. So, only seqferrno and seqferrstr are set on an error (and
1442   possibly an error value is returned).
1443PE_WARNONLY
1444   Allow warning messages to be printed, but disable error/fatal
1445   messages and calls to exit.
1446PE_ERRONLY
1447   Allow error/fatal messages, but disable warning messages and
1448   calls to exit.
1449PE_NOWARN
1450   Allow error/fatal messages and calls to exit, but disable warning
1451   messages.
1452PE_NOEXIT
1453   Allow all message output, but disable calls to exit.
1454PE_ALL
1455   Allow all message output and calls to exit.
1456
1457The default policy is PE_ALL, where all output is performed and fatal
1458errors cause the SEQIO package to call exit.
1459
1460The function returns the old error policy as its return value. So, if you
1461want to turn off all error reporting for a SEQIO function call, you can do
1462the following:
1463
1464old_pe = seqferrpolicy(PE_NONE);
1465
1466/* Do the SEQIO function call */
1467
1468seqferrpolicy(old_pe);
1469
1470if (seqferrno != E_NOERROR && seqferrno != E_EOF) {
1471  /* Handle the error/warning that occurred during the function call */
1472}
1473
1474
1475James R. Knight, knight@cs.ucdavis.edu
1476June 26, 1996
1477