1SEQIO -- A Package for Sequence File I/O
2
3
4PROGRAMR.DOC - Guide to Using the SEQIO Package
5***********************************************
6
7The main documentation on the SEQIO interface is given in "seqio.doc
8".This file is more of a "how-to" guide to using the package. These are
9the ideas I had for using the package while I was designing and
10implementing it, broken up into five sections:
11
12 1. reading sequences and database searches,
13 2. extracting information from entries,
14 3. writing/converting/annotating entries,
15 4. BIOSEQ stuff (database information processing),
16 5. Error handling.
17
18At the end of the file, there is an additional section discussing how to
19port the package to other machines.
20
21I'm going to concentrate on the interface itself, so in all of the examples
22below, you will see constants for things like filenames, formats,
23database names, and so on. In a normal program those things would be
24specified as part of the user interface, but here I'm going to make them
25as simple as possible in order to illustrate the interface functions more
26clearly.
27
28Jim
29
30
31
32Reading Sequences and Database Searches
33***************************************
34
35This package actually evolved from a module of some sequence
36analysis software I was writing, as well as the three or four programs I
37had designed to some extent and was planning to implement (and still
38am). In all of those programs, I needed a module to read in the
39sequences in a sequence file, and I had three goals for that module: 1)
40make it simple for the rest of the program to use, 2) make it as fast as
41possible, and 3) remove as many size limitations as possible (from
42sequence size to maximum line length and so on). Those goals, and the
43focus on reading files and databases, remained in the design of the
44SEQIO package. However, in this file you won't hear much about goals
452 and 3, because they don't show up when your programs are written,
46only when they execute.
47
48A program that reads a sequence file or database looks a lot like using
49the stdio package to do normal file I/O: it opens the file or database,
50repeatedly calls a function to read the next sequence, and closes the
51file or database when it hits EOF.
52
53    int len;
54    char *seq;
55    SEQFILE *sfp;
56
57    if ((sfp = seqfopen("my_sequences", "r", "FASTA")) == NULL)
58      exit(1);
59
60    while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) {
61      if (len > 0 && isa_match(seq, len)) {
62        /* Found a match */
63      }
64    }
65
66    seqfclose(sfp);
67
68This code snippet is an example of searching the sequences of a
69FASTA-formatted file for the sequences that matched (whatever it is
70you might want to match). To read a database instead of a file, just
71replace the "sfp = seqfopen(...)" call with "sfp = seqfopendb
72("genbank")", to read the GenBank database for example. Another
73simple change you can make to this example is to read all of the
74file/database entries, instead of the sequences those entries contain.
75To do that, simply replace the call to "seq = seqfgetseq(sfp,
76&len, 0)" with "entry = seqfgetentry(sfp, &len, 0)" and the
77entry text for each entry is returned.
78
79With either or both of these alterations, the rest of the program will
80work in exactly the same way, with two minor exceptions. First, when
81the `seqfgetentry' for `seqfgetseq' substitution is made and the entries
82in the file or database contain more than one sequence, `seqfgetseq'
83will read each sequence in the entry, whereas `seqfgetentry' will only
84read the entry once regardless of how many sequences occur in the
85entry.
86
87Second, when searching databases using `seqfopendb', a BIOSEQ file
88must have been created and the "BIOSEQ" environment variable must
89include that file. See the file "user.doc" for infomation on how to create
90BIOSEQ files. That file also describes the strings `seqfopendb' can take
91to specifying a database search.
92
93Differences between SEQIO and stdio
94===================================
95
96There are some small differences between the SEQIO calls in the
97example above and the stdio calls used to do file I/O. First, the
98`seqfopen' function takes a third argument which specifies the format of
99the file being open. That argument either must be a string naming a
100supported file format (see "user.doc" and "format.doc" for the list of
101those formats), or must be NULL, in which case the format of the file is
102automatically determined from the text in the file.
103
104Second, the arguments to `seqfgetseq' are different from any of the
105fget* functions in the stdio package. The reason is that one of the
106deficiencies of the stdio package (in my opinion) is that the
107programmer has to worry about where and how to store the characters
108read in. I wanted programs using this package to worry as little as
109possible about how to store the read-in sequences and entries. Thus,
110the SEQIO package always remembers a "current" sequence and entry,
111and the sequence, entry or information about the sequence can be
112retrieved as needed.
113
114In addition, the package can return the sequence/entry/information
115character strings in one of two ways, either using an internal buffer or
116by malloc'ing a new buffer to store the string. The third argument to
117`seqfgetseq' is a flag telling how the sequence text should be returned
118(zero specifies an internal buffer and non-zero specifies a malloc'ed
119buffer). So, the `seqfgetseq' call above tells the SEQIO package to read
120the next sequence in the file, make that the "current" sequence, and
121return that sequence's text using its internal buffers. As another
122example, the following snippet shows how to accumulate all of the
123sequences of a file into an array, using malloc'ed buffers so that each
124sequence remains available until the malloc'ed buffer is freed:
125
126    int i, len;
127    char *seq, *seqs[400];
128    SEQFILE *sfp;
129
130    if ((sfp = seqfopendb("swiss-prot")) == NULL)
131      exit(1);
132
133    for (i=0; i < 400 && (seq = seqfgetseq(sfp, NULL, 1)) != NULL; ) {
134      if (*seq != '\0')
135        seqs[i++] = seq;
136    }
137    seqfclose(sfp);
138
139    /* Do the analysis of the sequences. */
140
141    while (i > 0)
142      free(seqs[--i]);
143
144Giving a non-zero third argument to `seqfgetseq' tells the SEQIO
145package to malloc a new buffer for each sequence, so they can be kept
146around after the next call to the package (the internal buffers are
147reused, so their contents may be changed on the next call to a SEQIO
148function).
149
150Also, note in this example that the second argument to the `seqfgetseq'
151function is NULL. One of the guarantees the SEQIO package makes is
152that the character strings of sequences and entries will be
153NULL-terminated strings, so you don't necessarily need the string's
154length to know where the sequence/entry ends. This also makes it easy
155to output the sequence or entry text, as in this version of the first
156example above which outputs the text of each entry whose sequence
157matches:
158
159    while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) {
160      if (len > 0 && isa_match(seq, len)) {
161        /* Found a match */
162        entry = seqfentry(sfp, NULL, 0);
163        fputs(entry, stdout);
164      }
165    }
166
167Note the use of `seqfentry' instead of `seqfgetentry'. The function
168`seqfentry' just returns the text of the "current" entry, and does NOT
169read the next entry in the file. With this use of a "current" sequence
170and entry, a program can get multiple pieces of information about a
171sequence/entry one piece at a time, without having to worry about
172getting everything it needs at once.
173
174The third and fourth differences between the stdio package and the
175SEQIO package in these examples are slightly harder to see. They
176involve the handling of errors. The third difference is that the program
177simply exits when `seqfopen' returns NULL, seemingly without printing
178an error message, and the fourth difference is the use of "len > 0"
179and "*seq != '\0'" as additional tests to see if a sequence was
180returned by `seqfgetseq'.
181
182The long answers for these differences are given in the Error Handling
183section and file "seqio.doc", where I talk about the error handling. The
184short answers are that the SEQIO package by default outputs error
185messages when an error occurs (but this can be disabled), and that the
186`seqfgetseq' and `seqfgetentry' functions are unique in that they return
187one of three values: 1) a string of characters on a successful read, 2)
188an empty string with length 0 if there is a problem reading the next
189sequence/entry (such as when the next entry contains no sequence),
190but that problem is not a fatal error, and 3) NULL if end-of-file is
191reached or a fatal error occurs (an error for which no more reading can
192be done).
193
194The functions `seqfopen' and `seqfopendb' are the common ways to
195open a file/database, and `seqfgetseq', `seqfgetentry' and `seqfgetinfo'
196(described in the next section) are the common ways to read in the
197sequences/entries in the file/database. There are a couple of other
198ways, using the functions `seqfopen2', `seqfread' and `seqfgetrawseq.
199Those functions are described in file "seqio.doc".
200
201
202
203Extracting Information from Entries
204***********************************
205
206For most of the sequence file formats that have been created (and most
207of the formats supported by the package), the entries in a file contain
208quite a bit more information than just the sequence itself. For instance,
209in the GenBank database, the sequence characters make up only a
210third of the characters in the database files. The rest of the database
211contains information about those sequences (identifiers, descriptions,
212references, features, and so on). In designing the SEQIO package, I
213tried to do two things, provide a method to automatically extract a
214number of the more common (and less complex) pieces of information
215stored in a sequence, and make it as easy as possible to extract other
216information.
217
218The Raw Sequence
219================
220
221One such piece of information that is often needed is the "raw"
222sequence text, giving both the sequence characters and any alignment
223or structural notation characters that are associated with the
224sequence. In many entries, the sequence is expressed not by itself, but
225in terms of an alignment of that sequence with others (such as the
226sequences in the other entries of the file).
227
228The function `seqfrawseq' can be used to retrieve both the sequence
229and the alignment/structure information specified with the sequence.
230This function works exactly the same as `seqfentry' and `seqfsequence',
231except that the string returned by the functions is different. Function
232`seqfentry' returns the complete entry text, `seqfsequence' returns
233only the characters of the sequence (typically all of the alphabetic
234characters), and `seqfrawseq' returns the sequence and the
235alignment/structure characters (typically all characters except
236whitespace and digits).
237
238The SEQINFO Structure
239=====================
240
241Other information contained in an entry is extracted and typically
242returned through the use of a SEQINFO structure defined by the
243package. The file "seqio.h" defines the SEQINFO structure used to
244store information that the SEQIO package extracts from an entry. In
245addition, there are interface functions which can be used to retrieve
246each of the individual fields in the SEQINFO structure. The SEQINFO
247structure is defined as follows:
248
249typedef struct {
250  char *dbname, *filename, *format;
251  int entryno, seqno, numseqs;
252
253  char *date, *idlist, *description;
254  char *comment, *organism, *history;
255  int isfragment, iscircular, alphabet;
256  int fragstart, truelen, rawlen;
257} SEQINFO;
258
259The structure contains six fields which the SEQIO package has about
260the current sequence:
261
262dbname
263   The name of the database being searched (if this is a search of
264   an actual database).
265filename
266   The name of the file currently being read.
267format
268   The format of the file (and the current entry).
269entryno
270   The location of the current entry in the file (if entryno is 10, then
271   the current entry is the tenth entry in the file).
272seqno
273   The location of the current sequence in the current entry (if
274   seqno is 3, then the current sequence is the third in the current
275   entry).
276numseqs
277   The number of sequences contained in the current entry.
278
279So, the current sequence's location is the `seqno' sequence of the
280`entryno' entry of the file `filename' (possibly of the database
281`dbname'). The `format' string gives the entry's format, and the entry
282contains `numseqs' sequences.
283
284The other twelve fields are information extracted from the current entry
285(see "format.doc" for the details about which information is retrieved
286for each file format):
287
288date
289   A single date giving the last time the entry was either created or
290   updated. Its format should be day-month-year, as in
291   31-JAN-1995.
292idlist
293   The list of identifiers given in the entry. The idlist's form is a
294   string containing vertical bar separated list of identifiers, each of
295   whose form consists of an identifier prefix, a ':' and the identifier.
296   See file "user.doc" for more information about identifiers and
297   identifier prefixes.
298description
299   A description of the sequence or sequences in the entry. This is
300   the "Title" or "Definition" line in some file formats. This string
301   should consist of a single "line" of text, although it can be of any
302   length. So, no newlines should appear in this text (they are
303   removed and added when the description is read from and
304   output in the sequence entries).
305comment
306   A block of text giving a comment about the sequence. The string
307   can contain one or more lines of any length. The one restriction
308   to the text appearing in a comment is that any block of lines at
309   the end of an entry's comment section where each line begins
310   with the string "SEQIO" is reserved for other use by the package
311   (this block holds extra identifiers or the `history' lines).
312organism
313   The name of the organism the sequence was taken from. Right
314   now, this field can contain any single "line" of text, although I
315   would like to standardize the contents of this field. It's on my
316   TODO list.
317history
318   This holds the lines of text placed in the comment section of
319   entries which describe previous SEQIO operations on this entry,
320   i.e., it holds the history of alterations and updates made to this
321   entry by programs using the SEQIO package. Any block of lines
322   at the end of a comment section where each line begins with the
323   string "SEQIO" is not considered part of the comment, but part
324   of the history.
325isfragment
326   This integer is non-zero if the sequence is a fragment of a larger
327   sequence, and zero if the sequence is complete (or if it is not
328   known whether the sequence is a fragment).
329iscircular
330   This integer is non-zero if the sequence is a circular sequence,
331   and zero if it is a linear sequence (or if it's circularity is not
332   known).
333alphabet
334   This integer is one of the predefined constants DNA, RNA,
335   PROTEIN or UNKNOWN. Its value is UNKNOWN unless either
336   the database's BIOSEQ entry (information field "Alphabet") or
337   the entry itself explicitly specifies the alphabet. The package
338   does not try to guess the alphabet.
339fragstart
340   When the sequence is a fragment of a larger sequence and the
341   location of this fragment in the larger sequence is known, this
342   value gives the starting position of the fragment. If this value is
343   not known (or the sequence is complete), fragstart is set to 0.
344truelen
345   This is the "true" length of the sequence, i.e., the length of the
346   sequence without any gap characters or notational characters.
347   Typically, these are just the alphabetic characters.
348rawlen
349   This is the "raw" length of the sequence, i.e., the length of the
350   sequence which includes the gap and notational characters.
351   Typically these are all characters except whitespace and digits.
352
353Accessing this information for an entry is very similar to that of
354accessing the sequence and entry text. The functions `seqfgetinfo' and
355`seqfinfo' work along the lines of `seqfgetentry' and `seqfentry', and so
356the following code snippet finds and outputs all of the entries with
357circular sequences:
358
359    char *entry;
360    SEQINFO *info;
361    SEQFILE *sfp;
362
363    if ((sfp = seqfopendb("genbank")) == NULL)
364      exit(1);
365
366    while ((info = seqfgetinfo(sfp, 0)) != NULL) {
367      if (info->iscircular) {
368        entry = seqfentry(sfp, NULL, 0);
369        fputs(entry, stdout);
370      }
371    }
372    seqfclose(sfp);
373
374and this code snippet finds and outputs the entry (or entries) with a
375given accession number:
376
377    char *s, *t, *idlist, *entry;
378    SEQINFO *info;
379    SEQFILE *sfp;
380
381    if ((sfp = seqfopendb("genbank")) == NULL)
382      exit(1);
383
384    while (seqfread(sfp, 1) == 0) {
385      idlist = seqfidlist(sfp, 0);
386      if (idlist != NULL) {
387        /*
388         * Scan the idlist, looking for an identifier whose prefix is
389         * "acc" and whose number matches the accession.
390         */
391        s = idlist;
392        while (*s) {
393          for (t=s; *s && *s != '|'; s++) ;
394
395          if (strncmp(t, "acc:X01828", 10) == 0) {
396            entry = seqfentry(sfp, NULL, 0);
397            fputs(entry, stdout);
398            break;
399          }
400
401          if (*s) s++;
402        }
403      }
404    }
405    seqfclose(sfp);
406
407A couple points to note about these examples and the fields of the
408SEQINFO structure. First, the string `idlist' is a vertical bar separated
409list of identifiers, where each identifier consists of a prefix naming the
410database or type of identifier and a suffix giving the actual id. See file
411"user.doc" for a complete description of these identifiers and identifier
412prefixes.
413
414Third, the functions like `seqfidlist' are similar to `seqfsequence',
415`seqfentry', and `seqfinfo' in that they return some information about
416the "current" sequence/entry. The package has one of these access
417functions for every field in the SEQINFO structure (i.e., `seqfdate',
418`seqfiscircular', ...). For the SEQINFO fields that are character strings,
419these functions take two arguments, where the second argument is just
420like the third argument of `seqfsequence' or `seqfentry'. It tells whether
421the package should return the character string using an internal buffer
422or in a malloc'ed buffer. (Again, be aware that the internal buffer strings
423are guaranteed to remain unchanged only upto the next call to the
424SEQIO package.)
425
426Fourth, the previous point raises the question of what happens when
427`seqfinfo' or `seqfgetinfo' is called with a second argument of 1, and the
428SEQINFO structure is returned in a malloc'ed buffer. Where do the
429character string fields of the structure point to? And will it be hard to
430free up the SEQINFO structure and its character strings? When
431`seqinfo' or `seqfgetinfo' is called with a second argument of 1, they
432actually malloc one large buffer, and store both the SEQINFO structure
433and the character string fields in that one buffer. And since the
434SEQINFO structure is placed at the beginning of the malloc'ed buffer,
435simply free'ing the SEQINFO structure will automatically free up all of
436its character strings.
437
438And fifth, note the use of `seqfread' in the second example. It was used
439because there is no `seqfgetidlist' function in the package. The only
440functions which both read the next entry/sequence and return
441something about that entry/sequence are `seqfgetseq',
442`seqfgetrawseq', `seqfgetentry' and `seqfgetinfo'. To perform searches
443using the other information functions, you must use one of the four
444entry/sequence reading functions listed in this paragraph. Also, in case
445the arguments to `seqfread' are confusing, the second argument to
446`seqfread' is NOT the same as the second argument to `seqfidlist'. The
447second argument to `seqfread' specifies whether to read the next
448sequence (if zero) or to read the next entry (if non-zero).
449
450Seqfmainid, Seqfmainacc, Seqfoneline and Seqfallinfo
451====================================================
452
453The SEQIO package includes four other functions for accessing and
454collecting information about each sequence: `seqmainid', `seqfmainacc',
455`seqfoneline' and `seqfallinfo'.
456
457`Seqfmainid' and `seqfmainacc' are variations of the `seqfidlist' which
458only return a "main" identifier, instead of returning the whole identifier
459list. This is useful in cases where you don't necessarily want to search
460the complete list of identifiers, but just want a single identifier to
461associate with a sequence. `Seqfmainid' returns the "main" identifier
462for a sequence, which specifically is the first non-accession identifier,
463if one exists, or the first accession number in the entry otherwise. The
464`seqfmainacc' function returns the first accession number in the entry,
465if one exists. Both have the same arguments as `seqfidlist', and both
466return a NULL-terminated string containing the single identifier, with
467an identifier prefix. So, the example above which searches for an
468accession number could be rewritten as the following, if we were just
469looking for the entry whose main accession number is "X01828":
470
471    char *mainid, *entry;
472    SEQFILE *sfp;
473
474    if ((sfp = seqfopendb("genbank")) == NULL)
475      exit(1);
476
477    while (seqfread(sfp, 1) == 0) {
478      if ((mainid = seqfmainid(sfp, 0)) != NULL &&
479          strncmp(mainid, "acc:X01828", 10) == 0) {
480        entry = seqfentry(sfp, NULL, 0);
481        fputs(entry, stdout);
482      }
483    }
484    seqfclose(sfp);
485
486The function `seqfoneline' can be used to create a "oneline"
487description of the information for an entry. A number of programs (and
488a number of file formats) have situations where they would like to
489present the user with a relatively compact, one line description of a
490particular sequence. The SEQIO package defines a standard format for
491this type of description for biological sequence, and `seqfoneline' is
492the function the package provides to construct these descriptions. The
493argument list for `seqfoneline' is the following:
494
495int seqfoneline(SEQINFO *info, char *buffer, int buflen, int idonly);
496
497where `info' is a SEQINFO structure, `buffer' is a character buffer where
498the oneline description will be stored, `buflen' is the length of the
499buffer, and `idonly' will be discussed momentarily.
500
501This function operates in a similar manner as `fgets', in that the string it
502constructs is stored in the buffer passed to it. It differs from fgets in two
503major respects (apart from the fact that it does no file reading). The first
504is that the oneline description is guaranteed to both fit in the buffer and
505to be NULL-terminated (i.e., no oneline description will ever be longer
506than "buflen-1" characters). The second is that the function returns
507the length of the oneline description stored in `buffer', instead of a
508pointer to buffer itself. Hopefully, both of these differences will be more
509useful in practice than the way fgets works.
510
511The final argument to `seqfoneline' is an `idonly' flag specifying
512whether the "oneline description" should in fact just contain a single
513identifier for the sequence. This flag is useful in cases where you just
514want a single identifier string that is guaranteed to be no longer than a
515certain length (most notably in the output of the PHYLIP, Clustalw and
516MSF formats). When the flag is non-zero, the string stored in `buffer' is
517guaranteed to contain a single word identifier or description, and is
518guaranteed not to contain any whitespace.
519
520The final variation on accessing information from an entry is `seqfallinfo'.
521This function works exactly like `seqfinfo', except that the comment field
522of the SEQINFO structure returned contains a different string. Using
523`seqfinfo', the comment string returned consists of whatever comment
524appears in the entry. With `seqfallinfo', the comment string contains
525the complete header of the entry. The specifics of what string this is
526depends on the particular file format, but generally it consists of all of
527the lines of the entry except the sequence lines.
528
529Extracting Other Information
530============================
531
532The code snippets above illustrate the two ways of using the SEQIO
533package to extract information from an entry. One way is to use
534`seqfgetinfo' or `seqfinfo' to have the SEQIO package extract all of the
535information it can from an entry, and then to access the fields of the
536SEQINFO structure to get that information. The other way is to use the
537access functions for the SEQINFO fields (`seqfidlist', `seqfiscircular',
538and so on) to get one or more pieces of information from the entry.
539
540If neither of those ways can get the information you're looking for, the
541third way of getting information from a sequence is to get the entry's
542text and scan that text for the information, as in this example which
543outputs all entries in the file "alu.human" of the "REPBASE" database
544which are classified in the "Alu-J" region:
545
546    char *entry, *s;
547    SEQFILE *sfp;
548
549    if ((sfp = seqfopendb("repbase:alu.human")) == NULL)
550      exit(1);
551
552    while ((entry = seqfgetentry(sfp, NULL, 0)) != NULL) {
553      if (strstr(entry, "\nFT                   \\rpt_family=\"Alu-J\""))
554        fputs(entry, stdout);
555    }
556    seqfclose(sfp);
557
558This works, because when looking at the "alu.human" file, the
559sequences are classified by the line
560
561FT                   \rpt_family="Alu-J"
562
563Thus, by reading each entry and doing a simple scan for that particular
564line, I can extract the appropriate entries. And of course, more
565complicated (or robust) searches of the entries could be written, but
566the point here is that the SEQIO package takes care of all of the file I/O
567and simplifies the programmer's task to just implementing the
568scanning.
569
570
571
572Writing, Creating and Annotating Entries
573****************************************
574
575Writing Entries
576===============
577
578The process for writing sequences and entries is very similar to that of
579the stdio package: open a file, call a function to write each entry, close
580the file. The difference is that the function which writes each entry takes
581a sequence and a SEQINFO structure as its arguments. Because of
582this, the easiest example to give is actually a file format conversion
583program. This one converts from EMBL to GenBank:
584
585    int len;
586    char *seq;
587    SEQINFO *info;
588    SEQFILE *insfp, *outsfp;
589
590    if ((insfp = seqfopen("my_sequences", "r", "embl")) == NULL)
591      exit(1);
592    if ((outsfp = seqfopen("my_seqs.2", "w", "genbank")) == NULL)
593      exit(1);
594
595    while ((seq = seqfgetseq(insfp, &len, 0)) != NULL) {
596      if (len > 0 && (info = seqfino(insfp, 0)) != NULL)
597        seqfwrite(outsfp, seq, len, info);
598    }
599    seqfclose(insfp);
600    seqfclose(outsfp);
601
602The SEQIO package also contains a `seqfconvert' function, which can
603simplify this code just a little bit (although there's not much farther that
604you can go):
605
606    int len;
607    char *seq;
608    SEQINFO *info;
609    SEQFILE *insfp, *outsfp;
610
611    if ((insfp = seqfopen("my_sequences", "r", "embl")) == NULL)
612      exit(1);
613    if ((outsfp = seqfopen("my_seqs.2", "w", "genbank")) == NULL)
614      exit(1);
615
616    while (seqfread(insfp, 0) != NULL)
617      seqfconvert(insfp, outsfp);
618
619    seqfclose(insfp);
620    seqfclose(outsfp);
621
622For the function `seqfopen', its second argument is the same as the
623second argument to `fopen', except that `seqfopen' only supports
624reading ("r"), writing ("w") and appending ("a") modes. Also, when
625writing a file, the third `seqfopen' argument specifying the format must
626be given. It cannot be NULL.
627
628Creating New Entries
629====================
630
631The `seqfwrite' function uses the sequence and the 12 entry
632information fields of the SEQINFO structure (date, idlist, description,
633comment, organism, history, isfragment, iscircular, alphabet, fragstart,
634truelen, rawlen) when outputting the entry. It does not use the other six
635SEQINFO fields. Also, any of the character string fields may be either
636NULL or the empty string, in which case `seqfwrite' assumes that that
637information is not available. The function does not require that all of the
638fields be filled with information (it does the best it can with the
639information it's given). The only requirement `seqfwrite' makes on its
640arguments is that a non-empty sequence is given. It cannot output
641entries with no sequence.
642
643So, if you want to create new entries containing information that you
644compute using some other method, simply declare a SEQINFO
645structure, fill in its fields with the strings and values you've computed,
646and pass it and the sequence to `seqfwrite'.
647
648    int len;
649    char *seq;
650    SEQINFO info;
651    SEQFILE *insfp, *outsfp;
652
653    if ((outsfp = seqfopen("new_seqs", "w", "sprot")) == NULL)
654      exit(1);
655
656    while (/* more entries to create */) {
657      memset(&info, 0, sizeof(SEQINFO));
658
659      /* Perform some computation to get a sequence and to fill in the
660         fields of the SEQINFO structure. */
661
662      seqfwrite(outsfp, seq, len, &info);
663    }
664    seqfclose(outsfp);
665
666The SEQINFO structure has been defined so that all of the default
667values for the fields are 0 (or NULL for character strings). Thus, setting
668all of the bytes of the structure to 0 sets all of the default values.
669
670Annotating Existing Entries
671===========================
672
673The function `seqfannotate' provides a solution to the common
674problem of associating new information with an existing entry and its
675sequence. A biologist runs a program or performs a database search to
676find entries or sequences with a particular feature or pattern, i.e., some
677new piece of information about that sequence. It would be nice to be
678able to tag that entry with the new information. But, the question is
679where to store the information? Keeping a separate file for the new
680information can become a management headache, and using `seqfinfo'
681and `seqfwrite' (or their cousins in other sequence I/O packages)
682eliminates a lot of the other information the entry holds. The
683'seqfannotate' function remedies this problem by allowing you to insert
684new text as a comment in an entry as that entry is being output, so that
685the outputted entry will contain all of the information in the original
686entry plus the new, inserted information.
687
688The function takes a SEQFILE pointer (open for writing), an entry and a
689string, and it inserts the string into the comment section of the entry as
690it is outputting the entry. The arguments for `seqfannotate' are the
691following:
692
693  int seqfannotate(SEQFILE *sfp, char *entry, int entrylen,
694                   char *newcomment, int flag)
695
696where `sfp' is the SEQFILE structure, `entry' and `entrylen' give the
697necessary information about the entry, `newcomment' is the string to
698be inserted, and `flag' tells whether or not to retain any existing
699comments in the entry (zero says to remove all other comments and
700non-zero says to retain the comments). As an example, here is the
701example program given at the beginnning of this file, extended so that it
702adds the matching positions to the entry text.
703
704    int len, entrylen;
705    char *seq, *entry, *str;
706    SEQFILE *sfp, *sfpout;
707
708    if ((sfp = seqfopen("my_sequences.3", "r", "pir")) == NULL)
709      exit(1);
710
711    if ((sfpout = seqfopen("-", "w", seqfformat(sfp))) == NULL)
712      exit(1);
713
714    while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) {
715      if (len > 0 && (str = isa_match(seq, len)) != NULL) {
716        /* Found a match */
717        entry = seqfentry(sfp, &entrylen, 0);
718        seqfannotate(sfpout, entry, entrylen, str, 1);
719      }
720    }
721    seqfclose(sfp);
722
723where the function `isa_match' now returns a character string such as
724
725  "Prosite Pattern:  GLYCOSAMINOGLYCAN  (S-G-x-G)\nMatches: 10-16, 503-508.\n"
726
727instead of just a boolean flag. (Note: the string here is a literal version
728of the string that isa_match might return.)
729
730One thing that might appear to be missing from the `seqfannotate' call
731is the format of the entry being passed to it. The format for the passed
732in entry is assumed to be the same as the format that was specified
733when the SEQFILE structure was opened for writing. (Note the use
734above of `seqfformat' when opening the output, and recall that giving
735"-" to `seqfopen' tells it to open standard input or standard output.) If
736the entry is not in the correct form, a parse error will occur and nothing
737will be output.
738
739With the example program above, if the entry text given to
740`seqfannotate' were the following (to use an actual PIR entry):
741
742ENTRY            CCMQR      #type complete
743TITLE            cytochrome c - rhesus macaque (tentative sequence)
744ORGANISM         #formal_name Macaca mulatta #common_name rhesus macaque
745DATE             17-Mar-1987 #sequence_revision 17-Mar-1987 #text_change
746                   05-Aug-1994
747ACCESSIONS       A00003
748REFERENCE        A00003
749   #authors      Rothfus, J.A.; Smith, E.L.
750   #journal      J. Biol. Chem. (1965) 240:4277-4283
751   #title        Amino acid sequence of rhesus monkey heart cytochrome c.
752   #cross-references MUID:66045191
753   #contents     Compositions of chymotryptic peptides and sequences of
754                   residues 55-61 and 68-70
755   #accession    A00003
756      ##molecule_type protein
757      ##residues      1-104 ##label ROT
758CLASSIFICATION   #superfamily cytochrome c; cytochrome c homology
759KEYWORDS         acetylated amino end; electron transfer; heme; mitochondrion;
760                   oxidative phosphorylation; respiratory chain
761FEATURE
762   1                  #modified_site acetylated amino end (Gly) #status
763                        experimental\
764   14,17              #binding_site heme (Cys) (covalent) #status predicted\
765   18,80              #binding_site heme iron (His, Met) (axial ligands)
766                        #status predicted
767SUMMARY          #length 104  #molecular-weight 11605  #checksum 9512
768SEQUENCE
769                5        10        15        20        25        30
770      1 G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G P
771     31 N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I T W G
772     61 E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E E
773     91 R A D L I A Y L K K A T N E
774///
775
776the output from `seqfannotate' would be
777
778ENTRY            CCMQR      #type complete
779TITLE            cytochrome c - rhesus macaque (tentative sequence)
780ORGANISM         #formal_name Macaca mulatta #common_name rhesus macaque
781DATE             17-Mar-1987 #sequence_revision 17-Mar-1987 #text_change
782                   05-Aug-1994
783ACCESSIONS       A00003
784REFERENCE        A00003
785   #authors      Rothfus, J.A.; Smith, E.L.
786   #journal      J. Biol. Chem. (1965) 240:4277-4283
787   #title        Amino acid sequence of rhesus monkey heart cytochrome c.
788   #cross-references MUID:66045191
789   #contents     Compositions of chymotryptic peptides and sequences of
790                   residues 55-61 and 68-70
791   #accession    A00003
792      ##molecule_type protein
793      ##residues      1-104 ##label ROT
794COMMENT    Prosite Pattern:  GLYCOSAMINOGLYCAN  (S-G-x-G)
795           Matches: 10-16, 503-508.
796
797           SEQIO annotation, lines 1-2.  02-Feb-1996
798CLASSIFICATION   #superfamily cytochrome c; cytochrome c homology
799KEYWORDS         acetylated amino end; electron transfer; heme; mitochondrion;
800                   oxidative phosphorylation; respiratory chain
801FEATURE
802   1                  #modified_site acetylated amino end (Gly) #status
803                        experimental\
804   14,17              #binding_site heme (Cys) (covalent) #status predicted\
805   18,80              #binding_site heme iron (His, Met) (axial ligands)
806                        #status predicted
807SUMMARY          #length 104  #molecular-weight 11605  #checksum 9512
808SEQUENCE
809                5        10        15        20        25        30
810      1 G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G P
811     31 N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I T W G
812     61 E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E E
813     91 R A D L I A Y L K K A T N E
814///
815
816Note the new COMMENT section between the REFERENCE and
817CLASSIFICATION sections. And when read back in again, the string
818returned by `seqfcomment' would be the string
819
820  "Prosite Pattern:  GLYCOSAMINOGLYCAN  (S-G-x-G)\nMatches: 10-16, 503-508.\n"
821
822Exactly what was inserted (because the original entry had no other
823comments).
824
825
826
827BIOSEQ Stuff (Database Information Processing)
828**********************************************
829
830The first three sections present essentially all of the main functionality
831for reading and writing files and performing database searches. (There
832are a couple additional functions, but I'll leave you to read "seqio.doc"
833to find out what they are.) Sometimes, however, a program needs more
834control over the operations that are performed than the basic functions
835of the package permit. These next two sections describe additional
836features that can provide the extra control.
837
838This section discusses the four of the five functions related to the
839BIOSEQ standard for specifying and searching databases. I assume in
840this section that you have read the parts of "user.doc" that relate to the
841BIOSEQ standard and have some idea about what a BIOSEQ file looks
842like. Please go read that text first.
843
844The five BIOSEQ functions that are included in the SEQIO package
845(and in fact make up all of its functionality except for the standard itself)
846are `bioseq_read' which reads the BIOSEQ files, `bioseq_check' which
847can check to see if a database search specifier is valid, `bioseq_info'
848which is used to get an information field from a BIOSEQ entry,
849`bioseq_parse' which is used to get the list of files specified by a
850database search. and `bioseq_matchinfo' which is used to determine
851which BIOSEQ entry for a database has an information field with a
852particular value. This section talks about all of these functions except
853`bioseq_matchinfo'.
854
855The function `bioseq_read' takes in the name of a file, reads the
856BIOSEQ entries in the file, checks the syntax of those entries, and
857stores all of the entry information in internal data structures. Those
858data structures are then used by the `bioseq_info', `bioseq_matchinfo'
859and `bioseq_parse' functions.
860
861By default, the first files read are always the files specified by the
862"BIOSEQ" environment variable, if it is defined. This is done before any
863of the bioseq_* functions perform their operation. Then, each call to
864`bioseq_read' reads subsequent files.
865
866The internal data structure used by the package is a list of the read-in
867entries, and the determination of which entry a database search
868specification refers to is performed by searching through the list. The
869entries in the list are stored in reverse order of the calls to
870bioseq_read, but in the given order within a specific call to
871bioseq_read. So, the first entry checked is always the first entry of first
872file from the last call to bioseq_read. From there, the rest of the entries
873in that last call are checked, and after the last entry of that last call, the
874first entry of the next to last call to bioseq_read is checked. This way,
875the later calls to `bioseq_read' will have priority over the previous calls
876to `bioseq_read' (or the "BIOSEQ" env. variable files), in case of
877duplicates.
878
879Therefore, if you're writing a program and you want to allow the user to
880have multiple ways to specify BIOSEQ files (such as the BIOSEQ
881environment variable, plus other user-specified or program-specific
882files), use `bioseq_read' to read in the files in increasing priority, and
883the SEQIO package will always pick the highest priority BIOSEQ entry
884for each database. And, if you want the files specified by the "BIOSEQ"
885env. variable to have a higher priority than other files, simply call
886`bioseq_read' to reread the environment variable value. A BIOSEQ file
887can always be read in more than once, and the latest read will always
888override the entries from the previous read (unless the names of the
889BIOSEQ entries have changed between reads).
890
891The function `bioseq_check' takes a database search specifier and
892checks whether it refers to a known database (i.e., whether a BIOSEQ
893entry exists for that database). It returns non-zero if the BIOSEQ entry
894exists, and zero otherwise. This can be used for a quick error check
895testing whether the specifier given by the user is valid or not.
896
897The function `bioseq_info' is used to get the text from an information
898field in the BIOSEQ entry for a database. These information fields
899provide an easy way for the user to pass database-specific information
900to your program. One example of this is to allow the user to specify
901some command line options using an information field specific to the
902database. This way, the user can "tune" the program for each database,
903without having to always keep track of what option values must be
904specified for each database.
905
906The SEQIO package also "defines" several information fields that it
907uses when performing database searches. These fields are `Name',
908`Format', `Alphabet', `IdPrefix' and `Index'. The `Name' field gives the
909name of the database, and its presence distinguishes BIOSEQ entries
910for databases from entries for personal collections of files. The `Format'
911and `Alphabet' fields specify the format for the database files and the
912alphabet for the database sequences, respectively. The `IdPrefix' field
913specifies the identifier prefix that should be given to the main identifier
914in each entry. The `Index' field specifies the name of the file which
915indexes all of the database's entries (see "idxseq.doc" for more
916information about the index files).
917
918(NOTE: Information fields can only be "defined" in the sense that the
919user can be asked to place the requested text in information fields for
920the specified keywords. There is nothing requiring those fields to be
921there or restricting what text the user puts there, except maybe that
922improper text will trigger an error in the package or your program.)
923
924The `bioseq_parse' function is the function used to parse database
925search specifications and determine the list of files that should be read
926in that search. This function (along with the `bioseq_info' function for
927the four information fields above) is used by `seqfopendb' to open a
928database search. In fact, that initial example opening a database search
929could be replaced with the following code snippet, and it would perform
930the same operations (with one exception noted below):
931
932    int len;
933    char *s, *t, *files, *seq;
934    SEQFILE *sfp;
935
936    /*
937     * The next 9 lines replace the lines:
938     *     if ((sfp = seqfopendb("genbank")) == NULL)
939     *       exit(1);
940     */
941    if ((files = bioseq_parse("genbank")) == NULL)
942      exit(1);
943
944    for (s=files; *s; s++) {
945      for (t=s; *s != '\n'; s++) ;
946      *s = '\0';
947
948      if ((sfp = seqfopen(t, "r", NULL)) == NULL)
949        exit(1);
950
951      while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) {
952        if (len > 0 && isa_match(seq, len)) {
953          /* Found a match */
954        }
955      }
956      seqfclose(sfp);
957    }
958    free(files);
959
960The string returned by `bioseq_parse' is a list of the database's files to
961be read, where each filename is terminated by a newline character
962(including the last filename), and the whole string is terminated by a
963NULL character. This string is stored in a malloc'ed buffer, and so must
964be freed when no longer useful. (Why newline? Hey, it probably won't
965appear in a filename, it's different from '\0' and it makes printing the list
966of files look nice. Got better reasons for some other character?)
967
968The example above opens the same set of files and reads the same
969sequences. The only potential difference between the execution of that
970example and the example using `seqfopendb' is that the SEQIO
971package will not know about the four information fields associated with
972the database, and so minor differences may appear in the results (very
973minor differences in the fields of any SEQINFO structure and any
974output generated by SEQIO). This information could be included in the
975example using `bioseq_info', `seqfsetdbname', `seqfsetidpref' and
976`seqfsetalpha', as follows:
977
978    char *format, *dbname, *alpha, *idprefix;
979
980    if ((files = bioseq_parse("genbank")) == NULL)
981      exit(1);
982
983    format = bioseq_info("genbank", "Format");
984    dbname = bioseq_info("genbank", "Name");
985    alpha = bioseq_info("genbank", "Alphabet");
986    idprefix = bioseq_info("genbank, "IdPrefix");
987
988    for (s=files; *s; s++) {
989      for (t=s; *s != '\n'; s++) ;
990      *s = '\0';
991
992      if ((sfp = seqfopen(t, "r", format)) == NULL)
993        exit(1);
994
995      if (dbname != NULL)
996        seqfsetdbname(sfp, dbname);
997      if (alpha != UNKNOWN)
998        seqfsetalpha(sfp, alpha);
999      if (idprefix != NULL)
1000        setfsetidpref(sfp, idprefix);
1001
1002      while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) {
1003        if (len > 0 && isa_match(seq, len)) {
1004          /* Found a match */
1005        }
1006      }
1007      seqfclose(sfp);
1008    }
1009
1010    free(files);
1011    if (format != NULL)
1012      free(format);
1013    if (dbname != NULL)
1014      free(dbname);
1015    if (alpha != NULL)
1016      free(alpha);
1017    if (idprefix != NULL)
1018      free(idprefix);
1019
1020Note that the format string returned by `bioseq_info' is also returned in
1021a malloc'ed buffer, and so must be freed after its use.
1022
1023
1024
1025Error Handling
1026**************
1027
1028There are three things that any programmer must figure out when
1029writing a program (apart from what the program actually is supposed to
1030do). They are what the user interface will look like, how the program is
1031going to store the data it uses, and how the program handles errors.
1032Since this is a package and not a complete program, I leave the user
1033interface to your dreams and abilities, but I want to try to simplify the
1034other two tasks as much as possible. I've talked about how the SEQIO
1035package keeps track of a lot of data internally, and can return that data
1036when asked. Here, I want to describe how the package handles errors,
1037and the way you can specify how the package should handle them.
1038
1039The image I had when designing the error handling of the package was
1040that when the package is being used to create a "quick and dirty"
1041program that is written just to quickly get information or entries from a
1042database or file, the SEQIO package should do as much as possible to
1043descriptively report and properly handle errors. However, when the
1044package is used to create robust application software with either a
1045command line or a windowing user interface, the programmer should
1046have the ability to disable some or all of that reporting/handling
1047mechanism and replace it with their own error handling routines.
1048
1049By default, when the SEQIO package detects an error, it first sets the
1050values of variables `seqferrno' and `seqferrstr' to an integer error value
1051and the text of an error message, respectively. These variables are
1052defined in "seqio.h" as extern variables, so you have access to their
1053values at all times. (See file "seqio.doc" for a more complete
1054description of the values `seqferrno' can take.) The next thing that the
1055package does is output an error message on standard error. And
1056finally, depending on the seriousness of the error, the package may
1057either return an error value as the result of the SEQIO function call or it
1058may exit the program.
1059
1060Obviously, the outputting of an error message or the program exiting
1061can affect your user interface, so I've tried to design the package so
1062you can either work with these actions more easily or disable them
1063easily. The first thing I've done is try to write all of the error messages
1064so that they would be comprehensible to the user of your program, who
1065may not know about a SEQIO package. I could not handle all of the
1066cases (in particular, the error message from calls to `seqfparseent' and
1067`seqfannotate' are not as informative, because those functions are not
1068given any information that originally came from the user, such as a
1069filename). But, for the most part, the error messages should not be
1070incomprehensible. If you do find an error message that you think could
1071be improved, please send an message to knight@cs.ucdavis.edu.
1072
1073The second thing I've done is to limit the times when the package exits
1074the program only to when (1) the package detects that no more memory
1075is available or when (2) it detects an bug in the package code. Thus,
1076(hopefully) there will be few occasions when the package will actually
1077exit the program. And, typically the "quick and dirty" programs don't
1078have any better handling of these errors.
1079
1080The third thing I've done is to include a function `seqfsetperror' to
1081allow you to redirect all of the error printing the package does. This
1082function takes another function as its argument, and, when given that
1083argument function, the SEQIO package will call that function for any
1084error printing, instead of calling its default print error function. Thus,
1085you can redirect all of the error output to an empty function, to a
1086function that changes the text of the error messages, or to a function
1087which pops up an error window with the text of the message.
1088
1089The fourth thing I've done is to add a function `seqferrpolicy' which
1090allows you to disable some or all of the error output and whether the
1091program calls exit on memory errors and program bugs. See the file "
1092seqio.doc" for the details on `seqferrpolicy'. Thus, when you want to
1093handle the error reporting and handling yourself, the package can be
1094told to just set `seqferrno', set `seqferrstr' and return error values from
1095the package functions. And, even in that case, you still have access to
1096the messages that the package would have output, since that message
1097is stored in `seqferrstr'. So, for example, if you are writing a windowing
1098program and you want some but not all error messages to appear in a
1099popup window, you can make the call "seqferrpolicy(PE_NONE)",
1100and then after the SEQIO package calls which may trigger an error
1101worth reporting, check the value of seqferrno. The package is
1102guaranteed never to output any messages or exit the program (except if
1103it core dumps, of course).
1104
1105
1106
1107Porting the Package to Another Machine
1108**************************************
1109
1110Currently, the package has been tested under the following operating
1111systems:
1112
1113 Ultrix, SunOS, Solaris, IRIX, Windows NT/95
1114
1115If your machine is not one of these, there is a chance the program may
1116not compile on it. Based on my experience with other software I've
1117written, my guess is that the code should compile on most of the Unix
1118variants, with the exception that the proper include files needed to read
1119directory files may differ from those in the code. On non-Unix variants,
1120the code probably will not compile, as the code dealing with directory
1121files is specifically geared for the Unix and Windows operating systems.
1122
1123If you do have a machine not on the list, are not able to compile it and
1124want to port it, first send me mail (at knight@cs.ucdavis.edu). I am very
1125interesting in getting the program to work on as many systems as
1126possible, and will try to help as much as possible (including
1127implementing any changes on my latest version of the code and
1128immediately sending you a personal release, so that you would not
1129have to wait until the next version of the code came out). Then, check
1130these list of things below, which may narrow down where the problem
1131lies.
1132
1133First, the current version of the code uses the following include files:
1134
1135  #include <stdio.h>
1136  #include <stdlib.h>
1137  #include <ctype.h>
1138  #include <fcntl.h>
1139  #include <stdarg.h>
1140  #include <string.h>
1141  #include <time.h>
1142  #include <errno.h>
1143  #include <sys/types.h>
1144  #include <sys/stat.h>
1145
1146  #ifdef __unix
1147  #include <dirent.h>
1148  #ifdef SYSV
1149  #include <sys/dirent.h>
1150  #endif
1151  #endif
1152
1153  #ifdef WIN32
1154  #include <windows.h>
1155  #endif
1156
1157  #include "seqio.h"
1158
1159plus, the following include file
1160
1161  #include <sys/mman.h>
1162
1163is ifdef'ed inside the preprocessor define value ISMAPABLE (see below
1164for the discussion of the `mmap' system call and ISMAPABLE).
1165
1166If your machine does not have some of these includes, take them out,
1167figure out which variable/functions needed those includes, and then
1168figure out which include files your system needs to declare those
1169variables/functions.
1170
1171Second, here is a complete list of the external variables and function
1172calls used by the bulk of my program.
1173
1174  * Current set of external calls in main section of code:
1175  *      exit, fclose, fopen, fputc, fputc, fprintf, free, fwrite,
1176  *      getenv, getpagesize, isalpha, isalnum, isdigit, isspace,
1177  *      malloc, memcpy, memset, realloc, sizeof, sprintf,
1178  *      strcpy, strcmp, strlen, strncmp, tolower, va_arg, va_end,
1179  *      va_start, vsprintf
1180  *      mmap, munmap (these are ifdef'd inside `ISMAPABLE')
1181  *
1182  * Current set of (unusual?) data-structures/variables in main section:
1183  *      errno, va_list, __LINE__,
1184  *      caddr_t (this is ifdef'd inside `ISMAPABLE')
1185
1186In addition, I've encapsulated a lot of the system operations into
1187functions at the end of the file "seqio.c". My assumption was that the
1188functions and variables above are common to most or all machines,
1189whereas the functions and variables below are more machine specific.
1190So, I put all of the machine specific code at the end of the file, where it
1191is much easier to find. Here is a list of all of the
1192functions/variables/structures made in these encapsulated functions:
1193
1194  * Current set of external calls in end section of code:
1195  *      close, ctime, open, read, stat, time
1196  *
1197  *      closedir, opendir, readdir  (these are ifdef'd inside `__unix')
1198  *
1199  *      GetCurrentDirectory, SetCurrentDirectory,
1200  *      FindFirstFile, FindNextFile, CloseHandle
1201  *                              (these are ifdef'd inside `WIN32')
1202  *
1203  * Current set of (unusual?) data-structures/variables in end section:
1204  *      stat structure, time_t, stdin, stdout, stderr
1205  *      DIR, dirent structure  (these are ifdef'd inside `__unix')
1206  *      WIN32_FIND_DATA, HANDLE   (these are ifdef'd inside `WIN32')
1207
1208If any of these functions or variables are not supported on your
1209machine, please let me know and we can figure out how to work around
1210them.
1211
1212Here are some additional tips and requirements for the package:
1213
1214 1. For Unix variants, if the structures DIR and dirent, and the
1215   functions opendir, readdir and closedir, are problems for the
1216   compiler, check the man pages of those functions for the include
1217   files needed to use them. The current include files I've specified
1218   are the following:
1219
1220   #include <sys/types.h>
1221   #include <dirent.h>
1222   #ifdef SYSV
1223   #include <sys/dirent.h>
1224   #endif
1225
1226   These include files are compatible with SunOS, SOLARIS, Ultrix,
1227   OSF, DYNIX (or whatever the Sequent's Unix variant is called),
1228   IRIX and HPUX. I have tested the directory include files on all
1229   these.
1230
1231 2. If one of the string functions (strcmp, strlen, strcpy, ...) or the
1232   character class functions (isalpha, isspace, isdigit, ...) is not
1233   supported, then tell me about it and I will add my own version of
1234   that function to the code and remove the use of those functions
1235   from the package and send you a new release. One of my goals
1236   for the package is no compiler options ever need to be specified
1237   to get the program to compile correctly. So, it's better (from my
1238   point of view) to just replace any function that may not exist on a
1239   machine, rather than have the users worry about configuring the
1240   package for different machines.
1241
1242 3. The program requires that "int"'s be 4 bytes long, as they will
1243   take values larger than 65536. This shouldn't be a problem,
1244   except for the PC's. If you wish to port it to a PC, what I can do is
1245   create an "int4" typedef that can be set to the appropriate value
1246   for the different machines.
1247
1248 4. I've created typedefs to hide the datatype used when reading
1249   directories and when reading from raw files using open and read.
1250   If your system requires different data structures for those values,
1251   the typedef declarations are at the beginning of "seqio.c".
1252
1253 5. I've also created a "dirch" variable to hold the character used by
1254   the operating system to distinguish between directories in a
1255   path. Now, that variable is set to the character '/' (for Unix) but it
1256   can be reset using system specific ifdefs to another character
1257   (such as '\' for Windows NT). This variable is also declared at the
1258   beginning of "seqio.c".
1259
1260   This variable is used in all of the BIOSEQ processing, which
1261   must know the format of directory pathnames. If directory paths
1262   use some format other than a string of names separated by the
1263   directory character (as the VMS systems do), we'll have to work
1264   together to reimplement the BIOSEQ processing.
1265
1266Finally, if your machine is not on the list and even if you are able to
1267compile the program successfully, I would like you to check one
1268additional feature. Some of the Unix variants support calls to a function
1269`mmap', which directly maps disk files into the memory of a program.
1270I've added code to use this function, because it speeds up file reading
1271by about 30-40%. I would like you to check to see if your machine
1272supports the `mmap' call on generic files (some systems, like Ultrix,
1273have the `mmap' call but it only works for device files).
1274
1275I have encapsulated all of the code dealing with the `mmap' call inside a
1276preprocessor define value ISMAPABLE, and at the beginning of
1277"seqio.c", I include an ifdef expression which, for the systems that
1278support the `mmap' call, defines ISMAPABLE. So, another way you can
1279check to see if the `mmap' call on your system exists is to compile the
1280program with the -DISMAPABLE option and see if it compiles. If so,
1281please send me mail so I can add that system to the ifdef expression
1282that turns on the mmap'ing.
1283
1284
1285James R. Knight, knight@cs.ucdavis.edu
1286June 28, 1996
1287