1SEQIO -- A Package for Sequence File I/O 2 3 4PROGRAMR.DOC - Guide to Using the SEQIO Package 5*********************************************** 6 7The main documentation on the SEQIO interface is given in "seqio.doc 8".This file is more of a "how-to" guide to using the package. These are 9the ideas I had for using the package while I was designing and 10implementing it, broken up into five sections: 11 12 1. reading sequences and database searches, 13 2. extracting information from entries, 14 3. writing/converting/annotating entries, 15 4. BIOSEQ stuff (database information processing), 16 5. Error handling. 17 18At the end of the file, there is an additional section discussing how to 19port the package to other machines. 20 21I'm going to concentrate on the interface itself, so in all of the examples 22below, you will see constants for things like filenames, formats, 23database names, and so on. In a normal program those things would be 24specified as part of the user interface, but here I'm going to make them 25as simple as possible in order to illustrate the interface functions more 26clearly. 27 28Jim 29 30 31 32Reading Sequences and Database Searches 33*************************************** 34 35This package actually evolved from a module of some sequence 36analysis software I was writing, as well as the three or four programs I 37had designed to some extent and was planning to implement (and still 38am). In all of those programs, I needed a module to read in the 39sequences in a sequence file, and I had three goals for that module: 1) 40make it simple for the rest of the program to use, 2) make it as fast as 41possible, and 3) remove as many size limitations as possible (from 42sequence size to maximum line length and so on). Those goals, and the 43focus on reading files and databases, remained in the design of the 44SEQIO package. However, in this file you won't hear much about goals 452 and 3, because they don't show up when your programs are written, 46only when they execute. 47 48A program that reads a sequence file or database looks a lot like using 49the stdio package to do normal file I/O: it opens the file or database, 50repeatedly calls a function to read the next sequence, and closes the 51file or database when it hits EOF. 52 53 int len; 54 char *seq; 55 SEQFILE *sfp; 56 57 if ((sfp = seqfopen("my_sequences", "r", "FASTA")) == NULL) 58 exit(1); 59 60 while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) { 61 if (len > 0 && isa_match(seq, len)) { 62 /* Found a match */ 63 } 64 } 65 66 seqfclose(sfp); 67 68This code snippet is an example of searching the sequences of a 69FASTA-formatted file for the sequences that matched (whatever it is 70you might want to match). To read a database instead of a file, just 71replace the "sfp = seqfopen(...)" call with "sfp = seqfopendb 72("genbank")", to read the GenBank database for example. Another 73simple change you can make to this example is to read all of the 74file/database entries, instead of the sequences those entries contain. 75To do that, simply replace the call to "seq = seqfgetseq(sfp, 76&len, 0)" with "entry = seqfgetentry(sfp, &len, 0)" and the 77entry text for each entry is returned. 78 79With either or both of these alterations, the rest of the program will 80work in exactly the same way, with two minor exceptions. First, when 81the `seqfgetentry' for `seqfgetseq' substitution is made and the entries 82in the file or database contain more than one sequence, `seqfgetseq' 83will read each sequence in the entry, whereas `seqfgetentry' will only 84read the entry once regardless of how many sequences occur in the 85entry. 86 87Second, when searching databases using `seqfopendb', a BIOSEQ file 88must have been created and the "BIOSEQ" environment variable must 89include that file. See the file "user.doc" for infomation on how to create 90BIOSEQ files. That file also describes the strings `seqfopendb' can take 91to specifying a database search. 92 93Differences between SEQIO and stdio 94=================================== 95 96There are some small differences between the SEQIO calls in the 97example above and the stdio calls used to do file I/O. First, the 98`seqfopen' function takes a third argument which specifies the format of 99the file being open. That argument either must be a string naming a 100supported file format (see "user.doc" and "format.doc" for the list of 101those formats), or must be NULL, in which case the format of the file is 102automatically determined from the text in the file. 103 104Second, the arguments to `seqfgetseq' are different from any of the 105fget* functions in the stdio package. The reason is that one of the 106deficiencies of the stdio package (in my opinion) is that the 107programmer has to worry about where and how to store the characters 108read in. I wanted programs using this package to worry as little as 109possible about how to store the read-in sequences and entries. Thus, 110the SEQIO package always remembers a "current" sequence and entry, 111and the sequence, entry or information about the sequence can be 112retrieved as needed. 113 114In addition, the package can return the sequence/entry/information 115character strings in one of two ways, either using an internal buffer or 116by malloc'ing a new buffer to store the string. The third argument to 117`seqfgetseq' is a flag telling how the sequence text should be returned 118(zero specifies an internal buffer and non-zero specifies a malloc'ed 119buffer). So, the `seqfgetseq' call above tells the SEQIO package to read 120the next sequence in the file, make that the "current" sequence, and 121return that sequence's text using its internal buffers. As another 122example, the following snippet shows how to accumulate all of the 123sequences of a file into an array, using malloc'ed buffers so that each 124sequence remains available until the malloc'ed buffer is freed: 125 126 int i, len; 127 char *seq, *seqs[400]; 128 SEQFILE *sfp; 129 130 if ((sfp = seqfopendb("swiss-prot")) == NULL) 131 exit(1); 132 133 for (i=0; i < 400 && (seq = seqfgetseq(sfp, NULL, 1)) != NULL; ) { 134 if (*seq != '\0') 135 seqs[i++] = seq; 136 } 137 seqfclose(sfp); 138 139 /* Do the analysis of the sequences. */ 140 141 while (i > 0) 142 free(seqs[--i]); 143 144Giving a non-zero third argument to `seqfgetseq' tells the SEQIO 145package to malloc a new buffer for each sequence, so they can be kept 146around after the next call to the package (the internal buffers are 147reused, so their contents may be changed on the next call to a SEQIO 148function). 149 150Also, note in this example that the second argument to the `seqfgetseq' 151function is NULL. One of the guarantees the SEQIO package makes is 152that the character strings of sequences and entries will be 153NULL-terminated strings, so you don't necessarily need the string's 154length to know where the sequence/entry ends. This also makes it easy 155to output the sequence or entry text, as in this version of the first 156example above which outputs the text of each entry whose sequence 157matches: 158 159 while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) { 160 if (len > 0 && isa_match(seq, len)) { 161 /* Found a match */ 162 entry = seqfentry(sfp, NULL, 0); 163 fputs(entry, stdout); 164 } 165 } 166 167Note the use of `seqfentry' instead of `seqfgetentry'. The function 168`seqfentry' just returns the text of the "current" entry, and does NOT 169read the next entry in the file. With this use of a "current" sequence 170and entry, a program can get multiple pieces of information about a 171sequence/entry one piece at a time, without having to worry about 172getting everything it needs at once. 173 174The third and fourth differences between the stdio package and the 175SEQIO package in these examples are slightly harder to see. They 176involve the handling of errors. The third difference is that the program 177simply exits when `seqfopen' returns NULL, seemingly without printing 178an error message, and the fourth difference is the use of "len > 0" 179and "*seq != '\0'" as additional tests to see if a sequence was 180returned by `seqfgetseq'. 181 182The long answers for these differences are given in the Error Handling 183section and file "seqio.doc", where I talk about the error handling. The 184short answers are that the SEQIO package by default outputs error 185messages when an error occurs (but this can be disabled), and that the 186`seqfgetseq' and `seqfgetentry' functions are unique in that they return 187one of three values: 1) a string of characters on a successful read, 2) 188an empty string with length 0 if there is a problem reading the next 189sequence/entry (such as when the next entry contains no sequence), 190but that problem is not a fatal error, and 3) NULL if end-of-file is 191reached or a fatal error occurs (an error for which no more reading can 192be done). 193 194The functions `seqfopen' and `seqfopendb' are the common ways to 195open a file/database, and `seqfgetseq', `seqfgetentry' and `seqfgetinfo' 196(described in the next section) are the common ways to read in the 197sequences/entries in the file/database. There are a couple of other 198ways, using the functions `seqfopen2', `seqfread' and `seqfgetrawseq. 199Those functions are described in file "seqio.doc". 200 201 202 203Extracting Information from Entries 204*********************************** 205 206For most of the sequence file formats that have been created (and most 207of the formats supported by the package), the entries in a file contain 208quite a bit more information than just the sequence itself. For instance, 209in the GenBank database, the sequence characters make up only a 210third of the characters in the database files. The rest of the database 211contains information about those sequences (identifiers, descriptions, 212references, features, and so on). In designing the SEQIO package, I 213tried to do two things, provide a method to automatically extract a 214number of the more common (and less complex) pieces of information 215stored in a sequence, and make it as easy as possible to extract other 216information. 217 218The Raw Sequence 219================ 220 221One such piece of information that is often needed is the "raw" 222sequence text, giving both the sequence characters and any alignment 223or structural notation characters that are associated with the 224sequence. In many entries, the sequence is expressed not by itself, but 225in terms of an alignment of that sequence with others (such as the 226sequences in the other entries of the file). 227 228The function `seqfrawseq' can be used to retrieve both the sequence 229and the alignment/structure information specified with the sequence. 230This function works exactly the same as `seqfentry' and `seqfsequence', 231except that the string returned by the functions is different. Function 232`seqfentry' returns the complete entry text, `seqfsequence' returns 233only the characters of the sequence (typically all of the alphabetic 234characters), and `seqfrawseq' returns the sequence and the 235alignment/structure characters (typically all characters except 236whitespace and digits). 237 238The SEQINFO Structure 239===================== 240 241Other information contained in an entry is extracted and typically 242returned through the use of a SEQINFO structure defined by the 243package. The file "seqio.h" defines the SEQINFO structure used to 244store information that the SEQIO package extracts from an entry. In 245addition, there are interface functions which can be used to retrieve 246each of the individual fields in the SEQINFO structure. The SEQINFO 247structure is defined as follows: 248 249typedef struct { 250 char *dbname, *filename, *format; 251 int entryno, seqno, numseqs; 252 253 char *date, *idlist, *description; 254 char *comment, *organism, *history; 255 int isfragment, iscircular, alphabet; 256 int fragstart, truelen, rawlen; 257} SEQINFO; 258 259The structure contains six fields which the SEQIO package has about 260the current sequence: 261 262dbname 263 The name of the database being searched (if this is a search of 264 an actual database). 265filename 266 The name of the file currently being read. 267format 268 The format of the file (and the current entry). 269entryno 270 The location of the current entry in the file (if entryno is 10, then 271 the current entry is the tenth entry in the file). 272seqno 273 The location of the current sequence in the current entry (if 274 seqno is 3, then the current sequence is the third in the current 275 entry). 276numseqs 277 The number of sequences contained in the current entry. 278 279So, the current sequence's location is the `seqno' sequence of the 280`entryno' entry of the file `filename' (possibly of the database 281`dbname'). The `format' string gives the entry's format, and the entry 282contains `numseqs' sequences. 283 284The other twelve fields are information extracted from the current entry 285(see "format.doc" for the details about which information is retrieved 286for each file format): 287 288date 289 A single date giving the last time the entry was either created or 290 updated. Its format should be day-month-year, as in 291 31-JAN-1995. 292idlist 293 The list of identifiers given in the entry. The idlist's form is a 294 string containing vertical bar separated list of identifiers, each of 295 whose form consists of an identifier prefix, a ':' and the identifier. 296 See file "user.doc" for more information about identifiers and 297 identifier prefixes. 298description 299 A description of the sequence or sequences in the entry. This is 300 the "Title" or "Definition" line in some file formats. This string 301 should consist of a single "line" of text, although it can be of any 302 length. So, no newlines should appear in this text (they are 303 removed and added when the description is read from and 304 output in the sequence entries). 305comment 306 A block of text giving a comment about the sequence. The string 307 can contain one or more lines of any length. The one restriction 308 to the text appearing in a comment is that any block of lines at 309 the end of an entry's comment section where each line begins 310 with the string "SEQIO" is reserved for other use by the package 311 (this block holds extra identifiers or the `history' lines). 312organism 313 The name of the organism the sequence was taken from. Right 314 now, this field can contain any single "line" of text, although I 315 would like to standardize the contents of this field. It's on my 316 TODO list. 317history 318 This holds the lines of text placed in the comment section of 319 entries which describe previous SEQIO operations on this entry, 320 i.e., it holds the history of alterations and updates made to this 321 entry by programs using the SEQIO package. Any block of lines 322 at the end of a comment section where each line begins with the 323 string "SEQIO" is not considered part of the comment, but part 324 of the history. 325isfragment 326 This integer is non-zero if the sequence is a fragment of a larger 327 sequence, and zero if the sequence is complete (or if it is not 328 known whether the sequence is a fragment). 329iscircular 330 This integer is non-zero if the sequence is a circular sequence, 331 and zero if it is a linear sequence (or if it's circularity is not 332 known). 333alphabet 334 This integer is one of the predefined constants DNA, RNA, 335 PROTEIN or UNKNOWN. Its value is UNKNOWN unless either 336 the database's BIOSEQ entry (information field "Alphabet") or 337 the entry itself explicitly specifies the alphabet. The package 338 does not try to guess the alphabet. 339fragstart 340 When the sequence is a fragment of a larger sequence and the 341 location of this fragment in the larger sequence is known, this 342 value gives the starting position of the fragment. If this value is 343 not known (or the sequence is complete), fragstart is set to 0. 344truelen 345 This is the "true" length of the sequence, i.e., the length of the 346 sequence without any gap characters or notational characters. 347 Typically, these are just the alphabetic characters. 348rawlen 349 This is the "raw" length of the sequence, i.e., the length of the 350 sequence which includes the gap and notational characters. 351 Typically these are all characters except whitespace and digits. 352 353Accessing this information for an entry is very similar to that of 354accessing the sequence and entry text. The functions `seqfgetinfo' and 355`seqfinfo' work along the lines of `seqfgetentry' and `seqfentry', and so 356the following code snippet finds and outputs all of the entries with 357circular sequences: 358 359 char *entry; 360 SEQINFO *info; 361 SEQFILE *sfp; 362 363 if ((sfp = seqfopendb("genbank")) == NULL) 364 exit(1); 365 366 while ((info = seqfgetinfo(sfp, 0)) != NULL) { 367 if (info->iscircular) { 368 entry = seqfentry(sfp, NULL, 0); 369 fputs(entry, stdout); 370 } 371 } 372 seqfclose(sfp); 373 374and this code snippet finds and outputs the entry (or entries) with a 375given accession number: 376 377 char *s, *t, *idlist, *entry; 378 SEQINFO *info; 379 SEQFILE *sfp; 380 381 if ((sfp = seqfopendb("genbank")) == NULL) 382 exit(1); 383 384 while (seqfread(sfp, 1) == 0) { 385 idlist = seqfidlist(sfp, 0); 386 if (idlist != NULL) { 387 /* 388 * Scan the idlist, looking for an identifier whose prefix is 389 * "acc" and whose number matches the accession. 390 */ 391 s = idlist; 392 while (*s) { 393 for (t=s; *s && *s != '|'; s++) ; 394 395 if (strncmp(t, "acc:X01828", 10) == 0) { 396 entry = seqfentry(sfp, NULL, 0); 397 fputs(entry, stdout); 398 break; 399 } 400 401 if (*s) s++; 402 } 403 } 404 } 405 seqfclose(sfp); 406 407A couple points to note about these examples and the fields of the 408SEQINFO structure. First, the string `idlist' is a vertical bar separated 409list of identifiers, where each identifier consists of a prefix naming the 410database or type of identifier and a suffix giving the actual id. See file 411"user.doc" for a complete description of these identifiers and identifier 412prefixes. 413 414Third, the functions like `seqfidlist' are similar to `seqfsequence', 415`seqfentry', and `seqfinfo' in that they return some information about 416the "current" sequence/entry. The package has one of these access 417functions for every field in the SEQINFO structure (i.e., `seqfdate', 418`seqfiscircular', ...). For the SEQINFO fields that are character strings, 419these functions take two arguments, where the second argument is just 420like the third argument of `seqfsequence' or `seqfentry'. It tells whether 421the package should return the character string using an internal buffer 422or in a malloc'ed buffer. (Again, be aware that the internal buffer strings 423are guaranteed to remain unchanged only upto the next call to the 424SEQIO package.) 425 426Fourth, the previous point raises the question of what happens when 427`seqfinfo' or `seqfgetinfo' is called with a second argument of 1, and the 428SEQINFO structure is returned in a malloc'ed buffer. Where do the 429character string fields of the structure point to? And will it be hard to 430free up the SEQINFO structure and its character strings? When 431`seqinfo' or `seqfgetinfo' is called with a second argument of 1, they 432actually malloc one large buffer, and store both the SEQINFO structure 433and the character string fields in that one buffer. And since the 434SEQINFO structure is placed at the beginning of the malloc'ed buffer, 435simply free'ing the SEQINFO structure will automatically free up all of 436its character strings. 437 438And fifth, note the use of `seqfread' in the second example. It was used 439because there is no `seqfgetidlist' function in the package. The only 440functions which both read the next entry/sequence and return 441something about that entry/sequence are `seqfgetseq', 442`seqfgetrawseq', `seqfgetentry' and `seqfgetinfo'. To perform searches 443using the other information functions, you must use one of the four 444entry/sequence reading functions listed in this paragraph. Also, in case 445the arguments to `seqfread' are confusing, the second argument to 446`seqfread' is NOT the same as the second argument to `seqfidlist'. The 447second argument to `seqfread' specifies whether to read the next 448sequence (if zero) or to read the next entry (if non-zero). 449 450Seqfmainid, Seqfmainacc, Seqfoneline and Seqfallinfo 451==================================================== 452 453The SEQIO package includes four other functions for accessing and 454collecting information about each sequence: `seqmainid', `seqfmainacc', 455`seqfoneline' and `seqfallinfo'. 456 457`Seqfmainid' and `seqfmainacc' are variations of the `seqfidlist' which 458only return a "main" identifier, instead of returning the whole identifier 459list. This is useful in cases where you don't necessarily want to search 460the complete list of identifiers, but just want a single identifier to 461associate with a sequence. `Seqfmainid' returns the "main" identifier 462for a sequence, which specifically is the first non-accession identifier, 463if one exists, or the first accession number in the entry otherwise. The 464`seqfmainacc' function returns the first accession number in the entry, 465if one exists. Both have the same arguments as `seqfidlist', and both 466return a NULL-terminated string containing the single identifier, with 467an identifier prefix. So, the example above which searches for an 468accession number could be rewritten as the following, if we were just 469looking for the entry whose main accession number is "X01828": 470 471 char *mainid, *entry; 472 SEQFILE *sfp; 473 474 if ((sfp = seqfopendb("genbank")) == NULL) 475 exit(1); 476 477 while (seqfread(sfp, 1) == 0) { 478 if ((mainid = seqfmainid(sfp, 0)) != NULL && 479 strncmp(mainid, "acc:X01828", 10) == 0) { 480 entry = seqfentry(sfp, NULL, 0); 481 fputs(entry, stdout); 482 } 483 } 484 seqfclose(sfp); 485 486The function `seqfoneline' can be used to create a "oneline" 487description of the information for an entry. A number of programs (and 488a number of file formats) have situations where they would like to 489present the user with a relatively compact, one line description of a 490particular sequence. The SEQIO package defines a standard format for 491this type of description for biological sequence, and `seqfoneline' is 492the function the package provides to construct these descriptions. The 493argument list for `seqfoneline' is the following: 494 495int seqfoneline(SEQINFO *info, char *buffer, int buflen, int idonly); 496 497where `info' is a SEQINFO structure, `buffer' is a character buffer where 498the oneline description will be stored, `buflen' is the length of the 499buffer, and `idonly' will be discussed momentarily. 500 501This function operates in a similar manner as `fgets', in that the string it 502constructs is stored in the buffer passed to it. It differs from fgets in two 503major respects (apart from the fact that it does no file reading). The first 504is that the oneline description is guaranteed to both fit in the buffer and 505to be NULL-terminated (i.e., no oneline description will ever be longer 506than "buflen-1" characters). The second is that the function returns 507the length of the oneline description stored in `buffer', instead of a 508pointer to buffer itself. Hopefully, both of these differences will be more 509useful in practice than the way fgets works. 510 511The final argument to `seqfoneline' is an `idonly' flag specifying 512whether the "oneline description" should in fact just contain a single 513identifier for the sequence. This flag is useful in cases where you just 514want a single identifier string that is guaranteed to be no longer than a 515certain length (most notably in the output of the PHYLIP, Clustalw and 516MSF formats). When the flag is non-zero, the string stored in `buffer' is 517guaranteed to contain a single word identifier or description, and is 518guaranteed not to contain any whitespace. 519 520The final variation on accessing information from an entry is `seqfallinfo'. 521This function works exactly like `seqfinfo', except that the comment field 522of the SEQINFO structure returned contains a different string. Using 523`seqfinfo', the comment string returned consists of whatever comment 524appears in the entry. With `seqfallinfo', the comment string contains 525the complete header of the entry. The specifics of what string this is 526depends on the particular file format, but generally it consists of all of 527the lines of the entry except the sequence lines. 528 529Extracting Other Information 530============================ 531 532The code snippets above illustrate the two ways of using the SEQIO 533package to extract information from an entry. One way is to use 534`seqfgetinfo' or `seqfinfo' to have the SEQIO package extract all of the 535information it can from an entry, and then to access the fields of the 536SEQINFO structure to get that information. The other way is to use the 537access functions for the SEQINFO fields (`seqfidlist', `seqfiscircular', 538and so on) to get one or more pieces of information from the entry. 539 540If neither of those ways can get the information you're looking for, the 541third way of getting information from a sequence is to get the entry's 542text and scan that text for the information, as in this example which 543outputs all entries in the file "alu.human" of the "REPBASE" database 544which are classified in the "Alu-J" region: 545 546 char *entry, *s; 547 SEQFILE *sfp; 548 549 if ((sfp = seqfopendb("repbase:alu.human")) == NULL) 550 exit(1); 551 552 while ((entry = seqfgetentry(sfp, NULL, 0)) != NULL) { 553 if (strstr(entry, "\nFT \\rpt_family=\"Alu-J\"")) 554 fputs(entry, stdout); 555 } 556 seqfclose(sfp); 557 558This works, because when looking at the "alu.human" file, the 559sequences are classified by the line 560 561FT \rpt_family="Alu-J" 562 563Thus, by reading each entry and doing a simple scan for that particular 564line, I can extract the appropriate entries. And of course, more 565complicated (or robust) searches of the entries could be written, but 566the point here is that the SEQIO package takes care of all of the file I/O 567and simplifies the programmer's task to just implementing the 568scanning. 569 570 571 572Writing, Creating and Annotating Entries 573**************************************** 574 575Writing Entries 576=============== 577 578The process for writing sequences and entries is very similar to that of 579the stdio package: open a file, call a function to write each entry, close 580the file. The difference is that the function which writes each entry takes 581a sequence and a SEQINFO structure as its arguments. Because of 582this, the easiest example to give is actually a file format conversion 583program. This one converts from EMBL to GenBank: 584 585 int len; 586 char *seq; 587 SEQINFO *info; 588 SEQFILE *insfp, *outsfp; 589 590 if ((insfp = seqfopen("my_sequences", "r", "embl")) == NULL) 591 exit(1); 592 if ((outsfp = seqfopen("my_seqs.2", "w", "genbank")) == NULL) 593 exit(1); 594 595 while ((seq = seqfgetseq(insfp, &len, 0)) != NULL) { 596 if (len > 0 && (info = seqfino(insfp, 0)) != NULL) 597 seqfwrite(outsfp, seq, len, info); 598 } 599 seqfclose(insfp); 600 seqfclose(outsfp); 601 602The SEQIO package also contains a `seqfconvert' function, which can 603simplify this code just a little bit (although there's not much farther that 604you can go): 605 606 int len; 607 char *seq; 608 SEQINFO *info; 609 SEQFILE *insfp, *outsfp; 610 611 if ((insfp = seqfopen("my_sequences", "r", "embl")) == NULL) 612 exit(1); 613 if ((outsfp = seqfopen("my_seqs.2", "w", "genbank")) == NULL) 614 exit(1); 615 616 while (seqfread(insfp, 0) != NULL) 617 seqfconvert(insfp, outsfp); 618 619 seqfclose(insfp); 620 seqfclose(outsfp); 621 622For the function `seqfopen', its second argument is the same as the 623second argument to `fopen', except that `seqfopen' only supports 624reading ("r"), writing ("w") and appending ("a") modes. Also, when 625writing a file, the third `seqfopen' argument specifying the format must 626be given. It cannot be NULL. 627 628Creating New Entries 629==================== 630 631The `seqfwrite' function uses the sequence and the 12 entry 632information fields of the SEQINFO structure (date, idlist, description, 633comment, organism, history, isfragment, iscircular, alphabet, fragstart, 634truelen, rawlen) when outputting the entry. It does not use the other six 635SEQINFO fields. Also, any of the character string fields may be either 636NULL or the empty string, in which case `seqfwrite' assumes that that 637information is not available. The function does not require that all of the 638fields be filled with information (it does the best it can with the 639information it's given). The only requirement `seqfwrite' makes on its 640arguments is that a non-empty sequence is given. It cannot output 641entries with no sequence. 642 643So, if you want to create new entries containing information that you 644compute using some other method, simply declare a SEQINFO 645structure, fill in its fields with the strings and values you've computed, 646and pass it and the sequence to `seqfwrite'. 647 648 int len; 649 char *seq; 650 SEQINFO info; 651 SEQFILE *insfp, *outsfp; 652 653 if ((outsfp = seqfopen("new_seqs", "w", "sprot")) == NULL) 654 exit(1); 655 656 while (/* more entries to create */) { 657 memset(&info, 0, sizeof(SEQINFO)); 658 659 /* Perform some computation to get a sequence and to fill in the 660 fields of the SEQINFO structure. */ 661 662 seqfwrite(outsfp, seq, len, &info); 663 } 664 seqfclose(outsfp); 665 666The SEQINFO structure has been defined so that all of the default 667values for the fields are 0 (or NULL for character strings). Thus, setting 668all of the bytes of the structure to 0 sets all of the default values. 669 670Annotating Existing Entries 671=========================== 672 673The function `seqfannotate' provides a solution to the common 674problem of associating new information with an existing entry and its 675sequence. A biologist runs a program or performs a database search to 676find entries or sequences with a particular feature or pattern, i.e., some 677new piece of information about that sequence. It would be nice to be 678able to tag that entry with the new information. But, the question is 679where to store the information? Keeping a separate file for the new 680information can become a management headache, and using `seqfinfo' 681and `seqfwrite' (or their cousins in other sequence I/O packages) 682eliminates a lot of the other information the entry holds. The 683'seqfannotate' function remedies this problem by allowing you to insert 684new text as a comment in an entry as that entry is being output, so that 685the outputted entry will contain all of the information in the original 686entry plus the new, inserted information. 687 688The function takes a SEQFILE pointer (open for writing), an entry and a 689string, and it inserts the string into the comment section of the entry as 690it is outputting the entry. The arguments for `seqfannotate' are the 691following: 692 693 int seqfannotate(SEQFILE *sfp, char *entry, int entrylen, 694 char *newcomment, int flag) 695 696where `sfp' is the SEQFILE structure, `entry' and `entrylen' give the 697necessary information about the entry, `newcomment' is the string to 698be inserted, and `flag' tells whether or not to retain any existing 699comments in the entry (zero says to remove all other comments and 700non-zero says to retain the comments). As an example, here is the 701example program given at the beginnning of this file, extended so that it 702adds the matching positions to the entry text. 703 704 int len, entrylen; 705 char *seq, *entry, *str; 706 SEQFILE *sfp, *sfpout; 707 708 if ((sfp = seqfopen("my_sequences.3", "r", "pir")) == NULL) 709 exit(1); 710 711 if ((sfpout = seqfopen("-", "w", seqfformat(sfp))) == NULL) 712 exit(1); 713 714 while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) { 715 if (len > 0 && (str = isa_match(seq, len)) != NULL) { 716 /* Found a match */ 717 entry = seqfentry(sfp, &entrylen, 0); 718 seqfannotate(sfpout, entry, entrylen, str, 1); 719 } 720 } 721 seqfclose(sfp); 722 723where the function `isa_match' now returns a character string such as 724 725 "Prosite Pattern: GLYCOSAMINOGLYCAN (S-G-x-G)\nMatches: 10-16, 503-508.\n" 726 727instead of just a boolean flag. (Note: the string here is a literal version 728of the string that isa_match might return.) 729 730One thing that might appear to be missing from the `seqfannotate' call 731is the format of the entry being passed to it. The format for the passed 732in entry is assumed to be the same as the format that was specified 733when the SEQFILE structure was opened for writing. (Note the use 734above of `seqfformat' when opening the output, and recall that giving 735"-" to `seqfopen' tells it to open standard input or standard output.) If 736the entry is not in the correct form, a parse error will occur and nothing 737will be output. 738 739With the example program above, if the entry text given to 740`seqfannotate' were the following (to use an actual PIR entry): 741 742ENTRY CCMQR #type complete 743TITLE cytochrome c - rhesus macaque (tentative sequence) 744ORGANISM #formal_name Macaca mulatta #common_name rhesus macaque 745DATE 17-Mar-1987 #sequence_revision 17-Mar-1987 #text_change 746 05-Aug-1994 747ACCESSIONS A00003 748REFERENCE A00003 749 #authors Rothfus, J.A.; Smith, E.L. 750 #journal J. Biol. Chem. (1965) 240:4277-4283 751 #title Amino acid sequence of rhesus monkey heart cytochrome c. 752 #cross-references MUID:66045191 753 #contents Compositions of chymotryptic peptides and sequences of 754 residues 55-61 and 68-70 755 #accession A00003 756 ##molecule_type protein 757 ##residues 1-104 ##label ROT 758CLASSIFICATION #superfamily cytochrome c; cytochrome c homology 759KEYWORDS acetylated amino end; electron transfer; heme; mitochondrion; 760 oxidative phosphorylation; respiratory chain 761FEATURE 762 1 #modified_site acetylated amino end (Gly) #status 763 experimental\ 764 14,17 #binding_site heme (Cys) (covalent) #status predicted\ 765 18,80 #binding_site heme iron (His, Met) (axial ligands) 766 #status predicted 767SUMMARY #length 104 #molecular-weight 11605 #checksum 9512 768SEQUENCE 769 5 10 15 20 25 30 770 1 G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G P 771 31 N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I T W G 772 61 E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E E 773 91 R A D L I A Y L K K A T N E 774/// 775 776the output from `seqfannotate' would be 777 778ENTRY CCMQR #type complete 779TITLE cytochrome c - rhesus macaque (tentative sequence) 780ORGANISM #formal_name Macaca mulatta #common_name rhesus macaque 781DATE 17-Mar-1987 #sequence_revision 17-Mar-1987 #text_change 782 05-Aug-1994 783ACCESSIONS A00003 784REFERENCE A00003 785 #authors Rothfus, J.A.; Smith, E.L. 786 #journal J. Biol. Chem. (1965) 240:4277-4283 787 #title Amino acid sequence of rhesus monkey heart cytochrome c. 788 #cross-references MUID:66045191 789 #contents Compositions of chymotryptic peptides and sequences of 790 residues 55-61 and 68-70 791 #accession A00003 792 ##molecule_type protein 793 ##residues 1-104 ##label ROT 794COMMENT Prosite Pattern: GLYCOSAMINOGLYCAN (S-G-x-G) 795 Matches: 10-16, 503-508. 796 797 SEQIO annotation, lines 1-2. 02-Feb-1996 798CLASSIFICATION #superfamily cytochrome c; cytochrome c homology 799KEYWORDS acetylated amino end; electron transfer; heme; mitochondrion; 800 oxidative phosphorylation; respiratory chain 801FEATURE 802 1 #modified_site acetylated amino end (Gly) #status 803 experimental\ 804 14,17 #binding_site heme (Cys) (covalent) #status predicted\ 805 18,80 #binding_site heme iron (His, Met) (axial ligands) 806 #status predicted 807SUMMARY #length 104 #molecular-weight 11605 #checksum 9512 808SEQUENCE 809 5 10 15 20 25 30 810 1 G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G P 811 31 N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I T W G 812 61 E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E E 813 91 R A D L I A Y L K K A T N E 814/// 815 816Note the new COMMENT section between the REFERENCE and 817CLASSIFICATION sections. And when read back in again, the string 818returned by `seqfcomment' would be the string 819 820 "Prosite Pattern: GLYCOSAMINOGLYCAN (S-G-x-G)\nMatches: 10-16, 503-508.\n" 821 822Exactly what was inserted (because the original entry had no other 823comments). 824 825 826 827BIOSEQ Stuff (Database Information Processing) 828********************************************** 829 830The first three sections present essentially all of the main functionality 831for reading and writing files and performing database searches. (There 832are a couple additional functions, but I'll leave you to read "seqio.doc" 833to find out what they are.) Sometimes, however, a program needs more 834control over the operations that are performed than the basic functions 835of the package permit. These next two sections describe additional 836features that can provide the extra control. 837 838This section discusses the four of the five functions related to the 839BIOSEQ standard for specifying and searching databases. I assume in 840this section that you have read the parts of "user.doc" that relate to the 841BIOSEQ standard and have some idea about what a BIOSEQ file looks 842like. Please go read that text first. 843 844The five BIOSEQ functions that are included in the SEQIO package 845(and in fact make up all of its functionality except for the standard itself) 846are `bioseq_read' which reads the BIOSEQ files, `bioseq_check' which 847can check to see if a database search specifier is valid, `bioseq_info' 848which is used to get an information field from a BIOSEQ entry, 849`bioseq_parse' which is used to get the list of files specified by a 850database search. and `bioseq_matchinfo' which is used to determine 851which BIOSEQ entry for a database has an information field with a 852particular value. This section talks about all of these functions except 853`bioseq_matchinfo'. 854 855The function `bioseq_read' takes in the name of a file, reads the 856BIOSEQ entries in the file, checks the syntax of those entries, and 857stores all of the entry information in internal data structures. Those 858data structures are then used by the `bioseq_info', `bioseq_matchinfo' 859and `bioseq_parse' functions. 860 861By default, the first files read are always the files specified by the 862"BIOSEQ" environment variable, if it is defined. This is done before any 863of the bioseq_* functions perform their operation. Then, each call to 864`bioseq_read' reads subsequent files. 865 866The internal data structure used by the package is a list of the read-in 867entries, and the determination of which entry a database search 868specification refers to is performed by searching through the list. The 869entries in the list are stored in reverse order of the calls to 870bioseq_read, but in the given order within a specific call to 871bioseq_read. So, the first entry checked is always the first entry of first 872file from the last call to bioseq_read. From there, the rest of the entries 873in that last call are checked, and after the last entry of that last call, the 874first entry of the next to last call to bioseq_read is checked. This way, 875the later calls to `bioseq_read' will have priority over the previous calls 876to `bioseq_read' (or the "BIOSEQ" env. variable files), in case of 877duplicates. 878 879Therefore, if you're writing a program and you want to allow the user to 880have multiple ways to specify BIOSEQ files (such as the BIOSEQ 881environment variable, plus other user-specified or program-specific 882files), use `bioseq_read' to read in the files in increasing priority, and 883the SEQIO package will always pick the highest priority BIOSEQ entry 884for each database. And, if you want the files specified by the "BIOSEQ" 885env. variable to have a higher priority than other files, simply call 886`bioseq_read' to reread the environment variable value. A BIOSEQ file 887can always be read in more than once, and the latest read will always 888override the entries from the previous read (unless the names of the 889BIOSEQ entries have changed between reads). 890 891The function `bioseq_check' takes a database search specifier and 892checks whether it refers to a known database (i.e., whether a BIOSEQ 893entry exists for that database). It returns non-zero if the BIOSEQ entry 894exists, and zero otherwise. This can be used for a quick error check 895testing whether the specifier given by the user is valid or not. 896 897The function `bioseq_info' is used to get the text from an information 898field in the BIOSEQ entry for a database. These information fields 899provide an easy way for the user to pass database-specific information 900to your program. One example of this is to allow the user to specify 901some command line options using an information field specific to the 902database. This way, the user can "tune" the program for each database, 903without having to always keep track of what option values must be 904specified for each database. 905 906The SEQIO package also "defines" several information fields that it 907uses when performing database searches. These fields are `Name', 908`Format', `Alphabet', `IdPrefix' and `Index'. The `Name' field gives the 909name of the database, and its presence distinguishes BIOSEQ entries 910for databases from entries for personal collections of files. The `Format' 911and `Alphabet' fields specify the format for the database files and the 912alphabet for the database sequences, respectively. The `IdPrefix' field 913specifies the identifier prefix that should be given to the main identifier 914in each entry. The `Index' field specifies the name of the file which 915indexes all of the database's entries (see "idxseq.doc" for more 916information about the index files). 917 918(NOTE: Information fields can only be "defined" in the sense that the 919user can be asked to place the requested text in information fields for 920the specified keywords. There is nothing requiring those fields to be 921there or restricting what text the user puts there, except maybe that 922improper text will trigger an error in the package or your program.) 923 924The `bioseq_parse' function is the function used to parse database 925search specifications and determine the list of files that should be read 926in that search. This function (along with the `bioseq_info' function for 927the four information fields above) is used by `seqfopendb' to open a 928database search. In fact, that initial example opening a database search 929could be replaced with the following code snippet, and it would perform 930the same operations (with one exception noted below): 931 932 int len; 933 char *s, *t, *files, *seq; 934 SEQFILE *sfp; 935 936 /* 937 * The next 9 lines replace the lines: 938 * if ((sfp = seqfopendb("genbank")) == NULL) 939 * exit(1); 940 */ 941 if ((files = bioseq_parse("genbank")) == NULL) 942 exit(1); 943 944 for (s=files; *s; s++) { 945 for (t=s; *s != '\n'; s++) ; 946 *s = '\0'; 947 948 if ((sfp = seqfopen(t, "r", NULL)) == NULL) 949 exit(1); 950 951 while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) { 952 if (len > 0 && isa_match(seq, len)) { 953 /* Found a match */ 954 } 955 } 956 seqfclose(sfp); 957 } 958 free(files); 959 960The string returned by `bioseq_parse' is a list of the database's files to 961be read, where each filename is terminated by a newline character 962(including the last filename), and the whole string is terminated by a 963NULL character. This string is stored in a malloc'ed buffer, and so must 964be freed when no longer useful. (Why newline? Hey, it probably won't 965appear in a filename, it's different from '\0' and it makes printing the list 966of files look nice. Got better reasons for some other character?) 967 968The example above opens the same set of files and reads the same 969sequences. The only potential difference between the execution of that 970example and the example using `seqfopendb' is that the SEQIO 971package will not know about the four information fields associated with 972the database, and so minor differences may appear in the results (very 973minor differences in the fields of any SEQINFO structure and any 974output generated by SEQIO). This information could be included in the 975example using `bioseq_info', `seqfsetdbname', `seqfsetidpref' and 976`seqfsetalpha', as follows: 977 978 char *format, *dbname, *alpha, *idprefix; 979 980 if ((files = bioseq_parse("genbank")) == NULL) 981 exit(1); 982 983 format = bioseq_info("genbank", "Format"); 984 dbname = bioseq_info("genbank", "Name"); 985 alpha = bioseq_info("genbank", "Alphabet"); 986 idprefix = bioseq_info("genbank, "IdPrefix"); 987 988 for (s=files; *s; s++) { 989 for (t=s; *s != '\n'; s++) ; 990 *s = '\0'; 991 992 if ((sfp = seqfopen(t, "r", format)) == NULL) 993 exit(1); 994 995 if (dbname != NULL) 996 seqfsetdbname(sfp, dbname); 997 if (alpha != UNKNOWN) 998 seqfsetalpha(sfp, alpha); 999 if (idprefix != NULL) 1000 setfsetidpref(sfp, idprefix); 1001 1002 while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) { 1003 if (len > 0 && isa_match(seq, len)) { 1004 /* Found a match */ 1005 } 1006 } 1007 seqfclose(sfp); 1008 } 1009 1010 free(files); 1011 if (format != NULL) 1012 free(format); 1013 if (dbname != NULL) 1014 free(dbname); 1015 if (alpha != NULL) 1016 free(alpha); 1017 if (idprefix != NULL) 1018 free(idprefix); 1019 1020Note that the format string returned by `bioseq_info' is also returned in 1021a malloc'ed buffer, and so must be freed after its use. 1022 1023 1024 1025Error Handling 1026************** 1027 1028There are three things that any programmer must figure out when 1029writing a program (apart from what the program actually is supposed to 1030do). They are what the user interface will look like, how the program is 1031going to store the data it uses, and how the program handles errors. 1032Since this is a package and not a complete program, I leave the user 1033interface to your dreams and abilities, but I want to try to simplify the 1034other two tasks as much as possible. I've talked about how the SEQIO 1035package keeps track of a lot of data internally, and can return that data 1036when asked. Here, I want to describe how the package handles errors, 1037and the way you can specify how the package should handle them. 1038 1039The image I had when designing the error handling of the package was 1040that when the package is being used to create a "quick and dirty" 1041program that is written just to quickly get information or entries from a 1042database or file, the SEQIO package should do as much as possible to 1043descriptively report and properly handle errors. However, when the 1044package is used to create robust application software with either a 1045command line or a windowing user interface, the programmer should 1046have the ability to disable some or all of that reporting/handling 1047mechanism and replace it with their own error handling routines. 1048 1049By default, when the SEQIO package detects an error, it first sets the 1050values of variables `seqferrno' and `seqferrstr' to an integer error value 1051and the text of an error message, respectively. These variables are 1052defined in "seqio.h" as extern variables, so you have access to their 1053values at all times. (See file "seqio.doc" for a more complete 1054description of the values `seqferrno' can take.) The next thing that the 1055package does is output an error message on standard error. And 1056finally, depending on the seriousness of the error, the package may 1057either return an error value as the result of the SEQIO function call or it 1058may exit the program. 1059 1060Obviously, the outputting of an error message or the program exiting 1061can affect your user interface, so I've tried to design the package so 1062you can either work with these actions more easily or disable them 1063easily. The first thing I've done is try to write all of the error messages 1064so that they would be comprehensible to the user of your program, who 1065may not know about a SEQIO package. I could not handle all of the 1066cases (in particular, the error message from calls to `seqfparseent' and 1067`seqfannotate' are not as informative, because those functions are not 1068given any information that originally came from the user, such as a 1069filename). But, for the most part, the error messages should not be 1070incomprehensible. If you do find an error message that you think could 1071be improved, please send an message to knight@cs.ucdavis.edu. 1072 1073The second thing I've done is to limit the times when the package exits 1074the program only to when (1) the package detects that no more memory 1075is available or when (2) it detects an bug in the package code. Thus, 1076(hopefully) there will be few occasions when the package will actually 1077exit the program. And, typically the "quick and dirty" programs don't 1078have any better handling of these errors. 1079 1080The third thing I've done is to include a function `seqfsetperror' to 1081allow you to redirect all of the error printing the package does. This 1082function takes another function as its argument, and, when given that 1083argument function, the SEQIO package will call that function for any 1084error printing, instead of calling its default print error function. Thus, 1085you can redirect all of the error output to an empty function, to a 1086function that changes the text of the error messages, or to a function 1087which pops up an error window with the text of the message. 1088 1089The fourth thing I've done is to add a function `seqferrpolicy' which 1090allows you to disable some or all of the error output and whether the 1091program calls exit on memory errors and program bugs. See the file " 1092seqio.doc" for the details on `seqferrpolicy'. Thus, when you want to 1093handle the error reporting and handling yourself, the package can be 1094told to just set `seqferrno', set `seqferrstr' and return error values from 1095the package functions. And, even in that case, you still have access to 1096the messages that the package would have output, since that message 1097is stored in `seqferrstr'. So, for example, if you are writing a windowing 1098program and you want some but not all error messages to appear in a 1099popup window, you can make the call "seqferrpolicy(PE_NONE)", 1100and then after the SEQIO package calls which may trigger an error 1101worth reporting, check the value of seqferrno. The package is 1102guaranteed never to output any messages or exit the program (except if 1103it core dumps, of course). 1104 1105 1106 1107Porting the Package to Another Machine 1108************************************** 1109 1110Currently, the package has been tested under the following operating 1111systems: 1112 1113 Ultrix, SunOS, Solaris, IRIX, Windows NT/95 1114 1115If your machine is not one of these, there is a chance the program may 1116not compile on it. Based on my experience with other software I've 1117written, my guess is that the code should compile on most of the Unix 1118variants, with the exception that the proper include files needed to read 1119directory files may differ from those in the code. On non-Unix variants, 1120the code probably will not compile, as the code dealing with directory 1121files is specifically geared for the Unix and Windows operating systems. 1122 1123If you do have a machine not on the list, are not able to compile it and 1124want to port it, first send me mail (at knight@cs.ucdavis.edu). I am very 1125interesting in getting the program to work on as many systems as 1126possible, and will try to help as much as possible (including 1127implementing any changes on my latest version of the code and 1128immediately sending you a personal release, so that you would not 1129have to wait until the next version of the code came out). Then, check 1130these list of things below, which may narrow down where the problem 1131lies. 1132 1133First, the current version of the code uses the following include files: 1134 1135 #include <stdio.h> 1136 #include <stdlib.h> 1137 #include <ctype.h> 1138 #include <fcntl.h> 1139 #include <stdarg.h> 1140 #include <string.h> 1141 #include <time.h> 1142 #include <errno.h> 1143 #include <sys/types.h> 1144 #include <sys/stat.h> 1145 1146 #ifdef __unix 1147 #include <dirent.h> 1148 #ifdef SYSV 1149 #include <sys/dirent.h> 1150 #endif 1151 #endif 1152 1153 #ifdef WIN32 1154 #include <windows.h> 1155 #endif 1156 1157 #include "seqio.h" 1158 1159plus, the following include file 1160 1161 #include <sys/mman.h> 1162 1163is ifdef'ed inside the preprocessor define value ISMAPABLE (see below 1164for the discussion of the `mmap' system call and ISMAPABLE). 1165 1166If your machine does not have some of these includes, take them out, 1167figure out which variable/functions needed those includes, and then 1168figure out which include files your system needs to declare those 1169variables/functions. 1170 1171Second, here is a complete list of the external variables and function 1172calls used by the bulk of my program. 1173 1174 * Current set of external calls in main section of code: 1175 * exit, fclose, fopen, fputc, fputc, fprintf, free, fwrite, 1176 * getenv, getpagesize, isalpha, isalnum, isdigit, isspace, 1177 * malloc, memcpy, memset, realloc, sizeof, sprintf, 1178 * strcpy, strcmp, strlen, strncmp, tolower, va_arg, va_end, 1179 * va_start, vsprintf 1180 * mmap, munmap (these are ifdef'd inside `ISMAPABLE') 1181 * 1182 * Current set of (unusual?) data-structures/variables in main section: 1183 * errno, va_list, __LINE__, 1184 * caddr_t (this is ifdef'd inside `ISMAPABLE') 1185 1186In addition, I've encapsulated a lot of the system operations into 1187functions at the end of the file "seqio.c". My assumption was that the 1188functions and variables above are common to most or all machines, 1189whereas the functions and variables below are more machine specific. 1190So, I put all of the machine specific code at the end of the file, where it 1191is much easier to find. Here is a list of all of the 1192functions/variables/structures made in these encapsulated functions: 1193 1194 * Current set of external calls in end section of code: 1195 * close, ctime, open, read, stat, time 1196 * 1197 * closedir, opendir, readdir (these are ifdef'd inside `__unix') 1198 * 1199 * GetCurrentDirectory, SetCurrentDirectory, 1200 * FindFirstFile, FindNextFile, CloseHandle 1201 * (these are ifdef'd inside `WIN32') 1202 * 1203 * Current set of (unusual?) data-structures/variables in end section: 1204 * stat structure, time_t, stdin, stdout, stderr 1205 * DIR, dirent structure (these are ifdef'd inside `__unix') 1206 * WIN32_FIND_DATA, HANDLE (these are ifdef'd inside `WIN32') 1207 1208If any of these functions or variables are not supported on your 1209machine, please let me know and we can figure out how to work around 1210them. 1211 1212Here are some additional tips and requirements for the package: 1213 1214 1. For Unix variants, if the structures DIR and dirent, and the 1215 functions opendir, readdir and closedir, are problems for the 1216 compiler, check the man pages of those functions for the include 1217 files needed to use them. The current include files I've specified 1218 are the following: 1219 1220 #include <sys/types.h> 1221 #include <dirent.h> 1222 #ifdef SYSV 1223 #include <sys/dirent.h> 1224 #endif 1225 1226 These include files are compatible with SunOS, SOLARIS, Ultrix, 1227 OSF, DYNIX (or whatever the Sequent's Unix variant is called), 1228 IRIX and HPUX. I have tested the directory include files on all 1229 these. 1230 1231 2. If one of the string functions (strcmp, strlen, strcpy, ...) or the 1232 character class functions (isalpha, isspace, isdigit, ...) is not 1233 supported, then tell me about it and I will add my own version of 1234 that function to the code and remove the use of those functions 1235 from the package and send you a new release. One of my goals 1236 for the package is no compiler options ever need to be specified 1237 to get the program to compile correctly. So, it's better (from my 1238 point of view) to just replace any function that may not exist on a 1239 machine, rather than have the users worry about configuring the 1240 package for different machines. 1241 1242 3. The program requires that "int"'s be 4 bytes long, as they will 1243 take values larger than 65536. This shouldn't be a problem, 1244 except for the PC's. If you wish to port it to a PC, what I can do is 1245 create an "int4" typedef that can be set to the appropriate value 1246 for the different machines. 1247 1248 4. I've created typedefs to hide the datatype used when reading 1249 directories and when reading from raw files using open and read. 1250 If your system requires different data structures for those values, 1251 the typedef declarations are at the beginning of "seqio.c". 1252 1253 5. I've also created a "dirch" variable to hold the character used by 1254 the operating system to distinguish between directories in a 1255 path. Now, that variable is set to the character '/' (for Unix) but it 1256 can be reset using system specific ifdefs to another character 1257 (such as '\' for Windows NT). This variable is also declared at the 1258 beginning of "seqio.c". 1259 1260 This variable is used in all of the BIOSEQ processing, which 1261 must know the format of directory pathnames. If directory paths 1262 use some format other than a string of names separated by the 1263 directory character (as the VMS systems do), we'll have to work 1264 together to reimplement the BIOSEQ processing. 1265 1266Finally, if your machine is not on the list and even if you are able to 1267compile the program successfully, I would like you to check one 1268additional feature. Some of the Unix variants support calls to a function 1269`mmap', which directly maps disk files into the memory of a program. 1270I've added code to use this function, because it speeds up file reading 1271by about 30-40%. I would like you to check to see if your machine 1272supports the `mmap' call on generic files (some systems, like Ultrix, 1273have the `mmap' call but it only works for device files). 1274 1275I have encapsulated all of the code dealing with the `mmap' call inside a 1276preprocessor define value ISMAPABLE, and at the beginning of 1277"seqio.c", I include an ifdef expression which, for the systems that 1278support the `mmap' call, defines ISMAPABLE. So, another way you can 1279check to see if the `mmap' call on your system exists is to compile the 1280program with the -DISMAPABLE option and see if it compiles. If so, 1281please send me mail so I can add that system to the ifdef expression 1282that turns on the mmap'ing. 1283 1284 1285James R. Knight, knight@cs.ucdavis.edu 1286June 28, 1996 1287