1SEQIO -- A Package for Sequence File I/O 2 3 4SEQIO.DOC - The SEQIO Package Interface 5*************************************** 6 7 8 9The SEQIO package is a set of C functions which can read and write 10biological sequence files formatted using various file formats and which 11can be used to perform database searches on biological databases. I 12had five main goals in designing this package: 13 14 1. Keep the interface as similar as possible to the C stdio package, 15 given the fact that we're dealing with sequences and sequence 16 entries instead of characters and lines. 17 2. Support as many sequence file formats as possible, and make it 18 relatively quick and easy (at least for me) to add new formats. 19 3. Handle genome-sized sequences and large database searches, 20 so that the file I/O is no longer a time or space bottleneck for a 21 program. 22 4. Make the package flexible enough so that sequence analysis, 23 information retrieval and associating new information with 24 sequences (such as the results of the analyses) is easy. 25 5. Try to help people do "something else" with the minimum of 26 hassle (where "something else" means getting information or 27 performing some computation that no one else has provided 28 software for). 29 30The package defines a SEQFILE data structure, similar to stdio's FILE 31data structure, that is opened and closed and is used to perform the 32reading and writing of the files. Sequences and sequence entries are 33read or written one at a time, as a stream. 34 35Because of the size and complexity of sequences and entries, internal 36SEQIO data structures always retain the last read, or "current", 37sequence and its entry. A number of access functions are provided to 38return information about that current sequence and entry. So, you 39could read the next sequence, examine it, and then use the access 40functions to look at the database identifiers for the sequence, and then 41get the complete entry text. This is different from the stdio package, 42which simply puts the characters into data structures that you create 43and then forgets about those characters. 44 45The package can retrieve a number of pieces of information stored in a 46sequence entry, although it can't get everything (yet). In addition to the 47sequence and the complete entry text, it can get things like the 48database identifiers, the description, the organism name, the entry 49creation/update date, and so on. A SEQINFO data structure (listed 50below in the Current Sequence Access Functions section) has been 51defined to hold all of this information about a sequence or entry. This is 52a transparent data structure, unlike the SEQFILE structure, so that you 53can access and modify the fields of the structure or create your own 54structures. Since the information in the SEQINFO structure is used 55along with the sequence when writing sequence entries, creating new 56sequence entries is as simple as filling in the fields of a SEQINFO 57structure and then passing it and the sequence to the function 58`seqfwrite'. 59 60The package can perform database searches using the BIOSEQ 61standard. BIOSEQ was developed as a successor to the FASTLIBS 62method of the FASTA program for specifying the files to be read. The 63user (not you, but the user of your program) creates a BIOSEQ file with 64entries describing the various database, like this one for GenBank: 65 66>genbank,gb: /databases/genbank 67>Name: GenBank 68>Alphabet: DNA 69 gbbct.seq, gbest.seq, gbinv.seq, gbmam.seq, gbpat.seq, gbphg.seq 70 gbpri.seq, gbrna.seq, gbrod.seq, gbsts.seq, gbsyn.seq, gbuna.seq 71 gbvrl.seq, gbvrt.seq 72 73 bct:(gbbct.seq), est:(gbest.seq), inv:(gbinv.seq), mam:(gbmam.seq) 74 pat:(gbpat.seq), phg:(gbphg.seq), pri:(gbpri.seq), rna:(gbrna.seq) 75 rod:(gbrod.seq), sts:(gbsts.seq), syn:(gbsyn.seq), una:(gbuna.seq) 76 vrl:(gbvrl.seq), vrt:(gbvrt.seq) 77 78The SEQIO package can read this file and the BIOSEQ related 79procedures in the package, seqfopendb, bioseq_read, bioseq_info and 80bioseq_parse, can be used to perform database searches and to 81retrieve information about a database or the location of its files. The full 82details of the BIOSEQ standard are found in the file "user.doc". 83 84General Comments 85================ 86 87As you read this file and use the package, here are some issues to keep 88in mind: 89 90 1. How are returned values allocated? The SEQIO package either 91 returns pointers to internal data structures (which may change 92 after the next SEQIO call) or returns malloc'ed buffers which you 93 must free to avoid a memory leak. 94 95 2. How are errors handled? The SEQIO package always sets an 96 error variable `seqferrno' and error string `seqferrstr' on an 97 error. Normally, the package also reports warnings and errors 98 on standard error, and will exit the program on fatal errors (like 99 running out of memory). Using `seqferrpolicy' though, you can 100 keep the package from doing any or all of that (so that it only 101 sets the error variables and returns an error return value). And 102 using `seqfsetperror', you can replace the package's default 103 print error function (which outputs to standard error) with your 104 own output function. 105 106 3. File formats are always specified by name, such as "GenBank", 107 "FASTA", "Stanford", and so on. 108 109 4. Except for filenames, just about everything is case-insensitive 110 (so "GENBANK", "fasta" and "sTAnForD" are acceptable in point 111 3). 112 113 5. One deficiency you might find in the package is that there is no 114 explicit link between a sequence and the SEQINFO structure for 115 that sequence. I could not reconcile the package's ability to 116 create new, malloc'ed sequence and SEQINFO buffers, my wish 117 to keep the amount of allocated space to a minimum, and any 118 mechanism for linking the SEQINFO structure with a sequence. 119 You'll have to keep track of the sequences and SEQINFO 120 structures, and make sure that you are not mixing them up. 121 122 6. Most of the names in the package should be fairly unique, except 123 possibly the predefined constants DNA, RNA, PROTEIN, AMINO 124 and UNKNOWN used to specify alphabets. File "quickref.doc" 125 gives a complete list of the constants, typedef names and 126 functions names defined by the package. If these alphabetic 127 constants interfere with the constants in your program, just add 128 the following lines AFTER including "seqio.h", and then use the 129 constants SEQIO_DNA, SEQIO_RNA, SEQIO_PROTEIN, SEQIO_AMINO and 130 SEQIO_UNKNOWN when looking at SEQIO's alphabet values. 131 132 #undef DNA 133 #undef RNA 134 #undef PROTEIN 135 #undef AMINO 136 #undef UNKNOWN 137 138 #define SEQIO_UNKNOWN 0 139 #define SEQIO_DNA 1 140 #define SEQIO_RNA 2 141 #define SEQIO_PROTEIN 3 142 #define SEQIO_AMINO 3 143 144 Or you can just use the constants predefined in the package in 145 your program (you should be able to figure out what values they 146 have from above). 147 148 149 150Opening and Closing Files/Database-Searches 151******************************************* 152 153Seqfopen 154======== 155 156 SEQFILE *seqfopen(char *filename, char *mode, char *format) 157 158 o filename - the file to be opened 159 o mode - "r", "w" or "a" 160 o format - the file format name (optional for reading) 161 162 o returns an open SEQIO file structure (or NULL on error) 163 164The seqfopen function opens a file for reading or writing. It is similar to 165stdio's fopen, except it returns a SEQFILE structure and it has an extra 166`format' argument. This argument specifies the format of the file to be 167read or written. See the files "user.doc" and "format.doc" for a 168description of the valid formats. 169 170In addition to normal filenames, the package recognizes several special 171characters. First, the string containing only a dash ("-") specifies that 172standard input or standard output should be opened instead. Second, 173filenames beginning with a `~' are treated the same way as the Unix 174shells (i.e., the `~' is used to refer to home directories). Finally, access 175to single entries of the file can be specified using an `@', followed by a 176single entry access specification. See "user.doc" for a description of 177how to specify single entry access. 178(Note: This specification means that seqfopen will not accept filenames 179containing an ampersand character. It will always assume the `@' 180denotes a single entry access specification.) 181 182When reading a file, the `format' argument can be NULL, in which case 183seqfopen tries to automatically determine the format of the file. If the file 184format is any of the valid formats except the Raw format, seqfopen 185should be able to determine the correct format. If it cannot determine 186the correct format, the SEQIO package does not fail to open the file. It 187simply triggers a warning and opens the file in the Plain format. 188 189Seqfopendb 190========== 191 192 SEQFILE *seqfopendb(char *dbspec) 193 194 o dbspec - a BIOSEQ database search specification 195 196 o returns an open SEQIO file structure (or NULL on error) 197 198The seqfopendb function opens a SEQFILE structure to perform a 199database search. The one argument specifies what database (or part of 200a database) to search. All of the sequences in that database (or 201database part) can then be read as if they were stored in a single 202sequence file. See the file "user.doc" for a description of a valid 203BIOSEQ database specification. 204 205This function also looks for five information fields from the BIOSEQ 206entry for the database. These fields are not required to occur in an 207entry. The "Name" field specifies the name of the database described 208by that BIOSEQ entry and is used to distinguish between "official" 209databases and just collections of entries. The "IdPrefix" field gives the 210identifier prefix to attach to any identifier in an entry which does not 211already have a prefix. The "Format" field specifies the file format to pass 212to `seqfopen' for each database file read. The "Alphabet" field specifies 213the alphabet for each sequence in the database. And finally, the "Index" 214field gives the name of the file indexing all of the database's entries 215(which is used to randomly access the entries). 216 217Seqfopen2 218========= 219 220 SEQFILE *seqfopen2(char *string) 221 222 o string - the filename (if it specifies an existing file) or 223 database search specifier (otherwise) 224 225 o returns an open SEQIO file structure (or NULL on error) 226 227The seqfopen2 function is a "combination" function which you can use 228to avoid having to decide whether to call seqfopen or seqfopendb. If 229the parameter specifies an existing file (tested using the stat system 230call), then seqfopen is called (with the second and third arguments of 231"r" and NULL, resp.). Otherwise, seqfopendb is called. 232 233Seqfclose 234========= 235 236 void seqfclose(SEQFILE *sfp) 237 238 o sfp - the SEQFILE structure to be closed 239 240 o returns nothing 241 242The seqfclose function closes an open SEQFILE structure, closing any 243open FILE structures it uses and freeing up the memory allocated to it. 244 245 246 247Reading Sequences/Entries 248************************* 249 250Seqfread 251======== 252 253 int seqfread(SEQFILE *sfp, int flag) 254 255 o sfp - an open SEQFILE structure 256 o flag - read the next sequence (if zero) or entry (non-zero) 257 258 o returns 0 on success and -1 on EOF or error 259 260The seqfread function reads the next sequence or entry. The difference 261between reading the next sequence and reading the next entry only 262appears with formats that contain multiple sequences per entry, such 263as the PHYLIP or Clustalw formats. When reading "by sequence", each 264sequence of a multiple sequence entry is read and kept as the current 265sequence. If seqfread is called with a non-zero `flag' argument, then 266the next entry is always read, even if some sequences of the current 267entry have not been read. 268 269This function only returns a status value because the current sequence 270and entry text are kept in the internal SEQIO data structures. The 271function just changes the value of the current sequence and current 272entry. 273 274Seqfgetseq, Seqfgetrawseq, Seqfgetentry, Seqfgetinfo 275==================================================== 276 277 char *seqfgetseq(SEQFILE *sfp, int *length_out, int newbuffer) 278 char *seqfgetrawseq(SEQFILE *sfp, int *length_out, int newbuffer) 279 char *seqfgetentry(SEQFILE *sfp, int *length_out, int newbuffer) 280 SEQINFO *seqfgetinfo(SEQFILE *sfp, int newbuffer) 281 282 o sfp - an open SEQFILE structure 283 o length_out - address where the returned string's length is 284 stored (if not NULL) 285 o newbuffer - malloc a new buffer for the object (if non-zero) 286 or return an internal buffer (if zero) 287 288 o returns the sequence/entry text or the SEQINFO structure 289 (or NULL on error) 290 291These functions simply call seqfread to read the next sequence or 292entry, and then call the access function to return the sequence text, 293entry text or sequence information. The seqfgetseq, seqfgetrawseq and 294seqfgetinfo functions call seqfread with a `flag' value of 0, while the 295seqfgetentry function calls seqfread with a `flag' value of 1. This way, 296the search by entry will look at each entry exactly once (it won't 297repeatedly look at the same multiple sequence entry), and the search 298by sequence and search by information will look at each sequence 299exactly once. 300 301See the access functions description next for a more complete 302description of the arguments and their use. 303 304 305 306Access Functions for the Current Sequence, Entry and Information 307**************************************************************** 308 309Seqfsequence, Seqfrawseq, Seqfentry, Seqfinfo, Seqfallinfo 310========================================================== 311 312 char *seqfsequence(SEQFILE *sfp, int *length_out, int newbuffer) 313 char *seqfrawseq(SEQFILE *sfp, int *length_out, int newbuffer) 314 char *seqfentry(SEQFILE *sfp, int *length_out, int newbuffer) 315 SEQINFO *seqfinfo(SEQFILE *sfp, int newbuffer) 316 SEQINFO *seqfallinfo(SEQFILE *sfp, int newbuffer) 317 318 o sfp - an open SEQFILE structure 319 o length_out - address where the returned string's length is 320 stored (if not NULL) 321 o newbuffer - malloc a new buffer for the object (if non-zero) 322 or return an internal buffer (if zero) 323 324 o returns the sequence/entry text or the SEQINFO structure 325 (or NULL on error) 326 327These functions are the access functions by which the current 328sequence text, the current entry text or the information about the 329current sequence can be retrieved. The seqfinfo and seqfallinfo 330functions are described in the next section. 331 332The seqfsequence, seqfrawseq and seqfentry functions return the 333current sequence text, the raw sequence text or the current entry text, 334respectively. They also return the length of the text, if the second 335argument is an integer variable pointer (such as `s = 336seqfsequence(sfp, &len, 0)'). 337 338(NOTE: The function seqfrawseq differs from seqfsequence in that it 339also includes any alignment or notational characters in its returned 340"sequence". Typically, seqfsequence only extracts alphabetic 341characters in the sequence lines of the entry, whereas seqfrawseq 342extracts all characters except whitespace and digits.) 343 344The returned text is stored either in an internal SEQIO buffer or in a 345newly malloc'ed buffer, depending on the value of the `newbuffer' 346argument. With both kinds of buffers, you are permitted to modify or 347rewrite the characters of the text as needed (these are NOT read-only 348buffers), with two exceptions. First, you may not write past the end of 349the returned text, because you would be writing off the end of a 350malloc'ed buffer or onto other text stored in an internal SEQIO buffer. 351So, you can make the string shorter, but not longer. 352 353Second, if you use the internal buffer, any permanent changes you 354make could affect future calls involving that sequence or entry (what is 355being returned is the buffer storing the only copy SEQIO has of the 356sequence/entry). The idea is to use the internal buffers when the 357sequence/entry will only be kept around temporarily and any changes 358made are undone when the text is no longer needed, and to use the 359malloc'ed buffer for the sequences/entries that must be kept around for 360a long time. 361 362The SEQINFO Structure 363===================== 364 365The seqfinfo functions returns a SEQINFO structure containing various 366information about the current sequence and its entry. So, what 367information does the SEQINFO structure hold? Here is the C definition: 368 369typedef struct { 370 char *dbname, *filename, *format; 371 int entryno, seqno, numseqs; 372 373 char *date, *idlist, *description; 374 char *comment, *organism, *history; 375 int isfragment, iscircular, alphabet; 376 int fragstart, truelen, rawlen; 377} SEQINFO; 378 379The structure contains six fields which the SEQIO package has about 380the current sequence: 381 382dbname 383 The name of the database being searched (if this is a search of 384 an actual database). 385filename 386 The name of the file currently being read. 387format 388 The format of the file (and the current entry). 389entryno 390 The location of the current entry in the file (if entryno is 10, then 391 the current entry is the tenth entry in the file). 392seqno 393 The location of the current sequence in the current entry (if 394 seqno is 3, then the current sequence is the third in the current 395 entry). 396numseqs 397 The number of sequences contained in the current entry. 398 399So, the current sequence's location is the `seqno' sequence of the 400`entryno' entry of the file `filename' (possibly of the database 401`dbname'). The `format' string gives the entry's format, and the entry 402contains `numseqs' sequences. 403 404The other twelve fields are information extracted from the current entry 405(see "format.doc" for the details about which information is retrieved 406for each file format): 407 408date 409 A single date giving the last time the entry was either created or 410 updated. Its format should be day-month-year, as in 411 31-JAN-1995. 412idlist 413 The list of identifiers given in the entry. The idlist's form is a 414 string containing vertical bar separated list of identifiers, each of 415 whose form consists of an identifier prefix, a ':' and the identifier. 416 See file "user.doc" for more information about identifiers and 417 identifier prefixes. 418description 419 A description of the sequence or sequences in the entry. This is 420 the "Title" or "Definition" line in some file formats. This string 421 should consist of a single "line" of text, although it can be of any 422 length. So, no newlines should appear in this text (they are 423 removed and added when the description is read from and 424 output in the sequence entries). 425comment 426 A block of text giving a comment about the sequence. The string 427 can contain one or more lines of any length. The one restriction 428 to the text appearing in a comment is that any block of lines at 429 the end of an entry's comment section where each line begins 430 with the string "SEQIO" is reserved for other use by the package 431 (this block holds extra identifiers or the `history' lines). 432organism 433 The name of the organism the sequence was taken from. Right 434 now, this field can contain any single "line" of text, although I 435 would like to standardize the contents of this field. It's on my 436 TODO list. 437history 438 This holds the lines of text placed in the comment section of 439 entries which describe previous SEQIO operations on this entry, 440 i.e., it holds the history of alterations and updates made to this 441 entry by programs using the SEQIO package. Any block of lines 442 at the end of a comment section where each line begins with the 443 string "SEQIO" is not considered part of the comment, but part 444 of the history. 445isfragment 446 This integer is non-zero if the sequence is a fragment of a larger 447 sequence, and zero if the sequence is complete (or if it is not 448 known whether the sequence is a fragment). 449iscircular 450 This integer is non-zero if the sequence is a circular sequence, 451 and zero if it is a linear sequence (or if it's circularity is not 452 known). 453alphabet 454 This integer is one of the predefined constants DNA, RNA, 455 PROTEIN or UNKNOWN. Its value is UNKNOWN unless either 456 the database's BIOSEQ entry (information field "Alphabet") or 457 the entry itself explicitly specifies the alphabet. The package 458 does not try to guess the alphabet. 459fragstart 460 When the sequence is a fragment of a larger sequence and the 461 location of this fragment in the larger sequence is known, this 462 value gives the starting position of the fragment. If this value is 463 not known (or the sequence is complete), fragstart is set to 0. 464truelen 465 This is the "true" length of the sequence, i.e., the length of the 466 sequence without any gap characters or notational characters. 467 Typically, these are just the alphabetic characters. 468rawlen 469 This is the "raw" length of the sequence, i.e., the length of the 470 sequence which includes the gap and notational characters. 471 Typically these are all characters except whitespace and digits. 472 473The seqfinfo function fills in as many fields of the SEQINFO structure as 474it can, given the information in the entry. If a particular piece of 475information could not be found in the entry, that field is set either to 476NULL or 0, depending on whether it is a character string or an integer. 477(And, yes, I know NULL and 0 are really the same value. I'm talking 478about good programming style here.) 479 480The seqfallinfo function also returns a SEQINFO structure and the 481information is the same as that returned by seqfinfo, with the exception 482of the `comment' field. With seqfallinfo, the entire header for the current 483entry is stored in the comment field of the SEQINFO structure. The 484meaning of "entire header" is different for the different formats, but the 485general idea is that it consists of everything in the entry except the 486sequence lines. This can be useful for converting from one format to 487another without losing the header lines describing the references and 488features of the sequence. 489 490As with the seqfsequence and seqfentry, the structure returned by 491these two functions is either an internal SEQIO structure or a malloc'ed 492structure. However, the malloc'ed structure is a bit more complicated 493here, since the SEQINFO structure contains character strings for some 494of its fields. What the SEQIO package does is malloc one big buffer in 495which it stores the SEQINFO structure and all of the character strings 496that its fields point to. So, when you call free to "free" the SEQINFO 497structure, you also automatically free all of the character strings. This 498means that the same restrictions specified for the returned 499sequence/entry text above also apply to these character strings. 500 501SEQINFO Field Access Functions 502============================== 503 504 char *seqfdbname(SEQFILE *sfp, int newbuffer) 505 char *seqffilename(SEQFILE *sfp, int newbuffer) 506 char *seqfformat(SEQFILE *sfp, int newbuffer) 507 int seqfentryno(SEQFILE *sfp) 508 int seqfseqno(SEQFILE *sfp) 509 int seqfnumseqs(SEQFILE *sfp) 510 char *seqfdate(SEQFILE *sfp, int newbuffer) 511 char *seqfidlist(SEQFILE *sfp, int newbuffer) 512 char *seqfdescription(SEQFILE *sfp, int newbuffer) 513 char *seqfcomment(SEQFILE *sfp, int newbuffer) 514 char *seqforganism(SEQFILE *sfp, int newbuffer) 515 int seqfisfragment(SEQFILE *sfp) 516 int seqfiscircular(SEQFILE *sfp) 517 int seqfalphabet(SEQFILE *sfp) 518 int seqffragstart(SEQFILE *sfp) 519 int seqftruelen(SEQFILE *sfp) 520 int seqfrawlen(SEQFILE *sfp) 521 522 o sfp - an open SEQFILE structure 523 o newbuffer - malloc a new buffer for the object (if non-zero) 524 or return an internal buffer (if zero) 525 526 o returns the information string or integer 527 528These are the access functions used to retrieve individual pieces of 529information about the current sequence and current entry. Like 530seqfsequence and seqfentry, the functions which return strings can 531return either an internal SEQIO buffer or a malloc'ed buffer containing 532the string (with the same restrictions and requirements on its use). 533 534The functions returning strings all return NULL on an error. The other 535access functions all return 0 on an error, even though 0 may be a valid 536return value (such as in seqfiscircular). Note that the alphabet 537UNKNOWN is defined to be 0. 538 539Seqfmainid, Seqfmainacc 540======================= 541 542 char *seqfmainid(SEQFILE *sfp, int newbuffer) 543 char *seqfmainacc(SEQFILE *sfp, int newbuffer) 544 545 o sfp - an open SEQFILE structure 546 o newbuffer - malloc a new buffer for the object (if non-zero) 547 or return an internal buffer (if zero) 548 549 o returns the identifier string or NULL 550 551These functions access the idlist information collected from an entry 552and return either the main identifier or main accession number 553occurring in the entry. The main identifier is considered to be either the 554first identifier which is not an accession number, or the first accession 555number if no non-accession identifiers are found in the entry. Thus, 556seqfmainid is NULL only when no identifiers could be extracted from 557the entry. The main accession number is the first accession number 558found. 559 560The string returned by these functions consists of a single identifier, 561whose form is the same as each identifier in the idlist (i.e., an identifier 562prefix, a `:' and the identifier string). This string will be 563NULL-terminated (it is not just a pointer into the idlist). And, as with all 564of the other information strings, it can be returned either in an internal 565buffer or a newly malloc'ed buffer. 566 567Seqfoneline 568=========== 569 570 int seqfoneline(SEQINFO *info, char *buffer, int buflen, int idonly) 571 572 o info - a SEQINFO structure 573 o buffer - the buffer to store the oneline description 574 o buflen - the buffer length 575 o idonly - only store an identifier for the sequence 576 577 o returns the length of the string stored in buffer 578 579This function is used to construct a "oneline" description of a 580sequence, based on the information stored in the SEQINFO structure. 581See "user.doc" for a complete description of the format for a oneline 582description. That description is stored in the character array specified 583by the "buffer" argument. 584 585The function guarantees that it will fit the oneline description into the 586first "buflen-1" characters of buffer, and that the description will always 587be NULL-terminated. So, you won't have to check for a buffer overflow 588(like you do with function `fgets'). 589 590If the "idonly" value is non-zero, then the oneline description will 591consist only of a single identifier for the sequence, and the description 592will not contain any whitespace. This is useful for constructing short 593identifiers for sequences (most notably, the identifiers used in the 594PHYLIP, Clustalw and MSF formats). 595 596Seqfsetidpref, Seqfsetdbname, Seqfsetalpha 597========================================== 598 599 void seqfsetidpref(SEQFILE *sfp, char *idprefix) 600 void seqfsetdbname(SEQFILE *sfp, char *dbname) 601 void seqfsetalpha(SEQFILE *sfp, char *alphabet) 602 603 o sfp - a SEQFILE structure open for reading 604 o idprefix - the identifier prefix (if not NULL or not empty) 605 o dbname - the current database name(if not NULL or not 606 empty) 607 o alphabet - the string used to determine the alphabet when 608 the entry does not specify an alphabet (if not NULL or not 609 empty) 610 611 o returns nothing 612 613These are functions which allow you to set the identifier prefix, 614database name, and alphabet when reading a sequence file or 615performing a database search. The idea is that these functions provide 616the same capability as the inclusion of the "Name", "IdPrefix" and 617"Alphabet" information fields in the BIOSEQ entry used by a database 618search. The use of these values in a database search is described in 619the comments for `seqfopendb' above and in file "programr.doc" in the 620BIOSEQ Stuff section. 621 622If the second argument to the function is either NULL or an empty 623string, then the current value held in the SEQFILE structure is 624removed. This gives a way to unset any of those values as needed. 625 626(NOTE: The `alphabet' argument to "seqfsetalpha" is a character string, 627not one of the predefined constants RNA, DNA, PROTEIN or 628UNKNOWN. The reason for doing that is so that any string specifying a 629valid alphabet (i.e., strings like "tRNA", "Peptide", "cDNA", 630"pre-mRNA") can be input to the package, but the value set in the 631SEQINFO structure is simplified so that programs who just need to 632know whether the sequence is DNA, RNA or PROTEIN can just test the 633integer SEQINFO field. 634 635One of the things on my TODO list is to add an `alphastr' field to the 636SEQINFO structure to return the actual alphabet string that occurs in 637the entry or is given to the package.) 638 639 640 641Writing Sequences/Entries 642************************* 643 644Seqfwrite 645========= 646 647 int seqfwrite(SEQFILE *sfp, char *seq, int seqlen, SEQINFO *info) 648 649 o sfp - a SEQFILE structure open for writing 650 o seq - the sequence 651 o seqlen - the sequence length 652 o info - information about the sequence 653 654 o returns 0 on success and -1 on error 655 656The seqfwrite function outputs a sequence entry using the sequence 657and information given in the function arguments. Only the twelve "entry 658information" fields of the SEQINFO structure (date, mainid, mainacc, 659idlist, description, comment, organism, history, isfragment, iscircular, 660alphabet, truelen) are used when outputting the entry. 661 662(NOTE: For the ASN.1, PHYLIP and Clustalw formats, remember that it 663is important that you call seqfclose to close the file being written, as the 664package performs some output during that close operation.) 665 666Seqfconvert 667=========== 668 669 int seqfconvert(SEQFILE *input_sfp, SEQFILE *output_sfp) 670 671 o input_sfp - a SEQFILE structure open for reading 672 o output_sfp - a SEQFILE structure open for writing 673 674 o returns 0 on success and -1 on error 675 676The seqfconvert function retrieves the sequence and information about 677the current sequence of `input_sfp' and then calls seqfwrite with 678`output_sfp', the sequence and the information. 679 680Seqfputs 681======== 682 683 int seqfputs(SEQFILE *sfp, char *s, int len) 684 685 o sfp - a SEQFILE structure open for writing 686 o s - the string to output 687 o len - the number of chars to output (or 0, specifying to 688 output to the end of s) 689 690 o returns 0 on success and -1 on error 691 692The seqfputs function outputs a string on the output stream opened for 693the SEQFILE structure. The purpose of this function is to give a way to 694mix the output of complete entries with the output of entries produced 695by seqfwrite or seqfannotate. Thus, programs can take differently 696formatted entries as input (i.e., a combination of GenBank, FASTA and 697EMBL entries), and transform them into a single output file format 698without losing any information in input entries whose format matches 699the output format (i.e., output all in EMBL format using seqfwrite on the 700GenBank and FASTA entries and seqfputs on the EMBL entries). 701 702The function makes no checks on the string to ensure that the output 703consists of a single format. When using this function, you must ensure 704that the output consists of complete entries of a single format (since 705you can use seqfputs to output entries line by line). 706 707The `len' argument either specifies the number of characters to output 708(if non-zero), or that the complete string should be output (if zero). If 709the length is non-zero, exactly that many characters will be written to 710the output. 711 712Seqfannotate 713============ 714 715 int seqfannotate(SEQFILE *sfp, char *entry, int entrylen, char 716 *newcomment, int flag) 717 718 o sfp - a SEQFILE structure open for writing 719 o entry - the entry text to output 720 o entrylen - the length of the entry text 721 o newcomment - the comment to add to the entry 722 o flag - remove existing comments (if zero) or append the 723 new comment (if non-zero) 724 725 o returns 0 on success and -1 on error 726 727The seqfannotate function adds extra comment text to an entry as it 728outputs the entry. This way, you can insert new information into an 729entry without losing any of the information in the entry. The new text 730will be output as part of the entry's comment, so that when the output 731entry is read again, seqfcomment can be used to access the inserted 732text. Also, the `flag' argument can be set to strip out any existing 733comments, so that when the output entry is read, the string returned by 734seqfcomment is just that inserted text. This should provide an easy 735method for running a number of experiments on sequences, while 736storing the results of those experiments in the sequences' entries. 737 738The format for the entry must be the same as the format specified when 739the SEQFILE structure was opened for writing. Any mismatch in the two 740formats will result in a parse error. 741 742See file "programr.doc" for a description of how this function can be 743used. 744 745Seqfgcgify 746========== 747 748 int seqfgcgify(SEQFILE *sfp, char *entry, int entrylen) 749 750 o sfp - a SEQFILE structure open for writing 751 o entry - the entry text to convert to the GCG format 752 o entrylen - the length of the entry text 753 754 o returns 0 on success and -1 on error 755 756This function converts an entry from its non-GCG format into its GCG 757format, without losing any of the header line information. This operation 758will work only for the GenBank, PIR, EMBL, Swiss-Prot, FASTA, NBRF 759or IG/Stanford formats. Also, the SEQFILE structure must have been 760opened either with the generic "GCG" format, or the "GCG-*" format 761matching the format of the entry. Any mismatch in formats will result in 762a parse error. 763 764Seqfungcgify 765============ 766 767 int seqfungcgify(SEQFILE *sfp, char *entry, int entrylen) 768 769 o sfp - a SEQFILE structure open for writing 770 o entry - the entry text to convert to the GCG format 771 o entrylen - the length of the entry text 772 773 o returns 0 on success and -1 on error 774 775This function converts an entry from its GCG format into its non-GCG 776format, without losing any of the header line information. This operation 777will work only for the GenBank, PIR, EMBL, Swiss-Prot, FASTA, NBRF 778or IG/Stanford formats. Also, the SEQFILE structure must have been 779opened the non-GCG format, and the format of the entry must be the 780corresponding "GCG-*" format. Any mismatch in formats will result in a 781parse error. 782 783 784 785BIOSEQ Database Functions 786************************* 787 788Bioseq_read 789=========== 790 791 int bioseq_read(char *filelist) 792 793 o filelist - a comma separated list of files (must be BIOSEQ 794 files) 795 796 o returns 0 on success and -1 on error 797 798The bioseq_read function reads all of the files in the comma separated 799list of files and adds the BIOSEQ entries it reads to the internal list of 800entries. These new entries are added to the front of the list, so that the 801newer entries will always be found before the older entries. Thus, you 802should call bioseq_read with the files in increasing priority (the entries 803you want examined first should be the last entries added). 804 805By default, the files specified by the environment variable "BIOSEQ" (if 806it exists) are always the first file read in (this is done automatically 807before any bioseq_* function is processed). If this does not suit your 808priority scheme, simply call bioseq_read with the "BIOSEQ" 809environment variable value in its proper position in the priority 810scheme. Those entries will override the previously read entries. 811 812(Note: This function has the capability to read a complete comma 813separated list of files, but the typical use of this function will probably 814be to read a single file, except when handling the "BIOSEQ" 815environment variable.) 816 817Bioseq_check 818============ 819 820 int bioseq_check(char *dbspec) 821 822 o dbspec - a database search specifier 823 824 o returns non-zero if the string refers to a known database, 825 or returns zero otherwise 826 827The bioseq_check function can be used to test whether a database 828search specification refers to a database known to the package (i.e., if a 829BIOSEQ entry exists for that database). 830 831Bioseq_info 832=========== 833 834 char *bioseq_info(char *dbspec, char *fieldname) 835 836 o dbspec - a database search specifier 837 o fieldname - the name of the information field to be returned 838 839 o returns the text for that field of the BIOSEQ entry for that 840 database. 841 (NOTE: the returned string buffer is a malloc'ed buffer, and 842 it must be freed by you.) 843 844The bioseq_info function is used to retrieve an information field from a 845BIOSEQ entry. The file "user.doc" describes the BIOSEQ entry format 846and how information fields can be added to an entry. The returned text 847is always stored in a malloc'ed buffer which you must free after you've 848finished using. 849 850Note that the `dbspec' argument does NOT have to be a simple 851database name, but can be a fulle database search specifier. This is so 852you can use the same specification to start a database search and get 853information about the database being searched, without having to 854explicitly extract the database name from the specification. 855 856Bioseq_matchinfo 857================ 858 859 char *bioseq_matchinfo(char *fieldname, char *fieldvalue) 860 861 o fieldname - the name of an information field 862 o fieldvalue - the value that the information field should have 863 864 o returns the name of a database. 865 (NOTE: the returned string buffer is a malloc'ed buffer, and 866 it must be freed by you.) 867 868The bioseq_matchinfo function is used to determine which database 869contains an information field with a specific value. (The package uses it 870to find the database corresponding to a particular identifier prefix, as in 871`bioseq_matchinfo("IdPrefix", "ec")'.) 872 873The function traverses the list of BIOSEQ entries, looking for the first 874one which has an information field that matches both the fieldname and 875fieldvalue. The fieldname and fieldvalue matching are both 876case-insensitive, and any whitespace separating the fieldname and the 877fieldvalue is ignored. So, the call above would match the information 878line `>IDPREFIX: EC', regardless of how many spaces separate the `:' 879and `EC'. 880 881Bioseq_parse 882============ 883 884 char *bioseq_parse(char *dbspec) 885 886 o dbspec - a database search specifier 887 888 o returns the list of files in a string where each file is 889 terminated by a newline character and the whole string is 890 terminated by a NULL character. 891 (NOTE: the returned string buffer is a malloc'ed buffer, and 892 it must be freed by you.) 893 894The bioseq_parse function parses a BIOSEQ database search 895specification and determines which files of the database need to be 896searched. That list of files is then returned in a malloc'ed buffer which 897you must free. The returned string terminates each filename by a 898newline character (even the last file in the list), and the list of filenames 899is ended by a NULL character (as with all strings). 900 901 902 903Miscellaneous Functions 904*********************** 905 906Seqfisafile 907=========== 908 909 int seqfisafile(char *filename) 910 911 o filename - a filename (with a possible "@..." single entry 912 access specification) 913 914 o returns non-zero (if the filename refers to an existing file) 915 or zero (if not) 916 917The seqfisafile function can be used to test whether a user-given 918filename (which may contain a single entry access specification in 919addition to the actual filename) refers to an existing file. It first checks 920for the single entry access specification (by looking for a `@'), and then 921checks the prefix to see if it refers to an existing file. 922 923What the function does not do is parse the single entry access 924specification (if it's there). So, a non-zero return value does NOT mean 925that seqfopen will succeed in opening the file. 926 927Seqfisaformat 928============= 929 930 int seqfisaformat(char *format) 931 932 o format - a file format string 933 934 o returns non-zero (if the string is a valid file format) or zero 935 (if not) 936 937The seqfisaformat function can be used to test whether a string 938specifies a valid file format or not. It checks the given string against all 939of the valid format names and returns a non-zero/zero value telling 940whether a match occurred. 941 942Seqffmttype 943=========== 944 945 int seqffmttype(char *format) 946 947 o format - a file format string 948 949 o returns the format type or T_INVFORMAT (for an invalid 950 format) 951 952The seqffmttype function returns some type information about the 953given format. The possible returned types are T_SEQONLY, 954T_DATABANK, T_GENERAL, T_LIMITED, T_ALIGNMENT and T_OUTPUT. 955See file "format.doc" for more information about what these types 956mean. 957 958Seqfcanwrite 959============ 960 961 int seqfcanwrite(char *format) 962 963 o format - a file format string 964 965 o returns non-zero (if that format is writeable) or zero (if not) 966 967The seqfcanwrite function tells whether or not the given file format has 968writing capabilities. Currently, every file format except FASTA-output 969has this capability. 970 971Seqfcanannotate 972=============== 973 974 int seqfcanannotate(char *format) 975 976 o format - a file format string 977 978 o returns non-zero (if the format's entries can be annotated) 979 or zero (if not) 980 981The seqfcanannotate function tells whether or not the given file format 982has annotation capabilities. Currently, the file formats GenBank, EMBL, 983Swiss-Prot, PIR, FASTA, NBRF, IG/Stanford and ASN.1 do have this 984capability, and the formats Raw, Plain, FASTA-old, NBRF-old, IG-old, 985PHYLIP, Clustalw and FASTA-output do not. 986 987Seqfcangcgify 988============= 989 990 int seqfcangcgify(char *format) 991 992 o format - a file format string 993 994 o returns non-zero (if the format's entries can be 995 gcgified/ungcgified) or zero (if not) 996 997The seqfcangcgify function tells whether or not the given file format can 998be converted from and to its GCG format. Currently, the file formats 999GenBank, EMBL, Swiss-Prot, PIR, FASTA, FASTA-old, NBRF, 1000NBRF-old, IG/Stanford and IG/Stanford-old have this capability, and 1001the other formats do not. 1002 1003Seqfbytepos 1004=========== 1005 1006 void seqfbytepos(SEQFILE *sfp) 1007 1008 o sfp - a SEQFILE structure open for reading 1009 1010 o returns a byte position, or -1 on error 1011 1012The seqfbytepos function returns the byte position in the current file of 1013the beginning of the current entry. If no current entry exists (because 1014of a previous error or because EOF was reached), the function returns 1015-1. 1016 1017Seqfsetpretty 1018============= 1019 1020 void seqfsetpretty(SEQFILE *sfp, int value) 1021 1022 o sfp - a SEQFILE structure open for writing 1023 o value - either non-zero or zero 1024 1025 o returns nothing 1026 1027The seqfsetpretty function tells whether the output operations should 1028add some whitespace to make the sequence look prettier. When 1029outputting in the Plain, FASTA, NBRF or IG/Stanford formats (and their 1030variants), the putseq operation looks at the sequence being output, 1031and may add whitespace to the outputted sequence to make it look 1032prettier. 1033 1034By default, the extra spaces are added when the sequence is DNA, RNA 1035or Protein and when there are no non-alphabetic characters in the 1036sequence (such as alignment characters). 1037 1038Seqfparseent 1039============ 1040 1041 SEQINFO *seqfparseent(char *entry, int entrylen, char *format) 1042 1043 o entry - the text of an entry 1044 o entrylen - the length of the entry 1045 o format - the format of the entry 1046 1047 o returns a malloc'ed SEQINFO structure containing the 1048 information about the entry. 1049 (NOTE: This structure must be freed by you.) 1050 1051The seqfparseent function parses an entry and constructs a SEQINFO 1052structure containing the information occurring in the entry. It could be 1053useful if you read in entries on your own, but still want to retrieve the 1054information stored in them. 1055 1056(NOTE: This function cannot parse entries in the PHYLIP, Clustalw or 1057FASTA-output format. An error is triggered if you call seqfparseent 1058with an entry in one of these formats.) 1059 1060Asn_parse 1061========= 1062 1063 int asn_parse(char *begin, char *end, ...) 1064 1065 o begin - the beginning of the ASN.1 text 1066 o end - the end of the ASN.1 text (i.e., the last character of 1067 the ASN.1 text is at `end-1') 1068 o ... - a NULL terminated list of arguments specifying the 1069 sub-records to be searched and the variables to store the 1070 beginning and end positions of the sub-record text. 1071 (NOTE: These arguments must be given in groups of 3 until 1072 the NULL termination, such as in 1073 1074 "seq.id.genbank", &gbstart, &gbend, 1075 "seq.descr", &destart, &deend, 1076 NULL 1077 1078 The format for each triple is 1079 1080 char *subrecord, char **begin_out, char **end_out 1081 1082 and either begin_out or end_out can be NULL.) 1083 1084 o returns a count of the number of sub-records found or a 1085 -1 on error 1086 1087Since I found correctly parsing the ASN.1 text a much harder task than 1088the line oriented file formats, I've added this internal function to the 1089interface so that you might find it easier to move through the ASN.1 1090records. The description below assumes that you are familiar with the 1091ASN.1 text format, and in particular the structure of 1092"Bioseq-set.Seq-set.seq" records in the ASN.1 Bioseq-set hierarchy 1093defined in the NCBI toolkit. The function itself can work with any 1094correctly formatted ASN.1 text, however. 1095 1096The asn_parse function takes a piece of ASN.1 text (from `begin' to 1097`end') that specifies either one or more records in a hierarchy (i.e., you 1098need not specify the top-most record, but the beginning of the text 1099must be at the beginning of some record in the hierarchy and every 1100record whose beginning starts inside the text must be complete). The 1101arguments that you give after `end' specify the sub-records you are 1102interested in, along with pointer variables which will be set to the 1103beginning and end of those sub-records, if they are found. 1104 1105The strings naming the desired sub-records should match the 1106structure of sub-records in the text's hierarchy. In most cases, the 1107string is simply a list of the initial keywords of the sub-records 1108separated by periods. However, when a sub-record does not have an 1109initial keyword, but begins with an open brace, that open brace must be 1110included in the sub-record identifier string. One example of this in the 1111Bioseq hierarchy is the Bioseq-set.seq-set.seq.annot records (here is 1112an example) 1113 1114annot { 1115 { 1116 data 1117 ftable { 1118 { 1119 data 1120 prot { 1121 name { 1122 "nifS protein" } } , 1123 location 1124 whole 1125 gi 77963 } } } } 1126 1127To access sub-records in this record, the strings must appear as 1128"annot.{.data" or "annot.{.data.ftable.{.location". The open braces after 1129keywords are not specified, but braces without keywords are. Also, 1130note that in this example, the keywords "prot", "whole" and "gi" are 1131NOT initial keywords recognized by asn_parse, because they do not 1132appear at the beginning of a sub-record (i.e., after an open brace 1133starting a sub-record or after a comma separating sub-records). 1134 1135In the parameter description above (the `...' description), I'm assuming 1136that the user gives asn_parse the text for a Bioseq `seq' record (see 1137the NCBI toolkit documentation for the structure of this record) and 1138wants to get the "id.genbank" and "descr" sub-records. As another 1139example, here is an excerpt from the SEQIO function implementing the 1140seqfinfo operation for the ASN.1 file format. The first part of it looks for 1141identifiers and accession numbers, and the second looks for 1142comments. 1143 1144 /* 1145 * Find the "id" and "descr" sub-records in the "seq" record. 1146 */ 1147 idstr = destr = NULL; 1148 status = asn_parse(entry, entry + entrylen, 1149 "seq.id", &idstr, &idend, 1150 "seq.descr", &destr, &deend, 1151 NULL); 1152 if (status == -1) 1153 /* error handling code here */ 1154 1155 /* 1156 * If there was an "id" sub-record, look for all of the possible 1157 * sub-records that specify database identifiers or accession 1158 * numbers. 1159 */ 1160 if (idstr != NULL) { 1161 pirname = spname = gbname = emname = otname = prfname = dbjname = NULL; 1162 piracc = spacc = gbacc = emacc = otacc = prfacc = dbjacc = NULL; 1163 pdbmol = gistr = giim = gibbs = gibbm = NULL; 1164 status = asn_parse(idstr, idend, 1165 "id.pir.name", &pirname, &pnend, 1166 "id.swissprot.name", &spname, &spend, 1167 "id.genbank.name", &gbname, &gbend, 1168 "id.embl.name", &emname, &emend, 1169 "id.ddbj.name", &dbjname, &dbjend, 1170 "id.prf.name", &prfname, &prfend, 1171 "id.other.name", &otname, &otend, 1172 "id.pir.accession", &piracc, &paend, 1173 "id.swissprot.accession", &spacc, &spaend, 1174 "id.genbank.accession", &gbacc, &gbaend, 1175 "id.embl.accession", &emacc, &emaend, 1176 "id.ddbj.accession", &dbjacc, &dbjaend, 1177 "id.prf.accession", &prfacc, &prfaend, 1178 "id.other.accession", &otacc, &otaend, 1179 "id.pdb.mol", &pdbmol, &pmend, 1180 "id.gi", &gistr, &gisend, 1181 "id.giim.id", &giim, &giimend, 1182 "id.gibbsq", &gibbs, &gibbsend, 1183 "id.gibbmt", &gibbm, &gibbmend, 1184 NULL); 1185 if (status == -1) 1186 /* error handling code here */ 1187 1188 /* 1189 * If the PIR identifier is found, extract it from the string 1190 * pirname+4 (+4 to skip the "name" keyword) to pnend using 1191 * internal function `add_id'. 1192 */ 1193 if (pirname != NULL) 1194 add_id(&info, "pir", pirname+4, pnend); 1195 1196 ... 1197 } 1198 1199 /* 1200 * If the "description" sub-record exists, look for a "comment" 1201 * sub-record. 1202 */ 1203 if (destr != NULL) { 1204 comment = NULL; 1205 status = asn_parse(destr, deend, 1206 "descr.comment", &comment, &cmend, 1207 NULL); 1208 if (status == -1) 1209 /* error handling code here */ 1210 1211 /* 1212 * If that first comment was found, use internal function 1213 * `add_comment' to add it to the SEQINFO structure, and then 1214 * look for other comments in the "descr" record, since there 1215 * can be more than one "comment" sub-record. 1216 * 1217 * Note the first two arguments to the asn_parse call are 1218 * `cmend+1' and `deend'. So, these searches start from just 1219 * after the end of the last found comment and run until the 1220 * next "comment" record is found, or until the end of the 1221 * "descr" record. 1222 */ 1223 if (comment != NULL) { 1224 add_comment(&info, comment+7, cmend, 1); 1225 while (asn_parse(cmend+1, deend, 1226 "comment", &comment, &cmend, 1227 NULL) == 1) 1228 add_comment(&info, comment+7, cmend, 1); 1229 } 1230 } 1231 1232A couple of notes. First, if both the beginning and end pointer variables 1233are specified (such as `&comment' and `&comend' above), then either 1234both variables will be set to a value if the sub-record is found, or both 1235will be left unchanged if the sub-record is not found. Thus, in the 1236example above, I only needed to set and test whether the beginning 1237pointer variable was NULL, and did not need to worry about the value 1238of the end pointer variable (if the beg. variable had been changed, then 1239I was guaranteed that the end variable had also been changed). 1240 1241Second, the function returns the very beginning of the record it is 1242searching for, including the initial keyword starting the record. So, 1243when using the beginning and end variable values from one search as 1244the `begin' and `end' parameters to another search, the record 1245specifications for that second search must begin with that initial 1246keyword. In the example above, the first search looked for "seq.id" and 1247"seq.descr", and the later searches then looked for "id.pir.name" and 1248"descr.comment". 1249 1250Finally, the return value of the function is the count of the number of 1251sub-records found, or a -1 if a parse error occurring while scanning 1252the text. 1253 1254 1255 1256Error Handling/Reporting 1257************************ 1258 1259Seqferrno, Seqferrstr 1260===================== 1261 1262 extern int seqferrno; 1263 extern char seqferrstr[]; 1264 1265These are variables which are set whenever an error occurs in the 1266SEQIO package. seqferrno gives a numerical error similar to Unix's 1267errno, and seqferrstr holds the error string that the SEQIO package 1268would have output (or perhaps did output depending on the error 1269policy). 1270 1271The values that seqferrno can have are as follows: 1272 1273E_EOF (-1) 1274 Reached the end of the file or database search. 1275E_NOERROR (0) 1276 There is no error (the default value). 1277E_OPENFAILED (1) 1278 An error occurred when opening a file. 1279E_READFAILED (2) 1280 An error occurred when reading a file. 1281E_NOMEMORY (3) 1282 Ran out of memory. 1283E_PROGRAMERROR (4) 1284 A bug was detected in the SEQIO package itself. 1285E_PREVERROR (5) 1286 A previous error occurred which does not permit the current 1287 operation to succeed. 1288E_PARAMERROR (6) 1289 An invalid parameter passed to a SEQIO function. 1290E_INVFORMAT (7) 1291 An unknown file format was specified as the argument to a 1292 function (like seqfopen). 1293E_DETFAILED (8) 1294 The format of a file could not be automatically determined from 1295 its contents. 1296E_PARSEERROR (9) 1297 A parse error occurred while scanning a file. 1298E_DBPARSEERROR (10) 1299 A parse error occurred while scanning a BIOSEQ entry or a 1300 database search specification. 1301E_DBFILEERROR (11) 1302 The BIOSEQ entry for a database specifies some files which 1303 could not be located or opened. 1304E_NOSEQ (12) 1305 A sequence entry does not contain a sequence. (This error only 1306 occurs if the sequence is actually requested, not for every entry 1307 without a sequence.) 1308E_DIFFLENGTH (13) 1309 There is a discrepancy between the length of a sequence, as 1310 specified in the sequence entry, and the number of characters 1311 found for the sequence. 1312E_INVINFO (14) 1313 When outputting an entry, one of the fields in the SEQINFO 1314 structure used to generate the output contains invalid 1315 information (i.e., it is not formatted correctly). 1316E_FILEERROR (15) 1317 A parse error occurred while parsing a single entry access 1318 specification that was given with a filename to seqfopen. 1319 1320Seqfperror 1321========== 1322 1323 void seqfperror(char *s) 1324 1325 o s - a string (usually the program name) to be printed 1326 before the error string 1327 1328 o returns nothing 1329 1330The seqfperror function is similar to that of Unix's perror function. It 1331outputs to standard error the text of the last error message (i.e., the 1332contents of seqferrstr). If the argument to the function is not NULL, 1333then that argument string and the string ": " are first output. This 1334argument is typically used to output the program name before the error 1335message. 1336 1337Seqfsetperror 1338============= 1339 1340 void seqfsetperror(void (*perr_fn)(char *)) 1341 1342 o perr_fn - a void function that takes a string as its 1343 argument 1344 1345 o returns nothing 1346 1347The seqfsetperror function can be used to replace the default method 1348the SEQIO package has for reporting errors. When an error is detected 1349and the error policy is such that an error message should be output, a 1350default print error function is used to output the error message to 1351stderr. This function resets the print error function to either one of 1352your own choosing (if the argument is not NULL), or back to the default 1353print error function (if the argument is NULL). 1354 1355Seqferrpolicy 1356============= 1357 1358 int seqferrpolicy(int pe) 1359 1360 o pe - sets the error policy 1361 1362 o returns the old error policy 1363 1364The seqferrpolicy function set an error policy for the SEQIO package to 1365follow when it detects an error has occurred. The SEQIO package 1366detects three kinds of errors: 1367 1368 1. Warnings - when the package detects that something is wrong, 1369 but can still complete the current operation and return an actual 1370 value. 1371 1372 Examples of warnings are E_DIFFLENGTH, where a sequence is 1373 found but its length may be incorrect, or E_DETFAILED, where 1374 the file format can't be determined and the Plain format is used. 1375 1376 On a warning, an error message is output. 1377 1378 The warning errno values are: E_DETFAILED, 1379 E_DBFILEERROR(sometimes), E_DIFFLENGTH. 1380 1381 2. Errors - when the package is unable to complete the current 1382 operation because of some failure, but this failure does not 1383 adversely affect the state of the SEQFILE structures (so the 1384 SEQIO package can keep going in later calls). 1385 1386 Examples of errors are E_PARAMERROR, when an invalid 1387 parameter is given to a function, or E_NOSEQ, where no 1388 sequence can be returned since no sequence occurs in the 1389 current entry. 1390 1391 The one variation on the handling of these errors (this operation 1392 cannot finish, but future operations using that SEQFILE 1393 structure are okay) is when an E_READFAILED or 1394 E_PARSEERROR is detected while reading a file. In those cases 1395 (i.e., on the first parse error or when the file can't be read), the 1396 package stops reading that file and returns an error result. 1397 However, what happens to the next call to read the SEQFILE 1398 structure depends on whether seqfopen was called to read a 1399 single file or whether seqfopendb was called to read a number of 1400 files. If seqfopen was called, the next call to read another entry or 1401 sequence will get an EOF signal, whereas when a database is 1402 being read, the package will move on to the next file in the 1403 database and attempt to read the entries from there. So, the 1404 result of the next call to seqfread, seqfgetseq, seqfgetentry or 1405 seqfgetinfo after a read/parse error depends on whether there 1406 are any more files to read. 1407 1408 On an error, an error message is output and an error value is 1409 returned as the result of the called function. 1410 1411 The error errno values are: E_EOF, E_OPENFAILED, 1412 E_READFAILED, E_PREVERROR, E_PARAMERROR, 1413 E_INVFORMAT, E_PARSEERROR, E_DBPARSEERROR, 1414 E_DBFILEERROR(sometimes), E_NOSEQ, E_FILEERROR. 1415 1416 3. Fatal errors - when the error leaves the state of the SEQFILE 1417 structure (or the SEQIO package) in an unrecoverrable state. 1418 1419 No more operations can be performed using that SEQFILE 1420 structure, and you may only call seqfclose to close the structure 1421 (if the program does not exit on the error). If further calls are 1422 made, the error status E_PREVERROR is always returned. 1423 1424 Examples of fatal errors are E_NOMEMORY where the package 1425 runs out of memory, or E_PROGRAMERROR when an internal 1426 program bug is detected 1427 1428 On a fatal error, an error message is output, the system function 1429 `exit' is called (unless disabled using seqferrpolicy), and an 1430 error value is returned as the result of the function (if the exit call 1431 is disabled). 1432 1433 The fatal errno values are: E_NOMEMORY, 1434 E_PROGRAMERROR. 1435 1436How these errors are handled can be determined by which of the 1437following predefined constants is passed to seqferrpolicy: 1438 1439PE_NONE 1440 Disable all warning/error/fatal message output and all calls to 1441 exit. So, only seqferrno and seqferrstr are set on an error (and 1442 possibly an error value is returned). 1443PE_WARNONLY 1444 Allow warning messages to be printed, but disable error/fatal 1445 messages and calls to exit. 1446PE_ERRONLY 1447 Allow error/fatal messages, but disable warning messages and 1448 calls to exit. 1449PE_NOWARN 1450 Allow error/fatal messages and calls to exit, but disable warning 1451 messages. 1452PE_NOEXIT 1453 Allow all message output, but disable calls to exit. 1454PE_ALL 1455 Allow all message output and calls to exit. 1456 1457The default policy is PE_ALL, where all output is performed and fatal 1458errors cause the SEQIO package to call exit. 1459 1460The function returns the old error policy as its return value. So, if you 1461want to turn off all error reporting for a SEQIO function call, you can do 1462the following: 1463 1464old_pe = seqferrpolicy(PE_NONE); 1465 1466/* Do the SEQIO function call */ 1467 1468seqferrpolicy(old_pe); 1469 1470if (seqferrno != E_NOERROR && seqferrno != E_EOF) { 1471 /* Handle the error/warning that occurred during the function call */ 1472} 1473 1474 1475James R. Knight, knight@cs.ucdavis.edu 1476June 26, 1996 1477