1The \eslmod{buffer} module provides an abstract layer for building 2input parsers. Different types of input -- including files, standard 3input, piped output from executed commands, C strings, and raw memory 4-- can be handled efficiently in a single API and a single object, an 5\ccode{ESL\_BUFFER}. 6%The API is summarized in Table~\ref{tbl:buffer_api}. 7 8The main rationale for \eslmod{buffer} is to enable multipass parsing 9of any input, even a nonrewindable stream or pipe. A canonical problem 10in sequence file parsing is that we need to know both the format ( 11FASTA or Genbank, for instance) and the alphabet (protein or nucleic 12acid, for instance) in order to parse Easel-digitized sequence data 13records. To write ``smart'' parsers that automagically determine the 14file format and alphabet, so programs work transparently on lots of 15different file types without users needing to specify them, we need 16three-pass parsing: one pass to read raw data and determine the 17format, a second pass to parse the format for sequence data and 18determine its alphabet, and finally the actual parsing of digitized 19sequences. Multiple pass parsing of a nonrewindable stream, such as 20standard input or the output of a \ccode{gunzip} call, isn't possible 21without extra support. The \eslmod{buffer} module standardizes that 22support for all Easel input. 23 24\subsection{Examples of using the buffer API} 25 26Here's an example of using \eslmod{buffer} to read a file line by 27line: 28 29\input{cexcerpts/buffer_example} 30 31This shows how to open an input, get each line sequentially, do 32something to each line (here, count the number of x's), and close the 33input. To compile this example, then run it on a file (any file would 34do, but here, \ccode{esl\_buffer.c} itself): 35 36\user{gcc -I. -o esl\_buffer\_example -DeslBUFFER\_EXAMPLE esl\_buffer.c easel.c -lm} 37\user{./esl\_buffer\_example esl\_buffer.c} 38\response{Counted 181 x's in 3080 lines.} 39 40The most important thing to notice here is that 41\ccode{esl\_buffer\_Open()} function implements a standard Easel idiom 42for finding input sources. If the \ccode{filename} argument is a 43single dash '-', it will read from \ccode{stdin}. If the 44\ccode{filename} argument ends in \ccode{.gz}, it will assume the file 45is a \ccode{gzip}-compressed input, and it will decompress it on the 46fly with \ccode{gzip -dc} before reading it. If it does not find the 47\ccode{filename} relative to the current directory, and if the second 48argument (here \ccode{"TESTDIR"}) is non-\ccode{NULL}, it looks at the 49setting of an environment variable \ccode{envvar}, which should 50contain a colon-delimited list of directories to search to try to find 51\ccode{filename}. Therefore all of the following commands will work 52and give the same result: 53 54\begin{userchunk} 55% ./esl_buffer_example esl_buffer.c 56\end{userchunk} 57 58\begin{userchunk} 59 % cat esl_buffer.c | ./esl_buffer_example - 60\end{userchunk} 61 62\begin{userchunk} 63 % cp esl_buffer.c foo 64 % gzip foo 65 % ./esl_buffer_example foo.gz 66\end{userchunk} 67 68\begin{userchunk} 69 % cp esl_buffer.c ${HOME}/mydir2/baz 70 % export TESTDIR=${HOME}/mydir1:${HOME}/mydir2 71 % ./esl_buffer_example baz 72\end{userchunk} 73 74This idiomatic flexibility comes in handy when using biological data. 75Data are are often kept in standard directories on systems (for 76example, we maintain a symlink \ccode{/misc/data0/databases/Uniprot} 77on ours), so having applications look for directory path listings in 78standardized environment variables can help users save a lot of typing 79of long paths. Data files can be big, so it's convenient to be able to 80compress them and not have to decompress them to use them. It's 81convenient to have applications support the power of using UNIX 82command invocations in pipes, chaining the output of one command into 83the input of another, so it's nice to automatically have any 84Easel-based application read from standard input. 85 86A couple of other things to notice about this example: 87 88\begin{enumerate} 89\item If the \ccode{esl\_buffer\_Open()} fails, it still returns a 90 valid \ccode{ESL\_BUFFER} structure, which contains nothing except a 91 user-directed error message \ccode{bf->errmsg}. If you were going to 92 continue past this error, you'd want to \ccode{esl\_buffer\_Close()} 93 the buffer. 94 95\item \ccode{esl\_buffer\_GetLine()} returns a pointer to the start of 96 the next line \ccode{p}, and its length in chars \ccode{n} 97 (exclusive of any newline character). It does \emph{not} return a 98 string - \ccode{p[n]} is \emph{not} a \ccode{NUL} byte 99 \verb+\0+. Standard C string functions, which expect 100 \ccode{NUL}-terminated strings, can't be used on \ccode{p}. The 101 reason is efficiency: the \ccode{ESL\_BUFFER} is potentially looking 102 at a read-only exact image of the input, and 103 \ccode{esl\_buffer\_GetLine()} is not wasting any time making a copy 104 of it. If you need a string, with an appended \verb+\0+ in the 105 right place, see \ccode{esl\_buffer\_FetchLineAsStr()}. 106\end{enumerate} 107 108\subsubsection{Reading tokens} 109 110Because \ccode{ESL\_BUFFER} prefers to give you pointers into a 111read-only image of the input, the standard C \ccode{strtok()} function 112can't be used to define tokens (whitespace-delimited fields, for 113example), because \ccode{strtok()} tries to write a \verb+\0+ byte 114after each token it defines. Therefore \ccode{ESL\_BUFFER} provides 115its own token parsing mechanism. Depending on whether or not you 116include newline characters (\verb+\r\n+) in the list of separator 117(delimiter) characters, it either ignores newlines altogether, or it 118detects newlines separately and expects to find a known number of 119tokens per line. 120 121For example, our x counting program could be implemented to parse 122every token instead of every line: 123 124\input{cexcerpts/buffer_example2} 125 126\user{gcc -I. -o esl\_buffer\_example2 -DeslBUFFER\_EXAMPLE2 esl\_buffer.c easel.c -lm} 127\user{./esl\_buffer\_example2 esl\_buffer.c} 128\response{Counted 181 x's in 14141 words.} 129 130In the \ccode{esl\_buffer\_GetToken()} call, including \verb+\r\n+ 131with \verb+" \t"+ in the separators causes newlines to be treated like 132delimiters like any space or tab character. If you omit \verb+\r\n+ 133newline characters from the separators, then the parser detects them 134specially anyway; when it sees a newline instead of a token, it 135returns \ccode{eslEOL} and sets the point to the next character 136following the newline. For example, we can count both lines and 137tokens: 138 139\input{cexcerpts/buffer_example3} 140 141\user{gcc -I. -o esl\_buffer\_example3 -DeslBUFFER\_EXAMPLE3 esl\_buffer.c easel.c -lm} 142\user{./esl\_buffer\_example3 esl\_buffer.c} 143\response{Counted 181 x's in 14141 words on 3080 lines.} 144 145What happens if the last line in a text file is missing its terminal 146newline? In the example above, the number of lines would be one fewer; 147the nonterminated last line wouldn't be 148counted. \ccode{esl\_buffer\_GetToken()} would return \ccode{eslEOF} 149on the last line of the file, rather than \ccode{eslEOL} followed by 150\ccode{eslEOF} at its next call as it'd do if the newline were there. 151 152 153\subsubsection{Reading fixed-width binary input} 154 155You can also read fixed-width binary input directly into storage, 156including scalar variables, using the \ccode{esl\_buffer\_Read()} 157call. This is similar to C's \ccode{fread()}: 158 159\input{cexcerpts/buffer_example4} 160 161The \ccode{Read()} call needs to know exactly how many bytes \ccode{n} 162it will read. For variable-width binary input, see the 163\ccode{esl\_buffer\_Get()}/\ccode{esl\_buffer\_Set()} calls. 164 165In fact all inputs are treated by \ccode{ESL\_BUFFER} as binary 166input. That is, platform-dependent newlines are not converted 167automatically to C \verb+\n+ characters, as would happen when using 168the C \ccode{stdio.h} library to read an input stream in ``text 169mode''. You can freely mix different types of \ccode{esl\_buffer\_*} 170parsing calls as you see appropriate. 171 172 173\subsubsection{A more complicated example, a FASTA parser} 174 175An example of a simple FASTA parsing function: 176 177\input{cexcerpts/buffer_example5a} 178 179and an example of using that function in a program: 180 181\input{cexcerpts/buffer_example5b} 182 183One thing to note here is the use of \ccode{esl\_buffer\_Set()} to 184push characters back into the parser. For example, when we look for 185the starting '>', we do a raw \ccode{esl\_buffer\_Get()}, look at the 186first character, then call \ccode{esl\_buffer\_Set()} with 187\ccode{nused=1} to tell the parser we used 1 character of what it gave 188us. This is an idiomatic usage of the 189\ccode{esl\_buffer\_Get()}/\ccode{esl\_buffer\_Set()} pair. The 190\ccode{esl\_buffer\_Get()} call doesn't even move the point until the 191companion \ccode{esl\_buffer\_Set()} tells it where to move to. 192 193The other idiomatic use of \ccode{esl\_buffer\_Set()} is to implement 194a ``peek'' at a next line or a next token, using a 195\ccode{esl\_buffer\_GetLine()}/\ccode{esl\_buffer\_Set()} or 196\ccode{esl\_buffer\_GetToken()}/\ccode{esl\_buffer\_Set()} 197combination. You see this when we're in the sequence reading loop, we 198get a line, and we want to peek at its first character. If it's a '>' 199we're seeing the start of the next sequence, so we want to return 200while leaving the point on the '>'. To do this, we use 201\ccode{esl\_buffer\_GetLine()} to get the line, and if the first char 202is a '>' we use \ccode{esl\_buffer\_Set()} to push the line pointer 203(with 0 used characters) back to the parser. 204 205You can also see examples here of using 206\ccode{esl\_buffer\_FetchTokenAsStr()} 207\ccode{esl\_buffer\_FetchLineAsStr()} to copy the name and description 208directly to allocated, \verb+\0+-terminated C strings. Note how they 209interact: because \ccode{esl\_buffer\_FetchTokenAsStr()} moves the 210point past any trailing separator characters to the start of the next 211token, and because \ccode{esl\_buffer\_FetchLineAsStr()} doesn't need 212the point to be at the start of a line, the 213\ccode{esl\_buffer\_FetchLineAsStr()} call finds the description 214without leading spaces or trailing newline (but with any trailing 215spaces). 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234\subsection{Using anchors: caller-defined limits on random access} 235 236The naive way to enable random access on a sequential stream is to 237slurp the whole stream into memory. If the stream is large, this may 238be very memory inefficient. Many parsers do not need full random 239access, but instead need a limited form of it -- for instance, the 240three-pass case of determining format and alphabet from the start of a 241sequence file. \ccode{ESL\_BUFFER} allows the caller to define an 242\emph{anchor} to define a start point in the input that is not allowed 243to go away until the caller says so. 244 245Setting an anchor declares that \ccode{mem[anchor..n-1]} is not be 246overwritten by new input reads. A new input read may first relocate 247(``reoffset'') \ccode{mem[anchor..n-1]} to \ccode{mem[0..n-anchor-1]} 248in order to use its current allocation efficiently. Setting an anchor 249may therefore cause \ccode{mem} to be reoffset and/or reallocated, and 250\ccode{balloc} may grow, if the buffer is not large enough to hold 251everything starting from the \ccode{anchor} position. When no anchors 252are set, \ccode{mem} will not be reoffset or reallocated. 253 254If we set an anchor at offset 0 in the input, then the entire input 255will be progressively slurped into a larger and larger allocation of 256memory as we read sequentially. We are guaranteed to be able to 257reposition the buffer anywhere from the anchor to n-1, even in a 258normally nonrewindable, nonpositionable stream. If we've read enough 259to determine what we need (format, alphabet...), we can release the 260anchor, and the buffer's memory usage will stop growing. 261 262The functions that get a defined chunk of memory -- 263\ccode{esl\_buffer\_GetLine()}, \ccode{esl\_buffer\_GetToken()}, and 264\ccode{esl\_buffer\_CopyBytes()} -- set an anchor at the start of the 265line, token, or chunk of bytes before they go looking for its end. 266This takes advantage of the anchor mechanism to make sure that the 267buffer will contain the entire line, token, or chunk of bytes, not just a 268truncated part. 269 270 271\subsection{Token-based parsing} 272 273A \esldef{token} is a substring consisting of characters not in a set 274of caller-defined \esldef{separator} characters. Typically, separator 275chararacters might be whitespace (\ccode{" \t"}). 276 277Additionally, newlines are always considered to be separators. Tokens 278cannot include newlines. 279 280In token-based parsing, we can handle newlines in two ways. Sometimes 281we might know exactly how many tokens we expect on the line. Sometimes 282we don't care. 283 284If the caller knows exactly how many tokens are expected on each line 285of the input, it should not include newline characters in its 286separator string. Now, if the caller asks for a token but no token 287remains on the line, it will see a special \ccode{eslEOL} return code 288(and the parser will be positioned at the next character after that 289newline). A caller can check for this deliberately with one last call 290to \ccode{esl\_buffer\_GetToken()} per line, to be sure that it sees 291\ccode{eslEOL} rather than an unexpected token. 292 293If the caller doesn't care how many tokens occur on each line, it 294should include newline characters (\verb+"\r\n"+) in the separator 295string. Then newlines are treated (and skipped) like any other 296separator. 297 298Starting from the current buffer position, the procedure for defining 299a token is: 300 301\begin{itemize} 302\item Skip characters in the separator string. (If end-of-file is 303 reached, return \ccode{eslEOF}.) 304\item If parser is on a newline, skip past it, and return 305 \ccode{eslEOL}. (Note that if the caller had newline characters 306 in the separator string, the first step already skipped any 307 newline, and no \ccode{eslEOL} return is possible.) 308\item Anchor at the current buffer position, \ccode{p}. 309\item From the current point, count characters \emph{not} in the 310 separator, \ccode{n}. (Expand/refill the buffer as needed.) 311\item Define the token: \ccode{p[0..n]}. 312\item Move the current point to the character following the token. 313\end{itemize} 314 315\subsection{Newline handling.} 316 317Easel assumes that newlines are encoded as \verb+\n+ (UNIX, Mac OS/X) 318or \verb+\r\n+ (MS Windows). 319 320All streams are opened as binary data. This is necessary to guarantee 321a one:one correspondence between data offsets in memory and data 322offsets on the filesystem, which we need for file positioning 323purposes. It is also necessary to guarantee that we can read text 324files that have been produced on a system other than the system we're 325reading them on (that we can read Windows text files on a Linux 326system, for example).\footnote{That is, the usual ANSI C convention of 327 reading/writing in ``text mode'' does not suffice, because it 328 assumes the newlines of the system we're on, not necessarily the 329 system that produced the file.} However, it makes us responsible 330for handling system-specific definition of ``newline'' character(s) in 331ASCII text files. 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363\subsection{Implementation notes (for developers)} 364 365\paragraph{The state guarantee.} An \ccode{ESL\_BUFFER} is exchangeable 366and sharable even amongst entirely different types of parsers because 367it is virtually always guaranteed to be in a well-defined 368state. Specifically: 369 370\begin{itemize} 371\item \ccode{bf->mem[bf->pos]} is ALWAYS positioned at the next byte 372 that a parser needs to parse, unless the buffer is at EOF. 373 374\item There are ALWAYS at least \ccode{pagesize} bytes available to 375 parse, provided the input stream has not reached EOF. 376\end{itemize} 377 378 379\paragraph{State in different input type modes} 380 381There are six types (``modes'') of inputs: 382 383\begin{tabular}{ll} 384 Mode & Description \\ \hline 385\ccode{eslBUFFER\_STDIN} & Standard input. \\ 386\ccode{eslBUFFER\_CMDPIPE} & Output piped from a command. \\ 387\ccode{eslBUFFER\_FILE} & A \ccode{FILE} being streamed. \\ 388\ccode{eslBUFFER\_ALLFILE} & A file entirely slurped into RAM. \\ 389\ccode{eslBUFFER\_MMAP} & A file that's memory mapped (\ccode{mmap()}). \\ 390\ccode{eslBUFFER\_STRING} & A string or memory. \\ \hline 391\end{tabular} 392 393The main difference between modes is whether the input is being read 394into the buffer's memory in chunks, or whether the buffer's memory 395effectively contains the entire input: 396 397\begin{tabular}{lll} 398 & \ccode{STDIN, CMDPIPE, FILE} & \ccode{ALLFILE, MMAP, STRING} \\ 399\ccode{mem} & input chunk: \ccode{mem[0..n-1]} is \ccode{input[baseoffset..baseoffset+n-1]} & entire input: \ccode{mem[0..n-1]} is \ccode{input[0..n-1]} \\ 400\ccode{n} & current chunk size & entire input size (exclusive of \verb+\0+ on a \ccode{STRING}) \\ 401\ccode{balloc} & $>0$; \ccode{mem} is reallocatable & 0; \ccode{mem} is not reallocated \\ 402\ccode{fp} & open; \ccode{feof(fp) = TRUE} near EOF & \ccode{NULL} \\ 403\ccode{baseoffset} & offset of byte \ccode{mem[0]} in input & 0 \\ 404\end{tabular} 405 406 407\paragraph{Behavior at end-of-input (``end-of-file'', EOF).} 408 409The buffer can three kinds of states with respect to how near to EOF 410it is, as follows. 411 412During normal parsing, \ccode{bf->n - bf->pos >= bf->pagesize}: 413 414\begin{cchunk} 415 mem-> {[. . . . . . . . . . . . . . . .] x x x x} 416 ^ baseoffset ^ pos ^ n ^ balloc 417 [~ ~ ~ ~ ~ ~ ~ ~] 418 n-pos >= pagesize 419\end{cchunk} 420 421As input is nearing EOF, and we are within last <pagesize> bytes, 422\ccode{bf->n - bf->pos < bf->pagesize}: 423 424\begin{cchunk} 425 mem-> {[. . . . . . . . . . . . . . . .] x x x x} 426 ^ baseoffset ^ pos ^ n ^ balloc 427\end{cchunk} 428 429In modes where we might be reading input in streamed chunks 430(\ccode{eslBUFFER\_STDIN}, \ccode{eslBUFFER\_CMDPIPE} 431\ccode{eslBUFFER\_FILE}), \ccode{feof(bf->fp)} becomes \ccode{TRUE} 432when the buffer nears EOF. 433 434When the input is entirely EOF, then \ccode{bf->pos == bf->n}: 435 436\begin{cchunk} 437 mem-> {[. . . . . . . . . . . . . . . .] x x x x} 438 ^ baseoffset ^ n ^ balloc 439 ^ pos 440\end{cchunk} 441 442 443 444 445 446\paragraph{ The use of \ccode{esl\_pos\_t}. } 447 448All integer variables for a position or length in memory or in a file 449are of type \ccode{esl\_pos\_t}. In POSIX, memory positions are an 450unsigned integer type \ccode{size\_t}, and file positions are a signed 451integer type \ccode{off\_t}. Easel wants to assure an integer type 452that we can safely cast to either \ccode{size\_t} or \ccode{off\_t}, 453and in which we can safely store a negative number as a status flag 454(such as -1 for ``currently unset''). \ccode{esl\_pos\_t} is defined 455as the largest signed integer type that can be safely cast to 456\ccode{size\_t} or \ccode{off\_t}. 457