1The \eslmod{buffer} module provides an abstract layer for building
2input parsers. Different types of input -- including files, standard
3input, piped output from executed commands, C strings, and raw memory
4-- can be handled efficiently in a single API and a single object, an
5\ccode{ESL\_BUFFER}.
6%The API is summarized in Table~\ref{tbl:buffer_api}.
7
8The main rationale for \eslmod{buffer} is to enable multipass parsing
9of any input, even a nonrewindable stream or pipe. A canonical problem
10in sequence file parsing is that we need to know both the format (
11FASTA or Genbank, for instance) and the alphabet (protein or nucleic
12acid, for instance) in order to parse Easel-digitized sequence data
13records. To write ``smart'' parsers that automagically determine the
14file format and alphabet, so programs work transparently on lots of
15different file types without users needing to specify them, we need
16three-pass parsing: one pass to read raw data and determine the
17format, a second pass to parse the format for sequence data and
18determine its alphabet, and finally the actual parsing of digitized
19sequences. Multiple pass parsing of a nonrewindable stream, such as
20standard input or the output of a \ccode{gunzip} call, isn't possible
21without extra support. The \eslmod{buffer} module standardizes that
22support for all Easel input.
23
24\subsection{Examples of using the buffer API}
25
26Here's an example of using \eslmod{buffer} to read a file line by
27line:
28
29\input{cexcerpts/buffer_example}
30
31This shows how to open an input, get each line sequentially, do
32something to each line (here, count the number of x's), and close the
33input.  To compile this example, then run it on a file (any file would
34do, but here, \ccode{esl\_buffer.c} itself):
35
36\user{gcc -I. -o esl\_buffer\_example -DeslBUFFER\_EXAMPLE esl\_buffer.c easel.c -lm}
37\user{./esl\_buffer\_example esl\_buffer.c}
38\response{Counted 181 x's in 3080 lines.}
39
40The most important thing to notice here is that
41\ccode{esl\_buffer\_Open()} function implements a standard Easel idiom
42for finding input sources. If the \ccode{filename} argument is a
43single dash '-', it will read from \ccode{stdin}. If the
44\ccode{filename} argument ends in \ccode{.gz}, it will assume the file
45is a \ccode{gzip}-compressed input, and it will decompress it on the
46fly with \ccode{gzip -dc} before reading it. If it does not find the
47\ccode{filename} relative to the current directory, and if the second
48argument (here \ccode{"TESTDIR"}) is non-\ccode{NULL}, it looks at the
49setting of an environment variable \ccode{envvar}, which should
50contain a colon-delimited list of directories to search to try to find
51\ccode{filename}. Therefore all of the following commands will work
52and give the same result:
53
54\begin{userchunk}
55% ./esl_buffer_example esl_buffer.c
56\end{userchunk}
57
58\begin{userchunk}
59  % cat esl_buffer.c | ./esl_buffer_example -
60\end{userchunk}
61
62\begin{userchunk}
63  % cp esl_buffer.c foo
64  % gzip foo
65  % ./esl_buffer_example foo.gz
66\end{userchunk}
67
68\begin{userchunk}
69  % cp esl_buffer.c ${HOME}/mydir2/baz
70  % export TESTDIR=${HOME}/mydir1:${HOME}/mydir2
71  % ./esl_buffer_example baz
72\end{userchunk}
73
74This idiomatic flexibility comes in handy when using biological data.
75Data are are often kept in standard directories on systems (for
76example, we maintain a symlink \ccode{/misc/data0/databases/Uniprot}
77on ours), so having applications look for directory path listings in
78standardized environment variables can help users save a lot of typing
79of long paths. Data files can be big, so it's convenient to be able to
80compress them and not have to decompress them to use them. It's
81convenient to have applications support the power of using UNIX
82command invocations in pipes, chaining the output of one command into
83the input of another, so it's nice to automatically have any
84Easel-based application read from standard input.
85
86A couple of other things to notice about this example:
87
88\begin{enumerate}
89\item If the \ccode{esl\_buffer\_Open()} fails, it still returns a
90  valid \ccode{ESL\_BUFFER} structure, which contains nothing except a
91  user-directed error message \ccode{bf->errmsg}. If you were going to
92  continue past this error, you'd want to \ccode{esl\_buffer\_Close()}
93  the buffer.
94
95\item \ccode{esl\_buffer\_GetLine()} returns a pointer to the start of
96  the next line \ccode{p}, and its length in chars \ccode{n}
97  (exclusive of any newline character). It does \emph{not} return a
98  string - \ccode{p[n]} is \emph{not} a \ccode{NUL} byte
99  \verb+\0+. Standard C string functions, which expect
100  \ccode{NUL}-terminated strings, can't be used on \ccode{p}. The
101  reason is efficiency: the \ccode{ESL\_BUFFER} is potentially looking
102  at a read-only exact image of the input, and
103  \ccode{esl\_buffer\_GetLine()} is not wasting any time making a copy
104  of it. If you need a string, with an appended \verb+\0+ in the
105  right place, see \ccode{esl\_buffer\_FetchLineAsStr()}.
106\end{enumerate}
107
108\subsubsection{Reading tokens}
109
110Because \ccode{ESL\_BUFFER} prefers to give you pointers into a
111read-only image of the input, the standard C \ccode{strtok()} function
112can't be used to define tokens (whitespace-delimited fields, for
113example), because \ccode{strtok()} tries to write a \verb+\0+ byte
114after each token it defines. Therefore \ccode{ESL\_BUFFER} provides
115its own token parsing mechanism. Depending on whether or not you
116include newline characters (\verb+\r\n+) in the list of separator
117(delimiter) characters, it either ignores newlines altogether, or it
118detects newlines separately and expects to find a known number of
119tokens per line.
120
121For example, our x counting program could be implemented to parse
122every token instead of every line:
123
124\input{cexcerpts/buffer_example2}
125
126\user{gcc -I. -o esl\_buffer\_example2 -DeslBUFFER\_EXAMPLE2 esl\_buffer.c easel.c -lm}
127\user{./esl\_buffer\_example2 esl\_buffer.c}
128\response{Counted 181 x's in 14141 words.}
129
130In the \ccode{esl\_buffer\_GetToken()} call, including \verb+\r\n+
131with \verb+" \t"+ in the separators causes newlines to be treated like
132delimiters like any space or tab character. If you omit \verb+\r\n+
133newline characters from the separators, then the parser detects them
134specially anyway; when it sees a newline instead of a token, it
135returns \ccode{eslEOL} and sets the point to the next character
136following the newline. For example, we can count both lines and
137tokens:
138
139\input{cexcerpts/buffer_example3}
140
141\user{gcc -I. -o esl\_buffer\_example3 -DeslBUFFER\_EXAMPLE3 esl\_buffer.c easel.c -lm}
142\user{./esl\_buffer\_example3 esl\_buffer.c}
143\response{Counted 181 x's in 14141 words on 3080 lines.}
144
145What happens if the last line in a text file is missing its terminal
146newline? In the example above, the number of lines would be one fewer;
147the nonterminated last line wouldn't be
148counted. \ccode{esl\_buffer\_GetToken()} would return \ccode{eslEOF}
149on the last line of the file, rather than \ccode{eslEOL} followed by
150\ccode{eslEOF} at its next call as it'd do if the newline were there.
151
152
153\subsubsection{Reading fixed-width binary input}
154
155You can also read fixed-width binary input directly into storage,
156including scalar variables, using the \ccode{esl\_buffer\_Read()}
157call. This is similar to C's \ccode{fread()}:
158
159\input{cexcerpts/buffer_example4}
160
161The \ccode{Read()} call needs to know exactly how many bytes \ccode{n}
162it will read. For variable-width binary input, see the
163\ccode{esl\_buffer\_Get()}/\ccode{esl\_buffer\_Set()} calls.
164
165In fact all inputs are treated by \ccode{ESL\_BUFFER} as binary
166input. That is, platform-dependent newlines are not converted
167automatically to C \verb+\n+ characters, as would happen when using
168the C \ccode{stdio.h} library to read an input stream in ``text
169mode''. You can freely mix different types of \ccode{esl\_buffer\_*}
170parsing calls as you see appropriate.
171
172
173\subsubsection{A more complicated example, a FASTA parser}
174
175An example of a simple FASTA parsing function:
176
177\input{cexcerpts/buffer_example5a}
178
179and an example of using that function in a program:
180
181\input{cexcerpts/buffer_example5b}
182
183One thing to note here is the use of \ccode{esl\_buffer\_Set()} to
184push characters back into the parser. For example, when we look for
185the starting '>', we do a raw \ccode{esl\_buffer\_Get()}, look at the
186first character, then call \ccode{esl\_buffer\_Set()} with
187\ccode{nused=1} to tell the parser we used 1 character of what it gave
188us. This is an idiomatic usage of the
189\ccode{esl\_buffer\_Get()}/\ccode{esl\_buffer\_Set()} pair.  The
190\ccode{esl\_buffer\_Get()} call doesn't even move the point until the
191companion \ccode{esl\_buffer\_Set()} tells it where to move to.
192
193The other idiomatic use of \ccode{esl\_buffer\_Set()} is to implement
194a ``peek'' at a next line or a next token, using a
195\ccode{esl\_buffer\_GetLine()}/\ccode{esl\_buffer\_Set()} or
196\ccode{esl\_buffer\_GetToken()}/\ccode{esl\_buffer\_Set()}
197combination. You see this when we're in the sequence reading loop, we
198get a line, and we want to peek at its first character. If it's a '>'
199we're seeing the start of the next sequence, so we want to return
200while leaving the point on the '>'. To do this, we use
201\ccode{esl\_buffer\_GetLine()} to get the line, and if the first char
202is a '>' we use \ccode{esl\_buffer\_Set()} to push the line pointer
203(with 0 used characters) back to the parser.
204
205You can also see examples here of using
206\ccode{esl\_buffer\_FetchTokenAsStr()}
207\ccode{esl\_buffer\_FetchLineAsStr()} to copy the name and description
208directly to allocated, \verb+\0+-terminated C strings. Note how they
209interact: because \ccode{esl\_buffer\_FetchTokenAsStr()} moves the
210point past any trailing separator characters to the start of the next
211token, and because \ccode{esl\_buffer\_FetchLineAsStr()} doesn't need
212the point to be at the start of a line, the
213\ccode{esl\_buffer\_FetchLineAsStr()} call finds the description
214without leading spaces or trailing newline (but with any trailing
215spaces).
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234\subsection{Using anchors: caller-defined limits on random access}
235
236The naive way to enable random access on a sequential stream is to
237slurp the whole stream into memory. If the stream is large, this may
238be very memory inefficient. Many parsers do not need full random
239access, but instead need a limited form of it -- for instance, the
240three-pass case of determining format and alphabet from the start of a
241sequence file. \ccode{ESL\_BUFFER} allows the caller to define an
242\emph{anchor} to define a start point in the input that is not allowed
243to go away until the caller says so.
244
245Setting an anchor declares that \ccode{mem[anchor..n-1]} is not be
246overwritten by new input reads. A new input read may first relocate
247(``reoffset'') \ccode{mem[anchor..n-1]} to \ccode{mem[0..n-anchor-1]}
248in order to use its current allocation efficiently. Setting an anchor
249may therefore cause \ccode{mem} to be reoffset and/or reallocated, and
250\ccode{balloc} may grow, if the buffer is not large enough to hold
251everything starting from the \ccode{anchor} position. When no anchors
252are set, \ccode{mem} will not be reoffset or reallocated.
253
254If we set an anchor at offset 0 in the input, then the entire input
255will be progressively slurped into a larger and larger allocation of
256memory as we read sequentially. We are guaranteed to be able to
257reposition the buffer anywhere from the anchor to n-1, even in a
258normally nonrewindable, nonpositionable stream. If we've read enough
259to determine what we need (format, alphabet...), we can release the
260anchor, and the buffer's memory usage will stop growing.
261
262The functions that get a defined chunk of memory --
263\ccode{esl\_buffer\_GetLine()}, \ccode{esl\_buffer\_GetToken()}, and
264\ccode{esl\_buffer\_CopyBytes()} -- set an anchor at the start of the
265line, token, or chunk of bytes before they go looking for its end.
266This takes advantage of the anchor mechanism to make sure that the
267buffer will contain the entire line, token, or chunk of bytes, not just a
268truncated part.
269
270
271\subsection{Token-based parsing}
272
273A \esldef{token} is a substring consisting of characters not in a set
274of caller-defined \esldef{separator} characters. Typically, separator
275chararacters might be whitespace (\ccode{" \t"}).
276
277Additionally, newlines are always considered to be separators. Tokens
278cannot include newlines.
279
280In token-based parsing, we can handle newlines in two ways. Sometimes
281we might know exactly how many tokens we expect on the line. Sometimes
282we don't care.
283
284If the caller knows exactly how many tokens are expected on each line
285of the input, it should not include newline characters in its
286separator string. Now, if the caller asks for a token but no token
287remains on the line, it will see a special \ccode{eslEOL} return code
288(and the parser will be positioned at the next character after that
289newline). A caller can check for this deliberately with one last call
290to \ccode{esl\_buffer\_GetToken()} per line, to be sure that it sees
291\ccode{eslEOL} rather than an unexpected token.
292
293If the caller doesn't care how many tokens occur on each line, it
294should include newline characters (\verb+"\r\n"+) in the separator
295string. Then newlines are treated (and skipped) like any other
296separator.
297
298Starting from the current buffer position, the procedure for defining
299a token is:
300
301\begin{itemize}
302\item Skip characters in the separator string. (If end-of-file is
303      reached, return \ccode{eslEOF}.)
304\item If parser is on a newline, skip past it, and return
305      \ccode{eslEOL}. (Note that if the caller had newline characters
306      in the separator string, the first step already skipped any
307      newline, and no \ccode{eslEOL} return is possible.)
308\item Anchor at the current buffer position, \ccode{p}.
309\item From the current point, count characters \emph{not} in the
310      separator, \ccode{n}. (Expand/refill the buffer as needed.)
311\item Define the token: \ccode{p[0..n]}.
312\item Move the current point to the character following the token.
313\end{itemize}
314
315\subsection{Newline handling.}
316
317Easel assumes that newlines are encoded as \verb+\n+ (UNIX, Mac OS/X)
318or \verb+\r\n+ (MS Windows).
319
320All streams are opened as binary data. This is necessary to guarantee
321a one:one correspondence between data offsets in memory and data
322offsets on the filesystem, which we need for file positioning
323purposes. It is also necessary to guarantee that we can read text
324files that have been produced on a system other than the system we're
325reading them on (that we can read Windows text files on a Linux
326system, for example).\footnote{That is, the usual ANSI C convention of
327  reading/writing in ``text mode'' does not suffice, because it
328  assumes the newlines of the system we're on, not necessarily the
329  system that produced the file.}  However, it makes us responsible
330for handling system-specific definition of ``newline'' character(s) in
331ASCII text files.
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363\subsection{Implementation notes (for developers)}
364
365\paragraph{The state guarantee.} An \ccode{ESL\_BUFFER} is exchangeable
366and sharable even amongst entirely different types of parsers because
367it is virtually always guaranteed to be in a well-defined
368state. Specifically:
369
370\begin{itemize}
371\item \ccode{bf->mem[bf->pos]} is ALWAYS positioned at the next byte
372      that a parser needs to parse, unless the buffer is at EOF.
373
374\item There are ALWAYS at least \ccode{pagesize} bytes available to
375      parse, provided the input stream has not reached EOF.
376\end{itemize}
377
378
379\paragraph{State in different input type modes}
380
381There are six types (``modes'') of inputs:
382
383\begin{tabular}{ll}
384    Mode                    &   Description                                   \\ \hline
385\ccode{eslBUFFER\_STDIN}    &  Standard input.                                \\
386\ccode{eslBUFFER\_CMDPIPE}  &  Output piped from a command.                   \\
387\ccode{eslBUFFER\_FILE}     &  A \ccode{FILE} being streamed.                 \\
388\ccode{eslBUFFER\_ALLFILE}  &  A file entirely slurped into RAM.              \\
389\ccode{eslBUFFER\_MMAP}     &  A file that's memory mapped (\ccode{mmap()}).  \\
390\ccode{eslBUFFER\_STRING}   &  A string or memory.                            \\ \hline
391\end{tabular}
392
393The main difference between modes is whether the input is being read
394into the buffer's memory in chunks, or whether the buffer's memory
395effectively contains the entire input:
396
397\begin{tabular}{lll}
398               &   \ccode{STDIN, CMDPIPE, FILE}                                                   & \ccode{ALLFILE, MMAP, STRING}        \\
399\ccode{mem}    &   input chunk: \ccode{mem[0..n-1]} is \ccode{input[baseoffset..baseoffset+n-1]}  & entire input: \ccode{mem[0..n-1]} is \ccode{input[0..n-1]}     \\
400\ccode{n}      &   current chunk size                                                             & entire input size (exclusive of \verb+\0+ on a \ccode{STRING}) \\
401\ccode{balloc} &   $>0$; \ccode{mem} is reallocatable                                             & 0; \ccode{mem} is not reallocated  \\
402\ccode{fp}     &   open; \ccode{feof(fp) = TRUE} near EOF                                         & \ccode{NULL}                        \\
403\ccode{baseoffset} &  offset of byte \ccode{mem[0]} in input                                      & 0                                  \\
404\end{tabular}
405
406
407\paragraph{Behavior at end-of-input (``end-of-file'', EOF).}
408
409The buffer can three kinds of states with respect to how near to EOF
410it is, as follows.
411
412During normal parsing, \ccode{bf->n - bf->pos >= bf->pagesize}:
413
414\begin{cchunk}
415  mem->  {[. . . . . . . . . . . . . . . .] x x x x}
416           ^ baseoffset    ^ pos            ^ n   ^ balloc
417                          [~ ~ ~ ~ ~ ~ ~ ~]
418                          n-pos >= pagesize
419\end{cchunk}
420
421As input is nearing EOF, and we are within last <pagesize> bytes,
422\ccode{bf->n - bf->pos < bf->pagesize}:
423
424\begin{cchunk}
425 mem->  {[. . . . . . . . . . . . . . . .] x x x x}
426          ^ baseoffset              ^ pos  ^ n   ^ balloc
427\end{cchunk}
428
429In modes where we might be reading input in streamed chunks
430(\ccode{eslBUFFER\_STDIN}, \ccode{eslBUFFER\_CMDPIPE}
431\ccode{eslBUFFER\_FILE}), \ccode{feof(bf->fp)} becomes \ccode{TRUE}
432when the buffer nears EOF.
433
434When the input is entirely EOF, then \ccode{bf->pos == bf->n}:
435
436\begin{cchunk}
437  mem->  {[. . . . . . . . . . . . . . . .] x x x x}
438           ^ baseoffset                     ^ n   ^ balloc
439                                            ^ pos
440\end{cchunk}
441
442
443
444
445
446\paragraph{ The use of \ccode{esl\_pos\_t}. }
447
448All integer variables for a position or length in memory or in a file
449are of type \ccode{esl\_pos\_t}. In POSIX, memory positions are an
450unsigned integer type \ccode{size\_t}, and file positions are a signed
451integer type \ccode{off\_t}. Easel wants to assure an integer type
452that we can safely cast to either \ccode{size\_t} or \ccode{off\_t},
453and in which we can safely store a negative number as a status flag
454(such as -1 for ``currently unset''). \ccode{esl\_pos\_t} is defined
455as the largest signed integer type that can be safely cast to
456\ccode{size\_t} or \ccode{off\_t}.
457